My second-favourite keynote from yesterday's Strata Hadoop World conference was this one, from Pinterest's John Rauser. To many people (especially in the Big Data world), Statistics is a series of complex equations, but a just a little intuition goes a long way to really understanding data. John illustrates this wonderfully using an example of data collected to determine whether consuming beer causes mosquitoes to bite you more:
The big lesson here, IMO, is that so many statistical problems can seem complex, but you can actually get a lot of insight by recognizing that your data is just one possible instance of a random process. If you have a hypothesis for what that process is, you can simulate it, and get an intuitive sense of how surprising your data is. R has excellent tools for simulating data, and a couple of hours spent writing code to simulate data can often give insights that will be valuable for the formal data analysis to come.
(By the way, my favourite keynote from the connference was Amanda Cox's keynote on data visualization at the New York Times, which featured several examples developed in R. Sadly, though, it wasn't recorded.)
O'Reilly Strata: Statistics Without the Agonizing Pain
This video has been removed by the user:-(
Posted by: Fe Ze | October 17, 2014 at 10:42
Looks like O'Reilly replaced the video with a corrected version. I've fixed it in the post above. Thanks!
Posted by: David Smith | October 17, 2014 at 11:03
Great video. I too have a preference for simulation-based rather than theoretical hypothesis testing, mainly because I find it much easier to explain to a non-technical audience.
Posted by: Michael Jackson | October 17, 2014 at 22:55
This may seem like nitpicking, but shouldn't it read ...: permutate! The example in the video does not generate a Monte Carlo distribution with mean/sd but uses resampling. Or would "resampling" be a subtype of simulation?
Cheers,
Andrej
Posted by: A.N. Spiess | October 18, 2014 at 17:05
This is good video. Great explanation of the two approaches.
Resampling is amazingly easy...and fun! He's right that the tools matter little. There's at least one Excel plug-in that does exactly the resampling approach he shows. It's a great way to see the central limit theorem (CLT) in action. It's an ah-ha moment to see the variance in the estimate of the mean decrease as the number of times the data is resampled increases.
The "simulation" is the whole thing, the 43 observations resampled 50,000 times. The add-on step is to obtain the p-value of 4.4 in this newly built distribution. The p-value from the stat 101 approach and the p-value from this newly built distribution will be pretty close in value to each other.
Posted by: Phillip Burger | October 19, 2014 at 10:27