The New York Times published an article of interest to statisticians the other day: "The Odds, Continually Updated". Surprisingly for a generalaudience newspaper, this article goes into the the distinctions between Bayesian and frequentist statistics, and does so in a very approachable way. Here's an excerpt:
The essence of the frequentist technique is to apply probability to data. If you suspect your friend has a weighted coin, for example, and you observe that it came up heads nine times out of 10, a frequentist would calculate the probability of getting such a result with an unweighted coin. The answer (about 1 percent) is not a direct measure of the probability that the coin is weighted; it’s a measure of how improbable the ninein10 result is — a piece of information that can be useful in investigating your suspicion.
By contrast, Bayesian calculations go straight for the probability of the hypothesis, factoring in not just the data from the cointoss experiment but any other relevant information — including whether you’ve previously seen your friend use a weighted coin.
The article covers the genesis of both frequentist and Bayesian statistics, and includes several examples of Bayesian statistics applications, including healthcare, cosmology, and even searchandrescue. (This story of how Bayesian analysis directed the rescue of a fisherman lost at sea is nothing short of amazing.)
R, of course, has extensive capabilities for Bayesian analysis. Check out the Bayesian Inference task view on CRAN, and especially the RStan package.
New York Times: The Odds, Continually Updated
The following post by Norm Matloff originally appeared on his blog, Mad(Data)Scientist, on September 15th. We rarely republish posts that have appeared on other blogs, however, the questions that Norm raises both with respect to the teaching of statistics, and his assertion that "R's statistical procedures are centered far too much on significance testing" deserve a second look. Moreover, Norm's post elicited quite a few comments, many of which are at a high level of discourse. At the bottom of this post we have include excerpts from exchanges with statistician Mervin Thomas and with philosopher of science Deborah Mayo. It is well worth reading the full threads of these exchanges as well as those associated with a number of other comments. Norm has been a contributor to the Revolutions Blog in the past. We thank him for permission to republish his post. (Guest post editor, Joseph Rickert).
by Norm Matloff
My posting about the statistics profession losing ground to computer science drew many comments, not only here in Mad (Data) Scientist, but also in the coposting at Revolution Analytics, and in Slashdot. One of the themes in those comments was that Statistics Departments are out of touch and have failed to modernize their curricula. Though I may disagree with the commenters’ definitions of “modern,” I have in fact long felt that there are indeed serious problems in statistics curricula.
I must clarify before continuing that I do NOT advocate that, to paraphrase Shakespeare, “First thing we do, we kill all the theoreticians.” A precise mathematical understanding of the concepts is crucial to good applications. But stat curricula are not realistic.
I’ll use Student ttests to illustrate. (This is material from my opensource book on probablity and statistics.) The ttest is an exemplar for the curricular ills in three separate senses:
Significance testing has long been known to be underinformative at best, and highly misleading at worst. Yet it is the core of almost any applied stat course. Why are we still teaching — actually highlighting — a method that is recognized to be harmful?
We prescribe the use of the ttest in situations in which the sampled population has an exact normal distribution — when we know full well that there is no such animal. All reallife random variables are bounded (as opposed to the infinitesupport normal distributions) and discrete (unlike the continuous normal family). [Clarification, added 9/17: I advocate skipping the tdistribution, and going directly to inference based on the Central Limit Theorem. Same for regression. See my book.]
Going handinhand with the ttest is the sample variance. The classic quantity s2 is an unbiased estimate of the population variance σ2, with s2 defined as 1/(n1) times the sum of squares of our data relative to the sample mean. The concept of unbiasedness does have a place, yes, but in this case there really is no point to dividing by n1 rather than n. Indeed, even if we do divide by n1, it is easily shown that the quantity that we actually need, s rather than s2, is a BIASED (downward) estimate of σ. So that n1 factor is much ado about nothing.
Right from the beginning, then, in the very first course a student takes in statistics, the star of the show, the ttest, has three major problems.
Sadly, the R language largely caters to this oldfashioned, unwarranted thinking. The var() and sd() functions use that 1/(n1) factor, for example — a bit of a shock to unwary students who wish to find the variance of a random variable uniformly distributed on, say, 1,2,…,10.
Much more importantly, R’s statistical procedures are centered far too much on significance testing. Take ks.test(), for instance; all one can do is a significance test, when it would be nice to be able to obtain a confidence band for the true cdf. Or consider loglinear models: The loglin() function is so centered on testing that the user must proactively request parameter estimates, never mind standard errors. (One can get the latter by using glm() as a workaround, but one shouldn’t have to do this.)
I loved the suggestion by Frank Harrell in rdevel to at least remove the “star system” (asterisks of varying numbers for different pvalues) from R output. A Quixotic action on Frank’s part (so of course I chimed in, in support of his point); sadly, no way would such a change be made. To be sure, R in fact is modern in many ways, but there are some problems nevertheless.
In my blog posting cited above, I was especially worried that the stat field is not attracting enough of the “best and brightest” students. Well, any thoughtful student can see the folly of claiming the ttest to be “exact.” And if a sharp student looks closely, he/she will notice the hypocrisy of using the 1/(n1) factor in estimating variance for comparing two general means, but NOT doing so when comparing two proportions. If unbiasedness is so vital, why not use 1/(n1) in the proportions case, a skeptical student might ask?
Some years ago, an Israeli statistician, upon hearing me kvetch like this, said I would enjoy a book written by one of his countrymen, titled What’s Not What in Statistics. Unfortunately, I’ve never been able to find it. But a good cleanup along those lines of the way statistics is taught is long overdue.
Selected Comments
Mervyn Thomas
SEPTEMBER 16, 2014 AT 4:15 PM
I have run statistics operations in quite large public and private sector organisations, and directly supervised many masters and PhD level statisticians. The biggest problem I had with new statisticians was helping them to understand that nobody else cares about the statistics.
Of course the statistics is important, but only in so far as it helps produce solid and reliable answers to problems – or reveals that no such answers are available with current data. Nearly everybody is focussed on their own problems. The trick is producing results and reports which address those problems in a rigorous and defensible way.
In a sense, I see applied statistics as more of an engineering discipline – but one that makes careful use of rigorous analysis.
I believe that statistics departments have largely missed the boat with data science (except for a few stand out examples like Stanford), and that the reason is that many academic statisticians have failed to engage with other disciplines properly. Of course, there are very significant exceptions to that – Terry Speed for example.
One of the most telling examples of that for me is the number of time academic statisticians have asked if I or my life science collaborators could provide them with data to test an approach — without actually wanting to engage with the problem that generated the data.
Relevance comes from engagement, not from rarefied brilliance. There is no better example of that than Fisher.
Does it matter? Yes because I see other disciplines reinventing the statistical wheel – and doing it badly.
REPLY
matloff
SEPTEMBER 16, 2014 AT 5:01 PM
Very interesting comments. I largely agree.
Sadly, my own campus, the University of California at Davis, illustrates your point. To me, a big issue is joint academic appointments, and to my knowledge the Statistics Dept. has none. This is especially surprising in light of the longtime (several decades) commitment of UCD to interdisciplinary research. The Stat. Dept. has even gone in the opposite direction: The Stat grad program used to be administered by a Graduate Group, a unique UCD entity in which faculty from many departments run the graduate program in a given field; yet a few years ago, the Stat. Dept. disbanded its Graduate Group. I must hasten to add that there IS good interdisciplinary work being done by Stat faculty with researchers in other fields, but still the structure is too narrow, in my view.
(My own department, Computer Science, has several appointments with other disciplines, and more important, has actually expanded the membership of its Graduate Group.)
I would say, though, that I think the biggest reason Stat (in general, not just UCD) has been losing ground to CS and other fields is not because of disinterest in applications, but rather a failure to tackle the complex, largescale, “messy” problems that the Machine Learning crowd addresses routinely.
REPLY
Mervyn Thomas
SEPTEMBER 16, 2014 AT 5:19 PM
“a failure to tackle the complex, largescale, “messy” problems that the Machine Learning crowd addresses routinely.” Good point! I have often struggled with junior statisticians wanting to know whether or not an analysis is `right’ rather than fit for purpose. That’s a strange preoccupation, because in 40 years as a professional statistician I have never done a `correct’ analysis. Everything is predicated on assumptions which are approximations at best.

Mayo
SEPTEMBER 27, 2014 AT 9:27 PM
I reviewed the part in your book on tests vs CIs. It was quite as extreme as I’d remembered it. I’m so used to interpreting significance levels and pvalues in terms of discrepancies warranted or not that I automatically have those (severity) interpretations in mind when I consider tests. Fallacies of rejection and acceptance, relativity to sample size–all dealt with, and the issues about CIs requiring testing supplements remain (especially in onesided testing which is common). This paper covers 13 central problems with hypothesis tests, and how error statistics deals with them.
I remember many of the things I like A LOT about Matloff’s book. I’m glad he sees CIs as the way to go for variable choice (on prediction grounds) because it means that severity is relevant there too.
Norm Matloff
SEPTEMBER 28, 2014 AT 11:46 PM
Looks like a very interesting paper, Deborah (as I would have expected). I look forward to reading it. Just skimming through, though, it looks like I’ll probably have comments similar to the ones I made on Mervyn’s points.
Going back to my original post, do you at least agree that CIs are more informative than tests?
Hadley Wickham's dplyr package is a great toolkit for getting data ready for analysis in R. If you haven't yet taken the plunge to using dplyr, Kevin Markham has put together a great handson video tutorial for his Data School blog, which you can see below. The video covers the five main datamanipulation "verbs" that dplyr provides: filter, select, arrange, mutate and summarise/group_by. (It also introduces the glimpse function, a handy alternative to str, that I had overlooked before.)
The video also provides an introduction to the %>% ("then") operator from magrittr, which you'll likely fund useful for many other applications in addition to dplyr. Also, Kevin's video works from an Rmarkdown script to show how dplyr works, and so serves as a minitutorial for Rmarkdown as well. It's well worth 40 minutes of your time. Also, check out Kevin's blog post linked below for links to many other useful dplyr resources.
Data School: Handson dplyr tutorial for faster data manipulation in R (via Peter Aldhous)
The militarization of local police departments here in the US has been much in the news lately, and the New York Times published in June an indepth article on how materiel from wars has ended up in the hands of US counties. Besides the traditional reporting it's a fantastic piece of data journalism: the Times submitted a freedomofinformation request to the Defense Department for the items, value and the date they were provided to each county, and published the data on GitHub. Here's a small snippet of the data:
The Times also published an interactive map of the data, aggregated by county. What that map doesn't show is the element of time, and the rate at which materials were being supplied from 2006 to 2012. Andrew Cooper, an associate professor at Simon Fraser University, used the Times data and the R programming language to create this animation of the supply of materials throughout the US over the past 8 years:
You can find a link to the complete NYT article on the topic below.
New York Times: War Gear Flows to Police Departments
by Joseph Rickert
One of the most difficult things about R, a problem that is particularly vexing to beginners, is finding things. This is an unintended consequence of R's spectacular, but mostly uncoordinated, organic growth. The R core team does a superb job of maintaining the stability and growth of the R language itself, but the innovation engine for new functionality is largely in the hands of the global R communty.
Several structures have been put in place to address various apsects of the finding things problem. For example, Task Views represent a monumental effort to collect and classify R packages. The RSeek site is an effective tool for web searches. RBloggers is a good place to go for R applications and CRANberries let's you know what's new. But, how do you find things that you didn't even know you were looking for?For this, the so called "misc packages" can be very helpful. Whereas the majority of R packages are focused on a particular type of analysis or class of models, or special tool, misc packages tend to be collections of functions that facilitate common tasks. (Look below for a partial list).
DescTools is a new entry to the misc package scene that I think could become very popular. The description for the package begins:
DescTools contains a bunch of basic statistic functions and convenience wrappers for efficiently describing data, creating specific plots, doing reports using MS Word, Excel or PowerPoint. The package's intention is to offer a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R.
So far, of the 380 functions in this collection the Desc function has my attention. This function provides very nice tabular and graphic summaries of the variables in a data frame with output that is specific to the data type. The d.pizza data frame that comes with the package has a nice mix of data types
head(d.pizza) index date week weekday area count rabate price operator driver delivery_min temperature wine_ordered wine_delivered 1 1 20140301 9 6 Camden 5 TRUE 65.655 Rhonda Taylor 20.0 53.0 0 0 2 2 20140301 9 6 Westminster 2 FALSE 26.980 Rhonda Butcher 19.6 56.4 0 0 3 3 20140301 9 6 Westminster 3 FALSE 40.970 Allanah Butcher 17.8 36.5 0 0 4 4 20140301 9 6 Brent 2 FALSE 25.980 Allanah Taylor 37.3 NA 0 0 5 5 20140301 9 6 Brent 5 TRUE 57.555 Rhonda Carter 21.8 50.0 0 0 6 6 20140301 9 6 Camden 1 FALSE 13.990 Allanah Taylor 48.7 27.0 0 0 wrongpizza quality 1 FALSE medium 2 FALSE high 3 FALSE <NA> 4 FALSE <NA> 5 FALSE medium 6 FALSE low
Here is some of the voluminous output from the function. The data frame as a whole is summarized as follows
'data.frame': 1209 obs. of 16 variables: 1 $ index : int 1 2 3 4 5 6 7 8 9 10 ... 2 $ date : Date, format: "20140301" "20140301" "20140301" "20140301" ... 3 $ week : num 9 9 9 9 9 9 9 9 9 9 ... 4 $ weekday : num 6 6 6 6 6 6 6 6 6 6 ... 5 $ area : Factor w/ 3 levels "Brent","Camden",..: 2 3 3 1 1 2 2 1 3 1 ... 6 $ count : int 5 2 3 2 5 1 4 NA 3 6 ... 7 $ rabate : logi TRUE FALSE FALSE FALSE TRUE FALSE ... 8 $ price : num 65.7 27 41 26 57.6 ... 9 $ operator : Factor w/ 3 levels "Allanah","Maria",..: 3 3 1 1 3 1 3 1 1 3 ... 10 $ driver : Factor w/ 7 levels "Butcher","Carpenter",..: 7 1 1 7 3 7 7 7 7 3 ... 11 $ delivery_min : num 20 19.6 17.8 37.3 21.8 48.7 49.3 25.6 26.4 24.3 ... 12 $ temperature : num 53 56.4 36.5 NA 50 27 33.9 54.8 48 54.4 ... 13 $ wine_ordered : int 0 0 0 0 0 0 1 NA 0 1 ... 14 $ wine_delivered: int 0 0 0 0 0 0 1 NA 0 1 ... 15 $ wrongpizza : logi FALSE FALSE FALSE FALSE FALSE FALSE ... 16 $ quality : Ord.factor w/ 3 levels "low"<"medium"<..: 2 3 NA NA 2 1 1 3 3 2 ...
The factor variable driver gets a table and a plot.
10  driver (factor) length n NAs levels unique dupes 1'209 1'204 5 7 7 y level freq perc cumfreq cumperc 1 Carpenter 272 .226 272 .226 2 Carter 234 .194 506 .420 3 Taylor 204 .169 710 .590 4 Hunter 156 .130 866 .719 5 Miller 125 .104 991 .823 6 Farmer 117 .097 1108 .920 7 Butcher 96 .080 1204 1.000
and so does the numeric variable delivery.
11  delivery_min (numeric) length n NAs unique 0s mean meanSE 1'209 1'209 0 384 0 25.653 0.312 .05 .10 .25 median .75 .90 .95 10.400 11.600 17.400 24.400 32.500 40.420 45.200 rng sd vcoef mad IQR skew kurt 56.800 10.843 0.423 11.268 15.100 0.611 0.095 lowest : 8.8 (3), 8.9, 9 (3), 9.1 (5), 9.2 (3) highest: 61.9, 62.7, 62.9, 63.2, 65.6 ShapiroWilks normality test p.value : 2.2725e16
Pretty nice for an automatic first look at the data.
For some more R treasure hunting have a look into the following short list of misc packages.
Package 
Description 
Tools for manipulating data (No 1 package downloaded for 2013) 

Convenience wrappers for functions for manipulating strings 

One of the most popular R packages of all time: functions for data analysis, graphics, utilities and much more 

Package development tools 

The “go to” package for machine learning, classification and regression training 

Good svm implementation and other machine learning algorithms 

Tools for describing data and descriptive statistics 

Tools for plotting decision trees 

Functions for numerical analysis, linear algebra, optimization, differential equations and some special functions 

Contains different highlevel graphics functions for displaying large datasets 

Relatively new package with various functions for survival data extending the methods available in the survival package. 

New this year: miscellaneous R tools to simplify the working with data types and formats including functions for working with data frames and character strings 

Some functions for Kalman filters 

Misc 3d plots including isosurfaces 

New package with utilities for producing maps 

Various programming tools like ASCIIfy() to convert characters to ASCII and checkRVersion() to see if a newer version of R is available 

A grab bag of utilities including progress bars and function timers 
In discussion with several data scientists, Will Stanton (a data scientist with Return Path) learned that a common concern is: what software should I be using? There are many options out there, but what is the best platform to be an effective "data hacker"?
Will recommends using a technology stack with R and Hadoop, which allows data scientists "to do almost anything you need to for data hacking". With this platform, you have all the tools you need for:
On the other hand, Will says the stack works best on Unix or Linux based systems (Windows is possible, but tricky), and isn't ideally suited for text mining or webbased applicatons. But if this is something you want to try, a good start is the RHadoop project, a collection of R packages that connect R and Hadoop.
For more on being a data hacker with RHadoop stack, check out Will's complete blog post linked below.
Will Stanton's Data Science blog: Becoming a data “hacker” (via Joaquim Coll)
by Matt Sundquist, Plotly Cofounder
It's delightfully smooth to publish R code, plots, and presentations to the web. For example:
Now, Plotly lets you collaboratively edit and publish interactive ggplot2 graphs using these tools. This post shows how. Find us on GitHub, at feedback at plot ly, and @plotlygraphs. For more on our ggplot2 and R support, see our API docs.
install.packages("devtools") # so we can install from github library("devtools") install_github("ropensci/plotly") # plotly is part of ropensci library(plotly) py < plotly(username="r_user_guide", key="mw5isa4yqp") # open plotly connection ggiris < qplot(Petal.Width, Sepal.Length, data = iris, color = Species) py$ggplotly(ggiris) # send to plotly
py$ggplotly()
to your ggplot2 plot creates a Plotly graph online, drawn with D3.js, a popular JavaScript visualization library. The plot, data, and code for making the plot in Julia, Python, R, and MATLAB are all online and editable by you and your collaborators. In this case, it's here: https://plot.ly/~r_user_guide/2; if you forked the plot and wanted to tweak and share it, a new version of the plot would be saved into your profile. ## 1. Putting Plotly Graphs in Knitr ```{r} library("knitr") library("devtools") url<"https://plot.ly/~MattSundquist/1971" plotly_iframe < paste("<center><iframe scrolling='no' seamless='seamless' style='border:none' src='", url, "/800/1200' width='800' height='1200'></iframe><center>", sep = "") ``` `r I(plotly_iframe)`
`plotly=TRUE`
chunk option. Here is an example and source to see the process in action. py < plotly("ggplot2examples", "3gazttckd7") # key and username for your call figure < py$get_figure("r_user_guide", 1) # graph id for plot you want to access str(figure)
figure$data[]
You're probably familiar with the classic Travelling Salesman problem: given (say) 20 cities, what is shortest route you can take that passes through all 20 cities and returns to the starting point? It's a difficult problem to solve, because you need to try all possible routes to find the minimum, and there are a LOT of possibilities. For a 20city tour there are more than 1 trillion trillion routes to try — and that's a fairly small problem!
You can get a good answer (if not necessarily the perfect answer) using various heuristic techniques. Software developer Todd Schneider used the R language to implement a technique called simulated annealing. It starts with a random route (each city in the route is chosen at random, without regard to its distance) and then tries various similar routes and probably adopts the shortest one and repeats the process. I say "probably" because a random element in the annealing process helps the process avoid getting stuck with suboptimal solutions.
In the animation below (created by Todd), you can see the process in action, trying to find a Salesman tour through the world's major capitals. The initial tour is more than 600,000 miles, but it soon settles down into a compact 80,666mile tour (you may need to click the image to see the animation):
Todd's R code for the Travelling Salesman problem can be found at Github, and he's also created a Shiny app that lets you solve the problem for your own selection of cities in the USA or around the world, and see a similar animation of the annealing process at work. You can find the app and lots more detail about the algorithm and implementation at Todd's blog at the link below.
Todd W. Schneider: The Traveling Salesman with Simulated Annealing, R, and Shiny (via Brock Tibert)
Rrrr! It's International Talk Like a Pirate day again, mateys, the day all landlubbers should talk in pirate lingo. (If you're unsure how, R can help.) It's also the day where you can pick up some great O'Reilly R books for half price.
The Revolution Analytics team has been celebrating with some great pirate costumes  thanks Colleen, Dave and Amanda!
That's all for this week. See you on Monday, mateys!