*Are MLB players better hitters now than they were 20 years ago? Revolution Analytics' Joseph Rickert uses R to take a look at the data, and offers an instructive lesson in checking your assumptions for statistical tests in the process -- Ed.*

Data are everywhere – but, even for simple things, I still seem to spend a too much time surfing the web to find an appropriate data set. Many times a data set will come my way for some reason and then end up being interesting for some entirely different reason. If I only had a good way to cross reference all of these data sets: you know, a list of every data set I ever came across annotated with what it is good for and with a link to where I put it. Of course, if I had this list I would have to keep it in my list of all my other lists; and then, to make sure I could remember where it was, I would have to keep it in a special place and ….

Anyway, baseball season is here and few things in the world do data better than baseball. From opening day (3/31/11) to the end of October the ballparks of this country will generate more data than I can keep up with. Fortunately, there are others who can. Take a look, for example, at Baseball Prospectus for comprehensive baseball statistics organized by year in csv files that are easy to download or at Sean Lahman’s website for a comprehensive data base going back to 1871.

Given that today is the first day of the baseball season, I have the perfect excuse and data set to illustrates how important it is to check the assumptions before doing a t-test. Let's look at the batting averages (AVG) for both major leagues for the years 1990 and 2010. Except for the apparent increase in variability for 2010, the box plots for the two distributions look pretty similar.

So it might seem reasonable to do a simple t-test to see if there is any significant difference. In R this is one line of code that produces the result:

> t.test(AVG ~ YEAR,data=bdat,var.equal=T)

Two Sample t-test data: AVG by YEAR t = 1.4098, df = 1682, p-value = 0.1588 alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -0.003395079 0.020751550 sample estimates: mean in group 1990 mean in group 2010 0.2034081 0.1947299

*Created by Pretty R at inside-R.org*

Because the confidence interval contains 0 there is no reason to reject the null hypothesis that the means of the two distributions are indeed the same; so it appears that both there is no change in batting average between 1990 and last season. However, is the t-test the right test to use? Forget about the fact that I blew off the apparent increase in variability as irrelevant; are the two distributions even approximately Normal?

Good thing I checked! The kernel density plots indicate that these distributions have two, or maybe three, modes – not even close to Normal! These plots are pretty interesting on their own: when I watch hitters this season I’ll we wondering under which bump they belong. But getting back to a formal test to look for a difference in the means of the batting averages for the 1990 and 2010 seasons, it appears that the Wilcoxon Rank Sum Test (also known as the Mann-Whitney test, and which doesn't assume the distributions are Normal) is the way to go.

> wilcox.test(AVG ~ YEAR, data=bdat,conf.int=TRUE) Wilcoxon rank sum test with continuity correction data: AVG by YEAR W = 370070.5, p-value = 0.03541 alternative hypothesis: true location shift is not equal to 0 95 percent confidence interval: 6.849543e-06 1.302933e-02 sample estimates: difference in location 0.004991099

*Created by Pretty R at inside-R.org*

The Willcoxon indicates that there **is** a significance difference, at the 5% level anyway. (The R code for the charts and analysis appears after the jump.) This was a surprise to me; maybe I’m easily surprised but isn’t life more fun that way. I am looking forward to quite a few surprises from 2011 Baseball. Enjoy the season!

################################################################## # Get data and build some data frames dataDir <- "C:/Users/Joseph/Documents/Revolution/Baseball" fn1990 <- file.path(dataDir,"bpstats_1990.csv") df1990 <- read.csv(fn1990) AVG1990 <- df1990$AVG fn2010 <- file.path(dataDir,"bpstats_2010.csv") df2010 <- read.csv(fn2010) AVG2010 <- df2010$AVG ################################################################## # make a "log form" data frame to generate the boxplot year1990 <-rep("1990",length(AVG1990)) year2010 <-rep("2010",length(AVG2010)) df1 <- data.frame(AVG1990,year1990) names(df1) <- c("AVG","YEAR") df2 <- data.frame(AVG2010,year2010) names(df2) <- c("AVG","YEAR") bdat <- rbind(df1,df2) boxplot(AVG ~ YEAR,data=bdat, col= c("red","blue"), main="Batting Average Distributions") #################################################################### # Draw the kernel density plots par(mfrow=c(2,1)) plot(density(AVG1990),col="red",main="1990 Batting Average Density") rug(AVG1990,col="green") plot(density(AVG2010),col="blue",main="2010 Batting Average Density") rug(AVG1990,col="green") ##################################################################### # perform the tests t.test(AVG ~ YEAR,data=bdat,var.equal=TRUE) wilcox.test(AVG ~ YEAR, data=bdat,conf.int=TRUE) #####################################################################

Hopefully this isn't a dumb question -- I'm new to all of this, but in your density plots above the curve appears to extend below 0.0 a little bit. How could a player have a negative batting average? Or am I interpreting this incorrectly?

Thanks

Posted by: Bill Svec | March 31, 2011 at 10:09

If you look at the density plots, you notice that a number of batters had a batting average of zero (no hits) and that some had a batting average of 1.0 (100% hits!). Those guys with a 100% batting average must be billionaires by now. You would never lose with them in your lineup! Seriously, a godd batting average is around 0.3. Batters at the extremes of the distribution likely have a low number of attempts. Since it is reasonable to say we are only interested in the performance of typical batters in our comparison, it would be logical to apply a filter to the data based on the number of hits. This would likely remove the extreme values from the distribution. Whether this would affect the results of your statistical tests is an open question.

Posted by: Mike | March 31, 2011 at 10:36

I was going to suggest excluding batters with less than a certain number of at-bats. Obviously those at around .400 and higher would be removed. So would much of the peak which is strangely drawn centered on .000.

I'm a little surprised by the size of this peak. Are there that many pitchers and players who got only a handful of at-bats? Maybe each player's contribution should be weighted by number of plate appearances. I'm sure without even filtering out batters with few at bats, the peak at zero would then disappear.

Posted by: Jon Peltier | March 31, 2011 at 10:56

@Bill Svec That's an artifact from the kernel density estimate; it doesn't know the range of the data unless you supply it. If you look at the rug (the green lines) you'll see that the data stops at 0 and 1 as it should.

Posted by: Jared | March 31, 2011 at 10:56

It's always great to see some good old-fashioned data analysis! Thanks for the guest blog, Joseph.

As the other comments point out, you can restrict "at bats" to eliminate the extreme outliers (maybe use, AB > 20). Also, restrict the kernel density to only consider AVG>=0. (I don't know how to restrict the KDE in R off the top of my head, although I know how to do it in another language.)

Lastly, the standard t-test is fairly robust to non-normal data, but it can be sensitive to unequal variance, which I'm guessing will still be present even after you filter by AB>20. In that case, you might want to use the Welch-Satterthwaite adjustment to the t-test, which I think has better power (???)than the more-general Mann-Whitney test.

Posted by: Rick Wicklin | March 31, 2011 at 11:40

Why not run a permutation test?

Posted by: John Myles White | March 31, 2011 at 11:44

First thing I noticed here is that it's pretty obvious there are a lot of players with only a couple at bats in the data set. Batting average has a lot of noise even for a guy with a full season of at bats.

If you set a minimum of at bats for a season, you'll find the distribution to have lots of players bunched up on the left, with a long right tail.

Honestly, the AB filter should be something in the realm of no less than 300. It's important to of course check assumptions, but even before that it's important to use common sense.

Posted by: Millsy | March 31, 2011 at 12:35

Thank you all for the comments. It is nice to just do some basic statistics. I should have thought to adjust the axis for the kernel plots, sorry for the confusion. Doing permutation tests and the trying the Welch-Satterthwaite adjustment sound like good ideas. However, I am not so sure about removing the outliers first thing. Doing that seems to me to shift the question to something like: how different are the "real" hitters?

Posted by: Joseph Rickert | March 31, 2011 at 16:21

Author probably forget to take into account an assumption which is crucial to use Wilcoxon Rank Sum Test. I mean symmetric distribution. Obviously, none of those two are symmetric therefore the test results are questionable.

Posted by: Robert | April 01, 2011 at 03:26

What about the Central Limit Theorem? Even if trimodal, the sampling distribution of the mean will be normally distributed with 700+ samples. The assumption of normality( or more accurately a t-dist) is on that distribution, not the parent distribution.

Here is a crazy mixture of normal, lognormal, gamma, and beta that is similar to what was shown for the actual data. Look at the sampling distribution (takes about 1 min to run):

xpts<-seq(0,1,.01)

par(mfrow=c(3,1))

plot(y=

0.4*dnorm(x=xpts,mean=0.3,sd=0.035)

+0.4*dlnorm(xpts,meanlog=log(0.265),sdlog=0.3)

+0.15*dgamma(xpts,shape=0.2,scale=0.5)

+0.05*dbeta(0.9*xpts+0.01,shape1=0.5,shape2=0.5),

x=xpts,type="l",xlim=c(0,1),main="Mixture Density",ylab="f(x)",xlab="x")

plot(density(

c(

rnorm(400,mean=0.3,sd=0.035),

rlnorm(400,meanlog=log(0.265),sdlog=0.3),

rgamma(150,shape=0.2,scale=0.05),

rbeta(50,shape1=0.5,shape2=0.5)

))

,xlim=c(-0.1,1.1), main="Typical Kernel Density Estimate")

xbar<-c(1:5e4)

for(i in 1:5e4){

xbar[i]<-mean(c(

rnorm(0.4*740,mean=0.3,sd=0.035),

rlnorm(0.4*740,meanlog=log(0.265),sdlog=0.3),

rgamma(0.15*740,shape=0.2,scale=0.05),

rbeta(0.05*740,shape1=0.5,shape2=0.5)

)

)

}

xpts2<-seq(0.240,0.270,0.0001)

plot(density(xbar), main="Sampling Dist of Mean w/Normal Overlay (red)")

lines(y=dnorm(x=xpts2,mean=mean(xbar),sd=sd(xbar)),x=xpts2,

type="l",col="red",lwd=2,lty=2)

Looks pretty good to me. Now, one could also debate that we have census rather than a sample here but then that takes the fun out of baseball data altogether!

Also, the Wilcoxon-Mann-Whitney isn't just a test of location, it will detect differences in shape/variability (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1120984/)

So that the fact that it finds a difference may just be due to differences in variability.

Posted by: Wade | April 01, 2011 at 08:29

I think the part-time players ("outliers" if you wish) probably impact the variability as well... In 1990, there were 26 teams with a total of 650 roster spots (25 per team), but there were 740 players in the list (many, but not all, of the extra 90 are the part-timers that are probably disproportionately near the extremes)... In 2010, there were 30 teams and 750 roster spots, but the 2010 data shows 944 players... That's roughly 100 extra regulars, but also 100 more part-timers... I wouldn't be surprised that removing the part-timers would also likely remove most of the variability differences...

Posted by: Barry | April 01, 2011 at 12:46

I redid the work excluding all players with less than 100 Plate Appearances. This niftily excluded all pitchers - as the maximum PA for any pitcher in either season was in the 90s - and will also exclude late season callups who often hit worse than MLB regulars

The t-test gave the following result

t = 0.4475, df = 822, p-value = 0.6546

alternative hypothesis: true difference in means is not equal to 0

95 percent confidence interval:

-0.003606420 0.005736718

sample estimates:

mean in group 1990 mean in group 2010

0.2566466 0.2555814

The means could hardly be any closer.

The density plots are much closer to a normal distribution. Fielding position is probably also a factor as catchers, for instance, are not expected to have as high a batting average as those in less crucial areas of the diamond

Posted by: Andrew Clark | May 27, 2011 at 05:32