« New R User Groups in Tallahasse (FL), Hobart (TAS) | Main | Revolution's Chief Scientist: R is the Language of the Future »

March 31, 2011

TrackBack

TrackBack URL for this entry:
http://www.typepad.com/services/trackback/6a010534b1db25970b0147e39ea80c970b

Listed below are links to weblogs that reference Baseball, T-tests and statistical surprises:

Comments

Feed You can follow this conversation by subscribing to the comment feed for this post.

Hopefully this isn't a dumb question -- I'm new to all of this, but in your density plots above the curve appears to extend below 0.0 a little bit. How could a player have a negative batting average? Or am I interpreting this incorrectly?

Thanks

If you look at the density plots, you notice that a number of batters had a batting average of zero (no hits) and that some had a batting average of 1.0 (100% hits!). Those guys with a 100% batting average must be billionaires by now. You would never lose with them in your lineup! Seriously, a godd batting average is around 0.3. Batters at the extremes of the distribution likely have a low number of attempts. Since it is reasonable to say we are only interested in the performance of typical batters in our comparison, it would be logical to apply a filter to the data based on the number of hits. This would likely remove the extreme values from the distribution. Whether this would affect the results of your statistical tests is an open question.

I was going to suggest excluding batters with less than a certain number of at-bats. Obviously those at around .400 and higher would be removed. So would much of the peak which is strangely drawn centered on .000.

I'm a little surprised by the size of this peak. Are there that many pitchers and players who got only a handful of at-bats? Maybe each player's contribution should be weighted by number of plate appearances. I'm sure without even filtering out batters with few at bats, the peak at zero would then disappear.

@Bill Svec That's an artifact from the kernel density estimate; it doesn't know the range of the data unless you supply it. If you look at the rug (the green lines) you'll see that the data stops at 0 and 1 as it should.

It's always great to see some good old-fashioned data analysis! Thanks for the guest blog, Joseph.

As the other comments point out, you can restrict "at bats" to eliminate the extreme outliers (maybe use, AB > 20). Also, restrict the kernel density to only consider AVG>=0. (I don't know how to restrict the KDE in R off the top of my head, although I know how to do it in another language.)

Lastly, the standard t-test is fairly robust to non-normal data, but it can be sensitive to unequal variance, which I'm guessing will still be present even after you filter by AB>20. In that case, you might want to use the Welch-Satterthwaite adjustment to the t-test, which I think has better power (???)than the more-general Mann-Whitney test.

Why not run a permutation test?

First thing I noticed here is that it's pretty obvious there are a lot of players with only a couple at bats in the data set. Batting average has a lot of noise even for a guy with a full season of at bats.

If you set a minimum of at bats for a season, you'll find the distribution to have lots of players bunched up on the left, with a long right tail.

Honestly, the AB filter should be something in the realm of no less than 300. It's important to of course check assumptions, but even before that it's important to use common sense.

Thank you all for the comments. It is nice to just do some basic statistics. I should have thought to adjust the axis for the kernel plots, sorry for the confusion. Doing permutation tests and the trying the Welch-Satterthwaite adjustment sound like good ideas. However, I am not so sure about removing the outliers first thing. Doing that seems to me to shift the question to something like: how different are the "real" hitters?

Author probably forget to take into account an assumption which is crucial to use Wilcoxon Rank Sum Test. I mean symmetric distribution. Obviously, none of those two are symmetric therefore the test results are questionable.

What about the Central Limit Theorem? Even if trimodal, the sampling distribution of the mean will be normally distributed with 700+ samples. The assumption of normality( or more accurately a t-dist) is on that distribution, not the parent distribution.

Here is a crazy mixture of normal, lognormal, gamma, and beta that is similar to what was shown for the actual data. Look at the sampling distribution (takes about 1 min to run):

xpts<-seq(0,1,.01)

par(mfrow=c(3,1))
plot(y=
0.4*dnorm(x=xpts,mean=0.3,sd=0.035)
+0.4*dlnorm(xpts,meanlog=log(0.265),sdlog=0.3)
+0.15*dgamma(xpts,shape=0.2,scale=0.5)
+0.05*dbeta(0.9*xpts+0.01,shape1=0.5,shape2=0.5),
x=xpts,type="l",xlim=c(0,1),main="Mixture Density",ylab="f(x)",xlab="x")

plot(density(
c(
rnorm(400,mean=0.3,sd=0.035),
rlnorm(400,meanlog=log(0.265),sdlog=0.3),
rgamma(150,shape=0.2,scale=0.05),
rbeta(50,shape1=0.5,shape2=0.5)
))
,xlim=c(-0.1,1.1), main="Typical Kernel Density Estimate")


xbar<-c(1:5e4)

for(i in 1:5e4){
xbar[i]<-mean(c(
rnorm(0.4*740,mean=0.3,sd=0.035),
rlnorm(0.4*740,meanlog=log(0.265),sdlog=0.3),
rgamma(0.15*740,shape=0.2,scale=0.05),
rbeta(0.05*740,shape1=0.5,shape2=0.5)
)
)
}

xpts2<-seq(0.240,0.270,0.0001)
plot(density(xbar), main="Sampling Dist of Mean w/Normal Overlay (red)")
lines(y=dnorm(x=xpts2,mean=mean(xbar),sd=sd(xbar)),x=xpts2,
type="l",col="red",lwd=2,lty=2)


Looks pretty good to me. Now, one could also debate that we have census rather than a sample here but then that takes the fun out of baseball data altogether!

Also, the Wilcoxon-Mann-Whitney isn't just a test of location, it will detect differences in shape/variability (http://www.ncbi.nlm.nih.gov/pmc/articles/PMC1120984/)

So that the fact that it finds a difference may just be due to differences in variability.

I think the part-time players ("outliers" if you wish) probably impact the variability as well... In 1990, there were 26 teams with a total of 650 roster spots (25 per team), but there were 740 players in the list (many, but not all, of the extra 90 are the part-timers that are probably disproportionately near the extremes)... In 2010, there were 30 teams and 750 roster spots, but the 2010 data shows 944 players... That's roughly 100 extra regulars, but also 100 more part-timers... I wouldn't be surprised that removing the part-timers would also likely remove most of the variability differences...

I redid the work excluding all players with less than 100 Plate Appearances. This niftily excluded all pitchers - as the maximum PA for any pitcher in either season was in the 90s - and will also exclude late season callups who often hit worse than MLB regulars

The t-test gave the following result

t = 0.4475, df = 822, p-value = 0.6546
alternative hypothesis: true difference in means is not equal to 0
95 percent confidence interval:
-0.003606420 0.005736718
sample estimates:
mean in group 1990 mean in group 2010
0.2566466 0.2555814

The means could hardly be any closer.

The density plots are much closer to a normal distribution. Fielding position is probably also a factor as catchers, for instance, are not expected to have as high a batting average as those in less crucial areas of the diamond

The comments to this entry are closed.


R for the Enterprise

Got comments or suggestions for the blog editor?
Email David Smith.
Follow revodavid on Twitter Follow David on Twitter: @revodavid

Search Revolutions Blog