by Joseph Rickert
We have been following sports statistics regularly on the Revolutions Blog with quite a few sports related posts this year. In one post I did back in April about the Latham R package for baseball statistics I speculated on how baseball was poised to move from Moneyball style predictive analytics to real-time descriptive stats by showing strike zone heat maps overlaying TV images of batters swinging away. Sports statistics, however, is moving much more quickly than I imagined. Apparently, the NBA is blowing by this milestone and setting up to do real-time predictions.
Recently Mark Glickman, one of the organizers of NESSIS 2013 (The New England Symposium of Statistics in Sport) sent me links to the slides and videos of the presentations made at the conference. There are several excellent presentations here, but I was astounded by Dan Cerone's presentaton on "State of Transition: Estimating Real-Time Expected Possession Value in the NBA with a Spatiotemporal Transition Model and Player Tracking Data".
Dan, a Harvard graduate student, describes how he and his fellow researchers are using an optical tracking data a system developed by STATS, and scheduled to be installed in all 30 NBA areanas, to build predictive state transition models. The optical system tracks 2D locations of all 10 players on the court as well as the 3D position of the ball by taking 25 images per second. Using the 800 million data points generated from only 515 games the Harvard researchers are trying to answer questions like "How many points is a team expected to score given the spatial evolution of its possession up to time t?"
EPV = E[X|F(t) ] where X = number of points scored on this possession (unknown). and F(t) = space-time information of the possession up to time t.
The following graph shows spatial effect surface plots for some San Antonio Spurs players. These surfaces are components of the predictive model.
Just how big this kind of modeling is expected to be can be inferred from the opening remarks made by Mike Zarren, Assistant GM of the Boston Celtics, at the beginning of Dan's Presentation. Speaking about plans for the continued availability of the data, Mr. Zarren says "I've talked with people on both sides, at the league and also at Stats, and both are still interested in researchers getting some access to this data, but exactly what the model looks like is still up to debate". My guess is that there will be some serious money riding on this data and the predictive models based on it.
All of the NESSIS presentations exhibit a fairly high level of statistical play. In addition to Dans presentation, there are four more basketball related studies, one each on the Boston Marathon, soccer and tennis, one on Football about using Random Forest models to estimate win probabilities on each play during a game, and three presentations on baseball, including an R based analysis of "streakiness" by Jim Albert, long time R contributor and editor of the Journal of Quantitative Analysis in Sports. At the beginning of his talk Jim recounts how early in his career he was surprised to find an analysis of baseball data in a paper by Brad Efron and Carl Morris on Stein's Paradox in Statistics. At that time, Jim remarks, "you don't write about sports to get tenure... maybe times have changed": maybe they have.