It's well-known that the home team has an advantage in soccer (or football, as it's called in England). But which teams have made the most of their home-field advantage over the years? Evolutionary biologist (and Liverpool fan) Joe Gallagher analyzed the percentage of points won in the UK Premier League (which awards 3 points for a win and one point for a draw) for teams at home and away:

On average, Premier League teams win around 60% of their points at home. The green bars show the 10 teams who won most of their points at home (Burnley managed only a single away win and an away draw in their 2009 season) and the purple bars shows 10 teams who defied the odds to win most of their points in *away* games.

Joe used the R language to perform the analysis and create the chart above, with the engsoccerdata package providing the historical home/away score data (augmented with current-season data from the web). You can find the R code used to perform the analysis, along with several other interesting analyses of the data, at the link below.

Joe Gallagher: Home advantages and wanderlust

While you might think of Scrabble as that game you play with your grandparents on a rainy Sunday, some people take it *very* seriously. There's an international competition devoted to Scrabble, and no end of guides and strategies for competitive play. James Curley, a psychology professor at Columbia University, has used an interesting method to collect data about what plays are most effective in Scrabble: by having robots play against each other, thousands of times.

The data were generated with a Visual Basic script that automated two AI players completing a game in Quackle. Quackle emulates the Scrabble board, and provides a number of AI players; the simulation used the "Speedy Player" AI which tends to make tactical scoring moves while missing some longer-term strategic plays (like most reasonably skilled Scrabble players). He recorded the results of 2566 games between two such computer players and provided the resulting plays and boards in an R package. With these data, you can see some interesting statistics on long-term outcomes from competitive Scrabble games, like this map (top-left) of which squares on the board are most used in games (darker means more frequently), and also for just the Q, Z and blank tiles. Scrabble games in general tend to follow the diagonals where the double-word score squares are located, while the high-scoring Q and Z tiles tend to be used on double- and triple-letter squares. The zero-point blank tile, by comparison, is used fairly uniformly across the board.

Further analysis of the actual plays during the simulated games reveals some interesting Scrabble statistics:

**It's best to play first**. Player 1 won 54.6% of games, while Player 2 won 44.9%, a statistically significant difference. (The remaining 0.4% of the games were ties.)

**Some uncommon words are frequently used**. The 10 most frequently-played words were QI, QAT, QIN, XI, OX, EUOI, XU, ZO, ZA, and EX; not necessarily words you'd use in casual conversation, but all words very familiar to competitive scrabble players. As we'll see later though, not all are high-scoring plays. Sometimes it's a good idea to get rid of a high-scoring but restrictive letter (QI, the life energy in Chinese philosophy), or simply to shake up a vowel-heavy rack for more opportunities next turn (EUOI, a cry of impassioned rapture in ancient Bacchic revels).

**For high-scoring plays, go for the bingo**. A Scrabble Bingo, where you lay down all 7 tiles in your rack in one play, comes with a 50-point bonus. The top three highest-scoring plays in the simulation were all bingo plays: REPIQUED (239 points), CZARISTS (230 points), and IDOLIZED. (Remember though, that this is from just a couple of thousand simulated games; there are many many more potentially high-scoring words.)

**High-scoring non-bingo plays can be surprisingly long words**. It's super-satisfying to lay down a short word like EX with the X making a second word on a triple-letter tile (for a 9x bonus), so I was surprised to see the top 10 highest-scoring non-bingo plays were still fairly long words: XENURINE (144 points), CYANOSES (126 pts) and SNAPWEED (126 points), all using at least 2 tiles already on the board. The shortest word in the top 10 of this list was ZITS.

**Some tiles just don't work well together**. The pain of getting a Q without a U seems obvious, but it turns out getting **two** U's is way worse in point-scoring potential. From the simulation, you can estimate the point-scoring potential of any pair of tiles in your rack: lighter is better, and darker is worse.

Managing the scoring potential of the tiles in your rack is a big part of Scrabble strategy, as we saw in another Scrabble analysis using R a few years ago. The lowly zero-point blank is actually worth a lot of potential points, while the highest-scoring tile Q is actually a liability. Here are the findings from that analysis:

- The blank is worth about 30 points to a good player, mainly by making 50-point "bingo" plays possible.
- Each S is worth about 10 points to the player who draws it.
- The Q is a burden to whichever player receives it, effectively serving as a 5 point penalty for having to deal with it due to its effect in reducing bingo opportunities, needing either a U or a blank for a chance at a bingo and a 50-point bonus.
- The J is essentially neutral pointwise.
- The X and the Z are each worth about 3-5 extra points to the player who receives them. Their difficulty in playing in bingoes is mitigated by their usefulness in other short words.

For more James Curley's recent Scrabble analysis, including the R code using his scrabblr package, follow the link to below.

RPubs: Analyzing Scrabble Games

At the 2016 EARL London conference senior data-visualisation journalist John Burn-Murdoch, described how the Financial Times uses R to produce high-quality, striking data visualisations. Until recently, charts were the realm of an information designer using tools like Adobe Illustrator: the output was beautiful, but the process was a long and winding one. The FT needed to be able to "audition" several different visual treatments quickly, to be able to create stunning visuals before deadline. That's where R and the ggplot2 package come in.

John presented a case study (you can see the slides with animations here) on creating this FT article, Explore the changing tides of European footballing power. The final work included 128 charts in total, telling the story of dozens of soccer teams in four countries. John also shared these animated versions (along with the ggplot2 code to produce them):

You can find the data and R code behind the animations here.

John B Murdoch: Ggplot2 as a creative engine ... and other ways R is transforming the FT's quantitative journalism

For about three years now, telemetry has been gathered for professional basketball games in the US by SportVU for the NBA. Six cameras track the on-court position of the players and the ball, with a resolution of 25 samples per second.

Combine this movement data with NBA play-by-play data (players, plays, fouls, and points scored — data sadly no longer made available by the NBA), and you have a rich data set for analysis. Naturally, you can read these data files into R, and Rajiv Shah provides several R scripts to facilitate the process. These include functions to import the motion and play-by-play data files and merge them into a data frame, exploration of the movement data, extract and analyze player trajectories and calculate metrics on player motion (such as speed and 'jerk').

James Curley used the same data and extended those scripts to animate NBA plays, such as this basket scored during a December 2015 game between the San Antonio Spurs and the Minnesota Timberwolves. The orange polygon is a measure of player spacing on the court. (Pop a taut rubber band around the players and let go: that's a convex hull.) It would be interesting to extract the area of this convex hull over time as a series, and see if the value relates to scoring opportunities, but that's a task for another time.

James used simple ggplot2 functions to plot the positions of the players and the ball on top of a geom extension to draw the court. Each frame was animated from records in the SportVU data, and then assempled into an animated GIF using the gg_animate function. (Many thanks to James for providing the GIF itself.) You can see further details, including the complete R code, at the blog post linked below.

Curley Lab: Animating NBA Play by Play using R

March Madness is upon us here in the US. This annual college basketball competition pits 64 teams in a single-elimination tournament, and the team that goes undefeated for all 6 rounds will be named NCAA Champion.

Predicting the winners of the competition, and in particular completing a "bracket" of the teams you predict to make it to the final 32 or 16 and eventually win, is a popular pastime (and foundation for many wagers). Some use their knowledge of the teams or the betting markets to select their bracket. And some, like 47-year-old English data scientist and top-ranked Kaggler Amanda Schierz, use data, models, and R. Watch her story in this (sadly, unembeddable) video from ESPN and FiveThirtyEight.

If you'd like to try your hand at your own predictions based on machine learning, Azure ML (part of the Cortana Analytics suite) provides all the data, algorithms, and R and Python support you need. Here at Microsoft we've run internal March Madness competitions every year, and in the video below last year's winner Damon Hachmeister shares his secrets.

There's even a March Madness Prediction Service published on the Cortana Analytics Gallery which, given the current state of the competition as a Web Service input, will provide predictions for the remaining games as an output.

Got more tips for predicting a bracket with Machine Learning? Share them in the comments below.

by Andrie de Vries

A week ago my high school friend, @XLRunner, sent me a link to the article "How Zach Bitter Ran 100 Miles in Less Than 12 Hours". Zach's effort was rewarded with the American record for the 100 mile event.

This reminded me of some analysis I did, many years ago, of the world record speeds for various running distances. The International Amateur Athletics Federation (IAAF) keeps track of world records for distances from 100m up to the marathon (42km). The distances longer than 42km do not fall in the IAAF event list, but these are also tracked by various other organisations.

You can find a list of IAAF world records at Wikipedia, and a list of ultramarathon world best times at Wikepedia.

I extracted only the mens running events from these lists, and used R to plot the average running speeds for these records:

You can immediately see that the speed declines very rapidly from the sprint events. Perhaps it would be better to plot this using a logarithmic x-scale, adding some labels at the same time. I also added some colour for what I call standard events - where "standard" is the type of distance you would see regularly at a world championships or olympic games. Thus the mile is "standard", but the 2,000m race is not.

Now our data points are in somewhat more of a straight line, meaning we could consider fitting a linear regression.

However, it seems that there might be two kinks in the line:

- The first kink occurs somewhere between the 800m distance and the mile. It seems that the sprinting distances (and the 800m is sometimes called a long sprint) has different dynamics from the events up to the marathon.
- And then there is another kink for the ultra-marathon distances. The standard marathon is 42.2km, and distances longer than this are called ultramarathons.

Also, note that the speed for the 100m is actually slower than for the 200m. This indicates the transition effect of getting started from a standing start - clearly this plays a large role in the very short sprint distance.

For the analysis below, I exlcuded the data for:

- The 100m sprint (transition effects play too large a role)
- The ultramarahon distances (they get raced less frequently, thus something strange seems to be happening in the data for the 50km race in particular).

To fit a regression line with kinks, more properly known as a segmented regression (or sometimes called piecewise regression), you can use the segmented package, available on CRAN.

The `segmented()`

function allows you to modify a fitted object of class `lm`

or `glm`

, specifying which of the independent variables should have segments (kinks). In my case, I fitted a linear model with a single variable (log of distance), and allowed `segmented()`

to find a single kink point.

My analysis indicates that there is a kink point at 1.13km (10^0.055 = 1.13), i.e. between the 800m event and the 1,000m event.

`> summary(sfit)`

`***Regression Model with Segmented Relationship(s)***`

`Call: `

`segmented.lm(obj = lfit, seg.Z = ~logDistance)`

`Estimated Break-Point(s):`

` Est. St.Err `

` 0.055 0.021`

`Meaningful coefficients of the linear terms:`

` Estimate Std. Error t value Pr(>|t|) `

`(Intercept) 27.2064 0.1755 155.04 < 2e-16 ***`

`logDistance - 15.1305 0.4332 -34.93 1.94e-13 ***`

`U1.logDistance 11.2046 0.4536 24.70 NA `

`---`

`Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`

`Residual standard error: 0.2373 on 12 degrees of freedom`

`Multiple R-Squared: 0.9981, Adjusted R-squared: 0.9976`

`Convergence attained in 4 iterations with relative change -4.922372e-16 `

The final plot shows the same data, but this time with the segmented regression line also displayed.

I conlude:

- It is really easy to fit a segmented linear regression model using the segmented package
- There seems to be a different physiological process for the sprint events and the middle distance events. The segmented regression finds this kink point between the 800m event and the 1,000m event
- The ultramarathon distances have a completely different dynamic. However, it's not clear to me whether this is due to inherent physiological constraints, or vastly reduced competition in these "non-standard" events.
- The 50km world record seems too "slow". Perhaps the competition for this event is less intense than for the marathon?

Here is my code for the analysis:

EMC recently ran a competion to find out why John McGuinness, the legendary motorcycle racer known as the "Morecambe Missile", is outperforms the average motorcycle racer. To answer this question, EMC instrumented his bike and his suit with a number of real-time sensors. (Data collected included gear and RPM for the bike, and heart rate and acceleration for the rider.) They did the same for Adam Child, a motorcycle racing journalist who acted as a "control" in this study. The two motorcyclists then completed 10 laps of Circuit Monteblanco in Spain, and the sensor data was provided for analysis in a CrowdAnalytix competition. (Sadly, the data themselves are proprietary and not available for others to analyze.) Prizes were awarded for the best model and for the best data visualization, and both winners used R.

In the "best model" competition, winner Stefan Jol (a revenue management Analyst at a leading UK radio group) used a random forest to determine that bike position and rider position were the primary determinants of speed (and race performance). His analysis appeared to divide the track into segments, shown in the map below.

In the visualization competition, winner Charlotte Wickham (Assistant Professor in the Department of Statistics at Oregon State University) also divided the track into segments, with an interactive data visulization to compare the pro racer and journalist in each segment.

As this demonstration video shows, Charlotte's app allows you to select one or more segments from the racetrack, and compare summary statistics and even the racing line used by each rider. EMC noted Charlotte's use of R in this quote:

“Professor Wickham built a very unique and insightful application (using the R statistical programming language) which allowed users to look at sections of the course and compare John and Chad across a number of different variables. R is a powerful language that we use at EMC for predictive analytics, and her use of R demonstrates the versatility of open source analytical programs.”

You can learn more about the winning submissions at the link below, by scrolling down to the "The Analysis" section.

By Mark Malter

A few weeks ago, I wrote about my Baseball Stats R shiny application, where I demonstrated how to calculate runs expectancies based on the 24 possible bases/outs states for any plate appearance. In this article, I’ll explain how I expanded on that to calculate the probability of winning the game, based on the current score/inning/bases/outs state. While this is done on other websites, I have added some unique play attempt features -- steal attempt, sacrifice bunt attempt, and tag from third attempt -- to show the probability of winning with and without the attempt, as well as the expected win probability given a user determined success rate for the play. That way, a manager can not only know the expected runs based on a particular decision, but the actual probability of winning the game if the play is attempted.

After the user enters the score, inning, bases state, and outs, the code runs through a large number of simulated games using the expected runs to be scored over the remainder of the current half inning, as well as each succeeding half inning for the remainder of the game.

When there are runners on base and the user clicks on any of the ‘play attempt’ tabs, a new table is generated showing the new probabilities. I allow for sacrifice bunts with less than two outs and a runner on first, second, or first and second. The stolen base tab can be used with any number of outs and the possibility of stealing second, third, or both. The tag from third tab will work as long as there is a runner on third and less than two outs prior to the catch.

I first got this idea after watching game seven of the 2014 World Series. Trailing 3-2 with two outs and nobody on base, Alex Gordon singled to center off of Madison Bumgarner and made it all the way to third base after a two base error. As Gordon was approaching third base, the coach gave him the stop sign, as shortstop Brandon Crawford was taking the relay throw in short left field. Had Gordon attempted to score on the play, he probably would have been out at the plate- game and series over. However, by holding up at third base, the Royals are also still ‘probably’ going to lose since runners only score from third base with two outs about 26% of the time.

The calculator shows that the probability of winning a game down by one run with a runner on third and two outs in the bottom of the ninth (or an extra inning) is roughly 17%. If we click on the ‘Tag from Third Attempt’ tab (Gordon attempting to score would have been equivalent to tagging from third after a catch for the second out), and play with the ‘Base running success rate’ slider, we see that the break even success rate is roughly 0.3. I don’t know the probability of Gordon beating the throw home, but if it was greater than 0.3 then making the attempt would have improved the Royals chances of winning the World Series. In fact, if the true success rate was as much as 0.5 then the Royals win probability would have jumped by 11% to 28%.

Here is the UI code. Here is the server code. And here is the Shiny App.

>

by Mark Malter

After reading the book, Analyzing Baseball with R, by Max Marchi and Jim Albert, I decided to expand on some of their ideas relating to runs created and put them into an R shiny app .

The Server and UI code are linked at the bottom of the Introduction tab.

I downloaded the Retrosheet play-by-play data for every game played in the 2011-2014 seasons in every park and aggregated every plate appearance by one of the 24 bases/outs states (ranging from nobody on/nobody out to bases loaded/two outs). With Retrosheets data, I wrote code to track the batter, bases, outs, runs scored over remainder of inning, current game score, and inning. I also used the R Lahman package and databases for individual player information. Below is a brief explanation of the function of each tab on the app.

**Potential Runs by bases/outs state**: Matrix of all 24 possible bases/outs states, both with expected runs over the remainder of an inning, and the probability of scoring at least one run over the remainder of the inning (for late innings of close games). I used this table to analyze several types of plays, as shown below. Notice that, assuming average hitters, the analysis below shows why sacrifice bunts are always a bad idea. The Runs Created stat for a plate appearance is defined as:

end state – start state + runs created on play.

I first became serious about this after watching the last inning of the 2014 world series. Down 3-2 with two outs and nobody on base, Alex Gordon singled to center and advanced to third on a two base error. As Gordon was heading into third base, Giants shortstop Brandon Crawford was taking the relay throw in short left field. Had Gordon been sent home, Crawford would likely have thrown him out at the plate. However, the runs matrix shows only a 26% chance of scoring a run with a man on third and two outs, and with Madison Bumgarner on the mound, it was even less likely that on deck hitter Salvador Perez would be able to drive in Hosmer. So even though sending Gordon would likely have ended the game (and the series), it still may have been the optimal play. This would be similar to hitting 16 vs. a dealer’s ten in Blackjack- you’ll probably lose, but you’re making an optimal play. For equivalency, see the Tag from Third analysis below, as this play would have been equivalent to tagging from third after a catch for the second out.

**Runs Created All Regular MLB Players**: I filtered out all players with fewer than 400 plate appearances and created an interactive rchart showing each player’s runs potential by runs created. I placed the following filters in the UI: year, innings (1-3, 4-6, 7-extras), run differential at time of at bat (0-1, 2-3, 4+), position, team, bats, age range, and weight. Hovering over a point shows the player and his salary. For example, Mike Trout created 58 runs out of a potential of 332 in 2014. Filtering 2013 for second baseman under the age of 30 and weighing less than 200 pounds, we see Jason Kipnis created 27 runs out of a potential of 300.

**Player Runs Table**: Same as above, but this shows each player (> 400 plate appearances for the selected season), broken down by each of the eight bases states. For example, in 2014 Jose Abreu created 43.5 runs on a potential of 291, and was most efficient with a runner on second base, where he created 10.3 runs on a potential of only 36.

The following tabs show runs expectancies of various offensive plays from the start state the expected end state, based on the expected Baserunning Success rate in the UI. For each play, I created a graphical as well as a table tab. For the graphical tabs, there is a UI to switch between views of expected runs and scoring probability.

**Stolen bases Graphic/Table**: For each of fifteen different base stealing situations, I show the start state, end state (based on the UI selected success rate), and the breakeven success rate for the given situation. We see that rather than one generic rule of thumb for breaking even, the situational b/e’s vary widely, ranging from 91% with a runner on second and two outs, to 54% for a double steal with first and second and one out (I assume that any out is the lead runner). Notice though that if only the runner on second attempts to steal, the break even jumps from 54% to 72%.

**Tag from Third Graphic/Table**: I broke down every situation where a fly ball was caught with a runner on third, where the catch was either the first or second out. I tracked the attempt frequency and success rate for each situation, based on the outs and whether there were trailing runners. Surprisingly, I found that almost every success rate is well over 95%, meaning runners are only tagging when they’re almost certain to score. However, the break evens range from 40% with first and third with two outs (after the catch) to 77% with runners on second and third with one out. I believe this shows a gray area between the b/e and success rates where runners are being far too cautious.

The following tabs show whether a base runner should attempt to advance two bases on a single. Again, of course it depends on the situation.

**First to Third Graphic/Table**: Here we see that the attempted frequencies are very low, and as expected, lowest on balls hit to left field. However, as with the above tag plays, runners are almost always safe, showing another gray area between attempts and b/e’s. For example, on a single to right field with one out, runners only attempt to advance to third base 42.1% of the time, and are safe 97.3%. If we place the UI Success Rate slider on 0.85, we see that the attempt increases the runs expectancy from 0.87 to 0.99.

**Second to Home Graphic/Table**: Here we see the old adage, “don’t make the first or second out at the plate”, is not necessarily true. Attempting to score from second on a single depends not only on the outs, but also whether there is a trailing runner. The break evens range from 93% with no outs and no trailing runner on first, to 40% with two outs and no runner on first. Once again, the success rates are almost always higher than the break even rate, showing too much caution.

**Sacrifice Bunt Graphic/Table**: These tabs show that unless we have a hitter far below average, the sacrifice should never be attempted. For example, in going from a runner on first and no outs to a runner on second with one out, or going from a runner on second with no outs to a runner on third with one out, we drop from 0.85 runs to 0.66 runs and from 1.10 runs to 0.94 runs respectively. Worse, I’m assuming that the bunt is always successful with the lead runner never being thrown out. The only situation where the bunt might be wise is in a late inning and the team is playing for one run after a leadoff double. Getting the runner from second and no outs to third with one out increases the probability of scoring from 0.61 to 0.65, IF the bunt is successful. Even here, it is a poor play if the success rate is less than 90%. The graphic tab allows the user to see how the expected end state changes as the UI success rate slider is altered.

UI code: https://github.com/malter61/retrosheets/blob/master/ui.R

Server code: https://github.com/malter61/retrosheets/blob/master/server.R

*Mark Malter is a data scientist currently working for Houghton, Mifflin, Harcourt, as well as the consulting firm Channel Pricing, specializing in building predictive models, cluster analysis, and visualizing data. He is also a sixteen year veteran stock options market-maker at the Chicago Board Options Exchange. He has a BS degree in electrical engineering, an MBA, and is currently working on an MS degree in Predictive Analytics. Mark also spent 14 years as a director and coach of his local youth baseball league.*

A quick heads-up that this Thursday (December 11), Allen Day from MapR and Bill Jacobs from Revolution Analytics will be live presenting a new webinar, Batter Up! Advanced Sports Analytics with R and Storm. The analysis will be of baseball data, but the webinar will be of interest to anyone interested in doing large-scale statistical analysis with R of data in Hadoop. Here's the abstract:

This session will demonstrate how the all-star line-up featuring R and Storm enables real-time processing on massive data sets; a real home run! The presenters will use actual baseball data and a real-world use case to compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution. Attendees will leave the session with information that could easily be applied for other use cases such as video game analytics, fraud detection, intrusion detection, and consumer propensity to buy calculations.

The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.

Register to attend the session on Thursday, where you can ask Bill and Allen you questions live. All registrants will also recieve a copy of the slides and the webinar recording. You can register at the link below.

**Update**: The webinar is a wrap, and you can view the slides and replay at the link below.

Revolution Analytics webinars: Batter Up! Advanced Sports Analytics with R and Storm