It's relatively easy to find the player who has scored the most goals in the last 12 years (hello, Lionel Messi). But which professional football (soccer) player is the best **finisher**, i.e. which player is most likely to put a shot they take into the goal?

You can't simply use the conversion rate (the ratio of shots taken to goals scored), because some players play more shots a long way from the goal while others get more set-ups near the goal. To correct for that, the blog Barça Numeros used a Bayesian beta-binomial regression model to weight the conversion rates by distance, and then ranked each player for their goal scoring rate at 25 distances from the goal. (The analysis was performed in R using techniques described in David Robinson's book *Introduction to Empirical Bayes: Examples from Baseball Statistics*, which is available online.)

Here's a chart comparing the ranks of Messi, Zlatan Ibrahimovic, Ronaldo Cristiano, Paulo Dybala and Allesandro Del Piero, at each distance, showing Messi to be the best finisher of these players at *all* ranges:

For an overall ranking for each player, the blog used the median rank across the 25 shot distances — a ranking that places Lionel Messi as the greatest finisher of the last 12 years.

For more details behind the analysis (and many more charts), check out the complete blog post linked below.

Barça Numeros: Who are the best finishers in contemporary football? (via @barcanumbers)

Strava is a fitness app that records your activities, including the routes of your walks, rides and runs. The service also provides an API that allows you to extract all of your data for analysis. University of Melbourne research fellow Marcus Volz created an R package to download and visualize Strava data, and created a chart to visualize all of his runs over six years as a small multiple.

Inspired by his work (and the availability of the R package he created), others also visualized bike rides, activity calendars, and aggregated route maps with elevation data. (You can see several examples in the Twitter moment embedded below.) If you'd like to download your own Strava data, all you need a Strava access token, a recent version of R (3.4.3 or later) and the strava package found on Github.

Strava activity visualized with R

Marcus Volz: A gallery of visualisations derived from Strava running data

It's well-known that the home team has an advantage in soccer (or football, as it's called in England). But which teams have made the most of their home-field advantage over the years? Evolutionary biologist (and Liverpool fan) Joe Gallagher analyzed the percentage of points won in the UK Premier League (which awards 3 points for a win and one point for a draw) for teams at home and away:

On average, Premier League teams win around 60% of their points at home. The green bars show the 10 teams who won most of their points at home (Burnley managed only a single away win and an away draw in their 2009 season) and the purple bars shows 10 teams who defied the odds to win most of their points in *away* games.

Joe used the R language to perform the analysis and create the chart above, with the engsoccerdata package providing the historical home/away score data (augmented with current-season data from the web). You can find the R code used to perform the analysis, along with several other interesting analyses of the data, at the link below.

Joe Gallagher: Home advantages and wanderlust

While you might think of Scrabble as that game you play with your grandparents on a rainy Sunday, some people take it *very* seriously. There's an international competition devoted to Scrabble, and no end of guides and strategies for competitive play. James Curley, a psychology professor at Columbia University, has used an interesting method to collect data about what plays are most effective in Scrabble: by having robots play against each other, thousands of times.

The data were generated with a Visual Basic script that automated two AI players completing a game in Quackle. Quackle emulates the Scrabble board, and provides a number of AI players; the simulation used the "Speedy Player" AI which tends to make tactical scoring moves while missing some longer-term strategic plays (like most reasonably skilled Scrabble players). He recorded the results of 2566 games between two such computer players and provided the resulting plays and boards in an R package. With these data, you can see some interesting statistics on long-term outcomes from competitive Scrabble games, like this map (top-left) of which squares on the board are most used in games (darker means more frequently), and also for just the Q, Z and blank tiles. Scrabble games in general tend to follow the diagonals where the double-word score squares are located, while the high-scoring Q and Z tiles tend to be used on double- and triple-letter squares. The zero-point blank tile, by comparison, is used fairly uniformly across the board.

Further analysis of the actual plays during the simulated games reveals some interesting Scrabble statistics:

**It's best to play first**. Player 1 won 54.6% of games, while Player 2 won 44.9%, a statistically significant difference. (The remaining 0.4% of the games were ties.)

**Some uncommon words are frequently used**. The 10 most frequently-played words were QI, QAT, QIN, XI, OX, EUOI, XU, ZO, ZA, and EX; not necessarily words you'd use in casual conversation, but all words very familiar to competitive scrabble players. As we'll see later though, not all are high-scoring plays. Sometimes it's a good idea to get rid of a high-scoring but restrictive letter (QI, the life energy in Chinese philosophy), or simply to shake up a vowel-heavy rack for more opportunities next turn (EUOI, a cry of impassioned rapture in ancient Bacchic revels).

**For high-scoring plays, go for the bingo**. A Scrabble Bingo, where you lay down all 7 tiles in your rack in one play, comes with a 50-point bonus. The top three highest-scoring plays in the simulation were all bingo plays: REPIQUED (239 points), CZARISTS (230 points), and IDOLIZED. (Remember though, that this is from just a couple of thousand simulated games; there are many many more potentially high-scoring words.)

**High-scoring non-bingo plays can be surprisingly long words**. It's super-satisfying to lay down a short word like EX with the X making a second word on a triple-letter tile (for a 9x bonus), so I was surprised to see the top 10 highest-scoring non-bingo plays were still fairly long words: XENURINE (144 points), CYANOSES (126 pts) and SNAPWEED (126 points), all using at least 2 tiles already on the board. The shortest word in the top 10 of this list was ZITS.

**Some tiles just don't work well together**. The pain of getting a Q without a U seems obvious, but it turns out getting **two** U's is way worse in point-scoring potential. From the simulation, you can estimate the point-scoring potential of any pair of tiles in your rack: lighter is better, and darker is worse.

Managing the scoring potential of the tiles in your rack is a big part of Scrabble strategy, as we saw in another Scrabble analysis using R a few years ago. The lowly zero-point blank is actually worth a lot of potential points, while the highest-scoring tile Q is actually a liability. Here are the findings from that analysis:

- The blank is worth about 30 points to a good player, mainly by making 50-point "bingo" plays possible.
- Each S is worth about 10 points to the player who draws it.
- The Q is a burden to whichever player receives it, effectively serving as a 5 point penalty for having to deal with it due to its effect in reducing bingo opportunities, needing either a U or a blank for a chance at a bingo and a 50-point bonus.
- The J is essentially neutral pointwise.
- The X and the Z are each worth about 3-5 extra points to the player who receives them. Their difficulty in playing in bingoes is mitigated by their usefulness in other short words.

For more James Curley's recent Scrabble analysis, including the R code using his scrabblr package, follow the link to below.

RPubs: Analyzing Scrabble Games

At the 2016 EARL London conference senior data-visualisation journalist John Burn-Murdoch, described how the Financial Times uses R to produce high-quality, striking data visualisations. Until recently, charts were the realm of an information designer using tools like Adobe Illustrator: the output was beautiful, but the process was a long and winding one. The FT needed to be able to "audition" several different visual treatments quickly, to be able to create stunning visuals before deadline. That's where R and the ggplot2 package come in.

John presented a case study (you can see the slides with animations here) on creating this FT article, Explore the changing tides of European footballing power. The final work included 128 charts in total, telling the story of dozens of soccer teams in four countries. John also shared these animated versions (along with the ggplot2 code to produce them):

You can find the data and R code behind the animations here.

John B Murdoch: Ggplot2 as a creative engine ... and other ways R is transforming the FT's quantitative journalism

For about three years now, telemetry has been gathered for professional basketball games in the US by SportVU for the NBA. Six cameras track the on-court position of the players and the ball, with a resolution of 25 samples per second.

Combine this movement data with NBA play-by-play data (players, plays, fouls, and points scored — data sadly no longer made available by the NBA), and you have a rich data set for analysis. Naturally, you can read these data files into R, and Rajiv Shah provides several R scripts to facilitate the process. These include functions to import the motion and play-by-play data files and merge them into a data frame, exploration of the movement data, extract and analyze player trajectories and calculate metrics on player motion (such as speed and 'jerk').

James Curley used the same data and extended those scripts to animate NBA plays, such as this basket scored during a December 2015 game between the San Antonio Spurs and the Minnesota Timberwolves. The orange polygon is a measure of player spacing on the court. (Pop a taut rubber band around the players and let go: that's a convex hull.) It would be interesting to extract the area of this convex hull over time as a series, and see if the value relates to scoring opportunities, but that's a task for another time.

James used simple ggplot2 functions to plot the positions of the players and the ball on top of a geom extension to draw the court. Each frame was animated from records in the SportVU data, and then assempled into an animated GIF using the gg_animate function. (Many thanks to James for providing the GIF itself.) You can see further details, including the complete R code, at the blog post linked below.

Curley Lab: Animating NBA Play by Play using R

March Madness is upon us here in the US. This annual college basketball competition pits 64 teams in a single-elimination tournament, and the team that goes undefeated for all 6 rounds will be named NCAA Champion.

Predicting the winners of the competition, and in particular completing a "bracket" of the teams you predict to make it to the final 32 or 16 and eventually win, is a popular pastime (and foundation for many wagers). Some use their knowledge of the teams or the betting markets to select their bracket. And some, like 47-year-old English data scientist and top-ranked Kaggler Amanda Schierz, use data, models, and R. Watch her story in this (sadly, unembeddable) video from ESPN and FiveThirtyEight.

If you'd like to try your hand at your own predictions based on machine learning, Azure ML (part of the Cortana Analytics suite) provides all the data, algorithms, and R and Python support you need. Here at Microsoft we've run internal March Madness competitions every year, and in the video below last year's winner Damon Hachmeister shares his secrets.

There's even a March Madness Prediction Service published on the Cortana Analytics Gallery which, given the current state of the competition as a Web Service input, will provide predictions for the remaining games as an output.

Got more tips for predicting a bracket with Machine Learning? Share them in the comments below.

by Andrie de Vries

A week ago my high school friend, @XLRunner, sent me a link to the article "How Zach Bitter Ran 100 Miles in Less Than 12 Hours". Zach's effort was rewarded with the American record for the 100 mile event.

This reminded me of some analysis I did, many years ago, of the world record speeds for various running distances. The International Amateur Athletics Federation (IAAF) keeps track of world records for distances from 100m up to the marathon (42km). The distances longer than 42km do not fall in the IAAF event list, but these are also tracked by various other organisations.

You can find a list of IAAF world records at Wikipedia, and a list of ultramarathon world best times at Wikepedia.

I extracted only the mens running events from these lists, and used R to plot the average running speeds for these records:

You can immediately see that the speed declines very rapidly from the sprint events. Perhaps it would be better to plot this using a logarithmic x-scale, adding some labels at the same time. I also added some colour for what I call standard events - where "standard" is the type of distance you would see regularly at a world championships or olympic games. Thus the mile is "standard", but the 2,000m race is not.

Now our data points are in somewhat more of a straight line, meaning we could consider fitting a linear regression.

However, it seems that there might be two kinks in the line:

- The first kink occurs somewhere between the 800m distance and the mile. It seems that the sprinting distances (and the 800m is sometimes called a long sprint) has different dynamics from the events up to the marathon.
- And then there is another kink for the ultra-marathon distances. The standard marathon is 42.2km, and distances longer than this are called ultramarathons.

Also, note that the speed for the 100m is actually slower than for the 200m. This indicates the transition effect of getting started from a standing start - clearly this plays a large role in the very short sprint distance.

For the analysis below, I exlcuded the data for:

- The 100m sprint (transition effects play too large a role)
- The ultramarahon distances (they get raced less frequently, thus something strange seems to be happening in the data for the 50km race in particular).

To fit a regression line with kinks, more properly known as a segmented regression (or sometimes called piecewise regression), you can use the segmented package, available on CRAN.

The `segmented()`

function allows you to modify a fitted object of class `lm`

or `glm`

, specifying which of the independent variables should have segments (kinks). In my case, I fitted a linear model with a single variable (log of distance), and allowed `segmented()`

to find a single kink point.

My analysis indicates that there is a kink point at 1.13km (10^0.055 = 1.13), i.e. between the 800m event and the 1,000m event.

`> summary(sfit)`

`***Regression Model with Segmented Relationship(s)***`

`Call: `

`segmented.lm(obj = lfit, seg.Z = ~logDistance)`

`Estimated Break-Point(s):`

` Est. St.Err `

` 0.055 0.021`

`Meaningful coefficients of the linear terms:`

` Estimate Std. Error t value Pr(>|t|) `

`(Intercept) 27.2064 0.1755 155.04 < 2e-16 ***`

`logDistance - 15.1305 0.4332 -34.93 1.94e-13 ***`

`U1.logDistance 11.2046 0.4536 24.70 NA `

`---`

`Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`

`Residual standard error: 0.2373 on 12 degrees of freedom`

`Multiple R-Squared: 0.9981, Adjusted R-squared: 0.9976`

`Convergence attained in 4 iterations with relative change -4.922372e-16 `

The final plot shows the same data, but this time with the segmented regression line also displayed.

I conlude:

- It is really easy to fit a segmented linear regression model using the segmented package
- There seems to be a different physiological process for the sprint events and the middle distance events. The segmented regression finds this kink point between the 800m event and the 1,000m event
- The ultramarathon distances have a completely different dynamic. However, it's not clear to me whether this is due to inherent physiological constraints, or vastly reduced competition in these "non-standard" events.
- The 50km world record seems too "slow". Perhaps the competition for this event is less intense than for the marathon?

Here is my code for the analysis:

EMC recently ran a competion to find out why John McGuinness, the legendary motorcycle racer known as the "Morecambe Missile", is outperforms the average motorcycle racer. To answer this question, EMC instrumented his bike and his suit with a number of real-time sensors. (Data collected included gear and RPM for the bike, and heart rate and acceleration for the rider.) They did the same for Adam Child, a motorcycle racing journalist who acted as a "control" in this study. The two motorcyclists then completed 10 laps of Circuit Monteblanco in Spain, and the sensor data was provided for analysis in a CrowdAnalytix competition. (Sadly, the data themselves are proprietary and not available for others to analyze.) Prizes were awarded for the best model and for the best data visualization, and both winners used R.

In the "best model" competition, winner Stefan Jol (a revenue management Analyst at a leading UK radio group) used a random forest to determine that bike position and rider position were the primary determinants of speed (and race performance). His analysis appeared to divide the track into segments, shown in the map below.

In the visualization competition, winner Charlotte Wickham (Assistant Professor in the Department of Statistics at Oregon State University) also divided the track into segments, with an interactive data visulization to compare the pro racer and journalist in each segment.

As this demonstration video shows, Charlotte's app allows you to select one or more segments from the racetrack, and compare summary statistics and even the racing line used by each rider. EMC noted Charlotte's use of R in this quote:

“Professor Wickham built a very unique and insightful application (using the R statistical programming language) which allowed users to look at sections of the course and compare John and Chad across a number of different variables. R is a powerful language that we use at EMC for predictive analytics, and her use of R demonstrates the versatility of open source analytical programs.”

You can learn more about the winning submissions at the link below, by scrolling down to the "The Analysis" section.

By Mark Malter

A few weeks ago, I wrote about my Baseball Stats R shiny application, where I demonstrated how to calculate runs expectancies based on the 24 possible bases/outs states for any plate appearance. In this article, I’ll explain how I expanded on that to calculate the probability of winning the game, based on the current score/inning/bases/outs state. While this is done on other websites, I have added some unique play attempt features -- steal attempt, sacrifice bunt attempt, and tag from third attempt -- to show the probability of winning with and without the attempt, as well as the expected win probability given a user determined success rate for the play. That way, a manager can not only know the expected runs based on a particular decision, but the actual probability of winning the game if the play is attempted.

After the user enters the score, inning, bases state, and outs, the code runs through a large number of simulated games using the expected runs to be scored over the remainder of the current half inning, as well as each succeeding half inning for the remainder of the game.

When there are runners on base and the user clicks on any of the ‘play attempt’ tabs, a new table is generated showing the new probabilities. I allow for sacrifice bunts with less than two outs and a runner on first, second, or first and second. The stolen base tab can be used with any number of outs and the possibility of stealing second, third, or both. The tag from third tab will work as long as there is a runner on third and less than two outs prior to the catch.

I first got this idea after watching game seven of the 2014 World Series. Trailing 3-2 with two outs and nobody on base, Alex Gordon singled to center off of Madison Bumgarner and made it all the way to third base after a two base error. As Gordon was approaching third base, the coach gave him the stop sign, as shortstop Brandon Crawford was taking the relay throw in short left field. Had Gordon attempted to score on the play, he probably would have been out at the plate- game and series over. However, by holding up at third base, the Royals are also still ‘probably’ going to lose since runners only score from third base with two outs about 26% of the time.

The calculator shows that the probability of winning a game down by one run with a runner on third and two outs in the bottom of the ninth (or an extra inning) is roughly 17%. If we click on the ‘Tag from Third Attempt’ tab (Gordon attempting to score would have been equivalent to tagging from third after a catch for the second out), and play with the ‘Base running success rate’ slider, we see that the break even success rate is roughly 0.3. I don’t know the probability of Gordon beating the throw home, but if it was greater than 0.3 then making the attempt would have improved the Royals chances of winning the World Series. In fact, if the true success rate was as much as 0.5 then the Royals win probability would have jumped by 11% to 28%.

Here is the UI code. Here is the server code. And here is the Shiny App.

>