March Madness is upon us here in the US. This annual college basketball competition pits 64 teams in a single-elimination tournament, and the team that goes undefeated for all 6 rounds will be named NCAA Champion.

Predicting the winners of the competition, and in particular completing a "bracket" of the teams you predict to make it to the final 32 or 16 and eventually win, is a popular pastime (and foundation for many wagers). Some use their knowledge of the teams or the betting markets to select their bracket. And some, like 47-year-old English data scientist and top-ranked Kaggler Amanda Schierz, use data, models, and R. Watch her story in this (sadly, unembeddable) video from ESPN and FiveThirtyEight.

If you'd like to try your hand at your own predictions based on machine learning, Azure ML (part of the Cortana Analytics suite) provides all the data, algorithms, and R and Python support you need. Here at Microsoft we've run internal March Madness competitions every year, and in the video below last year's winner Damon Hachmeister shares his secrets.

There's even a March Madness Prediction Service published on the Cortana Analytics Gallery which, given the current state of the competition as a Web Service input, will provide predictions for the remaining games as an output.

Got more tips for predicting a bracket with Machine Learning? Share them in the comments below.

by Andrie de Vries

A week ago my high school friend, @XLRunner, sent me a link to the article "How Zach Bitter Ran 100 Miles in Less Than 12 Hours". Zach's effort was rewarded with the American record for the 100 mile event.

This reminded me of some analysis I did, many years ago, of the world record speeds for various running distances. The International Amateur Athletics Federation (IAAF) keeps track of world records for distances from 100m up to the marathon (42km). The distances longer than 42km do not fall in the IAAF event list, but these are also tracked by various other organisations.

You can find a list of IAAF world records at Wikipedia, and a list of ultramarathon world best times at Wikepedia.

I extracted only the mens running events from these lists, and used R to plot the average running speeds for these records:

You can immediately see that the speed declines very rapidly from the sprint events. Perhaps it would be better to plot this using a logarithmic x-scale, adding some labels at the same time. I also added some colour for what I call standard events - where "standard" is the type of distance you would see regularly at a world championships or olympic games. Thus the mile is "standard", but the 2,000m race is not.

Now our data points are in somewhat more of a straight line, meaning we could consider fitting a linear regression.

However, it seems that there might be two kinks in the line:

- The first kink occurs somewhere between the 800m distance and the mile. It seems that the sprinting distances (and the 800m is sometimes called a long sprint) has different dynamics from the events up to the marathon.
- And then there is another kink for the ultra-marathon distances. The standard marathon is 42.2km, and distances longer than this are called ultramarathons.

Also, note that the speed for the 100m is actually slower than for the 200m. This indicates the transition effect of getting started from a standing start - clearly this plays a large role in the very short sprint distance.

For the analysis below, I exlcuded the data for:

- The 100m sprint (transition effects play too large a role)
- The ultramarahon distances (they get raced less frequently, thus something strange seems to be happening in the data for the 50km race in particular).

To fit a regression line with kinks, more properly known as a segmented regression (or sometimes called piecewise regression), you can use the segmented package, available on CRAN.

The `segmented()`

function allows you to modify a fitted object of class `lm`

or `glm`

, specifying which of the independent variables should have segments (kinks). In my case, I fitted a linear model with a single variable (log of distance), and allowed `segmented()`

to find a single kink point.

My analysis indicates that there is a kink point at 1.13km (10^0.055 = 1.13), i.e. between the 800m event and the 1,000m event.

`> summary(sfit)`

`***Regression Model with Segmented Relationship(s)***`

`Call: `

`segmented.lm(obj = lfit, seg.Z = ~logDistance)`

`Estimated Break-Point(s):`

` Est. St.Err `

` 0.055 0.021`

`Meaningful coefficients of the linear terms:`

` Estimate Std. Error t value Pr(>|t|) `

`(Intercept) 27.2064 0.1755 155.04 < 2e-16 ***`

`logDistance - 15.1305 0.4332 -34.93 1.94e-13 ***`

`U1.logDistance 11.2046 0.4536 24.70 NA `

`---`

`Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1`

`Residual standard error: 0.2373 on 12 degrees of freedom`

`Multiple R-Squared: 0.9981, Adjusted R-squared: 0.9976`

`Convergence attained in 4 iterations with relative change -4.922372e-16 `

The final plot shows the same data, but this time with the segmented regression line also displayed.

I conlude:

- It is really easy to fit a segmented linear regression model using the segmented package
- There seems to be a different physiological process for the sprint events and the middle distance events. The segmented regression finds this kink point between the 800m event and the 1,000m event
- The ultramarathon distances have a completely different dynamic. However, it's not clear to me whether this is due to inherent physiological constraints, or vastly reduced competition in these "non-standard" events.
- The 50km world record seems too "slow". Perhaps the competition for this event is less intense than for the marathon?

Here is my code for the analysis:

EMC recently ran a competion to find out why John McGuinness, the legendary motorcycle racer known as the "Morecambe Missile", is outperforms the average motorcycle racer. To answer this question, EMC instrumented his bike and his suit with a number of real-time sensors. (Data collected included gear and RPM for the bike, and heart rate and acceleration for the rider.) They did the same for Adam Child, a motorcycle racing journalist who acted as a "control" in this study. The two motorcyclists then completed 10 laps of Circuit Monteblanco in Spain, and the sensor data was provided for analysis in a CrowdAnalytix competition. (Sadly, the data themselves are proprietary and not available for others to analyze.) Prizes were awarded for the best model and for the best data visualization, and both winners used R.

In the "best model" competition, winner Stefan Jol (a revenue management Analyst at a leading UK radio group) used a random forest to determine that bike position and rider position were the primary determinants of speed (and race performance). His analysis appeared to divide the track into segments, shown in the map below.

In the visualization competition, winner Charlotte Wickham (Assistant Professor in the Department of Statistics at Oregon State University) also divided the track into segments, with an interactive data visulization to compare the pro racer and journalist in each segment.

As this demonstration video shows, Charlotte's app allows you to select one or more segments from the racetrack, and compare summary statistics and even the racing line used by each rider. EMC noted Charlotte's use of R in this quote:

“Professor Wickham built a very unique and insightful application (using the R statistical programming language) which allowed users to look at sections of the course and compare John and Chad across a number of different variables. R is a powerful language that we use at EMC for predictive analytics, and her use of R demonstrates the versatility of open source analytical programs.”

You can learn more about the winning submissions at the link below, by scrolling down to the "The Analysis" section.

By Mark Malter

A few weeks ago, I wrote about my Baseball Stats R shiny application, where I demonstrated how to calculate runs expectancies based on the 24 possible bases/outs states for any plate appearance. In this article, I’ll explain how I expanded on that to calculate the probability of winning the game, based on the current score/inning/bases/outs state. While this is done on other websites, I have added some unique play attempt features -- steal attempt, sacrifice bunt attempt, and tag from third attempt -- to show the probability of winning with and without the attempt, as well as the expected win probability given a user determined success rate for the play. That way, a manager can not only know the expected runs based on a particular decision, but the actual probability of winning the game if the play is attempted.

After the user enters the score, inning, bases state, and outs, the code runs through a large number of simulated games using the expected runs to be scored over the remainder of the current half inning, as well as each succeeding half inning for the remainder of the game.

When there are runners on base and the user clicks on any of the ‘play attempt’ tabs, a new table is generated showing the new probabilities. I allow for sacrifice bunts with less than two outs and a runner on first, second, or first and second. The stolen base tab can be used with any number of outs and the possibility of stealing second, third, or both. The tag from third tab will work as long as there is a runner on third and less than two outs prior to the catch.

I first got this idea after watching game seven of the 2014 World Series. Trailing 3-2 with two outs and nobody on base, Alex Gordon singled to center off of Madison Bumgarner and made it all the way to third base after a two base error. As Gordon was approaching third base, the coach gave him the stop sign, as shortstop Brandon Crawford was taking the relay throw in short left field. Had Gordon attempted to score on the play, he probably would have been out at the plate- game and series over. However, by holding up at third base, the Royals are also still ‘probably’ going to lose since runners only score from third base with two outs about 26% of the time.

The calculator shows that the probability of winning a game down by one run with a runner on third and two outs in the bottom of the ninth (or an extra inning) is roughly 17%. If we click on the ‘Tag from Third Attempt’ tab (Gordon attempting to score would have been equivalent to tagging from third after a catch for the second out), and play with the ‘Base running success rate’ slider, we see that the break even success rate is roughly 0.3. I don’t know the probability of Gordon beating the throw home, but if it was greater than 0.3 then making the attempt would have improved the Royals chances of winning the World Series. In fact, if the true success rate was as much as 0.5 then the Royals win probability would have jumped by 11% to 28%.

Here is the UI code. Here is the server code. And here is the Shiny App.

>

by Mark Malter

After reading the book, Analyzing Baseball with R, by Max Marchi and Jim Albert, I decided to expand on some of their ideas relating to runs created and put them into an R shiny app .

The Server and UI code are linked at the bottom of the Introduction tab.

I downloaded the Retrosheet play-by-play data for every game played in the 2011-2014 seasons in every park and aggregated every plate appearance by one of the 24 bases/outs states (ranging from nobody on/nobody out to bases loaded/two outs). With Retrosheets data, I wrote code to track the batter, bases, outs, runs scored over remainder of inning, current game score, and inning. I also used the R Lahman package and databases for individual player information. Below is a brief explanation of the function of each tab on the app.

**Potential Runs by bases/outs state**: Matrix of all 24 possible bases/outs states, both with expected runs over the remainder of an inning, and the probability of scoring at least one run over the remainder of the inning (for late innings of close games). I used this table to analyze several types of plays, as shown below. Notice that, assuming average hitters, the analysis below shows why sacrifice bunts are always a bad idea. The Runs Created stat for a plate appearance is defined as:

end state – start state + runs created on play.

I first became serious about this after watching the last inning of the 2014 world series. Down 3-2 with two outs and nobody on base, Alex Gordon singled to center and advanced to third on a two base error. As Gordon was heading into third base, Giants shortstop Brandon Crawford was taking the relay throw in short left field. Had Gordon been sent home, Crawford would likely have thrown him out at the plate. However, the runs matrix shows only a 26% chance of scoring a run with a man on third and two outs, and with Madison Bumgarner on the mound, it was even less likely that on deck hitter Salvador Perez would be able to drive in Hosmer. So even though sending Gordon would likely have ended the game (and the series), it still may have been the optimal play. This would be similar to hitting 16 vs. a dealer’s ten in Blackjack- you’ll probably lose, but you’re making an optimal play. For equivalency, see the Tag from Third analysis below, as this play would have been equivalent to tagging from third after a catch for the second out.

**Runs Created All Regular MLB Players**: I filtered out all players with fewer than 400 plate appearances and created an interactive rchart showing each player’s runs potential by runs created. I placed the following filters in the UI: year, innings (1-3, 4-6, 7-extras), run differential at time of at bat (0-1, 2-3, 4+), position, team, bats, age range, and weight. Hovering over a point shows the player and his salary. For example, Mike Trout created 58 runs out of a potential of 332 in 2014. Filtering 2013 for second baseman under the age of 30 and weighing less than 200 pounds, we see Jason Kipnis created 27 runs out of a potential of 300.

**Player Runs Table**: Same as above, but this shows each player (> 400 plate appearances for the selected season), broken down by each of the eight bases states. For example, in 2014 Jose Abreu created 43.5 runs on a potential of 291, and was most efficient with a runner on second base, where he created 10.3 runs on a potential of only 36.

The following tabs show runs expectancies of various offensive plays from the start state the expected end state, based on the expected Baserunning Success rate in the UI. For each play, I created a graphical as well as a table tab. For the graphical tabs, there is a UI to switch between views of expected runs and scoring probability.

**Stolen bases Graphic/Table**: For each of fifteen different base stealing situations, I show the start state, end state (based on the UI selected success rate), and the breakeven success rate for the given situation. We see that rather than one generic rule of thumb for breaking even, the situational b/e’s vary widely, ranging from 91% with a runner on second and two outs, to 54% for a double steal with first and second and one out (I assume that any out is the lead runner). Notice though that if only the runner on second attempts to steal, the break even jumps from 54% to 72%.

**Tag from Third Graphic/Table**: I broke down every situation where a fly ball was caught with a runner on third, where the catch was either the first or second out. I tracked the attempt frequency and success rate for each situation, based on the outs and whether there were trailing runners. Surprisingly, I found that almost every success rate is well over 95%, meaning runners are only tagging when they’re almost certain to score. However, the break evens range from 40% with first and third with two outs (after the catch) to 77% with runners on second and third with one out. I believe this shows a gray area between the b/e and success rates where runners are being far too cautious.

The following tabs show whether a base runner should attempt to advance two bases on a single. Again, of course it depends on the situation.

**First to Third Graphic/Table**: Here we see that the attempted frequencies are very low, and as expected, lowest on balls hit to left field. However, as with the above tag plays, runners are almost always safe, showing another gray area between attempts and b/e’s. For example, on a single to right field with one out, runners only attempt to advance to third base 42.1% of the time, and are safe 97.3%. If we place the UI Success Rate slider on 0.85, we see that the attempt increases the runs expectancy from 0.87 to 0.99.

**Second to Home Graphic/Table**: Here we see the old adage, “don’t make the first or second out at the plate”, is not necessarily true. Attempting to score from second on a single depends not only on the outs, but also whether there is a trailing runner. The break evens range from 93% with no outs and no trailing runner on first, to 40% with two outs and no runner on first. Once again, the success rates are almost always higher than the break even rate, showing too much caution.

**Sacrifice Bunt Graphic/Table**: These tabs show that unless we have a hitter far below average, the sacrifice should never be attempted. For example, in going from a runner on first and no outs to a runner on second with one out, or going from a runner on second with no outs to a runner on third with one out, we drop from 0.85 runs to 0.66 runs and from 1.10 runs to 0.94 runs respectively. Worse, I’m assuming that the bunt is always successful with the lead runner never being thrown out. The only situation where the bunt might be wise is in a late inning and the team is playing for one run after a leadoff double. Getting the runner from second and no outs to third with one out increases the probability of scoring from 0.61 to 0.65, IF the bunt is successful. Even here, it is a poor play if the success rate is less than 90%. The graphic tab allows the user to see how the expected end state changes as the UI success rate slider is altered.

UI code: https://github.com/malter61/retrosheets/blob/master/ui.R

Server code: https://github.com/malter61/retrosheets/blob/master/server.R

*Mark Malter is a data scientist currently working for Houghton, Mifflin, Harcourt, as well as the consulting firm Channel Pricing, specializing in building predictive models, cluster analysis, and visualizing data. He is also a sixteen year veteran stock options market-maker at the Chicago Board Options Exchange. He has a BS degree in electrical engineering, an MBA, and is currently working on an MS degree in Predictive Analytics. Mark also spent 14 years as a director and coach of his local youth baseball league.*

A quick heads-up that this Thursday (December 11), Allen Day from MapR and Bill Jacobs from Revolution Analytics will be live presenting a new webinar, Batter Up! Advanced Sports Analytics with R and Storm. The analysis will be of baseball data, but the webinar will be of interest to anyone interested in doing large-scale statistical analysis with R of data in Hadoop. Here's the abstract:

This session will demonstrate how the all-star line-up featuring R and Storm enables real-time processing on massive data sets; a real home run! The presenters will use actual baseball data and a real-world use case to compose an implementation of the use case as Storm components (spouts, bolts, etc.) and highlight how R can be an effective tool in prototyping a solution. Attendees will leave the session with information that could easily be applied for other use cases such as video game analytics, fraud detection, intrusion detection, and consumer propensity to buy calculations.

The business need for real-time analytics at large scale has focused attention on the use of Apache Storm, but an approach that is sometimes overlooked is the use of Storm and R together. This novel combination of real-time processing with Storm and the practical but powerful statistical analysis offered by R substantially extends the usefulness of Storm as a solution to a variety of business critical problems. By architecting R into the Storm application development process, Storm developers can be much more effective. The aim of this design is not necessarily to deploy faster code but rather to deploy code faster. Just a few lines of R code can be used in place of lengthy Storm code for the purpose of early exploration – you can easily evaluate alternative approaches and quickly make a working prototype.

Register to attend the session on Thursday, where you can ask Bill and Allen you questions live. All registrants will also recieve a copy of the slides and the webinar recording. You can register at the link below.

**Update**: The webinar is a wrap, and you can view the slides and replay at the link below.

Revolution Analytics webinars: Batter Up! Advanced Sports Analytics with R and Storm

by Joseph Rickert

Norman Matloff professor of computer science at UC Davis, and founding member of the UCD Dept. of Statistics has begun posting as Mad(Data)Scientist. (You may know Norm from his book, *The Art of R Programming:* NSP, 2011.) In his second post (out today) on the new R package, freqparcoord, that he wrote with Yinkang Xie, Norm looks into outliers in baseball data.

> library(freqparcoord) > data(mlb) > freqparcoord(mlb,-3,4:6,7)

We would like to welcome Norm as a new R blogger and we are looking forward to future posts!

Mad(Data)Scientist: More on freqparcoord

If you're still working on your March Madness brackets or fantasy teams, Rodrigo Zamith has updated his NCAA Data Visualizer with the latest teams, players and results. Just choose the two teams you want to compare and the metric to compare them on, and this R-based app will show you the results instantly.

Rodrigo Zamanth: Visualizing Season Performance by NCAA Tournament Teams (2014)

Who will win the SuperBowl this Sunday: Seattle or Denver? As pundits around the country weigh in with their predictions, you might want to check out the analysis from the New York Times' 4th Down Bot, which compares the coaches' calls on fourth down plays with what historical statistics and a point-forecasting model indicate would have been the ideal play.

According to the Bot, the Seahawks made of 6 of 8 4th down calls correctly in their championship-winning game (one of those was "complicated" according to the Bot, so let's give it to Seattle). In the Bronco's championship game, Denver made 5 of 6 calls correctly. So at least according to the Bot, Denver has the edge over Seattle in 4th down calls. (I'm going to get run out of Seattle for saying that!)

As you're watching the game on Sunday, you can see the Bots calls in real-time on Twitter (follow NYT4thDownBot), and see if you agree with the model or with the coaches calls.

By the way, the New York Times makes extensive use of R in its sports analysis, from fantasy football to draft picks to baseball.

New York Times: 4th Down Bot

Because video games happen in a virtual world, it's possible to measure just about every aspect of the game. It's kind of like being able to observe a sports match or a battle, but being able to attach a telemetry sensor to every player, every weapon and bullet, every surface of the environment, and gather all that data in real time. The Big Data revolution has made this possible, and video game companies routinely gather 50 terabytes of data per day to improve their games, operations and revenue.

But from the player's point of view, can analyzing this data improve their performance? Just as Moneyball revolutionalized baseball, can analyzing video game data improve the success of a professional gamer? Video gaming magazine Kill Screen asked how big data helps and hurts League of Legends players in a recent article and suggested it can help many players:

“Almost all players will reach a point where they will plateau without self-reflection, analysis, and focused practice,” Sabine Hemmi of the League of Legends stat site Elobuff tells me. “Any player who understands the basics can learn from statistics. It will be easier for them to identify their weaknesses and focus on improving.”

But when it comes to the elite players, the numbers may not be as much help:

“If you go to a local chess club and pick a low-level player, it’s easy to spot flaws in their game,” explains Bill Grosso, CEO of Scientific Revenue. “Then, you go to the world championship of chess. It becomes really, really hard.”

Nonetheless, it's clear that data analysis is revolutionizing the video game industry, as Bill Grosso rescribed in a recent Revolution Analytics webinar, "Knowing How People are Playing Your Game Gives You the Winning Hand". Follow that link for the webinar replay, or click the link below for the full Kill Screen article.

Kill Screen: How big data helps — and hurts — League of Legends players