If you're still working on your March Madness brackets or fantasy teams, Rodrigo Zamith has updated his NCAA Data Visualizer with the latest teams, players and results. Just choose the two teams you want to compare and the metric to compare them on, and this R-based app will show you the results instantly.

Rodrigo Zamanth: Visualizing Season Performance by NCAA Tournament Teams (2014)

Nate Silver, the statistician famous for correctly predicting the outcome of the US 2012 presidential election, was lured away from the New York Times by ESPN 8 months ago. Yesterday marked the relaunch of his website FiveThirtyEight.com, now with a focus on data journalism and with a staff of 20. In the opening post, Nate describes the journalistic methodology of the site, which will be data-driven, deliberate, and focus not just on the explanation of data but its generalization to future outcomes — a process missing from traditional journalism.

The site seems to be doing very well so far, if the traffic it's sending to this blog is any indication. (The opening post linked to our 2011 post, "The plural of anecdote is data, after all".) I like the fact that the site isn't shying away from in-depth statitistical issues: for example, today's post on how statisticans can help find the misssing Malaysian Airlines 777 is basically a primer on Bayesian Statistics. (Bayesian or simulation-based methods have helped find planes, boats and individuals lost at see before, such as in this gripping story of a Montauk fisherman lost overboard in the Atlantic Ocean.)

I with Nate and his team the best for their new venture. Hopefully the success of the site will bring data journalism and statistical principles to a wider audience.

In June 2013, the conflict between opposition and government forces around the Syrian city of Aleppo had intensified. Rockets struck residential districts, and car-bombs exploded near key facilities.

Many people died. But as is common in conflict areas, the reports of the *number* of dead varied by the source of the information. While some agencies reported a surge in casualties in the Aleppo area around June 2013, others did not.

The true number of casualties in conflicts like the Syrian war seems unknowable, but the mission of the Human Rights Data Analysis Group (HRDAG) is to make sense of such information, clouded as it is by the fog of war. They do this not by nominating one source of information as the "best", but instead with statistical modeling of the *differences* between sources.

In a fascinating talk at Strata Santa Clara in February, HRDAG's Director of Research Megan Price explained the statistical technique she used to make sense of the conflicting information. Each of the four agencies shown in the chart above published a list of identified victims. By painstakingly linking the records between the different agencies (no simple task, given incomplete information about each victim and variations in capturing names, ages etc.), HRDAG can get a more complete sense of the total number of casualties. But the real insight comes from recognizing that **some victims were reported by no agency at all**. By looking at the rates at which some known victims were not reported by *all* of the agencies, HRDAG can estimate the number of victims that were identified by *nobody*, and thereby get a more accurate count of total casualties. (The specific statistical technique used was Random Forests, using the R language. You can read more about the methodology here.)

HRDAG is doing a noble and difficult job of understanding the facts of war from incomplete data. "If we base our conclusions about what's happening in Syria on the observed data — on the reporting rates — we get those questions wrong", said Megan in her Strata talk. "When estimate what is missing, we have a much more accurate estimate of reality."

Strata: Record Linkage and Other Statistical Models for Quantifying Conflict Casualties in Syria

In 1990, 87% of Americans could be uniquely identified given only their gender, date of birth and the 5-digit ZIP. You can check how easily you can be identified using those three data points here, and *vastly* more data is available about individuals today compared to 24 years ago. In this brave new world of social sharing, open data and data security revelations, data privacy is a big issue for consumers and businesses alike.

Statisticians have a unique perspective when it comes to data, yet searching for "privacy" or "ethics" at the websites of the major statisticial societies yields little of relevance. Why aren't more statisticians playing leading roles in data privacy and ethics issues? That's the topic I raise in an op-ed at StatsLife, the online magazine of the Royal Statistical Society. I encourage you to add your voice to the conversation in the comments.

StatsLife: Why aren’t more statisticians involved in data privacy?

This is the time of year when everyone likes to speculate on the winners of the Academy Awards, to be announced on Sunday. There are plenty of ways to try and predict which movie is going to win Best Picture or who'll win Best Actress. You could look at the various betting markets and see who the speculators are favouring. You could take a look at the predictions from various movie experts. You could base your predictions on the movie "fundamentals": prior awards won, box office receipts, and so forth. If you travel in such circles, you could listen in on the chatter at Hollywood cocktail parties. Or you could even watch all of the nominated movies and decide for yourself.

As Peter Aldhous (a data journalist we've featured in this blog before) reports in Medium, a team researchers used statistical analysis to evaluate all the possible methods for forecasting the Oscars, by using them to predict the outcomes of the 2013 Academy Awards and comparing the results to the actual outcomes. The conclusion: the predictions from the BetFair betting markets — alone — are the best indicators of the actual outcomes. BetFair even does better than a Nate Silver-style aggregation of the critics' picks on the day before the actual awards (and *way* better than a statistical model based on movie fundamentals), as you can see in the chart below.

You can read the details of the analysis in this paper from Microsoft research. Author David Rothschild let me know that all the computation for the paper was done in the R language, along with many R packages including plyr, reshape, ggplot2 and data.table. Rothschild uses the BetFair predictions (slightly adjusted so that the total probability of all outcomes adds to 100%) as the basis of the Oscar predictions at the PredictWise website. Click through to see the up-to-the-minute predictions, but the forecasts for the top awards as of this writing along with their predicted chance of winning are:

Best Picture | 12 Years a Slave | 87.4% |

Best Directing | Alfonso Cuarón (Gravity) | 98.2% |

Best Actor | Matthew McConaughey (Dallas Buyers Club) | 91.9% |

Best Actress | Cate Blanchett (Blue Jasmine) | 98.6% |

Best Supporting Actor | Jared Leto (Dallas Buyers Club) | 97.2% |

Best Supporting Actress | Lupita Nyong’o (12 Years a Slave) | 59.1% |

Best Visual Effects | Gravity | 99.8% |

(On a personal note: I really hope *Gravity* does match the predictions above. It's easily one of the best films I've seen in the last decade, and I'd give it Best Picture as well if I were an Academy member. See it in 3-D if you can.)

You can see predictions for the other categories in Peter Aldhous's article linked below.

**Update March 3**: By my count, the PredictWise predictions (based on the Betfair betting markets) correctly predicted 21 of 24 Oscar winners. Not a bad record!

*by James Paul Peruvankal, Senior Program Manager at Revolution Analytics*

Three weeks ago, researchers at Princeton released a study on Epidemiological modeling of online social network dynamics that states Facebook might lose 80% of its users by 2015-2017. Facebook data scientists hilariously debunked the study stating that Princeton itself would lose all of its students by 2021, using the same methodology used in the original study. The media joined the hype, with TechCrunch, CNN, Mashable, CNET etc offering their comments on the study and methodology.

The reality is neither of these institutions are going out of business. Last year Princeton admitted 7.4% of the applicants, still maintaining the status of a premier educational institution. Facebook has over 1.2 billion active users. However, Facebook has lower engagement from teens and faces threat from a number of new social networks. The article from Times of India explores various possibilities for the social network. Similarly, traditional universities are threatened by the rise of online education platforms such as edx, Coursera and Udacity. One of the key assumptions behind using SIR model is that there are ‘waves of social networks’ (eg. Classmates.com, Friendster, MySpace) that come and go. It may also happen that there would be growth and stability instead of growth and decline.

Here is the Google search trends for Myspace:

and here is the Google trends plot for Facebook:

However, our interest is to use R as a platform in modeling contagion. These are useful in modeling propagation of diseases, marketing message, viral videos, etc. The simplest and classic model is the SIR model which considers fixed population with only three compartments: susceptible, infected and removed. The compartments used for this model consist of three classes:

- S(t) is used to represent the number of individuals not yet infected with the disease at time t, or those susceptible to the disease.
- I(t) denotes the number of individuals who have been infected with the disease and are capable of spreading the disease to those in the susceptible category.
- R(t) is the compartment used for those individuals who have been infected and then removed from the disease, either due to immunization or due to death.

If N is the fixed population, N = S(t) + I(t) + R(t).

Let’s assume that an individual in the population has an equal probability as every other individual of contracting the disease with a rate of ‘beta’ (infection rate of the disease).Also assume that a number equal to the fraction ‘gamma’ (recovery rate) , infected are entering the removed class

We derive the following differential equations:

dS/dt = - beta * S * I

dI/dt = beta*S*I - gamma*I

dR/dt = gamma*I

I used “deSolve” package to model this and “optim” function to optimize the solution. Here is what I got:

If you would like to improve this model, here is the code. If you would like to explore other models for modeling contagion, see Download SIR Model v3. Let us know if you have any interesting findings.

Two recent salary surveys have shown that R language skills attract median salaries in excess of $110,000 in the United States.

In the 2014 Dice Tech Salary Survey of over 17,000 technology professionals, the highest-paid IT skill was R programming. While big-data skills in general featured strongly in the top tier, having R at the top of the list reflects the strong demand for skills to make sense of, and extract value from big data.

Similarly, the recent O'Reilly Data Scientist Survey also found R skills amongst those that pay in the $110,000-$125,000 range (albeit amongst a much smaller and specialized sample of respondents).

The reports also show that IT wages in general are rising at around 3% per year, largely due to increased demand. You can download the full reports at the links below.

Dice: 2014 Dice Tech Salary Survey Report

O'Reilly Media: 2013 Data Science Salary Survey

There's no shortage of web sites listing the current medal standings at Sochi, not least the official Winter Olympics Medal Tally. And here's the same tally, rendered with R:

Click through to see a real-time version of the chart, created with RStudio's Shiny by Ty Henkaline. (By the way, does anyone know if it's possible to embed a live version of the chart in a blog post like this?) If you're looking to create similar real-time charts of Web-based tables, be sure to check out the underlying code by Tyler Rinker that grabs the medal table from the Sochi website, cleans up the data, and plots the medal tally as a chart.

[**Updated**: The interactive chart by Ty Henkaline was mistakenly attributed to Ramnath Vaidyanathan. Apologies for the error.]

TRinker's R Blog: Sochi Olympic Medals

It's a brand new year, and the Revolutions blog is now three weeks into its sixth year. Hard to believe that a little over five years ago this was the only R-related blog; now there are more than 450 and the R project and the R community continue to thrive and grow.

To bring in the new year, I thought I'd take a look back at the 10 most popular R-related posts of 2013. In descending order starting with the post with the most pageviews in 2013, they are:

- Free e-book on Data Science with R
- Learn how to analyze data with R with Coursera's "Data Analysis" videos
- Statistics vs Data Science vs BI
- Visualize large data sets with the bigvis package
- Elements of Statistical Learning: free book download
- Trevor Hastie presents glmnet: lasso and elastic-net regularization in R
- Did an Excel error bring down the London Whale?
- Big Data Sets you can use with R
- Draw nicer Classification and Regression Trees with the rpart.plot package
- 10 R packages every data scientist should know about

Some popular posts not directly related to R were Nate Silver addresses assembled statisticians at this year's JSM and A great example of Simpson's Paradox: US median wage decline. The most popular "Because its Friday" posts were Game of Thrones Family Trees and Ten Tech Tips from David Pogue.

We've got plenty more to come in 2014: more guest bloggers, more data science and big data discussions, and of course much more R. I'd like to thank our many guest bloggers in 2013, especially Joe Rickert who featured several times in the top 10. I'd also like to thank my colleagues at Revolution Analytics for supporting this blog. And I'd especially like to thank the readers of this blog for insightful comments and great suggestions. Let us know what you'd like to see in the coming year the comments.

by Joseph Rickert

The Revolution R Enterprise 7.0 Getting started Guide makes a distinction between High Performance Computing (HPC) which is CPU centric, focusing on using many cores to perform lots of processing on small amounts of data, and High Performance Analytics (HPA), data centric computing that concentrates on feeding data to cores, disk I/O, data locality, efficient threading, and data management in RAM. The following collection of tips for computing with big data is an abbreviated version of the Guide’s discussion of the HPC and HPA considerations underlying the design of Revolution R Enterprise 7.0 and RevoScaleR, Revolution’s R package for HPA computing.

It doesn’t hurt to state the obvious: bigger is better. In general, memory is the most important consideration. Getting more cores can also help, but only up to a point since R itself can generally only use one core at a time internally. Moreover, for many data analysis problems the bottlenecks are disk I/O and the speed of RAM, making it difficult to efficiently use more than 4 or 8 cores on commodity hardware.

R allows its core math libraries to be replaced. Doing so can provide a very noticeable performance boost to any function that make use of computational linear algebra algorithms. Revolution R Enterprise links in the Intel Math Kernel Libraries.

R does quite a bit of automatic copying. For example, when a data frame is passed into a function a copy of the data is made if the data frame is modified, and putting a data frame into a list also automatically causes a copy to be made. Moreover, many basic analysis algorithms, such as lm and glm, produce multiple copies of a data set as the computations progress. Memory management is important.

Processing data a chunk at a time is the key to being able to scale computations without increasing memory requirements. External memory algorithms load a manageable amount of data into RAM, perform some intermediate calculations, load the next chunk and keep going until all of the data has been processed. Then, the final result is computed from the set of intermediate results. There are several CRAN packages including biglm, bigmemory, ff and ffbase that either implement external memory algorithms or help with writing them. Revolution R Enterprise’s RevoScaleR package takes chunking algorithms to the next level by automatically taking advantage of the computational resources to run its algorithms in parallel.

Using all of the available cores and nodes is key to scaling computations to really big data. However, since data analysis algorithms tend to be I/O bound when data cannot fit into memory, the use of multiple hard drives can be even more important than the use of multiple cores. The CRAN package foreach provides easy-to-use tools for executing R functions in parallel on both on a single computer and across multiple computers. The foreach() function is particularly useful for “embarrassingly parallel” computations that do not involve communication among different tasks.

The statistical functions and machine learning algorithms in the RevoScaleR package are all Parallel External Memory Algorithm’s (PEMA’s). They automatically take advantage of all of the cores available on a machine or on a cluster (including LSF and Hadoop clusters.)

In R, the two choices for “continuous” data are numeric, an 8 byte (double) floating point number and integer, a 4 byte integer. There are circumstances where storing and processing integer data can provide the dual advantages using less memory and decreasing processing time. For example, when working with integers, a tabulation is generally much faster than sorting and gives exact values for all empirical quantiles. Even when you are not working with integers scaling and converting to integers can produce fast and accurate estimates of quantiles. As an example, if the data consists of floating point values in the range from 0 to 1,000, converting to integers and tabulating will bound the median or any other quantile to within two adjacent integers. Then interpolation can get you even closer approximation.

You will want to store big data so that it can efficiently accessed from disk. The use of appropriate data types can save both storage space and access time. Take advantage of integers and, when you can, store data in 32-bit floats not 64-bit doubles. A 32-bit float can represent 7 decimal digits of precision, which is more than enough for most data, and it takes up half the space of doubles. Save the 64-bit doubles for computations.

Even though a data set may have many thousands of variables, typically not all of them are being analyzed at one time. By reading from disk just the actual variables and observations you will use in analysis, you can speed up the analysis considerably.

Loops in R can be very slow compared with R’s core vector operations which are typically written in C, C++ or Fortran, compiled languages that execute much quicker than the R interpreter.

One R’s great strengths is its ability to integrate easily with other languages, including C, C++, and Fortran. You can pass R data objects to other languages, do some computations, and return the results in R data objects. The CRAN package Rcpp,for example, makes it easy to call C and C++ code from R.

When working with small data sets, it is common to perform data transformations one at a time. For instance, one line of code might create a new variable, and subsequent lines perform additional transformations with each transformation requiring a pass through the data. To avoid the overhead of making multiple passes over a large data set write chunking algorithms that apply all of the transformations to each chunk. RevoScaleR’s rxDataStep() function is designed for one pass processing by permitting multiple data transformations to be performed on each chunk.

When writing chunking algorithms, try to avoid algorithms that cross chunk boundaries. In general, data transformations for a single row of data should not be dependent on values in other rows. The key idea is that a transformation expression should give the same result even if only some of the rows of data are in memory at one time. Data manipulations requiring lags can be done but require special handling.

Working with categorical or factor variables in big data sets can be challenging. For starters, not all of the factor levels may be represented in a single chunk of data. Using R’s factor() function in a transformation on a chunk of data without explicitly specifying all of the levels that are present in the entire data set might cause you to end up with incompatible factor levels from chunk to chunk. Also, building models with factors having hundreds of levels may cause hundreds of dummy variables to be created that really eat up memory. The functions in the RevoScaleR package that deal with factors minimize memory use and do not generally explicitly create dummy variables to represent factors.

Most analysis functions return a relatively small object of results that can easily be handled in memory. Occasionally, however, output will have the same number of rows as the data: when computing predictions and residuals for example. In order for this to scale, you will want the output written out to a file rather than kept in memory.

Sorting is by nature a time-intensive operation. Do what you can to avoid sorting a large data set. Use functions that compute estimates of medians and quantiles and look for implementations of popular algorithms that avoid sorting. For example, the RevoScaleR function rxDTree() avoids sorting by working with histograms of the data rather that with the raw data itself.