It's not an overstatement to say that, at least for me personally, Edward Tufte's book *The Visual Display of Quantitative Information* was transformative. Reading this book got me and, I feel confident saying, many many other data scientists passionate about visualizing data. This is the book that popularized Minard's chart depicting Napoleon's march on Russia, introduced the world to the concepts of chartjunk and the data-ink ratio, and demonstrated many times over the value of *telling stories* with data (as opposed to merely displaying it).

Tufte's book was also a direct influence on the graphics system of the S Language (and also its successor, R): it was the first statistical programming language where Tufte's concepts could easily be expressed in small amounts of code (even if his principles weren't fully adhered to in the default settings). So it's great to find Lukasz Piwek's Tufte in R page, where many of the examples from *The Visual Display* are recreated in base R code (and sometimes using lattice and ggplot2 as well). Here for example is Tufte's famous rugplot, where the axis tickmarks are replaced by dashes at the data points, giving a sense of the marginal distributions while also marking the data:

And here is Tufte's original sparklines chart: minimal time series presented as small multiples of individual data units (and now all the range in business intelligence tools and spreadsheets).

Each of the examples comes with corresponding R code, usually just a dozen lines or so. Even the document itself is laid out in the style of Tufte's book, with footnotes presented as sidenotes in the margins of the text, right where they're referenced. (RStudio has an RMarkdown style for Tufte handouts like this.) Check out all the examples at the link below.

Lukasz Piwek: Tufte in R

by Joseph Rickert

Answering email queries from friends and acquaintances from around the world wanting to attend useR! 2016 has been painful. It is amazing that the conference sold out a full two months before its start, but upon reflection, not unbelievable. From its inception useR! has been an "academic" conference both in spirit and location. Including this year's conference at Stanford, 10 of the 12 useR! conferences have been held on-site at academic institutions with limited physical space. All of these conferences have been very successful and great fun, and I think it would be fantastic if useR! could continue at university venues.

However, the R Community appears to be growing at a prodigious rate. The success of R with machine learning practitioners over the past couple of years and the attention being paid to R by Microsoft and other corporate users all but ensures that any academic venue attempting to contain the R community will be left bursting at the seams. Unless some professional organization, like the R Consortium for example, steps up to the task of organizing useR! at a convention center venue it seems to me the the need for R users to congregate will be met by a growing number of regional conferences. In this respect, European R users are particularly lucky this year with a few choices for R conferences still coming up.

The Cinquiémes Recontres R will be held on June 22^{nd} through 24^{th} in Toulouse. The conference bills itself as a national forum for sharing ideas about using R in a variety of disciplines including visualization, applied statistics, Bayesian Analysis, machine learning and more. With talks in both French and English, it has the potential to attract an international audience. Note that the program includes Ryan Hafen talking about tools for the analysis and visualization of large complex data sets in R, and Heather Turner speaking abut the inclusion of women in the R community.

R aficionados seeking a useR! like experience might consider the "Meeting of Heroes", the european R users meeting (eRum) conference that is being organized by Department of Statistics at Poznan University of Economics, the Estimator Students’ Association, and the Department of Mathematical and Statistical Methods at Poznan University of Life Sciences.

The conference is more than five months away (October 12 - 14) but already they have an outstanding line up of workshops.

- Warsztaty wprowadzające do R (in Polish) - Adam Dąbrowski (UAM WNGiG)
- Predictive modeling with R - Artur Suchwałko (Data Scientist & Owner at QuantUp (Poland) ||| Co-owner / CSO / Vice-CEO at MedicWave (Sweden))
- Data Visualization using R - Matthias Templ (Vienna University of Technology & Statistics Austria & data-analysis OG)
- Time series forecasting with R - Adam Zagdański (Faculty of Pure And Applied Mathematics, Wroclaw University of Technology; QuantUp)
- R for expression profiling by Next Generation Sequencing - Paweł Łabaj (Austrian Academy of Science / Boku University Vienna)
- Introduction to Bayesian Statistics with R and Stan - Rasmus Bååth (Lund University)
- Small Area Estimation with R - Virgilio Gómez-Rubio (Universidad de Castilla-La Mancha, Spain)
- R for industry and business: Statistical tools for quality control and improvement - Emilio L. Cano (The University of Castilla-La Mancha)
- An introduction to changepoint models using R - Rebecca Killick (Lancaster University)
- Visualising spatial data with R: from 'base' to 'shiny' - Robin Lovelace (Leeds Institute for Data Analytics, University of Leeds)

If you are thinking of going to eRum don't wait too long to sign up. The conference will be limited to 250 attendees, and the workshops have room for only 25 people. The call for papers closes June 15th. You can submit an abstract here.

The 10th R/Rmetrics Summer Workshop is coming up soon (June 24th and 25th) in Zurich. In addition to financial topics there will be sessions on Bayesian statistics, R and Shiny programming and much more.

If you are interested in insurance and actuarial science consider attending R in Insurance 2016, a one day conference to be held at the Cass Business School in London on Monday July 11^{th}. Here, I have to mention that fellow BARUG organizer, Dan Murphy is giving an invited talk.

If you are looking for a larger more industry oriented conference then don't forget about EARL (Effective Applications of the R Language) to be held in London from September 13^{th} through 15^{th}. Do hurry if you would like to give a talk. Abstract submission closes tomorrow, April 29^{th}!

Please do let me know in the comments below if I have missed anything.

by Andrie de Vries

A few weeks ago I wrote about the growth of CRAN packages, where I demonstrated how to scrape CRAN archives to get an estimate of the number of packages over time. In this post I briefly mentioned that the Ecdat package contains a dataset, CRANpackages, with snapshots recorded by John Fox and Spencer Graves.

Here is a plot of the data they collected. The dataset contains data through 2014, so I manually added the package count as of today (8,329).

In my previous post, I asked the question: "are there indications that the contribution rate is steady, accelerating or decelerating?"

This hints to analysis by John Fox where he says "The number of packages on CRAN ... has grown roughly exponentially, with residuals from the exponential trend ... showing a recent decline in the rate of growth" (Fox, 2009).

In my previous post Using segmented regression to analyse world record running times I used segmented regression to estimate a model that is piece-wise linear.

I used the same process to fit a segmented regression line through the CRAN package data.

By default the segmented package fits a single break point through the data. The results of this analysis indicates a break point occurring some time during 2008. This is entirely consistent with the observation by John Fox that the rate of growth is slowing down.

However, note that the the segmented regression line doesn't fit the data very well during the period 2008 to 2012.

With a small amount of extra work you can fit segmented models with multiple break points. To do this, you simply have to specify initial values for the search. Here I show the results of a simple model with two break points. This model finds the first break point during 2007 and the second break point during 2011.

Natural systems can not maintain exponential growth forever. There are always some limits on the system that will ultimately inhibit any further growth. This is why many systems display some kind of sigmoid curve, or S curve.

Although the growth curve of CRAN packages shows signs of slowing down, it does not seem as if there is an inflexion point in the data. An inflexion point is where the curve transitions from being convex to being concave.

Thus it seems the grown of CRAN packages will appear to be exponential for quite some time in the future!

As usual, here is the R code I used.

by Lixun Zhang, Data Scientist at Microsoft

As a data scientist, I have experience with R. Naturally, when I was first exposed to Microsoft R Open (MRO, formerly Revolution R Open) and Microsoft R Server (MRS, formerly Revolution R Enterprise), I wanted to know the answers for 3 questions:

- What do R, MRO, and MRS have in common?
- What’s new in MRO and MRS compared with R?
- Why should I use MRO or MRS instead of R?

The publicly available information on MRS either describes it at a high level or explains the specific functions and the underlying algorithms. When they compare R, MRO, and MRS, the materials tend to be high level without many details at the functions and packages level, with which data scientists are most familiar. And they don’t answer the above questions in a comprehensive way. So I designed my own tests (and the code behind the tests is available on GitHub). Below are my answers to the three questions above. MRO has an optional MKL library and unless noted otherwise the observations hold true, whether MKL is installed on MRO or not.

After installing R, MRO, and MRS, you'll notice that everything you can do in R can be done in MRO or MRS. For example, you can use *glm()* to fit a logistic regression and *kmeans()* to carry out cluster analysis. As another example, you can install packages from CRAN. In fact, a package installed in R can be used in MRO or MRS and vice versa if the package is installed in a library tree that's shared among them. You can use the command *.libPaths()* to set and get library trees for R, MRO and MRS. Finally, you can use your favorite IDEs such as RStudio and Visual Studio with RTVS for R, MRO or MRS. In other words, MRO and MRS are 100% compatible with R in terms of functions, packages, and IDEs.

While everything you do in R can done in MRO and MRS, the reverse is not true, due to the additional components in MRO and MRS. MRO allows users to install an optional math library MKL for multithreaded performance. This library shows up as a package named *"RevoUtilsMath"* in MRO.

MRS comes with more packages and functions than R. From the package perspective, most of the additional ones are not on CRAN and are available only after installing MRS. One such example is the RevoScaleR package. MRS also installs the MKL library by default. As for functions, MRS has High Performance Analysis (HPA) version of many base R functions, which are included in the RevoScaleR package. For example, the HPA version of *glm()* is *rxGlm()* and for *kmeans()* it is *rxKmeans()*. These HPA functions can be used in the same way as their base R counterparts with additional options. In addition, these functions can work with a special data format (XDF) that's customized for MRS.

In a nutshell, MRS solves two problems associated with using R: capacity (handling the size of datasets and models) and speed. And MRO solves the problem associated with speed.

The following table summarizes the performance comparisons for R, MRO, and MRS. In terms of capacity, using HPA in MRS increases the size of data that can be analyzed. From the speed perspective, certain matrix related base R functions can perform better in MRO and MRS than base R due to MKL. The HPA functions in MRS perform better than their base R counterparts for large datasets. More details on this comparison can be found in the notebook on GitHub.

It should be noted that while there are packages such as *“bigmemory”* and *“ff”* that help address some of the big data problems, they were not included in the benchmark tests.

For data scientists trying to determine which of these platforms should be used under different scenarios, the following table can be used as a reference. Depending on the amount of data and the availability of MRS's HPA functions, the table summarizes scenarios where R, MRO, and MRS can be used. It can be observed that whenever R can be used, MRO can be used with the additional benefit of multi-thread computation for certain matrix related computations. And MRS can be used whenever R or MRO can be used and it allows the possibility of using HPA functions that provide better performance in terms of both speed and capacity.

Follow the link below for my in-depth comparison of R, MRO and MRS.

Lixun Zhang: Introduction to Microsoft R Open and Microsoft R Server

Naomi Robbins, author of Creating More Effective Graphs and Forbes contributor has teamed up with daughter Dr Joyce Robbins to present a new webinar this Thursday April 28, Creating Effective Graphs with Microsoft R Open. The webinar will demonstrate how to create a variety of useful graphics with R: comparisons, distributions, trends over time, relationships, divisions of a whole, and much more like this:

This webinar will be useful for anyone who wants to learns how to display data graphically with the greatest impact. The webinar will use Microsoft R Open, but since it's 100% compatible, the code provided during the webinar can be used with any edition of R. The webinar will begin at 10AM Pacific time (click here to see your local time), and I'll be hosting and passing your questions to the presenters. Even if you can't make the live event, sign up to receive a link to the slides and replay, plus a free copy of a new 50-page e-book by the presenters.

To register for the webinar, follow the link below.

Microsoft Advanced Analytics and IoT: Creating Effective Graphs with Microsoft R Open

Microsoft R Open 3.2.4, Microsoft's enhanced distribution of R, is now available for download from mran.microsoft.com. This update is based on R 3.2.4-revised, and includes several improvements and some minor bug fixes from the R Core Group. Improvements include long-vector support for the `smooth`

function, a new `stringsAsFactors`

options when using `rbind`

with data frames, and better rounding from the `summary`

function in the presence of some infinite values.

This release uses a CRAN snapshot taken on April 1, 2016 as the default package repository. Many packages have been updated since the release of MRO 3.2.3, including implementations of the Bayesian bootstrap, deep boosting, multivariate and fuzzy random forests, and balanced sampling. There's a summary of package changes provided in the package spotlight for Microsoft R Open 3.2.4. And as always, you can use packages released or updated since April 1 by using the checkpoint package, and browse updates to CRAN with the CRAN Time Machine on MRAN.

To download Microsoft R Open 3.2.4, visit MRAN at the link below. Don't forget to also download the Math Library for your platform to enable the performance improvements, and (if you don't have an R development environment installed already), R Tools for Visual Studio or RStudio. If you have questions about Microsoft R Open, check the FAQ or the new forum on MSDN, or ask here in the comments.

by Joseph Rickert

R/Finance 2016 is less than a month away and, as always, I am very much looking forward to it. In past years, I have elaborated on what puts it among my favorite conferences even though I am not a finance guy. R/Finance is small, single track and intense with almost no fluff. And scattered among the esoterica of finance and trading there has, so far, always been a rich mix of mathematics, time series applications, R programming, stimulating conversation and attitude. When it comes down to it, it’s the people, the organizers and participants who make a conference. Looking over the agenda for this year, I am sure that once again, for two days at least, Chicago will be the center of the R world.

This year, however, I am going to be ready for R/Finance. I am going to do my homework. If I had done a little prep last year I would have had a copy of Arthur Koestler’s The Sleepwalkers in my bag. So when Emanuel Derman went deep philosophy I could have gone through that looking-glass with him.

So what’s on the line up this year? Rishi Narang will lead off for the keynote speakers with a talk provocatively entitled “Rage Against the Machine Learning”. There is not much online for and industry outsider to latch onto, but it probably wouldn’t hurt to have a look at one of his three books on quantitative trading.

Tarek Eldin will deliver the second keynote entitled ‘Random Pricing Errors and Systematic Returns: The Flaw in Fundamental Prices” My guess is that this online paper might provide some relevant preparatory reading.

Frank Diebold’s keynote is entitled “Estimating Global Bank Network Connectedness”. I think it’s a safe bet that his recent paper with Mert Demirer, Laura Liu and Kami Yilmaz will indeed be relevant.

Batting cleanup for the keynote speakers will be none other than the R Inferno himself, who vaguely and possibly misleadingly suggests that preparation for his talk, “Some Linguistics of Quantitative Finance” might begin with Yucatan.

For preparation on more solid ground, I am going to look into the R packages explicitly called out in the agenda. Of course, there will be Rcpp. Chicago is Eddelbuettel country and no doubt much of the conversation over coffee will revolve around high performance computing. But, even R users who are not particularly interested in writing high performance code themselves ought to know something about this package. With a reverse table listing hundreds of packages it is becoming the foundation for much of R.

In addition to Dirk’s tutorial on Rcpp and RcppArmadillo, Matt Dziubinski will talk about getting the most out of Rcpp in practice and Jason foster will talk about using RcppParallel for multi-asset principal component regression. Look here for some older talks by Matt.

Robert McDonald will describe the derivmkts package which contains functions that support his book Derivatives Markets.

Eran Raviv will talk about combining multiple forecasts using R’s ForecastCombinations package.

Kjell Konis will describe how to compare Fitted Factor Models with his fit.models package.

Steven Pav will speak of madness, package for multivariate automatic differentiation. There is a very nice vignette that describes the mathematics of madness.

Qiang Kou will talk about deep learning in R using the MxNet package which makes use of GPUs.

Mario Annau will talk about the h5 package, an S4 interface to the HDF5 storage format.

Robert Krzyzanowski will describe the Syberia development framework for R.

Dirk Eddelbuettel will revisit the Rblapi package for connecting R to Bloomberg.

Michael Kane will talk about a new package he is writing glmnetlib which is intended to be a low-level library for Regularized Regression.

Matt Brigida use a Shiny implementation to talk about Community Finance Teaching Resources.

When I registered for the conference I saw that the preconference tutorial by Harte and Weylandt on modern Bayesian tools for time series analysis is going to use STAN. So, I need to add rstan to the list.

In his tutorial on leveraging the Azure cloud from R, Doug Service will show how to use the foreach package in the Azure environment.

And then, for some serious preparation it might be helpful to take a look at the math underlying some of the presentations. For example, Klaus Spanderen will talk about calibrating Heston Local Stochastic Volatility Models. Sida Yang will discuss using Latent Dirichlet Allocation to discover distributions underlying financial news topics and Pedro Alexander will discuss portfolio selection with support vector regression.

All that I have listed won't even cover half of what will be presented at the conference, however, I hope some of it will be helpful in preparing for R/Finance. But, most importantly, don’t forget to register! Unfortunately, this year many, if not most, of the people who would like to go to the useR! conference will not be able to attend. Don’t get locked out of R/Finance too!

You might think literary criticism is no place for statistical analysis, but given digital versions of the text you can, for example, use sentiment analysis to infer the dramatic arc of an Oscar Wilde novel. Now you can apply similar techniques to the works of Jane Austen thanks to Julia Silge's R package janeaustenr (available on CRAN). The package includes the full text the 6 Austen novels, including *Pride and Prejudice* and *Sense and Sensibility*.

With the novels' text in hand, Julia then applied Bing sentiment analysis (as implemented in R's syuzhet package), shown here with annotations marking the major dramatic turns in the book:

There's quite a lot of noise in that chart, so Julia took the elegant step of using a low-pass fourier transform to smooth the sentiment for all six novels, which allows for a comparison of the dramatic arcs:

An apparent Austen afficionada, Julia interprets the analysis:

This is super interesting to me.

EmmaandNorthanger Abbeyhave the most similar plot trajectories, with their tales of immature women who come to understand their own folly and grow up a bit.Mansfield ParkandPersuasionalso have quite similar shapes, which also is absolutely reasonable; both of these are more serious, darker stories with main characters who are a little melancholic.Persuasionalso appears unique in starting out with near-zero sentiment and then moving to more dramatic shifts in plot trajectory; it is a markedly different story from Austen’s other works.

For more on the techniques of the analysis, including all the R code (plus some clever Austen-based puns), check out Julia's complete post linked below.

data science ish: If I Loved Natural Language Processing Less, I Might Be Able to Talk About It More

As I mentioned yesterday, Microsoft R Server now available for HDInsight, which means that you can now run R code (including the big-data algorithms of Microsoft R Server) on a managed, cloud-based Hadoop instance.

Debraj GuhaThakurta, Senior Data Scientist, and Shauheen Zahirazami, Senior Machine Learning Engineer at Microsoft, demonstrate some of these capabilities in their analysis of 170M taxi trips in New York City in 2013 (about 40 Gb). Their goal was to show the use of Microsoft R Server on an HDInsight Hadoop cluster, and to that end, they created machine learning models using distributed R functions to predict (1) whether a tip was given for a taxi ride (binary classification problem), and (2) the amount of tip given (regression problem). The analyses involved building and testing different kinds of predictive models. Debraj and Shauheen uploaded the NYC Taxi data to HDFS on Azure blob storage, provisioned an HDInsight Hadoop Cluster with 2 head nodes (D12), 4 worker nodes (D12), and 1 R-server node (D4), and installed R Studio Server on the HDInsight cluster to conveniently communicate with the cluster and drive the computations from R.

To predict the tip amount, Debraj and Shauheen used linear regression on the training set (75% of the full dataset, about 127M rows). Boosted Decision Trees were used to predict whether or not a tip was paid. On the held-out test data, both models did fairly well. The linear regression model was able to predict the actual tip amount with a correlation of 0.78 (see figure below). Also, the boosted decision tree performed well on the test data with an AUC of 0.98.

The data behind the analysis is public, so if you'd like to try it out yourself the Microsoft R Server code for the analysis is available on Github, and you can read more details about the analysis in the detailed writeup, linked below. The link also contains details about data exploration and modeling, including references to additional distributed machine learning functions in R, which may be explored to improve model performance.

Scalable Data Analysis using Microsoft R Server (MRS) on Hadoop MapReduce: Using MRS on Azure HDInsight (Premium) for Exploring and Modeling the 2013 New York City Taxi Trip and Fare Data

If you want to train a statistical model on very large amounts of data, you'll need three things: a storage platform capable of holding all of the training data, a computational platform capable of efficently performing the heavy-duty mathematical computations required, and a statistical computing language with algorithms that can take advantage of the storage and computation power. Microsoft R Server, running on HDInsight with Apache Spark provides all three.

As Mario Inchiosa and Roni Burd demonstrate in this recorded webinar, Microsoft R Server can now run within HDInsight Hadoop nodes running on Microsoft Azure. Better yet, the big-data-capable algorithms of ScaleR (pdf) take advantage of the in-memory architecture of Spark, dramatically reducing the time needed to train models on large data. And if your data grows or you just need more power, you can dynamically add nodes to the HDInsight cluster using the Azure portal.

Many of the details are in the slides embdedded above, but to see a demonstration of Microsoft R Server running on Spark with HDInsight, click on the link below for access to the recorded webinar.

Microsoft Azure On-Demand Webinar: Building A Scalable Data Science Platform with R and Hadoop