by Joseph Rickert

2015 has been a good year for R user groups, both in terms of activity and the number of new groups founded. The plot below which runs 12/30/2012 through the week beginning with Monday 11/23/2015 shows that the number of weekly meeting continues to drift up to the right. You can see the seasonal pattern of fewer meetings in the late summer and winter holiday season and increased activity in Autumn. We track weekly meetings in our Community Calendar.

The table below shows 18 new groups founded in 2015. (If I have missed your group please let me know.)

Altogether, we have 174 R user groups listed in the Local R User Group Directory.

After running for more than 5 years the Revolution Analytics R user group support program closed down on September 30, 2015. However, without missing a beat, we immediately picked up with the new Microsoft Data Science User Group Program. Reflecting Microsoft's broad support for the open source community, this program provides support for data science user groups of all stripes. (even Python :) The new program works in a manner that is similar to the Revolution Analytics program. There are three levels of support Vector, Matrix and Array with the amount of funding and care package of goodies determined by a group's size and the frequency with which it meets. If you are a data science group organizer and you think a little cash and some swag can help your group grow please do apply to the program here.

Finally, a shout out to RBelgium that postponed its meeting scheduled for this week, but will pick up again on 12/16 and to RLyon meeting on 11/27.

Today is Thanksgiving Day in the United States, where the nation's citizens pause to reflect on what they are thankful for. I'd like to take this opportunity to give thanks to the members of the R Core Group who developed R, and continue to donate their time to help the community use R by improving R, writing documentation, maintaining the build systems on a huge number of platforms, maintaining the CRAN system and working with developers to approve packages, answering questions on mailing lists, and much much more. Much of this work goes on invisibly behind the scenes, and the R community has much to thank of these volunteers. The members of the R core group are:

- Douglas Bates
- John Chambers
- Peter Dalgaard
- Robert Gentleman
- Kurt Hornik
- Ross Ihaka
- Michael Lawrence
- Friedrich Leisch
- Uwe Ligges
- Thomas Lumley
- Martin Maechler
- Martin Morgan
- Duncan Murdoch
- Paul Murrell
- Martyn Plummer
- Brian Ripley
- Deepayan Sarkar
- Duncan Temple Lang
- Luke Tierney
- Simon Urbanek
- plus Heiner Schwarte up to October 1999, Guido Masarotto up to June 2003, Stefano Iacus up to July 2014 and Seth Falcon up to August 2015.

If you're in the US or celebrating Thanksgiving elsewhere, have a great day. To celebrate the occasion, Kieran Healey offers us this Thanksgiving Turkey, created in just 3 lines of R code (4 with my slight modification for posterity). Enjoy!

by Michael Helbraun

The software business includes travel, and that means hotels. The news that Marriott was acquiring Starwood was of particular interest to me – especially since more than 75% of my 95 nights so far this year on the road have been spent with one of those two companies.

While other folks can evaluate if the deal makes sense financially, I was just curious how this might affect a business traveler. Looking at the news there are those optimistic and plenty concerned. Granted, many of these details on how the loyalty programs will be combined won’t be known for some time, but what we do know is where each company maintains properties.

With 4200+ Marriott and 1700+ Starwood properties I was curious where there might be overlap, and how well the deal would help Marriott to grow in new markets. Luckily R can help in this regard.

The first thing to do is to put together a data set. It would have been nice if the companies had cleaned spreadsheets available publically, but as is normally the case we end up spending a good portion of time gathering and preparing data. In this case scraping, and formatting the data from SPG and Marriott into a spreadsheet with all their property locations. While I won’t go into data cleaning here, for a one time effort on just a few thousand rows of data this was pretty straightforward to do in Excel.

After I had all locations for all properties it was time to bring that data into R to start the analysis. First I was curious where each firm had the most properties – simple to do with a cross tab. NYC seems a logical top 5, but Houston and Atlanta, interesting:

**Top 10 Marriott Locations **

So far so good, but to actually put these on a map it’s much easier if the data has latitude and longitude. The

marGeocoded <- cbind(locations, geocode(locations))

save(marGeocoded, file="D:/Datasets/marGeocoded.RData")

load("D:/Datasets/marGeocoded.RData")

locations <- hotToGeo

hotGeocoded <- cbind(locations, geocode(locations))

save(hotGeocoded, file="D:/Datasets/hotGeocoded.RData")

load("D:/Datasets/hotGeocoded.RData")

Once the lat/long coordinates are merged back into our data set there are a number of ways to plot the results. I’m a fan of the globe plots within Bryan Lewis’s excellent *rthreejs* package. This allows you to stretch a 2D image over a globe which you can then plot on top of and interact with. Here I’ve plotted all the Marriott properties in orange and the Starwood properties in yellow:

After this it seemed like there was the most overlap in the US and Europe. To create a static plot ggmap is very quick:

# Europe map with ggmap

eurPlot <- qmap(location = "Europe", zoom = 4, legend = "bottomright", maptype = "terrain", color = "bw", darken = 0.01)

eurPlot <- eurPlot + geom_point(data = combGeocoded, aes(y = lat, x = lon, colour = firm, size=Counts, alpha=.2))

(eurPlot <- eurPlot + scale_size_continuous(range = c(3,10)))

If we want to create something within an interactive zoom the *leaflet* package is another useful one. It leverages Open Street Map and allows you to pan and zoom:

Aggregating and deriving value from low value info is a great use of R, and this sort of analysis is fun as it gives some additional perspective into a current event. If you would like to play around with this, a copy of the script Download Merger analysis and relevant data files are available Download HotGeocoded and Download MarGeocoded – let us know what you find in the comments.

Todd W Schneider analyzed a database of 1.1 billion taxi rides in New York City from 2009-2015, and discovered some interesting insights on how New Yorkers use cabs. For example, here's a map of the drop-off locations of each ride in the database:

The R code to generate this beautiful map is surprisingly simple: just one line to extract the data from a Postgres database, and a few lines of ggplot2 code to render each drop-off as a point on the map, colored by the type of cab (NYC Yellow or regional Green Boro taxis). Note the use of the alpha= argument to make the dots transparent, allowing them to build in intensity according to the number of drop-offs in each location.

Todd also used R to calculate from the data the amount of time required to get from various NYC districts to the airport. For example, here's the chart for trips from midtown Manhattan to JFK airport:

Note how Todd presents probability bands instead of the medians of trip times by time of day. As anyone who communtes regularly knows, the same trip at the same time of day doesn't always take the same amount of time: there is a *distribution* of possible trip times, from quick runs to extreme delays. If you leave for your destination with only the median trip time (as shown by most navigation apps) to spare, **you will be late half the time**. Personally, I like to use the 90/90 rule for airport trips: leave at a time that gives me a 90% chance of arriving 90 or more minutes before my flight. This chart helps me follow that rule. For example, at rush hour (around 4PM) you should leave midtown **2 hours and 55 minutes **(85 + 90 minutes) before your flight if you want to have 90 minutes at the airport.

For many other charts and analyses of the NYC taxi data, check out Todd's complete blog post below.

Todd W Schneider: Analyzing 1.1 Billion NYC Taxi and Uber Trips, with a Vengeance

by Joseph Rickert

What are you reading? - and what are you recommending to friends, colleagues, and students who want to learn something about R programming? A quick search of Amazon will show that there are several new R books proposed for 2016; but of course, new doesn't necessarily mean better. I fully expect that many new books in all areas of statistics, data science and many other scientific disciplines using R to provide a computational aspect for their exposition will continue to be written for years to come. All of these books will provide windows into learning R for people excited about the particular subject matter. However, so many excellent R based texts have already been published that it will be difficult for these new works to achieve "must buy" status for the R content alone.

Below are my recommendations for good R reads. Some of these books go back a few years, but they continue to hold their value. With the possible exception of books that were based primarily on the S language, good R books don't become obsolete. Unlike some other computer languages, R evolves mostly through new capabilities added by contributed packages, not through changes to the R core. The fact that the dplyr family of packages may make data wrangling more convenient in many circumstances doesn't make a book that teaches data manipulation through base R functions any less relevant. In fact, some might argue that new students should be taught the basic functionally first. I am not a militant traditionalist, but it does seem to me that familiarity with the bare bones basics of the language will help newcomers to gain intuition about how R works.

There are three lists below. The first lists my picks for teaching R programming. (Top row in the graphic) The second list provides my recommendations for people interested in learning R for data science. (Second row in the graphic).

The third list is of books on my shelf that I continue to value. For every entry in all three lists I provide a mini or micro review. In a few cases, I point to a more extensive review that I have previously published in this blog. My lists are in no way intended to be complete. But, I apologize right now if I have omitted some really good books. Please let me know about what I have missed by commenting to this post with a mini review of your own.

Advanced R by Hadley Wickham - Anyone who wants to gain a deep understanding of the R language will certainly benefit from this book. More than a reference: the author seeks to provide a conceptual framework for understanding R’s structure and guide readers through R’s idiosyncratic mechanisms pointing out traps, illuminating difficult concepts and providing expert commentary.

The Art of R Programming: A Tour of Statistical Software Design by Norman Matloff – This is still my pick for the best book for people with some programming experience who want to make a serious effort at learning R. Professor Matloff’s interest in teaching the mechanics of programming infused with his deep understanding of both the underlying computer science and statistical theory put this book on top.

Hands on Programming with R by Garrett Grolemund – If you are not only new to R but new to programming as well this is the book for you. I have review it more extensively here.

R For Dummies by Andrie de Vries and Joris Meys – A current, concise and insightful reference to core concepts in the R language. A really nice feature of the book is its emphasis on presenting the R ecosystem along with core R concepts. When learning anything new, it is always helpful to understand the big picture. Keep this book by your computer, when you stop referring to it you will be a pretty good R programmer.

Applied Predictive Modeling by Max-Kuhn and Kjell Johnson – This book is the master text for predictive analytics, carefully walking through several modeling examples and making expert use of the extensive machine learning tools in R’s caret package. I have described the book more fully here.

Data Mining with Rattle and R by Graham Williams – This is the perfect first book for machine learning with R. The rattle GUI helps get across the machine learning concepts and also produces some pretty good R code to get your started.

Data Science in R: A Case Studies Approach to Computational Reasoning and Data Science by Deborah Nolan and Duncan Temple Lang. – My most recent acquisition, this book consists of 12, non-trivial case studies organized under three themes: Data Manipulation and modeling, Simulation Studies and Data and Web Technologies. All of the data sets are messy and the projects identify and develop the kind of skills required to undertake open-ended data science projects. The book doesn’t teach R programming, but it shows why R is the appropriate language for doing data science.

Practical Data Science with R by Nina Zumel and John Mount – This book is one of a kind. It moves fluidly between the various stages of the data science process from surface considerations of working with customers to the deep details of various machine learning algorithms. There is quite a bit of original R code that you can use in real projects. Most impressive is the statistical sensibility of the authors who want you to make correct inferences from your data and machine learning models as well as effectively communicate your findings to the people paying the bills.

A First Course in Statistical Programming with R by W. John Braun and Duncan Murdoch – A deceptively thin book that provides a sharp introduction to R and moves quickly through debugging, computational linear algebra, numerical optimization and linear programming.

An Introduction to Statistical Learning with Applications in R by, Gareth James, Daniela Witten, Trevor Hastie and Robert Tibshirani. This book is the companion to the master text for Machine / Statistical Learning, The Elements of Statistical Learning, and contains plenty of R code. The authors have generously posted pdf versions of both of these books online.

An R Companion to Applied Regression by John Fox and Sanford Weisberg– I have been a fan since the first edition which is possibly the best introduction to regression analysis with R ever.

Applied Meta-Analysis with R by Ding-Geng Chen and Karl E. Peace – Provides a solid introduction to basic meta-analysis that should be very helpful to people working in the field and want to move to R.

Bayesian Computation with R by Jim Albert – A concise, undergraduate level introduction to Bayesian Statistics.

Bayesian Essentials with R by Jean-Michel Marin and Christian P. Robert – This is a solid introduction to Bayesian Statistics with lots of useful code.

Data Analysis and Graphics Using R: An Example-Based Approach by John Maindonald and John Braun – a comprehensive introduction to both statistical analysis that is most suitable for self-learning. It is also a very handsome book. If you are a book person, this is the one to own.

Data Analysis Using Regression and Multilevel/Hierarchical Models by Andrew Gelman and Jennifer Hill – A superb book on statistical modeling that is both practical and rigorous with a modern perspective that should appeal to anyone Bayesians and non-Bayesians alike.

Data Manipulation with R by Phil Spector – A concise introduction to data munging using base R capabilities. This is another book to keep with you while programming.

Doing Bayesian Data Analysis: A Tutorial with R and BUGS by John K. Kruschke – This eclectic and entertaining read is a way to learn both R and Bayesian Analysis simultaneously. It provides lots of R code to build on.

Extending the Linear Model with R: Generalized Linear, Mixed Effects and Nonparametric Regression Models – Building on the authors text on linear models this book covers a lot of ground and provides real insight.

Forecasting: principles and practices by Rob J Hyndman and George Athanasopoulos - Written to teach time series forecasting to a business audience this free, online text is a beautiful example of both the open source ethos and of how R can help people with real business problems become productive with a very modest learning curve.

Introduction to Probability with R by Kenneth Baclawski – This is an eclectic little book. There is really not much R in it, but it is a modern introduction to probability theory including stochastic processes with enough R to help you teach yourself the math by experimenting. R is the really easy part of this book.

Introductory Statistics with R by Peter Dalgaard – A classic text with R code to get you doing real statistics very quickly and a great reference for both statistics and R that you will want to hang on to.

Introductory Time Series with R by Paul S.P. Cowpertwait and Andrew C. Metcalfe - Could be the best introduction to time series analysis ever.

Linear Models with R by Julian J. Faraway – A compact course on analyzing linear models using R. It contains several examples and enough R code to thoroughly analyze regression models.

Modern Applied Statistics with S by W.N. Venables and B.D. Ripley – Probably the best introduction to modern computational statistics out there. Even though it is S, most of the code will work in R.

R Cookbook by Paul Teetor – A solid introduction with recipes for carrying out data analyses and basic plots that you will want on your shelf.

R for Everyone: Advanced Analytics and Graphics by Jared P. Lander – An easy read with relevant machine learning examples that will get you started with R.

R for SAS and SPSS Users by Robert A. Muenchen. If you are still using SAS or SPSS you need this book. The author speaks your language, understands where you are coming from and will help you learn some R.

R Graphics Cookbook by Winston Chang – an indispensable reference for R visualizations of all kinds. Read a more complete review here.

R in Action by Robert I Kabacoff - A gentle introduction to R with elegant plots that are model visualizations. You can read more about it here.

Regression Modeling Strategies by Frank E. Harrell, Jr. An incredible amount of wisdom for how to do statistics backed up with mostly straightforward R code.

R Programming for Bioinformatics by Robert Gentlemen – Not only for Bioinformatics. This book provides insight into the structure of the R language for intermediate and advanced programmers.

Software for Data Analysis: Programming with R by John M. Chambers – A text for advanced programmers discussing philosophy and good practices and providing deep insight into R.

Statistical Analysis of Network Data with R by Eric D. Kolaczyk and Gábor Csárdi – This is an indispensable resource for analyzing network data, containing a thorough explanation of the igraph package, it works through exponential random graph models and other advanced topics.

Statistical Computing in C++ and R by Randall L. Eubank and Ana Kupresanin – A very approachable introduction to both R and C++ for anyone who wants to understand these languages from the perspective of numerical analysis and the nuts and bolts of linear algebra.

Statistics and Data Analysis for Financial Engineering by David Ruppert and David S. Matteson – If you are interested in financial modeling this book could be your ticket to learning R and the R packages that support time series and financial engineering.

Time Series Analysis with Applications in R by Jonathan D. Cryer and Kung-Sik Chan – A solid undergraduate level introduction to R with step-by-Step R code. Very suitable for self study.

XML and Web Technologies for Data Sciences with R by Deborah Nolan and Duncan Temple Lang – Everything a data scientist would ever really want to know about XLM documents, JSON and other web technologies and how you can work with them using R.

by Andrie de Vries

We have written on several occasions about AzureML, the Microsoft machine learning studio that is part of the Cortana Analytics suite:

- Running R in the Azure ML cloud
- Call R functions from any application with the AzureML package
- Using miniCRAN in Azure ML

In September we announced that the AzureML package for R allows you to publish R functions as Azure web services. This is a brilliantly easy way to deploy your functions to other users and clients. For example, you can publish a function from R, then consume that function from Excel!

I am pleased to announce that we have completed a significant rewrite of the AzureML package. This rewrite adds several enhancements. Specifically, AzureML now also allows you to interact with:

- Workspace: connect to and manage AzureML workspaces
- Datasets: upload and download datasets to and from AzureML workspaces
- Experiments: download intermediate datasets from AzureML experiments

We have also significantly enhanced the functionality to publish and consume models

- Publish: define a custom function or train a model and publish it as an Azure Web Service
- Consume: use available web services from R in a variety of convenient formats

This version of the AzureML package adds new functionality to interact with datasets and experiments.

The code to do this is very simple:

```
# Create a workspace object
ws <- workspace()
# List datasets
datasets(ws, filter = "sample")
# Download a dataset
frame <- download.datasets(ws, name = "Forest fires data")
head(frame)
```

As expected, this displays the first few lines of the resulting data frame:

X Y month day FFMC DMC DC ISI temp RH wind rain area

1 7 5 mar fri 86.2 26.2 94.3 5.1 8.2 51 6.7 0.0 0

2 7 4 oct tue 90.6 35.4 669.1 6.7 18.0 33 0.9 0.0 0

3 7 4 oct sat 90.6 43.7 686.9 6.7 14.6 33 1.3 0.0 0

4 8 6 mar fri 91.7 33.3 77.5 9.0 8.3 97 4.0 0.2 0

5 8 6 mar sun 89.3 51.3 102.2 9.6 11.4 99 1.8 0.0 0

6 8 6 aug sun 92.3 85.3 488.0 14.7 22.2 29 5.4 0.0 0

We made many improvements to the mechanism underlying the functionality to publish a web service.

In particular, it is now very easy to provide a data frame as input to the publishing function. You no longer have to specify the classes of every column. Instead, the publishWebservice() function automatically determines the column classes of the inputs as well as the results.

To illustrate, here is an example from the help:

```
ws <- workspace()
# Publish a simple model using the lme4::sleepdata
library(lme4)
set.seed(1)
train <- sleepstudy[sample(nrow(sleepstudy), 120),]
m <- lm(Reaction ~ Days + Subject, data = train)
# Deine a prediction function to publish based on the model:
sleepyPredict <- function(newdata){
predict(m, newdata=newdata)
}
ep <- publishWebService(ws, fun = sleepyPredict, name="sleepy lm",
inputSchema = sleepstudy,
data.frame=TRUE)
# OK, try this out, and compare with raw data
ans = consume(ep, sleepstudy)$ans
plot(ans, sleepstudy$Reaction)
```

Right now, the new version is only available at github. To install the package, use:

if(!require("devtools")) install.packages("devtools") devtools::install_github("RevolutionAnalytics/AzureML")

Additional resources:

The package has extensive help with many examples as well as a vignette. You can also:

- view the vignette at Getting Started with the AzureML Package.
- take a look at the bug bash instructions - walk-through guide with installation and configuration instructions as well as sample code

Github: AzureML, An R interface to AzureML experiments, datasets, and web services

Bob Horton

Sr Data Scientist, Microsoft

Wikipedia describes Simpson’s paradox as “a trend that appears in different groups of data but disappears or reverses when these groups are combined.” Here is the figure from the top of that article (you can click on the image in Wikipedia then follow the “more details” link to find the R code used to generate it. There is a lot of R in Wikipedia).

I rearranged it a bit to put the values in a dataframe, to make it a bit easier to think of the “color” column as a confounding variable:

x | y | color |
---|---|---|

1 | 6 | 1 |

2 | 7 | 1 |

3 | 8 | 1 |

4 | 9 | 1 |

8 | 1 | 2 |

9 | 2 | 2 |

10 | 3 | 2 |

11 | 4 | 2 |

If we do not consider this confounder, we find that the coefficient of x is negative (the dashed line in the figure above):

`coefficients(lm(y ~ x, data=simpson_data))`

```
## (Intercept) x
## 8.3333333 -0.5555556
```

If we do take the confouder into account, we see the coefficient of x is positive:

`coefficients(lm(y ~ x + color, data=simpson_data))`

```
## (Intercept) x color
## 17 1 -12
```

In his book *Causality*, Judea Pearl makes a more sweeping statement regarding Simpson’s paradox: “Any statistical relationship between two variables may be reversed by including additional factors in the analysis.” [Pearl2009]

That sounds fun; let’s try it.

First we’ll make variables `x`

and `y`

with a simple linear relationship. I’ll use the same slopes and intercepts as in the Wikipedia figure, both to show the parallel and to demonstrate the incredible cosmic power I have to bend coefficients to my will.

```
set.seed(1)
N <- 3000
x <- rnorm(N)
m <- -0.5555556
b <- 8.3333333
y <- m * x + b + rnorm(length(x))
plot(x, y, col="gray", pch=20, asp=1)
fit <- lm(y ~ x)
abline(fit, lty=2, lwd=2)
```

When we look at the slope of the regression line determined by fitting the model, it is almost exactly equal to the constant `m`

that we used to determine `y`

.

`coefficients(fit)`

```
## (Intercept) x
## 8.3284021 -0.5358175
```

We get out what we put in; the coefficient of x is essentially the slope we originally gave `y`

when we generated it (-0.5555556). This is the ‘effect’ of `x`

, in that a one unit increase in `x`

apparently increases `y`

by this amount.

Now think about how to concoct a confounding variable to reverse the coefficient of `x`

. This figure shows one way to approach the problem – group the points into a set of parallel stripes, with the stripes sloping in a different direction from the overall dataset:

```
m_new <- 1 # the new coefficient we want x to have
cdf <- confounded_data_frame(x, y, m_new, num_grp=10) # see function below
striped_scatterplot(y ~ x, cdf) # also see below
```

The stripes were made by specifying a reference line with a slope equal to the x-coefficient we want to achieve, and calculating the distance to that line for each point. Putting these distances into categories (by rounding off some multiple of the distance) then groups the points into stripes (shown as colors in the figure). A regression line was then fitted separately to the set of points within each stripe. The regression lines for the stripes on the very ends can be a bit wild, since these groups are very small and scattered, but the ones near the center, representing the majority of the data points, have a quite consistent slope.

The equation for determining the distance from a point to a line is (of course) right there in Wikipedia.

With a little rearranging to express the line in terms of y-intercept (`b`

) and slope (`m`

), and leaving off the absolute value so that points below the line have negative distances (and thus end up in a different group from the stripe with a positive distance of the same magnitude), we get this function:

```
point_line_distance <- function(b, m, x, y)
(y - (m*x + b))/sqrt(m^2 + 1)
```

Here are functions for putting the points into stripewise groups, determining the regression coefficients for each group, and putting it all together into a figure:

```
confounded_data_frame <- function(x, y, m, num_grp){
b <- 0 # intercept doesn't matter
d <- point_line_distance(b, m, x, y)
d_scaled <- 0.0005 + 0.999 * (d - min(d))/(max(d) - min(d)) # avoid 0 and 1
data.frame(x=x, y=y,
group=as.factor(sprintf("grp%02d", ceiling(num_grp*(d_scaled)))))
}
find_group_coefficients <- function(data){
coef <- t(sapply(levels(data$group),
function(grp) coefficients(lm(y ~ x, data=data[data$group==grp,]))))
coef[!is.na(coef[,1]) & ! is.na(coef[,2]),]
}
striped_scatterplot <- function(formula, grouped_data){
# blue on top and red on bottom, to match the Wikipedia figure
colors <- rev(rainbow(length(levels(grouped_data$group)), end=2/3))
plot(formula, grouped_data, bg=colors[grouped_data$group], pch=21, asp=1)
grp_coef <- find_group_coefficients(grouped_data)
# if some coefficents get dropped, colors won't match exactly
for (r in 1:nrow(grp_coef))
abline(grp_coef[r,1], grp_coef[r,2], col=colors[r], lwd=2)
}
```

Note that the regression lines for each group are not exactly parallel to the stripes. This is because linear regression is about minimizing the squared error on the y-axis, not the distance of points from the line. However, the thinner the stripes are, the closer the group regression lines are to our target slope. If we make a large number of thin stripes, the coefficient of `x`

when the groups are taken into account is essentially the same as the slope of the reference line we used to orient the stripes:

```
cdf100 <- confounded_data_frame(x, y, m_new, num_grp=100)
# without confounder
coefficients(lm(y ~ x, cdf100))['x']
```

```
## x
## -0.5358175
```

```
# with confounder
coefficients(lm(y ~ x + group, cdf100))['x']
```

```
## x
## 0.9961566
```

This approach gives us the power to synthesize simulated confounders that can change the coefficient of `x`

to pretty much any value we choose when a model is fitted with the confounder taken into account. Plus, it makes pretty rainbows.

While Simpson’s Paradox is typically described in terms of categorical confounders, the same reversal principle applies to continuous confounders. But that’s a topic for another post.

[Pearl2009]: Pearl, J. Causality: Models, Reasoning and Inference (2ed). Cambridge University Press, New York 2009.

Two recent surveys — one based on LinkedIn skills data, and another a direct survey of data miners — show that R remains the most popular software for statistical data analysis.

In a study of skills associated with LinkedIn profiles by RJmetrics (and also reported on Forbes), "data analysis" was unsurprisingly the skill most associated with self-proclaimed data scientists. Of specific software skills listed, R was the most common, closely followed by Python.

Meanwhile, Karl Rexer is preparing the results from his latest survey of data analytics professionals. When the survey was last published in 2013, R was the most popular tool and it was growing rapidly in popularity. Karl has shared some preliminary data from the upcoming surveyreport (now titled the "2015 Data Science Survey"), where R's popularity continues to surge. 76% of respondents use R for data analysis, and 36% use R as their primary tool (compared to 7% for the runner-up, SAS). R has continued to grow in both measures in every year since the Rexer survey was launched:

We'll have more coverage on the 2015 Rexer Data Science Survey when the complete results are published later this year.