There's no denying that for a language as popular as R, it has more than its fair share of quirks. If you've ever wondered *why*, for example, R has a non-standard assignment operator, or that periods are allowed in symbols (and don't signify method calls), or that character data imports as factors (not strings) by default, then this blog post by Oliver Keyes is for you. If you're new to R, it's worth checking out for some common traps in R's syntax. And if you're a longtime R user, it's an interesting (and entertaining) look into R's history, and how various influences (including ancient keyboards) have influenced the design of the world's most popular language for data science.

Oliver Keyes: Rbirtary Standards

If you're new to the concept of predictive models, or just want to review the background on how data scientists learn from past data to predict the future, you may be interested in my talk from the Data Insights Summit, Introduction to Real-Time Predictive Modeling.

In the talk above I gave a brief introduction to the R language and mentioned several applications using R. If you'd like to get started with R, you might like to follow along with my co-blogger Joseph Rickert's beginners workshop, Supercharge Your Data Analysis With R:

You can follow along with Joe's session by downloading Microsoft R Open and using the scripts in this GitHub repository.

by Joseph Rickert

If you are an R user and work for an enterprise where Microsoft SQL server is important, it is extremely helpful to have easy access to SQL Server databases. Over the past year, we have run several posts on this topic including a comprehensive four part series from Microsoft's Gregory Vandenbrouck on using various flavors of SQL with Azure as a data source (Part1, Part2, Part3 and Part4) as well as several posts on using the advanced features of Microsoft R Server (formerly Revolution R Enterprise) with SQL Server 2016. (See for example this recent post Credit Card Fraud Detection with SQL Server 2016 R Services).

In this post, I would just like to describe how to connect to an Azure SQL database from your local R session. Setting up an Azure hosted database is the easiest way I know for an R user to get started with Microsoft SQL server. You don't have to install the database, and very little SQL knowledge is necessary to begin working with a database. All of the heavy lifting is done by the Azure platform.

The only prerequisite is to get an Azure account. If you don't already have an account, signing up for a free trial account which you can do here will give you enough credits to experiment with working with SQL Server from R.

The first step after getting an account is to login to the Azure portal. You should see a screen that looks like the figure at right.

From here, clicking on the SQL database icon and selecting "New" should bring you to a screen that looks similar to the one below. The first time you create a database you will be asked to provide a server admin logon name and a password. Remember these because they will be necessary to form your complete connection string.

Next, select a name for your database, and click on the tab "blank database". This will bring you to a screen that looks like the figure below. Copy the text in the the "ODBC" box. This is everything that you need to form a complete connection string except for the password you created above. Note that text in the "ODBC" box will show your logon ID but not your password.

Now that the preliminaries are out of the way, you should be able to populate your database with the following code. The RODBC package is used to communicate with SQL Server. The nycflights13 package provides a convenient test data set with over 300,000 rows and 16 columns. The connection string is what I copied from the Azure ODBC text box described above, except that I have replaced my login ID with the text "my_ID". (Let this serve as a gentle reminder not to store your credentials in the source code. ) The command odbcDriverConnect() opens the ODBC connection to the database, but the real work is done sqlSave() that populates the database with the flights data. This took about 40 minutes on running from my RStudio R script executing on my lenovo Thinkpad.

# CONNECT TO AN AZURE SQL DATABASE library(RODBC) # Provides database connectivity library(nycflights13) # SOme sample data library(dplyr) # only used for nice format of Head() function here # The Connection string comes from the Azure ODBC text box connectionString <- "Driver={SQL Server Native Client 11.0};Server=tcp:hzgi1l8nwn.database.windows.net,1433;Database=Test_R2;Uid=your_logon_ID@hzgi1l8nwn;Pwd={your_password_here};Encrypt=yes;TrustServerCertificate=no;Connection Timeout=30;" #Open your RODBC connection myconn <- odbcDriverConnect(connectionString) # Get some data # We use the New York City 2013 Flight Data from thepackage nycflights12 dim(flights) #[1] 336776 16 head(flights) # Source: local data frame [6 x 16] # # year month day dep_time dep_delay arr_time arr_delay carrier tailnum # (int) (int) (int) (int) (dbl) (int) (dbl) (chr) (chr) # 1 2013 1 1 517 2 830 11 UA N14228 # 2 2013 1 1 533 4 850 20 UA N24211 # 3 2013 1 1 542 2 923 33 AA N619AA # 4 2013 1 1 544 -1 1004 -18 B6 N804JB # 5 2013 1 1 554 -6 812 -25 DL N668DN # 6 2013 1 1 554 -4 740 12 UA N39463 # Variables not shown: flight (int), origin (chr), dest (chr), air_time (dbl), # distance (dbl), hour (dbl), minute (dbl) # Save the table to the database sqlSave(channel=myconn, dat=flights, tablename = "flightsTbl")

The following code executes a SQL query and displays some of the data returned. odbcCloseAll() closes the ODBC connection.

# Fetch movies with ratings GT 3 from Azure SQL DB sqlQuery_m1 <- "SELECT * FROM flightsTbl WHERE month < 3" m1 <- sqlQuery(myconn, sqlQuery_m1) head(m1) # rownames year month day dep_time dep_delay arr_time arr_delay carrier # 1 1 2013 1 1 517 2 830 11 UA # 2 2 2013 1 1 533 4 850 20 UA # 3 3 2013 1 1 542 2 923 33 AA # 4 4 2013 1 1 544 -1 1004 -18 B6 # 5 5 2013 1 1 554 -6 812 -25 DL # 6 6 2013 1 1 554 -4 740 12 UA # tailnum flight origin dest air_time distance hour minute # 1 N14228 1545 EWR IAH 227 1400 5 17 # 2 N24211 1714 LGA IAH 227 1416 5 33 # 3 N619AA 1141 JFK MIA 160 1089 5 42 # 4 N804JB 725 JFK BQN 183 1576 5 44 # 5 N668DN 461 LGA ATL 116 762 5 54 # 6 N39463 1696 EWR ORD 150 719 5 54 dim(m1) #51955 17 odbcCloseAll()

To go further, have a look at this tutorial on writing Microsoft SQL queries, or get a more in depth introduction to Microsoft SQL Server here.

We had a fantastic turnout to last week's webinar, Introduction to Microsoft R Open. If you missed it, you can watch the replay below. In the talk, I gives some background on the R language and its applications, describe the performance and reproducibility benefits of Microsoft R Open, and give a demonstration of the basics of the R language along with a more in-depth demo of producing a beautiful weather data chart with R.

You can also check out the slides (and follow the embedded links) below:

Check out the other webinars in the Microsoft R Series as well.

Microsoft Webinars: Introduction to Microsoft R Open

by Joseph Rickert

There are number of R packages devoted to sophisticated applications of Markov chains. These include msm and SemiMarkov for fitting multistate models to panel data, mstate for survival analysis applications, TPmsm for estimating transition probabilities for 3-state progressive disease models, heemod for applying Markov models to health care economic applications, HMM and depmixS4 for fitting Hidden Markov Models and mcmc for working with Monte Carlo Markov Chains. All of these assume some considerable knowledge of the underlying theory. To my knowledge only DTMCPack and the relatively recent package, markovchain, were written to facilitate basic computations with Markov chains.

In this post, we’ll explore some basic properties of discrete time Markov chains using the functions provided by the markovchain package supplemented with standard R functions and a few functions from other contributed packages. “Chapter 11”, of Snell’s online probability book will be our guide. The calculations displayed here illustrate some of the theory developed in this document. In the text below, section numbers refer to this document.

A large part of working with discrete time Markov chains involves manipulating the matrix of transition probabilities associated with the chain. This first section of code replicates the Oz transition probability matrix from section 11.1 and uses the plotmat() function from the diagram package to illustrate it. Then, the efficient operator %^% from the expm package is used to raise the Oz matrix to the third power. Finally, left matrix multiplication of OZ^3 by the distribution vector u = (1/3, 1/3, 1/3) gives the weather forecast three days ahead.

library(expm)

library(markovchain)

library(diagram)

library(pracma)

stateNames <- c("Rain","Nice","Snow")

Oz <- matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3, byrow=TRUE)

row.names(Oz) <- stateNames; colnames(Oz) <- stateNames

Oz

# Rain Nice Snow

# Rain 0.50 0.25 0.25

# Nice 0.50 0.00 0.50

# Snow 0.25 0.25 0.50

plotmat(Oz,pos = c(1,2),

lwd = 1, box.lwd = 2,

cex.txt = 0.8,

box.size = 0.1,

box.type = "circle",

box.prop = 0.5,

box.col = "light yellow",

arr.length=.1,

arr.width=.1,

self.cex = .4,

self.shifty = -.01,

self.shiftx = .13,

main = "")

Oz3 <- Oz %^% 3

round(Oz3,3)

# Rain Nice Snow

# Rain 0.406 0.203 0.391

# Nice 0.406 0.188 0.406

# Snow 0.391 0.203 0.406

u <- c(1/3, 1/3, 1/3)

round(u %*% Oz3,3)

#0.401 0.198 0.401

The igraph package can also be used to Markov chain diagrams, but I prefer the “drawn on a chalkboard” look of plotmat.

This next block of code reproduces the 5-state Drunkward’s walk example from section 11.2 which presents the fundamentals of absorbing Markov chains. First, the transition matrix describing the chain is instantiated as an object of the S4 class makrovchain. Then, functions from the markovchain package are used to identify the absorbing and transient states of the chain and place the transition matrix, P, into canonical form.

p <- c(.5,0,.5)

dw <- c(1,rep(0,4),p,0,0,0,p,0,0,0,p,rep(0,4),1)

DW <- matrix(dw,5,5,byrow=TRUE)

DWmc <-new("markovchain",

transitionMatrix = DW,

states = c("0","1","2","3","4"),

name = "Drunkard's Walk")

DWmc

# Drunkard's Walk

# A 5 - dimensional discrete Markov Chain with following states

# 0 1 2 3 4

# The transition matrix (by rows) is defined as follows

# 0 1 2 3 4

# 0 1.0 0.0 0.0 0.0 0.0

# 1 0.5 0.0 0.5 0.0 0.0

# 2 0.0 0.5 0.0 0.5 0.0

# 3 0.0 0.0 0.5 0.0 0.5

# 4 0.0 0.0 0.0 0.0 1.0

# Determine transient states

transientStates(DWmc)

#[1] "1" "2" "3"

# determine absorbing states

absorbingStates(DWmc)

#[1] "0" "4"

In canonical form, the transition matrix, P, is partitioned into the Identity matrix, I, a matrix of 0’s, the matrix, Q, containing the transition probabilities for the transient states and a matrix, R, containing the transition probabilities for the absorbing states.

Next, we find the Fundamental Matrix, N, by inverting (I – Q). For each transient state, j, n_{ij} gives the expected number of times the process is in state j given that it started in transient state i. u_{i} is the expected time until absorption given that the process starts in state i. Finally, we compute the matrix B, where b_{ij} is the probability that the process will be absorbed in state J given that it starts in state i.

# Find Matrix Q

getRQ <- function(M,type="Q"){

if(length(absorbingStates(M)) == 0) stop("Not Absorbing Matrix")

tm <- M@transitionMatrix

d <- diag(tm)

m <- max(which(d == 1))

n <- length(d)

ifelse(type=="Q",

A <- tm[(m+1):n,(m+1):n],

A <- tm[(m+1):n,1:m])

return(A)

}

# Put DWmc into Canonical Form

P <- canonicForm(DWmc)

P

Q <- getRQ(P)

# Find Fundamental Matrix

I <- diag(dim(Q)[2])

N <- solve(I - Q)

N

# 1 2 3

# 1 1.5 1 0.5

# 2 1.0 2 1.0

# 3 0.5 1 1.5

# Calculate time to absorption

c <- rep(1,dim(N)[2])

u <- N %*% c

u

# 1 3

# 2 4

# 3 3

R <- getRQ(P,”R”)

B <- N %*% R

B

# 0 4

# 1 0.75 0.25

# 2 0.50 0.50

# 3 0.25 0.75

For section 11. 3, which deals with regular and ergodic Markov chains we return to Oz, and provide four options for calculating the steady state, or limiting probability distribution for this regular transition matrix. The first three options involve standard methods which are readily available in R. Method 1 uses %^% to raise the matrix Oz to a sufficiently high value. Method 2 calculates the eigenvalue for the eigenvector 1, and method 3 uses the nullspace() function form the pracma package to compute the null space, or kernel of the linear transformation associated with the matrix. The fourth method uses the steadyStates() function from the markovchain package. To use this function, we first convert Oz into a markovchain object.

# 11.3 Ergodic Markov Chains

# Four methods to get steady states

# Method 1: compute powers on Matrix

round(Oz %^% 6,2)

# Rain Nice Snow

# Rain 0.4 0.2 0.4

# Nice 0.4 0.2 0.4

# Snow 0.4 0.2 0.4

# Method 2: Compute eigenvector of eigenvalue 1

eigenOz <- eigen(t(Oz))

ev <- eigenOz$vectors[,1] / sum(eigenOz$vectors[,1])

ev

# Method 3: compute null space of (P - I)

I <- diag(3)

ns <- nullspace(t(Oz - I))

ns <- round(ns / sum(ns),2)

ns

# Method 4: use function in markovchain package

OzMC<-new("markovchain",

states=stateNames,

transitionMatrix=

matrix(c(.5,.25,.25,.5,0,.5,.25,.25,.5),

nrow=3,

byrow=TRUE,

dimnames=list(stateNames,stateNames)))

steadyStates(OzMC)

The steadyState() function seems to be reasonably efficient for fairly large Markov Chains. The following code creates a 5,000 row by 5,000 column regular Markov matrix. On my modest, Lenovo ThinkPad ultrabook it took a little less than 2 minutes to create the markovchain object and about 11 minutes to compute the steady state distribution.

# Create a large random regular matrix

randReg <- function(N){

M <- matrix(runif(N^2,min=1,max=N),nrow=N,ncol=N)

rowS <- rowSums(M)

regM <- M/rowS

return(regM)

}

N <- 5000

M <-randReg(N)

#rowSums(M)

system.time(regMC <- new("markovchain", states = as.character(1:N),

transitionMatrix = M,

name = "M"))

# user system elapsed

# 98.33 0.82 99.46

system.time(ss <- steadyStates(regMC))

# user system elapsed

# 618.47 0.61 640.05

We conclude this little Markov Chain excursion by using the rmarkovchain() function to simulate a trajectory from the process represented by this large random matrix and plot the results. It seems that this is a reasonable method for simulating a stationary time series in a way that makes it easy to control the limits of its variability.

#sample from regMC

regMCts <- rmarkovchain(n=1000,object=regMC)

regMCtsDf <- as.data.frame(regMCts,stringsAsFactors = FALSE)

regMCtsDf$index <- 1:1000

regMCtsDf$regMCts <- as.numeric(regMCtsDf$regMCts)

library(ggplot2)

p <- ggplot(regMCtsDf,aes(index,regMCts))

p + geom_line(colour="dark red") +

xlab("time") +

ylab("state") +

ggtitle("Random Markov Chain")

For more on the capabilities of the markovchain package do have a look at the package vignette. For more theory on both discrete and continuous time Markov processes illustrated with R see Norm Matloff's book: From Algorithms to Z-Scores: Probabilistic and Statistical Modeling in Computer Science.

by Joseph Rickert

Early October: somewhere the leaves are turning brilliant colors, temperatures are cooling down and that back to school feeling is in the air. And for more people than ever before, it is going to seem to be a good time to commit to really learning R. I have some suggestions for R courses below, but first: What does it mean to learn R anyway? My take is that the answer depends on a person's circumstances and motivation.

I find the following graphic to be helpful in sorting things out.

The X axis is time on Malcolm Gladwell's "Outliers" scale. His idea is that it takes 10,000 hours of real effort to master anything, R, Python or Rock and Roll Guitar. The Y axis lists increasingly difficult R tasks, and the arrows within the plot area are labels increasingly proficient types of R users.

The point I want to make here is that a significant amount of very productive R work happens in the area around the red ellipse. So, while their is no avoiding "10,000" hours of hard work to become an R Jedi knight, a curious and motivated person can master enough R to accomplish his/her programming goals with a more modest commitment. There are three main reasons for this:

- R's functional programming style is very well suited for statistical modeling, data visualization and data science tasks
- The 7,000
^{+}packages available in the R ecosystem provide tens of thousands of functions that make it possible to accomplish quite a bit without having to write much code - Numerous, high quality books and online material devoted to teaching statistical theory and data science with R

If you have some background in some area of statistics or data science a viable strategy for learning R is to identify a resource that works for you and just jump into the middle of things, picking up R as you go along.

The lists below link to courses that can either start you on a formal programming path, or help you become a productive R user in a particular application area. Some of the courses are "live events" that you take with a cohort of students, others are set up for self study.

The courses devoted to teaching R as a programming language are

- The Data Scientist’s toolbox
- R Programming
- Introduction to R Programming
- Introduction to R
- R Programing - Introduction 1
- Introduction a la programacion estadistica con R
- O’Reilly Code School

The first two courses above are from Coursera's Data Science Specialization sequence. Taught by Roger Peng, Jeff Leek and Brian Caffo they are probably the gold standard for MOOC R courses. I am a little late with this post. The Data Scientists's toolbox started this past Monday but there is still time to catch up. The third course, Introduction to R Programming, is a relatively new edX course from Microsoft's online offerings that is getting great reviews. The fourth course on the list a solid introduction to R from DataCamp. R Programming - Introduction 1 is a beginner's introduction to R taught by Paul Murrell or Tal Galili. Next listed, is a Spanish language introduction to R from Coursera and O'Reilly's interactive Code School course.

These next three lists contain courses from DataCamp and statistics.com and online resources from R Studio that introduce more advanced features of R by buildng on basic R programming skills. Note that the final course on the DataCamp list introduces Big Data features of Revolution R Enterprise which is available in the Azure Marketplace.

- Intermediate R
- Data Visualization in R with ggvis
- Data Manipulation with dplyr
- Data Analysis in R, the data.table Way
- Reporting with R Markdown
- Big data Analysis with Revolution R Enterprise

This next section lists courses from the major MOOCs, and non-MOOCs DataCamp and statistics.com that use R to teach various quantitative disciplines

**Coursera Courses**

- Data Analysis and Statistical Inference
- Developing Data Products
- Exploratory Data Analysis
- Getting and Cleaning Data
- Introduction to Computational Finance and Financial Econometrics
- Measuring Causal Effects in the Social Sciences
- Regression Models
- Reproducible Research
- Statistical Inference
- Statistics One

**edX Courses**

- Data Analysis for Life Sciences 1: Statistics and R
- Data Analysis for life Sciences 2: Introduction to Linear Models and Matrix Algebra
- Data Analysis for life Sciences 6: High-performance Computing for Reproducible Genomics
- Explore Statistics with R
- Sabermetrics 101: Introduction to Baseball Analytics

**Udacity Course**

DataCamp

statistics.com

Finally, here are a couple of google apps and Swirl, a new platform for teaching and learning R that may be useful for learning on the go.

It's time to "go back to school" and make some headway against those 10,000 hours.

by Jens Carl Streibig, Professor Emeritus at University of Copenhagen

*Editor's introduction: for background on the miniCRAN package, see our previous blog posts:*

MiniCRAN saves my neck when out in regions where seamless running internet is and exception rather than the rule. R is definitely the programme to offer universities and research institutions in agriculture because it is open source, no money involved, and the help, although sometimes a bit nerdy, is easy to access. I usually tell my student not to buy books on specific topics because R is dynamic and within a couple of years some of the functions in the book is obsolete and thud discourage the average user. Look at the documentation at the r-project.org or in rseek.org.

I have recently been teaching in Turkey and Iran. Sometimes the internet is ok other times it is not. Before it was a struggle to get the particularly packages downloaded and install via RStudio. In a workshop in Iran we could not download the essential packages. A shrewd student downloaded dependencies and distributed the zipfiles to her fellow students. After some glitches we got all up and running.

When I became aware of miniCRAN at the useR!2015 meeting all my R problems were almost solved, with help from the maintainer, Andrie de Vries at Revolution Analytics, we got it to work, when given a workshop on dose-response, also in Iran two weeks ago. Everything went all right for those students who could not install the packages at home. Some windows version were in a poor state of repair, so they could not run RStudio and we had to provide all the dependencies, but no problem they were all in the miniCRAN repository.

by Andrie de Vries

Every once in a while I try to remember how to do interpolation using R. This is not something I do frequently in my workflow, so I do the usual sequence of finding the appropriate help page:

?interpolate

Help pages:

stats::approx Interpolation Functions

stats::NLSstClosestX Inverse Interpolation

stats::spline Interpolating Splines

So, the help tells me to use approx() to perform linear interpolation. This is an interesting function, because the help page also describes approxfun() that does the same thing as approx(), except that approxfun() returns a function that does the interpolation, whilst approx() returns the interpolated values directly.

(In other words, approxfun() acts a little bit like a predict() method for approx().)

The help page for approx() also points to stats::spline() to do spline interpolation and from there you can find smooth.spline() for smoothing splines.

Talking about smoothing, base R also contains the function smooth(), an implementation of running median smoothers (algorithm proposed by Tukey).

Finally I want to mention loess(), a function that estimates Local Polynomial Regression Fitting. (The function loess() underlies the stat_smooth() as one of the defaults in the package ggplot2.)

I set up a little experiment to see how the different functions behave. To do this, I simulate some random data in the shape of a sine wave. Then I use each of these functions to interpolate or smooth the data.

On my generated data, the interpolation functions approx() and spline() gives a quite ragged interpolation. The smoothed median function smooth() doesn't do much better - there simply is too much variance in the data.

The smooth.spline() function does a great job at finding a smoother using default values.

The last two plots illustrate loess(), the local regression estimator. Notice that loess() needs a tuning parameter (span). The lower the value of the smoothing parameter, the smaller the number of points that it functions on. Thus with a value of 0.1 you can see a much smoother interpolation than at a value of 0.5.

Here is the code:

If you've thought about learning the R language but didn't know how to start, there's a new, free course on edX that starts you from the R basics and lets you learn R by trying R as you go.

Presented by DataCamp and Microsoft, the course starts from the very basics of R (arithmetic on the command line, creating variables), progresses through the basic data types (vector, matrix, factor, list and data frame) and ends with a module on data visualization. The course consists of lecture-style videos interspersed with quizzes to test your knowledge. And best of all, you can try out what you've learned at the R command line using the DataCamp online interface -- so you don't even have to install R yourself! A browser is all you need.

It's perfect for newcomers to R, even if you don't have experience in other programming languages. If you want to get a (paid) certification you'll need to complete the course by September 1, but you can view all of the course materials, quizzes and labs anytime for free. You can learn more about the course at the DataCamp blog or get started now at the link below.

by John Mount

Data Scientist, Win-Vector LLC

R has a number of very good packages for manipulating and aggregating data (plyr, sqldf, RevoScaleR, data.table, and more), but when it comes to accumulating results the beginning R user is often at sea. The R execution model is a bit exotic so many R users are very uncertain which methods of accumulating results are efficient and which are inefficient.

In this latest "R as it is" we will quickly become expert at efficiently accumulating results in R. To read more please click here.