by Lixun Zhang (Data Scientist), Ye Xing (Senior Data Scientist) and Tao Wu (Principal Data Scientist Manager), all at Microsoft
Editor's Note: To learn more about migrating from SAS to R, there will be a live webinar presented by Lixun and Ye tomorrow (Tuesday, November 15). Register to attend the webinar here.
R has been gaining in popularity among data professionals in recent years in industries such as financial services, as shown for example in this survey from executive search firm Burtch Works. In this blog post we share some key considerations in migrating SAS to R for a financial services workload. More specifically, we will focus on the data manipulation aspects of the migration.
One of the most important differences between SAS and R is how data are processed. Take the process of calculating the sum of two variables as an example, which is shown by the SAS and R code below.
SAS processes data row by row by using an implied loop in Data Step. The following graph shows how it execute the operation on a dataset with 3 rows. It starts by calculating the sum for row 1, write the result of row 1 to output table, then do the calculation for row 2 and repeat this till the end of the dataset. Assuming the data have been sorted by the variable "x", SAS also records that row 1 is the first occurrence of x = 10 and row 2 is the last occurrence of x = 10 by the "first." and "last." statement, respectively. This can be useful in situations where only the first or last occurrence should be kept in the output dataset.
R, on the other hand, applies functions at the column level by processing all rows at the same time, as shown in the following graph. Since R processes data by column, it does not have a corresponding function for SAS "first." or "last."
Because of the differences between SAS and R, one of the best approaches for converting SAS programs to R is to first understand what a block of SAS programs is doing and then rewrite the code in R. To illustrate this, we summarized several scenarios and published them into Cortana Intelligence Gallery as 3 Jupyter notebooks so that you can test out the R code in Azure Machine Learning Studio. These notebooks cover some common business scenarios in financial industry such as counting delinquencies by account and calculating total expense by account. Some important technical concepts such as SAS “retain”, “first.”, and “last.” statements and R’s apply() and sapply() functions are demonstrated in these samples.
To run the notebooks, you can start by clicking on SAS to R Tutorial and then click on “Open in Studio.”
Cortana Intelligence Gallery: SAS to R Tutorial Part 1
[Updated Nov 15 2016 to correct an arithmetic error in the tables. Thanks to Bill Venables for this and also for suggesting an improvement to the R code.]
Oops! Look at n=3, 30 + 300 = 330 (not 300!)
An alternative in R that avoids the $ ugliness would be
input <- within(input, z <- x + y)
This also extends much more gracefully to more complicated adjustments, and %>% pipes.
Posted by: Bill Venables | November 14, 2016 at 16:35
Hi Bill, Thank so much for pointing out the error and suggesting better R code. All pictures have been updated following your comment.
FYI for the rest of readers, we originally had 30 + 300 = 300 in the two screenshots and they have been corrected. Bill's recommended R code replaced the "ugly" version of input$z <- input$x + input$y
Thanks so much.
Posted by: Lixzhang | November 15, 2016 at 08:01