by Lixun Zhang, Data Scientist at Microsoft
For financial institutions currently using SAS, R is an alternative statistical software that is free and has been widedly used in academia and industry. Technical support for R is included with Microsoft R Server, which in addition to being 100% compatible with R, also has improved speed and the capability to work with large datasets.
As corporations switch from SAS to R, they might need to rewrite some of their legacy SAS programs in the new language. The good news is that for most of SAS data manipulation steps, R has a straightforward corresponding function. For example, merging can be easily done with R’s merge function. As another example, R’s colSums function, just like the SAS 'sum' function, calculates the sum of two or more variables while allowing missing values to be omitted.
The purpose of this write-up is to help programmers rewriting SAS programs in R understand an important concept in SAS: the “retain” statement, which is used to remember information from one row to the next. This article assumes some familiarity with R and basic understanding of SAS. As a reminder, SAS processes data row by row: it finishes processing the current row then processes the next row. This is different from R where operations are column-based.
Creating Data
The dataset shown below will be used throughout this article. In this dataset there are 2 credit cards, each having 3 rows corresponding to 3 months’ expense information.
We can load the data with the code below, with the SAS code first, followed by the R code.
Now, let's look at some examples of working with this data in SAS and R.
Reshaping data
In this first example we’ll look at how the SAS “retain” statement works by examining the code below.
This diagram explains how the sample dataset’s first 3 rows are processed by the above SAS code.
Row 1: Code Block A tells SAS to create 3 variables Expense1, Expense2, and Expense3 (abbreviated as Expense1 – Expense3) because this is the 1st row (n = 1 is TRUE). By default, SAS assigns missing values (“.”) to these new variables. In addition it creates an array named “Expensex” for the 3 variables. Code Block B uses the “retain” statement and tells SAS to keep the value of 3 variables Expense1, Expense2, and Expense3 from the previous row. Since this is the 1st row this code block does not affect the values of the 3 variables. Code Block C assigns value “.” to these three variables because CardID 1 appears for the first time (first.CardID is TRUE). Then Code Block D assigns Expense3 = 100 because the variable month has value 3 and Expense is 100. Code Block E does not output the current row because this row is not the last occurrence of CardID 1 (last.CardID is FALSE).
Row 2: Because of the statement “retain Expense1 – Expense3” in Code Block B, the values of the variables Expense1 – Expense3 from the previous row are carried over. So Expense3 has value 100 and Expense1 and Expense2 have missing values. Code Block C does not affect this row because this is not the first time CardID 1 appears. Code Block D then assigns Expense2 = 50 because the value of month is 2 and the value of Expense is 50. Code Block E does not output the current row because this is not the last occurrence of CardID.
Row 3: Code Block B copies the values of Expense1 – Expense3 from Row 2. Code Block C does not affect this row because this is not the first time CardID 1 appears. Then Code Block D assigns Expense1 = 70 because the value of month is 1 and the value of Expense is 70. Because this is the last occurrence of CardID 1, Code Block E exports this row to the dataset bank_mod.
Then the code starts processing Row 4 for CardID 2 and repeat the same process as for CardID 1. It’s worth noting that Code Block C assigns values “.” to the three expense variables at Row 4 because this is the first time CardID 2 appears.
Here's how the data look before and after the transformation (redundant variables such as month and Expense have been removed by the “drop” option in the SAS code):
R users will recognize that this transformation can be accomplished with the reshape function, as the code below shows.
If we had not used the “retain” statement in this example, SAS would not have carried over the values for Expense1 – Expense3 from one row to the next. This means, for example, that when we export Row 3, Expense1 would have value 70 whereas Expense2 and Expense3 would have missing values.
Sum by ID
The “retain” statement can be used for other purposes. In the following example, we are trying to calculate the total expense per CardID. Here the retain variable “total_expense” is first copied from the previous row and then used in calculating a new value through the line:
total_expense = total_expense + Expense ;
We can do these calculations with the aggregate function in R. The diagram below describes how the data look like before and after the transformation and is followed by the SAS and R programs for this task. Only variables 'CardID' and 'total_expense' are kept through the “keep” option in the SAS code.
Count by ID
The “retain” statement can also be used for counting purposes. For example, we may want to count the number of times that the expense was higher than 40. The diagram below describes how the data look like before and after the transformation, followed by the SAS and R programs for this task. Only variables 'CardID' and 'over_spend' are kept through the “keep” option in the SAS code. This is similar to the example for calculating sums by ID except that here the variable has values 0 or 1.
Conclusion
In this write-up we have shown that it’s possible to make the conversion for SAS “retain” statement even though R does not provide a 1-to-1 match function. In such scenarios, the best approach is to understand what SAS is doing and then find the R function that does the same thing.
A bit simpler code, which may be important to those who are new in R:
bank %>%
mutate(condition = Expense > 40) %>%
group_by(CardID) %>%
summarize(over_spend = sum(condition))
Posted by: bijou4 | September 08, 2016 at 04:28
Thanks bijou4.
For SAS users not familiar with R, you can install and load the package "dplyr" to use "%>%" as an operator in bijou4's example.
In my case I avoided using 3rd party packages.
Posted by: Lixzhang | September 12, 2016 at 13:49