by Joseph Rickert

A strong case can be made that base R graphics supplemented with either the lattice library or ggplot2 for plotting by subgroups provides everything a statistician might need for both exploratory data analysis and for developing clear, crisp for communicating results. However, it is abundantly clear that web based graphics, driven to a large extent by JavaScript enhanced web design, is opening up new vistas for data visualizations. The ability to interact with graphs, view them from different points of view, establish real-time relationships between different plots and other graphical elements provides opportunities to extract new insights from data. To be fair, many of these capabilities have existed in R for quite some time, some from the very beginning. For example, the identify() function in the graphics package lets you mouse over a point on a plot and click to determine the associated value, and what could be easier than the plot3d() function in the rgl package that uses OpenGL technology to let you grab a #D scatter plot with your mouse and rotate it any which way. Run this code to see how it works.

Developers are continuing to build out the infrastructure of web based graphics, and now it is possible to select environments that offer a rich set of features all in one place.

Until recently, however, making use of web based graphics directly from R required a basic knowledge of web based development and some JavaScript programming skills. If you have these skills, or want to acquire them, have a look at the V8 package which provides an R interface to Google's open source JavaScript engine, but if JavaScript programming is not going to be your thing then htmlwidgets is the way to go.

An R user can load a htmlwidgets library and generate a web based plot by calling a function that looks like any other R plotting function. For example, after installing and loading the three.js library, a few lines of code will produce an interactive 3D scatter plot that can be displayed in a webpage, a markdown document or in the RStudio plot window. The following code generates a more contemporary version of the rotating 3D scatterplot.

library(stringr) library(htmltools) install.packages("devtools",repos="http://cran.rstudio.com/") devtools::install_github("bwlewis/rthreejs") library(threejs) data(mtcars) # load the mtcars data set data <- mtcars[order(mtcars$cyl),] #sort the data set for plotting head(data) uv <- tabulate(mtcars$cyl) # figure our how many observarions for each cylindar type col <- c(rep("red",uv[4]),rep("yellow",uv[6]),rep("blue",uv[8])) #set the colors row.names(mtcars) # see what models of cars are in the data set scatterplot3js(data[,c(3,6,1)], labels=row.names(mtcars), # mousing over a point will show what model car it is size=mtcars$hp/100, # the size of a point maps to horsepower flip.y=TRUE, color=col,renderer="canvas") # point color indicates number of cylindars

This kind of visualization packs a lot of information into a relatively small space. Not only does the ability to rotate the plot produce a satisfying 3 dimensional rendering, but using color, size and mouse movement to convey information provides three additional dimensions.

As exciting as this kind of visualization is, however, I don’t mean to imply that it is somehow going to make static graphics obsolete. Rob Kabacoff's 2012 post using the scatterplot3d package provides an example of a 3D scatterplot of the mtcars data that has a timeless, elegant look and clearly displays the data without distraction.

Nevertheless, I am betting on htmlwidgets moving forward to be the next big thing. Not only are they easy to use, but the developers have created a framework for developing new widgets that hides most of the details of JavaScript bindings and the like. Currently, there are only a few ready to use widgets listed at the htmlwidgets.org showcase. so we will have to see if the R community embraces this technology.

In the meantime, for inspiration, have a look at Bryan Lewis' presentation at the recent NY R conference and the examples of widgets listed on his last slide.

Computerworld's Sharon Machlis published today a very useful list of R packages that every R user should know. The list covers packages for data import, data wrangling, data visualization and package development, but for beginning R users the biggest challenge is usually just dealing with data. To that end, I thought it was worth listing the package for data access and manipulation, which I thoroughly endorse:

**Data import/access**: readr (text data files), rio (many binary data file formats), readxl (Excel spreadsheets), googlesheets (Google Sheets), RMySQL (MySQL databases), quantmod (economic and financial data sources);**Data manipulation**: dplyr (general data frame processing); data.table (aggregation and filtering); tidyr (tidying messy data into row/col format); sqldf (SQL queries on data frames), zoo (time series data wrangling)

Check out Sharon's complete list below for details on these and many other useful R packages.

ComputerWorld: Great R packages for data import, wrangling & visualization

Arthur Charpentier was trying to solve an interesting problem with R: given this data set of random walks in the 2-D plane, what is the likely *origin* of a pathway that ends in the black circle below?

It's pretty easy to generate random data like this with a few lines of code in R. And with 2 million trajectories of 80 points each, you have some moderately-sized data to analyze: about 4Gb.

There are several ways to tackle data of this size with R: you can use an ordinary data.frame object (provided you have sufficient RAM to hold it in memory) and use standard R functions to select the corresponding records; you can use functions in the dplyr package to filter the data; or you can use a data.table package and its operations to select the appropriate data. Arthur tried all three methods with the following results:

- Using ordinary data.frame operations, it took about a minute to extract the necessary data. Even then, Arthur had some challenges with out of memory errors when trying to create temporary columns in the data (which swelled its size to over 6 Gb).
- Using the dplyr package, Arthur read in the data as a data_frame object and filtered the data using dplyr's group_by, summarise, and left_join operations. This process took about two minutes.
- Using the data.table package and using its built-in selection syntax and merge operator, the process took around 10 seconds.

Note that all of these techniques are in-memory operations. Arthur doesn't note the size of the system he was using, but it probably has at least 8Gb of RAM to be able to accommodate the data. While dplyr's syntax is (for me) somewhat simpler to use, here data.table wins out on performance, thanks to its optimized operations and the ability to create new variables on the fly (and without requiring additional RAM) with its := syntax. You can see the complete code used for the various methods at the link below.

Freakonometrics: Working with "large" datasets, with dplyr and data.table

by Mark Malter

After reading the book, Analyzing Baseball with R, by Max Marchi and Jim Albert, I decided to expand on some of their ideas relating to runs created and put them into an R shiny app .

The Server and UI code are linked at the bottom of the Introduction tab.

I downloaded the Retrosheet play-by-play data for every game played in the 2011-2014 seasons in every park and aggregated every plate appearance by one of the 24 bases/outs states (ranging from nobody on/nobody out to bases loaded/two outs). With Retrosheets data, I wrote code to track the batter, bases, outs, runs scored over remainder of inning, current game score, and inning. I also used the R Lahman package and databases for individual player information. Below is a brief explanation of the function of each tab on the app.

**Potential Runs by bases/outs state**: Matrix of all 24 possible bases/outs states, both with expected runs over the remainder of an inning, and the probability of scoring at least one run over the remainder of the inning (for late innings of close games). I used this table to analyze several types of plays, as shown below. Notice that, assuming average hitters, the analysis below shows why sacrifice bunts are always a bad idea. The Runs Created stat for a plate appearance is defined as:

end state – start state + runs created on play.

I first became serious about this after watching the last inning of the 2014 world series. Down 3-2 with two outs and nobody on base, Alex Gordon singled to center and advanced to third on a two base error. As Gordon was heading into third base, Giants shortstop Brandon Crawford was taking the relay throw in short left field. Had Gordon been sent home, Crawford would likely have thrown him out at the plate. However, the runs matrix shows only a 26% chance of scoring a run with a man on third and two outs, and with Madison Bumgarner on the mound, it was even less likely that on deck hitter Salvador Perez would be able to drive in Hosmer. So even though sending Gordon would likely have ended the game (and the series), it still may have been the optimal play. This would be similar to hitting 16 vs. a dealer’s ten in Blackjack- you’ll probably lose, but you’re making an optimal play. For equivalency, see the Tag from Third analysis below, as this play would have been equivalent to tagging from third after a catch for the second out.

**Runs Created All Regular MLB Players**: I filtered out all players with fewer than 400 plate appearances and created an interactive rchart showing each player’s runs potential by runs created. I placed the following filters in the UI: year, innings (1-3, 4-6, 7-extras), run differential at time of at bat (0-1, 2-3, 4+), position, team, bats, age range, and weight. Hovering over a point shows the player and his salary. For example, Mike Trout created 58 runs out of a potential of 332 in 2014. Filtering 2013 for second baseman under the age of 30 and weighing less than 200 pounds, we see Jason Kipnis created 27 runs out of a potential of 300.

**Player Runs Table**: Same as above, but this shows each player (> 400 plate appearances for the selected season), broken down by each of the eight bases states. For example, in 2014 Jose Abreu created 43.5 runs on a potential of 291, and was most efficient with a runner on second base, where he created 10.3 runs on a potential of only 36.

The following tabs show runs expectancies of various offensive plays from the start state the expected end state, based on the expected Baserunning Success rate in the UI. For each play, I created a graphical as well as a table tab. For the graphical tabs, there is a UI to switch between views of expected runs and scoring probability.

**Stolen bases Graphic/Table**: For each of fifteen different base stealing situations, I show the start state, end state (based on the UI selected success rate), and the breakeven success rate for the given situation. We see that rather than one generic rule of thumb for breaking even, the situational b/e’s vary widely, ranging from 91% with a runner on second and two outs, to 54% for a double steal with first and second and one out (I assume that any out is the lead runner). Notice though that if only the runner on second attempts to steal, the break even jumps from 54% to 72%.

**Tag from Third Graphic/Table**: I broke down every situation where a fly ball was caught with a runner on third, where the catch was either the first or second out. I tracked the attempt frequency and success rate for each situation, based on the outs and whether there were trailing runners. Surprisingly, I found that almost every success rate is well over 95%, meaning runners are only tagging when they’re almost certain to score. However, the break evens range from 40% with first and third with two outs (after the catch) to 77% with runners on second and third with one out. I believe this shows a gray area between the b/e and success rates where runners are being far too cautious.

The following tabs show whether a base runner should attempt to advance two bases on a single. Again, of course it depends on the situation.

**First to Third Graphic/Table**: Here we see that the attempted frequencies are very low, and as expected, lowest on balls hit to left field. However, as with the above tag plays, runners are almost always safe, showing another gray area between attempts and b/e’s. For example, on a single to right field with one out, runners only attempt to advance to third base 42.1% of the time, and are safe 97.3%. If we place the UI Success Rate slider on 0.85, we see that the attempt increases the runs expectancy from 0.87 to 0.99.

**Second to Home Graphic/Table**: Here we see the old adage, “don’t make the first or second out at the plate”, is not necessarily true. Attempting to score from second on a single depends not only on the outs, but also whether there is a trailing runner. The break evens range from 93% with no outs and no trailing runner on first, to 40% with two outs and no runner on first. Once again, the success rates are almost always higher than the break even rate, showing too much caution.

**Sacrifice Bunt Graphic/Table**: These tabs show that unless we have a hitter far below average, the sacrifice should never be attempted. For example, in going from a runner on first and no outs to a runner on second with one out, or going from a runner on second with no outs to a runner on third with one out, we drop from 0.85 runs to 0.66 runs and from 1.10 runs to 0.94 runs respectively. Worse, I’m assuming that the bunt is always successful with the lead runner never being thrown out. The only situation where the bunt might be wise is in a late inning and the team is playing for one run after a leadoff double. Getting the runner from second and no outs to third with one out increases the probability of scoring from 0.61 to 0.65, IF the bunt is successful. Even here, it is a poor play if the success rate is less than 90%. The graphic tab allows the user to see how the expected end state changes as the UI success rate slider is altered.

UI code: https://github.com/malter61/retrosheets/blob/master/ui.R

Server code: https://github.com/malter61/retrosheets/blob/master/server.R

*Mark Malter is a data scientist currently working for Houghton, Mifflin, Harcourt, as well as the consulting firm Channel Pricing, specializing in building predictive models, cluster analysis, and visualizing data. He is also a sixteen year veteran stock options market-maker at the Chicago Board Options Exchange. He has a BS degree in electrical engineering, an MBA, and is currently working on an MS degree in Predictive Analytics. Mark also spent 14 years as a director and coach of his local youth baseball league.*

by Joseph Rickert

Tracking R user group meetings is a good way to stay informed about what's happening in the R world. On Tuesday the Bay Area useR Group (BARUG) met at AdRoll in San Francisco. It was a mini-conference with 6 talks:

- Bryan Galvin our host at AdRoll (many thanks for the pizza and beer) kicked off the evening by showing how his company has created Sparkle, a big-league workflow for deploying Shiny Apps.
- Shivaram Venkataraman of Amplab gave an overview of SparkR, highlighting new features and giving a preview of what's coming.
- Perry de Valpine presented NIMBLE, a new C++ compiled language for building BUGS models with R that Bayesian R users should find of great interest.
- Hamed Dagour presented three scenarios's of how Personal Capital uses R for financial applications.
- Ramnath Vaidyanathan delivered a high energy, overview of htmlwidgets for R. Html widgets may very well be "the next big thing".
- Bill Grosso finished up the evening by describing how his team at Scientific Revenue use R to build dynamic pricing models for internet game companies.

SparkR has been generating quite a bit of interest in the R community at large. Shivaram revealed that SparkR is now going mainstream within the Spark project. SparkR will be merged into Spark in the upcoming 1.4 release. This means that there will be a Spark API for R.

The following slide illustrates the data frame methods that have been developed for SparkR:

Future SparkR work which is highlighted in Shivaram's presentation includes:

- High-level APIs for machine learning algorithms
- Pipelines with "featurizers"
- Extended models and summary methods
- APIs for streaming and time series analysis
- Distributed matrix operations

htmlwidgets is another project that is also generating some buzz in the R world. The idea here is to create R binding to JavaScript libraries that allow interactive JavaScript plots to be used in stand-alone applications or embedded in markdown files or Shiny plots.

Ramnath's presentation highlighted some brand new work in html widgets. Really notable was his live demonstration of cooperating html widgets. Using message passing to transmit state status, Ramnath showed a cluster of graphs changing on the fly to reflect new information from graphs that they were linked to. This gif link gives an idea of the kinds of cool things people are doing with htmlwidgets.

In other R user group meetings around the world this week: The New Delhi UseR Group gave R tutorials, the Taiwan R User Group explored the MLDM package and its applications. Statistical Programming DC looked into R and Shiny. MadR looked into making R code easier and more reliable.The Portland R User Group got a glimpse of new features coming to RStudio. The Barcelona R users group will look into dplyr. Dublin R will look at using R with medical statistics method comparison studies. DataPhilly will investigate data science in health care and look into using R in Plotly. The Denver R Users Group will discuss websites with GitHub and RStudio in the context of continuous delivery systems. The Atlanta R Users Group will look into building a netter data stack.

It is only a small sample of 15 talks, but they do emphasize that both Shiny and the use of R in production level work are both trending topics in the R Community. Bryan's talk illustrates the practical intersection of both trends.

In today's data-oriented world, just about every retailer has amassed a huge database of purchase transaction. Each transaction consists of a number of products that have been purchased together. A natural question that you could answer from this database is: What products are typically purchased together? This is called Market Basket Analysis (or Affinity Analysis). A closely related question is: Can we find relationships between certain products, which indicate the purchase of other products? For example, if someone purchases avocados and salsa, it's likely they'll purchase tortilla chips and limes as well. This is called association rule learning, a data mining technique used by retailers to improve product placement, marketing, and new product development.

R has an excellent suite of algorithms for market basket analysis in the arules package by Michael Hahsler and colleagues. It includes support for both the Apriori algorithm and the ECLAT (equivalence class transformation algorithm). You can find an in-depth description of both techniques (including several examples) in the Introduction to arules vignette. The slides below, by Yanchang Zhao provide a nice overview, and you can find further examples at RDataMining.com.

by Ari Lamstein

Today I will walk through an analysis of San Francisco Zip Code Demographics using my new R package choroplethrZip. This package creates choropleth maps of US Zip Codes and connects to the US Census Bureau. A choropleth is a map that shows boundaries of regions (such as zip codes) and colors those regions according to some metric (such as population).

Zip codes are a common geographic unit for businesses to work with, but rendering them is difficult. Official zip codes are maintained by the US Postal Service, but they exist solely to facilitate mail delivery. The USPS does not release a map of them, they change frequently and, in some cases, are not even polygons. The most authoritative map I could find of US Zip codes was the Census Bureau’s Map of Zip Code Tabulated Areas (ZCTAs). Despite shipping with only a simplified version of this map (60MB instead of 500MB), choroplethrZip is still too large for CRAN. It is instead hosted on github, and you can install it from an R console like this:

# install.github(“devtools”)

library(devtools)

install_github('arilamstein/choroplethrZip@v1.1.1')

The package vignettes (1, 2) explain basic usage. In this article I’d like to demonstrate a more in depth example: showing racial and financial characteristics of each zip code in San Francisco. The data I use comes from the 2013 American Community Survey (ACS) which is run by the US Census Bureau. If you are new to the ACS you might want to view my vignette on Mapping US Census Data.

One table that deals with race and ethnicity is B03002 - Hipanic or Latino Origin by Race. Many people will be surprised by the large number of categories. This is because the US Census Bureau has a complex framework for categorizing race and ethnicity. Since my purpose here is to demonstrate technology, I will simplify the data by only dealing with only a handful of the values: Total Hispanic or Latino, White (not Hispanic), Black (not Hispanic) and Asian (not Hispanic).

The R code for getting this data into a data.frame can be viewed here, and the code for generating the graphs in this post can be viewed here. Here is a boxplot of the ethnic breakdown of the 27 ZCTAs in San Francisco.

This boxplot shows that there is a wide variation in the racial and ethnic breakdown of San Francisco ZCTAs. For example, the percentage of White people in each ZCTA ranges from 7% to 80%. The percentage of black and hispanics seem to have a tighter range, but also contains outliers. Also, while Asian Americans only make up the 5% of the total US population, the median ZCTA in SF is 30% Asian.

Viewing this data with choropleth maps allows us to associate locations with these values.

When discussing demographics people often ask about Per Capita Income. Here is a boxplot of the Per capita Income in San Francisco ZCTAs.

The range of this dataset - from $15,960 to $144,400 - is striking. Equally striking is the outlier at the top. We can use a continuous scale choropleth to highlight the outlier. We can also use a four color choropleth to show the locations of the quartiles.

The outlier for income is zip 94105, which is where a large number of tech companies are located. The zips in the southern part of the city tend to have a low income.

After viewing this analysis readers might wish to do a similar analysis for the city where they live. To facilitate this I have created an interactive web application. The app begins by showing a choropleth map of a random statistic of the Zips in a random Metropolitan Statistical Area (MSA). You can choose another statistic, zoom in, or select another MSA.

In the event that the application does not load (for example, if I reach my monthly quota at my hosting company) then you can run the app from source, which is available here.

I hope that you have enjoyed this exploration of Zip code level demographics with choroplethrZip. I also hope that it encourages more people to use R for demographic statistics.

by Sherri Rose

Assistant Professor of Health Care Policy

Harvard Medical School

Targeted learning methods build machine-learning-based estimators of parameters defined as features of the probability distribution of the data, while also providing influence-curve or bootstrap-based confidence internals. The theory offers a general template for creating targeted maximum likelihood estimators for a data structure, nonparametric or semiparametric statistical model, and parameter mapping. These estimators of causal inference parameters are double robust and have a variety of other desirable statistical properties.

Targeted maximum likelihood estimation built on the loss-based “super learning” system such that lower-dimensional parameters could be targeted (e.g., a marginal causal effect); the remaining bias for the (low-dimensional) target feature of the probability distribution was removed. Targeted learning for effect estimation and causal inference allows for the complete integration of machine learning advances in prediction while providing statistical inference for the target parameter(s) of interest. Further details about these methods can be found in the many targeted learning papers as well as the 2011 targeted learning book.

Practical tools for the implementation of targeted learning methods for effect estimation and causal inference have developed alongside the theoretical and methodological advances. While some work has been done to develop computational tools for targeted learning in proprietary programming languages, such as SAS, the majority of the code has been built in R.

Of key importance are the two R packages SuperLearner and tmle. Ensembling with SuperLearner allows us to use many algorithms to generate an ideal prediction function that is a weighted average of all the algorithms considered. The SuperLearner package, authored by Eric Polley (NCI), is flexible, allowing for the integration of dozens of prespecified potential algorithms found in other packages as well as a system of wrappers that provide the user with the ability to design their own algorithms, or include newer algorithms not yet added to the package. The package returns multiple useful objects, including the cross-validated predicted values, final predicted values, vector of weights, and fitted objects for each of the included algorithms, among others.

Below is sample code with the ensembling prediction package SuperLearner using a small simulated data set.

library(SuperLearner) ##Generate simulated data## set.seed(27) n<-500 data <- data.frame(W1=runif(n, min = .5, max = 1), W2=runif(n, min = 0, max = 1), W3=runif(n, min = .25, max = .75), W4=runif(n, min = 0, max = 1)) data <- transform(data, #add W5 dependent on W2, W3 W5=rbinom(n, 1, 1/(1+exp(1.5*W2-W3)))) data <- transform(data, #add Y dependent on W1, W2, W4, W5 Y=rbinom(n, 1,1/(1+exp(-(-.2*W5-2*W1+4*W5*W1-1.5*W2+sin(W4)))))) summary(data) ##Specify a library of algorithms## SL.library <- c("SL.nnet", "SL.glm", "SL.randomForest") ##Run the super learner to obtain predicted values for the super learner as well as CV risk for algorithms in the library## fit.data.SL<-SuperLearner(Y=data[,6],X=data[,1:5],SL.library=SL.library, family=binomial(),method="method.NNLS", verbose=TRUE) ##Run the cross-validated super learner to obtain its CV risk## fitSL.data.CV <- CV.SuperLearner(Y=data[,6],X=data[,1:5], V=10, SL.library=SL.library,verbose = TRUE, method = "method.NNLS", family = binomial()) ##Cross validated risks## mean((data[,6]-fitSL.data.CV$SL.predict)^2) #CV risk for super learner fit.data.SL #CV risks for algorithms in the library

The final lines of code return the cross-validated risks for the super learner as well as each algorithm considered within the super learner. While a trivial example with a small data set and few covariates, these results demonstrate that the super learner, which takes a weighted average of the algorithms in the library, has the smallest cross-validated risk and outperforms each individual algorithm.

The tmle package, authored by Susan Gruber (Reagan-Udall Foundation), allows for the estimation of both average treatment effects and parameters defined by a marginal structural model in cross-sectional data with a binary intervention. This package also includes the ability to incorporate missingness in the outcome and the intervention, use SuperLearner to estimate the relevant components of the likelihood, and use data with a mediating variable. Additionally, TMLE and collaborative TMLE R code specifically tailored to answer quantitative trait loci mapping questions, such as those discussed in Wang et al 2011, is available in the supplementary material of that paper.

The multiPIM package, authored by Stephan Ritter (Omicia, Inc.), is designed specifically for variable importance analysis, and estimates an attributable-risk-type parameter using TMLE. This package also allows the use of SuperLearner to estimate nuisance parameters and produces additional estimates using estimating-equation-based estimators and g-computation. The package includes its own internal bootstrapping function to calculate standard errors if this is preferred over the use of influence curves, or influence curves are not valid for the chosen estimator.

Four additional prediction-focused packages are casecontrolSL, cvAUC, subsemble, and h2oEnsemble, all primarily authored by Erin LeDell (Berkeley). The casecontrolSL package relies on SuperLearner and performs subsampling in a case-control design with inverse-probability-of-censoring-weighting, which may be particularly useful in settings with rare outcomes. The cvAUC package is a tool kit to evaluate area under the ROC curve estimators when using cross-validation. The subsemble package was developed based on a new approach to ensembling that fits each algorithm on a subset of the data and combines these fits using cross-validation. This technique can be used in data sets of all size, but has been demonstrated to be particularly useful in smaller data sets. A new implementation of super learner can be found in the Java-based h2oEnsemble package, which was designed for big data. The package uses the H2O R interface to run super learning in R with a selection of prespecified algorithms.

Another TMLE package is ltmle, primarily authored by Joshua Schwab (Berkeley). This package mainly focuses on parameters in longitudinal data structures, including the treatment-specific mean outcome and parameters defined by a marginal structural model. The package returns estimates for TMLE, g-computation, and estimating-equation-based estimators.

*The text above is a modified excerpt from the chapter "Targeted Learning for Variable Importance" by Sherri Rose in the forthcoming Handbook of Big Data (2015) edited by Peter Buhlmann, Petros Drineas, Michael John Kane, and Mark Van Der Laan to be published by CRC Press.*

Thanks to all who attended my webinar earlier this week, Reproducibility with Revolution R Open and the Checkpoint Package. If you missed the live session, you can catch up with the slides and video replay which I've embedded below.

If you just want to check out the demo of the checkpoint package, it starts at 18:30 in the video below. If you want to follow along at home, you can download the demo script here.

Revolution Analytics webinars: Reproducibility with Revolution R Open and the Checkpoint Package

by Herman Jopia

**What is Binning?**

**Binning** is the term used in scoring modeling for what is also known in Machine Learning as **Discretization**, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

**Why Binning?**

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

- It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
- It controls or mitigates the impact of outliers over the model.
- It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

**Unsupervised Discretization**

Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.

**Equal length intervals**

- Objective: Understand the distribution of a variable.
- Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).
- Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.

Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

**Equal frequency intervals**

- Objective: Analyze the relationship with a binary target variable through metrics like bad rate.
- Example: Quartlies or Percentiles.
- Disadvantage: The cutpoints selected may not maximize the difference between bins when mapped to a target variable, as shown in Table 2

Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).

**Supervised Discretization**

Supervised Discretization divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.

In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available [2], analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.

**An Example With 'smbinning'**

Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).

# Load package and its data library(smbinning) data(chileancredit) # Training and testing samples chileancredit.train=subset(chileancredit,FlagSample==1) chileancredit.test=subset(chileancredit,FlagSample==0) # Run and save results result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05) result$ivtable # Relevant plots (2x2 Page) par(mfrow=c(2,2)) boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, horizontal=T, frame=F, col="lightgray",main="Distribution") mtext("Time on Books (Months)",3) smbinning.plot(result,option="dist",sub="Time on Books (Months)") smbinning.plot(result,option="badrate",sub="Time on Books (Months)") smbinning.plot(result,option="WoE",sub="Time on Books (Months)")

Table 3. Time on Books cutpoints mapped to Credit Performance.

Figure 1. Plots generated by the package.

In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.

For more information about binning, the package's documentation available on CRAN lists some references related to the algorithm behind it and its supporting website some references for scoring modeling development.

**References**

[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).

[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.