by Herman Jopia

**What is Binning?**

**Binning** is the term used in scoring modeling for what is also known in Machine Learning as **Discretization**, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

**Why Binning?**

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

- It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.
- It controls or mitigates the impact of outliers over the model.
- It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

**Unsupervised Discretization**

Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.

**Equal length intervals**

- Objective: Understand the distribution of a variable.
- Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).
- Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.

Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

**Equal frequency intervals**

- Objective: Analyze the relationship with a binary target variable through metrics like bad rate.
- Example: Quartlies or Percentiles.
- Disadvantage: The cutpoints selected may not maximize the difference between bins when mapped to a target variable, as shown in Table 2

Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).

**Supervised Discretization**

Supervised Discretization divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.

In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available [2], analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.

**An Example With 'smbinning'**

Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).

# Load package and its data library(smbinning) data(chileancredit) # Training and testing samples chileancredit.train=subset(chileancredit,FlagSample==1) chileancredit.test=subset(chileancredit,FlagSample==0) # Run and save results result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05) result$ivtable # Relevant plots (2x2 Page) par(mfrow=c(2,2)) boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, horizontal=T, frame=F, col="lightgray",main="Distribution") mtext("Time on Books (Months)",3) smbinning.plot(result,option="dist",sub="Time on Books (Months)") smbinning.plot(result,option="badrate",sub="Time on Books (Months)") smbinning.plot(result,option="WoE",sub="Time on Books (Months)")

Table 3. Time on Books cutpoints mapped to Credit Performance.

Figure 1. Plots generated by the package.

In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.

For more information about binning, the package's documentation available on CRAN lists some references related to the algorithm behind it and its supporting website some references for scoring modeling development.

**References**

[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).

[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.

by Joseph Rickert

What will you be doing at 26 minutes and 53 seconds past 9 this coming Saturday morning? I will probably be running simulations. I have become obsessed with an astounding result from number theory and have been trying to devise Monte Carlo simulations to get at it. The result, well known to number theorists says: choose two integers at random; the probability that they will be coprime is 6/*π*^{2} ! Here, *π* materializes out of thin air. Who could have possibly guessed this? - well, Leonhard Euler, apparently, and this sort of magic seems to be quite common in number theory.

More formally, the theorem Euler proved goes something like this: Let P_{N} be the probability that two randomly chosen integers in {1, 2, ... N} are coprime. Then, as N goes to infinity P_{N} goes to 6/π^{2 }. Well, this seems to be a little different. You don't actually have to sample from an infinite set. So, I asked myself, would a person who is not familiar with this result, but who is allowed to do some Monte Carlo simulations, have a reasonable chance of guessing the answer? I know that going to infinity would be quite a trip, but I imagined that I could take a few steps and see something interesting. How about this: choose some boundary, N, and draw lots of pairs of random numbers. Count the number of coprime pairs M. Then sqrt( 6 / (M/N)) will give an estimate of π. As N gets bigger and bigger you should see the digits of the average of the estimates marching closer and closer to *π* - 3.1, 3.14, 3.141 etc.

Before you go off and try this, let me warn you: there is no parade of digits. For a modest 100,000 draws with an N around 10,000,000 you should get an estimate of *π* close to 3.14. Then, even with letting N get up around 1e13 with a million draws, you won't do much better. The following code (compliments of a colleague who introduced me to mapply) performs 100 simulations, each with 100,000 draws as the range varies from +/- 1,000,000 to +/- 1e13. (The code runs pretty quickly on my laptop.)

# Monte Carlo estimate of pi library( numbers ) library(ggplot2) set.seed(123) bigRange <- seq(1e6,1e13,by=1e11) M <- length(bigRange) draws <- 1e5 for(i in 1:M){ maxRange <- bigRange[i] print(bigRange[i]) min <- -maxRange max <- maxRange r1 <- round( runif( n = draws, min = min, max = max ) ) r2 <- round( runif( n = draws, min = min, max = max ) ) system.time( coprimeTests <- mapply( coprime, r1, r2 ) ) prob[i] <- sum( coprimeTests ) / draws print(prob[i]) piEst[i] <- sqrt( 6 / prob[i] ) print(piEst[i]) } piRes2 <- data.frame(bigRange,prob,piEst) p2 <- ggplot(piRes2,aes(bigRange,piEst)) p2 + geom_line() + geom_point() + xlab("Half Range for Random Draws")+ ylab("Estimate of Pi") + ggtitle("Expanding Range Simulation") + geom_smooth(method = "lm", se=FALSE, color="black")

Here is the "time series" plot of the results.

The mean is 3.142959. The slight upward trend is most likely a random number induced illusion.

The problem as, I have framed it, is probably beyond the reach of a naive Monte Carlo approach. Nevertheless, on Saturday, when I have some simulation, time I will try running 100,000,000 draws. This should get me another digit of *π* since the accuracy of the mean increases as sqrt(N).

The situation, however, is not as dismal as I have been making it out. Don't imagine that just because the problem is opaque to a brute force Monte Carlo effort that it is without the possibility of computational illumination. Euler's proof, mentioned above, turns on recognizing that the expression for the probability of two randomly chosen integers being coprime may be expressed as the mathematical function z(2). The poof yields *π*^{2} = 6z(2), The wizards of R have reduced this calculation to the trivial. The Rmpfr package, which allows the use of arbitrarily precise numbers instead of R's double precision numbers, includes the function zeta(x)! So, here, splendidly arrayed, is *π*.

Happy Pi Day!

by Joseph Rickert

Distcomp, a new R package available on GitHub from a group of Stanford researchers has the potential to significantly advance the practice of collaborative computing with large data sets distributed over separate sites that may be unwilling to explicitly share data. The fundamental idea is to be able to rapidly set up a web service based on Shiny and opencpu technology that manages and performs a series of master / slave computations which require sharing only intermediate results. The particular target application for distcomp is any group of medical researchers who would like to fit a statistical model using the data from several data sets, but face daunting difficulties with data aggregation or are constrained by privacy concerns. Distcomp and its methodology, however, ought to be of interest to any organization with data spread across multiple heterogeneous database environments.

Setting up the distcomp environment requires some preliminary work and out-of-band communication among the collaborators. In the first step, the lead investigator uses a distcomp function to invoke a browser-based Shiny application to describe the location of her data set, the variables to be used in the computation, the model formula and other metadata necessary to describe the computation.

Next, the investigator invokes another distcomp function to move the metadata and a copy of the local data set to computation server with a unique identifier. Once the master server is in place, collaborating investigators at remote locations perform a similar process to set up slave computation servers at their sites. When the lead investigator receives the URLs pointing to the slave servers she is ready to kick off the computation.

All of the details of this setup process are described in this paper by Narasimham et al. The paper also describes two non-trivial computations: a distributed rank-k singular value decomposition and distributed, stratified Cox model that are of interest in their own right. The algorithm and code for the stratified Cox model ought to be useful to data scientists in a number of fields working on time to event models. A really nice feature of the algorithm is that it only requires each site to independently optimize the partial likelihood function using its local data. The master process uses the partial likelihood information from all of the sites to compute a final estimate of the coefficients and their variances.

There are several nice aspects to this work:

- It builds on the cumulative work of the R community to provide a big league, big data application around open source R.
- It provides a flexible paradigm for implementing distributed / parallel applications that leverages existing R algorithms (e.g. the Cox model makes use of code in the survival package)
- It illustrates the ease with which R projects can be deployed in web services applications with Shiny and other R centric software such as DeployR
- It provides an alternative to building out infrastructure and aggregating data before realizing the benefits of a big data computation. (Prototyping calculations with distcomp might also serve to justify the expense and effort of developing centralized infrastructure.)
- It recognizes that privacy and other social concerns are important in big data applications and provides a model for respecting some of the social requirements for dealing with sensitive data.

Distcomp is new work and the developers acknowledge several limitations. (So far, they have only built out two algorithms and they don’t have a way to easily deal with factor data across the distributed data sets.) Nevertheless, the project appears to show great promise.

By David R. Morganstein, ASA President

Raise your hand if you recently read an article in a newspaper or online about a new scientific discovery and were surprised by how the journalist reported the data or a statistical concept.

You’re not alone! We may both want reporters to be more statistically literate.

More of them will be, thanks to a new American Statistical Association (ASA) and Sense About Science USA (SAS USA) initiative whose goal is to connect with journalists and their editors and help them become more statistically savvy.

Last year, the ASA began working with the newly formed SAS USA to re-launch STATS.org, a statistics informational and resource hub for journalists — and anyone interested in how numbers shape science and society. The project will help reporters gain access to the statistical, data-driven perspective of stories on which they are working.

Through STATS.org, journalists are connected with statisticians who are experts on specific topics to provide them understandable statistical advice and explanation. This connection is critical to helping raise the understanding of the media on the statistical issues in their stories. For the first time many reporters will have access to a statistical expert who can help them interpret a scientific study and convey that meaning to the public.

In addition, ASA member statisticians provide background information and the statistical science perspective on timely news stories and STATS.org writers, who write about quantitative concepts in a very readable, easily understandable style, produce articles. These articles are featured on the STATS.org website for reporters and others interested in statistical matters to read and learn about statistics.

The project, which publicly launched January 1, already is hard at work. The website is populated with articles covering medicines, nutrition science, the vaccination debate and climate change.

The latter is in response to a recent *New York Times* opinion article that advocated climate scientists use a less stringent standard to assess global warming. Michael Lavine, an ASA member and professor of mathematics and statistics at the University of Massachusetts Amherst, wrote an article clarifying the statistical components of the opinion article.

“Unfortunately, to make her argument, the author confuses several different aspects of confidence, evidence, belief, and decision-making. The purpose here is to point out the confusion and clarify the statistical issues. This article is not about climate change; it’s about statistics. Oreskes’ mistaken interpretation of these statistical ideas do not imply that climate change is under question; the evidence for climate change consists of mechanistic as well as statistical arguments, and has little to do with the topic under discussion here: a misinterpretation of what is called the p-value,” wrote Lavine.

You, too, can help ASA and STATS.org. The next time you read a news article that misstates or misinterprets statistical data or concepts, take a minute to forward a copy of it to the ASA or STATS.org. Also, if you see a key statistical concept that consistently is misreported in the media, let us know about it as well.

STATS.org: Because Numbers Count

The World Cup of Cricket starts this week. (C'mon Aussie!) Cricket isn't well-known amongst many of my American friends or colleagues, so when I'm asked about it I usually point them to this video, which gives a good sense of the game:

Actually, this Vox article and this ESPN video do a much better job of describing the game. One thing the ESPN video doesn't mention (besides not listing all 11 ways to be out) is the possibility of a draw. In test match cricket, it's entirely possible for a match lasting five days to end without a winner. The reason is that a test match lasts four innings (each team gets to bat twice), and is also limited to five days. If the time limit ends before both innings are complete, *and* the trailing team is still at bat, the game is declared a draw. (The idea is that the trailing team may have caught up if only the game could continue.) The strategy for the trailing team, if they don't think they can achieve outright victory, is to instead play for time and go for the draw. (The winner of a test series is the the team with the most wins over five five-day test matches.)

Playing for five days without a definitive outcome can try the patience of the modern sporting fan, so one-day cricket was born. Here, there are just two innings per game, each team is given a fixed number of overs (balls) with which to score, after which the innings is automatically over and the other team has an opportunity to bat. As the name suggests, the game is over in a day and one team or the other will be declared the winner (unless there is an exact tie in scores).

There's an interesting statistical angle here, which is related to interruptions in the game. Let's say we're halfway through the second innings and Australia is at bat with 142 runs to England's 204. Normally, Australia would need to score 63 runs (the "target") to win. Now suppose it starts to rain, and the game is suspended for an hour. To keep the game from running long, Australia will be given fewer overs to bat, and their target will be reduced as well. But the target isn't reduced in exact proportion to the overs removed, to reflect the fact that more runs are generally scored in the latter part of an innings. The exact calculation is based on statistical analysis of cricket games, and is a great example of censored data analysis. (The basic idea is to be able to forecast what the final score *would have been* in games that are interrupted.) The calculation is known as the Duckworth-Lewis method, named after the two British statisticians that devised it. People often talk about statistics and baseball in the same breath, but this is the only example I can think of where statistical *modeling* is such an important part of a sport. (If you can think of others, let me know in the comments!)

Well, that's all for this week — I'm off to watch the cricket! See you back here on Monday.

From the "statistician humour" department, today's xkcd cartoon will ring a bell for anyone who's ever published (or read!) a scientific article including a P-value for a statistical test:

If finding P-value excuses is a common activity for you (and let's hope not!) then R has you covered with the Significantly Improved Significance Test. This R code from Rasmus Bååth will automatically annotate your P-values between 0.05 and 0.12 with excuses like "suggestive of statistical significance", "weakly non-significant" or "quasi-significant". Bonus points for links in the comments to real journal articles that actually use these excuses!

*by Bob Horton, **Data Scientist, Revolution Analytics*

From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.

This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.

As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.

Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as

`plot(0:20, choose(n=20, k=0:20))`

Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.

Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.

Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.

Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through community@revolutionanalytics.com if you are interested in working with us. Stay tuned for progress reports.

by Joseph Rickert

KatRisk, a Berkeley based catastrophe modeling company specializing in wind and flood risk, has put three R and Shiny powered interactive demos on their website. Together these provide a nice introduction to the practical aspects of weather based risk modeling and give a good indication of the kinds of data that are important. Two of the models, the US & Caribbean Hurricane Model and the Asia Typhoon Model, provide a tremendous amount of information but they require a little bit of background knowledge to understand the data required to drive them, and the computed loss statistics.

The Flood Data Lookup Model, however, can really hit home for anybody. Just bring up the model, type in the address of the location of interest and press the red "Geocode" button to get the associated longitude and latitude. Then click on the "Get Data" button. The resulting information will give you an idea of the level of risk for the property and let you know what a 100 year flood and 500 year flood would look like. Next, switch to the "Flood Map" tab and press the "Get Map" button to see some of the information overlayed on a Google map.

Not being able to resist the opportunity to have Google Maps google Google, I thought it would be interesting to see how bad things could get at the Googleplex.

Uh oh! The Googleplex gets a pretty high KatRisk score. A 100 year flood would put the place under 7 feet of water!

Not to worry though: Google has already completed their first round of feasibility tests for a navy. (Nobody does long range planning like Google.)

The KatRisk models are based on R code that makes heavy use of data.table for fast table look ups of the risk results. As the company says on their website:

KatRisk has developed a suite of analytic tools to make it easy to access our data and models. We use open source software tools including R Shiny for our web applications. By using R shiny we can develop on-line products that can also easily be deployed to a client site. Our software is completely open, so if you decide to host our analytical tools you will be able to see all of the details in easy to understand and modify R code.

For some details on the underlying analytics have a look at this previous post that was based on a talk Dag Lohmann gave to the Bay Area UseR Group last year.

So, go ahead and compute your KatRisk score, but please do be mindful of the company's request not to run the model for more than 3 locations in one day.

by Ryan Garner

Senior Data Scientist, Revolution Analytics

I love creating spatial data visualizations in R. With the ggmap package, I can easily download satellite imagery which serves as a base layer for the data I want to represent. In the code below, I show you how to visualize sampled soil attributes among 16 different rice fields in Uruguay.

library(ggmap) library(plyr) library(gridExtra) temp <- tempfile() download.file("http://www.plantsciences.ucdavis.edu/plant/data.zip", temp) connection <- unz(temp, "Data/Set3/Set3data.csv") rice <- read.csv(connection) names(rice) <- tolower(names(rice)) # Create a custom soil attribute plot # @param df Data frame containing data for a field # @param attribute Soil attribute # @return Custom soil attribute plot create_plot <- function(df, attribute) { map <- get_map(location = c(median(df$longitude), median(df$latitude)), maptype = "satellite", source = "google", crop = FALSE, zoom = 15) plot <- ggmap(map) + geom_point(aes_string(x = "longitude", y = "latitude", color = attribute), size = 5, data = df) plot <- plot + ggtitle(paste("Farmer", df$farmer, "/ Field", df$field)) plot <- plot + scale_color_gradient(low = "darkorange", high = "darkorchid4") return(plot) } ph_plot <- dlply(rice, "field", create_plot, attribute = "ph") ph_plots <- do.call(arrangeGrob, ph_plot)

First, I download data that is used in "Spatial Data Analysis in Ecology and Agriculture using R" by Dr. Richard Plant. (This is an excellent book to get your feet wet working with spatial data in R.) After the data has been downloaded, I create a function that builds a custom soil attribute plot for each unique field found in the rice yield data. Then, I customized the output to include larger spatial points and a custom gradient that goes from dark orange to dark purple for clarity.

Finally, once all the plots are generated, I arrange them into a single plot.

The plot shows the ph intensity of the soil in 16 fields belonging to 9 different farmers. The second to the last plot, field 15 of farmer L, appears to have higher ph concentrations than the rest.

by Joseph Rickert

In a recent post, where I presented some R related highlights of November's H_{2}0 World conference, I singled out and described talks by Trevor Hastie and John Chambers and remarked that it would be nice if the videos would be made available. Well, thanks to the generosity of the folks at H_{2}O I got my wish.

Here is the video of Professor Hastie's talk.

This video represents a master class on machine learning where in 40 minutes or so Professor Hastie conducts a tour that starts with basic decision trees and goes all the way to building learning ensembles with the Lasso. Along the way, he presents the salient ideas on bagging, random forests and boosting. The treatment of boosting is succinct and elegant covering some remarkable features of the family of boosting algorithms. For example, Professor Hastie describes how training error in Adaboost can reach zero and stay there but testing error can continue to improve, how superior performance can be achieved with boosting algorithms by using only tree stumps, and how the stagewise additive modeling "slows down the rate of overfitting". The really deep insight comes in the discussion about viewing Adaboost as an algorithm that fits additive logistic regression models with an exponential loss function. This, in turn, leads to a discussion Jerome Freidman's Gradient Boosting Machine and more general boosting algorithms that can accommodate multiple kinds of loss functions. These are the models implemented in R's gbm package.

I think this video of John Chambers' reminiscing about his time at Bell Labs working with John Tukey is destined to become an important part of the historical record for Statistics. There are many remembrences of Tukey to be found online, but I don't know of any other visual record by someone of John Chambers' stature who interacted with Tukey as a colleague and professional statistician.

In Just a few minutes, Chambers paints a balanced and revealing portrait that humanizes and captures some of the complexity of this icon of modern statistics. I especially like the story in the Q & A portion of the talk where John describes Tukey's propensity for "mischief" and his delight in inventing new words (like boxplot "hinges") that rankled many of his statistician colleagues, but apparently particularly upset the British statisticians.

There are a few more videos on the H_{2}O site that are worth a look.