by Herman Jopia
What is Binning?
Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.
Why Binning?
Though there are some reticence to it [1], the benefits of binning are pretty straight forward:
Unsupervised Discretization
Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.
Equal length intervals
Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.
Equal frequency intervals
Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).
Supervised Discretization
Supervised Discretization divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups.
In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available [2], analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.
An Example With 'smbinning'
Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).
# Load package and its data library(smbinning) data(chileancredit) # Training and testing samples chileancredit.train=subset(chileancredit,FlagSample==1) chileancredit.test=subset(chileancredit,FlagSample==0) # Run and save results result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05) result$ivtable # Relevant plots (2x2 Page) par(mfrow=c(2,2)) boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB, horizontal=T, frame=F, col="lightgray",main="Distribution") mtext("Time on Books (Months)",3) smbinning.plot(result,option="dist",sub="Time on Books (Months)") smbinning.plot(result,option="badrate",sub="Time on Books (Months)") smbinning.plot(result,option="WoE",sub="Time on Books (Months)")
Table 3. Time on Books cutpoints mapped to Credit Performance.
Figure 1. Plots generated by the package.
References
[1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996).
[2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.
by Joseph Rickert
The gradient boosting machine as developed by Friedman, Hastie, Tibshirani and others, has become an extremely successful algorithm for dealing with both classification and regression problems and is now an essential feature of any machine learning toolbox. R’s gbm() function (gbm package) is a particularly well crafted implementation of the gradient boosting machine that served as a model for the rxBTrees() function which was released last year as a feature of Revolution R Enterprise 7.3. You can think of rxBTRees(), as a scaled up version gbm() that is designed to work with massive data sets in distributed environments.
The basic idea underlying the gbm is that weak learners may be combined in an additive, iterative fashion to produce highly accurate classifiers that are resistant to overfitting. While in theory any weak learner can be incorporated into a gbm, classification and regression trees, or decision trees as they are called in RRE, are particularly convenient. As Kuhn and Johnson (p 205) point out:
rxBTrees() scales gbm() by building the boosting algorithm around RRE’s rxDTree() decision tree implementation. As I described in a previous post, this algorithm scales to work with large data sets by building trees on histograms, a summary of the data, and not on the data itself. rxBTrees(), like rxDTree() is also a proper parallel external memory algorithm (PEMA) This means two things:
The following function fits a rxBTrees() model to a training data subset of the a mortgage data set. rxBTrees() contains many input parameters. Some are common to all RevoScaleR PEMAs, others are inherited from rxDTree(), and still others such as nTree and learningRate have direct counterparts in R's gbm() function. In this function call, I have explicitly called out parameters that pertain to fitting a gbm model
# Fit a Boosted Trees Mode on the training data set form <- formula(default ~ houseAge + creditScore + yearsEmploy + ccDebt) BT_model <- rxBTrees(formula = form, data= "mdTrain", lossFunction = "bernoulli", minSplit = NULL, # Min num of obs at node before a split is attempted minBucket = NULL, # Min num of obs at a terminal node maxDepth = 1, # 2^maxDepth = number of leaves on tree cp = 0, # Complexity parameter maxSurrogate = 0, # Max surrogate splits retained, 0 improves perf useSurrogate = 2, # Determines how surrogates are used surrogateStyle = 0, # Sets surrogate algorithm nTree = 100, # Num of trees = num of boosting iterations mTry = NULL, # Num vars to sample as a split candidate sampRate = NULL, # % of observations to sample for each tree # default = 0.632 importance = TRUE, # Importance of predictors to be assessed learningRate = 0.1, # Also known as shrinkage maxNumBins = NULL, # Max num bins in histograms # default: min(1001,max(sqrt(num of obs))) verbose = 1, # Amount of status sent ot console computeContext="RxLocalParallel", # use all cores available in PC overwrite = TRUE) # Allow xdf output file to be overwritten
Some especially noteworthy parameters are maxDepth, maxNumBins and computeContext. maxDepth accomplishes the same purpose as interaction.depth in the gbm() function. However, they are not the same parmeter. Setting interaction.depth = 1 produces stump trees with two leaves each because the number of leaves = interaction.depth + 1. In rxBTRees(), stumps are produced by setting maxDepth = 1. But, this is because 2^{maxDepth} gives the number of leaves.
MaxNumBins determines the number of bins to be used in making histograms of the data. If anyone ever wanted to compare gbm() directly with rxBTrees() on a small data set for example, setting MaxNumBins to the number of points in the data set would mimic gbm(). Note, however, that this would saddle rxBTrees() with all of the overhead of computing histograms while receiving none of the benefits.
The computeContext parameter is the mechanism for connecting a PEMA to the underlying hardware and that enables functions such as rxBTrees() to run in parallel on a various clusters using distributed data. Some of the details of how this happens have been described in a previous post, but I hope to be able to follow up this post with an example of rxBTrees() running on Hadoop in the not too distant future.
The data set used to build the model above is available on Revolution Anaytics' data set web page both as an xdf file and as multiple csv files. The following command shows what the data looks like.
> rxGetInfo(mortData,getVarInfo=TRUE,numRows=3)
File name: C:\DATA\Mortgage Data\mortDefault\mortDefault2.xdf Number of observations: 1e+07 Number of variables: 6 Number of blocks: 20 Compression type: none Variable information: Var 1: creditScore, Type: integer, Low/High: (432, 955) Var 2: houseAge, Type: integer, Low/High: (0, 40) Var 3: yearsEmploy, Type: integer, Low/High: (0, 15) Var 4: ccDebt, Type: integer, Low/High: (0, 15566) Var 5: year, Type: integer, Low/High: (2000, 2009) Var 6: default, Type: integer, Low/High: (0, 1) Data (3 rows starting with row 1): creditScore houseAge yearsEmploy ccDebt year default 1 615 10 5 2818 2000 0 2 780 34 5 3575 2000 0 3 735 12 1 3184 2000 0
In building the model, I divided the mortgage default data randomly into a training and test data sets. Then, I fit an rxBTrees() model on the training data and used the test data to compute the ROC curve shown below.
Along the way, the model also produced the out of bag error versus number of trees plot which shows that for this data set one could probably get by with only building 70 or 80 trees.
The code I used to build the model is available here: Download Post_model. If you do not have Revolution R Enterprise available to you at your company, you can try out this model for yourself in the AWS cloud by taking advantage of Revolution Analytics' 14 day free trial.
Take the first step towards building a gbm on big data.
by Gary R. Moser
Director of Institutional Research and Planning
The California Maritime Academy
I recently contacted Joseph Rickert about inviting Vim guru Drew Niel (web: vimcasts.org, book: "Practical Vim: Edit Text at the Speed of Thought") to speak at the Bay Area R User Group group. Due to Drew's living in Great Britain that might not be easily achieved, so Joe generously extended an invitation for me to share a bit about why I like Vim here.
When it comes to text editors, there are a number of strong contenders to choose from (...lots and lots of them). if you've found a tool with a rich feature set that makes you more productive (and more importantly, is something you like) you should probably stick with that. Given that editing text of various types is such a big part of our lives, however, it's worth turning a critical eye toward how well our tools facilitate this specific activity.
Do you spend a lot of time editing text? Yeah, me too. Personally this includes R code, Markdown, Latex, as well as notes and outlines (including this article). When I rediscovered Vim fairly recently, it was from the perspective of being a proficient R user. Therefore, my values and beliefs about what constitutes good software are framed by my experience with R. That includes being OK with an initial learning curve.
I have been using RStudio as my primary IDE for R since it was offered. It's great software; they did something not too long ago that I really appreciate - they added a stripped-down Vim editing mode. Vim pushes users to the documentation pretty quickly (usually as the result of accidentally deleting large chunks of work), and as I dug in and began to discover its full functionality, I came to realize how much I was missing out on by using the emulator. The ability to set and jump to marks in a document, or utilizing multiple registers for copy/paste are two good examples of essential but missing features in RStudio.
Vim has been described as "a language for text editing," which I think is a useful way to view it. At the risk of sounding snotty, I would compare the experiences of using Vim (or another good editor) versus a plain-jane text editor to that of playing chess versus checkers. That is, there's an element of strategic and intentional action compared to simply playing one of a limited set of moves over and over again.
One of the things that makes Vim so interesting and different from other editors stems from its origins. As the result of being developed in the context of severe constraints (slow networks, no digital displays, limited system resources, and no mouse), Vim - then "ed" - had to accomplish the greatest amount of work with the least number of keystrokes. This requirement led to the development of a vast number of very specific commands that can be combined in useful ways. Drew Neil artfully compares this to playing notes, chords, and melodies on a piano. It's also an appropriate comparison for setting one's expectations toward becoming a skilled Vim user! Michael Mrozekon's humorous plot cleverly suggests that, not unlike R, Vim doesn't hold your hand.
It also speaks to my point about specificity. Emacs, for example, can be extended to be a web client or music player, hence the rabbit-hole learning curve, but isn't that getting away from the primary task of text editing?
The fundamental way that Vim differs from most other text editors is that it is explicitly modal; all software is technically modal in certain ways (that is, the same keys do different things under different circumstances), but with Vim it is a central design feature. Essentially, what this means is that by switching modes, a different keyboard comes into existence under your fingers. Because Vim has four modes, and a very rich and terse set of key-bindings, it's like having four+ keyboards in one. The keyboard cheat sheet is a useful reference, especially in the beginning.
Warning: after becoming familiar with Vim's basic functionality, going back to a typical text editor feels rather clumsy.
Vim as an interface to R using the Vim-R-plugin is mostly good for how I use it, but I expect to be dialing-in Vim for a long time before it's got all the features I want. I don't mind this, but I can see how someone else might. I encourage you consider your own tools and how well they facilitate your most frequent tasks. If you're an RStudio user, try giving Vim mode a go. A visit to www.vim.org will connect you to the resources you'll need.
by Joseph Rickert
What will you be doing at 26 minutes and 53 seconds past 9 this coming Saturday morning? I will probably be running simulations. I have become obsessed with an astounding result from number theory and have been trying to devise Monte Carlo simulations to get at it. The result, well known to number theorists says: choose two integers at random; the probability that they will be coprime is 6/π^{2} ! Here, π materializes out of thin air. Who could have possibly guessed this? - well, Leonhard Euler, apparently, and this sort of magic seems to be quite common in number theory.
More formally, the theorem Euler proved goes something like this: Let P_{N} be the probability that two randomly chosen integers in {1, 2, ... N} are coprime. Then, as N goes to infinity P_{N} goes to 6/π^{2 }. Well, this seems to be a little different. You don't actually have to sample from an infinite set. So, I asked myself, would a person who is not familiar with this result, but who is allowed to do some Monte Carlo simulations, have a reasonable chance of guessing the answer? I know that going to infinity would be quite a trip, but I imagined that I could take a few steps and see something interesting. How about this: choose some boundary, N, and draw lots of pairs of random numbers. Count the number of coprime pairs M. Then sqrt( 6 / (M/N)) will give an estimate of π. As N gets bigger and bigger you should see the digits of the average of the estimates marching closer and closer to π - 3.1, 3.14, 3.141 etc.
Before you go off and try this, let me warn you: there is no parade of digits. For a modest 100,000 draws with an N around 10,000,000 you should get an estimate of π close to 3.14. Then, even with letting N get up around 1e13 with a million draws, you won't do much better. The following code (compliments of a colleague who introduced me to mapply) performs 100 simulations, each with 100,000 draws as the range varies from +/- 1,000,000 to +/- 1e13. (The code runs pretty quickly on my laptop.)
# Monte Carlo estimate of pi library( numbers ) library(ggplot2) set.seed(123) bigRange <- seq(1e6,1e13,by=1e11) M <- length(bigRange) draws <- 1e5 for(i in 1:M){ maxRange <- bigRange[i] print(bigRange[i]) min <- -maxRange max <- maxRange r1 <- round( runif( n = draws, min = min, max = max ) ) r2 <- round( runif( n = draws, min = min, max = max ) ) system.time( coprimeTests <- mapply( coprime, r1, r2 ) ) prob[i] <- sum( coprimeTests ) / draws print(prob[i]) piEst[i] <- sqrt( 6 / prob[i] ) print(piEst[i]) } piRes2 <- data.frame(bigRange,prob,piEst) p2 <- ggplot(piRes2,aes(bigRange,piEst)) p2 + geom_line() + geom_point() + xlab("Half Range for Random Draws")+ ylab("Estimate of Pi") + ggtitle("Expanding Range Simulation") + geom_smooth(method = "lm", se=FALSE, color="black")
Here is the "time series" plot of the results.
The mean is 3.142959. The slight upward trend is most likely a random number induced illusion.
The problem as, I have framed it, is probably beyond the reach of a naive Monte Carlo approach. Nevertheless, on Saturday, when I have some simulation, time I will try running 100,000,000 draws. This should get me another digit of π since the accuracy of the mean increases as sqrt(N).
The situation, however, is not as dismal as I have been making it out. Don't imagine that just because the problem is opaque to a brute force Monte Carlo effort that it is without the possibility of computational illumination. Euler's proof, mentioned above, turns on recognizing that the expression for the probability of two randomly chosen integers being coprime may be expressed as the mathematical function z(2). The poof yields π^{2} = 6z(2), The wizards of R have reduced this calculation to the trivial. The Rmpfr package, which allows the use of arbitrarily precise numbers instead of R's double precision numbers, includes the function zeta(x)! So, here, splendidly arrayed, is π.
Happy Pi Day!
by Andrie de Vries
In my previous post, I demonstrated how to get some status of running jobs on a parallel back end. However, I stopped short of actually demonstrating progress bars. In this post I demonstrate how to do this.
The StackOverflow question How do you create a progress bar when using the “foreach()” function in R? () asks the question how to create progress bars with foreach.
This question attracted some helpful answers, but ultimately didn’t answer the question. In particular, Steve Weston said this approach is flawed, since
It won't work well with doParallel because doParallel only calls the combine function after all of the results have been returned, since it is implemented by calling the parallel clusterApplyLB function. This technique only with works well with backends that call the combine function on-the-fly, like doRedis, doMPI, doNWS, and (defunct?) doSMP.
My experiments
However, this set me thinking. Could it be that one could trigger a tcltk progress bar in each worker?
Thus I started to experiment, and came up with the following code. On my windows machine, using doParallel, this seems to work reliably (but I could not get this approach to work on Linux using doMC).
When you run this code, you should see synchronous progress bars, one for each worker:
Conclusion
Although it takes some work, and there isn’t a single solution for all cases, it is possible to get progress bars for at least some parallel back ends.
In this post I demonstrated how to create progress bars using doParallel on Windows, but failed to replicate on Linux.
Can you do better?
by Matt Sundquist
co-founder of Plotly
Domino's new R Notebook and Plotly's R API let you code, make interactive R and ggplot2 graphs, and collaborate entirely online. Here is the Notebook in action:
Published R Notebook
To execute this Notebook, or to build your own, head to Domino's Plotly Project. The GIF below shows how to get started: choose an R session, press "Open Notebook", and choose the "R Notebook" file. For sensitive data, see Domino On-Premise and Plotly Enterprise.
This plot, like the rest in this post, was made in the R Notebook with ggplot2, then converted to an interactive plot with one line of code. The plot is drawn with D3.js. You can hover your mouse to see text, click and drag to zoom, and toggle traces on and off by clicking them in the legend. check out our GitHub page to see our ggplot2 support.
Next up, let's visualize Anscombe's Quartet. Head to the R Notebook to see the R code for this.
Interactive bar charts let you hover and filter. Press the legend items to filter the plot. Head to the R Notebook to see the code.
Plotly and Domino let you transition ggplot2 plots into interactive 3D plots. Head to the R Notebook to see the code.
If you use error bars, you can see the values when you hover your mouse. Head to the R Notebook to see the code.
We can also make dendrograms; this example uses ggdendro and ggplot2. Head to the R Notebook to see the example.
You can make interactive box plots from ggplot2 and share them.
Domino Notebooks can run Python, R, Julia, and Scala. Each Notebook can be shared in the Domino Cloud or exported as an IPython Notebook, HTML or reST file. Notebooks support LaTeX, and allow you to use Markdown, headers, import images, and display iframes.
Plotly graphs can be embedded in Notebooks, dashboards, HTML, knitr, or Shiny. Every Plotly graph can be exported in varied programming languages and image types. For example, for the boxplot above:
This is our first pass at showing how you can use Domino and Plotly for coding and plotting in R. Have ideas? Suggestions? Other collaboration or Notebook projects you'd like to see? We're at @DominoDataLab and @DominoDataLab and welcome your suggestions.
R has strong support for parallel programming, both in base R and additional CRAN packages.
For example, we have previously written about foreach and parallel programming in the articles Tutorial: Parallel programming with foreach and Intro to Parallel Random Number Generation with RevoScaleR.
The foreach package provides simple looping constructs in R, similar to lapply() and friends, and makes it easy execute each element in the loops in parallel. You can find the packages at foreach: Foreach looping construct for R and doParallel.
Parallel programming can help speed up the total completion time of your project. However, for tasks that take a long time to run, you may wish to track progress of the task, while the task is running.
This seems like a simple request, but seems remarkably hard to achieve. The reason boils down to this:
Each parallel worker is running in a different session of R
In some parallel computing setups, the workers don’t communicate with the initiating process, until the final combining step
So, if it is difficult to track progress directly, what can be done?
It seems to me the typical answer to this question fall into 3 different classes:
Use operating system monitoring tools, i.e. tools external to R.
Print messages to a file (or connection) in each worker, then read from this file, again outside of R
Use specialist back-ends that support this capability, e.g. the Redis database and the doRedis package
This is an area with many avenues of exploration, so I plan to briefly summarize each method and point to at least one question on StackOverflow that may help.
The question Monitoring Progress/Debugging Parallel R Scripts asks if it is possible to monitor a parallel job.
In his answer to this question, Dirk Eddelbuettel mentions that parallel back ends like MPI and PVM have job monitors, such as slurm and TORQUE. However, tools that are simpler to use, like snow do not have monitoring tools. In this case, you be forced to use methods like printing diagnostic messages to a file.
For parallel jobs using the doParallel backend, you can use standard operating system monitoring tools to see if the job is running on multiple cores. For example, in Windows, you can use the "Task Manager" to do this. Notice in the CPU utilization how each core went to maximum once the script started:
Sometimes it may be sufficient, or desirable, to print status messages from each of the workers. Simply adding a print() statement will not work, since the parallel workers do not share the standard output of the master job.
The question How can I print when using %dopar% asks how to do this using a snow parallel backend.
Steve Weston, the author of foreach (and one of the original founders of Revolution Analytics) wrote an excellent answer to this question.
Steve says that output produced by the snow workers gets thrown away by default, but you can use the makeCluster() argument "outfile" option to change that. Setting outfile to the empty string ("") prevents snow from redirecting the output, often resulting in the output from your print messages showing up on the terminal of the master process.
Steve says: to create and register your cluster with something like:
library(doSNOW)
cl <- makeCluster(4, outfile="")
registerDoSNOW(cl)
He continues: Your foreach loop doesn't need to change at all. This works with both SOCK clusters and MPI clusters using Rmpi built with Open MPI. On Windows, you won't see any output if you're using Rgui. If you use Rterm.exe instead, you will. In addition to your own output, you'll see messages produced by snow which can also be useful.
Also note that this solution seems to work with doSnow, but is not supported by the doParallel backend.
The final approach is a novel idea by Brian Lewis, and uses the Redis database as a parallel back end.
Specifically, the R package rredis allows message passing between R and Redis. The package doRedis allows you to use foreach with redis as the parallel backend. What’s interesting about Redis is that this database allows the user to create queues and each parallel worker fetches jobs from this queue. This allows for a dynamic network of workers, even across different machines.
The package has a wonderful vignette. Also take a look at the video demo at http://bigcomputing.com/doredis.html (reproduced below):
During my research of available information on this topic, I could not find published a reliable way of creating progress bars using foreach.
I came across some tantalising hints, e.g. at How do you create a progress bar when using the “foreach()” function in R?
Sadly, the proposed mechanism didn’t actually work.
I think there might be a way of getting progress bars with foreach and the doParallel package, at least in some circumstances.
I plan to pen my ideas in a follow-up blog post.
Meanwhile, can you do better? Is there a way of creating progress bars with foreach in a parallel job?
New media sites like Buzzfeed and Upworthy have mastered the art of "clickbait": headlines and content designed to drive as much traffic as possible to their sites. One technique is to use coy headlines like "If you take a puppy video break today, make sure this is the dog video you watch." (Gawker apparently spends longer writing a headline than the actual article.) But the big stock-in-trade is "listicles": articles that are, well, just lists of things. (Exactly half of Buzzfeed's top 20 posts of this week are listicles, including "32 Paintings Paired With Quotes From 'Mean Girls'".)
If your goal is to maximize virality, how long should a listicle be? Max Woolf, an R user and Bay Area Software QA Engineer, set out to answer that question with data. Buzzfeed reports the number of Facebook shares for each of its articles, so he scraped BuzzFeed’s website and counted the number of items in 15,656 listicles. He then used R's ggplot2 package to plot number of Facebook shares versus number of listicle items, and added a smooth line to show the relationship:
So it looks like if you want your listicle to go viral, aim for 15 or more items.
For R programmers, Max also shares a collection of useful tips for creating graphics like the above using ggplot2. His tutorial provides a step-by-step guide to making basic histograms and scatterplots, and then adding titles and annotations and choosing attractive color schemes to make your chart publication-ready. If you're an R user but haven't yet made the dive into ggplot2, the tutorial linked below is a good place to start.
Max Woolf: An Introduction on How to Make Beautiful Charts With R and ggplot2
by Joseph Rickert
Learning to effectively use any of the dozens of popular machine learning algorithms requires mastering many details and dealing with all kinds of practical issues. With all of this to consider, it might not be apparent to a person coming to machine learning from a background other than computer science or applied math that there are some easy to get at and very useful “theoretical” results. In this post, we will look at an upper bound result carefully described in section 2.2.1 of Schapire and Freund’s book “Boosting, Foundations and Algorithms” (This book, published in 2012, is destined to be a classic. During the course developing a thorough treatment of boosting algorithms, Schapire and Freund provide a compact introduction to the foundations of machine learning and relate the boosting algorithm to some exciting ideas from game theory, optimization and information geometry.)
The following is an example of a probabilistic upper bound result for the generalization error of an arbitrary classifier.
Assume that:
Define E_{t}, the training error, to be the percentage of misclassified samples and E_{g}, the generalization error, to be the probability of misclassifying a single example (x,y) chosen at random from D.
Then, for any d greater than zero, with probability at least 1 - d, the following upper bound holds on the generalization error of h:
E_{g} <= E_{t} + sqrt(log(1/d)/(2m)) where m is the number of random samples (R)
Schapire and Freund approach this result as a coin flipping problem, noting that when a training example (x,y) is selected at random, the probability that h(x) does not equal y can be identified with a flipped coin coming up heads. The probability of getting a head, p, is fixed for all flips. The problem becomes that of determining whether the training error, the fraction of mismatches in a sequence of m flips, is significantly different from p.
The big trick in the proof of the above result is to realize that Hoeffding’s Inequality can be used to bound the binomial expression for the probability of getting at most (p - e)m heads in m trials where e is a small positive number. For our purposes, Hoeffding’s inequality can be stated as follows:
Let X_{1} . . . X_{m} be independent random variables taking values in [0,1]. Let A_{m} denote their average. Then P(A_{m} <= E[A_{m}] - e) <= exp(-2me^{2}).
If the X_{i} are binomial random variables where X_{i} = 1 if h(x) is not equal to y, then the training error, E_{t} as defined above, is equal to A_{m} , the number of successes in m flips. E[A_{m}] = p is the generalization error, E_{g}. Hence, the expression in defining the bound in Hoeffding’s Inequality can be written:
E_{g} >= E_{t} + e.
Now, letting d = exp(-2me^{2}) where d > 0 we get the result ( R ) above. What this says is that with probability at least 1 - d,
E_{t} + sqrt(log(1/d)/(2m)) is an upper bound for the generalization error.
A couple of things to notice about the result are:
The really big assumption is the one that slipped in at the very beginning, that the training samples and test samples are random draws from the same distribution. This is something that would be difficult to verify in practice, but serves the purpose of encouraging one to think about the underlying distributions that might govern the data in a classification problem.
The plot below provides a simple visualization of the result. It was generated by simulating draws from a binomial with a little bit of noise added, where p = .4 and d = .1 This represents a classifier that does a little better than guessing. The red vertical line marks the value of the generalization error among the simulated upper bounds. The green lines focus on the 10% quantile.
As the result predicts, a little more than 90% of the upper bounds are larger than p.
And here is the code.
m <- 10000 # number of samples p <- .4 # Probability of incorrect classification N <- 1000 # Number of simulated sampling experiments delta <- .1 # 1 - delta is the upper probability bound gamma <- sqrt(log(1/delta)/(2*m)) # Calculate constant term to upper bound Am <- vector("numeric",N) # Allocate vector for(i in 1:N){ Am[i] <- sum(rbinom(m,1,p) + rnorm(m,0,.1))/m # Simulate training error } u_bound <- Am + gamma # Calculate upper bounds plot(ecdf(u_bound), xlab="Upper Bound", col = "blue", lwd = 3, main = "Empir Dist (Binomial with noise)") abline(v=.4, col = "red") abline(h=.1, col = "green") abline(v= quantile(u_bound,.1),col="green")
So, what does all this mean in practice? The result clearly pertains to an idealized situation, but to my mind it provides a rough guide as to how low you ought to be able to reduce your testing error. In some cases, it may even signal that you might want to look for better data.
Thanks to everyone at the Chicago R User Group for giving me such a warm welcome for my presentation last night. In my talk, I gave an introduction to Revolution R Open, with a focus on how the checkpoint package makes sharing R code in a reproducible way easy:
library(devtools)
install_github("RevolutionAnalytics/checkpoint")