R has something of a reputation for generating, shall we say, obscure error messages like this:

Error in model.frame.default(formula = y ~ female + DNC + SE_region + : could not find function "function (object, ...) \nobject"

One tip for dealing with error messages is to ignore everything between "Error in" and the colon: unless you are running a function that you wrote yourself, only the error message at the end is likely to be useful. If you're still stuck, another tip is to ask for help on Stackoverflow.com using the [r] tag, where you'll find more than 20,000 questions about R error messages.

Noam Ross has analyzed these questions to find the most commonly asked-about R error messages. Naturally, he used the stackr R package to interrogate the StackOverflow API, and downloaded around 10,000 error messages. He then used a regular expression to break the questions down into trigrams (sequences of 3 works) to be able to count which were the most common. On that basis, the most common types of error messages were:

"could not find function" errors, usually caused by typos or not loading a required package

"Error in if" errors, caused by non-logical data or missing values passed to R's "if" conditional statement

"Error in eval" errors, caused by references to objects that don't exist

"cannot open" errors, caused by attempts to read a file that doesn't exist or can't be accessed

"no applicable method" errors, caused by using an object-oriented function on a data type it doesn't support

"subscript out of bounds" errors, caused by trying to access an element or dimension that doesn't exist

package errors caused by being unable to install, compile or load a package.

Noam's full analysis is at the link below. In addition to providing insights about R's error messages, the trigram method he uses will be useful to anyone who needs to do frequency analysis on unstructured data.

Binning is the term used in scoring modeling for what is also known in Machine Learning as Discretization, the process of transforming a continuous characteristic into a finite number of intervals (the bins), which allows for a better understanding of its distribution and its relationship with a binary variable. The bins generated by the this process will eventually become the attributes of a predictive characteristic, the key component of a Scorecard.

Why Binning?

Though there are some reticence to it [1], the benefits of binning are pretty straight forward:

It allows missing data and other special calculations (e.g. divided by zero) to be included in the model.

It controls or mitigates the impact of outliers over the model.

It solves the issue of having different scales among the characteristics, making the weights of the coefficients in the final model comparable.

Unsupervised Discretization Unsupervised Discretization divides a continuous feature into groups (bins) without taking into account any other information. It is basically a partiton with two options: equal length intervals and equal frequency intervals.

Equal length intervals

Objective: Understand the distribution of a variable.

Example: The classic histogram, whose bins have equal length that can be calculated using different rules (Sturges, Rice, and others).

Disadvantage: The number of records in a bin may be too small to allow for a valid calculation, as shown in Table 1.

Table 1. Time on Books and Credit Performance. Bin 6 has no bads, producing indeterminate metrics.

Equal frequency intervals

Objective: Analyze the relationship with a binary target variable through metrics like bad rate.

Example: Quartlies or Percentiles.

Disadvantage: The cutpoints selected may not maximize the difference between bins when mapped to a target variable, as shown in Table 2

Table 2. Time on Books and Credit Performance. Different cutpoints may improve the Information Value (0.4969).

Supervised Discretization

Supervised Discretization divides a continuous feature into groups (bins) mapped to a target variable. The central idea is to find those cutpoints that maximize the difference between the groups. In the past, analysts used to iteratively move from Fine Binning to Coarse Binning, a very time consuming process of finding manually and visually the right cutpoints (if ever). Nowadays with algorithms like ChiMerge or Recursive Partitioning, two out of several techniques available [2], analysts can quickly find the optimal cutpoints in seconds and evaluate the relationship with the target variable using metrics such as Weight of Evidence and Information Value.

An Example With 'smbinning'

Using the 'smbinning' package and its data (chileancredit), whose documentation can be found on its supporting website, the characteristic Time on Books is grouped into bins taking into account the Credit Performance (Good/Bad) to establish the optimal cutpoints to get meaningful and statistically different groups. The R code below, Table 3, and Figure 1 show the result of this application, which clearly surpass the previous methods with the highest Information Value (0.5353).

# Load package and its data library(smbinning)data(chileancredit)# Training and testing samples
chileancredit.train=subset(chileancredit,FlagSample==1)
chileancredit.test=subset(chileancredit,FlagSample==0)# Run and save results
result=smbinning(df=chileancredit.train,y="FlagGB",x="TOB",p=0.05)
result$ivtable
# Relevant plots (2x2 Page) par(mfrow=c(2,2))boxplot(chileancredit.train$TOB~chileancredit.train$FlagGB,
horizontal=T,frame=F,col="lightgray",main="Distribution")mtext("Time on Books (Months)",3)
smbinning.plot(result,option="dist",sub="Time on Books (Months)")
smbinning.plot(result,option="badrate",sub="Time on Books (Months)")
smbinning.plot(result,option="WoE",sub="Time on Books (Months)")

Table 3. Time on Books cutpoints mapped to Credit Performance.

Figure 1. Plots generated by the package.

In the middle of the "data era", it is critical to speed up the development of scoring models. Binning, and more specifically, automated binning helps to reduce significantly the time consuming process of generating predictive characteristics, reason why companies like SAS and FICO have developed their own proprietary algorithms to implement this functionality on their respective software. For analysts who do not have these specific tools or modules, the R package 'smbinning' offers an statistically robust alternative to run their analysis faster.

For more information about binning, the package's documentation available on CRAN lists some references related to the algorithm behind it and its supporting website some references for scoring modeling development.

References [1] Dinero, T. (1996) Seven Reasons Why You Should Not Categorize Continuous Data. Journal of health & social policy 8(1) 63-72 (1996). [2] Garcia, S. et al (2013) A Survey of Discretization Techniques: Taxonomy and Empirical Analysis in Supervised Learning. IEEE Transactions on Knowledge and Data Engineering, Vol. 25, No. 4, April 2013.

The gradient boosting machine as developed by Friedman, Hastie, Tibshirani and others, has become an extremely successful algorithm for dealing with both classification and regression problems and is now an essential feature of any machine learning toolbox. R’s gbm() function (gbm package) is a particularly well crafted implementation of the gradient boosting machine that served as a model for the rxBTrees() function which was released last year as a feature of Revolution R Enterprise 7.3. You can think of rxBTRees(), as a scaled up version gbm() that is designed to work with massive data sets in distributed environments.

The basic idea underlying the gbm is that weak learners may be combined in an additive, iterative fashion to produce highly accurate classifiers that are resistant to overfitting. While in theory any weak learner can be incorporated into a gbm, classification and regression trees, or decision trees as they are called in RRE, are particularly convenient. As Kuhn and Johnson (p 205) point out:

They have the flexibility to be weak learners by simply restricting their depth

Separate trees can be easily added together

Trees can be generated very quickly

rxBTrees() scales gbm() by building the boosting algorithm around RRE’s rxDTree() decision tree implementation. As I described in a previous post, this algorithm scales to work with large data sets by building trees on histograms, a summary of the data, and not on the data itself. rxBTrees(), like rxDTree() is also a proper parallel external memory algorithm (PEMA) This means two things:

It uses a “chunking” strategy to work with a block of data at a time that has been read from disk. Since there is no need to keep all of the data in memory at one time, rxBtrees() can build a gbm on arbitrarily large files.

PEMA’s are parallel algorithms built on top of RRE’s distributed computing infrastructure. This allows them to take full advantage of the opportunities for parallelism implicit in the underlying hardware by setting up a “compute context”. In the example below, the compute context is set to local parallel to take advantage of the cores on my PC. RRE also allows establishing compute contexts that enable PEMAs to run in parallel on clusters.

The following function fits a rxBTrees() model to a training data subset of the a mortgage data set. rxBTrees() contains many input parameters. Some are common to all RevoScaleR PEMAs, others are inherited from rxDTree(), and still others such as nTree and learningRate have direct counterparts in R's gbm() function. In this function call, I have explicitly called out parameters that pertain to fitting a gbm model

# Fit a Boosted Trees Mode on the training data set
form <-formula(default ~ houseAge + creditScore + yearsEmploy + ccDebt)
BT_model <- rxBTrees(formula = form,data= "mdTrain",
lossFunction = "bernoulli",
minSplit = NULL,# Min num of obs at node before a split is attempted
minBucket = NULL,# Min num of obs at a terminal node
maxDepth = 1,# 2^maxDepth = number of leaves on tree
cp = 0,# Complexity parameter
maxSurrogate = 0,# Max surrogate splits retained, 0 improves perf
useSurrogate = 2,# Determines how surrogates are used
surrogateStyle = 0,# Sets surrogate algorithm
nTree = 100,# Num of trees = num of boosting iterations
mTry = NULL,# Num vars to sample as a split candidate
sampRate = NULL,# % of observations to sample for each tree# default = 0.632
importance = TRUE,# Importance of predictors to be assessed
learningRate = 0.1,# Also known as shrinkage
maxNumBins = NULL,# Max num bins in histograms# default: min(1001,max(sqrt(num of obs)))
verbose = 1,# Amount of status sent ot console
computeContext="RxLocalParallel",# use all cores available in PC
overwrite = TRUE)# Allow xdf output file to be overwritten

Some especially noteworthy parameters are maxDepth, maxNumBins and computeContext. maxDepth accomplishes the same purpose as interaction.depth in the gbm() function. However, they are not the same parmeter. Setting interaction.depth = 1 produces stump trees with two leaves each because the number of leaves = interaction.depth + 1. In rxBTRees(), stumps are produced by setting maxDepth = 1. But, this is because 2^{maxDepth} gives the number of leaves.

MaxNumBins determines the number of bins to be used in making histograms of the data. If anyone ever wanted to compare gbm() directly with rxBTrees() on a small data set for example, setting MaxNumBins to the number of points in the data set would mimic gbm(). Note, however, that this would saddle rxBTrees() with all of the overhead of computing histograms while receiving none of the benefits.

The computeContext parameter is the mechanism for connecting a PEMA to the underlying hardware and that enables functions such as rxBTrees() to run in parallel on a various clusters using distributed data. Some of the details of how this happens have been described in a previous post, but I hope to be able to follow up this post with an example of rxBTrees() running on Hadoop in the not too distant future.

The data set used to build the model above is available on Revolution Anaytics' data set web page both as an xdf file and as multiple csv files. The following command shows what the data looks like.

> rxGetInfo(mortData,getVarInfo=TRUE,numRows=3)

File name:C:\DATA\Mortgage Data\mortDefault\mortDefault2.xdf
Number of observations: 1e+07
Number of variables:6
Number of blocks:20
Compression type: none
Variable information:Var1: creditScore, Type:integer, Low/High:(432,955)Var2: houseAge, Type:integer, Low/High:(0,40)Var3: yearsEmploy, Type:integer, Low/High:(0,15)Var4: ccDebt, Type:integer, Low/High:(0,15566)Var5: year, Type:integer, Low/High:(2000,2009)Var6: default, Type:integer, Low/High:(0,1)
Data (3 rows starting withrow1):
creditScore houseAge yearsEmploy ccDebt year default
161510528182000027803453575200003735121318420000

In building the model, I divided the mortgage default data randomly into a training and test data sets. Then, I fit an rxBTrees() model on the training data and used the test data to compute the ROC curve shown below.

Along the way, the model also produced the out of bag error versus number of trees plot which shows that for this data set one could probably get by with only building 70 or 80 trees.

The code I used to build the model is available here: Download Post_model. If you do not have Revolution R Enterprise available to you at your company, you can try out this model for yourself in the AWS cloud by taking advantage of Revolution Analytics' 14 day free trial.

Take the first step towards building a gbm on big data.

by Gary R. Moser Director of Institutional Research and Planning The California Maritime Academy

I recently contacted Joseph Rickert about inviting Vim guru Drew Niel (web: vimcasts.org, book: "Practical Vim: Edit Text at the Speed of Thought") to speak at the Bay Area R User Group group. Due to Drew's living in Great Britain that might not be easily achieved, so Joe generously extended an invitation for me to share a bit about why I like Vim here.

When it comes to text editors, there are a number of strong contenders to choose from (...lots and lots of them). if you've found a tool with a rich feature set that makes you more productive (and more importantly, is something you like) you should probably stick with that. Given that editing text of various types is such a big part of our lives, however, it's worth turning a critical eye toward how well our tools facilitate this specific activity.

Do you spend a lot of time editing text? Yeah, me too. Personally this includes R code, Markdown, Latex, as well as notes and outlines (including this article). When I rediscovered Vim fairly recently, it was from the perspective of being a proficient R user. Therefore, my values and beliefs about what constitutes good software are framed by my experience with R. That includes being OK with an initial learning curve.

I have been using RStudio as my primary IDE for R since it was offered. It's great software; they did something not too long ago that I really appreciate - they added a stripped-down Vim editing mode. Vim pushes users to the documentation pretty quickly (usually as the result of accidentally deleting large chunks of work), and as I dug in and began to discover its full functionality, I came to realize how much I was missing out on by using the emulator. The ability to set and jump to marks in a document, or utilizing multiple registers for copy/paste are two good examples of essential but missing features in RStudio.

Vim has been described as "a language for text editing," which I think is a useful way to view it. At the risk of sounding snotty, I would compare the experiences of using Vim (or another good editor) versus a plain-jane text editor to that of playing chess versus checkers. That is, there's an element of strategic and intentional action compared to simply playing one of a limited set of moves over and over again.

One of the things that makes Vim so interesting and different from other editors stems from its origins. As the result of being developed in the context of severe constraints (slow networks, no digital displays, limited system resources, and no mouse), Vim - then "ed" - had to accomplish the greatest amount of work with the least number of keystrokes. This requirement led to the development of a vast number of very specific commands that can be combined in useful ways. Drew Neil artfully compares this to playing notes, chords, and melodies on a piano. It's also an appropriate comparison for setting one's expectations toward becoming a skilled Vim user! Michael Mrozekon's humorous plot cleverly suggests that, not unlike R, Vim doesn't hold your hand.

It also speaks to my point about specificity. Emacs, for example, can be extended to be a web client or music player, hence the rabbit-hole learning curve, but isn't that getting away from the primary task of text editing?

The fundamental way that Vim differs from most other text editors is that it is explicitly modal; all software is technically modal in certain ways (that is, the same keys do different things under different circumstances), but with Vim it is a central design feature. Essentially, what this means is that by switching modes, a different keyboard comes into existence under your fingers. Because Vim has four modes, and a very rich and terse set of key-bindings, it's like having four+ keyboards in one. The keyboard cheat sheet is a useful reference, especially in the beginning.

Warning: after becoming familiar with Vim's basic functionality, going back to a typical text editor feels rather clumsy.

Vim as an interface to R using the Vim-R-plugin is mostly good for how I use it, but I expect to be dialing-in Vim for a long time before it's got all the features I want. I don't mind this, but I can see how someone else might. I encourage you consider your own tools and how well they facilitate your most frequent tasks. If you're an RStudio user, try giving Vim mode a go. A visit to www.vim.org will connect you to the resources you'll need.

What will you be doing at 26 minutes and 53 seconds past 9 this coming Saturday morning? I will probably be running simulations. I have become obsessed with an astounding result from number theory and have been trying to devise Monte Carlo simulations to get at it. The result, well known to number theorists says: choose two integers at random; the probability that they will be coprime is 6/π^{2} ! Here, π materializes out of thin air. Who could have possibly guessed this? - well, Leonhard Euler, apparently, and this sort of magic seems to be quite common in number theory.

More formally, the theorem Euler proved goes something like this: Let P_{N} be the probability that two randomly chosen integers in {1, 2, ... N} are coprime. Then, as N goes to infinity P_{N} goes to 6/π^{2 }. Well, this seems to be a little different. You don't actually have to sample from an infinite set. So, I asked myself, would a person who is not familiar with this result, but who is allowed to do some Monte Carlo simulations, have a reasonable chance of guessing the answer? I know that going to infinity would be quite a trip, but I imagined that I could take a few steps and see something interesting. How about this: choose some boundary, N, and draw lots of pairs of random numbers. Count the number of coprime pairs M. Then sqrt( 6 / (M/N)) will give an estimate of π. As N gets bigger and bigger you should see the digits of the average of the estimates marching closer and closer to π - 3.1, 3.14, 3.141 etc.

Before you go off and try this, let me warn you: there is no parade of digits. For a modest 100,000 draws with an N around 10,000,000 you should get an estimate of π close to 3.14. Then, even with letting N get up around 1e13 with a million draws, you won't do much better. The following code (compliments of a colleague who introduced me to mapply) performs 100 simulations, each with 100,000 draws as the range varies from +/- 1,000,000 to +/- 1e13. (The code runs pretty quickly on my laptop.)

# Monte Carlo estimate of pilibrary( numbers )library(ggplot2)set.seed(123)
bigRange <-seq(1e6,1e13,by=1e11)
M <-length(bigRange)
draws <- 1e5
for(i in1:M){
maxRange <- bigRange[i]print(bigRange[i])min<--maxRange
max<- maxRange
r1 <-round(runif( n = draws,min = min,max = max))
r2 <-round(runif( n = draws,min = min,max = max))system.time( coprimeTests <-mapply( coprime, r1, r2 ))prob[i]<-sum( coprimeTests )/ draws
print(prob[i])
piEst[i]<-sqrt(6/prob[i])print(piEst[i])}
piRes2 <-data.frame(bigRange,prob,piEst)
p2 <-ggplot(piRes2,aes(bigRange,piEst))
p2 + geom_line()+
geom_point()+
xlab("Half Range for Random Draws")+
ylab("Estimate of Pi")+
ggtitle("Expanding Range Simulation")+
geom_smooth(method = "lm", se=FALSE, color="black")

Here is the "time series" plot of the results.

The mean is 3.142959. The slight upward trend is most likely a random number induced illusion.

The problem as, I have framed it, is probably beyond the reach of a naive Monte Carlo approach. Nevertheless, on Saturday, when I have some simulation, time I will try running 100,000,000 draws. This should get me another digit of π since the accuracy of the mean increases as sqrt(N).

The situation, however, is not as dismal as I have been making it out. Don't imagine that just because the problem is opaque to a brute force Monte Carlo effort that it is without the possibility of computational illumination. Euler's proof, mentioned above, turns on recognizing that the expression for the probability of two randomly chosen integers being coprime may be expressed as the mathematical function z(2). The poof yields π^{2} = 6z(2), The wizards of R have reduced this calculation to the trivial. The Rmpfr package, which allows the use of arbitrarily precise numbers instead of R's double precision numbers, includes the function zeta(x)! So, here, splendidly arrayed, is π.

library(Rmpfr)sqrt(6*zeta(2))>sqrt(6*zeta(2))1'mpfr' number of precision 128 bits
[1]3.141592653589793238462643383279502884195

In my previous post, I demonstrated how to get some status of running jobs on a parallel back end. However, I stopped short of actually demonstrating progress bars. In this post I demonstrate how to do this.

This question attracted some helpful answers, but ultimately didn’t answer the question. In particular, Steve Weston said this approach is flawed, since

It won't work well with doParallel because doParallel only calls the combine function after all of the results have been returned, since it is implemented by calling the parallel clusterApplyLB function. This technique only with works well with backends that call the combine function on-the-fly, like doRedis, doMPI, doNWS, and (defunct?) doSMP.

My experiments

However, this set me thinking. Could it be that one could trigger a tcltk progress bar in each worker?

Thus I started to experiment, and came up with the following code. On my windows machine, using doParallel, this seems to work reliably (but I could not get this approach to work on Linux using doMC).

When you run this code, you should see synchronous progress bars, one for each worker:

Conclusion

Although it takes some work, and there isn’t a single solution for all cases, it is possible to get progress bars for at least some parallel back ends.

In this post I demonstrated how to create progress bars using doParallel on Windows, but failed to replicate on Linux.

Domino's new R Notebook and Plotly's R API let you code, make interactive R and ggplot2 graphs, and collaborate entirely online. Here is the Notebook in action:

To execute this Notebook, or to build your own, head to Domino's Plotly Project. The GIF below shows how to get started: choose an R session, press "Open Notebook", and choose the "R Notebook" file. For sensitive data, see Domino On-Premise and Plotly Enterprise.

Overlaid Histogram

This plot, like the rest in this post, was made in the R Notebook with ggplot2, then converted to an interactive plot with one line of code. The plot is drawn with D3.js. You can hover your mouse to see text, click and drag to zoom, and toggle traces on and off by clicking them in the legend. check out our GitHub page to see our ggplot2 support.

Anscombe's Quartet

Next up, let's visualize Anscombe's Quartet. Head to the R Notebook to see the R code for this.

Bar Charts

Interactive bar charts let you hover and filter. Press the legend items to filter the plot. Head to the R Notebook to see the code.

3D Plots

Plotly and Domino let you transition ggplot2 plots into interactive 3D plots. Head to the R Notebook to see the code.

Error Bars

If you use error bars, you can see the values when you hover your mouse. Head to the R Notebook to see the code.

Dendrograms

We can also make dendrograms; this example uses ggdendro and ggplot2. Head to the R Notebook to see the example.

Box plots

You can make interactive box plots from ggplot2 and share them.

Usage

Domino Notebooks can run Python, R, Julia, and Scala. Each Notebook can be shared in the Domino Cloud or exported as an IPython Notebook, HTML or reST file. Notebooks support LaTeX, and allow you to use Markdown, headers, import images, and display iframes.

Plotly graphs can be embedded in Notebooks, dashboards, HTML, knitr, or Shiny. Every Plotly graph can be exported in varied programming languages and image types. For example, for the boxplot above:

https://plot.ly/~r_user_guide/1079.png

https://plot.ly/~r_user_guide/1079.svg

https://plot.ly/~r_user_guide/1079.pdf

https://plot.ly/~r_user_guide/1079.eps

https://plot.ly/~r_user_guide/1079.py

https://plot.ly/~r_user_guide/1079.m

https://plot.ly/~r_user_guide/1079.jl

https://plot.ly/~r_user_guide/1079.json

https://plot.ly/~r_user_guide/1079.embed

This is our first pass at showing how you can use Domino and Plotly for coding and plotting in R. Have ideas? Suggestions? Other collaboration or Notebook projects you'd like to see? We're at @DominoDataLab and @DominoDataLab and welcome your suggestions.

The foreach package provides simple looping constructs in R, similar to lapply() and friends, and makes it easy execute each element in the loops in parallel. You can find the packages at foreach: Foreach looping construct for R and doParallel.

Tracking progress of parallel computing tasks

Parallel programming can help speed up the total completion time of your project. However, for tasks that take a long time to run, you may wish to track progress of the task, while the task is running.

This seems like a simple request, but seems remarkably hard to achieve. The reason boils down to this:

Each parallel worker is running in a different session of R

In some parallel computing setups, the workers don’t communicate with the initiating process, until the final combining step

So, if it is difficult to track progress directly, what can be done?

It seems to me the typical answer to this question fall into 3 different classes:

Use operating system monitoring tools, i.e. tools external to R.

Print messages to a file (or connection) in each worker, then read from this file, again outside of R

Use specialist back-ends that support this capability, e.g. the Redis database and the doRedis package

This is an area with many avenues of exploration, so I plan to briefly summarize each method and point to at least one question on StackOverflow that may help.

In his answer to this question, Dirk Eddelbuettel mentions that parallel back ends like MPI and PVM have job monitors, such as slurm and TORQUE. However, tools that are simpler to use, like snow do not have monitoring tools. In this case, you be forced to use methods like printing diagnostic messages to a file.

For parallel jobs using the doParallel backend, you can use standard operating system monitoring tools to see if the job is running on multiple cores. For example, in Windows, you can use the "Task Manager" to do this. Notice in the CPU utilization how each core went to maximum once the script started:

Method 2: Print messages to a file (or connection) in each worker, then read from this file, again outside of R

Sometimes it may be sufficient, or desirable, to print status messages from each of the workers. Simply adding a print() statement will not work, since the parallel workers do not share the standard output of the master job.

Steve Weston, the author of foreach (and one of the original founders of Revolution Analytics) wrote an excellent answer to this question.

Steve says that output produced by the snow workers gets thrown away by default, but you can use the makeCluster() argument "outfile" option to change that. Setting outfile to the empty string ("") prevents snow from redirecting the output, often resulting in the output from your print messages showing up on the terminal of the master process.

Steve says: to create and register your cluster with something like:

He continues: Your foreach loop doesn't need to change at all. This works with both SOCK clusters and MPI clusters using Rmpi built with Open MPI. On Windows, you won't see any output if you're using Rgui. If you use Rterm.exe instead, you will. In addition to your own output, you'll see messages produced by snow which can also be useful.

Also note that this solution seems to work with doSnow, but is not supported by the doParallel backend.

Method 3: Use specialist back-ends that support this capability, e.g. the Redis database and the doRedis package

The final approach is a novel idea by Brian Lewis, and uses the Redis database as a parallel back end.

Specifically, the R package rredis allows message passing between R and Redis. The package doRedis allows you to use foreach with redis as the parallel backend. What’s interesting about Redis is that this database allows the user to create queues and each parallel worker fetches jobs from this queue. This allows for a dynamic network of workers, even across different machines.

New media sites like Buzzfeed and Upworthy have mastered the art of "clickbait": headlines and content designed to drive as much traffic as possible to their sites. One technique is to use coy headlines like "If you take a puppy video break today, make sure this is the dog video you watch." (Gawker apparently spends longer writing a headline than the actual article.) But the big stock-in-trade is "listicles": articles that are, well, just lists of things. (Exactly half of Buzzfeed's top 20 posts of this week are listicles, including "32 Paintings Paired With Quotes From 'Mean Girls'".)

If your goal is to maximize virality, how long should a listicle be? Max Woolf, an R user and Bay Area Software QA Engineer, set out to answer that question with data. Buzzfeed reports the number of Facebook shares for each of its articles, so he scraped BuzzFeed’s website and counted the number of items in 15,656 listicles. He then used R's ggplot2 package to plot number of Facebook shares versus number of listicle items, and added a smooth line to show the relationship:

So it looks like if you want your listicle to go viral, aim for 15 or more items.

For R programmers, Max also shares a collection of useful tips for creating graphics like the above using ggplot2. His tutorial provides a step-by-step guide to making basic histograms and scatterplots, and then adding titles and annotations and choosing attractive color schemes to make your chart publication-ready. If you're an R user but haven't yet made the dive into ggplot2, the tutorial linked below is a good place to start.

Learning to effectively use any of the dozens of popular machine learning algorithms requires mastering many details and dealing with all kinds of practical issues. With all of this to consider, it might not be apparent to a person coming to machine learning from a background other than computer science or applied math that there are some easy to get at and very useful “theoretical” results. In this post, we will look at an upper bound result carefully described in section 2.2.1 of Schapire and Freund’s book “Boosting, Foundations and Algorithms” (This book, published in 2012, is destined to be a classic. During the course developing a thorough treatment of boosting algorithms, Schapire and Freund provide a compact introduction to the foundations of machine learning and relate the boosting algorithm to some exciting ideas from game theory, optimization and information geometry.)

The following is an example of a probabilistic upper bound result for the generalization error of an arbitrary classifier.

Assume that:

The data looks like a collection of labeled pairs (x,y) where x represents the vector of features and y takes values 0 or 1.

The training examples and test examples come from the same unknown distribution D of (x,y) pairs

We are dealing with a classification model (or hypothesis in machine learning jargon) h

Define E_{t}, the training error, to be the percentage of misclassified samples and E_{g}, the generalization error, to be the probability of misclassifying a single example (x,y) chosen at random from D.

Then, for any d greater than zero, with probability at least 1 - d, the following upper bound holds on the generalization error of h:

E_{g} <= E_{t} + sqrt(log(1/d)/(2m)) where m is the number of random samples (R)

Schapire and Freund approach this result as a coin flipping problem, noting that when a training example (x,y) is selected at random, the probability that h(x) does not equal y can be identified with a flipped coin coming up heads. The probability of getting a head, p, is fixed for all flips. The problem becomes that of determining whether the training error, the fraction of mismatches in a sequence of m flips, is significantly different from p.

The big trick in the proof of the above result is to realize that Hoeffding’s Inequality can be used to bound the binomial expression for the probability of getting at most (p - e)m heads in m trials where e is a small positive number. For our purposes, Hoeffding’s inequality can be stated as follows:

Let X_{1} . . . X_{m} be independent random variables taking values in [0,1]. Let A_{m} denote their average. Then P(A_{m} <= E[A_{m}] - e) <= exp(-2me^{2}).

If the X_{i} are binomial random variables where X_{i} = 1 if h(x) is not equal to y, then the training error, E_{t} as defined above, is equal to A_{m} , the number of successes in m flips. E[A_{m}] = p is the generalization error, E_{g}. Hence, the expression in defining the bound in Hoeffding’s Inequality can be written:

E_{g} >= E_{t} + e.

Now, letting d = exp(-2me^{2}) where d > 0 we get the result ( R ) above. What this says is that with probability at least 1 - d,

E_{t} + sqrt(log(1/d)/(2m)) is an upper bound for the generalization error.

A couple of things to notice about the result are:

The bound is the sum of two terms, the second of which doesn't depend on the distribution of the X_{i}

Hoeffding’s Inequality applies to any random variable in [0,1], so presumably this would make the bound robust with respect to the binomial assumption.

The really big assumption is the one that slipped in at the very beginning, that the training samples and test samples are random draws from the same distribution. This is something that would be difficult to verify in practice, but serves the purpose of encouraging one to think about the underlying distributions that might govern the data in a classification problem.

The plot below provides a simple visualization of the result. It was generated by simulating draws from a binomial with a little bit of noise added, where p = .4 and d = .1 This represents a classifier that does a little better than guessing. The red vertical line marks the value of the generalization error among the simulated upper bounds. The green lines focus on the 10% quantile.

As the result predicts, a little more than 90% of the upper bounds are larger than p.

And here is the code.

m <-10000# number of samples
p <-.4# Probability of incorrect classification
N <-1000# Number of simulated sampling experiments
delta <-.1# 1 - delta is the upper probability boundgamma<-sqrt(log(1/delta)/(2*m))# Calculate constant term to upper bound
Am <-vector("numeric",N)# Allocate vectorfor(i in1:N){
Am[i]<-sum(rbinom(m,1,p)+rnorm(m,0,.1))/m # Simulate training error}
u_bound <- Am +gamma# Calculate upper boundsplot(ecdf(u_bound),
xlab="Upper Bound",col = "blue",
lwd = 3,
main = "Empir Dist (Binomial with noise)")abline(v=.4,col = "red")abline(h=.1,col = "green")abline(v= quantile(u_bound,.1),col="green")

So, what does all this mean in practice? The result clearly pertains to an idealized situation, but to my mind it provides a rough guide as to how low you ought to be able to reduce your testing error. In some cases, it may even signal that you might want to look for better data.