by Joseph Rickert

Learning to effectively use any of the dozens of popular machine learning algorithms requires mastering many details and dealing with all kinds of practical issues. With all of this to consider, it might not be apparent to a person coming to machine learning from a background other than computer science or applied math that there are some easy to get at and very useful “theoretical” results. In this post, we will look at an upper bound result carefully described in section 2.2.1 of Schapire and Freund’s book “Boosting, Foundations and Algorithms” (This book, published in 2012, is destined to be a classic. During the course developing a thorough treatment of boosting algorithms, Schapire and Freund provide a compact introduction to the foundations of machine learning and relate the boosting algorithm to some exciting ideas from game theory, optimization and information geometry.)

The following is an example of a probabilistic upper bound result for the generalization error of an arbitrary classifier.

Assume that:

- The data looks like a collection of labeled pairs (x,y) where x represents the vector of features and y takes values 0 or 1.
- The training examples and test examples come from the same unknown distribution D of (x,y) pairs
- We are dealing with a classification model (or hypothesis in machine learning jargon) h

Define E_{t}, the training error, to be the percentage of misclassified samples and E_{g}, the generalization error, to be the probability of misclassifying a single example (x,y) chosen at random from D.

Then, for any d greater than zero, with probability at least 1 - d, the following upper bound holds on the generalization error of h:

E_{g} <= E_{t} + sqrt(log(1/d)/(2m)) where m is the number of random samples (R)

Schapire and Freund approach this result as a coin flipping problem, noting that when a training example (x,y) is selected at random, the probability that h(x) does not equal y can be identified with a flipped coin coming up heads. The probability of getting a head, p, is fixed for all flips. The problem becomes that of determining whether the training error, the fraction of mismatches in a sequence of m flips, is significantly different from p.

The big trick in the proof of the above result is to realize that Hoeffding’s Inequality can be used to bound the binomial expression for the probability of getting at most (p - e)m heads in m trials where e is a small positive number. For our purposes, Hoeffding’s inequality can be stated as follows:

Let X_{1} . . . X_{m} be independent random variables taking values in [0,1]. Let A_{m} denote their average. Then P(A_{m} <= E[A_{m}] - e) <= exp(-2me^{2}).

If the X_{i} are binomial random variables where X_{i} = 1 if h(x) is not equal to y, then the training error, E_{t} as defined above, is equal to A_{m} , the number of successes in m flips. E[A_{m}] = p is the generalization error, E_{g}. Hence, the expression in defining the bound in Hoeffding’s Inequality can be written:

E_{g} >= E_{t} + e.

Now, letting d = exp(-2me^{2}) where d > 0 we get the result ( R ) above. What this says is that with probability at least 1 - d,

E_{t} + sqrt(log(1/d)/(2m)) is an upper bound for the generalization error.

A couple of things to notice about the result are:

- The bound is the sum of two terms, the second of which doesn't depend on the distribution of the X
_{i}
- Hoeffding’s Inequality applies to any random variable in [0,1], so presumably this would make the bound robust with respect to the binomial assumption.

The really big assumption is the one that slipped in at the very beginning, that the training samples and test samples are random draws from the same distribution. This is something that would be difficult to verify in practice, but serves the purpose of encouraging one to think about the underlying distributions that might govern the data in a classification problem.

The plot below provides a simple visualization of the result. It was generated by simulating draws from a binomial with a little bit of noise added, where p = .4 and d = .1 This represents a classifier that does a little better than guessing. The red vertical line marks the value of the generalization error among the simulated upper bounds. The green lines focus on the 10% quantile.

As the result predicts, a little more than 90% of the upper bounds are larger than p.

And here is the code.

m <- 10000 # number of samples
p <- .4 # Probability of incorrect classification
N <- 1000 # Number of simulated sampling experiments
delta <- .1 # 1 - delta is the upper probability bound
gamma <- sqrt(log(1/delta)/(2*m)) # Calculate constant term to upper bound
Am <- vector("numeric",N) # Allocate vector
for(i in 1:N){
Am[i] <- sum(rbinom(m,1,p) + rnorm(m,0,.1))/m # Simulate training error
}
u_bound <- Am + gamma # Calculate upper bounds
plot(ecdf(u_bound),
xlab="Upper Bound",
col = "blue",
lwd = 3,
main = "Empir Dist (Binomial with noise)")
abline(v=.4, col = "red")
abline(h=.1, col = "green")
abline(v= quantile(u_bound,.1),col="green")

So, what does all this mean in practice? The result clearly pertains to an idealized situation, but to my mind it provides a rough guide as to how low you ought to be able to reduce your testing error. In some cases, it may even signal that you might want to look for better data.