*In this post Joseph Rickert demonstrates how to build a classification model on a large data set with the RevoScaleR package. A script file for use with Revolution R Enterprise to recreate the analysis below is at the end of the post, and can also be downloaded here -- ed.*

The k-means (Lloyd) algorithm, an intuitive way to explore the structure of a data set, is a work horse in the data mining world. The idea is to view the observations in an N variable data set as a region in N dimensional space and to see if the points form themselves into clusters according to some method of measuring distance. To apply the k-means algorithm one takes a guess at the number of clusters (i.e. select a value for k) and picks k points (maybe randomly) to be the initial center of the clusters. The algorithm then proceeds by iterating through two steps:

- Assign each point to the cluster to which it is closest
- Use the points in a cluster at the m
^{th} step to compute the new center of the cluster for the (m +1)^{th} step

Eventually, the algorithm will settle on k final clusters and terminate. Figure 1 shows an example of k-means clustering on an artificial 2-dimensional data set. The data come from two different normal distributions, one centered at (0,0) and the other at (1,1). The large circles are show the points within

Figure 1: k-means on artificial data

one standard deviation of the true means. The smaller colored circles are the calculated centers of the clusters. For this artificial example, the two clusters do a pretty good job of describing the data. However, for real data the situation will generally not be so clear cut. There is no guarantee that clusters found will be globally optimal, and of course, since the choice of k was arbitrary, there is no reason to believe that the clusters really mean anything.

Using k-means to get real work done means running the algorithm lots of times. The R implementation of the k-means algorithm, kmeans in the stats package, is pretty fast. Running the example above on my pc (1.87 GHz Dell laptop with 8 GB of RAM) on 10,000,000 points took about 4.3 seconds. But what if you have a data set that won’t fit into memory?

For Revolution R Enterprise users this is no longer a problem. The RevoScaleR package in version 4.3 has a new k-means function, rxKmeans, implemented as an external memory algorithm that works on a chunk of data at a time. Since the k-means algorithm is “embarrassingly parallel”, rxKmeans reads chunks of data (rows / observations) at a time, and iterates the Lloyd algorithm on each chunk (in parallel if multiple processors are available). Once all of the chunks have been processed, the means are updated one last time to produce the final result. Figure 2 shows rxKmeans working on some real data, a slice of the 2000 5% IPUMS U.S. Census Data containing information on individuals from 20 to 65 years old in the states of Connecticut, Indiana and Washington. The data set contains slightly over 350,000 observations with six variables. The data shows income plotted against age and the four clusters found by the k-means algorithm are color coded.

Figure 2: rxKmeans analysis of Census Data

The first thing to notice about the data is that apparently nobody in the three states involved makes between $175,000 and $300,000. As you probably guessed, this is an artifact of how the data were coded. Values of income higher than $175,000 were recorded as state means: hence, the three green lines. The choice of 4 clusters appears to be reasonable and k-means provides some insight . The horizontal clusters indicate that people group together more by income than by age, and the black dots which mark the centers of the clusters confirm that people generally get wealthier as they age. It took about 0.4 seconds to run this example on my laptop.

Finally, just for fun, I ran the rxKmeans on the 123 million plus row airlines data set that I have described in a previous blog post looking for 2 clusters in a 7 dimensional space described by departure time, arrival time, air time, arrival delay, departure delay, taxi in time and taxi out time. I have no idea how to interpret the results, but the calculation ran in just under 6 minutes. My take away is that I ought to think about doing some serious k-means clustering on a really, really big data set.