The 2010 Workshop on Algorithms for Modern Massive Data Sets (MMDS 2010) finished up this past Friday (June 18th) at Stanford. This was an exceptionally well organized conference: four days of mind-stretching talks on algorithm development and the challenges of working with massive data sets approached from almost every conceivable angle. The approximately 100 attendees were a diverse group of computer scientists, developers and quantitative marketing people from industry and academia. The mix of academics and industry types appeared to be pretty well balanced with a number of speakers listing themselves as having both university and company affiliations. This mashup of people, backgrounds and ideas reflected what I think was the overall theme and rationale for the workshop: moving deep theory ideas to practical solutions.

The level of the talks ranged from Peter Norvig’s (head of Google research) opening tutorial on the practical problems of running Google’s data center that is so massive that shark attacks, blasphemy outbreaks and other 6+ sigma events are real worries -- to Konstantin Mischaikow’s (Rutgers university) presentation on a “Combinatorial Framework for Nonlinear Dynamics” for which a good background in algebraic topology would have been helpful. Most of talks were fairly technical, but many of the speakers came across as being very skilled at communicating to non-specialists. (For a short time, I even had the illusion that I understood some algebraic topology!)

Maybe it was the perfect weather, or the beautiful Stanford campus, or the fact that the coffee was always hot, but it seemed to me that people were just in the mood for talking openly about their work. Too much and too varied a body of material was presented for me, an outsider to this field, to summarize here, but the following is my take on some of the major themes -- the recurring melodies if you will -- that played as background music through a good many of the presentations:

- Many modern data sets of practical interest, those containing social networking information for example, are best described by graphs not flat tables. The need to look at networked data has triggered a shift in perspective were researchers are looking at almost all massive data sets as graphs, and are investigating aspects of network structure that might be useful in developing new algorithms. There are deep theoretical connections between linear algebra and graph theory; however, it is not clear to what extent these connections can be exploited to create practical, efficient algorithms. John Gilbert (UC Santa Barbara), referred to linear algebra as the “middle ware” of scientific computing. So, finding analogous computational tools for graphical data sets or augmenting linear algebraic methods with new techniques could indeed have profound, practical implications.

- The analysis of massive data is tied to the underlying hardware in at least two ways. First there is the matter of capacity. There seemed to be some agreement that data sets are small to medium if they can fit on a laptop, medium to large if they can fit on a single server, and massive if they are so big that they need to be spread among multiple servers. Hence, both the tools and techniques to analyze data will depend on the underlying hardware architecture. So for example, memory bound languages such as MatLab and R, viable choices for small to large data sets, are not be feasible for massive data sets without extensions to blow through the memory constraint problem. Then, there is the matter of parallelism. As several speakers remarked, parallel computing is no longer an option. These days, even low end laptops are designed around multiple cores. It is very likely that the next generation of middleware algorithms will take maximum advantage of parallel computing.

- Growth in the number of records to be processed is not the only contributor to the size and complexity of modern massive data sets. The drive to bring all of the relevant data to bear on a problem is leading statisticians to seek new methods of analyzing heterogeneous data. Susan Holmes (Stanford) presented work that integrated data describing networks of gene interactions, microarray data and image data into a single framework for analysis. Flexible tools for working with multiple data types and perhaps new data types altogether will be necessary to advance this program. For example, David Bader (George Tech College of Computing) described work on a new data type for dealing with streaming graphs with vertices that change over time, while several speakers presented work to develop structures and algorithms to deal with streaming data.

- Computer scientists and statisticians working with massive data sets are pursuing dual strategies for dealing with the challenge of size. On the one hand there is the continuing search for algorithms that scale in less than exponential time. On the other, there is the strategy of reducing the size of the problem either through sampling or dimensionality reduction techniques. A favorite trick of algorithm developers is to approximate a large matrix M, with a smaller matrix M’ that is “close” to M according to some relevant norm. However, as Petros Drineas (Rensselaer Polytechnic Institute) pointed out in his tutorial on randomized algorithms, clever algebra won’t always work if the resulting structures do not have straightforward interpretations. In describing work he did with SNPs, Petros pointed out that the geneticists he was working with were not happy with linear combinations of the SNPs that contained the maximum amount of relevant information; they wanted to know the individual SNPs that were most important

The papers presented at the workshop should be posted at the link above in the near future. I highly recommend them not only as a snapshot of the state of the science, but also as a way for non-specialists to learn something about the field.