by Joseph Rickert
When I was in graduate school in the mid '70s Mathematics departments were still under the spell of abstraction for its own sake. At that time, Algebraic Topology which uses concepts from Abstract Algebra to study topological spaces was a major gateway to the realm of abstraction. On my first visit, it was not at all clear that any of the exotic creatures to be found there: simplices, homotopy groups, homology groups etc. would have anything to say about the Real line let alone the real world. So it is astounding to me that over the last several years the masters of mathematical abstraction have made Topological Data Analysis (TDA), a subfield of Computational Topology, an exciting technique for dealing with high dimensional data sets that shows great promise.
One fundamental idea of TDA is to consider a data set to be a sample or “point cloud” taken from a manifold in some high-dimensional topological space. The sample data are used to construct simplices, generalizations of intervals, which are, in turn, are “glued” together to form a kind of wireframe approximation of the manifold. This manifold represents the “shape” of the data, and once you have it, the tools of Algebraic Topology can be used to construct a class of groups, homology groups, that are algebraic analogues of certain properties of the manifold. For example, knowing the number of holes in a manifold at various dimensions characterizes the topology of the manifold. (It is a single hole that makes a torus different from a sphere but makes a coffee cup topologically equivalent to a donut.) Homology groups describe the algebraic analogues of the holes. So, investigating the properties of the homology groups can tell you something about the topology of the manifold and, hopefully, something useful about your data.
All of this so far is just basic Algebraic Topology. A recent breakthrough idea, the notion of "persistent homology", has helped to make TDA a practical tool for data analysis. Persistent homology algorithms look for topological invariants across various scales of a topological manifold. All of the several methods for constructing the simplicial complex that constitutes the wire frame described above involve a scale parameter e, but knowing the number of holes and other structural features that appear at any particular scale is not enough to characterize the manifold. What you really want to know are which features persist over the full range of the parameter e. Efficient algorithms have been developed for computing both the homology groups themselves and a way to visualize them. A “barcode” plot is a qualitative visualization of a homology group that looks like a collection of horizontal line segments. The x axis represents the parameter e and the y axis is an arbitrary ordering of homology generators. (See Ghrist for the details.) When examining a barcode plot you are looking for lines that span a good portion of the x axis. Short lines are most likely just topological noise, long lines that persist across a good bit of the e scale may tell you something about your data. The following barcode plot for the iris data set was drawn with functions from R package phom.
library(phom) data <- as.matrix(iris[,-5]) head(data) max_dim <- 0 max_f <- 1 irisInt0 <- pHom(data, dimension=max_dim, # maximum dimension of persistent homology computed max_filtration_value=max_f, # maximum dimension of filtration complex mode="vr", # type of filtration complex metric="euclidean") plotBarcodeDiagram(irisInt0, max_dim, max_f, title="H0 Barcode plot of Iris Data")
Created by Pretty R at inside-R.org
I interpret the plot to tell me that there are probably 3 or 4 clusters in the data set. This is a lot of heavy duty mathematical machinery to say something a bit vague about the iris data set, however, while this small example may not be particularly exciting, I hope it serves to whet your appetite. To investigate further have a look at CompTop, Stanford's group for Applied and Computational Algebraic Topology and the resources page on the Ayasdi website. This Palo Alto start up, founded by Stanford professor Gunnar Carlsson, a pioneer in Computational Topology, along with two other Stanford trained mathematicians is on a mission to make TDA mainstream. Be sure to take a look at Professor Carlsson's video which is entertaining, informative and well worth watching.
Nice post
It is possible to use Dionysus (which is much faster) from R
Instructions are here:
http://www.stat.cmu.edu/~flecci/research/index.html
Posted by: Larry Wasserman | January 16, 2014 at 17:08
How soon will Vegas have odds that using TDA will enable someone like Li to make something like his copula to blow up what's left of the world's economy?
I, too, was in grad school (Econ) in the mid '70s, and the result was the capture of the field by the micro/math faction, which all served to support the distributional shift (mess) we have today. Math and data are fine tools, but only reliable when the underlying system obeys Mother Nature's rules (i.e., Newton/Einstein/Heisenberg), which are immutable (although we didn't always know, for sure, what they were). Finance math/stat/quant analysis is victim to humans changing the rules at a whim. Oft times, it is these analyses which manifest the rule change. They ain't no Brownian Motion in human endeavors.
Posted by: Robert Young | January 17, 2014 at 06:19
Great links. Excellent introduction to an extremely difficult topic that seems to offer much. Well done, do more.
Posted by: Joel Cadwell | January 17, 2014 at 07:56
Awesome post. Didn't know about the phom package. Will use it in a project I have in mind.
Posted by: Meena | January 17, 2014 at 19:48