by Norman Matloff

The American Statistical Association (ASA) leadership, and many in Statistics academia. have been undergoing a period of angst the last few years, They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:

- The field is to a large extent being usurped by other disciplines, notably Computer Science (CS).
- Efforts to make the field attractive to students have largely been unsuccessful.

I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t *We* Data Science?”

Good, the ASA is taking action, I thought. But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics: Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.

This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become. Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”

In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R activist.

**CS vs. Statistics**

Let’s consider the CS issue first. Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly. To many of us, though, this is just “old wine in new bottles,” with the “wine” being Statistics. But the new “bottles” are disciplines outside of Statistics–especially CS.

I have a foot in both the Statistics and CS camps. I’ve spent most of my career in the Computer Science Department at the University of California, Davis, but I began my career in Statistics at that institution. My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology. I was one of the seven charter members of the Department of Statistics. Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature. With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups. However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric: What is poor Statistics to do?

Well then, how did CS come to annex the Stat field? The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI). Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.

That switch in AI was due largely to the emergence of Big Data. No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days. Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects. Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.

Involvement is one thing, but usurpation is another. Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas. This is dramatically demonstrated by statements that are made like, “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics. ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.

Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that. The problem is not that CS people are doing Statistics, but rather that they are doing it poorly: Generally the quality of CS work in Stat is weak. It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented. Instead, there are a number of **systemic reasons** for this, structural problems with the CS research “business model”:

**CS, having grown out of a research on fast-changing software and hardware systems, became accustomed to the “24-hour news cycle”**–very rapid publication rates, with the venue of choice being (refereed) frequent conferences rather than slow journals. This leads to research work being less thoroughly conducted, and less thoroughly reviewed, resulting in poorer quality work. The fact that some prestigious conferences have acceptance rates in the teens or even lower doesn’t negate these realities.- Because CS Depts. at research universities tend to be housed in Colleges of Engineering, there is
**heavy pressure to bring in lots of research funding, and produce lots of PhD students**. Large amounts of time is spent on trips to schmooze funding agencies and industrial sponsors, writing grants, meeting conference deadlines and managing a small army of doctoral students–instead of time spent in careful, deep, long-term contemplation about the problems at hand. This is made even worse by the rapid change in the fashionable research topic du jour. making it difficult to go into a topic in any real depth. Offloading the actual research onto a large team of grad students can result in faculty not fully applying the talents they were hired for; I’ve seen too many cases in which the thesis adviser is not sufficiently aware of what his/her students are doing. **There is rampant “reinventing the wheel.”**The above-mentioned lack of “adult supervision” and lack of long-term commitment to research topics results in weak knowledge of the literature. This is especially true for knowledge of the Stat literature, which even the “adults” tend to have very little awareness of. For instance, consider a paper on the use of unlabeled training data in classification. (I’ll omit names.) One of the two authors is one of the most prominent names in the machine learning field, and the paper has been cited over 3,000 times, yet the paper cites nothing in the extensive Stat literature on this topic, consisting of a long stream of papers from 1981 to the present.- Again for historical reasons, CS research is largely empirical/experimental in nature. This causes what in my view is
**one of the most serious problems plaguing CS research in Stat – lack of rigor**. Mind you, I am not saying that every paper should consist of theorems and proofs or be overly abstract; data- and/or simulation-based studies are fine. But there is no substitute for precise thinking, and in my experience, many (nominally) successful CS researchers in Stat do not have a solid understanding of the fundamentals underlying the problems they work on. For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; actually, one can add quadratic terms, and so on, to models like this. **This “engineering-style” research model causes a cavalier attitude towards underlying models and assumptions.**Most empirical work in CS doesn’t have any models to worry about. That’s entirely appropriate, but in my observation it creates a mentality that inappropriately carries over when CS researchers do Stat work. A few years ago, for instance, I attended a talk by a machine learning specialist who had just earned her PhD at one of the very top CS Departments. in the world. She had taken a Bayesian approach to the problem she worked on, and I asked her why she had chosen that specific prior distribution. She couldn’t answer – she had just blindly used what her thesis adviser had given her–and moreover, she was baffled as to why anyone would want to know why that prior was chosen.**Again due to the history of the field, CS people tend to have grand, starry-eyed ambitions–laudable, but a double-edged sword**. On the one hand, this is a huge plus, leading to highly impressive feats such as recognizing faces in a crowd. But this mentality leads to an oversimplified view of things, with everything being viewed as a paradigm shift. Neural networks epitomize this problem. Enticing phrasing such as “Neural networks work like the human brain” blinds many researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification.(Recently I was pleased to discover–“learn,” if you must–that the famous book by Hastie, Tibshirani and Friedman complains about what they call “hype” over neural networks; sadly, theirs is a rare voice on this matter.) Among CS folks, there is a failure to understand that the celebrated accomplishments of “machine learning” have been mainly the result of applying a lot of money, a lot of people time, a lot of computational power and prodigious amounts of tweaking to the given problem – not because fundamentally new technology has been invented.

All this matters – a LOT. In my opinion, the above factors result in highly lamentable opportunity costs. Clearly, I’m not saying that people in CS should stay out of Stat research. But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them. This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.

**Making Statistics Attractive to Students**

This of course is an age-old problem in Stat. Let’s face it–the very word statistics sounds hopelessly dull. But I would argue that a more modern development is making the problem a lot worse – the Advanced Placement (AP) Statistics courses in high schools.

Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat. He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.” That says it all, doesn’t it? And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students. No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics. It is especially troubling that Statistics may be losing the “best and brightest” students.

One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter. A typical example is that a student complained to me that his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s^2 , even though he had attended a top-quality high school in the heart of Silicon Valley. But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on. AP courses are ostensibly college level, but the students are not getting college-level instruction. The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.

The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle. The machines are expensive, and after all we are living in an age in which R is free! Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides – exactly the kinds of things that motivate young people.

So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually can be fixed reasonably simply. If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program. Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.

As noted, R is free and is multi platform, with outstanding graphical capabilities. There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.

As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: An Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R. But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry. Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.

This is not a complete solution by any means. There still is the issue of AP Stat being taught by people who lack depth in the field, and so on. And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.

But given all these weighty problems, it certainly would be nice to do something, right? Switching to R would be doable–and should be done.

*[Crossposted with permission from the Mad (Data) Scientist blog.]*

Check out this tweet:

```
```Just found out that @rdpeng @jtleek @bcaffo have enrolled 1,000,000 students in statistics classes on @coursera in last 2 years.

— Simply Statistics (@simplystats) May 30, 2014

Those courses include Roger Peng's Computing for Data Analysis and Jeff Leek's "Data Analysis", and they all use the R language. That means more than 1,000,000 new students have been exposed to R in the last two years! On this basis alone it seems clear that the estimate of two million R users wordwide is an *underestimate*.

by Joseph Rickert

Norman Matloff professor of computer science at UC Davis, and founding member of the UCD Dept. of Statistics has begun posting as Mad(Data)Scientist. (You may know Norm from his book, *The Art of R Programming:* NSP, 2011.) In his second post (out today) on the new R package, freqparcoord, that he wrote with Yinkang Xie, Norm looks into outliers in baseball data.

> library(freqparcoord) > data(mlb) > freqparcoord(mlb,-3,4:6,7)

We would like to welcome Norm as a new R blogger and we are looking forward to future posts!

Mad(Data)Scientist: More on freqparcoord

If you're a student and you'd like to improve R by developing a new R package for 3 months over the summer — and get paid $5000 by Google in the process — why not apply for an R project in the Google Summer of Code? All you need to do browse the list of R projects available (if you have any questions about the projects, ask on the gsoc-r google group), and apply to GSOC 2014 using this template. The deadline for student applications is March 21, so act quickly. Students accepted into the program will be announced on April 21.

R Wiki: Google Summer of Code 2014

by James Paul Peruvankal, Senior Program Manager at Revolution Analytics

At Revolution Analytics, we are always interested in how people teach and learn R, and what makes R so popular, yet ‘quirky’ to learn. To get some insight from a real pro we interviewed Bob Muenchen. Bob is the author of *R for SAS and SPSS Users* and, with Joseph M. Hilbe, *R for Stata Users*. He is also the creator of r4stats.com, a popular web site devoted to analyzing trends in analytics software and helping people learn the R language. Bob is an Accredited Professional Statistician™ with 30 years of experience and is currently the manager of OIT Research Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has conducted research for a variety of public and private organizations and has assisted on more than 1,000 graduate theses and dissertations. He has written or coauthored over 60 articles published in scientific journals and conference proceedings.

Bob has served on the advisory boards of SAS Institute, SPSS Inc., StatAce OOD, the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analysis, data mining, psychometrics and resampling.

**James: How did you get started teaching people how to use statistical software? **

Bob: When I came to UT in 1979, many people were switching from either FORTRAN or SPSS to SAS. There was quite a lot of demand for SAS training, and I enjoyed teaching the workshops. Back then SAS could save results, like residuals or predicted values, much more easily than SPSS, which drove the switch.

When the Windows version of SPSS came out, people started switching back. The SPSS user interface designer, Sheri Gilley, really understood what ease of use was all about, and the SAS folks didn’t get that until quite recently. I was just as happy teaching the SPSS workshops. However, many SPSS users at UT avoid programming, which I think is a big mistake. Pointing-and-clicking your way through an analysis can be a time-saving way to work, but I always keep the program so I have a record of what I did.

I started teaching R workshops in 2005 and attendance was quite sparse. Now it’s one of our Research Computing Support team’s most popular topics.

**James: Is there anything special about teaching people how to use R, any particular difficulties?**

Bob: In other analytics software, the focus is on variables. It sounds too simple to even bother saying: "Every procedure accepts variables." There are very few ways to specify them, such as by simple name, A, B, C, or lists like A TO Z or A—Z.

Rather than just variables, R has a variety of objects such as vectors, factors and matrices. Some procedures (called functions in R) require particular kinds of objects and there are many more ways to specify which objects to use. From a new user's perspective that may seem like needless complexity. However it provides significant benefits. Once an R user has defined a categorical variable as a factor, analyses will then try to “do the right thing” with that variable. For instance, you could include it in a regression equation and R would create the indicator variables needed to handle a categorical variable automatically.

Another important benefit to R’s object orientation is that it allows a total merger of what would normally be a separate matrix language into the main language of R. This attracts developers, who are helping grow R’s capabilities very rapidly.

**James: How do you handle such a broad range of backgrounds in your classes?**

Bob: The workshop participants do come from a very wide range of fields, but they share a common set of knowledge: what a variable is, how to analyze data, and so on. So I save a great deal of time by not having to explain all that. Instead, I redirect it into pointing out where R is likely to surprise them. You can have variables that are *not* in a data set? That’s a bizarre concept to a SAS, SPSS or Stata user. You can have X in one data set and Y in another, but include both in the same regression model? That sounds very strange at first and, of course, it’s quite risky if you’re not careful. I introduce most topics with, “You’re expecting this, but here comes something very different…”. Different doesn’t necessarily mean better, of course. SAS, SPSS and Stata are all top-quality packages and they do some things with less effort. I love R, but I like to point out where I think the others do a better job.

**James: How do you find teaching online compared to classroom courses?**

Bob: I teach my workshops in-person at The University of Tennessee and I’ve taught at the American Statistical Association’s Joint Statistical Meeting as well as the UseR! Conference. Teaching “live” is great fun, and being able to see the participants’ expressions is helpful in adjusting the presentation pace and knowing when to stop and ask for questions.

However, live workshops have major drawbacks. Travel costs can easily exceed the fee for a workshop, but worse, minimizing those expenses means cramming too much material into a short timeframe. That’s why I teach my webinars in half-day stretches skipping a day in between. We break every hour and fifteen minutes so people can relax. On their days off they can catch up on their regular work, review the workshop material, work on the exercises and email me with questions. At the end of a live workshop people are happy but exhausted and they leave quickly. At the end of a webinar-based workshop, they often stay for a long time afterwards asking questions. I stay online as long as it takes to answer them all.

**James: Some people like learning actively, with their hands on the keyboard. Others prefer to focus more on what’s being said and taking notes. How do you handle these styles?**

Bob: This is an excellent question! When I take a workshop myself, I usually prefer hands-on but sometimes I don’t. Each of my workshop attendees receives setup instructions a week early so their computer has the software installed and the files in the right place by the time we start. They’re ready for whichever learning style they prefer.

For hands-on learners, I use a single R program that contains the course notes as programming comments interspersed with executable code. Since the “slides” are right in front of them, they never need to take their eyes off their screens. The examples are designed to be easy to covert to their own projects. They build in a step-by-step fashion, going from simple to more complex to make sure no one gets lost. Participants can run each example as I cover it, and see the results on their own computers.

For people focused more on listening and taking notes, everyone also has a complete set of slides. The slides have the notes that describe each concept, then the code for it, followed by the output. The notes follow a numbering scheme that is used in both the program and the handout. This way, both types of learners stay in sync.

This dual approach has another benefit. It’s very easy to switch from one style to the other at any time. If someone gets tired of typing, or his or her computer malfunctions, switching to the notes is seamless. Conversely, if someone is following the printed notes and want switch to run an example, it’s very easy to find.

**James: What motivated you to start writing books?**

Bob: I’ve always enjoyed writing newsletter and journal articles. My books on R started out just as a set of notes that I kept for myself. When I put them online, they started getting thousands of hits and Springer called to ask if I could make it a book. I really didn’t think I had enough information, but it kept growing. The second edition of R for SAS and SPSS Users is 686 pages and I have notes on a few topics that I wish I had added. If I ever find time for a third edition, they’ll be in there.

**James: Thank you Bob for your time!**

If you are looking to learn R and are already familiar with software like SAS, SPSS or Stata, do check out Bob’s upcoming workshops here and here.

*by Joseph Rickert*

One of the key ideas in topological data analysis is to consider a data set to be a sample from a manifold in some high dimensional topological space and then to use the tools of algebraic topology to reconstruct the manifold. It turns out that the converse problem of taking a random sample from a given topological manifold also has some very useful applications in statistics. In their 2012 paper, Sampling From A Manifold, the mathematical statisticians Persi Diaconis, Susan Holmes and Mehrdad Shahshahani develop a general approach to this converse problem using fundamental ideas from geometric measure theory. They show how this topological / geometrical approach can be used not only to construct algorithms for generating test data for topological statistics, but also for testing the goodness of fit of sufficient statistics for exponential family distributions and for other statistical problems that may be conceptualized as sampling from a manifold.

The first example in the paper shows how to correctly sample from a torus:

M = { [(R+rcos(q))cos(y), (R+rcos(q))sin(y),sin(q)]}

where q and y are both in [0,2p) and R > r > 0

The authors point out that naively drawing q and y from uniformly distributed random variables will lead to sampled points that are denser than they should be in regions of higher curvature such as the inside of the torus. The correct way to sample is to use the theory they develop to derive the joint density, g(q, y), of q and y on the manifold. The theorem that enables this is an extension of the change of variables formula from calculus. Readers who have used Jacobians to work through tricky problems involving the sums and products of random variables in a probability course will recognize what is going on.

As it turns out, g(q, y) factors into into:

g1(q) = (1/2p)(1 + (r/R)cos(q))

where q e [0,2p) and g2(y) is uniform on [0,2p).

To actually sample from g1(q) the authors provide the R code for a rejection sampling algorithm.

Created by Pretty R at inside-R.org

The following plot shows the histogram and density curve for 10,000 draws from g1(q).

(Look here for the code to draw the plot: Download G1(theta)_histogram)

The math in Sampling from a Manifold is intense. However, the authors help where they can. Much of the paper is expository providing an introduction to geometric measure theory and amid the literature review the authors are kind enough to point out more elementary references that they think are useful. This is hard work, but it is nice to see that R code has a place among all the abstract ideas.

*by Joseph Rickert*

Revolution Analytics is announcing three new programs today that we hope will be modest but positive contributions to data science education and public service analytics. The first new program, the Academic Institution Program (AIP) enables colleges, universities and other educational institutions to obtain a site license for Revolution Analytics' commercial distribution of the R Language, Revolution R Enterprise (RRE), for the yearly fee of $999. This is not a watered-down version of the software. It is a full-featured, supported distribution of RRE including the Big Data capability of Revolution R Enterprise ScaleR and the secure, scalable web services distribution software: Revolution R Enterprise DeployR. For this one nominal fee, a participating academic institution can run the software on workstations and PCs in software labs, on all of its departmental and central IT servers, on large clusters (including Hadoop clusters), and on cloud resources under its control. The only limitation is that, wherever it is installed, that the software be used only for academic purposes.

Revolution Analytics has always made a single user workstation version of RRE available as a totally free download for students and faculty, and we are very proud that quite a few academics have used our software for learning and research. We will continued to offer free single user workstation academic downloads under the renamed Individual Scholar Program (ISP). However, only so much can be accomplished with individual downloads. Even when the entire class downloads the software it use depends on the expertise of the students and every new class must begin from scratch. Having R as part of the IT infrastructure, however, will allow departments within the university to realize multiple benefits in productivity, collaboration and performance. Our intention with the AIP is to help colleges and universities deploy R as their global platform for quantitative learning. Details of the AIP and ISP programs may be found on our website.

Our third new program, what we are calling the Public Services Program (PSP), is designed to make RRE available to non-profit organizations working for the public good. Our model organization for this program is the Human Rights Data Analysis Group (HRDAG). I have blogged about this group before and we at Revolution Analytics continue to be impressed by their determination to use statistical models as a light to illuminate those areas where the world needs to pay attention. (Clearly, not *all* of brightest minds of this generation of data scientists are engaged in getting people to click ads.) We hope that by offering site licenses for RRE to public services organizations for the same nominal fee as offered to academic institutions we will encourage and enable more public service non-profits organizations to bring the tools of data science to bear on important problems in health care, economics, conversation and other areas that will benefit society as a whole. The details of the PSP along with guidelines as to what kinds of organizations qualify for this program may be found on same link mentioned above.

Please send your nominations for non-profit that could benefit from the Public Services Program to community@revolutionanalytics.com.

Revolution Analytics: Academic and Public Service Programs

If you're new to the R language but keen to get started with linear modeling or logistic regression in the language, take a look at this "Introduction to R" PDF, by Princeton's Germán Rodríguez. (There's also a browsable HTML version.)

In a crisp 35 pages it begins by taking you through the basics of R: simple objects, importing data, and graphics. Then, it works through several examples of linear models (formula basics, fitting a model, model diagnostics, analysis of variance and even regression spones). Finally, there's a section on Generalized Linear Models, with a focus on logistic regression. The document doesn't attempt to explain all of the capabilities of R, but instead works through a series of examples to teach by demonstration. All of the datasets used in the guide are available online, so it's easy to follow along from home.

Also by the same author: a guide to R for Stata users.

Germán Rodríguez: Introducing R

by Joseph Rickert

When I was in graduate school in the mid '70s Mathematics departments were still under the spell of abstraction for its own sake. At that time, Algebraic Topology which uses concepts from Abstract Algebra to study topological spaces was a major gateway to the realm of abstraction. On my first visit, it was not at all clear that any of the exotic creatures to be found there: simplices, homotopy groups, homology groups etc. would have anything to say about the Real line let alone the real world. So it is astounding to me that over the last several years the masters of mathematical abstraction have made Topological Data Analysis (TDA), a subfield of Computational Topology, an exciting technique for dealing with high dimensional data sets that shows great promise.

One fundamental idea of TDA is to consider a data set to be a sample or “point cloud” taken from a manifold in some high-dimensional topological space. The sample data are used to construct simplices, generalizations of intervals, which are, in turn, are “glued” together to form a kind of wireframe approximation of the manifold. This manifold represents the “shape” of the data, and once you have it, the tools of Algebraic Topology can be used to construct a class of groups, homology groups, that are algebraic analogues of certain properties of the manifold. For example, knowing the number of holes in a manifold at various dimensions characterizes the topology of the manifold. (It is a single hole that makes a torus different from a sphere but makes a coffee cup topologically equivalent to a donut.) Homology groups describe the algebraic analogues of the holes. So, investigating the properties of the homology groups can tell you something about the topology of the manifold and, hopefully, something useful about your data.

All of this so far is just basic Algebraic Topology. A recent breakthrough idea, the notion of "persistent homology", has helped to make TDA a practical tool for data analysis. Persistent homology algorithms look for topological invariants across various scales of a topological manifold. All of the several methods for constructing the simplicial complex that constitutes the wire frame described above involve a scale parameter e, but knowing the number of holes and other structural features that appear at any particular scale is not enough to characterize the manifold. What you really want to know are which features persist over the full range of the parameter e. Efficient algorithms have been developed for computing both the homology groups themselves and a way to visualize them. A “barcode” plot is a qualitative visualization of a homology group that looks like a collection of horizontal line segments. The x axis represents the parameter e and the y axis is an arbitrary ordering of homology generators. (See Ghrist for the details.) When examining a barcode plot you are looking for lines that span a good portion of the x axis. Short lines are most likely just topological noise, long lines that persist across a good bit of the e scale may tell you something about your data. The following barcode plot for the iris data set was drawn with functions from R package phom.

library(phom) data <- as.matrix(iris[,-5]) head(data) max_dim <- 0 max_f <- 1 irisInt0 <- pHom(data, dimension=max_dim, # maximum dimension of persistent homology computed max_filtration_value=max_f, # maximum dimension of filtration complex mode="vr", # type of filtration complex metric="euclidean") plotBarcodeDiagram(irisInt0, max_dim, max_f, title="H0 Barcode plot of Iris Data")

Created by Pretty R at inside-R.org

I interpret the plot to tell me that there are probably 3 or 4 clusters in the data set. This is a lot of heavy duty mathematical machinery to say something a bit vague about the iris data set, however, while this small example may not be particularly exciting, I hope it serves to whet your appetite. To investigate further have a look at CompTop, Stanford's group for Applied and Computational Algebraic Topology and the resources page on the Ayasdi website. This Palo Alto start up, founded by Stanford professor Gunnar Carlsson, a pioneer in Computational Topology, along with two other Stanford trained mathematicians is on a mission to make TDA mainstream. Be sure to take a look at Professor Carlsson's video which is entertaining, informative and well worth watching.

by Joseph Rickert

The Revolution R Enterprise 7.0 Getting started Guide makes a distinction between High Performance Computing (HPC) which is CPU centric, focusing on using many cores to perform lots of processing on small amounts of data, and High Performance Analytics (HPA), data centric computing that concentrates on feeding data to cores, disk I/O, data locality, efficient threading, and data management in RAM. The following collection of tips for computing with big data is an abbreviated version of the Guide’s discussion of the HPC and HPA considerations underlying the design of Revolution R Enterprise 7.0 and RevoScaleR, Revolution’s R package for HPA computing.

It doesn’t hurt to state the obvious: bigger is better. In general, memory is the most important consideration. Getting more cores can also help, but only up to a point since R itself can generally only use one core at a time internally. Moreover, for many data analysis problems the bottlenecks are disk I/O and the speed of RAM, making it difficult to efficiently use more than 4 or 8 cores on commodity hardware.

R allows its core math libraries to be replaced. Doing so can provide a very noticeable performance boost to any function that make use of computational linear algebra algorithms. Revolution R Enterprise links in the Intel Math Kernel Libraries.

R does quite a bit of automatic copying. For example, when a data frame is passed into a function a copy of the data is made if the data frame is modified, and putting a data frame into a list also automatically causes a copy to be made. Moreover, many basic analysis algorithms, such as lm and glm, produce multiple copies of a data set as the computations progress. Memory management is important.

Processing data a chunk at a time is the key to being able to scale computations without increasing memory requirements. External memory algorithms load a manageable amount of data into RAM, perform some intermediate calculations, load the next chunk and keep going until all of the data has been processed. Then, the final result is computed from the set of intermediate results. There are several CRAN packages including biglm, bigmemory, ff and ffbase that either implement external memory algorithms or help with writing them. Revolution R Enterprise’s RevoScaleR package takes chunking algorithms to the next level by automatically taking advantage of the computational resources to run its algorithms in parallel.

Using all of the available cores and nodes is key to scaling computations to really big data. However, since data analysis algorithms tend to be I/O bound when data cannot fit into memory, the use of multiple hard drives can be even more important than the use of multiple cores. The CRAN package foreach provides easy-to-use tools for executing R functions in parallel on both on a single computer and across multiple computers. The foreach() function is particularly useful for “embarrassingly parallel” computations that do not involve communication among different tasks.

The statistical functions and machine learning algorithms in the RevoScaleR package are all Parallel External Memory Algorithm’s (PEMA’s). They automatically take advantage of all of the cores available on a machine or on a cluster (including LSF and Hadoop clusters.)

In R, the two choices for “continuous” data are numeric, an 8 byte (double) floating point number and integer, a 4 byte integer. There are circumstances where storing and processing integer data can provide the dual advantages using less memory and decreasing processing time. For example, when working with integers, a tabulation is generally much faster than sorting and gives exact values for all empirical quantiles. Even when you are not working with integers scaling and converting to integers can produce fast and accurate estimates of quantiles. As an example, if the data consists of floating point values in the range from 0 to 1,000, converting to integers and tabulating will bound the median or any other quantile to within two adjacent integers. Then interpolation can get you even closer approximation.

You will want to store big data so that it can efficiently accessed from disk. The use of appropriate data types can save both storage space and access time. Take advantage of integers and, when you can, store data in 32-bit floats not 64-bit doubles. A 32-bit float can represent 7 decimal digits of precision, which is more than enough for most data, and it takes up half the space of doubles. Save the 64-bit doubles for computations.

Even though a data set may have many thousands of variables, typically not all of them are being analyzed at one time. By reading from disk just the actual variables and observations you will use in analysis, you can speed up the analysis considerably.

Loops in R can be very slow compared with R’s core vector operations which are typically written in C, C++ or Fortran, compiled languages that execute much quicker than the R interpreter.

One R’s great strengths is its ability to integrate easily with other languages, including C, C++, and Fortran. You can pass R data objects to other languages, do some computations, and return the results in R data objects. The CRAN package Rcpp,for example, makes it easy to call C and C++ code from R.

When working with small data sets, it is common to perform data transformations one at a time. For instance, one line of code might create a new variable, and subsequent lines perform additional transformations with each transformation requiring a pass through the data. To avoid the overhead of making multiple passes over a large data set write chunking algorithms that apply all of the transformations to each chunk. RevoScaleR’s rxDataStep() function is designed for one pass processing by permitting multiple data transformations to be performed on each chunk.

When writing chunking algorithms, try to avoid algorithms that cross chunk boundaries. In general, data transformations for a single row of data should not be dependent on values in other rows. The key idea is that a transformation expression should give the same result even if only some of the rows of data are in memory at one time. Data manipulations requiring lags can be done but require special handling.

Working with categorical or factor variables in big data sets can be challenging. For starters, not all of the factor levels may be represented in a single chunk of data. Using R’s factor() function in a transformation on a chunk of data without explicitly specifying all of the levels that are present in the entire data set might cause you to end up with incompatible factor levels from chunk to chunk. Also, building models with factors having hundreds of levels may cause hundreds of dummy variables to be created that really eat up memory. The functions in the RevoScaleR package that deal with factors minimize memory use and do not generally explicitly create dummy variables to represent factors.

Most analysis functions return a relatively small object of results that can easily be handled in memory. Occasionally, however, output will have the same number of rows as the data: when computing predictions and residuals for example. In order for this to scale, you will want the output written out to a file rather than kept in memory.

Sorting is by nature a time-intensive operation. Do what you can to avoid sorting a large data set. Use functions that compute estimates of medians and quantiles and look for implementations of popular algorithms that avoid sorting. For example, the RevoScaleR function rxDTree() avoids sorting by working with histograms of the data rather that with the raw data itself.