by Joseph Rickert
In a recent post, where I presented some R related highlights of November's H_{2}0 World conference, I singled out and described talks by Trevor Hastie and John Chambers and remarked that it would be nice if the videos would be made available. Well, thanks to the generosity of the folks at H_{2}O I got my wish.
Here is the video of Professor Hastie's talk.
This video represents a master class on machine learning where in 40 minutes or so Professor Hastie conducts a tour that starts with basic decision trees and goes all the way to building learning ensembles with the Lasso. Along the way, he presents the salient ideas on bagging, random forests and boosting. The treatment of boosting is succinct and elegant covering some remarkable features of the family of boosting algorithms. For example, Professor Hastie describes how training error in Adaboost can reach zero and stay there but testing error can continue to improve, how superior performance can be achieved with boosting algorithms by using only tree stumps, and how the stagewise additive modeling "slows down the rate of overfitting". The really deep insight comes in the discussion about viewing Adaboost as an algorithm that fits additive logistic regression models with an exponential loss function. This, in turn, leads to a discussion Jerome Freidman's Gradient Boosting Machine and more general boosting algorithms that can accommodate multiple kinds of loss functions. These are the models implemented in R's gbm package.
I think this video of John Chambers' reminiscing about his time at Bell Labs working with John Tukey is destined to become an important part of the historical record for Statistics. There are many remembrences of Tukey to be found online, but I don't know of any other visual record by someone of John Chambers' stature who interacted with Tukey as a colleague and professional statistician.
In Just a few minutes, Chambers paints a balanced and revealing portrait that humanizes and captures some of the complexity of this icon of modern statistics. I especially like the story in the Q & A portion of the talk where John describes Tukey's propensity for "mischief" and his delight in inventing new words (like boxplot "hinges") that rankled many of his statistician colleagues, but apparently particularly upset the British statisticians.
There are a few more videos on the H_{2}O site that are worth a look.
by James Paul Peruvankal, Senior Program Manager at Revolution Analytics
At Revolution Analytics, we are always interested in how people teach and learn R, and what makes R so popular, yet ‘quirky’ to learn. To get some insight from a real pro we interviewed Bob Muenchen. Bob is the author of R for SAS and SPSS Users and, with Joseph M. Hilbe, R for Stata Users. He is also the creator of r4stats.com, a popular web site devoted to analyzing trends in analytics software and helping people learn the R language. Bob is an Accredited Professional Statistician™ with 30 years of experience and is currently the manager of OIT Research Support (formerly the Statistical Consulting Center) at the University of Tennessee. He has conducted research for a variety of public and private organizations and has assisted on more than 1,000 graduate theses and dissertations. He has written or coauthored over 60 articles published in scientific journals and conference proceedings.
Bob has served on the advisory boards of SAS Institute, SPSS Inc., StatAce OOD, the Statistical Graphics Corporation and PC Week Magazine. His suggested improvements have been incorporated into SAS, SPSS, JMP, STATGRAPHICS and several R packages. His research interests include statistical computing, data graphics and visualization, text analysis, data mining, psychometrics and resampling.
James: How did you get started teaching people how to use statistical software?
Bob: When I came to UT in 1979, many people were switching from either FORTRAN or SPSS to SAS. There was quite a lot of demand for SAS training, and I enjoyed teaching the workshops. Back then SAS could save results, like residuals or predicted values, much more easily than SPSS, which drove the switch.
When the Windows version of SPSS came out, people started switching back. The SPSS user interface designer, Sheri Gilley, really understood what ease of use was all about, and the SAS folks didn’t get that until quite recently. I was just as happy teaching the SPSS workshops. However, many SPSS users at UT avoid programming, which I think is a big mistake. Pointingandclicking your way through an analysis can be a timesaving way to work, but I always keep the program so I have a record of what I did.
I started teaching R workshops in 2005 and attendance was quite sparse. Now it’s one of our Research Computing Support team’s most popular topics.
James: Is there anything special about teaching people how to use R, any particular difficulties?
Bob: In other analytics software, the focus is on variables. It sounds too simple to even bother saying: "Every procedure accepts variables." There are very few ways to specify them, such as by simple name, A, B, C, or lists like A TO Z or A—Z.
Rather than just variables, R has a variety of objects such as vectors, factors and matrices. Some procedures (called functions in R) require particular kinds of objects and there are many more ways to specify which objects to use. From a new user's perspective that may seem like needless complexity. However it provides significant benefits. Once an R user has defined a categorical variable as a factor, analyses will then try to “do the right thing” with that variable. For instance, you could include it in a regression equation and R would create the indicator variables needed to handle a categorical variable automatically.
Another important benefit to R’s object orientation is that it allows a total merger of what would normally be a separate matrix language into the main language of R. This attracts developers, who are helping grow R’s capabilities very rapidly.
James: How do you handle such a broad range of backgrounds in your classes?
Bob: The workshop participants do come from a very wide range of fields, but they share a common set of knowledge: what a variable is, how to analyze data, and so on. So I save a great deal of time by not having to explain all that. Instead, I redirect it into pointing out where R is likely to surprise them. You can have variables that are not in a data set? That’s a bizarre concept to a SAS, SPSS or Stata user. You can have X in one data set and Y in another, but include both in the same regression model? That sounds very strange at first and, of course, it’s quite risky if you’re not careful. I introduce most topics with, “You’re expecting this, but here comes something very different…”. Different doesn’t necessarily mean better, of course. SAS, SPSS and Stata are all topquality packages and they do some things with less effort. I love R, but I like to point out where I think the others do a better job.
James: How do you find teaching online compared to classroom courses?
Bob: I teach my workshops inperson at The University of Tennessee and I’ve taught at the American Statistical Association’s Joint Statistical Meeting as well as the UseR! Conference. Teaching “live” is great fun, and being able to see the participants’ expressions is helpful in adjusting the presentation pace and knowing when to stop and ask for questions.
However, live workshops have major drawbacks. Travel costs can easily exceed the fee for a workshop, but worse, minimizing those expenses means cramming too much material into a short timeframe. That’s why I teach my webinars in halfday stretches skipping a day in between. We break every hour and fifteen minutes so people can relax. On their days off they can catch up on their regular work, review the workshop material, work on the exercises and email me with questions. At the end of a live workshop people are happy but exhausted and they leave quickly. At the end of a webinarbased workshop, they often stay for a long time afterwards asking questions. I stay online as long as it takes to answer them all.
James: Some people like learning actively, with their hands on the keyboard. Others prefer to focus more on what’s being said and taking notes. How do you handle these styles?
Bob: This is an excellent question! When I take a workshop myself, I usually prefer handson but sometimes I don’t. Each of my workshop attendees receives setup instructions a week early so their computer has the software installed and the files in the right place by the time we start. They’re ready for whichever learning style they prefer.
For handson learners, I use a single R program that contains the course notes as programming comments interspersed with executable code. Since the “slides” are right in front of them, they never need to take their eyes off their screens. The examples are designed to be easy to covert to their own projects. They build in a stepbystep fashion, going from simple to more complex to make sure no one gets lost. Participants can run each example as I cover it, and see the results on their own computers.
For people focused more on listening and taking notes, everyone also has a complete set of slides. The slides have the notes that describe each concept, then the code for it, followed by the output. The notes follow a numbering scheme that is used in both the program and the handout. This way, both types of learners stay in sync.
This dual approach has another benefit. It’s very easy to switch from one style to the other at any time. If someone gets tired of typing, or his or her computer malfunctions, switching to the notes is seamless. Conversely, if someone is following the printed notes and want switch to run an example, it’s very easy to find.
James: What motivated you to start writing books?
Bob: I’ve always enjoyed writing newsletter and journal articles. My books on R started out just as a set of notes that I kept for myself. When I put them online, they started getting thousands of hits and Springer called to ask if I could make it a book. I really didn’t think I had enough information, but it kept growing. The second edition of R for SAS and SPSS Users is 686 pages and I have notes on a few topics that I wish I had added. If I ever find time for a third edition, they’ll be in there.
James: Thank you Bob for your time!
If you are looking to learn R and are already familiar with software like SAS, SPSS or Stata, do check out Bob’s upcoming workshops here and here.
You know Hadley Wickham as the inventor of the ggplot2 visualization phenomenon, the creator of timesaving R packages like plyr and lubridate, and the Chief Scientist at RStudio. But do you know what laptop Hadley uses, what software he uses (besides, R, of course), or his favourite kitchen appliance? Find out Hadley's interview with The Setup. (Also check out our profile of Hadley from September 2010.)
The Setup: Hadley Wickham
"The RFiles" is an occasional series from Revolution Analytics, where we profile prominent members of the R Community.
Name: Paul Teetor Profession: Quantitative developer (freelance) Nationality: American Years Using R: 7 Known for: Author of R Cookbook (O’Reilly Media, 2011) 
An active member of the R community, Paul Teetor is a quantitative developer and statistical consultant based in the Chicago area. He’s been using R for seven years, during which time his contributions to the community have been significant  particularly in the field of finance. He’s currently a freelance consultant largely focused on time series analysis. Teetor is also the author of the popular R Cookbook, which was published by O’Reilly Media this past March and offers new users over 200 “recipes” for performing more efficient data analysis with R.
He was first drawn to R for the flexibility it offered him in developing trading systems. Citing his own background in software engineering and the need to perform sophisticated statistical analysis in a programmable  and costeffective  environment, Teetor said that R emerged as the perfect fit for him. Since then, he has performed the majority of his financial analyses in R and has also emerged as a leading evangelist for the community. He gradually collected a catalog of tricks and techniques for R, many of which were compiled into the R Cookbook. He's been a participant at conferences such as the Joint Statistical Meetings and the R/Finance Conference where he evangelized the role of R in quantitative finance. Some of those talks and papers are available on his website.
“Prior to R, I did most of my statistical analysis in Excel  and occasionally SAS,” said Teetor. “However, performing statistical analyses for financial tables in either was extremely tedious and puts you in a specific box. R is a modern, programmable language, so I can make it do what I need it to do in a timely manner. It’s been a pleasure to be able to take what I’ve learned from R and share it with other community members  and to continue learning new tips and tricks from them as well.”
Teetor uses R for the majority of his finance work because, as he puts it, it does things other languages “simply cannot do.” He cited the example of hedge ratio calculations which benefit from the flexibility of R, a topic on which he gave a lightning talk at R/Finance this past summer. He was also quick to credit fellow R user Jeff Ryan (whom we profiled here earlier this year) as an influential member of the R community, citing his finance packages as particularly useful. “I use nearly every finance package he’s written, they’re incredibly helpful and greatly streamline the process of Rbased financial analysis.”
When asked about the relationship between financial analysis and the rise of the data science movement, Teetor noted, “People in data science are experiencing what financial analysts have experienced for years: out of the box data analysis is not realistic. You need to incorporate a heavy amount of custom statistics, something that’s not easy to do with a commercial product where you can’t get to the source code. Data scientists need a way to construct custom analyses and R gives them that opportunity. Nothing else on the horizon that can compete with that, in terms of finance or the wider field of data science.”
Looking ahead, Teetor sees a bright future for the continued evolution of R. Since there is no real alternative on the market, he argues, R’s potential for future growth is nearly unlimited. He did, however, cite R’s capacity (or lack thereof) for software engineering as one possible area of improvement. “When R was originally envisioned, it wasn’t thought of as a vehicle for software engineering. Nobody expected people to keep their scripts as opposed to just throwing them away. As it’s grown though, people are building larger, more complex systems with longer lifetimes.” It’s an area that Teetor cites as one of the main struggles with R today, but also one which he cites as a great opportunity on which to innovate.
"The RFiles" is an occasional series from Revolution Analytics, where we profile prominent members of the R Community.
Martyn Plummer is a longtime contributor to the R community and a member of the R core group, which consists of 20 members that help oversee the continued evolution of the project. Plummer also serves on the editorial board of the R Journal, the official journal of the R project. By day, he serves as a Statistician and Epidemiologist at the International Agency for Research on Cancer (IARC), based in Lyon, France.
Plummer, who has been using R since 1995, has developed or contributed to a number of popular packages, including coda for analyzing Markov Chain Monte Carlo output, JAGS, a clone of the popular WinBUGS software Bayesian analysis and Epi, which provides functions for epidemiologists and accompanies an annual course that aims to introduce epidemiologists to R.
He has also incorporated R into his work at IARC, where he works in the Infection and Cancer Epidemiology group. Much of the work of this group is focused on human papillomavirus (HPV), which causes half a million cases of cervical cancer per year worldwide. Plummer and his colleagues use R (including his own Epi package) to analyze epidemiological studies of HPV infection and try to tease out some aspects of HPV natural history that are difficult to understand without statistical modeling, such as whether different HPV types interact with each other. He also relies heavily on R’s graphical capabilities for visualizing data in scientific publications.
Prior to R, Plummer worked primarily with S+ for analyzing data. He had been working in Cambridge, United Kingdom in the Biostatistics Unit at the Medical Research Council when he was offered a position at the IARC in Lyon. He recalls the transition, and how his new position introduced an entirely different computing environment. Soon after moving, he was introduced to the recentlyformed R project by his colleague David Clayton.
“From the beginning, I saw enormous potential in R,” says Plummer. “While I was accustomed to S+, it wasn’t long before I completely switched over to R. It was and continues to be unparalleled in its flexibility in terms of data analysis.”
Plummer also points to R’s extensible nature as one of its defining features. As a modern language, R is able to effectively adapt to the changing nature of data analysis in an era of increasingly large, unstructured data sets. “One of the most important features of R is that it’s built around the data; it’s designed for programming with data, so it can take these developments in stride,” he says.
He went on to describe a recent article in the R journal that analyzed 18 months’ worth of text from the R mailing lists and identified relationships between prominent members of the R community based on the topics they discussed. Plummer cites it as an example of R’s ability to keep up with the everchanging notion of “data.”
“10 years ago, I would have never called such an amalgamation of text a ‘data set,’” he says. “Today, though, we find ourselves in a situation where we can elicit structure from large and complex data sets and glean meaning from it.”
When asked about how he sees the R project evolving in coming years, Plummer speaks of a delicate yet effective balance. “R manages a difficult equilibrium; it’s partly a frontier for innovation in statistical computing, yet it’s also a stable platform for data analysis. It’s unique in this regard and I don’t see it facing serious competition for quite some time.”
He sees the current situation being maintained at least over the next few years, though one challenge for R users is to navigate the increasing number of contributed packages. While there’s incredible innovation being done for a diverse range of functions, Plummer says, there are also opportunities for the community as a whole to pool and share their work.
“One of the most important and oftoverlooked values of the R community is its interdisciplinary nature,” he says. “It’s remarkable to be able to collaborate with so many talented people from a diverse range of fields. We’re all statisticians, but statistics has a terrible tendency to fragment by subject matter. R gives us all a common platform and brings us together to encourage innovation.”
"The RFiles" is an occasional series from Revolution Analytics, where we profile prominent members of the R Community.

Name: Jeff Ryan Profession: Owner/Principal at Lemnica; Committee Member at R/Finance Nationality: American Years Using R: 8 Known for: R/Finance Conference, quantmod and xts packages 
Jeffrey Ryan is a Chicagobased quantitative software analyst and avid R user. He is perhaps best known in the R community as one of the primary organizers of the annual R/Finance conference. By day, he’s Principal at Lemnica, a firm that specalizes in developing quantitative software systems for financial market trading. In addition to his efforts on the R/Finance organizing committee (of which Dirk Eddelbuettel, whom we profiled here earlier this year, is also a member), he has also developed a number of popular R packages for financial analysis  most notably quantmod and xts.
Ryan first started using R in 2002 as an undergraduate at the University of Illinois  Chicago, where he studied economics and finance. He described his frustration with cumbersome proprietary tools and recalls his search for a more extensible programming language with which to work. "At that point, I did most of my work in perl and Excel/VBA," he says. "R was not particularly well known at that point, but I recognized the great potential it had for adding deeper statistical insights into my analyses."
After receiving his Bachelor’s degree from UIC, Ryan went to work for a Chicagobased hedge fund as a floor trader on the options exchange. While he wasn’t actively using R in his work at that point, he continued to experiment in his free time by building trading tools and models in R.
From there, Ryan began using R more frequently and became an increasingly active participant on the RSIGFinance mailing list. He recalls being particularly impressed by the work of Diethelm Wuertz, whom he credits as one of his major influences from the R community. Over time, Ryan took the lead on developing an allinone R package for quantiative trading models: quantmod.
The quantmod package is designed to assist quantitative traders in developing, testing and deploying trading models and has been adopted by big and small traders around the world. Says Ryan of quantmod, "I come from a quantitative trading background. When I was in school, there was no single package for building and testing a trading model. I saw the chance to do something like that with R, which is where quantmod was born."
"The package has evolved concurrently with R and is one of the more popular packages today – with the website serving as a gateway for new users to R and finance. I’m particularly honored to see the traction quantmod and other tools have gained amongst professional quants in the community, as well as how instrumental it has been in bringing new users into R."
In addition to quantmod, Ryan has developed numerous R packages and has been a contributor to many more. Highlights include xts, or eXtensible Time Series which is used to manage large financial timeseries, and IBrokers, which is geared towards real time trading. He is also working on a new collection of packages called “indexing”. According to Ryan, indexing is an abstraction of data.frames that allow fast queries on data that do not fit in memory.
When asked to comment on the significant role he’s played in developing R’s capabilities for quantitative finance, Ryan is quick to credit other members of the R community. In addition to Diethelm Wuertz and Dirk Eddelbuettel, Ryan specifically mentioned Josh Urlich (TTR, LSPM) and Brian Peterson and Peter Carl (PerformanceAnalytics) for the work they’ve done in developing financefocused packages in R, as well as their key contributions to the success of the R/Finance conferences.
The 2011 R/Finance conference was held this past April 2930, drawing more than 250 attendees from across the globe. The 2012 conference has already been confirmed for May 1719 of next year – once again to be held in Ryan’s hometown of Chicago.
"The RFiles" is an occasional series from Revolution Analytics, where we profile prominent members of the R Community.

Name: Martin Morgan Profession: Senior Staff Scientist at Fred Hutchinson Cancer Research Center Nationality: Canadian Years Using R: 7 Known for: Director of the Bioconductor project 
Martin Morgan is a Senior Staff Scientist at the Fred Hutchinson Cancer Research Center (FHCRC) in Seattle. He is perhaps best known for running the Bioconductor project, which has emerged as the tool of choice for scientists conducting analyses of highthroughput genomic data.
His use of R stems from his early days working at FHCRC, where one of his colleagues was R project cofounder Robert Gentleman. Gentleman also founded the Bioconductor project. While he had worked with a fair share of statistical analyses as he pursued his Ph.D. in Evolutionary Genetics at the University of Chicago, he was especially impressed by R’s flexible, powerful nature.
Morgan describes his first real interaction with the wider R community, which came via the R and Bioconductor mailing lists. As he puts it, “I was impressed with both the level of sophistication and engagement of the users on the mailing list. I remember realizing that the people responding to questions were the very authors of the packages in question.”
As he became more familiar with R, Morgan took on an increasingly active role in Bioconductor, eventually taking the lead on the project. Morgan and his team of researchers and analysts at FHCRC have worked to develop a number of R packages for genomic research, including the popular Rsamtools, IRanges, GenomicRanges and GenomicFeatures packages for importing and using nextgeneration sequence data. Morgan was quick to credit Michael Lawrence, Herve Pages, Patrick Aboyoun, and Marc Carlson in particular for their efforts in developing these packages.
Morgan cites significant insights from the first R project he embarked on. “I’d initially written a C program that was literally thousands of lines of code. After a while, I decided to try programming in R; since the facilities for data input and optimization were already there, I was able to do it in just six lines of R code.”
When asked about particular areas in which he would like to see R evolve, he responded, “Because R is a programming language, there’s a risk that people lose sight of its biggest strengths — namely, its nearly limitless capacity for statistical analysis.” He paused, before continuing, “Some conservative elements of the R project help reproducible research, but encouraging more interoperability between packages would improve R’s core functionality.”
"The RFiles" is an occasional series from Revolution Analytics, where we profile prominent members of the R Community.
Dirk Eddelbuettel is an active member of the R community who has been contributing packages to CRAN for nearly a decade. An early adopter of Linux, he was first introduced to R in 1996. "The world was smaller thenmuch of the open source community were reading the same mailing lists and news groups at the time," he said. "I had known about S for several years, and was starting to use S+ in my first job out of graduate school. So when I learned about R, there was a natural curiosity about this implementation and what it could do as a next step for data analysis."
"The RFiles" is an occasional series from Revolution Analytics, where we profile prominent members of the R Community.

Name: Saptarshi Guha Background: Ph.D. in Statistics, Purdue University Nationality: India Years Using R: 6 Known for: Developing RHIPE package for R + Hadoop integration 
At just 31 years old, Saptarshi Guha has emerged as a cuttingedge contributor to the R community. Saptarshi earned a Ph.D. in Statistics from Purdue University in June 2010 (his advisor was William S. Cleveland, one of the pioneers of modern data visualization) and his current research focus is the analysis of large data sets. He is working to develop innovative approaches to the visualization and computing of statistical analyses. He has also worked on modeling network traffic for security and developing algorithms to detect human presence in SSH connections.
Saptarshi is best known for developing the popular RHIPE package that integrates the R statistical environment with the Hadoop framework. RHIPE allows R users to compute on terabytesized data sets a cluster using the MapReduce framework, thus offering the best of both worlds to users seeking to leverage the strength of R and Hadoop. People with very large data sets stored in the Hadoop Distributed File System can now easily process the data on hundreds or even thousands of nodes in parallel, using only the R language (no need to learn Java). They can even apply the statistical algorithms in R to boot.
While Saptarshi has studied the intersection of computer science and statistics for well over a decade — in the U.S. as well as his native India — it was not until he began his doctoral research at Purdue that he learned R. Given his statistical background, he quickly took to the language. "R is one of the most expressive languages I’ve ever encountered, and it’s perfectly geared towards comprehensive data analysis," he says.
In addition to RHIPE, Saptarshi has worked on packages that serialize R objects for operability with other languages, including Python, Java and C. He also wrote a package that saves R objects in a flexible data format so that individual objects can be lazy loaded. He has used R to perform statistical analysis on a wide array of topics, from network security modeling to generating reports and graphs for monthly expenditures and weight loss programs.
“The real beauty with R is that it’s constantly evolving,” Saptarshi says. “Is it perfect? No. But it’s being constantly refined by some of the most brilliant statistical minds today.” To that end, Saptarshi is working on several packages that will bring features of distributed computing to users working within the R environment. He’s also working on a package that integrates R with HBase and give users a fast query distributed data store that applies MapReduce computations across the data using RHIPE.
On October 12, Saptarshi will be delivering a talk at Hadoop World, where he’ll demonstrate the use of RHIPE to analyze 190 Gb of VoIP network data. (This project was joint work with Jin Xia and William S. Cleveland.) When you make a call over Skype, Google Voice, or even a landline routed over the internet, the call quality is of primary importance. One of the major factors influencing call quality is timing: when you speak, your voice is sampled every 20 milliseconds, but on the receiver’s end it might not arrive quite so regularly – perhaps 5ms too early or too late.
This “jitter” in packet arrival times degrades the audio quality; to investigate this, Saptarshi used R code to identify which packets corresponded to a single call. He used R’s robust regression algorithm to remove the effect of the gateway. In this way, he was able to process the data in a matter of minutes stored across a cluster of eightcomputers to assess the overall call quality metrics of the system.
If you’re interested in attending and are in the Greater New York area, you can register here. Cloudera also published an interview with Saptarshi last month.