This is Part 3 of a five-part article series, with new parts published each Thursday. You can download the complete article from the Revolution Analytics website.
Power from Elegance
If the R movement has a genuine rock star, it’s probably Hadley Wickham. He’s an assistant professor and the Dobelman Family Junior Chair in Statistics at Rice University. He’s written and contributed to more than 20 R packages, and he’s won the John Chambers Award for Statistical Computing.
Most of Wickham’s research focuses on making data analysis better, faster and easier. He is especially interested in using visualization techniques to improve how data and models are understood. In other words, he’s all about making it easy to use R.
“R was designed from the ground up to deal with common data problems,” says Wickham. “Compared to other programming languages, it’s designed to help you do the kinds of things that you do most often when you’re performing data analysis. For example, R has data frames built into the core language. It’s such a natural structure, and it makes working with data much easier. But very few other languages have data frames built in.”
Because R was created by statisticians for statisticians, it’s already loaded with many of the crucial features required to accomplish the everyday tasks of statistical analysis. The very design of the R language is often described as “elegant” – in other words, R is in tune with the way statisticians think and work.
For example, says Wickham, “In statistics, it’s really critical to keep track of missing values. That’s when you don’t know what a value is, but you need some way of indicating it. R keeps track of that for you, so that if you add a number to a missing number, you still don’t know what that number is and R will keep track of it. That’s important.”
No Need to Reinvent the Wheel
Precisely because R is a programming language – as opposed to being a pre-fabricated piece of software – new analytic techniques that are written in R can be saved and re-used. So when R users discover something fresh and exciting, they have two options that are not generally available to users of pre-fab software:
- They can share the new techniques with other R users, inside their organizations and all over the world.
- They can reproduce and re-use the new techniques they have discovered.
These are not trivial or minor advantages – they represent enormous potential value. The ability to save and re-use improvised functions means that you’re not forced to reinvent the wheel each time that you run an analytic operation. Try doing that in SAS or SPSS and you’re in for a long haul.
The ability to share new R code through forums hosted by CRAN (Comprehensive R Archive Network) and other groups ensures a state of continuous evolution. Bluntly put, the world of R never sits still.
“New methods show up in R before they show up in other packages,” says Michael Elashoff of CardioDX, a molecular diagnostics company that collects data from multiple sources and builds predictive models in R that help physicians detect cardiovascular diseases in their patients.
“We do a lot of predictive model development on complex data sets, so the ability use and evaluate new statistical methods is important to us. Especially in the last couple of years, many of these newer methods have been showing up as R packages first. R is definitely on the cutting edge,” says Elashoff.
Zubin Dowlaty, VP / Head of Innovation & Development at Mu Sigma, has a similar take on the value of R. Headquartered in Chicago, Mu Sigma is a global analytics services company providing business decision support services to clients in data-intensive industries such as pharma, insurance, financial services, CPG/retail, healthcare and technology. All of that means that Mu Sigma is in the business of analyzing data – big time.
“The large ecosystem of statisticians all over the world adding new functions and packages to the R system is a huge benefit,” says Dowlaty. “State-of-the-art algorithms are available quickly through the R platform.”
The R platform has become so comprehensive that it now represents a “one-stop shop” for analytical techniques, says Dowlaty. “Most of the techniques you need to drive analytics into the business are available through R – everything from statistical to machine learning and optimization techniques. Unlike other vendors, like SAS or SPSS, R provides everything in one go-round.”
Er, I don't think R introduced missing values, these are just the IEEE 754 QNaN and widely supported in most languages. The fact that QNaN + other value = QNaN is defined by IEEE 754 and invariably implemented in hardware and so nothing to do with R.
What's more the R documentation shows a serious lack of understanding of what an IEEE 754 QNaN is when it discusses equality testing.
R is still awesome though!!
Posted by: David Heffernan | October 21, 2010 at 09:27
Hmm, I'm not sure that's quite correct: NA as a *statistical* missing value is still quite different from QNaN, I believe. For example, the fact that TRUE || NA is TRUE and that FALSE && NA is FALSE (in both cases, not NA) is different from the QNaN spec, right?
I think the fact that R's NA is different from QNaN is by design.
Posted by: David Smith | October 21, 2010 at 09:32
TRUE || NA doesn't mean anything in floating point context. IEEE 754 talks about floating point rather than logical.
NA is implemented as a QNAN - there are many to choose from. The documentation of is.nan talks about not comparing for equality but IEEE 754 is subtle. NaNs compare not equal to all FP values. So NA<>NA at least in terms of the IEEE 754 spec.
Posted by: David Heffernan | October 21, 2010 at 12:28
If you think about it, there is no other tenable way of implementing NA. If you were to hold FP values in objects that that stored flags for properties like missing, infinity, nan etc. then the performance when mapped to hardware would be dire. The use of IEEE 754 is both sensible and forced.
Posted by: David Heffernan | October 21, 2010 at 12:30
Also, TRUE || NA is TRUE and FALSE && NA, isn't just short circuit evalution? Once you get to TRUE || you can stop evaluating and likewise with FALSE &&.
Posted by: David Heffernan | October 21, 2010 at 15:19
> 0/0
[1] NaN
> x = 0/0
> TRUE || x
[1] TRUE
> FALSE && x
[1] FALSE
>
Posted by: David Heffernan | October 21, 2010 at 15:22
sometime it is important to "re-invent" wheel if the the original wheel was designed poorly.....
Posted by: Jason | October 21, 2010 at 21:07
@Jason
Is your point is the IEEE 754 is designed poorly?
Posted by: David Heffernan | October 22, 2010 at 01:52
@David
no David. I am not talking about missing value here. My point is that not all "redundancy" is bad and parallel efforts are needed for important issues.
Posted by: Jason | October 22, 2010 at 07:36