The R core team announced today that R 2.13.2 is now available:
The byte pixies have rolled up R-2.13.2.tar.gz at 9:00 this morning. This is intended to be the final release of the 2.13 series, for the benefit of those apprehensive of putting 2.14.x into production use.
This update fixes a number of minor bugs (for example, pch="." will guarantee at least a 1-pixel dot, to support very high-resolution charting), improves performance (date object print much faster now, for example), and adds a few minor features (such as being able to accurately limit memory usage for systems with very large amounts of RAM). The full change log appears after the jump.
You can download the R 2.13.2 source code now for home builds; binaries for Windows, Mac and Linux will appear on your local CRAN mirror in the next few days. The next release of R, R 2.14.0, is scheduled for October 31.
R-announce mailing list: R 2.13.2 is released
The campaign to re-elect US president Barack Obama is hiring -- and the RDataMining blog noticed that several of the open positions seek R skills. If you want to be a Communications Analyst, Digital Strategy Analyst, or Statistical Modeling Analyst and you know R, there may be a job opening for you. Just goes to show there's no corner of life untouched by analytics.
If you've used SAS or SPSS and want a jump-start into the basics of the popular R language, next week's webinar, Introduction to R for SAS and SPSS Users will be of interest to you. While R, SAS and SPSS are all three software systems for data analysis and graphics, the underlying concepts in R are quite different to those in SAS and SPSS.
To get SAS and SPSS users up to speed with the basics of R, Bob Muenchen (University of Tennessee) will give a 60-minute webinar on Wednesday, October 5. It will include:
Bob is the perfect presenter for this topic: he is the author of R for SAS and SPSS Users (and, with Joseph M. Hilbe, R for Stata Users). He is also the creator of r4stats.com, a popular web site devoted to helping people learn R. Bob also serves on advisory panels for IBM’s SPSS Corporation and SAS Institute.
This live webinar takes place 11:00AM - 12:00PM Pacific Time on Wednesday, October 5. (Click here for the webinar time in your local time zone). Register and the link below for details on how to join the live session.
Revolution Analytics Webinars: Introduction to R for SAS and SPSS Users
The R Graph Gallery, the website from Romain François that showcases hundreds of examples of data visualization with R, has new social features. Now, when you find a graph or chart you find appealing or useful, you can "Like" it on Facebook or "+1" it on Google+. This should be a great way of highlighting the best charts and graphs in this massive collection (and don't forget they all include working R source code as well, to adapt for your own needs).
Here are some examples from the current front page:
Explore the graphs in the R Graph Gallery at the link below, and +1 or Like your favourites!
R Graph Gallery: Enhance your data visualization with R
Ever since the term "Data Scientist" was coined by DJ Patil and Jeff Hammerbacker in 2009, there's been a vigorous debate on what the team actually means. More than 80% of statisticians consider themselves data scientists, but Data Science is more than just Statistics. (My own take is that Data Science is a valuable rebranding of computer science and applied statistics skills.)
To help bring clarity to the issue, Data Scientist and R user Harlan Harris has published a great presentation he gave at the Data Science DC meetup group, "What is Data Science anyway?". The presentation recaps the key data science discussions over the last few years, from Hal Varian ("the sexy job in the next 10 years will be Statistics"), Mike Driscoll ("sexy skills of data geeks"), Nathan Yau ("data scientists: people who can do it all"), Mike Loukides ("Data science enables the creation of data products"), Hilary Mason ("Data science is clearly a blend of the hackers"), Drew Conway ("The Data Science Venn Diagram") and many others.
In fact, the entire presentation servers as a literature review for the birth of "Data Science" as a concept, and would make excellent fodder for the "Data Science" page on Wikipedia which, sadly, is still a blank page.
One thing that seems certain: Data Science is here to stay. Companies are clamoring to hire people with data science skills, and the excitement at data science events like the recent Strata conference in New York is palpable. (This review of the conference even says that "Data Scientists are the new rock stars of the technology world.") As with all new concepts, the definition of "data science" may seem a bit cloudy now, but I'd wager that in 5 years or less "Data Scientist" will be as natural a job title as "Product Manager" or "Engineer". Haran notes a most apposite analogy made by John Cook, who looked at the "computer programmer" role:
"Take an expert programmer back in time 100 years. What are his skills? Maybe he’s pretty good at math. He has good general problem solving skills, especially logic. He has dabbled a little in linguistics, physics, psychology, business, and art. He has an interesting assortment of knowledge, but he’s not a master of any recognized trade."
Apply the same quote to "data scientists", and I'd wager that just 5 years or so we'll look at all the data scientists working around us and wonder how companies ever survived without them.
Harlan adds additional thoughts to his presentation at the link below.
Harlan Harris: Data Science, Moore’s Law, and Moneyball
Looks like there's been a lot of activity in the R user community in the Northern hemisphere now that the summer break is over. I've just added several new groups to the Local R User Group Directory:
Tokyo, Japan: The Tokyo.R R study group has already had 17 meetings, but has just been added to the directory.
Revolutions: Local R User Group Directory
Revolution Analytics today announced that it has partnered with Cloudera, the leader in Apache Hadoop-based software and services, to make big-data analytics with Hadoop and R available to Revolution R Enterprise users.
As we announced earlier this month, we have created three open-source R packages which make it possible for R users to write map-reduce programs in the R language, and bring computations in R to data stored in Hadoop. These three packages, known collectively as "RevoConnectR for Apache Hadoop" are now Cloudera Certified Technology, and have undergone a thorough verification process to ensure that they comply with Cloudera's list of development and performance guidelines. They will be fully supported in Revolution R Enterprise 5.0 Server for Linux (scheduled for later this year) for use with Cloudera Distribution including Apache Hadoop (CDH).
For R users, this combination brings the power and flexibility of the R language to the big-data capabilities of Apache Hadoop:
“The combination of R and Apache Hadoop creates a more powerful analytics solution than what’s previously been on the market for big data. R users can now leverage advanced analytics and data stored in Apache Hadoop to tap into more multifaceted dimensions of their data,” said Norman Nie, CEO of Revolution Analytics. “R and Hadoop are inherently complementary, but the technology combination has unique technical challenges to overcome to work seamlessly and effectively. Revolution Analytics is pleased to partner with Cloudera to deliver an intuitive solution that bridges these gaps and minimizes complexity to help users realize the potential of R and Apache Hadoop”.
And for users of Cloudera's CDH, it brings the predictive analytics capabilities of R to bring value to the masses of data stored in the Hadoop environment:
“Running advanced analytics on unstructured data has historically been challenging and cost prohibitive. It is not a coincidence that some of the fastest growing businesses in the last decade are the same organizations that invested in infrastructure to model and derive insight from the new classes of data being generated,” said Ed Albanese, Head of Business Development at Cloudera. “This solution, which combines the Cloudera Distribution including Apache Hadoop with Revolution Analytics Enterprise R, is about democratizing access to technologies uniquely capable of performing analysis on these new classes of data. Cloudera is pleased to partner with Revolution Analytics to deliver an accessible, integrated analytics solution for unstructured data”.
Revolution Analytics Press Releases: Revolution Analytics Partners With Cloudera To Deliver Comprehensive New Big Analytics Solution
As regular readers of this blog will know, I love optical illusions. Not only are they fun to watch, the can provide insights into the subtle operations of the brain and how it influences our perceptions of reality. The same applies to auditory input: as the video on the McGurk Effect shows, you can't always trust what you hear.
Some of the commenters at the Bad Astronomy blog said the McGurk Effect didn't work for them, but for me it worked ever time. I could even close and open my eyes during the "fa" speech and "hear" the change over and over.
I agree completely with Nathan's comments on the last point above:
There's a certain amount of subjectivity that goes into any visualization as you choose what data to show and how to show it. By focusing on one part of the data, you might inadvertently obscure another. However, if you're careful, get to know the data that you're dealing with, and stay true to what's there, then it should be easier to overcome bias.
After all, statistics is somewhat subjective, too. You choose what you analyze, what methods to use, and pick what to point out in reports.
News organizations, for example, have to do this all the time. They get a dataset, decide what story they want to tell (or find what story the data has to tell). Browse through graphics by The New York Times, and you can see how you can add a layer of information that objectively describes what the data is about.
This stands in contrast to the presentation I saw today at the Strata conference from Alex Lundry, Chart Wars: The Political Power of Data Visualization. (You can see a shorter version of his talk online). It was an entertaining talk, but his main point was to encourage data visualization partitioners to actively insert a point of view into the presentation of data. For example, he encourages more charts like the one on the right below, rather than the one on the left.
(Images from Nigel Holmes' the paper, Useful Junk? The Effects of Visual Embellishment on Comprehension and Memorability of Charts by Scott Bateman et al.)
Lundry's take is that because the image on the right is more easily recalled by those who have seen it, it's naturally better. I disagree. My objection to the chart on the right isn't just that uses chartjunk, nor that the teeth are disporoportionately sized to the values, nor even that the X "axis" is slanted upwards to exaggerate the rise. My objection is the chart on the right is that it actively pushes an analysis upon the viewer. As Nathan notes, there's always an element of bias in what data is selected to be presented, and the way it's presented. But good charts merely present data, and leave the analysis (obvious though it may be) to the viewer. When a chart takes on the burden of analysis for the viewer, that's when it strays from data visualization into propaganda.
Update Sep 26: Corrected attribution of paper with images above.
FlowingData: 5 misconceptions about visualization