The R core team announced today that R 2.13.2 is now available:
The byte pixies have rolled up R-2.13.2.tar.gz at 9:00 this morning. This is intended to be the final release of the 2.13 series, for the benefit of those apprehensive of putting 2.14.x into production use.
This update fixes a number of minor bugs (for example, pch="." will guarantee at least a 1-pixel dot, to support very high-resolution charting), improves performance (date object print much faster now, for example), and adds a few minor features (such as being able to accurately limit memory usage for systems with very large amounts of RAM). The full change log appears after the jump.
The campaign to re-elect US president Barack Obama is hiring -- and the RDataMining blog noticed that several of the open positions seek R skills. If you want to be a Communications Analyst, Digital Strategy Analyst, or Statistical Modeling Analyst and you know R, there may be a job opening for you. Just goes to show there's no corner of life untouched by analytics.
If you've used SAS or SPSS and want a jump-start into the basics of the popular R language, next week's webinar, Introduction to R for SAS and SPSS Users will be of interest to you. While R, SAS and SPSS are all three software systems for data analysis and graphics, the underlying concepts in R are quite different to those in SAS and SPSS.
To get SAS and SPSS users up to speed with the basics of R, Bob Muenchen (University of Tennessee) will give a 60-minute webinar on Wednesday, October 5. It will include:
What R is and how it compares to SAS and SPSS
An overview of how to install and maintain it
How to find R add-on modules comparable to those for SAS and SPSS
Which of R’s many user interfaces are most like those of SAS and SPSS
How to run R from within SAS and SPSS
What a simple R program looks like
Live Q&A with Bob Muenchen
Bob is the perfect presenter for this topic: he is the author of R for SAS and SPSS Users (and, with Joseph M. Hilbe, R for Stata Users). He is also the creator of r4stats.com, a popular web site devoted to helping people learn R. Bob also serves on advisory panels for IBM’s SPSS Corporation and SAS Institute.
The R Graph Gallery, the website from Romain François that showcases hundreds of examples of data visualization with R, has new social features. Now, when you find a graph or chart you find appealing or useful, you can "Like" it on Facebook or "+1" it on Google+. This should be a great way of highlighting the best charts and graphs in this massive collection (and don't forget they all include working R source code as well, to adapt for your own needs).
Here are some examples from the current front page:
Explore the graphs in the R Graph Gallery at the link below, and +1 or Like your favourites!
In fact, the entire presentation servers as a literature review for the birth of "Data Science" as a concept, and would make excellent fodder for the "Data Science" page on Wikipedia which, sadly, is still a blank page.
One thing that seems certain: Data Science is here to stay. Companies are clamoring to hire people with data science skills, and the excitement at data science events like the recent Strata conference in New York is palpable. (This review of the conference even says that "Data Scientists are the new rock stars of the technology world.") As with all new concepts, the definition of "data science" may seem a bit cloudy now, but I'd wager that in 5 years or less "Data Scientist" will be as natural a job title as "Product Manager" or "Engineer". Haran notes a most apposite analogy made by John Cook, who looked at the "computer programmer" role:
"Take an expert programmer back in time 100 years. What are his skills? Maybe he’s pretty good at math. He has good general problem solving skills, especially logic. He has dabbled a little in linguistics, physics, psychology, business, and art. He has an interesting assortment of knowledge, but he’s not a master of any recognized trade."
Apply the same quote to "data scientists", and I'd wager that just 5 years or so we'll look at all the data scientists working around us and wonder how companies ever survived without them.
Harlan adds additional thoughts to his presentation at the link below.
As we announced earlier this month, we have created three open-source R packages which make it possible for R users to write map-reduce programs in the R language, and bring computations in R to data stored in Hadoop. These three packages, known collectively as "RevoConnectR for Apache Hadoop" are now Cloudera Certified Technology, and have undergone a thorough verification process to ensure that they comply with Cloudera's list of development and performance guidelines. They will be fully supported in Revolution R Enterprise 5.0 Server for Linux (scheduled for later this year) for use with Cloudera Distribution including Apache Hadoop (CDH).
For R users, this combination brings the power and flexibility of the R language to the big-data capabilities of Apache Hadoop:
“The combination of R and Apache Hadoop creates a more powerful analytics solution than what’s previously been on the market for big data. R users can now leverage advanced analytics and data stored in Apache Hadoop to tap into more multifaceted dimensions of their data,” said Norman Nie, CEO of Revolution Analytics. “R and Hadoop are inherently complementary, but the technology combination has unique technical challenges to overcome to work seamlessly and effectively. Revolution Analytics is pleased to partner with Cloudera to deliver an intuitive solution that bridges these gaps and minimizes complexity to help users realize the potential of R and Apache Hadoop”.
And for users of Cloudera's CDH, it brings the predictive analytics capabilities of R to bring value to the masses of data stored in the Hadoop environment:
“Running advanced analytics on unstructured data has historically been challenging and cost prohibitive. It is not a coincidence that some of the fastest growing businesses in the last decade are the same organizations that invested in infrastructure to model and derive insight from the new classes of data being generated,” said Ed Albanese, Head of Business Development at Cloudera. “This solution, which combines the Cloudera Distribution including Apache Hadoop with Revolution Analytics Enterprise R, is about democratizing access to technologies uniquely capable of performing analysis on these new classes of data. Cloudera is pleased to partner with Revolution Analytics to deliver an accessible, integrated analytics solution for unstructured data”.
As regular readers of this blog will know, I love opticalillusions. Not only are they fun to watch, the can provide insights into the subtle operations of the brain and how it influences our perceptions of reality. The same applies to auditory input: as the video on the McGurk Effect shows, you can't always trust what you hear.
Some of the commenters at the Bad Astronomy blog said the McGurk Effect didn't work for them, but for me it worked ever time. I could even close and open my eyes during the "fa" speech and "hear" the change over and over.
Software does everything (Nathan notes "Personally, I use a lot of R and have a lot of fun in Illustrator", but uses a lot of other tools as well.)
Visualization is for making data flashy
The more information in a single graphic, the better
It has to be exact
Visualization is too biased to be useful
I agree completely with Nathan's comments on the last point above:
There's a certain amount of subjectivity that goes into any visualization as you choose what data to show and how to show it. By focusing on one part of the data, you might inadvertently obscure another. However, if you're careful, get to know the data that you're dealing with, and stay true to what's there, then it should be easier to overcome bias.
After all, statistics is somewhat subjective, too. You choose what you analyze, what methods to use, and pick what to point out in reports.
News organizations, for example, have to do this all the time. They get a dataset, decide what story they want to tell (or find what story the data has to tell). Browse through graphics by The New York Times, and you can see how you can add a layer of information that objectively describes what the data is about.
This stands in contrast to the presentation I saw today at the Strata conference from Alex Lundry, Chart Wars: The Political Power of Data Visualization. (You can see a shorter version of his talk online). It was an entertaining talk, but his main point was to encourage data visualization partitioners to actively insert a point of view into the presentation of data. For example, he encourages more charts like the one on the right below, rather than the one on the left.
Lundry's take is that because the image on the right is more easily recalled by those who have seen it, it's naturally better. I disagree. My objection to the chart on the right isn't just that uses chartjunk, nor that the teeth are disporoportionately sized to the values, nor even that the X "axis" is slanted upwards to exaggerate the rise. My objection is the chart on the right is that it actively pushes an analysis upon the viewer. As Nathan notes, there's always an element of bias in what data is selected to be presented, and the way it's presented. But good charts merely present data, and leave the analysis (obvious though it may be) to the viewer. When a chart takes on the burden of analysis for the viewer, that's when it strays from data visualization into propaganda.
Update Sep 26: Corrected attribution of paper with images above.