It's sunny up here in Seattle (I'm attending BioConductor 2010 -- Revolution's a sponsor) now that the morning fog has lifted. So in honor of sunny Friday afternoons, I bring you the Double Rainbow song:
(If you've been in a wi-fi-less cave for the past couple of weeks, check out the original.)
At his Zero Intelligence Agents blog, Drew Conway has taken on the task of performing a quantitative analysis (using R, of course) of the controversial Afghanistan document dump from Wikileaks. He's started with an analysis of the overall flow of information in the five Afghanistan regions, categorized by type of activity (enemy, neutral, etc.).
(Click to enlarge.) It's a 10,000-foot view of the data to be sure, but even show it does show some interesting trends: the relative quiet of some regions, surges and ebbs in the war, and the interchanges of activity between the various agents. Drew offers more analysis:
Given the nature of the reports, we would expect a noticeable degree of seasonality (peaks and valleys) given the natural ebb and flow of war. Any drastic deviations from this expectation could indicate a strong degree of selection on the part of Wikileaks. As you can see, however, the data generally do fit this expectation. Note the dramatic upward trending seasonality present in the heavy reporting areas of RC EAST and RC SOUTH. Perhaps more interestingly, though, is the sudden increase in the number of NEUTRAL reports present in the data for RC EAST and RC CAPITAL for the period roughly between mid-2006 and mid-2008.
Be sure to follow ZIA as Drew dives deeper into the analysis of this fascinating data set.
G Jay Kerns has published a 400+ page introductory text on Probability and Statistics. All of the examples and illustrations are done using R (as Jay puts it, "The people at the party are Probability and Statistics; the handshake is R") so if you want to brush up on your probability and learn R at the same time, this might be a good resource. It would also be great for teaching: Jay wrote the book based on an undergraduate course he gave at Youngstown State University. There's also a plug-in for R Commander to access some of the methods via dialogs.
Jay's book is free, in both senses of the word. You can download the PDF for free from Lulu, or purchase a printed copy for just over $30. Jay has also published all of the LaTeX sources if you want to build the book yourself. And if you're already using R, you can read the book with just three commands:
I haven't read the entire book, but glancing through it, it looks like a comprehensive overview of the basics of Statistics: distributions, hypothesis testing, estimation, linear regression, and even touches on resampling and nonparametric methods.
I did want to point out one minor error on page xiii, though. It's "We're not in Kansas any more, Toto", not "We're not in Kansas any more, Dorothy". They'd take away my Rainbow Card if I didn't mention that.
So why the sudden attention for R? Steve Miller at Information Management posted last week an insightful analysis of why the time is right for R, and Revolution's role in its commercial success. (Steve also had some kind words to say about this very blog -- thanks, Steve! By the way, if you'd like to receive the monthly Revolution newsletter that Steve mentions in his article, you can sign up here.). He charts R's ascendance in 10 steps:
The "Data Deluge", the rapid rise in the size of data sets, is a critical issue for businesses, because:
Advanced analytics are more accurate than "traditional" (backward-looking) data analysis methods or "expert opinion", and as a result:
Statistical analysis software is a hot topic for Business Intelligence community.
The legacy statistical platfoms, SAS and SPSS, are based on dated technology, but:
S is a modern statistical language platform (and, I'd add, a winner of the ACM Software Systems award, an honor it shares with Java, Apache and the Web itself), and:
R, the open-source descendent of S, has taken the academic world by storm and is now the lingua franca of statistical research in academia. Academics trained in R are now moving into the commercial world.
Revolution Analytics is headed by statistics pioneer and SPSS founder Norman Nie, bringing a wealth of experience to the mission.
The challenge is to build upon the open-source R platform to entice current users of SPSS and SAS. An immediate hurdle is to enhance R's support for ever-expanding data set sizes.
Revolution must work with the open-source R community, who may be leery of a commercial venture related to R.
It's a great summary, and I agree with all of it. Let me expand on the last two points:
Regarding the challenges of building upon R, Revolution has laid out its plan not just to bring scalable, high-performance analysis of large data sets to R, but also to provide a modern Web Services integration platform for R analytics, and to create an easy-to-use GUI for the more casual user. You'll be hearing much more about those initiatives in the coming weeks.
And finally, supporting the open-source R community is a critical part of Revolution's mission. And not just because it's the right thing to do: after all, Revolution is building a business on decades of collective work by volunteers to the R Project, starting with the foundation created Robert Gentleman and Ross Ihaka, realized by the dedication of the R Core Group, and expanded by the thousands of contributors of R packages. But also because the R community itself is a key part of the value of R: its innovation, its adaptiveness to new applications, and the resources and help from community members themselves.
That's why Revolution is supporting the R Community in a number of ways. Not just by contributing code to the R project (like the foreach and iterators packages), but also in supporting local user groups, sponsoring R conferences, funding students to do research and development for R, and evangelizing the benefits of R to the media and analyst community. And just last week, we launched inside-R.org, a portal for the open-source R community, to make it easier to find the wealth of R resources around the Web.
We'll keep working on points 9 and 10 in Steve's list above, and we look forward to continuing the R story ... to 11 and beyond.
Revolution's CEO Noman Nie has been busy recently spreading the good word about R and Revolution. Here's a quick roundup:
In an audio interview with Mary Joe Nott at B-eye Network, Norman talks about the "perfect storm" around R, "the most complete statistical tool in the world".
In Business Management magazine, Norman Nie describes how predictive analytics is changing the face of business, and how R is the driving force behind this revolution.
And in a general article about the use of predictive analytics by the Federal government in Federal Times, Norman reminds us that "predictive analytics" is really Statistics: "The essence is really using statistical modeling to understand both the past and present and the future of the world around us."
I've still got more to write from days 2 and 3 of useR! 2010, but in the meantime here are a few snapshots from my camera roll. (Click on the images for larger versions.)
Jeroen Ooms (left) chats with other useRs in the main hallway at NIST.
Karim Chine demonstrates his Electric-R portal for R on Amazon's EC2 cloud.
Andrew Lampitt talks about Business Intelligence with Jaspersoft and Revolution R.
Jonathan Lees demonstrates analysis of earthquake data with R.
James "JD" Long and Joseph Rickert hat-battle at the conference dinner.
Unfortunately the flash on my camera wasn't quite up to catching any of the keynote presenters on the large NIST stage, but video of many of the invited talks should soon be available at Drew Conway's Video Rchive.
Data mining analyst Yuval Marom has just started a new user group for users of R in Melbourne (the one in Australia). He says:
The group is open to people with all levels of experience with R, including complete beginners and complete experts, and people from all industries - private, government, and academic. The main purpose of the group is to provide a forum for the exchange of knowledge and experiences to help people along in their journey of learning to use R. The actual format and content of the meetings will be determined by the members.
The group's first meeting will be in the Melbourne CBD on August 9 from 6-8PM. For more info and to RSVP or join the mailing list, fill out the form at the link below.
I just discovered that R core member Paul Murrell has been maintain a list of plaudits for R: newspaper articles, book reviews, remarks on mailing lists and blogs, and even gratitudes from individual R users. He's collected dozens of entries since 2001 -- great materials here if you ever need more evidence of the awesomeness of R.
Just a couple of quick notes about the first day of talks at useR! 2010. It's been a jam-packed schedule -- so many good talks to see and people to meet, I just wish I had more time for it all!
One stand-out for me so far has been Frank Harrell's keynote lecture Information Allergy, on the dangers of misusing statistics in Medicine, was amazing. (Update: video here.) You know a talk is thought-provoking when you're still thinking about the consequences in free moments the day after. It's worthy of an entire blog post on its own.
I've also been excited to see the number of real-life applications using R presented at the conference. In one session alone, I saw how R is used to precisely locate earthquakes (by comparing actual arrival times of signals in seismograph data to their predicted arrival times); how it's used to measure and report on water quality in Australia; and even how it's used to measure the amount of greenhouse gases leaching out of landfills, from LIDAR measurement data. Really fascinating stuff.
The launch-party for inside-R.org last night was a lot of fun too: having about 150 R users together to drink and chat was a great way to learn lots of new things and meet some great people. Thanks to everyone who came along. (If you're at JSM in Vancouver, we'll be hosting another social event on Tuesday, August 3.)
Overall, so far it's been a really outstanding conference: smooth organization, great people, interesting talks, and a really palpable sense of excitement about R. Anyway, I have to run now to give my talk. I'll write more when I get a free moment.
As recently as a couple of years ago, finding information about R in the Web was hard. Other than the canonical content and mailing list archives at the official R project site, www.r-project.org, there wasn't too much else dedicated to R on the web -- and what was there was hard to find on Google without the help of sites like Rseek.org.
Fast-forward to today, and all that's changed. If you search for R on Google, you'll actually find content related to the R project on the first page. Dozens of blogs now cover R regularly, as the RSS feed of r-bloggers.com demonstrates. StackOverflow.com has over 1500 questions related to R. Crantastic.org lists more than 2500 user-contributed packages for R. On top of that, there are thousands of personal websites with useful tips, tricks and other useful suggestions for R.
Revolution Analytics created Inside-R.org for the R community to highlight these useful resources for R from around the Web, and make them accessible and searchable from a single site. From today, you'll be able to find blog posts about R, information about R packages from crantastic.org, questions about R from StackOverflow, and the help pages for all R functions in the latest R distribution, all accessible from a comprehensive search. You can even contribute your own tips and tricks for other R users.
I'll be blogging about the new features of Inside-R.org over the next couple of days, and you can read an overview in today's press release. Insider-R.org is for you, the R users, so check it out and let us know what you think.