Ajay Ohri at DecisionStats revently interviewed me about my background, this blog, and REvolution Computing's plans for REvolution R Enterprise and ParallelR. The interview is online now. Thanks for the interview, Ajay!
The current issue of Wired magazine has a lengthy profile of Hal Varian, Google's chief economist (and admirer of Statistics as a profession). It begins with a fascinating history of the use and economics of auctions at Google: upon joining, Varian declared of Google's auctions for AdWords, "You've managed to design an auction perfectly!"
The latter half of the article delves into some of the challenges in implementing AdWords, and includes this intriguing quote from leading statistician Daryl Pregibon:
"Google needs mathematical types that have a rich tool set for looking for signals in noise," says statistician Daryl Pregibon, who joined Google in 2003 after 23 years as a top scientist at Bell Labs and AT&T Labs. "The rough rule of thumb is one statistician for every 100 computer scientists."
Given that Google is a high-tech company with over 20,000 employees, that adds up to a lot of statisticans. R is a key tool for those statisticians: it was revealed in the New York Times article about R earlier this year that Google uses R widely. In fact, in that same article Pregibon said, “R is really important to the point that it’s hard to overvalue it.” Varian himself said (again, in that same article), "The great beauty of R is that you can modify it to do all sorts of things."
The R Journal has published its first issue, now available online. Billed as "a peer-reviewed, open-access publication of the R Foundation for Statistical Computing", The R Journal takes over from R News as the venue for in-depth articles about all things R.
This first issue begins with an introduction to the new journal from Editor-in-Chief Vince Carey. Next, an invited paper from John Chambers, called "Facets of R", traces the history of the various implementations of the S language he pioneered, that ultimately resulted in the "rich but sometimes messy" R we have today. This is followed by a review of the purpose and use of R-Forge, an on-line collaborative environment for developing extensions to R, by Stefan Theußl and Achim Zeileis.
Other articles cover: how to create Visio-style diagrams in R programs; generating Web-pages in HTML containing R output; probability model elicitation; integration of the multivariate normal density; Hilbert spectrum analysis; microarray study design; parallel computing support; and sharing predictive models with PMML (the predictive modeling markup language).
The R Journal: Current Issue. (Is there a permanent link to the index of the May 2009 issue?)
Many R users on Twitter append their tweets with the hashtag #rstats, which makes it easy to find other R users on Twitter. (A hashtag is just an easily-searched sequence of characters beginning with #.) Even if you don't use Twitter, you can follow the R discussion with this Twitter search.
It's a practice I've only just become aware of, so you won't find me that way (yet), but I am active on Twitter as user revodavid.
I guess this is a famous result, but it was new to me: George Zipf observed in 1949 that the populations of the cities in just about every country follow a power-law distribution. The largest city is always about twice as large as the second largest, and three times as big as the third largest, and so on. You can see this in action in R by plotting the populations of Canadian cities in 1996 (available as dataset cities in the DAAG package) against their rank on a log-log scale:
The nearly straight line (especially if you downweight the rank 1 city, Toronto) confirms the power-law distribution. The same result applies to cities of other countries, and even to other statistics than population, but no-one knows exactly why it holds. Oddly, while the result holds within nations, it breaks down somewhat when considered globally. Tim Gulden of George Mason University used light output from cities (estimated from satellite images) to evaluate Zipf's laws for the population and economic activity of cities globally:
Co-author Richard Florida interprets these results:
While the population of cities tends to follow the Zipf law that Strogatz describes within a nation, this scaling does not hold for the whole collection of world cities. The distribution ends up being somewhat flatter - particularly among the largest cities. This may result from barriers to migration between countries.
Google's Chief Economist, Hal Varian, recently said, "The sexy job in the next ten years will be statisticians." On the Dataspora blog, Michael Driscoll explains why. The sex appeal comes from three core skills: an understanding of statistical principles, the ability to "munge" data into forms suitable for analysis, and the ability to tell stories with data through visualization. (I particularly appreciated the distinction between exploratory and presentation graphics on that last point.) Put all three skills together, and you have an awesome, sexy job.
Robert Grossman of the Open Data Group has posted a useful set of tips for running R under Amazon's EC2 service. This is a quick and cost-effective way of creating instances of a large number of machines running R. You might want to do this for parallel computing in R when you don't have access to your own cluster, or when using R as the back-end for a web-based application and you're not sure how many servers you'll need (EC2 allows you to dynamically scale the hardware to the demand).
The Open Data Group has helpfully provided a virtual machine image (called an "AMI") including R (and I believe some of the default AMIs provided by Amazon include R, too). You can also create custom AMIs with REvolution R Enterprise installed, which allows you to take advantage of ParallelR on large EC2 clusters. Creating custom AMI's is pretty easy (especially if you use a provided image as a starting point), and you're probably going to need to create one in any case if you want to use your own packages. (If you do need to create a custom AMI, REvolution's services group can help.)
Wall Street and Technology reported in an article earlier this month that investment banks, hedge funds and other financial institutions are increasingly turning to open-source solutions for analysis and trading. It isn't just for cost reasons, either: the maturity of the open-source movement and the availability of cutting-edge technology first are also cited as factors. As our own Colin Magee quotes:
"People realize now that the open source project -- which really has worldwide buy-in from top experts from whatever field -- is perhaps a more secure and future-proof method of development than going with a proprietary vendor who can never keep with the worldwide community," says Colin Magee, VP sales and marketing at Revolution Computing, an open source predictive analytics platform for developing with R, the statistical modeling language.
But the more dramatic shift for Wall Street right now is that it is considering open-source alternatives for fundamental, industry-specific applications, applications like Marketcetera's open-source trading platform, which I've called "a lifeline to the hedge fund industry" because it enables the industry to become more efficient and more productive. REvolution Computing, Esper, and others are also benefiting from this shift.