Today is the last day to submit abstracts for the R/Finance 2010 conference to be held in Chicago on April 16-17. If you're not planning on speaking, but are interested in applications of R in Finance, be sure to add this to your calendar -- last year's conference was an outstanding event. Here's some more information about the conference from the website:
The two-day conference will cover topics including portfolio management, time series analysis, advanced risk tools, high-performance computing, and econometrics. All will be discussed within the context of using R as a primary tool for financial risk management and trading. We strongly encourage R users, as well as package authors to submit an abstract for presentation at this years conference.
The 2010 conference will build upon the success of last year's event. Including traditional keynotes from leading names in the R and finance community, presentation of contributed papers, short "lightning-style" presentations as well as the chance to meet and discuss colloboratively the future of the R in Finance community.
Darwin is a the capital city of Australia's Northern Territory. Lying on the coast of far Northern Australia, it's situated well in the tropics and as a result has hot, steamy, monsoonal weather. Darwin's weather has already had impact on urban culture, and now it seems it's had a political impact too: it's been at in the middle of the recent "Climategate" "scandal". A climate-change skeptic observed a discontinuity in the raw temperature records at Darwin airport in 1942 (which, as it turned out, was caused by a change in the equipment), and claimed that the standardization of the data was an overt attempt to present a cooling trend as a warming trend. This claim was roundly denounced in The Economist (prompting yet another reply and rebuttal).
Matthew Markus wasn't satisfied with drawing conclusions from a back-and-forth in the blogosphere, though. He decided to get hold of the raw data from the Global Historical Climate Network and attempt to reproduce this "smoking gun" graphic for himself in R.
Markus provides all the details and R code for accessing and plotting the temperature data in a blog post. There's lots of practical advice there, including how use the various command-line tools to uncompress and inspect the data (although you'll need a Unix-ish machine to follow along) as well as all the R commands for reading in the data. (It makes for a nice demonstration of reading fixed-format data files into R, actually.) You'll also see how to standardize and plot the data, with a result fairly close to the original above:
In a post earlier this month, it seemed as though compressing a data file before reading it into R could save you some time. With some feedback from readers and further experimentation, we might need to revisit that conclusion
To recap, in our previous experiment it took 170 seconds to read a 182Mb text file into R. But if we compressed the file first, it only took 65 seconds. Apparently, the benefits of reducing the amount of disk access (by dealing with a smaller file) far outweighed the CPU time required to decompress the file for reading.
In that experiment, though, each file was only read once. If you simply repeat the read statement on the uncompressed file, you see a sudden decrease in the time required to read it:
> system.time(read.table("bigdata.txt", sep=","))
> system.time(read.table("bigdata.txt", sep=","))
(This was on MacOS, using the R GUI. I also tried using R from the terminal on MacOS, and also from the R GUI in Windows, using both regular R and REvolution R. There were some slight variations in the timings, but in general I got similar results.)
So what's going on here (other than my embarrassing failure as a statistician to replicate my measurements the first time round)? One possibility is that we're seeing the effects of disk cache: when you access data on a hard drive, most modern drives will temporarily store some of the data in high-speed memory. This makes it faster to access the file in subsequent attempts, for as long as the file data remains in the cache. But that doesn't explain why we don't see a similar speedup in repeated readings of the compressed file:
I'd expect the second reading to be faster if disk cache had an effect, so I don't think disk cache is the culprit here. More revealing is the fact that the first use of read.table in any R session takes longer than subsequent ones. Reading from the gzipped file is slower than reading from the uncompressed file if it's the first read of the session:
So what's going on here? (This was using R from the terminal under MacOS; I got similar results using the R GUI on MacOS.) I don't have a good explanation, frankly. Maybe the additional time is required by R to load libraries or to page in the R executable (but why would it scale with the file size, then?). Note that we got the speed benefits from reading the uncompressed file second, which rules out disk cache having any significant benefits. If any one has any good explanations, I'd love to hear them.
So what file type is the fastest for reading into R? Reader Peter M. Li took a much more systematic approach to answering that question than I did, running fifty trials for compressed and uncompressed files using both read.table and scan. (We can safely assume that this level of replication nullifies any first-read or caching effects.) He also tested Stata files (an open, binary data file format that R can both read and write). Peter also tested different file sizes for each file type, with files containing one thousand, 10 thousand, 100 thousand, one million and 10 million observations. His results are summarized in the graph below, with log(file size) on the X axis and log(time to read) on the Y axis:
So, what can we conclude from all of this? Let's see:
In general, compressing your text data doesn't speed up reading it into R. If anything, it's slower.
The only time compressing files might be beneficial is for large files read with read.table (but not scan)
There's a speed penalty the first time you use read.table in an R session.
Reading data from Stata files has significant performance benefits compared to text-based files.
And, lastly but most importantly, you should always replicate your measurements.
At the most recent New York R User Group meetup, the topic was creating graphics in R with the ggplot2 package. Drew Conway's talk, "Making pretty pictures with ggplot2" gave several practical examples of visualizing data with ggplot2 and is well worth checking out:
He's no Father Christmas, but he is dressed in festive red and green, and he's made of bacteria. This image, Mario, was submitted to the 2009 international Genetically Engineered Machine (iGEM) competition by Team Osaka from the nanobiology laboratories at the University of Osaka, Japan.
They genetically engineered bacteria to express fluorescent proteins and carotenoid pigments to create works of art.
FlowingData recently took a look at Jeroen Ooms' latest web-based statistical tool based on R. We've looked at his tools for random-effects models and finance visualizations before, but this one is a more general tool for creating graphs from data sets using the ggplot2 package. It's pretty slick. All you need to do is upload a data set (in comma-separated .csv or tab-delimited .txt format) and then you can use the grammar of graphics philosophy underlying ggplot2 to layer various graph types using your data, and also to facet it (like panels in Trellis or lattice) by categorical variables. Here's an example I created in just a few minutes using the cereals dataset from the Data and Story Library:
If you want to try it out yourself, a trick is that most of the interaction is through right-clicking on the middle pane where the graphics appear: use the menus to set up your X and Y variables, and then choose the type of graphic with the "New Geom Layer (2D)" menu. There are many other features, which you can see in action in the video below. But the best feature of all, which I would have missed except for the video, is that it shows you the code to create the graph in R using ggplot2. There's a tiny tiny arrow at the very bottom of the application: click it to show the code pane. The code it displays is exactly the code you would write as an R programmer to create the chart. This makes Jereon's application a really useful tool for learning the capabilities and syntax of ggplot2.
Though commercial statistical packages are popular among researchers, their licensing costs drive people away from them. In this context, R http://www.r-project.org, the open source/free statistical package, which is fast becoming the darling of researchers/analysts, assumes significance. The great advantage of R is that it can be downloaded and installed on your machine without any licensing worries. Yet another advantage of R is that one can run it on multiple platforms such as Linux, Mac and Windows. Naturally, this adds to the autonomy of a researcher.
Unlike in the past, students of this generation prefer different operating systems to be run on their laptops.
In a classroom setting of this kind, teaching statistics with commercial packages becomes quite unwieldy — all cannot afford to purchase software for different platforms. In this regard, commercial packages pale in comparison with R, which has no such restrictions.
Besides being free, the advantage of this statistical software is its “extensibility” feature. R allows its users to enhance its functionality by creating new functions (this is similar to the extensions we find in Firefox).
Other open-source tools mentioned in the article include Firefox, AbiWord, Open Office, and the Zotero Firefox extension and Jabref (bibliography managers).
The US National Centers for Environment Prediction (NCEP) produces weather forecasts for the entire world from a model that's updated every 6 hours. The data is made freely available, and with a couple of free tools to convert the data and R you can easily produce am unpdated global weather forecast like this (click to enlarge):
(Check out the heat-wave in Australia!) Physicist and climate scientist Joe Wheatley provides all the details, including R code. All you need is R, the ncdf package to process the data file, and a free command-line tool to convert the file from the format provided by NCEP.