Bruno Rodrigues teaches a class on applied econometrics at the University of Strasbourg, with a focus on implementing econometric concepts in the R language. Since many of the students don't have any previous programming background, he's put together a tutorial on the basics of applied econometrics with R. The first two chapters serve as a generalpurpose beginners' introduction to R, while chapter 3 explores basic applied econometrics with R (primarily data summaries and linear models). A fourth chapter to come promises a focus on reproducible research, so check back for updates to this free document at the link below. (And if you're looking for a more advanced tutorial on econometrics with R, check out Econometrics in R by Grant Farnsworth.)
Bruno Rodrigues: Introduction to Programming Econometrics with R
by Joseph Rickert
Revolution Analytics's 2015 R User Group Support Program (RevoRUGS) begins today. Last year we provided financial support to 51 groups worldwide. That works out to be about one third of the total number of R active user groups listed in our Local User Group Directory.
R User Groups Supported by Revolution Analytics in 2014
(Download RevoRUGS_2014 to click on R user group locations.)
This year we are aiming to double the number we support. To make it easier for new groups to get started we are increasing the amount of the stipend for Vector level groups from $100 to $120. This should cover the cost of a years worth of organizer dues on meetup.com. (Look here to see what meetup.com pricing for a Basic group is in your region.)
The benefits for Revolution Analytics User Group Sponsorship include:
The mechanics of the program are pretty simple. If you are thinking about organizing a new group, review the tips for starting up a local R user group, get a web page going (preferably on meetup.com) and fill out the sponsorship form. If you are representing an existing R user group, review the requirements for Matrix and Array sponsorship. If you think your group meets either Matrix or Array requirements make sure your group's website reflects this before submitting the sponsorship request form.
We will review applications for sponsorship as they come in, and do our best to make a decision and inform organizers within a month after receiving the application form. Organizers of accepted groups will receive a cash stipend and package of goodies that depends on the group level accepted.
The deadlines this year are March 31, 2015 for Matrix and Array level groups, and September 30, 2015 for Vector groups. Properly completed sponsorship forms must be received by these dates.
All the details are here on our website.
The following code was used to draw the map using the data that you can download here: Download RevoRUGS2014
# REVO SUPPORTED RUGS 2014 # Code 12/18/14 post library(leafletR) #plot leaflet # Read data from csv file and stor in GeoJSON file city_dat < file.path(getwd(),"RevoRUGS2014.csv") cities < read.csv(city_dat,header = TRUE) dat_geo < toGeoJSON(data=city_dat, dest=getwd(),name="RevoRUGS_2014") # # Draw the map map < leaflet(data = dat_geo, dest=getwd(), popup = c("City","Name"), incl.data=TRUE, base.map=list("osm", "mqsat", "tls")) # # View map browseURL(map)
Also note: We are conducting an audit of the user groups listed in the Revolution Analytics directory and removing entries for groups who no longer have websites. If we remove your group by mistake, please do let us know!
by Joseph Rickert
One of the most difficult things about R, a problem that is particularly vexing to beginners, is finding things. This is an unintended consequence of R's spectacular, but mostly uncoordinated, organic growth. The R core team does a superb job of maintaining the stability and growth of the R language itself, but the innovation engine for new functionality is largely in the hands of the global R communty.
Several structures have been put in place to address various apsects of the finding things problem. For example, Task Views represent a monumental effort to collect and classify R packages. The RSeek site is an effective tool for web searches. RBloggers is a good place to go for R applications and CRANberries let's you know what's new. But, how do you find things that you didn't even know you were looking for?For this, the so called "misc packages" can be very helpful. Whereas the majority of R packages are focused on a particular type of analysis or class of models, or special tool, misc packages tend to be collections of functions that facilitate common tasks. (Look below for a partial list).
DescTools is a new entry to the misc package scene that I think could become very popular. The description for the package begins:
DescTools contains a bunch of basic statistic functions and convenience wrappers for efficiently describing data, creating specific plots, doing reports using MS Word, Excel or PowerPoint. The package's intention is to offer a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R.
So far, of the 380 functions in this collection the Desc function has my attention. This function provides very nice tabular and graphic summaries of the variables in a data frame with output that is specific to the data type. The d.pizza data frame that comes with the package has a nice mix of data types
head(d.pizza) index date week weekday area count rabate price operator driver delivery_min temperature wine_ordered wine_delivered 1 1 20140301 9 6 Camden 5 TRUE 65.655 Rhonda Taylor 20.0 53.0 0 0 2 2 20140301 9 6 Westminster 2 FALSE 26.980 Rhonda Butcher 19.6 56.4 0 0 3 3 20140301 9 6 Westminster 3 FALSE 40.970 Allanah Butcher 17.8 36.5 0 0 4 4 20140301 9 6 Brent 2 FALSE 25.980 Allanah Taylor 37.3 NA 0 0 5 5 20140301 9 6 Brent 5 TRUE 57.555 Rhonda Carter 21.8 50.0 0 0 6 6 20140301 9 6 Camden 1 FALSE 13.990 Allanah Taylor 48.7 27.0 0 0 wrongpizza quality 1 FALSE medium 2 FALSE high 3 FALSE <NA> 4 FALSE <NA> 5 FALSE medium 6 FALSE low
Here is some of the voluminous output from the function. The data frame as a whole is summarized as follows
'data.frame': 1209 obs. of 16 variables: 1 $ index : int 1 2 3 4 5 6 7 8 9 10 ... 2 $ date : Date, format: "20140301" "20140301" "20140301" "20140301" ... 3 $ week : num 9 9 9 9 9 9 9 9 9 9 ... 4 $ weekday : num 6 6 6 6 6 6 6 6 6 6 ... 5 $ area : Factor w/ 3 levels "Brent","Camden",..: 2 3 3 1 1 2 2 1 3 1 ... 6 $ count : int 5 2 3 2 5 1 4 NA 3 6 ... 7 $ rabate : logi TRUE FALSE FALSE FALSE TRUE FALSE ... 8 $ price : num 65.7 27 41 26 57.6 ... 9 $ operator : Factor w/ 3 levels "Allanah","Maria",..: 3 3 1 1 3 1 3 1 1 3 ... 10 $ driver : Factor w/ 7 levels "Butcher","Carpenter",..: 7 1 1 7 3 7 7 7 7 3 ... 11 $ delivery_min : num 20 19.6 17.8 37.3 21.8 48.7 49.3 25.6 26.4 24.3 ... 12 $ temperature : num 53 56.4 36.5 NA 50 27 33.9 54.8 48 54.4 ... 13 $ wine_ordered : int 0 0 0 0 0 0 1 NA 0 1 ... 14 $ wine_delivered: int 0 0 0 0 0 0 1 NA 0 1 ... 15 $ wrongpizza : logi FALSE FALSE FALSE FALSE FALSE FALSE ... 16 $ quality : Ord.factor w/ 3 levels "low"<"medium"<..: 2 3 NA NA 2 1 1 3 3 2 ...
The factor variable driver gets a table and a plot.
10  driver (factor) length n NAs levels unique dupes 1'209 1'204 5 7 7 y level freq perc cumfreq cumperc 1 Carpenter 272 .226 272 .226 2 Carter 234 .194 506 .420 3 Taylor 204 .169 710 .590 4 Hunter 156 .130 866 .719 5 Miller 125 .104 991 .823 6 Farmer 117 .097 1108 .920 7 Butcher 96 .080 1204 1.000
and so does the numeric variable delivery.
11  delivery_min (numeric) length n NAs unique 0s mean meanSE 1'209 1'209 0 384 0 25.653 0.312 .05 .10 .25 median .75 .90 .95 10.400 11.600 17.400 24.400 32.500 40.420 45.200 rng sd vcoef mad IQR skew kurt 56.800 10.843 0.423 11.268 15.100 0.611 0.095 lowest : 8.8 (3), 8.9, 9 (3), 9.1 (5), 9.2 (3) highest: 61.9, 62.7, 62.9, 63.2, 65.6 ShapiroWilks normality test p.value : 2.2725e16
Pretty nice for an automatic first look at the data.
For some more R treasure hunting have a look into the following short list of misc packages.
Package 
Description 
Tools for manipulating data (No 1 package downloaded for 2013) 

Convenience wrappers for functions for manipulating strings 

One of the most popular R packages of all time: functions for data analysis, graphics, utilities and much more 

Package development tools 

The “go to” package for machine learning, classification and regression training 

Good svm implementation and other machine learning algorithms 

Tools for describing data and descriptive statistics 

Tools for plotting decision trees 

Functions for numerical analysis, linear algebra, optimization, differential equations and some special functions 

Contains different highlevel graphics functions for displaying large datasets 

Relatively new package with various functions for survival data extending the methods available in the survival package. 

New this year: miscellaneous R tools to simplify the working with data types and formats including functions for working with data frames and character strings 

Some functions for Kalman filters 

Misc 3d plots including isosurfaces 

New package with utilities for producing maps 

Various programming tools like ASCIIfy() to convert characters to ASCII and checkRVersion() to see if a newer version of R is available 

A grab bag of utilities including progress bars and function timers 
by James Peruvankal
There are plenty of options if you want to learn R and are looking for training: your college’s statistics department, massive open online courses like Coursera, Udacity, edX, Datacamp etc. SiliconANGLE recently published an article about top Rtraining companies.
Let’s talk about how to choose a good Rtrainer.
At Revolution Analytics we are guided by the teaching philosophy presented in the following chart:
So, if you are serious about learning R, brush up on your statistics, be prepared to jump right in and start doing things on your own, surround yourself with people who are passionate about statistics and R, and figure out how to make the whole process fun for you. If you are teaching R and want to join us in our mission to ‘take R to the Enterprise’, see if you can fit in with our team.
Facebook is a company that deals with a lot of data — more than 500 terabytes a day — and R is widely used at Facebook to visualize and analyze that data. Applications of R at Facebook include user behaviour, content trends, human resources and even graphics for the IPO prospectus. Now, four R users at Facebook (Moira Burke, Chris Saden, Dean Eckles and Solomon Messing) share their experiences using R at Facebook in a new Udacity online course, Exploratory Data Analysis.
As the name suggests, this online course uses R (via RStudio) and the ggplot2 package to provide an introduction to Exploratory Data Analysis. The R Basics chapter gives a general overview of R: installing, starting, getting help, and language basics. Then the course covers visualizing data sets of one, two and multiple variables. There's even an introduction to predictive modeling at the end of the course.
A free Udacity account is required to watch the course videos. Get started at the link below.
Udacity: Exploratory Data Analysis
Looking for a fun and useful intro to R for firsttimers, with bonus cat pictures? Look no further than R for Cats, from Scott Chamberlain of ROpenSci. In addition to very helpful tips on R syntax, data structures, and an excellent list of dos and don'ts, it also shows you how to do this with R:
Check it out: R for cats and cat lovers
by Joseph Rickert
R/Finance 2014 is just about a week away. Over the past four or five years this has become my favorite conference. It is small (300 people this year), exceptionally wellrun, and always offers an eclectic mix of theoretical mathematics, efficient, practical computing, industry best practices and trading “street smarts”. This clip of Blair Hull delivering a keynote speech at R/Finance 2012 is an example of the latter. It ought to resonate with anyone who has followed some of the hype surrounding Michael Lewis recent book Flash Boys.
In any event, I thought it would be a good time to look at the relationship between R and Finance and to highlight some resources that are available to students, quants and data scientists looking to do computational finance with R.
First off, consider what computational finance has done for R. From the point of view of the development and growth of the R language, I think it is pretty clear that computational finance has played the role of the ultimate “Killer App” for R. This high stakes, competitive environment where a theoretical edge or a marginal computational advantage can mean big rewards has led to R package development in several areas including time series, optimization, portfolio analysis, risk management, high performance computing and big data. Additionally, challenges and crisis in the financial markets have helped accelerate R’s growth into big data. In this podcast, Michael Kane talks about the analysis of the 2010 Flash Crash he did with Casey King and Richard Holowczak and describes using R with large financial datasets.
Conversely, I think that it is also clear that R has done quite a bit to further computational finance. R’s ability to facilitate rapid data analysis and visualization, its great number of available functions and algorithms and the ease with which it can interface to new data sources and other computing environments has made it a flexible tool that evolves and adapts at a pace that matches developments in the financial industry. The list of packages in the Finance Task View on CRAN indicates the symbiotic relationship between the development of R and the needs of those working in computational finance. On the one hand, there are over 70 packages under the headings Finance and Risk Management that were presumably developed to directly respond to a problem in computational finance. But, the task view also mentions that packages in the Econometrics, Multivariate, Optimization, Robust, SocialSciences and TimeSeries task views may also be useful to anyone working in computational finance. (The High Performance Computing and Machine Learning task views should probably also be mentioned.) The point is that while a good bit of R is useful to problems in computational finance, R has greatly benefited from the contributions of the computational finance community.
If you are just getting started with R and computational finance have a look at John Nolan’s R as a Tool in Computational Finance. Other resources for R and computational finance that you may find helpful are::
Package Vignettes
Several of the Finance related packages have very informative vignettes or associated websites. For example have a look at those for the packages portfolio, rugarch, rquantlib (check out the cool rotating distributions), PerformanceAnalytics, and MarkowitzR.
Data
Quandl has become a major source for financial data, which can be easily accessed from R.
Websites
Relevant websites include the RMetrics site, The R Trader, Burns Statistics and Guy Yollin’s repository of presentations
YouTube
Three videos that.I found to be particularly interesting are recordings of the presentations “Finance with R” by Ronald Hochreiter, “Using R in Academic Finance” by Sanjiv Das and Portfolio Construction in R by Elliot Norma.
Blogs
Over the past couple of years, RBloggers has posted quite a few finance related applications. Prominent among these is the series on Quantitative Finance Applications in R by Daniel Harrison on the Revolutions Blog.
Books
Books on R and Finance include the excellent RMetrics series of ebooks, Statistics and Data Analysis for Financial Engineering by David Ruppert, Financial Risk Modeling and Portfolio Optimization with R by Bernard Pfaff, Introduction to R for Quantitative Finance by Daróczi et al. and a brand new title Computational Finance: An Introductory Course with R by Agrimiro Arratia.
Coursera
This August, Eric Zivot will teach the course Introduction to Computational Finance and Financial Econometrics which will emphasize R.
The R Journal
The R Journal frequently publishes finance related papers. The present issue: Volume 5/2, December 2013 contains three relevant papers. Performance Attribution for Equity Portfolios by Yang Lu, David Kane, Temporal Disaggregation of Time Series by Christoph Sax, Peter Steiner, and betategarch: Simulation, Estimation and Forecasting of BetaSkewtEGARCH Models by Genaro Sucarrat.
Conferences
in addition to R/Finance (Chicago) and useR!2014 (Los Angeles) look for R based, computational finance expertise at the 8th R/RMetrics Workshop (Paris).
Community
RSigFinance is one of R’s most active special interest groups.
by Joseph Rickert
Worldwide R user group activity for the first Quarter of 2014 appears to be way up compared to previous years as the following plot shows.
The plot was built by counting the meetings on Revolution Analytics R Community Calendar. R users continue to value the live, in person events and facetoface meetings with their peers. Moreover, if you peruse the details of the meetings listed on the calendar you will see that there have been some fantastic presentation so far this year. And although, as with any live event, if you missed it you will never know how good it really was, some of the R user groups have left traces of what happened by posting presentation slides on public areas of their websites.
I am sorry that I missed Harold Baize’s presentation to the Berkeley R language Beginner Study Group on Using R in a Microsoft Office world. I think the following slide from Harold’s presentation indicates a pragmatic approach coupled with a wry sense of humor.
The presentation goes to discuss reading and writing to Excel, Tableau, the R2DOCX package, markdown and more. To get the presentation, download the file RinMSOffice from the Berkeley Site.
Jeroen Janssens’ presentation on Command Line Data Science to the New York Open Statistical Programming Meetup is a very nice introduction to command line essentials for either Linux or Mac OS. If you are not comfortable with cat, grep or awk this might be a place to start.
In a similar spirit of helping new users get up to speed with R, the New Hampshire R Users Group (NH UserRs) has posted a number of tutorials by David Hocking. Introduction to Linear Regression and ANOVA in R is a useful 10 minute first look. Newcomers might also find it valuable to follow up with the David Lillis presentation, Data Analysis Tips in R, that has been posted on the Wellingtone R User Group (WRUG) site. This presentation contains a short discussion on calculating Tetrachoric and polychoric correlations in R that quite a few people might find valuable.
DublinR has graciously made available all of the materials from a Bayesian Data Analysis workshop that Mick Cooney delivered earlier this year. A zipped file containing R scripts, JAG files and data can be downloaded by clicking on “bdasingle2014” at this link.
Finally, I am glad that I was present at Megan Price’s takk: How a Small NonProfit Human Rights Group uses R, to the Bay Area UseR Group. This was an inspirational talk where you really had to be there. Nevertheless, Megan’s slides do portray the big picture of how statistical analysis can make a difference in human rights investigations. The following photo of a cache of documents containing records of those who disappeared during Guatemala's civil war indicates the magnitude of the statistical sampling problem that Megan and her colleagues at HRDAG faced.
This coming Monday, March 31, 2014, is the last day for applying to Revolution Analytics R User Group Sponsorship Program for Matrix or Array level sponsorship. So if you are the organizer of an established R User group and you think a cash grant and a box of R related “goodies” could help you grow, please apply before midnight PST 3/31/14.
by Joseph Rickert
Recently, I had the opportunity to be a member of a job panel for Mathematics, Economics and Statistics students at my alma mater, CSUEB (California State University East Bay). In the context of preparing for a career in data science a student at the event asked: “Where can I find good data sets?”. This triggered a number of thoughts: the first being that it was time to update the list of data sets that I maintain and blog about from time to time. So, thanks to that reminder I have added a few new links to the page, including a new section called Data Science Practice that links to some of the data sets used as examples in Doing Data Science by Rachel Schutt and Cathy O’Neil. Additionally, I have provided a direct link to the BigData Tag on infochimps and pointed out that multiple song data sets are available.
However, to do justice to student’s question it is necessary to give some thought to exactly what a “good” practice data set might look like. Here are three characteristics that I think a practice data set should have to be good:
Here are three data sets that meet these criteria in ascending order of degree of difficulty:
The first suggestion is the MovieLens data set which contains a million ratings applied to over 10,000 movies by more than 71,000 users. The download comes in two sizes, the full set, and a 100K subset. Both versions require working with multiple files.
Near the top of anybody’s list of practice data sets, and second on my little list because of degree of difficulty is the airlines data set from the 2009 ASA challenge. This data set which contains the arrival and departure information for all domestic flights from 1987 to 2008 has become the “iris” data set for Big Data. With over 123M rows it is too big to it into your laptop’s memory and with 29 variables of different types it is rich enough to suggest several analyses. Moreover, although the version of the data set maintained on the ASA website is fixed and therefore perfect for benchmarking, the Research and Innovative Technology Administration Bureau of Transportation Statistics continues to add to the data on a monthly basis. Go to RITA to get all of the data collected since the ASA competition ended.
Last on my short list is the Million Song data set. This contains features and meta data for one million songs which were originally provided by the music intelligence company Echo Nest. The data is in the specialized HDF5 format which makes it somewhat of a challenge to access. The data set maintainers do provide wrapper functions to facilitate downloading the data and avoiding some of the complexities of the HDF5 format. However, there are no R wrappers! The last I checked, the maintainers had a paragraph about there being a problem with their code along with an invitation for R experts to contact them (This would clearly be for extra points.) For more details about the contents of the data set look here.
As a final note, it is much easier use R to analyze the Public Data Sets available through Amazon Web Services now that you can run Revolution R Enterprise in the Amazon Cloud. We hope to have more to say about exactly how to go about doing this in a future post. However, everything you need to get started is in place including a 14 day free trial (Amazon charges apply) for Revolution R Enterprise. All you need is your own Amazon account.
Please let me know if you have additional links to useful, publically available data sets that I have missed. We very much appreciate the contributions blog readers have made to the list of data sets.