by Joseph Rickert
The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.)
A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that:
and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton.
Social network analysis at this level involves some serious statistics and mastering a very specialized vocabulary. However, it seems to me that some knowledge of this field will become important to everyone working in data science. Supervised learning models and statistical models that assume independence among the predictors will most likely represent only the first steps that data scientists will take in exploring the complexity of large data sets.
And, maybe of equal importance is that fact that working with network data is great fun. Moreover, software tools exist in R and other languages that make it relatively easy to get started with just a few pointers.
From a statistical inference point of view what you need to know is Exponential Random Graph Models (ERGMs) are at the heart of modern social network analysis. An ERGM is a statistical model that enables one to predict the probability of observing a given network from a specified given class of networks based on both observed structural properties of the network plus covariates associated with the vertices of the network. The exponential part of the name comes from exponential family of functions used to specify the form of these models. ERGMs are analogous to generalized linear models except that ERGMs take into account the dependency structure of ties (edges) between vertices. For a rigorous definition of ERGMs see sections 3 and 4 of the paper by Hunter et al. in the 2008 special issue of the JSS, or Chapter 6 in Kolaczyk and Csárdi's book Statistical Analysis of Network Data with R. (I have found this book to be very helpful and highly recommend it. Not only does it provide an accessible introduction to ERGMs it also begins with basic network statistics and the igraph package and then goes on to introduce some more advanced topics such as modeling processes that take place on graphs and network flows.)
In the R world, the place to go to work with ERGMs is the statnet.org. statnet is a suite of 15 or so CRAN packages that provide a complete infrastructure for working with ERGMs. statnet.org is a real gem of a site that contains documentation for all of the statnet packages along with tutorials, presentations from past Sunbelt conferences and more.
I am particularly impressed with the Shiny based GUI for learning how to fit ERGMs. Try it out on the Shiny webpage or in the box below. Click the Get Started button. Then select "built-in network" and "ecoli 1" under File type. After that, click the right arrow in the upper right corner. You should see a plot of the ecoli graph.
--------------------------------------------------------------------------------------------------------------------------
You will be fitting models in no time. And since the commands used to drive the GUI are similar to specifying the parameters for the functions in the ergm package you will be writing your own R code shortly after that.
By Torben Tvedebrink, Chair of local committee, useR! 2015
After useR! 2015 in Aalborg I had some time to reflect and think back on the phase leading up to the actual conference. The story of useR! 2015 began in 2013 when Søren Højsgaard, Head of Department of Mathematical Sciences, Aalborg University, popped the idea of hosting useR! 2015 or 2017 in Aalborg. He had made some informal enquiries to the R Foundation and R Core members about the possibility. With some positive indications we sat down and wrote the first draft for the bidding material. This included a description of Aalborg, the conference venue, our thoughts on the scientific and social programme together with our first budget. This five page document was sent to the R Foundation by the end of October 2013 and after some communication back and forth we received the final "go!" in January 2014.
The first thing we decided on was the form and location of the social events (i.e. the welcome reception and conference dinner). We decided that the participants should experience more of Aalborg than just the conference venue. Hence, the House of Music seemed like a natural choice. With the support of the municipality of Aalborg and our sponsors we had the opportunity of showing some of the best to our guests on the evening before the conference. Often conference dinners are held in big restaurants or settings where it is difficult to ensure good food and drinks. Hence, we wanted to focus on the place and theme rather than the food. The Robber's Banquet fitted nicely with this idea.
In the figure below I have plotted the number of useR! 2015 emails in my inbox over time (right: accumulated numbers are on log-scale. Total number of emails received: 3561). The number of received emails serves as a nice proxy for the amount of work put into the conference over time. As seen from the plot to the right, the number of emails grow exponentially over time. In the plot I have added some of the important dates, e.g. opening of registration, abstract deadline and registration deadlines.
We wanted to follow the example of useR! 2014 in Los Angeles and offer the tutorials free of charge to the participants. We applied some Danish foundations to support the initiative and had some positive feedback. In order to allocate the 16 tutorials into morning and afternoon tutorials we sent out a survey to the participants a month before the useR! conference. Based on the survey we ran through all possible permutations of the tutorials and minimised the number of individuals with both tutorial selections in the same session -- of course all done in R.
Initially we hoped that 300-400 participants would show up in Aalborg. This was based on some data, but primarily on a somewhat pessimistic prior, which fortunately did not hold true. In the figure below the number of registered participants are plotted over time. We opened the registration December 3rd 2014 and as expected only a few signed up the first months. However, as presenters received their notification and the early deadline approached we had already achieved our goal of 400 participants. In the months that followed another 260 useR!s signed up and we had a total of 660 participants when we ran the conference in July 2015 (128 females and 532 males). The industry/academia ratio was almost 50% with 284 from the industry, 262 academics and 113 students participating.
We had participants coming from more than 40 countries (see below), where the majority came from Denmark (129), USA (117) and Germany (92). For most countries the distribution between industry and academia was also close to 50% (right barplot).
In summary the useR! 2015 conference went well and we are happy for all the feedback we have received - positive as well as negative. These inputs are valuable to the R community in general and to us as local organisers in particular. We are happy to share any ideas and comments with future organisers of the useR! conference series.
The conference venue (Aalborg Congress and Culture Center, akkc.dk) had an ideal size for the turnout. With a plenary lecture hall suited for almost 800 people and four additional rooms (220 - 150 seats) for parallel sessions was adequate. The professional assistance from the staff during the planning was very helpful. We can only recommend involving experienced people in the planning and execution of the next useR! conferences. Similar thoughts go to the catering - with good, varied and suffucient food supply most people are happy!
The social events (welcome reception, poster session and conference dinner) all went as we hoped. For the poster session we had free drinks and food. This ment most people stayed until the end and poster presenters had many interesting discussions. We had intentionally encourage posters to be on display throughout the conference and to be located in the exhibitor's area. With this people had more time to visit both our many sponsors and look at the posters when they felt for it.
Once again we would like to thank all useR! 2015 participants for making the conference a memorable experience for the Department of Mathematical Sciences at Aalborg University. As special thanks goes to our many sponsors that made it possible to provide a high service level.
On behalf of the local organising committee,
Torben Tvedebrink
by Joseph Rickert
June was a hot month for extreme statistics and R. Not only did we close out the month with useR! 2015, but two small conferences in the middle of the month brought experts together from all over the world to discuss two very difficult areas of statistics that generate quite a bit of R code.
The Extreme Value Analysis conference is a prestigious event that is held every two years in different parts of the world. This year, over 230 participants from 26 countries met from June 15th through 19th at the University of Michigan, Ann Arbor for EVA 2015. The program included theoretical advances as well as novel applications of Extreme Value Theory in fields including finance,
economics,insurance, hydrology, traffic safety, terrorism risk, climate and environmental extremes. You can get a good idea of the topics discussed at the EVA from the book of abstracts which includes an author index as well as a keyword index. The conference organizers are in the process of obtaining permissions to post the slides from the talk. These should be available soon.
In the meantime, have a look at the slides from two excellent presentations from the Workshop on Statistical Computing which was held the day before the main conference. Eric Gilleland's Introduction to Extreme Value Analysis provides a gentle introduction for anyone willing to look at some math. Eric begins with some motivating examples, develops some key concepts and illustrates them with R and even provides some history along the way. This quote from Emil Gumbel, a founding giant in the field, should be every modeler's mantra: “Il est impossible que l’improbable n’arrive jamais”. ("It's impossible for the improbable to never occur" -- ed)
In Modeling spatial extremes with the SpatialExtremes package, Mathieu Ribatet works through a complete example in R by fitting and evaluating a model and running simulations. This motivating slide from the presentation describes the kind of problems he is considering.
In our world of climate extremes and financial black swans there are probably few topics more of more immediate concern to statisticians that EVA, but the vexing problem of dealing with missing values might be one of them. So, it was not surprising that at nearly the same time (June 18th and 19th) a 150 people or so gathered on the other side of the world in Rennes France for missData 2015.
Over the years, R developers have expended considerable energy creating routines to missing values. The transcan function in the Hmisc package "automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables". mice provides functions using the Fully Conditional Specification using the MICE algorithm. (See the slides from Stef van Buuern's presentation Fully Conditional Specification: Past, present and beyond for a perspective on FCS and the reading list at left.) mi provides functions for missing value imputation in a Bayesian framework as does the BaBooN package and VIM provides for visualizing the structure of missing values. Slides for almost all of the talks are available online at the conference program page and videos will be available soon. Have a look at the slides from the lightning talk by Matthias Templ and Alexander Kowarik to see what the VIM package can do.
Revolution Analytics was very pleased to have been able to sponsor both of these conferences. For the next EVA mark your calendars to visit Delft, the Netherlands in 2017.
by Joseph Rickert
Last week, I was fortunate enough to attend the R Summit & Workshop, an invitation only event, held at the Copenhagen Business School. The abstracts for the public talks presented are online and well worth a look. Collectively they provide a snapshot of the state of development of R and the R Community as well some insight into the directions in which researchers are moving to expand the boundaries of R.
Real highlights of the event were talks by Jennifer Bryan and Mine Çetinkaya-Rundel, two educators who are channeling enormous amounts of energy into teaching statistics and statistical programming, and into developing new pedagogical methods to improve the learning experience for both students and teachers alike. Both Mine and Jennifer are committed to R, as well as to using state of the art developer tools such as RStudio, R Markdown, Git and GitHub.
If you clicked on the link to Jennifer’s university home page above and expected to see more content there you are probably not running with the in crowd. Anybody who aspires to bask in the faintest glow of tech cool is hanging out on GitHub. So go to github.com/jennybc to find a place where social media, software development and best practices collide to generate a state-of-the-art learning platform.
What Jennifer has created there is the R version of a full immersion experience for learning a new language. Just like you can’t separate the highs and awkward lows of human interactions from tripping over the grammar while learning Japanese on the streets of Tokyo, at jennybc you have to cope with GitHub, R Markdown and complying with best practices while doing your homework, seeking help, and learning from your peers.
As Jennifer writes in her synopsis of her R Summit & Workshop talk:
I've formed strong opinions about workflows for R Markdown + GitHub and what the big wins are.
Mine, who focuses on undergraduate education, points out that: “R is attractive because, unlike software designed specifically for courses at this level, it is relevant beyond the introductory statistics classroom, and is more powerful and flexible.” Poke around Mine’s GitHub page and you will see that she is all about R, open source, reproducibility and teaching good habits and right values. In addition to her work in the classroom, Mine has developed the Coursera course: Data Analysis and Statistical Inference, is a coauthor of three, free R based textbooks and is a driving force behind the ASA Datafest competitions.
Students are often ambivalent as to whether they are looking for education or training. But if you are a student of either Mime or Jennifer you are going to get some of both, and have a real shot at launching a productive R fueled career.
by Andrie de Vries
Today is the first day of UseR!2015 conference in Aalborg in Northern Denmark. But yesterday was a day packed with 16 tutorials on a range of interesting topics. I submitted a proposal many months ago to run a session on using R in Hadoop and was very happy to selected to run a session in the morning.
When we first started planning the session, we set a Big Hairy Audacious Goal to run the session using a HortonWorks Hadoop cluster hosted in the Microsoft Azure cloud.
We trialled the session at the Birmingham R user group during May, and then again last week during a Microsoft internal webinar. In both cases, the cluster performed great.
Then I asked the organisers how many people expressed an interest, and I heard that 66 people said they might come and that the room will seat 144 people.
At this point I started having cold sweats!
Just as luck would have it, minutes before start time something went wrong. The cluster would not start and participants were unable to access the server for the first hour of the tutorial. Fortunately, we had a back-up plan.
You can take a vicarious tour through the analysis of the NYC Taxi Cab data with the rmarkdown slides built with the RStudio presentation suite.
Although I did not cover all the material during the session, the full set of presentations are:
2. Analysing New York taxi data with Hadoop
3. Computing on distributed matrices
4. Using RHive to connect to the hive database
(These presentations, sample data, all the scripts and examples live in github at https://github.com/andrie/RHadoop-tutorial. Use the tag https://github.com/andrie/RHadoop-tutorial/tree/2015-06-30-UseR!2015 to find the state of the repository at the time of the tutorial.)
The central example throughout is to use mapreduce() in the package rmr2 to summarize new York taxi journeys, grouped by hour of the day. This is a dataset of ~200GB of uncompressed csv files.
You can read more about how this data came into the public domain at http://chriswhong.com/open-data/foil_nyc_taxi/. Here is a simple plot of the results, sampled from a 1 in a 1000 sample of data:
The full code to do this is at available in a github repository at https://github.com/andrie/RHadoop-tutorial. Here is an extract:
library(rmr2)
library(rhdfs)
hdfs.init()
rmr.options(backend = "hadoop")
hdfs.ls("taxi")$file
homeFolder <- file.path("/user", Sys.getenv("USER"))
taxi.hdp <- file.path(homeFolder, "taxi")
headerInfo <- read.csv("data/dictionary_trip_data.csv", stringsAsFactors = FALSE)
colClasses <- as.character(as.vector(headerInfo[1, ]))
taxi.format <- make.input.format(format = "csv", sep = ",",
col.names = names(headerInfo),
colClasses = colClasses,
stringsAsFactors = FALSE
)
taxi.map <- function(k, v){
original <- v[[6]]
date <- as.Date(original, origin = "1970-01-01")
wkday <- weekdays(date)
hour <- format(as.POSIXct(original), "%H")
dat <- data.frame(date, hour)
z <- aggregate(date ~ hour, dat, FUN = length)
keyval(z[[1]], z[[2]])
}
taxi.reduce <- function(k, v){
data.frame(hour = k, trips = sum(v), row.names = k)
}
m <- mapreduce(taxi.hdp, input.format = taxi.format,
map = taxi.map,
reduce = taxi.reduce
)
dat <- values(from.dfs(m))
library("ggplot2")
p <- ggplot(dat, aes(x = hour, y = trips, group = 1)) +
geom_smooth(method = loess, span = 0.5,
col = "grey50", fill = "yellow") +
geom_line(col = "blue") +
expand_limits(y = 0) +
ggtitle("Sample of taxi trips in New York")
p
by Joseph Rickert
In a little over three weeks useR! 2015 will convene in Aalborg, Denmark and I am looking forward to being there and learning and talking about R user groups. The following map shows the big picture for R User Groups around the world.
However, it is very difficult to keep it up to date. Just after the map "went to press" I learned that a new user group formed in Norfolk Virginia last month. In fact, at least 11 new R user groups have formed so far this year.
RUG City Country Date Founded No. Members Website Berlin R Users Group Berlin Germany 1/5/2015 133 http://www.meetup.com/Berlin-R-Users-Group/ Rhus - useR Group Aarhus Denmark 1/19/2015 87 http://www.meetup.com/Rhus-useR-group Trenton R Users (TRU) Trenton US 1/27/2015 50 http://www.meetup.com/TRUgroup/ Honolulu R Users Group Honolulu US 1/30/2015 36 http://www.meetup.com/Honolulu-R-Users-Group/ Oslo useR! Group Oslo Norway 2/26/2015 89 http://www.meetup.com/Oslo-useR-Group/ St. Petersburg R User Group St Petersburg Russia 3/13/2015 50 http://www.meetup.com/St-Petersburg-R-User-Group/ Nijmegen eveRybody Nijmegen Netherlands 4/11/2015 23 http://www.meetup.com/Nijmegen-eveRybody/ AthensR Athens Greece 4/16/2015 30 http://www.meetup.com/AthensR/ 757 R Users Group Norfolk US 4/22/2015 18 http://www.meetup.com/757-R-Users-Group/ LubbockR Meetup Lubbock US 5/5/2015 23 http://www.meetup.com/lubbockR-Meetup/ R Kazan Kazan Russia 5/10/2015 11 http://www.meetup.com/R-Kazan/
Moreover, judging by the more than 4,000 users that showed up for the R China conference this week, China is probably severely under represented. There are very likely a few more user groups there and in the rest of Asia for us to learn about.
From a tweet by Kun Ren
RUG City State Country Date.Founded Num.Members Website Platform 1 Adelaide R-users group Adelaide SA Australia 10/1/2011 94 http://www.meetup.com/Adelaide-R-users-group/ Meetup 2 Albany R Users Group Albany NY United States 3/20/2014 104 http://www.meetup.com/Albany-R-Users-Group/ Meetup 3 amst-R-dam Amsterdam Netherlands 9/9/2010 483 http://www.meetup.com/amst-R-dam/ Meetup 4 Turkish Community of R Ankara Turkey 3/8/2013 181 http://www.meetup.com/tcr-users/ Meetup 5 Rhus - useR group Aarhus Denmark 1/19/2015 87 http://www.meetup.com/Rhus-useR-group Meetup 6 AthensR Athens Greece 4/16/2015 30 http://www.meetup.com/AthensR/ Meetup
We would like this information as well as our R User Group Directory to be as accurate as possible and would very much appreciate corrections, additions and subtractions from user group organizers.
If you are going to Aalborg and would like to chat about R User Groups please come by the Revolution Analytics / Microsoft table.
Here is the code used to draw the map: Download Code_for_RUGs_Map.
At last month's BUILD conference for Microsoft developers in San Francisco, R was front-and-center on the keynote stage.
In the keynote, Microsoft CVP Joseph Sirosh introduced the "language of data": open source R. Sirosh encouraged the audience to learn R, saying "if there is a single language that you choose to learn today .. let it be R".
The keynote featured a demonstration of genomic data analysis using R. The analysis was based on the 1000 genomes data set stored in the HDInsight Hadoop-in-the-cloud service. Revolution R Enterprise running on eight Hadoop clusters distributed around the globe (about 1600 cores in total), and R's Bioconductor suite (specifically the VariantTools and gmapR packages), was used to perform 'variant calling' and calculate the disease risks indicated by a subset of the 1000 genomes in parallel. The result was an interactive heat-map showing the disease risks for each individual.
The heat map was created by Winston Chang and Joe Cheng from RStudio as an htmlwidget using the D3heatmap package. (You can interact with a variant of the heatmap from the demo here.)
The next part of the demo was to compare an individual's disease risks — as indicated by his or her DNA — to the population. Joseph Sirosh had his own DNA sequence for this purpose, which he submitted via a Windows Phone app to an Azure service running R. This is easy to do with Azure ML Studio: just put your R code as part of a workflow, and an API will automatically be generated on request. In this way you can publish any R code as an API to the cloud, which is then callable by any connected application.
You can watch the entire keynote presentation below, and the R demo begins at around the 23 minute mark.
The R/Finance 2015 Conference wrapped up last Saturday at UIC. It has been seven years already, but R/Finance still has the magic! - mostly very high quality presentations and the opportunity to interact and talk shop with some of the most accomplished R developers, financial modelers and even a few industry legends such as Emanuel Derman and Blair Hull.
Emanuel Derman led off with a provocative but extraordinary keynote talk. Derman began way out there, somewhere well beyond the left field wall recounting the struggle of Johannes Kepler to formulate his three laws of planetary motion and closed with some practical advice on how to go about the business of financial modeling. Along the way he shared some profound, original thinking in an attempt to provide a theoretical context for evaluating and understanding the limitations of financial models. His argument hinged on making and defending the distinction between theories and models. Theories such as physical theories of Kepler, Newton and Einstein are ontological: they attempt to say something about how the world is. A theory attempts to provide "absolute knowledge of the world". A model, on the other hand, "tells you about what some aspect of the world is like". Theories can be wrong, but they are not the kinds of things you can interrogate with "why" questions.
Models work through analogies and similarities. They compare something we understand to something we don't. Spinoza's Theory of emotions is a theory because it attempts to explain human emotions axiomatically from first principles.
The Black Scholes equation, by contrast, is a model that tries to provide insight through the analogy with Brownian motion. As I understood it, the practical advice from all of this is to avoid the twin traps of attempting to axiomatize financial models as if they directly captured reality, and of believing that analyzing data, no matter how many terabytes you plow through, is a substitute for an educated intuition about how the world is.
The following table lists the remaining talks in alphabetical order by speaker.
Presentation | Package | Package Location | |
1 | Rohit Arora: Inefficiency of Modified VaR and ES | ||
2 | Kyle Balkissoon: A Framework for Integrating Portfolio-level Backtesting with Price and Quantity Information | PortFolioAnalytics | |
3 | Mark Bennett: Gaussian Mixture Models for Extreme Events | ||
4 | Oleg Bondarenko: High-Frequency Trading Invariants for Equity Index Futures | ||
5 | Matt Brigida: Markov Regime-Switching (and some State Space) Models in Energy Markets | code for regime switching | GitHub |
6 | John Burkett: Portfolio Optimization: Price Predictability, Utility Functions, Computational Methods, and Applications | DEoptim | CRAN |
7 | Matthew Clegg: The partialAR Package for Modeling Time Series with both Permanent and Transient Components | partialAR | CRAN |
8 | Yuanchu Dang: Credit Default Swaps with R (with Zijie Zhu) | CDS | GitHub |
9 | Gergely Daroczi: Network analysis of the Hungarian interbank lending market | ||
10 | Sanjiv Das: Efficient Rebalancing of Taxable Portfolios | ||
11 | Sanjiv Das: Matrix Metrics: Network-Based Systemic Risk Scoring | ||
12 | Emanuel Derman: Understanding the World | ||
13 | Matthew Dixon: Risk Decomposition for Fund Managers | ||
14 | Matt Dowle: Fast automatic indexing with data.table | data.table | CRAN |
15 | Dirk Eddelbuettel: Rblpapi: Connecting R to the data service that shall not be named | Rblpapi | GitHub |
16 | Markus Gesmann: Communicating risk - a perspective from an insurer | ||
17 | Vincenzo Giordano: Quantifying the Risk and Price Impact of Energy Policy Events on Natural Gas Markets Using R (with Soumya Kalra) | ||
18 | Chris Green: Detecting Multivariate Financial Data Outliers using Calibrated Robust Mahalanobis Distances | CerioliOutlierDetection | CRAN |
19 | Rohini Grover: The informational role of algorithmic traders in the option market | ||
20 | Marius Hofert: Parallel and other simulations in R made easy: An end-to-end study | simsalapar | CRAN |
21 | Nicholas James: Efficient Multivariate Analysis of Change Points | ecp | CRAN |
22 | Kresimir Kalafatic: Financial network analysis using SWIFT and R | ||
23 | Michael Kapler: Follow the Leader - the application of time-lag series analysis to discover leaders in S&P 500 | SIT | other |
24 | Ilya Kipnis: Flexible Asset Allocation With Stepwise Correlation Rank | ||
25 | Rob Krzyzanowski: Building Better Credit Models through Deployable Analytics in R | ||
26 | Bryan Lewis: More thoughts on the SVD and Finance | ||
27 | Yujia Liu and Guy Yollin: Fundamental Factor Model DataBrowser using Tableau and R | factorAnalytics | RFORGE |
28 | Louis Marascio: An Outsider's Education in Quantitative Trading | ||
29 | Doug Martin: Nonparametric vs Parametric Shortfall: What are the Differences? | ||
30 | Alexander McNeil: R Tools for Understanding Credit Risk Modelling | ||
31 | William Nicholson: Structured Regularization for Large Vector Autoregression | BigVAR | GitHub |
32 | Steven Pav: Portfolio Cramer-Rao Bounds (why bad things happen to good quants) | SharpeR | CRAN |
33 | Jerzy Pawlowksi: Are High Frequency Traders Prudent and Temperate? | HighFreq | GitHub |
34 | Bernhard Pfaff: The sequel of cccp: Solving cone constrained convex programs | cccp | CRAN |
35 | Stephen Rush: Information Diffusion in Equity Markets | ||
36 | Mark Seligman: The Arborist: a High-Performance Random Forest Implementation | Rborist | CRAN |
37 | Majeed Simaan: Global Minimum Variance Portfolio: a Horse Race of Volatilities | ||
38 | Anthoney Tsou: Implementation of Quality Minus Junk | qmj | GitHub |
39 | Marjan Wauters: Characteristic-based equity portfolios: economic value and dynamic style allocation | ||
40 | Hadley Wickham: Data ingest in R | readr | CRAN |
41 | Eric Zivot: Price Discovery Share-An Order Invariant Measure of Price Discovery with Application to Exchange-Traded Funds |
I particularly enjoyed Sanjiv Das' talks on Efficient Rebalancing of Taxable Portfolios and Matrix Metrics: Network Based Systemic Risk Scoring, both of which are approachable by non-specialists. Sanjiv became the first person to present two talks at an R/Finance conference, and thus the first person to win one of the best presentation prizes with the judges unwilling to say which of his two presentations secured the award.
Bryan Lewis' talk: More thoughts on the SVD and Finance was also notable for its exposition. Listening to Bryan you can almost fool yourself into believing that you could develop a love for numerical analysis and willingly spend an inordinate amount of your time contemplating the stark elegance of matrix decompositions.
Alexander McNeil's talk: R Tools for Understanding Credit Risk Modeling was a concise and exceptionally coherent tutorial on the subject, an unusual format for a keynote talk, but something that I think will be valued by students when the slides for all of the presentations become available.
Going out on a limb a bit, I offer a few un-researched, but strong impressions of the conference. This year, to a greater extent than I remember in previous years, talks were built around particular packages; talks 5, 7 and 8 for example. Also, it seemed that authors were more comfortable hightlighting and sharing packages that are work in progress; residing not on CRAN but on GitHub, R-Forge and other platforms. This may reflect a larger trend in R culture.
This is the year that cointegration replaced correlation as the operative concept in many models. The quants are way out ahead of the statisticians and data scientists on this one. Follow the money!
Speaking of data scientists: if you are a Random Forests fan do check out Mark Seligman's Rborist package, a high-performance and extensible implementation of the Random Forests algorithm.
Network analysis also seemed to be an essential element of many presentations. Gergely Daróczi's Shiny app for his analysis of the Hungarian interbank lending network is a spectacular example of how interactive graphics can enhance an analysis.
Finally, I'll finish up with some suggested reading in preparation for studying the slides of the presentations when they become available.
Sanjiv Das: Efficient Rebalancing of Taxable Portfolios
Sanjiv Das: Matrix Metrics: Network-based Systematic Risk Scoring
Emanuel Derman: Models.Behaving.Badly
Jurgen A. Doornik and R.J. O'Brien: Numerically Stable Cointegration Analysis (A recommendation from Bryan Lewis)
Arthur Koestler: The Sleepwalkers (I am certain this is the book whose title Derman forgot.)
Alexander J. McNeil and Rudiger Frey: Quantitative Risk Management Concepts, Techniques and Tools
Bernhard Pfaff: Analysis of Integrated and Cointegrated Time Series with R (Use R!)
by Joseph Rickert
The 8th XLDB (Extremely Large Databases) Conference open at Stanford on Tuesday with an outstanding program. This conference has been providing leadership in the "Big Data" world since its first workshop which was held in 2007. For example, the summary report for that year notes: "Both communities (industry and science) are moving towards parallel ... architectures on large clusters of commodity hardware, with the map/reduce paradigm as he leading processing model." but also observes that: "The map/reduce paradigm ... will likely not be the final answer" — prescience and a sober assessment with none of the hype that was to follow.
The extraordinary feature of the first day of this year's conference was the prominence of R. Several talks were either directly about R, or discussed R in conjunction with a significant subtopic. John Chambers spoke on "R in the World: Interfaces between Languages". Karim Chine presented ElasticR. Hannes Mühleisen elaborated on some innovative ideas in his talk: "R as a Query Language" describing a system for using R to write effective queries based on renjin, (R on the JVM). Jeff Lefevre discussed HP's DistributedR in his talk on "Extending Vertica with External Analytics". Rene Brun described the Root-R package and Rcpp in his talk about "ROOT: a Data Storage and Analysis Framework" used at CERN, and Nachum Shacham mentioned both R and the R/H2O/Hadoop interface in his opening talk: "On the Practice of Predictive Modeling with Bit Data".
Even Stephen Wolfram obliquely referred to R! He began his special keynote talk, and very impressive impromptu demo of the Wolfram Language, with a statement that went something like this: "Unlike other languages that have a very small core and add features through packages we decided to build as much as possible into the language". The exact quote will have to wait untill the video is available, but it very much seemed to me that at least with respect to design he was positioning the Wolfram Language (the combination of Mathematica and Wolfram Alpha) as a kind of anti-R!
The slides from all of the talks will be available on the conference program page in a couple of days, and the conference videos will follow in June. In the meantime, through the kindness of Hannes Mühleisen and the conference organizers, we have Hannes' slides and those of Rene Brun and John Chambers available for download.
The following slide from Hannes' presentation indicates how R might be make more efficient through certain SQL sensibilities and seems to share the spirit of data.table.
Rene's presentation contains several informative slides. Be sure to check out slide 11 which shows when C/C++ overtook Fortran, slide 29 which gives an overview of the core ROOT Math/Stat libraries , and slide 40 which shows how R, Rcpp and RInside fit in.
John Chamber's presentation begins with a reminder that the original S language was initially conceived as an interface to the Fortran libraries, the outstanding computational resource of the day, and then stresses that R's interfaces to other languages and resources such as databases is one of its greatest strengths.
He then elaborates on his three principles for understanding R and describes the motivations, architecture and design of the new group of "XR" packages he is working on. When complete, these will provide a uniform interface to languages as diverse as Python and Julia and provide proxies to objects, functions and classes that will benefit both end-user programmers and developers.
If XLDB 8 turns out to be as prescient as its predecessors at pointing to the direction in which big databases will go, then the future will bring some pretty exciting developments to R.
Downloads:
Download Tues_Hannes_Muehleisen
Download 4_Tues_ReneBrun_XLDB
Download 9_Tues_Chambers - XLDB Conference
by Joseph Rickert
Last Friday and Saturday the NY R Conference briefly lit up Manhattan's Union Square neighborhood as the center of the R world. You may have caught some of the glow on twitter. Jared Lander, volunteers from the New York Open Statistical Programming Meetup along with the staff at Workbench (the conference venue) set the bar pretty darn high for a first time conference.
The list of speakers was impressive (a couple of the presentations approached the sublime), the venue was bright and upscale, the food was good, and some of the best talks ran way over the time limit but somehow the clock slowed down to sync to the schedule.
But the best part of the conference was the vibe! It was a sweet brew of competency, cooperation and fun. The crowd, clearly out to enjoy themselves, provided whatever lift the speakers needed to be at the top of their game. For example, when near the very end of the second day Stefan Karpinsky's PC just "up and died" as he was about to start his Julia to R demo the crowd hung in there with him and Stefan managed an engaging, ad lib, no visuals 20 minute talk. It was also uncanny how the talks seemed to be arranged in just the right order. Mike Dewar, a data scientist with the New York Times, gave the opening presentation which featured some really imaginative and impressive data visualizations that wowed the audience. But Bryan Lewis stole back the thunder, and the applause, later in the morning when as part of his presentation on htmlwidgets he reproduced Dewar's finale viz with mushroom data.
Bryan has posted his slides on his site here along with a promise to post all of the code soon.
The slides from all of the presentations have yet to be posted on the NY R Conference website. So, all I can do here today is to provide an opportunity sample drawn from postings I have managed to find scattered about the web. Here are Winston Chang's talk on Dashboarding with Shiny, Jared Lander's talk on Making R Go Faster and Bigger, Wes McKinney's talk on Data Frames, my talk on Reproducibility with the checkpoint package, and Max Richman's talk on R for Survey Analysis.
For the rest of the presentations, we will have to wait for the slides to become available on the conference site. There is a lot to look forward to: Vivian Peng's presentation on Storytelling and Data Visualization will be worth multiple viewings and you will not want to miss Hilary Parker's hilarious "grand slam" talk on Reproducible Analysis in Production featuring explainR and complainR. But for sure, look for Andrew Gelman's talk: But When You Call Me A Bayesian I Know I'm Not the Only One. Gelman delivered what was possibly the best technical talk ever, but we will have to wait for the conference video to reassess that.
Was Gelman's talk really the best ever, or was it just the magic of his delivery and the mood of the audience that made it seem so? Either way, I'm glad I was there.