The militarization of local police departments here in the US has been much in the news lately, and the New York Times published in June an indepth article on how materiel from wars has ended up in the hands of US counties. Besides the traditional reporting it's a fantastic piece of data journalism: the Times submitted a freedomofinformation request to the Defense Department for the items, value and the date they were provided to each county, and published the data on GitHub. Here's a small snippet of the data:
The Times also published an interactive map of the data, aggregated by county. What that map doesn't show is the element of time, and the rate at which materials were being supplied from 2006 to 2012. Andrew Cooper, an associate professor at Simon Fraser University, used the Times data and the R programming language to create this animation of the supply of materials throughout the US over the past 8 years:
You can find a link to the complete NYT article on the topic below.
New York Times: War Gear Flows to Police Departments
I was visiting Napa Valley over the weekend, and at around 3:30AM on Sunday morning I awoke suddenly to what felt like some giant at the end of the bed shaking it as hard as he could. It was an earthquake. One of the scariest things about an earthquake is that when it happens, you have no idea how serious it is — you only know what it feels like where you are. So of course I turned to Twitter (thank you Twitter!) and the USGS was on the case — it was a 6.1 (later revised to 6.0), with an epicenter about 25 miles away.
It was actually a a relief to find out that the epicenter was so close: I was worried that it might have been closer to San Francisco, which would have meant the shaking we felt was indicitive of a much more catastrophic event. Still, it was hard to go back to sleep after it happened, and of course I wasn't the only one. Jawbone, the makers of the Jawbone UP sleeptracking device, released this chart showing wearers being woken up by the quake and then slowly going back to sleep. Those closer to the epicenter took longer to recover.
You can read more about the data collection and analysis behind this chart at the Jawbone blog post linked below.
Jawbone blog: How the Napa earthquake affected Bay Area sleepers
by Joseph Rickert
Last week, I posted a list of sessions at the Joint Statistical Meetings related to R. As it turned out, that list was only the tip of the iceberg. In some areas of statistics, such as graphics, simulation and computational statistics the use of R is so prevalent that people working in the field often don't think to mention it. For example, in the session New Approaches to Data Exploration and Discovery which included the presentation on the Glassbox package that figured in my original list, R was important to the analyses underlying nearly all of the talks in one way or another. The following are synopses of the talks in that session along with some pointers to relevant R resources.
Exploring Huge Collections of Scatterplots
Statistics and visualization legend Leland Wilkinson of Skytree showed off ScagExployer, a tool he built with Tuan Dang of the University of Illinois at Chicago to explore scagnostics (a contraction for “Scatter Plot Diagnostics” made up by John Hartigan and Paul Tukey in the 1980’s). ScagExployer makes it possible to look for anomalies and search for similar distributions in a huge collections of scatter plots. (The example Leland showed contained 124K plots).The ideas and many of the visuals for the talk can be found in the paper ScagExplorer: Exploring Scatterplots by Their Scagnostics. ScagExployer is Java based tool, but R users can work with the scagnostics package written by Lee Wilkinson and Anushka Anand in 2007.
Glassbox: An R Package for Visualizing Algorithmic Models:
Google’s Max Ghenis presented work he did with fellow Googlers Ben Ogorek; and Estevan Flores. Glassbox is an R application that attempts to provide transparency to “blackbox” algorithmic models such as Random Forests. Among other things, it calculates and plots the collective importance of groups of variables in such a model. The slides for the presentation are available, as is the package itself. Google is using predictive modeling and tools such as glassbox to better understand the characteristics of its workforce and to ask important, reflective questions such a “How can we better understand diversity?” The company also does HR modeling to see if what they know about people can give them a competitive edge in hiring. For example, Google uses data collected from people who have interviewed at the company in the past, but who have not received offers from Google, to try and understand Google's future hiring needs. The coolest thing about this presentation was that these guys work for the Human Resources Department! If you think that you work for a tech company go down to HR and see if you can get some help with Random Forests.
A Web Application for Efficient Analysis of Peptide Libraries
Eric Hare of Iowa State University introduced PeLica, work he did with colleagues Timo Sieber of University Medical Center HamburgEppendorf and Heike Hofmann of Iowa State University. PeLica is an interactive, Shiny application to help assess the statistical properties of peptide libraries. PeLica’s creators refer to it as a Peptide Library Calculator that acts as a front end to the R package peptider which contains functions for evaluating the diversity of peptide libraries. The authors have done an exceptional job of using the documentation features available in Shiny to make their app a teaching tool.
To Merge or Not to Merge: An Interactive Visualization Tool for Local Merges of Mixture Model Components Elizabeth Lorenzi of Carnegie Mellon showed the prototype for an interactive visualization tool that she is working on with Rebecca Nugent of Carnegie Mellon and Nema Dean of the University of Glasgow. The software calculates intercomponent similarities of mixture model component trees and displays them as hierarchical dendrograms. Elizabeth and her colleagues are implementing this tool as an R package.
An Interactive Visualization Platform for Interpreting Topic Models
Carson Sievert of Iowa State University presented LDAvis, a general framework for visualizing topic models that he is building with Kenny Shirley of AT&T Labs. LDAvis is interactive R software that enables users to interpret and compare topics by highlighting keywords. The theory is nicely described in a recent paper, and the examples on Carson’s Github page are instructive and fun to play with. In this plot below, circle 26 representing a topic has been selected. The bar chart on the right displays the 30 most relevant terms for this topic. The red bars represent the frequency of a term in a given topic, (proportional to p(term  topic)), and the gray bars represent a term's frequency across the entire corpus, (proportional to p(term)).
Gravicom: A WebBased Tool for Community Detection in Networks
Andrea Kaplan showed off an interactive application that she and her Iowa State University team members, Heike Hofmann and Daniel Nordman are building. GRavicom is an interactive web application based on Shiny and the D3 JavaScript library that lets a user manually collect nodes into clusters in a social network graph and then save this grouping information for subsequent processing. The idea is that eyeballing a large social network and selecting “obvious” groups may be an efficient way to initialize a machine learning algorithm. Have a look at a Live demo.
Human Factors Influencing Visual Statistical Inference
Mahbubul Majumder of the University of Nebraska presented joint work done with Heike Hofmann and Dianne Cook, both of Iowa State University, on identifying key factors such as demographics, experience, training, of even the placement of figures in an array of plots, that may be important for the human analysis of visual data.
As global warming causes sea levels to rise, the risk of flooding for coastal settlements also rises over time. A recent analysis by Reuters find that incidents of coastal flooding along the Eastern seaboard of the United States have surged in recent years as the sea level steadily rises.
Flood levels have been exceeded in six eastern cities by an average of 20 days or more since 2001. The analysis is based on more than 25 million hourly tidegauge readings compared to flood thresholds set by the National Oceanic and Atmospheric Administration (NOAA).
As noted by reporter Ryan McNeill on Twitter, the analysis was done in R, after using the Ruby's RinRuby library to download the data from NOAA:
@dancow @znmeb I used Ruby to download and process the data, then I fed it into R using this gem: https://t.co/GGgnU5d04b
— Ryan McNeill (@McNeill_Tweets) July 10, 2014
Reuters plans to publish a broad examination of rising sea levels later this year. You can read the July 10 article at the link below.
Reuters: Exclusive: Coastal flooding has surged in U.S., Reuters finds (via Sharon Machlis)
by Joseph Rickert
UserR! 2014 got under way this past Monday with a very impressive array of tutorials delivered on the day that the conference organizers were struggling to cope with a record breaking crowd. My guess is that conference attendance is somewhere in the 700 range. Moreover, this the first year that I can remember that tutorials were free. The combination made for jampacked tutorials.
The first thing that jumps out just by looking at the tutorial schedule is the effort that RStudio made to teach at the conference. Of the sixteen tutorials given, four were presented by the RStudio team. Winston Chang conducted an introduction to interactive graphics with ggvis; Yihui Xie presented Dynamic Documents with R and knitr; Garrett Grolemund Interactive data display with Shiny and R and Hadley Wickham taught data manipulation with dplyr. Shiny is still captivating R users. I was particularly struck by a conversation I had with a undergraduate Stats major who seemed to be genuinely pleased and excited about being able to build her own Shiny apps. Kudos to the RStudio team.
Bob Muenchen brought this same kind of energy to his introduction to managing data with R. Bob has extensive experience with SAS and Stata and he seems to have a gift anticipating areas where someone proficient in one of these packages, but new to R, might have difficulty.
Matt Dowle presented his tutorial on Data Table and, from the chatter I heard, increased his growing list of data table converts. Dirk Eddelbuettel presented An ExampleDriven, Handson Introduction to Rcpp and Romain Francois taught C++ and Rcpp11 for beginners. Both Dirk and Romain work hard to make interfacing to C++ a real option for people who themselves are willing to make an effort.
I came late to Martin Morgan’s tutorial on Bioconductor but couldn’t get in the room. Fortunately, Martin prepared an extensive set of materials for his course which I hope to be able to work through.
Max Kuhn taught an introduction to applied predictive based on the recent book he wrote with Kjell Johnson. Both the slides and code are available. Virgilio Gomez Rubio presented a tutorial on applied spatial data analysis with R. His materials are available here. Ramnath Vaidyanathan Interactive Documents with R Have a look at both Ramnath’s slides, the map embedded in his abstract and his screencast.
Drew Schmidt taught a workshop on Programming with Big Data in R based on the pdbR package. The course page points to a rich set of introductory resources.
I very much wanted to attend Søren Højsgaard’s tutorial on Graphical Models and Bayesian Networks with R, but couldn’t make it. I did attend John Nash's tutorial on Nonlinear parameter optimization and modeling in R and I am glad that I did. This is a new field for me and I was fortunate to see something of John’s meticulous approach to the subject.
I was disappointed that I didn’t get to attend Thomas Petzoldt’s tutorial on Simulating differential equation models in R. This is an area that is not usually associated with R. The webpage for the tutorial is really worth a look.
I don't know if the conference organizers planned it that way, but as it turned out, the tutorial subjects chosen are an excellent showcase for the depth and diversity of applications that can be approached through R. Many thanks to the tutorial teachers and congratulations to the UseR! 2014 conference organizers for a great start to the week. I hope to have more to say about the conference in future posts.
To play in a World Cup national soccer team, a player must be a citizen of that country. But most World Cup players don't regularly play in the nation of their World Cup team. Some hold dual citizenship; others simply play for a league team in a foreign country where citizenship rules don't apply.
In this elegant chart, Guy Abel, a statistician and R programmer at the Vienna Institute of Demography, illustrates how the World Cup national teams are drawn from League players from around the world. (Click to enlarge.)
The arrows on the chart flow FROM the World Cup national teams TO the countries where the players currently play in league teams. Most of the players in Australia's World Cup team, for example, actually play for teams in the USA, South Korea, and European league teams. By contrast, about a third of Italy's team and almost all of Russia's play for domestic leagues (note the arrows folding back on themselves indicating players who play in home leagues).
The chart was created in the R language using the circlize package. The underlying data was scraped from Wikipedia, and the code to create this plot is available on github. Guy gives other several examples (with R code) of creating such "circular migration flow plots" on his blog.
Guy Abel: 2014 World Cup Squads
by Joseph Rickert
Last week,I had the opportunity to participate in the Second Academy of Science and Engineering (ASE) Conference on Big Data Science and Computing at Stanford University. Since the conference was held simultaneously with the two other conferences, one on Social Computing and the other on Cyber Security, it was definitely not an R crowd, and not even a typical Big Data crowd. Talks from the three programs were intermixed throughout the day so at any given moment you could find yourself looking for common ground in a conversation with mostly R aware, but language impartial fellow attendees. I don’t know whether this method of organization was the desperate result of necessity or genius, but I thought it worked out very well and made for a stimulating interaction dynamic. The ASE conference must have been difficult program to set up. The organizers, however, did a wonderful job mashing talks and themes together to make for an excellent experience.
There were several very good talks at the conference, however, the tutorial on Deep Learning and Natural Language Processing given by Richard Socher was truly outstanding. Richard is a PhD student in Stanford’s Computer Science Department studying under Chris Manning and Andrew Ng. Very rarely do you come across such a polished speaker with complete and casual command of complex material. And, while the delivery was impressive the content was jaw dropping. Richard walked through the Deep Learning methodology and tools being developed in Stanford’s AI lab and showed a number of areas where the Deep Learning techniques are yielding notable results; for example, a system for single sentence sentiment detection that improved positive/negative sentence classification by 5.4%. Have a look at Andrew Ng’s or Christopher Manning’s lists of publications to get a good idea of the outstanding work that is being done in this area.
A key concept covered in the tutorial is the ability to represent natural language structures, parsing trees for example, in a finite dimensional vector space and to build the theoretical and software tools in such a way that same method can be use to deconstruct and represent other hierarchies. The following slide indicates how a structures build for Natural Language Processing (NLP) can also be used to represent images.
This ability to bring a powerful, integrated set of tools to many different areas seems to be a key reason why neural nets and Deep Learning are suddenly getting so much attention. In a tutorial similar to the one Richard gave on Saturday, Richard and Chris Manning attribute the recent resurgence of Deep Learning to three factors:
The software used in the NLP and Deep Learning work at Stanford seems to be mostly based on Python and C. (See theano and Senna for example.) So far, it does not appear that much Deep Learning work at all is being done with R. However, things are looking up. 0xdata’s H20 Deep Learning implementation is showing impressive results, and the this algorithm is available in the h20 R package. Also, the R package darch and the very recent deepnet package, both of which offer implementations of Restricted Boltzman Machines, indicate that Deep Learning researchers are working in R.
Finally, to get a quick overview of the area have a look at the book, Deep Learning: Methods and Applications by Li Deng and Dony Yu of Microsoft Research is available online.
by Joseph Rickert
I was very happy to have been able to attend R / Finance 2014 which wrapped up a couple of weeks ago. In general, the talks were at a very high level of play, some dealing with brand new ideas and many presented at a significant level of technical or mathematical sophistication. Fortunately, most of the slides from the presentations are quite detailed and available at the conference site. Collectively, these presentations provide a view of the boundaries of the conceptual space imagined by the leaders in quantitative finance. Some of this space covers infrastructure issues involving ideas for pushing the limits of R (Some Performance Improvements for the R Engine) or building a new infrasturcture (New Ideas for Large Network Analysis) or (Building Simple Data Caches) for example. Others are involved with new computational tools (Solving Cone Constrained Convex Programs) or attempt to push the limits on getting some actionable insight from the mathematical abstrations: (Portfolio Inference withthei One Wierd Trick) or (Twinkle twinkle litle STAR: Smooth Transition AR Models in R) for example.
But while the talks may be illuminating, the real takeaways from the conference are the R packages. These tools embody the work of the thought leaders in the field of computational finance and are the means for anyone sufficiently motivated to understand this cutting edge work. By my count, 20 of the 44 tutorials and talks given at the conference were based on a particular R package. Some of the packages listed in the following table are wellestablished and others are workinprogress sitting out on RForge or GitHub, providing opportunities for the interested to get involved.
R Finance 2014 Talk 
Package 
Description 
Introduction to data.table 
Extension of the data frame 

An ExampleDriven Handson introduction to Rcpp 
Functions to facilitate integrating R with C++ 

Portfolio Optimization: Utility, Computation, Equities Applications 
Environment for reaching Financial Engineering and Computational Finance 

ReEvaluation of the Low Risk Anomaly via Matching 
Implementation of the Coarsened Exact matching Algorithm 

BCP Stability Analytics: New Directions in Tactical Asset Management 
Bayesian Analysis of Change Point Problems 

On the Persistence of Cointegration in Pairs Trading 
EngleGranger Cointegration Models 

Tests for Robust Versus Least Squares Factor Model Fits 
robust methods 

The R Package cccp: Solving Cone Constrained Convex Programs 
Solver for convex problems for cone constraints 

Twinkle, twinkle little STAR: Smooth Transition AR Models in R 
Modeling smooth transition models 

Asset Allocaton with Higher Order Moments and Factor Models 
Global optimization by differential evolution / Numerical methods for portfolio optimization 

Event Studies in R 
Event study and extreme event analysis 

An R package on Credit Default Swaps 
Provides tools for pricing credit default swaps 

New Ideas for Large Network Analysis, Implemented in R 
Implicitly restarted Lanczos methods for R 

Package “Intermediate and Long Memory Time Series 

Simulate & Detect Intermediate and Long Memory Processes / in development 
Stochvol: Dealing with Stochastic Volatility in Time Series 
Efficient Bayesian Inference for Stochastic Volatility (SV) Models 

Divide and Recombine for the Analysis of Large Complex Data with R 
Package for using R with Hadoop 

gpusvcalibration: Fast Stochastic Volatility Model Calibration using GPUs 
Fast calibration of stochastic volatility models for option pricing models 

The FlexBayes Package 
Provides an MCMC engine for the class of hierarchical feneralized linear models and connections to WinBUGS and OpenBUGS 

Building Simple Redis Data Caches 
Rcpp bindings for Redis that connects R to the Redis key/value store 

Package pbo: Probability of Backtest Overfitting 
Uses Combinatorial Symmetric Cross Validation to implement performance tests. 
Many of these packages / projects also have supplementary material that is worth chasing down. Be sure to take a look at Alexios Ghalanos recent post that provides an accessible introduction to his stellar keynote address.
Many thanks to the organizers of the conference who, once again, did a superb job, and to the many professionals attending who graciously attempted to explain their ideas to a dilletante. My impression was that most of the attendies thoroughly enjoyed themselves and that the general sentiment was expressed by the last slide of Stephen Rush's presentation:
Nate Silver's departure to relaunch FiveThirtyEight.com left a bit of a hole at the New York Times, which The Upshot — the new data journalism practice at the Times — seeks to fill. And they've gotten off to a great start with the new Senate forecasting model, called Leo. Leo was created by Amanda Cox (longtime graphics editor at the NYT) and Josh Katz (creator of the Dialect Quiz), and uses a similar pollaggregation methodology to that used by Silver. The model itself is implemented in the R language, and the R code is available for inspection at GitHub.
As of this writing, Leo suggests that the Democrats have a 51% chance of retaining the Senate in the 2014 elections. Now, probabilities are subtle things, and some commentators are prone to report from an estimate like this that "Leo predicts the Democrats will win the Senate in 2014". Of course, that would still be a risky bet to make, "essentially the same as a coin flip" as the Leo website says. As long as the probability isn't 0% or 100%, the actual outcome is always in doubt, dependent on factors we can't measure and over which we have no control. (In other words, luck.) The Upshot does a lovely job of demonstrating this variability with a feature that spins roulette wheels, loaded according to the data (mainly polls) for each race, and simulates one possible outcome.
This is a fantastic way to demonstrate the inherent variability in any statistical forecast, and it ties in with the underlying methodology as well. (For the statisticians out there, those wheels don't spin independently; correlations between individual races are taken into account.) Spin those wheels hundreds or thousands of times, count the number of times Democrats or Republicans win, and you have an estimate of the overall probability each party wins. Nicely done!
The UpShot: Senate Forecasts (via Sharon Machlis)
If you're still working on your March Madness brackets or fantasy teams, Rodrigo Zamith has updated his NCAA Data Visualizer with the latest teams, players and results. Just choose the two teams you want to compare and the metric to compare them on, and this Rbased app will show you the results instantly.
Rodrigo Zamanth: Visualizing Season Performance by NCAA Tournament Teams (2014)