by Joseph Rickert
We usually have a pretty good time at the monthly Bay Area useR Group (BARUG) meetings, but this month's meeting was a bit more of a party than usual. The very well connected PR team at Sqor Sports, our host company for the evening, secured San Francisco's tres trendy 111 Minna Gallery for the venue. There was a full bar, house music for the networking portion of the meeting, gourmet grilled cheese sandwiches complements of Revolution Analytics and drama — Matt Dowle, one of our speakers, was on a flight that was late getting in from London.
Oh! and yes, there were three very engaging presentations — well worth standing around in the dark.
First up was Noah Gift, CTO or Sqor, a company with a mission to take sports marketing to a whole new level. They are creating a marketplace for athletes to build and promote their digital brands. Noah described how devilishly difficult it is to gather, clean and prepare the data. Correctly labeling social media data from several sources generated by different athletes with the same name poses a number of vexing challenges.
One surprising aspect of the technology that Sqor is developing is what they call an Erlang to R bridge the replaces many tasks they formerly accomplished with Python. Noah indicated that they planning on placing this code in open source.
Below is a plot from Noah's presentation showing predictions from their R based machine learning algorithms.
Our second speaker was Stephen Elston who gave a virtuoso, live demo on using R on the Microsoft Azure Machine Learning cloud platform. Steve glided between the Azure workflow interface and running R scripts. He showed how to manipulate and transform data in both environments, go back and forth to run models in both Azure and R and visualize results in R. Slides for Steve’s talk are available as is some R code on Steve's github site. Studying the scripts will give you an idea of the features he presented.
Finally, just in from London, and still lucid at what would have been 4AM his time, Matt Dowle walked through a summary of new features of data.table v1.9.4 and v1.9.5. There were several data.table users present, and Matt made a few new converts with a series of impressively fast benchmarks against base R. In one demo, Matt showed data.table's forder() taking only 17 seconds to sort 40 million random numerics, a task that took R 7 minutes. According to Matt, the trick for getting this kind of performance is data.table's C-based implementation of radix sorting which works on numeric, character and integer types, with no range restrictions (recall that base::sort.list(...,method="radix") is limited to integers with range < 100,000).
data.table's radix sorting, which scales linearly i.e. below the O(n log n) bound for comparison sorts, is based on two papers: one by Terdiman and the other by Herf. However, where both of these papers use the least significant digit, data.table uses the most significant digit to improve cache efficiency.
Matt also demonstrated data.table's new automatic indexes (You can now use == in i and data.table will automatically build a secondary key) as well as using dplyr syntax with data.table. Matt emphasized that this flexibility shows the power of R's object oriented design. Matt also claimed that both Python's pandas and the dplyr for R made the wrong choices in using hashing. Instead of hashing, data.table uses fast sorting based on the sort order vector which is an index in data.table
For more benchmark information be sure to visit Matt's github site. If you are new to data.table, I recommend starting with Matt's 2014 useR presentation which explains some the ideas underlying data.table as well as providing an introduction.