by Joseph Rickert
The second, annual H2O World conference finished up yesterday. More than 700 people from all over the US attended the three-day event that was held at the Computer History Museum in Mountain View, California; a venue that pretty much sits well within the blast radius of ground zero for Data Science in the Silicon Valley. This was definitely a conference for practitioners and I recognized quite a few accomplished data scientists in the crowd. Unlike many other single-vendor productions, this was a genuine Data Science event and not merely a vendor showcase. H2O is a relatively small company, but they took a big league approach to the conference with an emphasis on cultivating the community of data scientists and delivering presentations and panel discussions that focused on programming, algorithms and good Data Science practice.
The R based sessions I attended on the tutorial day were all very well done. Each was designed around a carefully crafted R script performing a non-trivial model building exercise and showcasing one or more of the various algorithms in the H2O repertoire including GLMs, Gradient Boosting Machines, Random Forests and Deep Learning Neural Nets. The presentations were targeted to a sophisticated audience with considerable discussion of pros and cons. Deep Learning is probably H2O's signature algorithm, but despite its extremely impressive performance in many applications nobody here was selling it as the answer to everything.
The following code fragment from a script (Download Deeplearning)that uses deep learning to identify a spiral pattern in a data set illustrates the current look and feel of H2O's R interface. Any function that begins with h2o. runs in the JVM not in the R environment. (Also note that if you want to run the code you must first install Java on your machine, the Java Runtime Environment will do. Then, download the H2O R package Version 3.6.0.3 from the company's website. The scripts will not run with the older version of the package on CRAN.)
### Cover Type Dataset #We important the full cover type dataset (581k rows, 13 columns, 10 numerical, 3 categorical). #We also split the data 3 ways: 60% for training, 20% for validation (hyper parameter tuning) and 20% for final testing. # df <- h2o.importFile(path = normalizePath("../data/covtype.full.csv")) dim(df) df splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234) train <- h2o.assign(splits[[1]], "train.hex") # 60% valid <- h2o.assign(splits[[2]], "valid.hex") # 20% test <- h2o.assign(splits[[3]], "test.hex") # 20% # #Here's a scalable way to do scatter plots via binning (works for categorical and numeric columns) to get more familiar with the dataset. # #dev.new(noRStudioGD=FALSE) #direct plotting output to a new window par(mfrow=c(1,1)) # reset canvas plot(h2o.tabulate(df, "Elevation", "Cover_Type")) plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Cover_Type")) plot(h2o.tabulate(df, "Soil_Type", "Cover_Type")) plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Elevation" )) # #### First Run of H2O Deep Learning #Let's run our first Deep Learning model on the covtype dataset. #We want to predict the `Cover_Type` column, a categorical feature with 7 levels, and the Deep Learning model will be tasked to perform (multi-class) classification. It uses the other 12 predictors of the dataset, of which 10 are numerical, and 2 are categorical with a total of 44 levels. We can expect the Deep Learning model to have 56 input neurons (after automatic one-hot encoding). # response <- "Cover_Type" predictors <- setdiff(names(df), response) predictors # #To keep it fast, we only run for one epoch (one pass over the training data). # m1 <- h2o.deeplearning( model_id="dl_model_first", training_frame=train, validation_frame=valid, ## validation dataset: used for scoring and early stopping x=predictors, y=response, #activation="Rectifier", ## default #hidden=c(200,200), ## default: 2 hidden layers with 200 neurons each epochs=1, variable_importances=T ## not enabled by default ) summary(m1) # #Inspect the model in [Flow](http://localhost:54321/) for more information about model building etc. by issuing a cell with the content `getModel "dl_model_first"`, and pressing Ctrl-Enter. # #### Variable Importances #Variable importances for Neural Network models are notoriously difficult to compute, and there are many [pitfalls](ftp://ftp.sas.com/pub/neural/importance.html). H2O Deep Learning has implemented the method of [Gedeon](http://cs.anu.edu.au/~./Tom.Gedeon/pdfs/ContribDataMinv2.pdf), and returns relative variable importances in descending order of importance. # head(as.data.frame(h2o.varimp(m1))) # #### Early Stopping #Now we run another, smaller network, and we let it stop automatically once the misclassification rate converges (specifically, if the moving average of length 2 does not improve by at least 1% for 2 consecutive scoring events). We also sample the validation set to 10,000 rows for faster scoring. # m2 <- h2o.deeplearning( model_id="dl_model_faster", training_frame=train, validation_frame=valid, x=predictors, y=response, hidden=c(32,32,32), ## small network, runs faster epochs=1000000, ## hopefully converges earlier... score_validation_samples=10000, ## sample the validation dataset (faster) stopping_rounds=2, stopping_metric="misclassification", ## could be "MSE","logloss","r2" stopping_tolerance=0.01 ) summary(m2) plot(m2)
First notice that it all looks pretty much like R code. The script mixes standard R functions and H2O functions in a natural way. For example, h20.tabulate() produces an object of class "list" and h20.deeplearning() yields a model object that plot can deal with. This is just really baseline stuff that has to happen to provide make H2O coding feel like R. But note that the H2O code goes beyond this baseline requirement. The functions h2o.splitFrame() and h2o.assign() manipulate data residing in the JVM in a way that will probably seem natural to most R users, and the function signatures also seem to be close enough "R like" to go unnoticed. All of this reflects the conscious intent of the H2O designers not only to provide tools to facilitate the manipulation of H2O data from the R environment, but also to try and replicate the R experience.
An innovative new feature of the h20.deeplearning() function itself is the ability to specify a stopping metric. The parameter setting: (stopping_metric="misclassification", ## could be "MSE","logloss","r2" ) in the specification of model m2 means that the neural net will continue to learn until the specified performance threshold is achieved. In most cases, this will produce a useful model in much less time than it would take to have the learner run to completion. The following plot, generated in the script referenced above, shows the kind of problem for which the Deep Learning algorithm excels.
Highlights of the conference for me included the presentations listed below. The videos and slides (when available) from all of these presentations will be posted on the H2O conference website. Some have been posted already and the rest should follow soon. (I have listed the dates and presentation times to help you locate the slides when they become available)
Madeleine Udell (11-11: 10:30AM) presented the mathematics underlying the new algorithm, Generalized Low Rank Models (GLRM), she developed as part of her PhD work under Stephen Boyd, professor at Stanford University and adviser to H2O. This algorithm which generalizes PCA to deal with heterogeneous data types shows great promise for a variety of data science applications. Among other things, it offers a scalable way to impute missing data. This was possibly the best presentation of the conference. Madeleine is an astonishingly good speaker; she makes the math exciting.
Anqi Fu (11-9: 3PM) presented her H2O implementation of the GLRM. Anqi not only does a great job of presenting the algorithm, she also offers some real insight into the challenges of turning the mathematics into production level code. You can download one of Anqi's demo R scripts here: Download Glrm.census.labor.violations. To my knowledge, Anqi's code is the only scalable implementation of the GLRM. (Madeleine wrote the prototype code in Julia.)
Matt Dowle (11-10), of data.table fame, demonstrated his port of data.table's lightning fast radix sorting algorithm to H2O. Matt showed a 1B row X 1B row table join that runs in about 1.45 minutes on a 4 node 128 core H2O cluster. This is very impressive result, but Matt says he can already do 10B x 10B row joins, and is shooting for 100B x 100B rows.
Professor Rob Tibshirani (11-11: 11AM) presented work he is doing that may lead to lasso based models capable of detecting the presence of cancer in tissue extracted from patients while they are on the operating table! He described "Customized Learning", a method of building individual models for each patient. The basic technique is to pool the data from all of the patients and run a clustering algorithm. Then, for each patient fit a model using only the data in the patient's cluster. This is exciting work with the real potential to save lives.
Professor Stephen Boyd (11-10: 11AM) delivered a tutorial on optimization starting with basic convex optimization problems and then went on to describe Consensus Optimization, an algorithm for building machine learning models from data stored at different locations without sharing the data among the locations. Professor Boyd is a lucid and entertaining speaker, the kind of professor you will wish you had had.
Arno Candel (11-9: 1:30PM) presented the Deep Learning model which he developed at H2O. Arno is an accomplished speaker who presents the details with great clarity and balance. Be sure to have a look at his slide showing the strengths and weaknesses of Deep Learning.
Erin LeDell (11-9: 3PM) de-mystified ensembles and described how to build an ensemble learner from scratch. Anyone who wants to compete in a Kaggle competition should find this talk to be of value.
Szilard Pafka (11-11:3PM), in a devastatingly effective, low key presentation, described his efforts to benchmark the open source, machine learning platforms R, Python scikit, Vopal Wabbit, H2O, xgboost and Spark MLlib. Szilard downplayed his results, pointing out that they are in no way meant to be either complete nor conclusive. Nevertheless, Szilard put some considerable effort into the benchmarks. (He worked directly with all of the development teams for the various platforms.) Szilard did not offer any conclusions, but things are not looking all that good for Spark. The following slide plots AUC vs file size up to 10M rows.
Szilard's presentation should be available on the H2O site soon, but it is also available here.
I also found the Wednesday morning panel discussion on the "Culture of Data Driven Decision Making" and the Wednesday afternoon panel on "Algorithms -Design and Application" to be informative and well worth watching. Both panels included a great group of articulate and knowledgeable people.
If you have not checked in with H2O since the post I wrote last year, here' on one slide, is some of what they have been up to since then.
Congratulations to H2O for putting on a top notch event!
The R Consortium Infrastructure Steering Committee (chaired by Hadley Wickham) announced today the award of its first grant for an R community development project: $85,000 to Gábor Csárdi to implement the R-Hub project. As a board member of the R Consortium, I'm pleased to say this is a great first project for the R Consortium to get behind, as it aims to ease some of the difficulties associated with developing an R package for submission to CRAN. Currently more than 80% of CRAN submissions are rejected, often due to problems on platforms package developers don't have access to. When R-hub is ready, package developers will be able to detect and resolve any such issues prior to submitting, making it more likely their package will be accepted while relieving some of the burden on the dedicated volunteers who review CRAN submissions.
When completed, R-Hub will be a free online service available to all R users, allowing them to build and test R packages on all of the operating system platforms supported by CRAN: Windows, OS X, Linux and Solaris. It will integrate with GitHub (and possibly other online source code repositories) to provide a unified system for package source code management and testing. The architecture of the system has been designed by Gábor with input from many members of the R community, including: J.J. Allaire (RStudio), Ben Bolker (McMaster University) Dirk Eddelbuettel (Debian), Jay Emerson (Yale University), Nicholas Lewin-Koh (Genentech), Joseph Rickert and (me) David Smith (Revolution Analytics/Microsoft), Murray Stokely (Google), and Simon Urbanek (AT&T). You can review the R-hub plan on GitHub (and provide comments via issues). The project is estimated to take about six months to complete.
Meanwhile, the R Consortium ISC is now accepting proposals from the community on how its projects budget (about $110,000 "over the next several months", now that R-hub is approved) should be spent. Proposals can be for anything that would be of benefit to the R Community. Suggestions include "software development, developing new teaching materials, documenting best practices, standardising APIs or doing research". So if you have an idea for a project that could get off the ground with some funding, make a proposal to the R Consortium for consideration.
By the way, if you work for a company that makes extensive use of R, consider asking them to join the R Consortium to make even more funds available for community projects. (I'm proud to say that Microsoft is a platinum member.) And if you're attending the EARL Conference in Boston, I'll be participating in a panel discussion with other R Consortium board members where we'll be dicussining the R Consortium's goals and the projects managed by the Infrastructure Steering Committee. I hope to see you there!
R Contortium press releases: R Consortium Awards First Grant to Help Advance Popular Programming Language for Unlocking Value from Data
I'm honoured to be giving the opening keynote at the Effective Applications of R Conference (EARL) Conference in Boston on November 2. My presentation will be on the business economics and opportunity of open source data science, with a focus on applications that are now possible given the convergence of big data platforms, cloud technology, and data science software (especially R) charged by the contributions of the open source community.
Given the outstanding calibre of talks at last months' EARL London conference, I'm can't wait to learn about more uses of R in business and industry. The whole agenda looks great, but a few of the sessions that caught my eye include:
If you haven't yet signed up for EARL Boston (organized and hosted by Mango Solutions), registration is still open here. (Discount academic registrations are sold out, though.) I hope to see you there!
by Andrie de Vries
The second week of SQLRelay (#SQLRelay) kicked off in London earlier this week. SQLRelay is a series of conferences, spanning 10 cities in the United Kingdom over two weeks. The London agenda included 4 different streams, with tracks for the DBA, BI and Analytics users, as well as a workshop track with two separate tutorials.
My speaking slot was in the afternoon, with the title "In-database analytics using Revolution R and SQL".
In my talk I covered:
The last two demonstrations demonstrate how to run some R code embedded in a SQL stored procedure:
The presentation is available on SlideShare:
Here are the code samples I used in the demonstration:
by Joseph Rickert
We can declare 2015 the year that R went mainstream at the JSM. There is no doubt about it, the calculations, visualizations and deep thinking of a great many of the world's statisticians are rendered or expressed in R and the JSM is with the program. In 2013 I was happy to have stumbled into a talk where an FDA statistician confirmed that R was indeed a much used and trusted tool. Last year, while preparing to attend the conference, I was delighted to find a substantial list of R and data science related talks. This year, talks not only mentioned R: they were about R.
The conference began with several R focused pre-conference tutorials including Statistical Analysis of Financial Data Using R, The Art and Science of Data Visualization Using R, and Hadley Wickham’s sold out Advanced R. The Sunday afternoon session on Advances in R Software played to a full room. Highlights of that session included Gabe Becker’s presentation on the switchr package for reproducible research, Mark Seligman’s update on the new work being done on the Arborist implementation of the random forest algorithm and my colleague’s Andrie de Vries presentation of some work we did on the network structure of R packages. (See yesterday’s post.)
The enthusiasm expressed by the overflowing crowd for Monday’s invited session on Recent Advances in Interactive Graphics for Data Analysis was contagious. Talks revolved around several packages linking R graphics to d3 and JavaScript in order to provide interactive graphics which are not only visually stunning but also open up new possibilities for exploratory data analysis. Hadley Wickham, the substitute chair for the session, characterized the various approaches to achieving interactive graphics in R with a bit of humor and much insight that I think brings some clarity to this chaotic whorl of development. Hadley places current efforts to provide interactive R graphics in one of three categories:
Other highlights of the session included Kenney Shirley’s presentation on interactively visualizing trees with his summarytrees package that interfaces R to D3, Susan VanderPlas’ presentation of Animint (This package adds interactive aesthetics to ggplot2. Here is a nice tutorial.), and Karl Bowman’s discussion of visualizing high-dimensional genomic data (See qtlcharts and d3examples.)
In addition to visualization, education was another thread that stitched together various R related topics. Waller's talk, Evaluating Data Science Contributions in Teaching and Research, in the section of invited papers: The Statistics Identity Crisis: Are We Really Data Scientists provided some advice on how software developed by academics could be “packaged” to look like the more traditional work product traditionally valued for academic advancement. Progress along these lines would go a long way towards helping some of the most productive R contributors achieve career advancing recognition. There was also some considerable discussion about the kind of practical R and data science skills that should supplement the theoretical training of statisticians to help them be effective in academia as well as in industry. To get some insight into the relevant issues have a look at Jennifer Bryan’s slides for her talk Teach Data Science and They Will Come.
The following list contains 20 JSM talks with interesting package, educational or application R content.
by Joseph Rickert
The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.)
A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that:
and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton.
Social network analysis at this level involves some serious statistics and mastering a very specialized vocabulary. However, it seems to me that some knowledge of this field will become important to everyone working in data science. Supervised learning models and statistical models that assume independence among the predictors will most likely represent only the first steps that data scientists will take in exploring the complexity of large data sets.
And, maybe of equal importance is that fact that working with network data is great fun. Moreover, software tools exist in R and other languages that make it relatively easy to get started with just a few pointers.
From a statistical inference point of view what you need to know is Exponential Random Graph Models (ERGMs) are at the heart of modern social network analysis. An ERGM is a statistical model that enables one to predict the probability of observing a given network from a specified given class of networks based on both observed structural properties of the network plus covariates associated with the vertices of the network. The exponential part of the name comes from exponential family of functions used to specify the form of these models. ERGMs are analogous to generalized linear models except that ERGMs take into account the dependency structure of ties (edges) between vertices. For a rigorous definition of ERGMs see sections 3 and 4 of the paper by Hunter et al. in the 2008 special issue of the JSS, or Chapter 6 in Kolaczyk and Csárdi's book Statistical Analysis of Network Data with R. (I have found this book to be very helpful and highly recommend it. Not only does it provide an accessible introduction to ERGMs it also begins with basic network statistics and the igraph package and then goes on to introduce some more advanced topics such as modeling processes that take place on graphs and network flows.)
In the R world, the place to go to work with ERGMs is the statnet.org. statnet is a suite of 15 or so CRAN packages that provide a complete infrastructure for working with ERGMs. statnet.org is a real gem of a site that contains documentation for all of the statnet packages along with tutorials, presentations from past Sunbelt conferences and more.
I am particularly impressed with the Shiny based GUI for learning how to fit ERGMs. Try it out on the Shiny webpage or in the box below. Click the Get Started button. Then select "built-in network" and "ecoli 1" under File type. After that, click the right arrow in the upper right corner. You should see a plot of the ecoli graph.
--------------------------------------------------------------------------------------------------------------------------
You will be fitting models in no time. And since the commands used to drive the GUI are similar to specifying the parameters for the functions in the ergm package you will be writing your own R code shortly after that.
By Torben Tvedebrink, Chair of local committee, useR! 2015
After useR! 2015 in Aalborg I had some time to reflect and think back on the phase leading up to the actual conference. The story of useR! 2015 began in 2013 when Søren Højsgaard, Head of Department of Mathematical Sciences, Aalborg University, popped the idea of hosting useR! 2015 or 2017 in Aalborg. He had made some informal enquiries to the R Foundation and R Core members about the possibility. With some positive indications we sat down and wrote the first draft for the bidding material. This included a description of Aalborg, the conference venue, our thoughts on the scientific and social programme together with our first budget. This five page document was sent to the R Foundation by the end of October 2013 and after some communication back and forth we received the final "go!" in January 2014.
The first thing we decided on was the form and location of the social events (i.e. the welcome reception and conference dinner). We decided that the participants should experience more of Aalborg than just the conference venue. Hence, the House of Music seemed like a natural choice. With the support of the municipality of Aalborg and our sponsors we had the opportunity of showing some of the best to our guests on the evening before the conference. Often conference dinners are held in big restaurants or settings where it is difficult to ensure good food and drinks. Hence, we wanted to focus on the place and theme rather than the food. The Robber's Banquet fitted nicely with this idea.
In the figure below I have plotted the number of useR! 2015 emails in my inbox over time (right: accumulated numbers are on log-scale. Total number of emails received: 3561). The number of received emails serves as a nice proxy for the amount of work put into the conference over time. As seen from the plot to the right, the number of emails grow exponentially over time. In the plot I have added some of the important dates, e.g. opening of registration, abstract deadline and registration deadlines.
We wanted to follow the example of useR! 2014 in Los Angeles and offer the tutorials free of charge to the participants. We applied some Danish foundations to support the initiative and had some positive feedback. In order to allocate the 16 tutorials into morning and afternoon tutorials we sent out a survey to the participants a month before the useR! conference. Based on the survey we ran through all possible permutations of the tutorials and minimised the number of individuals with both tutorial selections in the same session -- of course all done in R.
Initially we hoped that 300-400 participants would show up in Aalborg. This was based on some data, but primarily on a somewhat pessimistic prior, which fortunately did not hold true. In the figure below the number of registered participants are plotted over time. We opened the registration December 3rd 2014 and as expected only a few signed up the first months. However, as presenters received their notification and the early deadline approached we had already achieved our goal of 400 participants. In the months that followed another 260 useR!s signed up and we had a total of 660 participants when we ran the conference in July 2015 (128 females and 532 males). The industry/academia ratio was almost 50% with 284 from the industry, 262 academics and 113 students participating.
We had participants coming from more than 40 countries (see below), where the majority came from Denmark (129), USA (117) and Germany (92). For most countries the distribution between industry and academia was also close to 50% (right barplot).
In summary the useR! 2015 conference went well and we are happy for all the feedback we have received - positive as well as negative. These inputs are valuable to the R community in general and to us as local organisers in particular. We are happy to share any ideas and comments with future organisers of the useR! conference series.
The conference venue (Aalborg Congress and Culture Center, akkc.dk) had an ideal size for the turnout. With a plenary lecture hall suited for almost 800 people and four additional rooms (220 - 150 seats) for parallel sessions was adequate. The professional assistance from the staff during the planning was very helpful. We can only recommend involving experienced people in the planning and execution of the next useR! conferences. Similar thoughts go to the catering - with good, varied and suffucient food supply most people are happy!
The social events (welcome reception, poster session and conference dinner) all went as we hoped. For the poster session we had free drinks and food. This ment most people stayed until the end and poster presenters had many interesting discussions. We had intentionally encourage posters to be on display throughout the conference and to be located in the exhibitor's area. With this people had more time to visit both our many sponsors and look at the posters when they felt for it.
Once again we would like to thank all useR! 2015 participants for making the conference a memorable experience for the Department of Mathematical Sciences at Aalborg University. As special thanks goes to our many sponsors that made it possible to provide a high service level.
On behalf of the local organising committee,
Torben Tvedebrink
by Joseph Rickert
June was a hot month for extreme statistics and R. Not only did we close out the month with useR! 2015, but two small conferences in the middle of the month brought experts together from all over the world to discuss two very difficult areas of statistics that generate quite a bit of R code.
The Extreme Value Analysis conference is a prestigious event that is held every two years in different parts of the world. This year, over 230 participants from 26 countries met from June 15th through 19th at the University of Michigan, Ann Arbor for EVA 2015. The program included theoretical advances as well as novel applications of Extreme Value Theory in fields including finance,
economics,insurance, hydrology, traffic safety, terrorism risk, climate and environmental extremes. You can get a good idea of the topics discussed at the EVA from the book of abstracts which includes an author index as well as a keyword index. The conference organizers are in the process of obtaining permissions to post the slides from the talk. These should be available soon.
In the meantime, have a look at the slides from two excellent presentations from the Workshop on Statistical Computing which was held the day before the main conference. Eric Gilleland's Introduction to Extreme Value Analysis provides a gentle introduction for anyone willing to look at some math. Eric begins with some motivating examples, develops some key concepts and illustrates them with R and even provides some history along the way. This quote from Emil Gumbel, a founding giant in the field, should be every modeler's mantra: “Il est impossible que l’improbable n’arrive jamais”. ("It's impossible for the improbable to never occur" -- ed)
In Modeling spatial extremes with the SpatialExtremes package, Mathieu Ribatet works through a complete example in R by fitting and evaluating a model and running simulations. This motivating slide from the presentation describes the kind of problems he is considering.
In our world of climate extremes and financial black swans there are probably few topics more of more immediate concern to statisticians that EVA, but the vexing problem of dealing with missing values might be one of them. So, it was not surprising that at nearly the same time (June 18th and 19th) a 150 people or so gathered on the other side of the world in Rennes France for missData 2015.
Over the years, R developers have expended considerable energy creating routines to missing values. The transcan function in the Hmisc package "automatically transforms continuous and categorical variables to have maximum correlation with the best linear combination of the other variables". mice provides functions using the Fully Conditional Specification using the MICE algorithm. (See the slides from Stef van Buuern's presentation Fully Conditional Specification: Past, present and beyond for a perspective on FCS and the reading list at left.) mi provides functions for missing value imputation in a Bayesian framework as does the BaBooN package and VIM provides for visualizing the structure of missing values. Slides for almost all of the talks are available online at the conference program page and videos will be available soon. Have a look at the slides from the lightning talk by Matthias Templ and Alexander Kowarik to see what the VIM package can do.
Revolution Analytics was very pleased to have been able to sponsor both of these conferences. For the next EVA mark your calendars to visit Delft, the Netherlands in 2017.
by Joseph Rickert
Last week, I was fortunate enough to attend the R Summit & Workshop, an invitation only event, held at the Copenhagen Business School. The abstracts for the public talks presented are online and well worth a look. Collectively they provide a snapshot of the state of development of R and the R Community as well some insight into the directions in which researchers are moving to expand the boundaries of R.
Real highlights of the event were talks by Jennifer Bryan and Mine Çetinkaya-Rundel, two educators who are channeling enormous amounts of energy into teaching statistics and statistical programming, and into developing new pedagogical methods to improve the learning experience for both students and teachers alike. Both Mine and Jennifer are committed to R, as well as to using state of the art developer tools such as RStudio, R Markdown, Git and GitHub.
If you clicked on the link to Jennifer’s university home page above and expected to see more content there you are probably not running with the in crowd. Anybody who aspires to bask in the faintest glow of tech cool is hanging out on GitHub. So go to github.com/jennybc to find a place where social media, software development and best practices collide to generate a state-of-the-art learning platform.
What Jennifer has created there is the R version of a full immersion experience for learning a new language. Just like you can’t separate the highs and awkward lows of human interactions from tripping over the grammar while learning Japanese on the streets of Tokyo, at jennybc you have to cope with GitHub, R Markdown and complying with best practices while doing your homework, seeking help, and learning from your peers.
As Jennifer writes in her synopsis of her R Summit & Workshop talk:
I've formed strong opinions about workflows for R Markdown + GitHub and what the big wins are.
Mine, who focuses on undergraduate education, points out that: “R is attractive because, unlike software designed specifically for courses at this level, it is relevant beyond the introductory statistics classroom, and is more powerful and flexible.” Poke around Mine’s GitHub page and you will see that she is all about R, open source, reproducibility and teaching good habits and right values. In addition to her work in the classroom, Mine has developed the Coursera course: Data Analysis and Statistical Inference, is a coauthor of three, free R based textbooks and is a driving force behind the ASA Datafest competitions.
Students are often ambivalent as to whether they are looking for education or training. But if you are a student of either Mime or Jennifer you are going to get some of both, and have a real shot at launching a productive R fueled career.
by Andrie de Vries
Today is the first day of UseR!2015 conference in Aalborg in Northern Denmark. But yesterday was a day packed with 16 tutorials on a range of interesting topics. I submitted a proposal many months ago to run a session on using R in Hadoop and was very happy to selected to run a session in the morning.
When we first started planning the session, we set a Big Hairy Audacious Goal to run the session using a HortonWorks Hadoop cluster hosted in the Microsoft Azure cloud.
We trialled the session at the Birmingham R user group during May, and then again last week during a Microsoft internal webinar. In both cases, the cluster performed great.
Then I asked the organisers how many people expressed an interest, and I heard that 66 people said they might come and that the room will seat 144 people.
At this point I started having cold sweats!
Just as luck would have it, minutes before start time something went wrong. The cluster would not start and participants were unable to access the server for the first hour of the tutorial. Fortunately, we had a back-up plan.
You can take a vicarious tour through the analysis of the NYC Taxi Cab data with the rmarkdown slides built with the RStudio presentation suite.
Although I did not cover all the material during the session, the full set of presentations are:
2. Analysing New York taxi data with Hadoop
3. Computing on distributed matrices
4. Using RHive to connect to the hive database
(These presentations, sample data, all the scripts and examples live in github at https://github.com/andrie/RHadoop-tutorial. Use the tag https://github.com/andrie/RHadoop-tutorial/tree/2015-06-30-UseR!2015 to find the state of the repository at the time of the tutorial.)
The central example throughout is to use mapreduce() in the package rmr2 to summarize new York taxi journeys, grouped by hour of the day. This is a dataset of ~200GB of uncompressed csv files.
You can read more about how this data came into the public domain at http://chriswhong.com/open-data/foil_nyc_taxi/. Here is a simple plot of the results, sampled from a 1 in a 1000 sample of data:
The full code to do this is at available in a github repository at https://github.com/andrie/RHadoop-tutorial. Here is an extract:
library(rmr2)
library(rhdfs)
hdfs.init()
rmr.options(backend = "hadoop")
hdfs.ls("taxi")$file
homeFolder <- file.path("/user", Sys.getenv("USER"))
taxi.hdp <- file.path(homeFolder, "taxi")
headerInfo <- read.csv("data/dictionary_trip_data.csv", stringsAsFactors = FALSE)
colClasses <- as.character(as.vector(headerInfo[1, ]))
taxi.format <- make.input.format(format = "csv", sep = ",",
col.names = names(headerInfo),
colClasses = colClasses,
stringsAsFactors = FALSE
)
taxi.map <- function(k, v){
original <- v[[6]]
date <- as.Date(original, origin = "1970-01-01")
wkday <- weekdays(date)
hour <- format(as.POSIXct(original), "%H")
dat <- data.frame(date, hour)
z <- aggregate(date ~ hour, dat, FUN = length)
keyval(z[[1]], z[[2]])
}
taxi.reduce <- function(k, v){
data.frame(hour = k, trips = sum(v), row.names = k)
}
m <- mapreduce(taxi.hdp, input.format = taxi.format,
map = taxi.map,
reduce = taxi.reduce
)
dat <- values(from.dfs(m))
library("ggplot2")
p <- ggplot(dat, aes(x = hour, y = trips, group = 1)) +
geom_smooth(method = loess, span = 0.5,
col = "grey50", fill = "yellow") +
geom_line(col = "blue") +
expand_limits(y = 0) +
ggtitle("Sample of taxi trips in New York")
p