A quick heads-up that I'll be presenting a live webinar on Thursday January 28, Introduction to Microsoft R Open. If you know anyone that should get to know R and/or Microsoft R Open, send 'em along. Here's the abstract:

Data Science is a strategic initiative for most companies today, who seek to understand the wealth of data now available to them to understand patterns, make forcecasts, and build data-driven products and process. The open-source R language is the lingua franca of data science, and ranked #6 in popularity amongst all languages by the IEEE. If you haven’t yet learned what R is all about, this webinar will bring you up to speed on the history of the R language, how it’s used, and why it’s so popular for developing advanced analytics applications.

In this 50-minute webinar, David Smith, R Community Lead at Microsoft, will introduce the R language and community, and give examples of R in action. In the webinar, David will demonstrate Microsoft R Open, Microsoft’s enhanced distribution of open-source R. He will also cover the enhancements included in Microsoft R Open, including enhanced performance, features for reproducible programming in R, and the new CRAN Time Machine for reproducible data analysis with R packages.

Then on February 4 Derek Norton (a frequent blogger here) will take a look at Using Microsoft R Server to Address Scalability Issues in R:

R is the world’s most widely-used programming language for data analysis. Despite its popularity, there are several constraints with leveraging open source R with enterprise-class datasets. In this 50-minute webinar, Derek Norton, Senior Data Scientist at Microsoft, will introduce the issues posed by open source R, and will walk through a use case that demonstrates how Microsoft R Server addresses each of the gaps associated with analyzing enterprise-class datasets with R.

Click the links above to register, or check out the full schedule of webinars at the link below — more will be added soon!

Microsoft: Microsoft R Webinars

By Virgilio Gómez Rubio, Spanish R Users Organizing Committee

As every autumn since 2009, Spanish R users gathered at their annual meeting. It is organised by Spanish R users group ‘Comunidad R-Hispano’and took place in 5-6 November in the historic city of Salamanca. The 7th Meeting of Spanish R Users attracted more than 100 R entusiasts and provided a mix of tutorials and contributed talks within the quarters of the University of Salamanca.

First of all, from Comunidad R-Hispano we would like to thanks our sponsors Revolution Analytics, Telefónica, Instituto de Ingeniería del Conocimiento, Open Sistemas and Datatons for their financial support. Publishers Springer, Wiley and CRC/Taylor & Francis also supported this meeting by providing flyers, discounts and samples of recent books.

Tutorials presented at the conference covered topics such as R on Spark, the analysis of data from telephone surveys, analysis of survival data,analysis of data obtained from randomised surveys, analysis of network data with igraph and how to use R for the analysis of questionnaire data. All these tutorials provided a hands-on approach to the subject and the materials are available from the conference website.

Contributed talks focused on four main areas: Applications, Interfaces/Data Mining, Statistical Methodology and Biostatistics. Altogether, there were 22 oral presentations on these topics.

Among the Applications, Marcos Fernández Arias showed how to use R to gather information available on-line to pay a fair price for a new car.Teresa González Arteaga also shared some of her teaching experiences with R in a Degree in Statistics. In the Interfaces/Data Mining section, Christian González Martel and co-authors explored the use of R to use Wikipedia searches of top Spanish companies for investment in the Spanish stock market.

Regarding the contributed session on Statistical Methodology, José Luis Vicente Villardón talked about his experience migrating his code for multivariate analysis using biplots from Matlab to R. Finally, in the Biostatistics section, Carlos Prieto and co-authors presented some interactive plots developed with R and other tools to visualize genomic data.

The prize to the best presentation by a young presenter was awarded to Karel López Quintero for his work on a Price Sensitivity Meter (PSM) with R.

Comunidad R-Hispano is already preparing the 8th Meeting that will take place in November 2016, in the University of Castilla-La Mancha in Albacete and it will be locally organised by the same team that took care of useR! 2013.

by Joseph Rickert

The following map of all of the R user groups listed in Microsoft's Local R User Group Directory is good way to visualize the R world as we rocket into 2016. As a member of the useR!2016 planning committee, foremost in my mind right now is that in just a few months people will be coming to Stanford from all points plotted and almost everywhere in between.

I suppose I have a strong bias here, but this conference promises to be an outstanding event where Silicon Valley welcomes R. The invited speakers for useR! 2016 are Richard Becker, Donald Knuth, Deborah Nolan, Simon Urbanek, Hadley Wickham and Daniela Witten and the invited tutorials already scheduled to be on the program are:

But all of this is only half the story. The rest depends on you. There are still several open slots for tutorials, and there is plenty of time to think about submitting an abstract for a contributed talk. Please, use one of the templates to submit your tutorial proposal and make sure it is received at useR-2016@R-project.org before the submissions deadline: midnight January 10, 2016 PST. The clock for contributed talks starts on January 5th and runs till March 25th.

Moreover, due to a special alignment of the stars in 2016 the Bioconductor Conference will also be held at Stanford from June 24th through the 26th making it possible to attend both conferences! The details should be up on the Bioconductor site soon. Bay Area hotels are notoriously expensive and there is limited capacity close to the campus, so if you are going to either useR! 2016 or BioC 2016 make you housing arrangements soon.

And - don't let all of this cosmic California talk distract you from planning for R/Finance 2016 which will be held May 20th and 21st in Chicago. If you have any interest at all in Finance (or Chicago), R/Finance is as good as it gets. The deadline for submissions is January 29th. You can find the details here.

And - don't forget that January 10th is also the deadline for submitting a proposal to the R-Consortium for the next round of funding. (You can find the details for submitting a proposal here.) The Infrastructure Steering Committee (ISC) has already received proposals for improving distributed computing in R, developing a package and API for retrieving package download statistics and building a framework for collecting R usage information at the function level. Don't miss this opportunity to get your ideas for work that could benefit the greater R Community in front of the ISC.

Finally, if you are thinking about starting an R user group in 2016 consider applying for support from Microsoft's Data Science User Group Sponsorship Program to get RUG going. You can find the application form here.

by Joseph Rickert

The second, annual H2O World conference finished up yesterday. More than 700 people from all over the US attended the three-day event that was held at the Computer History Museum in Mountain View, California; a venue that pretty much sits well within the blast radius of ground zero for Data Science in the Silicon Valley. This was definitely a conference for practitioners and I recognized quite a few accomplished data scientists in the crowd. Unlike many other single-vendor productions, this was a genuine Data Science event and not merely a vendor showcase. H2O is a relatively small company, but they took a big league approach to the conference with an emphasis on cultivating the community of data scientists and delivering presentations and panel discussions that focused on programming, algorithms and good Data Science practice.

The R based sessions I attended on the tutorial day were all very well done. Each was designed around a carefully crafted R script performing a non-trivial model building exercise and showcasing one or more of the various algorithms in the H2O repertoire including GLMs, Gradient Boosting Machines, Random Forests and Deep Learning Neural Nets. The presentations were targeted to a sophisticated audience with considerable discussion of pros and cons. Deep Learning is probably H2O's signature algorithm, but despite its extremely impressive performance in many applications nobody here was selling it as the answer to everything.

The following code fragment from a script (Download Deeplearning)that uses deep learning to identify a spiral pattern in a data set illustrates the current look and feel of H2O's R interface. Any function that begins with h2o. runs in the JVM not in the R environment. (Also note that if you want to run the code you must first install Java on your machine, the Java Runtime Environment will do. Then, download the H2O R package Version 3.6.0.3 from the company's website. The scripts will not run with the older version of the package on CRAN.)

### Cover Type Dataset #We important the full cover type dataset (581k rows, 13 columns, 10 numerical, 3 categorical). #We also split the data 3 ways: 60% for training, 20% for validation (hyper parameter tuning) and 20% for final testing. # df <- h2o.importFile(path = normalizePath("../data/covtype.full.csv")) dim(df) df splits <- h2o.splitFrame(df, c(0.6,0.2), seed=1234) train <- h2o.assign(splits[[1]], "train.hex") # 60% valid <- h2o.assign(splits[[2]], "valid.hex") # 20% test <- h2o.assign(splits[[3]], "test.hex") # 20% # #Here's a scalable way to do scatter plots via binning (works for categorical and numeric columns) to get more familiar with the dataset. # #dev.new(noRStudioGD=FALSE) #direct plotting output to a new window par(mfrow=c(1,1)) # reset canvas plot(h2o.tabulate(df, "Elevation", "Cover_Type")) plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Cover_Type")) plot(h2o.tabulate(df, "Soil_Type", "Cover_Type")) plot(h2o.tabulate(df, "Horizontal_Distance_To_Roadways", "Elevation" )) # #### First Run of H2O Deep Learning #Let's run our first Deep Learning model on the covtype dataset. #We want to predict the `Cover_Type` column, a categorical feature with 7 levels, and the Deep Learning model will be tasked to perform (multi-class) classification. It uses the other 12 predictors of the dataset, of which 10 are numerical, and 2 are categorical with a total of 44 levels. We can expect the Deep Learning model to have 56 input neurons (after automatic one-hot encoding). # response <- "Cover_Type" predictors <- setdiff(names(df), response) predictors # #To keep it fast, we only run for one epoch (one pass over the training data). # m1 <- h2o.deeplearning( model_id="dl_model_first", training_frame=train, validation_frame=valid, ## validation dataset: used for scoring and early stopping x=predictors, y=response, #activation="Rectifier", ## default #hidden=c(200,200), ## default: 2 hidden layers with 200 neurons each epochs=1, variable_importances=T ## not enabled by default ) summary(m1) # #Inspect the model in [Flow](http://localhost:54321/) for more information about model building etc. by issuing a cell with the content `getModel "dl_model_first"`, and pressing Ctrl-Enter. # #### Variable Importances #Variable importances for Neural Network models are notoriously difficult to compute, and there are many [pitfalls](ftp://ftp.sas.com/pub/neural/importance.html). H2O Deep Learning has implemented the method of [Gedeon](http://cs.anu.edu.au/~./Tom.Gedeon/pdfs/ContribDataMinv2.pdf), and returns relative variable importances in descending order of importance. # head(as.data.frame(h2o.varimp(m1))) # #### Early Stopping #Now we run another, smaller network, and we let it stop automatically once the misclassification rate converges (specifically, if the moving average of length 2 does not improve by at least 1% for 2 consecutive scoring events). We also sample the validation set to 10,000 rows for faster scoring. # m2 <- h2o.deeplearning( model_id="dl_model_faster", training_frame=train, validation_frame=valid, x=predictors, y=response, hidden=c(32,32,32), ## small network, runs faster epochs=1000000, ## hopefully converges earlier... score_validation_samples=10000, ## sample the validation dataset (faster) stopping_rounds=2, stopping_metric="misclassification", ## could be "MSE","logloss","r2" stopping_tolerance=0.01 ) summary(m2) plot(m2)

First notice that it all looks pretty much like R code. The script mixes standard R functions and H2O functions in a natural way. For example, h20.tabulate() produces an object of class "list" and h20.deeplearning() yields a model object that plot can deal with. This is just really baseline stuff that has to happen to provide make H2O coding feel like R. But note that the H2O code goes beyond this baseline requirement. The functions h2o.splitFrame() and h2o.assign() manipulate data residing in the JVM in a way that will probably seem natural to most R users, and the function signatures also seem to be close enough "R like" to go unnoticed. All of this reflects the conscious intent of the H2O designers not only to provide tools to facilitate the manipulation of H2O data from the R environment, but also to try and replicate the R experience.

An innovative new feature of the h20.deeplearning() function itself is the ability to specify a stopping metric. The parameter setting: (stopping_metric="misclassification", ## could be "MSE","logloss","r2" ) in the specification of model m2 means that the neural net will continue to learn until the specified performance threshold is achieved. In most cases, this will produce a useful model in much less time than it would take to have the learner run to completion. The following plot, generated in the script referenced above, shows the kind of problem for which the Deep Learning algorithm excels.

Highlights of the conference for me included the presentations listed below. The videos and slides (when available) from all of these presentations will be posted on the H2O conference website. Some have been posted already and the rest should follow soon. (I have listed the dates and presentation times to help you locate the slides when they become available)

Madeleine Udell (11-11: 10:30AM) presented the mathematics underlying the new algorithm, Generalized Low Rank Models (GLRM), she developed as part of her PhD work under Stephen Boyd, professor at Stanford University and adviser to H2O. This algorithm which generalizes PCA to deal with heterogeneous data types shows great promise for a variety of data science applications. Among other things, it offers a scalable way to impute missing data. This was possibly the best presentation of the conference. Madeleine is an astonishingly good speaker; she makes the math exciting.

Anqi Fu (11-9: 3PM) presented her H2O implementation of the GLRM. Anqi not only does a great job of presenting the algorithm, she also offers some real insight into the challenges of turning the mathematics into production level code. You can download one of Anqi's demo R scripts here: Download Glrm.census.labor.violations. To my knowledge, Anqi's code is the only scalable implementation of the GLRM. (Madeleine wrote the prototype code in Julia.)

Matt Dowle (11-10), of data.table fame, demonstrated his port of data.table's lightning fast radix sorting algorithm to H2O. Matt showed a 1B row X 1B row table join that runs in about 1.45 minutes on a 4 node 128 core H2O cluster. This is very impressive result, but Matt says he can already do 10B x 10B row joins, and is shooting for 100B x 100B rows.

Professor Rob Tibshirani (11-11: 11AM) presented work he is doing that may lead to lasso based models capable of detecting the presence of cancer in tissue extracted from patients while they are on the operating table! He described "Customized Learning", a method of building individual models for each patient. The basic technique is to pool the data from all of the patients and run a clustering algorithm. Then, for each patient fit a model using only the data in the patient's cluster. This is exciting work with the real potential to save lives.

Professor Stephen Boyd (11-10: 11AM) delivered a tutorial on optimization starting with basic convex optimization problems and then went on to describe Consensus Optimization, an algorithm for building machine learning models from data stored at different locations without sharing the data among the locations. Professor Boyd is a lucid and entertaining speaker, the kind of professor you will wish you had had.

Arno Candel (11-9: 1:30PM) presented the Deep Learning model which he developed at H2O. Arno is an accomplished speaker who presents the details with great clarity and balance. Be sure to have a look at his slide showing the strengths and weaknesses of Deep Learning.

Erin LeDell (11-9: 3PM) de-mystified ensembles and described how to build an ensemble learner from scratch. Anyone who wants to compete in a Kaggle competition should find this talk to be of value.

Szilard Pafka (11-11:3PM), in a devastatingly effective, low key presentation, described his efforts to benchmark the open source, machine learning platforms R, Python scikit, Vopal Wabbit, H2O, xgboost and Spark MLlib. Szilard downplayed his results, pointing out that they are in no way meant to be either complete nor conclusive. Nevertheless, Szilard put some considerable effort into the benchmarks. (He worked directly with all of the development teams for the various platforms.) Szilard did not offer any conclusions, but things are not looking all that good for Spark. The following slide plots AUC vs file size up to 10M rows.

Szilard's presentation should be available on the H2O site soon, but it is also available here.

I also found the Wednesday morning panel discussion on the "Culture of Data Driven Decision Making" and the Wednesday afternoon panel on "Algorithms -Design and Application" to be informative and well worth watching. Both panels included a great group of articulate and knowledgeable people.

If you have not checked in with H2O since the post I wrote last year, here' on one slide, is some of what they have been up to since then.

Congratulations to H2O for putting on a top notch event!

The R Consortium Infrastructure Steering Committee (chaired by Hadley Wickham) announced today the award of its first grant for an R community development project: $85,000 to Gábor Csárdi to implement the R-Hub project. As a board member of the R Consortium, I'm pleased to say this is a great first project for the R Consortium to get behind, as it aims to ease some of the difficulties associated with developing an R package for submission to CRAN. Currently more than 80% of CRAN submissions are rejected, often due to problems on platforms package developers don't have access to. When R-hub is ready, package developers will be able to detect and resolve any such issues prior to submitting, making it more likely their package will be accepted while relieving some of the burden on the dedicated volunteers who review CRAN submissions.

When completed, R-Hub will be a free online service available to all R users, allowing them to build and test R packages on all of the operating system platforms supported by CRAN: Windows, OS X, Linux and Solaris. It will integrate with GitHub (and possibly other online source code repositories) to provide a unified system for package source code management and testing. The architecture of the system has been designed by Gábor with input from many members of the R community, including: J.J. Allaire (RStudio), Ben Bolker (McMaster University) Dirk Eddelbuettel (Debian), Jay Emerson (Yale University), Nicholas Lewin-Koh (Genentech), Joseph Rickert and (me) David Smith (Revolution Analytics/Microsoft), Murray Stokely (Google), and Simon Urbanek (AT&T). You can review the R-hub plan on GitHub (and provide comments via issues). The project is estimated to take about six months to complete.

Meanwhile, the R Consortium ISC is now accepting proposals from the community on how its projects budget (about $110,000 "over the next several months", now that R-hub is approved) should be spent. Proposals can be for anything that would be of benefit to the R Community. Suggestions include "software development, developing new teaching materials, documenting best practices, standardising APIs or doing research". So if you have an idea for a project that could get off the ground with some funding, make a proposal to the R Consortium for consideration.

By the way, if you work for a company that makes extensive use of R, consider asking them to join the R Consortium to make even more funds available for community projects. (I'm proud to say that Microsoft is a platinum member.) And if you're attending the EARL Conference in Boston, I'll be participating in a panel discussion with other R Consortium board members where we'll be dicussining the R Consortium's goals and the projects managed by the Infrastructure Steering Committee. I hope to see you there!

R Contortium press releases: R Consortium Awards First Grant to Help Advance Popular Programming Language for Unlocking Value from Data

I'm honoured to be giving the opening keynote at the Effective Applications of R Conference (EARL) Conference in Boston on November 2. My presentation will be on the business economics and opportunity of open source data science, with a focus on applications that are now possible given the convergence of big data platforms, cloud technology, and data science software (especially R) charged by the contributions of the open source community.

Given the outstanding calibre of talks at last months' EARL London conference, I'm can't wait to learn about more uses of R in business and industry. The whole agenda looks great, but a few of the sessions that caught my eye include:

- Mark Sellors' workshop on "Getting Started with Spark and R"
- Gergely Daroczi talking about how CARD.com is analyzing and managing Facebook ads from R
- Oliver Keyes on how Wikipedia (one of the top 10 websites) analyzes traffic data with R
- Ari Lamstein (a frequent contributor to this blog) on mapping census data with R
- Bob Rudis from Verizon, with a behind-the-scenes look at making "the most respected report in information security"
- Tim Hesterberg from Google on measuring brand ad effectiveness
- William Denney from Pfizer on monitoring Phase 1 clinical drug trials

If you haven't yet signed up for EARL Boston (organized and hosted by Mango Solutions), registration is still open here. (Discount academic registrations are sold out, though.) I hope to see you there!

by Andrie de Vries

The second week of SQLRelay (#SQLRelay) kicked off in London earlier this week. SQLRelay is a series of conferences, spanning 10 cities in the United Kingdom over two weeks. The London agenda included 4 different streams, with tracks for the DBA, BI and Analytics users, as well as a workshop track with two separate tutorials.

My speaking slot was in the afternoon, with the title "In-database analytics using Revolution R and SQL".

In my talk I covered:

- A high level overview of R
- Data science in the cloud
- Connecting R to SQL
- Scalable R
- R in SQL Server
- Moving your workflow to the cloud

Although the functionality of using R directly inside SQL Server will only be part of SQL Server 2016, Microsoft announced earlier this year that SQL Server 2016 will include Revolution Analytics. I expect that more information will be released during the PASS 2015 summit in Seattle at the end of this month.

In my talk I included 5 simple demonstrations. The first 3 demonstrations appeal to the data scientist coding in R:

- Connecting R to SQL Server using an RODBC connector
- Using Revolution R Enterprise (RRE) in a local parallel compute context, reading data from a local file
- Changing the compute context to SQL Server, and running the R code directly inside the SQL Server machine

The last two demonstrations demonstrate how to run some R code embedded in a SQL stored procedure:

- Creating a very simple script that calls out to R
- Using R to generate some data, in this case simply bringing some data in from the famous iris data set that is built into R.

The presentation is available on SlideShare:

Here are the code samples I used in the demonstration:

by Joseph Rickert

We can declare 2015 the year that R went mainstream at the JSM. There is no doubt about it, the calculations, visualizations and deep thinking of a great many of the world's statisticians are rendered or expressed in R and the JSM is with the program. In 2013 I was happy to have stumbled into a talk where an FDA statistician confirmed that R was indeed a much used and trusted tool. Last year, while preparing to attend the conference, I was delighted to find a substantial list of R and data science related talks. This year, talks not only mentioned R: they were *about* R.

The conference began with several R focused pre-conference tutorials including Statistical Analysis of Financial Data Using R, The Art and Science of Data Visualization Using R, and Hadley Wickham’s sold out Advanced R. The Sunday afternoon session on Advances in R Software played to a full room. Highlights of that session included Gabe Becker’s presentation on the switchr package for reproducible research, Mark Seligman’s update on the new work being done on the Arborist implementation of the random forest algorithm and my colleague’s Andrie de Vries presentation of some work we did on the network structure of R packages. (See yesterday’s post.)

The enthusiasm expressed by the overflowing crowd for Monday’s invited session on Recent Advances in Interactive Graphics for Data Analysis was contagious. Talks revolved around several packages linking R graphics to d3 and JavaScript in order to provide interactive graphics which are not only visually stunning but also open up new possibilities for exploratory data analysis. Hadley Wickham, the substitute chair for the session, characterized the various approaches to achieving interactive graphics in R with a bit of humor and much insight that I think brings some clarity to this chaotic whorl of development. Hadley places current efforts to provide interactive R graphics in one of three categories:

- Speaking in tongues: interfacing to low level specialized languages (examples: iplots and rggobi)
- Hacking existing graphics (examples: Animint and using ggplot2 with Shiny)
- Abusing the browser (examples: R/qtlcharts, leaflet and htmlwidgets)

Other highlights of the session included Kenney Shirley’s presentation on interactively visualizing trees with his summarytrees package that interfaces R to D3, Susan VanderPlas’ presentation of Animint (This package adds interactive aesthetics to ggplot2. Here is a nice tutorial.), and Karl Bowman’s discussion of visualizing high-dimensional genomic data (See qtlcharts and d3examples.)

In addition to visualization, education was another thread that stitched together various R related topics. Waller's talk, Evaluating Data Science Contributions in Teaching and Research, in the section of invited papers: The Statistics Identity Crisis: Are We Really Data Scientists provided some advice on how software developed by academics could be “packaged” to look like the more traditional work product traditionally valued for academic advancement. Progress along these lines would go a long way towards helping some of the most productive R contributors achieve career advancing recognition. There was also some considerable discussion about the kind of practical R and data science skills that should supplement the theoretical training of statisticians to help them be effective in academia as well as in industry. To get some insight into the relevant issues have a look at Jennifer Bryan’s slides for her talk Teach Data Science and They Will Come.

The following list contains 20 JSM talks with interesting package, educational or application R content.

- Animint: Interactive Web-Based Animations Using Ggplot2's Grammar of Graphics

Susan Ruth VanderPlas, Iowa State University; Carson Sievert, Iowa State University; Toby Hocking, McGill University - Applying the R Language in Streaming and Business Intelligence Applications

Louis Bajuk, TIBCO Software Inc. - A Bayesian Test of Independence of Two Categorical Variables with Covariates

Dilli Bhatta, Truman State University - Comparison of R and Vowpal Wabbit for Click Prediction in Display Advertising

Jaimyoung Kwon, AOL Advertising; Bin Ren, AOL Platforms; Rajasekhar Cherukuri, AOL Platforms; Marius Holtan, AOL Platforms - Demonstration of Statistical Concepts with Animated Graphics and Simulations in R

Andrej Blejec, National Institute of Biology - The Dendextend R Package for Manipulation, Visualization, and Comparison of Dendograms

Tal Galili, Tel Aviv University - Enhancing Reproducibility and Collaboration via Management of R Package Cohorts

Gabriel Becker, Genentech Research; Cory Barr, Anticlockwork Arts; Robert Gentleman, Genentech Research; Michael Lawrence, Genentech Research - GMM Versus GQL Logistic Regression Models for Multi-Level Correlated Data

Bei Wang, Arizona State University; Jeffrey Wilson, W. P. Carey School of Business/Arizona State University - Increasing the Accuracy of Gene Expression Classifiers by Incorporating Pathway Information: A Latent Group Selection Approach

Yaohui Zeng, The University of Iowa; Patrick Breheny, The University of Iowa - Learning statistics with R, from the Ground Up Xiaofei Wang
- Mining an R Bug Database with R

Stephen Kaluzny, TIBCO Software Inc. - Multinomial Regression for Correlated Data Using the Bootstrap in R

Jennifer Thompson, Vanderbilt University; Timothy Girard, Vanderbilt University Medical Center; Pratik Pandharipande, Vanderbilt University Medical Center; E. Wesley Ely, Vanderbilt University Medical Center; Rameela Chandrasekhar, Vanderbilt University - The Network Structure of R Packages

Andrie de Vries, Revolution Analytics Limited; Joseph Rickert - Online PCA in High Dimension: A Comparative Study

David Degras, DePaul University; Hervé Cardot, Université de Bourgogne - Perils and Solutions for Comparative Effectiveness Research in Massive Observational Databases

Marc A. Suchard, UCLA - R Package PRIMsrc: Bump Hunting by Patient Rule Induction Method for Survival, Regression, and Classification

Jean-Eudes Dazard, Case Western Reserve University; Michael Choe, Case Western Reserve University; Michael LeBlanc, Fred Hutchinson Cancer Research Center; J. Sunil Rao, University of Miami - An R Package That Collects and Archives Files and Other Details to Support Reproducible Computing

Stan Pounds, St. Jude Children's Research Hospital; Zhifa Liu, St. Jude Children's Research Hospital - SimcAusal R Package: Conducting Transparent and Reproducible Simulation Studies of Causal Effect Estimation with Complex Longitudinal Data

Oleg Sofrygin, Kaiser Permanente Northern California/UC Berkeley; Mark Johannes van der Laan, UC Berkeley; Romain Neugebauer, Kaiser Permanente Northern California Statistical Computation Using Student Collaborative Work John D. Emerson, Middlebury College - Teaching Introductory Regression with R Using Package Regclass

Adam Petrie - Using Software to Search for Optimal Cross-Over Designs

Byron Jones

by Joseph Rickert

The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.)

A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that:

and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton.

Social network analysis at this level involves some serious statistics and mastering a very specialized vocabulary. However, it seems to me that some knowledge of this field will become important to everyone working in data science. Supervised learning models and statistical models that assume independence among the predictors will most likely represent only the first steps that data scientists will take in exploring the complexity of large data sets.

And, maybe of equal importance is that fact that working with network data is great fun. Moreover, software tools exist in R and other languages that make it relatively easy to get started with just a few pointers.

From a statistical inference point of view what you need to know is Exponential Random Graph Models (ERGMs) are at the heart of modern social network analysis. An ERGM is a statistical model that enables one to predict the probability of observing a given network from a specified given class of networks based on both observed structural properties of the network plus covariates associated with the vertices of the network. The exponential part of the name comes from exponential family of functions used to specify the form of these models. ERGMs are analogous to generalized linear models except that ERGMs take into account the dependency structure of ties (edges) between vertices. For a rigorous definition of ERGMs see sections 3 and 4 of the paper by Hunter et al. in the 2008 special issue of the JSS, or Chapter 6 in Kolaczyk and Csárdi's book *Statistical Analysis of Network Data with R*. (I have found this book to be very helpful and highly recommend it. Not only does it provide an accessible introduction to ERGMs it also begins with basic network statistics and the igraph package and then goes on to introduce some more advanced topics such as modeling processes that take place on graphs and network flows.)

In the R world, the place to go to work with ERGMs is the statnet.org. statnet is a suite of 15 or so CRAN packages that provide a complete infrastructure for working with ERGMs. statnet.org is a real gem of a site that contains documentation for all of the statnet packages along with tutorials, presentations from past Sunbelt conferences and more.

I am particularly impressed with the Shiny based GUI for learning how to fit ERGMs. Try it out on the Shiny webpage or in the box below. Click the **Get Started** button. Then select "built-in network" and "ecoli 1" under **File type**. After that, click the right arrow in the upper right corner. You should see a plot of the ecoli graph.

--------------------------------------------------------------------------------------------------------------------------

You will be fitting models in no time. And since the commands used to drive the GUI are similar to specifying the parameters for the functions in the ergm package you will be writing your own R code shortly after that.

*By Torben Tvedebrink, Chair of local committee, useR! 2015*

After useR! 2015 in Aalborg I had some time to reflect and think back on the phase leading up to the actual conference. The story of useR! 2015 began in 2013 when Søren Højsgaard, Head of Department of Mathematical Sciences, Aalborg University, popped the idea of hosting useR! 2015 or 2017 in Aalborg. He had made some informal enquiries to the R Foundation and R Core members about the possibility. With some positive indications we sat down and wrote the first draft for the bidding material. This included a description of Aalborg, the conference venue, our thoughts on the scientific and social programme together with our first budget. This five page document was sent to the R Foundation by the end of October 2013 and after some communication back and forth we received the final "go!" in January 2014.

The first thing we decided on was the form and location of the social events (i.e. the welcome reception and conference dinner). We decided that the participants should experience more of Aalborg than just the conference venue. Hence, the House of Music seemed like a natural choice. With the support of the municipality of Aalborg and our sponsors we had the opportunity of showing some of the best to our guests on the evening before the conference. Often conference dinners are held in big restaurants or settings where it is difficult to ensure good food and drinks. Hence, we wanted to focus on the place and theme rather than the food. The Robber's Banquet fitted nicely with this idea.

In the figure below I have plotted the number of useR! 2015 emails in my inbox over time (right: accumulated numbers are on log-scale. Total number of emails received: 3561). The number of received emails serves as a nice proxy for the amount of work put into the conference over time. As seen from the plot to the right, the number of emails grow exponentially over time. In the plot I have added some of the important dates, e.g. opening of registration, abstract deadline and registration deadlines.

We wanted to follow the example of useR! 2014 in Los Angeles and offer the tutorials free of charge to the participants. We applied some Danish foundations to support the initiative and had some positive feedback. In order to allocate the 16 tutorials into morning and afternoon tutorials we sent out a survey to the participants a month before the useR! conference. Based on the survey we ran through all possible permutations of the tutorials and minimised the number of individuals with both tutorial selections in the same session -- of course all done in R.

Initially we hoped that 300-400 participants would show up in Aalborg. This was based on some data, but primarily on a somewhat *pessimistic* prior, which fortunately did not hold true. In the figure below the number of registered participants are plotted over time. We opened the registration December 3rd 2014 and as expected only a few signed up the first months. However, as presenters received their notification and the early deadline approached we had already achieved our goal of 400 participants. In the months that followed another 260 useR!s signed up and we had a total of 660 participants when we ran the conference in July 2015 (128 females and 532 males). The industry/academia ratio was almost 50% with 284 from the industry, 262 academics and 113 students participating.

We had participants coming from more than 40 countries (see below), where the majority came from Denmark (129), USA (117) and Germany (92). For most countries the distribution between industry and academia was also close to 50% (right barplot).

In summary the useR! 2015 conference went well and we are happy for all the feedback we have received - positive as well as negative. These inputs are valuable to the R community in general and to us as local organisers in particular. We are happy to share any ideas and comments with future organisers of the useR! conference series.

The conference venue (Aalborg Congress and Culture Center, akkc.dk) had an ideal size for the turnout. With a plenary lecture hall suited for almost 800 people and four additional rooms (220 - 150 seats) for parallel sessions was adequate. The professional assistance from the staff during the planning was very helpful. We can only recommend involving experienced people in the planning and execution of the next useR! conferences. Similar thoughts go to the catering - with good, varied and suffucient food supply most people are happy!

The social events (welcome reception, poster session and conference dinner) all went as we hoped. For the poster session we had free drinks and food. This ment most people stayed until the end and poster presenters had many interesting discussions. We had intentionally encourage posters to be on display throughout the conference and to be located in the exhibitor's area. With this people had more time to visit both our many sponsors and look at the posters when they felt for it.

Once again we would like to thank all useR! 2015 participants for making the conference a memorable experience for the Department of Mathematical Sciences at Aalborg University. As special thanks goes to our many sponsors that made it possible to provide a high service level.

On behalf of the local organising committee,

Torben Tvedebrink