*by Bob Horton, **Data Scientist, Revolution Analytics*

From electronic medical records to genomic sequences, the data deluge is affecting all aspects of health care. The Masters of Science in Health Informatics (MSHI) program at the University of San Francisco, now in its second year, is designed to help students develop the practical computing skills and quantitative perspicacity they need to manage and exploit this wealth of data in health care applications.

This spring, I am privileged to participate in this effort by developing and teaching a new course, “Statistical Computing for Biomedical Data Analytics”, intended to motivate and prepare students for further studies in data science, such as the intensive summer courses of the MSAN bootcamp. The syllabus is on github.

As you’ve probably guessed, we will be using R. Other courses in the curriculum use Python, which seems to be favored by engineers; in contrast, R was developed by and for statisticians. We want the students to be exposed to both perspectives, and to have the technical background needed to make use of the extensive repositories of code available from CRAN and Bioconductor.

Data science is an interdisciplinary endeavor born of the synergy between computing, statistics, data management, and visualization. This can make it challenging to get started, because you have to know so many things before you get to the good stuff. We’re going to try to ease into it by starting with computational explorations of mathematical and statistical concepts. R is a fantastic environment for this; you can see a bell-shaped curve emerge from an example as simple as

`plot(0:20, choose(n=20, k=0:20))`

Note the expressive power of the vector of k values, and the easy convenience of having a world of statistical functions at your fingertips. Imagine how this little plot would have delighted Sir Francis Galton.

Data science is a journey. The enormous breadth of material and the rapid pace of development mean that the most important thing to learn is how to learn more. We’ll explore many fantastic resources for learning data science and R. For example, Coursera has excellent offerings, exemplified by the series of mini-courses from Johns Hopkins; our students will take at least one of these as a course project.

Of course, the R community itself is the biggest and most important resource. One class will be a field trip to a Bay Area useR Group (BARUG) meeting, and the comments in response to this post will be required reading. Ideas or suggestions regarding the syllabus or course materials from the github repository are welcome, as are observations or ruminations on the process of learning data science and R.

Finally, we are very interested in helping our students find outstanding internship opportunities in health-related organizations. Please don’t hesitate to contact me through community@revolutionalytics.com if you are interested in working with us. Stay tuned for progress reports.

by Joseph Rickert

KatRisk, a Berkeley based catastrophe modeling company specializing in wind and flood risk, has put three R and Shiny powered interactive demos on their website. Together these provide a nice introduction to the practical aspects of weather based risk modeling and give a good indication of the kinds of data that are important. Two of the models, the US & Caribbean Hurricane Model and the Asia Typhoon Model, provide a tremendous amount of information but they require a little bit of background knowledge to understand the data required to drive them, and the computed loss statistics.

The Flood Data Lookup Model, however, can really hit home for anybody. Just bring up the model, type in the address of the location of interest and press the red "Geocode" button to get the associated longitude and latitude. Then click on the "Get Data" button. The resulting information will give you an idea of the level of risk for the property and let you know what a 100 year flood and 500 year flood would look like. Next, switch to the "Flood Map" tab and press the "Get Map" button to see some of the information overlayed on a Google map.

Not being able to resist the opportunity to have Google Maps google Google, I thought it would be interesting to see how bad things could get at the Googleplex.

Uh oh! The Googleplex gets a pretty high KatRisk score. A 100 year flood would put the place under 7 feet of water!

Not to worry though: Google has already completed their first round of feasibility tests for a navy. (Nobody does long range planning like Google.)

The KatRisk models are based on R code that makes heavy use of data.table for fast table look ups of the risk results. As the company says on their website:

KatRisk has developed a suite of analytic tools to make it easy to access our data and models. We use open source software tools including R Shiny for our web applications. By using R shiny we can develop on-line products that can also easily be deployed to a client site. Our software is completely open, so if you decide to host our analytical tools you will be able to see all of the details in easy to understand and modify R code.

For some details on the underlying analytics have a look at this previous post that was based on a talk Dag Lohmann gave to the Bay Area UseR Group last year.

So, go ahead and compute your KatRisk score, but please do be mindful of the company's request not to run the model for more than 3 locations in one day.

by Ryan Garner

Senior Data Scientist, Revolution Analytics

I love creating spatial data visualizations in R. With the ggmap package, I can easily download satellite imagery which serves as a base layer for the data I want to represent. In the code below, I show you how to visualize sampled soil attributes among 16 different rice fields in Uruguay.

library(ggmap) library(plyr) library(gridExtra) temp <- tempfile() download.file("http://www.plantsciences.ucdavis.edu/plant/data.zip", temp) connection <- unz(temp, "Data/Set3/Set3data.csv") rice <- read.csv(connection) names(rice) <- tolower(names(rice)) # Create a custom soil attribute plot # @param df Data frame containing data for a field # @param attribute Soil attribute # @return Custom soil attribute plot create_plot <- function(df, attribute) { map <- get_map(location = c(median(df$longitude), median(df$latitude)), maptype = "satellite", source = "google", crop = FALSE, zoom = 15) plot <- ggmap(map) + geom_point(aes_string(x = "longitude", y = "latitude", color = attribute), size = 5, data = df) plot <- plot + ggtitle(paste("Farmer", df$farmer, "/ Field", df$field)) plot <- plot + scale_color_gradient(low = "darkorange", high = "darkorchid4") return(plot) } ph_plot <- dlply(rice, "field", create_plot, attribute = "ph") ph_plots <- do.call(arrangeGrob, ph_plot)

First, I download data that is used in "Spatial Data Analysis in Ecology and Agriculture using R" by Dr. Richard Plant. (This is an excellent book to get your feet wet working with spatial data in R.) After the data has been downloaded, I create a function that builds a custom soil attribute plot for each unique field found in the rice yield data. Then, I customized the output to include larger spatial points and a custom gradient that goes from dark orange to dark purple for clarity.

Finally, once all the plots are generated, I arrange them into a single plot.

The plot shows the ph intensity of the soil in 16 fields belonging to 9 different farmers. The second to the last plot, field 15 of farmer L, appears to have higher ph concentrations than the rest.

by Joseph Rickert

In a recent post, where I presented some R related highlights of November's H_{2}0 World conference, I singled out and described talks by Trevor Hastie and John Chambers and remarked that it would be nice if the videos would be made available. Well, thanks to the generosity of the folks at H_{2}O I got my wish.

Here is the video of Professor Hastie's talk.

This video represents a master class on machine learning where in 40 minutes or so Professor Hastie conducts a tour that starts with basic decision trees and goes all the way to building learning ensembles with the Lasso. Along the way, he presents the salient ideas on bagging, random forests and boosting. The treatment of boosting is succinct and elegant covering some remarkable features of the family of boosting algorithms. For example, Professor Hastie describes how training error in Adaboost can reach zero and stay there but testing error can continue to improve, how superior performance can be achieved with boosting algorithms by using only tree stumps, and how the stagewise additive modeling "slows down the rate of overfitting". The really deep insight comes in the discussion about viewing Adaboost as an algorithm that fits additive logistic regression models with an exponential loss function. This, in turn, leads to a discussion Jerome Freidman's Gradient Boosting Machine and more general boosting algorithms that can accommodate multiple kinds of loss functions. These are the models implemented in R's gbm package.

I think this video of John Chambers' reminiscing about his time at Bell Labs working with John Tukey is destined to become an important part of the historical record for Statistics. There are many remembrences of Tukey to be found online, but I don't know of any other visual record by someone of John Chambers' stature who interacted with Tukey as a colleague and professional statistician.

In Just a few minutes, Chambers paints a balanced and revealing portrait that humanizes and captures some of the complexity of this icon of modern statistics. I especially like the story in the Q & A portion of the talk where John describes Tukey's propensity for "mischief" and his delight in inventing new words (like boxplot "hinges") that rankled many of his statistician colleagues, but apparently particularly upset the British statisticians.

There are a few more videos on the H_{2}O site that are worth a look.

Johns Hopkins Biostatistics Professor (and presenter of Data Analysis at Coursera) Jeff Leek has published his list of awesome things other people did in 2014. It's well worth following the links in his 38 entries, where you'll find a wealth of useful resources in teaching, statistics, data science, and data visualization.

Many of the entries are related to R, including shout-outs to: the data wrangling, exploration, and analysis with R class at UBC; this paper on R Markdown and reproducible analysis; Hadley Wickham's R Packages; Hilary Parker's guide to writing R packages from scratch; the broom package (for tidying up statistical output in R); Karl Broman's hipsteR tutorial; Rocker (Docker containers for R); and Packrat and R markdown v2 from RStudio. I was also chuffed to see that this blog got a mention, too:

Another huge reason for the movement with R has been the outreach and development efforts of the Revolution Analytics folks. The Revolutions blog has been a must read this year.

Thanks, Jeff! Check out Jeff's complete list of awesome things at SimplyStatistics by following the link below.

SimplyStatistics: A non-comprehensive list of awesome things other people did in 2014

Visualizing complex survey data is something of an art. If the data has been collected and aggregated to geographic units (say, counties or states), a choropleth is one option. But if the data aren't so neatly arranged, making visual sense often requires some form of smoothing to represent it on a map.

R, of course, has a number of features and packages to help you, not least the survey package and the various mapping tools. Swmap (short for "survey-weighted maps") is a collection of R scripts that visualize some public data sets, for example this cartogram of transportation share of household spending based on data from the 2012-2013 Consumer Expenditure Survey.

Take a look at the script that created this chart and the other scripts available at swmap to learn some R-based techniques for visualizing survey-based data. And while you're there, browse the rest of asdfree.com for other useful resources (including data sets and tutorials)on analyzing survey data, for free, with R.

Analyze survey data for free: maps and the art of survey-weighted maintenance (via the author, Anthony Damico)

by Joseph Rickert

The American Statistical Association (ASA) Undergraduate Guidelines Workgroup recently published the report Curriculum Guidelines for Undergraduate Programs in Statistical Science. Although intended for educators setting up or revamping Stats programs at colleges and universities, this concise, 17 page document should be good reading for anyone who wants to take charge of their own education in learning to "think with data". Whether you are just getting started with your education or you are a working professional contemplating what to learn next to expand your knowledge and update your skills you should find the ASA report helpful.

The report places good statistical practice firmly on the foundation of the scientific method and locates statistical knowledge and skills squarely in the center modern data analysis.

However, it is far from being a complacent panegyric to statistics. The ASA report challenges educators to help students see that the "discipline of statistics is more than a collection of unrelated tools" and explicitly calls for an increased emphasis on data science and big league computational skills. Graduates of statistical programs:

should be facile with professional statistical software and other appropriate tools for data exploration, cleaning, validation, analysis, and communication. They should be able to program in a higher-level language, to think algorithmically, to use simulation-based statistical techniques . . . Graduates should be able to manage and manipulate data, including joining data from different sources and formats and restructuring data into a form suitable for analysis.

The expectations for communication skills are particularly noteworthy. The report says:

Graduates should be expected to write clearly, speak fluently, and construct effective visual displays and compelling written summaries. They should demonstrate ability to collaborate in teams and to organize and manage projects.

One could argue about the details of the topics that should be included in an undergraduate program. But, clearly the committe is aiming for far more than producing minimly competent, employable graduates. They are outlining a way of life, a competent way of being in a data driven world.

Hidden among the white papers listed on the ASA curriculum guidelines page is a treasure: Tim Hesterberg's paper on What Teachers Should Know about the Bootstrap: Resampling in the Undergraduate Curriculum. This is a lucid and fairly deep explication of bootstrapping and resampling techniques that deserves wide circulation. Tim writes that he had three goals in producing the paper: (1) To show the enormous potential of bootstrapping and permutation tests to help students understand statistical concepts . . .(2) To dig deeper . . .(3) To change statistical practice . . ."

Point (3) may sound astoundingly ambitious. However, it is grounded in a revolution that has been quietly gaining strength and whose time has come. Textbooks that rely on R based simulations to teach probability (e.g. Baclawski) and statistics (e.g. Matloff) have been available for some time, and Tim points out that undergraduate textbooks such as Chihara and Hesterberg which use resampling as the fundamental unifying idea are beginning to appear. Moreover, data scientists outside of the community of academic statisticians are well aware that programming skills more than compensate for a traditional statistics education that presents the subject as a collection of unrelated tests and techniques as this Strata Hadoop world presentation from John Rauser makes clear.

It is very good, indeed, to see the ASA leading the charge for change.

by Joseph Rickert

H_{2}O.ai held its first H_{2}O World conference over two days at the Computer History Museum in Mountain View, CA. Although the main purpose of the conference was to promote the company's rich set of Java based machne learning algorithms and announce their new products Flow and Play there were quite a few sessions devoted to R and statistics in general.

Before I describe some of these, a few words about the conference itself. H20 World was exceptionally well run, especially for a first try with over 500 people attending (my estimate). The venue is an interesting, accommodating space with plenty of parking, that played well with what, I think, must have been an underlying theme of the conference: acknowledging contributions of past generations of computer scientists and statisticians. There were two stages offering simultaneous talks for at least part of the conference: The Paul Erdős stage and the John Tukey stage. Tukey I got, why put such an eccentric mathematician front and center? I was puzzled until Sri Ambati, H_{2}O.ai's CEO and co-founder remarked that he admired Erdős because of his great generosity with collaboration. To a greater extent than most similar events, H_{2}O World itself felt like a collaboration. There was plenty of opportunity to interact with other attendees, speakers and H_{2}0 technical staff (The whole company must have been there). Data scientists, developers and Marketing staff were accessible and gracious with their time. Well done!

R was center stage for a good bit the hands on training that that occupied the first day of the conference. There were several sessions (Exploratory Data Analysis, Regression, Deep Learning, Clustering and Dimensionality Reduction) on accessing various H2O algorithms through the h2o R package and the H2O API. All of these moved quickly from R to running the custom H_{2}O alogorithms on the JVM. However, the message that came through is that R is the right environment for sophisticated machine learning.

Two great pleasures from the second day of the conference were Trevor Hastie's tutorial on the Gradient Boosting Machine and John Chamber's personal remembrances of John Tukey. It is unusual for a speaker to announce that he has been asked to condense a two hour talk into something just under an hour and then go on to speak slowly with great clarity, each sentence beguiling you into imagining that you are really following the details. (It would be very nice if the video of this talk would be made available.)

Two notable points from Trevor's lecure where understanding gradient boosting as minimizing the exponential loss function and the openness of the gbm algorithm to "tinkering". For the former point see Chapter 10 of the Elements of Statistical Learning or the more extended discussion in Schapire and Freund's Boosting: Foundations and Algorithms.

John Tukey spent 40 years at Bell Labs (1945 - 1985) and John Chamber's tenure there overlapped the last 20 years of Tukey's stay. Chambers who had the opportunity to observe Tukey over this extended period of time painted a moving and lifelike portrait of the man. According to Chambers, Tukey could be patient and gracious with customers and staff, provocative with his statistician colleagues and "intellectually intimidating". John remembered Richard Hamming saying: "John (Tukey) was a genius. I was not." Tukey apparently delighted in making up new terms when talking with fellow statisticians. For example, he called the top and bottom lines that identify the interquartile range on a box plot "hinges" not quartiles. I found it particularly interesting that Tukey would describe a statistic in terms of the process used to compute it, and not in terms of any underlying theory. Very unusual, I would think, for someone who earned a PhD in topology under Solomon Lefschetz. For more memories of John Tukey including more from John Chambers look here.

Other R related highlights were talks by Matt Dowle and Erin Ledell. Matt reprised the update on new features in data.table that he recently gave to the Bay Area useR Group and also presented interesting applications using data.table from UK insurance company Landmark, and KatRisk (Look here for KatRisk part of Matt's presentation).

Erin, author of the h20Ensemble package available on GitHub, delivered an exciting and informative talk on using ensembles of learners (combining gbm models and logistic regression models, for example) to create "superlearners".

Finally, I gave a short talk Revolution Analytics' recent work towards achieving reproducibility in R. The presentation motivates the need for reproducibility by examining the use of R in industry and science and describing how the checkpoint package and Revolution R Open, an open source distribution of R that points to a static repository can be helpful.

by Tim Winke

PhD student in Demography and Social Sciences in Berlin

*This post has been abstracted from Tim's entry to a contest that Dalia Research is running based on a global smarthpone survey that they are conducting. Tim's entry post is available as is all of the code behind it. - editor*

When people think about Germany, what comes to their mind? Oktoberfest, ok – but Mercedes might be second or BMW or Porsche. German car brands have a solid reputation all over the world, but how popular is each brand in different countries?

There are plenty of survey data out there but hardly anyone collects answers within a couple of days from 6 continents. A new start-up called Dalia Research found a way to use smartphone and tablet networks to conduct surveys. It’s not a separate app but works via thousands of apps where targeted users decide to take part in a survey in exchange for an incentive.

In August 2014, they asked 51 questions to young mobile users in 64 countries including Colombia, Iran and the Ukraine. This is impressive – you have access to opinions of 32 000 people collected within 4 days from all over the world – 500 respondents in each country – about their religion, what they think about the Unites States, about where the EU has global influence or if Qatar should host the 2022 FIFA World Cup, and also: “What is your favorite German car brand?”.

Surprisingly, as the map below shows, BMW seems to be the most popular German car brand – and Volkswagen does not reach the pole position in any country.

The ggplot2 stacked barchart provides even more detail.

To see how I employed dplyr, ggplot2 and rworldmap to construct these plots as well as how to integrate the survey data with world development indicators from the World Bank please have a look at my original post.

With so many more devices and instruments connected to the "Internet of Things" these days, there's a whole lot more time series data available to analyze. But time series are typically quite noisy: how do you distinguish a short-term tick up or down from a true change in the underlying signal? To solve this problem, Twitter created the BreakoutDetection package for R, which decomposes a time series into a series of segments of one of three types:

**Steady state**: The time series follows a fixed mean (with random noise around the mean);**Mean shift**: The time series jumps directly from one steady state to another;**Ramp up / down**: The time series transitions linearly from one steady state to another, over a fixed period of time.

Given a univariate time series (and a few tuning parameters), the breakout function will return a list of **breakout points**: times when these state transitions are detected. It uses a non-parametric algorithm (E-Divisive with Medians) to detect the breakout points, so no assumptions are made about the underlying distribution of the time series.

Twitter uses this R package to monitor the user experience on the Twitter network and detect when things are "Breaking Bad". Data scientist Randy Zwitch used the package to identify the dates of blog posts or references on Hacker News from his blog traffic data. (He also compared the algorithm to anomaly detection with the Adobe Analytics API.) And the University of Louisville School of Medicine has also looked at using the package to identify past influenza outbreaks from CDC data:

For more information about the BreakoutDetection package, check out Twitter's blog post linked below. You can download the BreakoutDetection R package itself from GitHub.

Twitter Engineering blog: Breakout detection in the wild (via FlowingData)