by Tim Winke

PhD student in Demography and Social Sciences in Berlin

*This post has been abstracted from Tim's entry to a contest that Dalia Research is running based on a global smarthpone survey that they are conducting. Tim's entry post is available as is all of the code behind it. - editor*

When people think about Germany, what comes to their mind? Oktoberfest, ok – but Mercedes might be second or BMW or Porsche. German car brands have a solid reputation all over the world, but how popular is each brand in different countries?

There are plenty of survey data out there but hardly anyone collects answers within a couple of days from 6 continents. A new start-up called Dalia Research found a way to use smartphone and tablet networks to conduct surveys. It’s not a separate app but works via thousands of apps where targeted users decide to take part in a survey in exchange for an incentive.

In August 2014, they asked 51 questions to young mobile users in 64 countries including Colombia, Iran and the Ukraine. This is impressive – you have access to opinions of 32 000 people collected within 4 days from all over the world – 500 respondents in each country – about their religion, what they think about the Unites States, about where the EU has global influence or if Qatar should host the 2022 FIFA World Cup, and also: “What is your favorite German car brand?”.

Surprisingly, as the map below shows, BMW seems to be the most popular German car brand – and Volkswagen does not reach the pole position in any country.

The ggplot2 stacked barchart provides even more detail.

To see how I employed dplyr, ggplot2 and rworldmap to construct these plots as well as how to integrate the survey data with world development indicators from the World Bank please have a look at my original post.

With so many more devices and instruments connected to the "Internet of Things" these days, there's a whole lot more time series data available to analyze. But time series are typically quite noisy: how do you distinguish a short-term tick up or down from a true change in the underlying signal? To solve this problem, Twitter created the BreakoutDetection package for R, which decomposes a time series into a series of segments of one of three types:

**Steady state**: The time series follows a fixed mean (with random noise around the mean);**Mean shift**: The time series jumps directly from one steady state to another;**Ramp up / down**: The time series transitions linearly from one steady state to another, over a fixed period of time.

Given a univariate time series (and a few tuning parameters), the breakout function will return a list of **breakout points**: times when these state transitions are detected. It uses a non-parametric algorithm (E-Divisive with Medians) to detect the breakout points, so no assumptions are made about the underlying distribution of the time series.

Twitter uses this R package to monitor the user experience on the Twitter network and detect when things are "Breaking Bad". Data scientist Randy Zwitch used the package to identify the dates of blog posts or references on Hacker News from his blog traffic data. (He also compared the algorithm to anomaly detection with the Adobe Analytics API.) And the University of Louisville School of Medicine has also looked at using the package to identify past influenza outbreaks from CDC data:

For more information about the BreakoutDetection package, check out Twitter's blog post linked below. You can download the BreakoutDetection R package itself from GitHub.

Twitter Engineering blog: Breakout detection in the wild (via FlowingData)

by Joseph Rickert

The igraph package has become a fundamental tool for the study of graphs and their properties, the manipulation and visualization of graphs and the statistical analysis of networks. To get an idea of just how firmly igraph has become embedded into the R package ecosystem consider that currently igraph lists 72 reverse depends, 59 reverse imports and 24 reverse suggests. The following plot, which was made with functions from the igraph and miniCRAN packages, indicates the complexity of this network.

library(miniCRAN) library(igraph) pk <- c("igraph","agop","bc3net","BDgraph","c3net","camel", "cccd", "CDVine", "CePa", "CINOEDV", "cooptrees","corclass", "cvxclustr", "dcGOR", "ddepn","dils", "dnet", "dpa", "ebdbNet", "editrules", "fanovaGraph", "fastclime", "FisHiCal", "flare", "G1DBN", "gdistance", "GeneNet", "GeneReg", "genlasso", "ggm", "gRapfa", "hglasso", "huge", "igraphtosonia", "InteractiveIGraph", "iRefR", "JGL", "lcd", "linkcomm", "locits", "loe", "micropan", "mlDNA", "mRMRe", "nets", "netweavers", "optrees", "packdep", "PAGI", "pathClass", "PBC", "phyloTop", "picasso", "PoMoS", "popgraph", "PROFANCY", "qtlnet", "RCA", "ReliabilityTheory", "rEMM", "restlos", "rgexf", "RNetLogo", "ror", "RWBP", "sand", "SEMID", "shp2graph", "SINGLE", "spacejam", "TDA", "timeordered", "tnet") dg <- makeDepGraph(pk) plot(dg,main="Network of reverse depends for igraph",cex=.4,vertex.size=8)

The igraph package itself is a tour de force with over 200 functions. Learning the package can be a formidable task especially if you are trying to learn graph theory and network analysis at the same time. To help myself through this process, I sorted the functions in the igraph package into seven rough categories:

- Create Graph
- Describe Graph
- Environment
- Find Communities
- Operate on a Graph
- Plot
- Statistics

The following shows a portion of the table for the first 10 functions related to creating graphs.

Function | Description | Category of Function | |

1 | aging.prefatt.game | Generate an evolving random graph with preferential attachment and aging | Create Graph |

2 | barabasi.game | generate scale-free graphs according to the Barabasi-Albert model | Create Graph |

3 | bipartite.random.game | Generate bipartite graphs using the Erdos-Renyi model | Create Graph |

4 | degree.sequence.game | Generate a random graph with a given degree degree sequence | Create Graph |

5 | erdos.renyi.game | Generate random graphs according to the Erdos-Renyi model | Create Graph |

6 | forest.fire.game | Grow a network that simulates how a fire spreads by igniting trees | Create Graph |

7 | graph.adjacency | Create an igraph from an adjacency matrix | Create Graph |

8 | graph.bipartite | Create a bipartite graph | Create Graph |

9 | graph.complementer | Create the complementary graph for a given graph | Create Graph |

10 | graph.empty | Create an empty graph | Create Graph |

The entire table may be downloaded here: Download Igraph_functions.

infomap.community() is an intriguing function listed under the Finding Communities category that looks for structure in a network by minimizing the expected description length of a random walker trajectory. The abstract to the paper by Rosvall and Bergstrom that introduced this method states:

To comprehend the multipartite organization of large-scale biological and social systems, we introduce an information theoretic approach that reveals community structure in weighted and directed networks. We use the probability flow of random walks on a network as a proxy for information flows in the real system and decompose the network into modules by compressing a description of the probability flow. The result is a map that both simplifies and highlights the regularities in the structure and their relationships.

I am not willing to claim that the 37 communities found by the algorithm represent meaningful structure, however, the idea of partitioning the network based on information flow does seem relevant to the package building process. Anybody looking for a research project?

imc <- infomap.community(dg) imc # Graph community structure calculated with the infomap algorithm # Number of communities: 37 # Modularity: 0.5139813 # Membership vector:

Some additional resources for working with igraph are:

- The igraph home page
- A presentation by Gábor Csrádi, igraph's principle author
- An old post with some pointers to additional resources
- A nice couple of nice tutorials: here and here.
- The new book: Statistical Analysis of Network Data with R by Eric D. Kolaczyk and Gábor Csrádi is a very good read.

by Peggy Fan

Ph.D. Candidate at Stanford's Graduate School of Education

Part of my dissertation at Stanford Graduate School of Education, International Comparative Education program, is looking at the World Values Survey (WVS), a cross-national social survey that started in 1981. Since then there has been 6 waves, and the surveys include questions that capture the demographic, behaviors, personal beliefs, and attitudes of the respondents in a variety of contexts. I am interested in looking civic participation, which is often measured by the extent to which a person belongs to an organization outside of family and work (and religion).

The goal is to create a tool that facilitates preliminary data analyses on this large dataset. The shiny app turns out to be a great tool for data visualization and exploration.

There are 85 countries from the first five waves in this dataset with about 255,000 observations. My outcome variable is based on a battery of questions from the WVS that ask if the respondent is a member of any of the following types of association: sports, arts, labor, politics, environmental, charity, women’s rights, human rights, or other. A respondent gets a “1” if he and she answers “yes” to any of the associational membership.

For the purpose of this app, I extract variables that are relevant, such as regions, country IDs, gender, educational attainment, and membership from the larger data set. Because the lowest unit of analysis is “country”, I calculate the country and regional averages of membership by topics of gender and educational attainment.

The reactive input function in shiny adjusts the data displayed based on the criteria selected. This allows many ways to dissect data. I utilize it to present data at the world, region, and country levels as well as by gender and education.

I create three tabs by topics. Users can choose “the world” or a region of interest or on the left, and using the `reactive`

with `selectInput`

, I create an object that makes the main panel display the corresponding data. I also use `observe`

to add another reactive element, which displays the list of country after the region is selected.

For gender and educational attainment tabs, `renderTable`

is perfect for displaying the information because it allows users to sort the data based on variables of interest. It also has other options, such as including a search box in the table, but I do not want extraneous features to clutter the table and a simple version is adequate in conveying the information.

server.R selectedData1 <- reactive({ if (input$region == "the world") { highested_table[,-1] } else { region = input$region highested_table[(highested_table$region == region), -1] } }) output$mytable1 = renderDataTable({ selectedData1() }, options = list(lengthMenu = c(5,10), pageLength = 5, searching = FALSE) ) observe({ region = input$region updateSelectInput(session, "country", choices = levels(as.factor(as.character(wvs_c$country[wvs_c$region==region]))), selected = levels(as.factor(as.character(wvs_c$country[wvs_c$region==region])))[1] ) }) ui.R sidebarPanel( selectInput("region", "Select a region:", list("All World"= "the world", "North America & Western Europe"="region1", "Central Europe"="region2", "Asia"="region3", "Latina America & Caribbean"="region4", "Sub-Saharan Africa"="region5", "Middle East & Northern Africa"="region6", "Oceania"="region7"), selected= "the World" ) mainPanel( tabPanel('Gender', dataTableOutput('mytable'), selectInput('country', 'Select a Country:', names(wvs_c$country), selected=names(wvs_c$country)[1]), plotOutput("myplot") ),

The maps provide a holistic view of world and regional comparisons. I choose the `rworldmap`

package because it uses ISO3 as country identifier and I also use ISO3 in my own data, which makes merging of country level data and spatial polygons quite easy. Moreover, its default is a choropleth map by country, with which I only have to adjust the palette for styling.

#server.R library(rworldmap) wvs_c <- read.csv("/Users/peggyfan/Downloads/R_data/Developing_data_products/wvs_c") wvs_c <- wvs_c[, -1] colourPalette1 <-c("#F5A9A9", "#F6D8CE", "#F8ECE0", "#EFFBFB", "#E0F2F7", "#CEE3F6", "#A9BCF5") world <- joinCountryData2Map(wvs_c ,joinCode = "ISO3" ,nameJoinColumn = "iso3" ,mapResolution="li")

When “the world” is chosen, the map tab shows the mapping of the entire data set. The gender and educational attainment tabs show the regional breakdown of those two topics using `ggplot`

.

Below the table, I embed another panel so users can choose a specific country (listed in alphabetical order) to view its gender and educational attainment breakdown in charts created also with `ggplot`

.

For those who are interested in the tabular data displayed on the “gender” and “education attainment” tabs, they can download the data from the website for their own research purposes.

App address: https://peggyfan.shinyapps.io/shinyapps/

by Joseph Rickert

The San Francisco Bay Area Chapter of the Association of Computing Machinery (ACM) has been holding an annual Data Mining Camp and "unconference" since 2009. This year, to reflect the times, the group held a *Data Science Camp* and unconference, and we at Revolution Analytics were, once again, very happy to be a sponsor for the event and pleased to be able to participate.

In an ACM unconference, except for prearranged tutorials and the keynote address, there are no scheduled talks. Instead, anyone with the passion to speak gets two minutes to pitch a session. A show of hands determines what flys, the organizers allocate rooms and group talks by theme on-the-fly, and then off you go. The photo below shows how all of this sorted out on Saturday.

As you might expect, there was a lot of interest in Big Data, NoSQL, NLP etc., but there was also quite a bit of interest in R, enough to run fill a large room for two back-to-back sessions. I was very happy to reprise some of the material from a recent webinar I presented on an introduction to Machine Learning and Data Science with R, and Ram Narasimhan (a longtime member of the Bay Area useR Group) gave a high energy and very informative tutorial on the dplyr package that, judging from the audience reaction, inspired quite a few new R programmers.

But the real R highlight came early in the day. Irina Kukuyeva presented a tutorial on* Principal Component Analysis with Applications in R and Python* that was well worth getting up for early Saturday morning. Not only did irina put together a very nice introduction to PCA starting with the the basic math and illustrating how PCA is used through case studies, but in a laudable effort to be as inclusive as possible, she also took the trouble to write both Python and R code for all of her examples! The following slide shows what PCA looks like in both languages.

This next slide shows what a good bit of statistics looks like in both languages.

For more presentations and tutorials by Irina that feature R, have a look at her Tutorial page.

by Terry M. Therneau Ph.D.

Faculty, Mayo Clinic

About a year ago there was a query about how to do "type 3" tests for a Cox model on the R help list, which someone wanted because SAS does it. The SAS addition looked suspicious to me, but as the author of the survival package I thought I should understand the issue more deeply. It took far longer than I expected but has been illuminating.

First off, what exactly is this 'type 3' computation of which SAS so deeply enamored? Imagine that we are dealing with a data set that has interactions. In my field of biomedical statistics all data relationships have interactions: an effect is never precisely the same for young vs old, fragile vs robust, long vs short duration of disease, etc. We may not have the sample size or energy to model them, but they exist nonetheless. Assume as an example that we had a treatment effect that increases with age; how then would one describe a main effect for treatment? One approach is to select an age distribution of interest and use the mean treatment effect, averaged over that age distribution.

To compute this, one can start by fitting a sufficiently rich model, get predicted values for our age distribution, and then average them. This requires almost by definition a model that includes an age by treatment interaction: we need reasonably unbiased estimates of the treatment effects at individual ages a,b,c,... before averaging, or we are just fooling ourselves with respect to this overall approach. The SAS type 3 method for linear models is exactly this. It assumes as the "reference population of interest" a uniform distribution over any categorical variables and the observed distribution of the data set for any continuous ones, followed by a computation of the average predicted value. Least squares means are also an average prediction taken over the reference population.

A primary statistical issue with type 3 is the choice of reference. Assume for instance that age had been coded as a categorical with levels of 50-59, 60-69, 70-79 and 80+. A type 3 test answers the question of what the treatment effect would be in a population of subjects in which 1/4 were aged 50-59, another 1/4 were 60-69, etc. Since I will never encounter a set of subjects with said pattern in real life, such an average is irrelevant . A nice satire of the situation can be found under the nom de plume of Guernsey McPearson (Also have a look at *Multi-Centre Trials and the Finally Decisive Argument*). To be fair there are other cases where the uniform distribution is precisely the right population, e.g., a designed experiment that lost perfect balance due to a handful of missing response values. But these are rare to non-existent in my world, and type 3 remains an answer to the question that nobody asked.

Average population prediction also highlights a serious deficiency in R. Working out the algebra, type 3 tests for a linear model turn out to be a contrast, C %*% coef(fit), for a particular contrast vector or matrix C. This fits neatly into the SAS package, which has a simple interface for user specified contrasts. (The SAS type 3 algorithm is at its heart simply an elegant way to derive C for their default reference population.) The original S package took a different view, which R has inherited, of pre instead of post-processing. Several of the common contrasts one might want to test can be obtained by clever coding of the design matrix X, before the fit, causing the contrast of interest to appear as one of the coefficients of the fitted model. This is a nice idea when it works, but there are many cases where it is insufficient, a linear trend test or all possible pair-wise comparisons for example.

*R needs a general and well thought out post-fit contrasts function. Population averaged estimates could be one option of said routine, with the SAS population one possible choice.*

Also, I need to mention a couple more things:

- The standard methods for computing type 3 that I see in the help lists are flawed, giving seriously incorrect answers unless sum-to-zero constraints were used for the fit (contr.sum). This includes both the use of drop.terms and the anova function in the car package.
- For coxph models, my original goal, the situation is even more complex. In particular, which average does one want: average log hazard ratio, average hazard ratio, ratio of average hazards, or something else? Only one of these can be rewritten as a contrast in the coefficients, and thus the clever linear models algorithms do not transfer.

My second-favourite keynote from yesterday's Strata Hadoop World conference was this one, from Pinterest's John Rauser. To many people (especially in the Big Data world), Statistics is a series of complex equations, but a just a little intuition goes a long way to really understanding data. John illustrates this wonderfully using an example of data collected to determine whether consuming beer causes mosquitoes to bite you more:

The big lesson here, IMO, is that so many statistical problems can seem complex, but you can actually get a lot of insight by recognizing that your data is just one possible instance of a random process. If you have a hypothesis for what that process is, you can **simulate it**, and get an intuitive sense of how surprising your data is. R has excellent tools for simulating data, and a couple of hours spent writing code to simulate data can often give insights that will be valuable for the formal data analysis to come.

(By the way, my favourite keynote from the connference was Amanda Cox's keynote on data visualization at the New York Times, which featured several examples developed in R. Sadly, though, it wasn't recorded.)

O'Reilly Strata: Statistics Without the Agonizing Pain

by Joseph Rickert

There is something about R user group meetings that both encourages, and nourshies a certain kind of "after hours" creativity. Maybe it is the pressure of having to make a presentation about stuff you do at work interesting to a general audience, or maybe it is just the desire to reach a high level of play. But, R user group presentations often manage to make some obscure area of computational statistics seem to be not only accessible, but also relevant and fun. Here are a couple of examples of what I mean.

Recently Xiaocun Sun conducted an Image processing workshop for KRUG, the Knoxville R User's Group. As the folowing slide indicates, he used the EBImage Bioconductor package, a package that I imagine few people who don't do medical imaging for a living would be likely to stubmle upon by accident, to illustrate the basics of image processing.

Xiacuns's presentation along with R code is available for download from the KRUG site.

As a second example, consider the presentation that Antonio Piccolboni recently made to the Bay Area useR Group (BARUG): 10 Eigenmaps of the United States of America. Inspired by an article in the New York Times, Antonio decided to undertake his own idiosyncratic tour through the Census data and look at socio-economic trends in the United States. His analysis is both thought provoking and visually compelling. For example, concerning the following map Antonio writes:

This map shows a very interesting ring pattern around some cities, including Atlanta, Dallas an Minneapolis. The red areas show strong population increase, including migration, and increase in available housing and high median income. The blue areas have a higher death rate, Federal Government payments to individuals, more widows, single person households and older people receiving social security.

Antonio's presentation might well illustrate the theme: "Data Scientist reads the Sunday paper and finds data to begin a conversation about what he read with his quantitative, R-literate friends".

This kind of active reading fits nicely with ideas about responsible, quantitative journalism that Chris Wiggins expresses in a presentation he recently made to the New York Open Statistical Programming Meetup. Here, Chris provides some insight into the role of Data Science at the New York Times and offers advice on using data to study relevant issues and clearly communicate findings. One major point in Chris' presentation is that data science plus clear communication can have a very positive influence on shaping our culture.

It is not an exaggeration to say that the kind of work that Xiaocun, Antonio and other R user group presenters undertake in their spare time "for fun" is valuable and important beyond the immediate goals of learning and teaching R.

by Joseph Rickert

In a recent post I talked about the information that can be developed by fitting a Tweedie GLM to a 143 million record version of the airlines data set. Since I started working with them about a year or so ago, I now see Tweedie models everywhere. Basically, any time I come across a histogram that looks like it might be a sample from a gamma distribution except for a big spike at zero, I see a candidate for a Tweedie model. (Having a Tweedie hammer makes lots of things look like Tweedie nails.) Nevertheless, apparently lots of people are seeing Tweedie these days. Even the scolarly citations for Maurice Tweedie's original paper are up.

Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter f. Statistician Bent Jørgensen solidified the concept of EDMs in a 1987 paper, and named the following class of EDMs after Tweedie.

An EDM random variable Y follows a Tweedie distribution if

var(Y) = f * V(m)

where m is the mean of the distribution, f is the dispersion parameter, V is function describing the mean/variance relationship of the distribution and p is a constant such that:

V(m) = m^{p }

Some very familiar distributions fall into the Tweedie family. Setting p = 0 gives a normal distribution. p = 1 is Poisson. p = 2 gives a gamma distribution and p = 3 yields an inverse Gaussian. However, much of the action for fitting Tweedie GLMs is for values of p between 1 and 2. In this interval, closed form distribution functions don’t exist, but as it turns out, Tweedies in this interval are compound Poisson distributions. (A compound Poisson random variable Y is the sum of N independent gamma random variables where N follows a Poisson distribution and N and the gamma random variates are independent.)

This last fact helps to explain why Tweedies are so popular. For example, one might model the insurance claims for a customer as a series of independent gamma random variables and the number of claims in some time interval as a Poisson random variable. Or, the gamma random variables could be models for precipation, and the total rainfall resulting from N rainstorms would follow a Tweedie distribution. The possibilities are endless.

R has a quite a few resources for working with Tweedie models. Here are just a few. You can fit Tweedie GLM model with the tweedie function in the statmod package .

# Fit an inverse-Gaussion glm with log-link

glm(y~x,family=tweedie(var.power=3,link.power=0))

The tweedie package has several interesting functions for working with Tweedie models including a function to generate random samples.The following graph shows four different Tweedie histograms as the power parameter moves from 1.2 to 1.9.

It is apparent that increasing the power shifts mass away from zero towards the right.

(Note the code for producing these plots which includes some nice code from Stephen Turner for putting all for ggplots on a single graph are availale are available. Download Plot_tweedie)

Package poistweedie also provides functions for simulating Tweedie models. Package HDtweedie implements an iteratively reweighted least squares algorithm for computing solution paths for grouped lasso and grouped elastic net Tweedie models. And, package cplm provides both likelihood and Bayesian functions for working with compound Poisson models. Be sure to have a look at the vignette for this package to see compound Poisson distributions in action.

Finally, two very readable references for both the math underlying Tweedie models and the algorithms to compute them are a couple of papers by Dunn and Smyth: here and here.

by Joseph Rickert

Recently, I had the opportunity to present a webinar on R and Data Science. The challenge with attempting this sort of thing is to say something interesting that does justice to the subject while being suitable for an audience that may include both experienced R users and curious beginners. The approach I settled on had three parts. I decided to:

- show a few slides that indicate the status of R among data scientists
- offer some thoughts as to why R is such a popular and effective tool
- work through some code.

The "why" slides attempt to convey the great number of machine learning and statistical algorithms available in R, the visualization capabilities, the richness of the R programming language and its many tools for data manipulation. I tried to emphasize the great amount of effort that the R community continues to make in order to integrate R with other languages and computing platforms, and to scale R to handle massive data sets on Hadoop and other big data platforms.

The code examples presented in the webinar emphasize the machine learning algorithms oganized in the caret package and the many tools available for working through the predictive modeling process such as functions for searching through the parameter space of a model, performing cross validation, comparing models etc. The code for the caret examples is available here.

Towards the end of the webinar I show the code for running a large Tweedie model with Revolution Analytics rxGlm() function and I also show what it looks like to run an rxLogit() model directly on Hadoop.

Click on video to view the webinar, or go to the Revolution Analytics' website to download the webinar and a pdf of the slides. All of the code is available on my GitHub repository.