by Matt Sundquist

co-founder of Plotly

R, Plotly, and ggplot2 let you make, share, and collaborate on beautiful, interactive plots online. Let's see what we can do with the topographic data from Auckland's Maunga Whau Volcano that comes with R.

Copy and paste this R code to make your first plot. The basic idea is: use ggplot2 code, add `py$ggplotly()`

to call the Plotly API, and make an interactive, web-based plot for sharing and collaboration.

install.packages("devtools") # so we can install from github library("devtools") install_github("ropensci/plotly") # plotly is part of ropensci library(plotly) py <- plotly(username="r_user_guide", key="mw5isa4yqp") # open plotly connection # Generate data library(reshape2) # for melt volcano3d <- melt(volcano) names(volcano3d) <- c("x", "y", "z") # Basic plot v <- ggplot(volcano3d, aes(x, y, z = z)) v + geom_tile(aes(fill = z)) + stat_contour() py$ggplotly()

The API response is a URL where the data, code, and plot reside: https://plot.ly/~r_user_guide/346.

If you ran the code you'll notice the colors are different. We've applied a theme from within the Plotly GUI to change the colors and layout. You can save themes--styles, colors, layouts--from plots and apply them to new plots. Hover to see data points--you can customize the hover text--or click and drag to zoom.

The plot is rendered using D3.js, a JavaScript visualization library developed by Mike Bostock of The New York Times. You can use the URL to embed plots using knitr and Shiny, or in iframes with a bit of HTML:

<iframe width="640" height="480" frameborder="0" seamless="seamless" scrolling="no" src="https://plot.ly/~r_user_guide/346.embed?width=640&height=480"></iframe>

We can add collaborators online and control the privacy. But not all our work is online; we generate reports, write papers, and give presentations with slides. Thus, for offline use, we can can export an image (SVG, EPS, PNG, or PDF), and include a link to the plot in our image. E.g.:

When you share the URL, you're sharing a fully reproducible version of your plot. The URL hosts your data, code to reproduce the plot, and exportable versions.

Others can fork your plot and make their own version, allowing for lightweight collaboration. That means no more emailing or searching around for data, plots, and code. It's all here.

- https://plot.ly/~r_user_guide/346.svg
- https://plot.ly/~r_user_guide/346.png
- https://plot.ly/~r_user_guide/346.pdf
- https://plot.ly/~r_user_guide/346.r
- https://plot.ly/~r_user_guide/346.py
- https://plot.ly/~r_user_guide/346.m
- https://plot.ly/~r_user_guide/346.jl
- https://plot.ly/~r_user_guide/346.json

Plotly graphs are represented using JSON, a syntax for storing and exchanging data. The .json version of each plot contains the data and a full description of the plot. The framework allows interoperability between Python, R, MATLAB, and other languages.

We can also make 3D plots rendered with WebGL. The iframe below says "source: from api (254)." That's a link to our original plot. The forked 3D plot is at a new URL--https://plot.ly/~MattSundquist/2444--and shows the fork history.

If you click and hold the plot, you can rotate it, or scroll in and out on the plot to zoom in and out. If you hover, you can see dynamic contour lines. You can fork the plot and see it in full-screen here: https://plot.ly/~MattSundquist/2444. Check out our 3D collection for more.

We're on Twitter and GitHub, and would love to hear your feedback, thoughts, and suggestions. We plan to continue expanding our maps coverage, so your ideas are most welcome. If our free cloud-based product doesn't work for you, contact us about getting Plotly on-premise.

by Tim Winke

PhD student in Demography and Social Sciences in Berlin

*This post has been abstracted from Tim's entry to a contest that Dalia Research is running based on a global smarthpone survey that they are conducting. Tim's entry post is available as is all of the code behind it. - editor*

When people think about Germany, what comes to their mind? Oktoberfest, ok – but Mercedes might be second or BMW or Porsche. German car brands have a solid reputation all over the world, but how popular is each brand in different countries?

There are plenty of survey data out there but hardly anyone collects answers within a couple of days from 6 continents. A new start-up called Dalia Research found a way to use smartphone and tablet networks to conduct surveys. It’s not a separate app but works via thousands of apps where targeted users decide to take part in a survey in exchange for an incentive.

In August 2014, they asked 51 questions to young mobile users in 64 countries including Colombia, Iran and the Ukraine. This is impressive – you have access to opinions of 32 000 people collected within 4 days from all over the world – 500 respondents in each country – about their religion, what they think about the Unites States, about where the EU has global influence or if Qatar should host the 2022 FIFA World Cup, and also: “What is your favorite German car brand?”.

Surprisingly, as the map below shows, BMW seems to be the most popular German car brand – and Volkswagen does not reach the pole position in any country.

The ggplot2 stacked barchart provides even more detail.

To see how I employed dplyr, ggplot2 and rworldmap to construct these plots as well as how to integrate the survey data with world development indicators from the World Bank please have a look at my original post.

by Joseph Rickert

We usually have a pretty good time at the monthly Bay Area useR Group (BARUG) meetings, but this month's meeting was a bit more of a party than usual. The very well connected PR team at Sqor Sports, our host company for the evening, secured San Francisco's tres trendy 111 Minna Gallery for the venue. There was a full bar, house music for the networking portion of the meeting, gourmet grilled cheese sandwiches complements of Revolution Analytics and drama — Matt Dowle, one of our speakers, was on a flight that was late getting in from London.

Oh! and yes, there were three very engaging presentations — well worth standing around in the dark.

First up was Noah Gift, CTO or Sqor, a company with a mission to take sports marketing to a whole new level. They are creating a marketplace for athletes to build and promote their digital brands. Noah described how devilishly difficult it is to gather, clean and prepare the data. Correctly labeling social media data from several sources generated by different athletes with the same name poses a number of vexing challenges.

One surprising aspect of the technology that Sqor is developing is what they call an Erlang to R bridge the replaces many tasks they formerly accomplished with Python. Noah indicated that they planning on placing this code in open source.

Below is a plot from Noah's presentation showing predictions from their R based machine learning algorithms.

Our second speaker was Stephen Elston who gave a virtuoso, live demo on using R on the Microsoft Azure Machine Learning cloud platform. Steve glided between the Azure workflow interface and running R scripts. He showed how to manipulate and transform data in both environments, go back and forth to run models in both Azure and R and visualize results in R. Slides for Steve’s talk are available as is some R code on Steve's github site. Studying the scripts will give you an idea of the features he presented.

Finally, just in from London, and still lucid at what would have been 4AM his time, Matt Dowle walked through a summary of new features of data.table v1.9.4 and v1.9.5. There were several data.table users present, and Matt made a few new converts with a series of impressively fast benchmarks against base R. In one demo, Matt showed data.table's forder() taking only 17 seconds to sort 40 million random numerics, a task that took R 7 minutes. According to Matt, the trick for getting this kind of performance is data.table's C-based implementation of radix sorting which works on numeric, character and integer types, with no range restrictions (recall that base::sort.list(...,method="radix") is limited to integers with range < 100,000).

data.table's radix sorting, which scales linearly i.e. below the O(n log n) bound for comparison sorts, is based on two papers: one by Terdiman and the other by Herf. However, where both of these papers use the least significant digit, data.table uses the most significant digit to improve cache efficiency.

Matt also demonstrated data.table's new automatic indexes (You can now use == in i and data.table will automatically build a secondary key) as well as using dplyr syntax with data.table. Matt emphasized that this flexibility shows the power of R's object oriented design. Matt also claimed that both Python's pandas and the dplyr for R made the wrong choices in using hashing. Instead of hashing, data.table uses fast sorting based on the sort order vector which is an index in data.table

For more benchmark information be sure to visit Matt's github site. If you are new to data.table, I recommend starting with Matt's 2014 useR presentation which explains some the ideas underlying data.table as well as providing an introduction.

by Joseph Rickert

The igraph package has become a fundamental tool for the study of graphs and their properties, the manipulation and visualization of graphs and the statistical analysis of networks. To get an idea of just how firmly igraph has become embedded into the R package ecosystem consider that currently igraph lists 72 reverse depends, 59 reverse imports and 24 reverse suggests. The following plot, which was made with functions from the igraph and miniCRAN packages, indicates the complexity of this network.

library(miniCRAN) library(igraph) pk <- c("igraph","agop","bc3net","BDgraph","c3net","camel", "cccd", "CDVine", "CePa", "CINOEDV", "cooptrees","corclass", "cvxclustr", "dcGOR", "ddepn","dils", "dnet", "dpa", "ebdbNet", "editrules", "fanovaGraph", "fastclime", "FisHiCal", "flare", "G1DBN", "gdistance", "GeneNet", "GeneReg", "genlasso", "ggm", "gRapfa", "hglasso", "huge", "igraphtosonia", "InteractiveIGraph", "iRefR", "JGL", "lcd", "linkcomm", "locits", "loe", "micropan", "mlDNA", "mRMRe", "nets", "netweavers", "optrees", "packdep", "PAGI", "pathClass", "PBC", "phyloTop", "picasso", "PoMoS", "popgraph", "PROFANCY", "qtlnet", "RCA", "ReliabilityTheory", "rEMM", "restlos", "rgexf", "RNetLogo", "ror", "RWBP", "sand", "SEMID", "shp2graph", "SINGLE", "spacejam", "TDA", "timeordered", "tnet") dg <- makeDepGraph(pk) plot(dg,main="Network of reverse depends for igraph",cex=.4,vertex.size=8)

The igraph package itself is a tour de force with over 200 functions. Learning the package can be a formidable task especially if you are trying to learn graph theory and network analysis at the same time. To help myself through this process, I sorted the functions in the igraph package into seven rough categories:

- Create Graph
- Describe Graph
- Environment
- Find Communities
- Operate on a Graph
- Plot
- Statistics

The following shows a portion of the table for the first 10 functions related to creating graphs.

Function | Description | Category of Function | |

1 | aging.prefatt.game | Generate an evolving random graph with preferential attachment and aging | Create Graph |

2 | barabasi.game | generate scale-free graphs according to the Barabasi-Albert model | Create Graph |

3 | bipartite.random.game | Generate bipartite graphs using the Erdos-Renyi model | Create Graph |

4 | degree.sequence.game | Generate a random graph with a given degree degree sequence | Create Graph |

5 | erdos.renyi.game | Generate random graphs according to the Erdos-Renyi model | Create Graph |

6 | forest.fire.game | Grow a network that simulates how a fire spreads by igniting trees | Create Graph |

7 | graph.adjacency | Create an igraph from an adjacency matrix | Create Graph |

8 | graph.bipartite | Create a bipartite graph | Create Graph |

9 | graph.complementer | Create the complementary graph for a given graph | Create Graph |

10 | graph.empty | Create an empty graph | Create Graph |

The entire table may be downloaded here: Download Igraph_functions.

infomap.community() is an intriguing function listed under the Finding Communities category that looks for structure in a network by minimizing the expected description length of a random walker trajectory. The abstract to the paper by Rosvall and Bergstrom that introduced this method states:

To comprehend the multipartite organization of large-scale biological and social systems, we introduce an information theoretic approach that reveals community structure in weighted and directed networks. We use the probability flow of random walks on a network as a proxy for information flows in the real system and decompose the network into modules by compressing a description of the probability flow. The result is a map that both simplifies and highlights the regularities in the structure and their relationships.

I am not willing to claim that the 37 communities found by the algorithm represent meaningful structure, however, the idea of partitioning the network based on information flow does seem relevant to the package building process. Anybody looking for a research project?

imc <- infomap.community(dg) imc # Graph community structure calculated with the infomap algorithm # Number of communities: 37 # Modularity: 0.5139813 # Membership vector:

Some additional resources for working with igraph are:

- The igraph home page
- A presentation by Gábor Csrádi, igraph's principle author
- An old post with some pointers to additional resources
- A nice couple of nice tutorials: here and here.
- The new book: Statistical Analysis of Network Data with R by Eric D. Kolaczyk and Gábor Csrádi is a very good read.

by Matt Sundquist

Plotly, co-founder

Plotly is a platform for data analysis, graphing, and collaboration. You can use ggplot2, Plotly's R API, and Plotly's web app to make and share interactive plots. Now, you can you can also make 3D plots. Immediately below are a few examples of 3D plots. In this post we will show how to make 3D plots with ggplot2 and Plotly's R API.

First, let's convert a ggplot2 tile plane into a Plotly graph, then convert it to a 3D plot. You can copy and paste this code and use a test username and key, or sign up for an account and generate your own.

install.packages("devtools") # so we can install from github library("devtools") install_github("ropensci/plotly") # plotly is part of ropensci library(plotly) py <- plotly(username="r_user_guide", key="mw5isa4yqp") # open plotly connection pp <- function (n,r=4) { x <- seq(-r*pi, r*pi, len=n) df <- expand.grid(x=x, y=x) df$r <- sqrt(df$x^2 + df$y^2) df$z <- cos(df$r^2)*exp(-df$r/6) df } p <- ggplot(pp(20), aes(x=x,y=y)) p <- p + geom_tile(aes(fill=z)) py$ggplotly(p)

We return a URL: plot.ly/~r_user_guide/83/y-vs-x/. The URL hosts the interactive plot, rendered with D3.js, a JavaScript visualization library. Each plot stores the data, and code to reproduce a plot with MATLAB, Python, R, Julia, and JavaScript.

We can export or embed plots in Shiny Apps, knitr, Slidify, blogs, and in an iframe, as we're doing below.

We'll now style, share, and change to 3D in the web app. Press the "Fork and Edit" button to get started in the GUI. The web app runs online and is free, so you won't need to install or download anything.

Below is our edited plot. Go to the plot and press " View full-size graph" to really dig in, or go straight to a full-screen version: plot.ly/~MattSundquist/2260.embed. Try clicking, holding, and toggling to flip, drag, and zoom. Press the icons in the upper right-hand corner to change modes.

We can also use qplot and edit the plot in the GUI. Assuming you ran the code above:

qplot(x, y, data=pp(100), geom="tile", fill=z) py$ggplotly()

We can also make 3D plots from a grid of imported Excel, Dropbox, Google Drive, or pasted data. Or combine data from another plot. Privacy settings are like Google Drive and GitHub: you control if plots are public or private, you own your data and can add collaborators, or you can run Plotly on-premise.

To create a 3D plot directly from R, we can use Plotly's R API. For example, try this code to make a surface plot. Plotly also supports 3D line and scatter plots.

data <- list( list( z = matrix(c(1, 20, 30, 50, 1, 20, 1, 60, 80, 30, 30, 60, 1, -10, 20), nrow=3, ncol=5), x = c("Monday", "Tuesday", "Wednesday", "Thursday", "Friday"), y = c("Morning", "Afternoon", "Evening"), type = "surface" ) ) response <- py$plotly(data, kwargs=list(filename="3D surface", fileopt="overwrite")) url <- response$url # returns plot URL

Editing in the GUI offers options for editing the lighting, lines that appear when hovering, colors in the scene and grid lines, and axis titles.

The plot, as well as code to reproduce our new version, is also available at the URL: plot.ly/~MattSundquist/2263.

3D plots can be useful for showing three-dimensional spaces like world, lakes, or mountains. You can also plot a random walk, the Lorenz attractor, functions, and stock volatility. We welcome your feedback, thoughts, and suggestions. We're at feedback at plot.ly and @plotlygraphs. Happy plotting!

by Joseph Rickert

I recently had the opportunity to look at the data used for the 2009 KDD Cup competition. There are actually two sets of files that are still available from this competition. The "large" file is a series of five .csv files that when concatenated form a data set with 50,000 rows and 15,000 columns. The "small" file also contains 50,000 rows but only 230 columns. "Target" files are also provided for both the large and small data sets. The target files contain three sets of labels for "appentency", "churn" and "upselling" so that the data can be used to train models for three different classification problems.

The really nice feature of both the large and small data sets is that they are extravagantly ugly, containing large numbers of missing variables, factor variables with thousands of levels, factor variables with only one level, numeric variables with constant values, and correlated independent variables. To top it off, the targets are severely unbalanced containing very low proportions of positive examples for all three of the classification problems. These are perfect files for practice.

Often times, the most difficult part about working with data like this is just knowing where to begin. Since getting a good look is usually a good place to start, let's look at a couple of R tools that I found to be helpful for taking that first dive into messy data.

The mi package takes a sophisticated approach to multiple imputation and provides some very advance capabilities. However, it also contains simple and powerful tools for looking at data. The function missing.pattern.plot() lets you see the pattern of missing values. The following line of code provides a gestalt for the small data set.

missing.pattern.plot(DF,clustered=FALSE,

xlab="observations",

main = "KDD 2009 Small Data Set")

Observations (rows) go from left to right and variables from bottom to top. Red indicates missing values. Looking at just the first 25 variables makes it easier to see what the plot is showing.

The function mi.info(), also in the mi package, provides a tremendous amount of information about a data set. Here is the output for the first 10 variables. The first thing the function does is list the variables with no data and the variables that are highly correlated with each other. Thereafter, the function lists a row for each variable that includes the number of missing values and the variable type. This is remarkably useful information that would otherwise take a little bit of work to discover.

> mi.info(DF) variable(s) Var8, Var15, Var20, Var31, Var32, Var39, Var42, Var48, Var52, Var55, Var79, Var141, Var167, Var169, Var175, Var185 has(have) no observed value, and will be omitted. following variables are collinear [[1]] [1] "Var156" "Var66" "Var9" [[2]] [1] "Var104" "Var105" [[3]] [1] "Var111" "Var157" "Var202" "Var33" "Var61" "Var71" "Var91" names include order number.mis all.mis type collinear 1 Var1 Yes 1 49298 No nonnegative No 2 Var2 Yes 2 48759 No binary No 3 Var3 Yes 3 48760 No nonnegative No 4 Var4 Yes 4 48421 No ordered-categorical No 5 Var5 Yes 5 48513 No nonnegative No 6 Var6 Yes 6 5529 No nonnegative No 7 Var7 Yes 7 5539 No nonnegative No 8 Var8 No NA 50000 Yes proportion No 9 Var9 No NA 49298 No nonnegative Var156, Var66 10 Var10 Yes 8 48513 No nonnegative No

For Revolution R Enterprise users the function rxGetINfo() is a real workhorse. It applies to data frames as well as data stored in .xdf files. For data in these files there is essentially no limit to how many observations can be analysed. rxGetInfo() is an example of an external memory algorithm that only reads a chunk of data at a time from the file. Hence, there is no need to try and stuff all of the data into memory.

The following is a portion of the output from running the function with the getVarinfo flag set to TRUE.

rxGetInfo(DF, getVarInfo=TRUE)

Data frame: DF Number of observations: 50000 Number of variables: 230 Variable information: Var 1: Var1, Type: numeric, Low/High: (0.0000, 680.0000) Var 2: Var2, Type: numeric, Low/High: (0.0000, 5.0000) Var 3: Var3, Type: numeric, Low/High: (0.0000, 130668.0000)

.

.

. Var 187: Var187, Type: numeric, Low/High: (0.0000, 910.0000) Var 188: Var188, Type: numeric, Low/High: (-6.4200, 628.6200) Var 189: Var189, Type: numeric, Low/High: (6.0000, 642.0000) Var 190: Var190, Type: numeric, Low/High: (0.0000, 230427.0000) Var 191: Var191 2 factor levels: r__I Var 192: Var192 362 factor levels: _hrvyxM6OP _v2gUHXZeb _v2rjIKQ76 _v2TmBftjz ... zKnrjIPxRp ZlOBLJED1x ZSNq9atbb6 ZSNq9aX0Db ZSNrjIX0Db Var 193: Var193 51 factor levels: _7J0OGNN8s6gFzbM 2Knk1KF 2wnefc9ISdLjfQoAYBI 5QKIjwyXr4MCZTEp7uAkS8PtBLcn 8kO9LslBGNXoLvWEuN6tPuN59TdYxfL9Sm6oU ... X1rJx42ksaRn3qcM X2uI6IsGev yaM_UXtlxCFW5NHTcftwou7BmXcP9VITdHAto z3s4Ji522ZB1FauqOOqbkl zPhCMhkz9XiOF7LgT9VfJZ3yI Var 194: Var194 4 factor levels: CTUH lvza SEuy Var 195: Var195 23 factor levels: ArtjQZ8ftr3NB ArtjQZmIvr94p ArtjQZQO1r9fC b_3Q BNjsq81k1tWAYigY ... taul TnJpfvsJgF V10_0kx3ZF2we XMIgoIlPqx ZZBPiZh Var 196: Var196 4 factor levels: 1K8T JA1C mKeq z3mO Var 197: Var197 226 factor levels: _8YK _Clr _vzJ 0aHy ... ZEGa ZF5Q ZHNR ZNsX ZSv9 Var 198: Var198 4291 factor levels: _0Ong1z _0OwruN _0OX0q9 _3J0EW7 _3J6Cnn ... ZY74iqB ZY7dCxx ZY7YHP2 ZyTABeL zZbYk2K Var 199: Var199 5074 factor levels: _03fc1AIgInD8 _03fc1AIgL6pC _03jtWMIkkSXy _03wXMo6nInD8 ... zyR5BuUrkb8I9Lth ZZ5

.

.

.

rxGetInfo() doesn't provide all of the information that mi.info() does, but is does do a particularly nice job on factor data, giving the number of levels and showing the first few. The two functions are complementary.

For a full listing of the output shown above down load the file: Download Mi_info_output.

by Joseph Rickert

In a recent post I talked about the information that can be developed by fitting a Tweedie GLM to a 143 million record version of the airlines data set. Since I started working with them about a year or so ago, I now see Tweedie models everywhere. Basically, any time I come across a histogram that looks like it might be a sample from a gamma distribution except for a big spike at zero, I see a candidate for a Tweedie model. (Having a Tweedie hammer makes lots of things look like Tweedie nails.) Nevertheless, apparently lots of people are seeing Tweedie these days. Even the scolarly citations for Maurice Tweedie's original paper are up.

Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter f. Statistician Bent Jørgensen solidified the concept of EDMs in a 1987 paper, and named the following class of EDMs after Tweedie.

An EDM random variable Y follows a Tweedie distribution if

var(Y) = f * V(m)

where m is the mean of the distribution, f is the dispersion parameter, V is function describing the mean/variance relationship of the distribution and p is a constant such that:

V(m) = m^{p }

Some very familiar distributions fall into the Tweedie family. Setting p = 0 gives a normal distribution. p = 1 is Poisson. p = 2 gives a gamma distribution and p = 3 yields an inverse Gaussian. However, much of the action for fitting Tweedie GLMs is for values of p between 1 and 2. In this interval, closed form distribution functions don’t exist, but as it turns out, Tweedies in this interval are compound Poisson distributions. (A compound Poisson random variable Y is the sum of N independent gamma random variables where N follows a Poisson distribution and N and the gamma random variates are independent.)

This last fact helps to explain why Tweedies are so popular. For example, one might model the insurance claims for a customer as a series of independent gamma random variables and the number of claims in some time interval as a Poisson random variable. Or, the gamma random variables could be models for precipation, and the total rainfall resulting from N rainstorms would follow a Tweedie distribution. The possibilities are endless.

R has a quite a few resources for working with Tweedie models. Here are just a few. You can fit Tweedie GLM model with the tweedie function in the statmod package .

# Fit an inverse-Gaussion glm with log-link

glm(y~x,family=tweedie(var.power=3,link.power=0))

The tweedie package has several interesting functions for working with Tweedie models including a function to generate random samples.The following graph shows four different Tweedie histograms as the power parameter moves from 1.2 to 1.9.

It is apparent that increasing the power shifts mass away from zero towards the right.

(Note the code for producing these plots which includes some nice code from Stephen Turner for putting all for ggplots on a single graph are availale are available. Download Plot_tweedie)

Package poistweedie also provides functions for simulating Tweedie models. Package HDtweedie implements an iteratively reweighted least squares algorithm for computing solution paths for grouped lasso and grouped elastic net Tweedie models. And, package cplm provides both likelihood and Bayesian functions for working with compound Poisson models. Be sure to have a look at the vignette for this package to see compound Poisson distributions in action.

Finally, two very readable references for both the math underlying Tweedie models and the algorithms to compute them are a couple of papers by Dunn and Smyth: here and here.

Hadley Wickham's dplyr package is a great toolkit for getting data ready for analysis in R. If you haven't yet taken the plunge to using dplyr, Kevin Markham has put together a great hands-on video tutorial for his Data School blog, which you can see below. The video covers the five main data-manipulation "verbs" that dplyr provides: filter, select, arrange, mutate and summarise/group_by. (It also introduces the glimpse function, a handy alternative to str, that I had overlooked before.)

The video also provides an introduction to the %>% ("then") operator from magrittr, which you'll likely fund useful for many other applications in addition to dplyr. Also, Kevin's video works from an Rmarkdown script to show how dplyr works, and so serves as a mini-tutorial for Rmarkdown as well. It's well worth 40 minutes of your time. Also, check out Kevin's blog post linked below for links to many other useful dplyr resources.

Data School: Hands-on dplyr tutorial for faster data manipulation in R (via Peter Aldhous)

by Joseph Rickert

While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.

#----------------------------------------- # SET UP THE PARAMETER SPACE SEARCH GRID ctrl <- trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10-fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid <- expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <-trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) #--------------------------------- # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) #----------------------------------- # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues <- resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values #--------------------------------------------- # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot

After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, *"The design and analysis of benchmark experiments"*, Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675-699

The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.

Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)

I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.

All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.

by Seth Mottaghinejad

Let's review the Collatz conjecture, which says that given a positive integer n, the following recursive algorithm will always terminate:

- if n is 1, stop, otherwise recurse on the following
- if n is even, then divide it by 2
- if n is odd, then multiply it by 3 and add 1

In our last post, we created a function called 'cpp_collatz', which given an integer vector, returns an integer vector of their corresponding stopping times (number of iterations for the above algorithm to reach 1). For example, when n = 5 we have

5 -> 3*5+1 = 16 -> 16/2 = 8 -> 8/2 = 4 -> 4/2 = 2 -> 2/2 = 1,

giving us a stopping time of 5 iterations.

In today's article, we want to perform some exploratory data analysis to see if we can find any pattern relating an integer to its stopping time. As part of the analysis, we will extract some features out of the integers that could help us explain any differences in stopping times.

Here are some examples of potentially useful features:

- Is the integer a prime number?
- What are its proper divisors?
- What is the remainder upon dividing the integer by some other number?
- What are the sum of its digits?
- Is the integer a triangular number?
- Is the integer a square number?
- Is the integer a pentagonal number?

In case you are encountering these terms for the first time, a triangular number is any number m that can be written as m = n(n+1)/2, where n is some positive integer. To determine if a number is triangular, we can rewrite the above equation as n^2 + n - 2m = 0, and use the quadriatic formula to get n = (-1 + sqrt(1 + 8m))/2 and (-1 - sqrt(1 + 8m))/2. Since n must be a positive integer, we ignore the latter solution, leaving us with (-1 + sqrt(1 + 8m))/2.

Thus, if plugging m in the above formula results in an integer, we can say that m is a triangular number. Similar rules exist to determine if an integer is square or pentagonal, but I will refer you to Wikipedia for the details.

For the purpose of conducting our analysis, we created some other functions in C++ and R to help us. Let's take a look at these functions:

cat(paste(readLines(file.path(directory, "collatz.cpp")), collapse = "\n")) #include <Rcpp.h> #include <vector> using namespace Rcpp; // [[Rcpp::export]] int collatz(int nn) { int ii = 0; while (nn != 1) { if (nn % 2 == 0) nn /= 2; else nn = 3 * nn + 1; ii += 1; } return ii; } // [[Rcpp::export]] IntegerVector cpp_collatz(IntegerVector ints) { return sapply(ints, collatz); } // [[Rcpp::export]] bool is_int_prime(int nn) { if (nn < 1) stop("int must be greater than 1."); else if (nn == 1) return FALSE; else if (nn % 2 == 0) return (nn == 2); for (int ii=3; ii*ii<=nn; ii+=2) { if (nn % ii == 0) return false; } return true; } // [[Rcpp::export]] LogicalVector is_prime(IntegerVector ints) { return sapply(ints, is_int_prime); } // [[Rcpp::export]] NumericVector gen_primes(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> primes; primes.push_back(2); int i = 3; while (primes.size() < unsigned(n)) { if (is_int_prime(i)) primes.push_back(i); i += 2; } return Rcpp::wrap(primes); } // [[Rcpp::export]] NumericVector gen_divisors(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> divisors; divisors.push_back(1); for(int i = 2; i <= sqrt(n); i++) { if(n%i == 0) { divisors.push_back(i); divisors.push_back(n/i); } } sort(divisors.begin(), divisors.end()); divisors.erase(unique(divisors.begin(), divisors.end()), divisors.end()); return Rcpp::wrap(divisors); } // [[Rcpp::export]] bool is_int_perfect(int nn) { if (nn < 1) stop("int must be greater than 0."); return nn == sum(gen_divisors(nn)); } // [[Rcpp::export]] LogicalVector is_perfect(IntegerVector ints) { return sapply(ints, is_int_perfect); }

Here is a list of other helper functions in collatz.cpp:

- 'is_prime': given an integer vector, returns a logical vector
- 'gen_primes': given some integer input, n, generates the first n prime numbers
- 'gen_divisors': given an integer n, returns an integer vector of all its proper divisors
- 'is_perfect': given an integer vector, returns a logical vector

sum_digits <- function(x) { # returns the sum of the individual digits that make up x # x must be an integer vector f <- function(xx) sum(as.integer(strsplit(as.character(xx), "")[[1]])) sapply(x, f) }

As you can guess, many of the above functions (such as 'is_prime' and 'gen_divisors') rely on loops, which makes C++ the ideal place to perform the computation. So we farmed out as much of the heavy-duty computations to C++ leaving R with the task of processing and analyzing the resulting data.

Let's get started. We will perform the analysis on all integers below 10^5, since R is memory-bound and we can run into a bottleneck quickly. But next time, we will show you how to overcome this limitation using the 'RevoScaleR' package, which will allow us to scale the analysis to much larger integers.

One small caveat before we start: I enjoy dabbling in mathematics but I know very little about number theory. The analysis we are about to perform is not meant to be rigorous. Instead, we will attempt to approach the problem using EDA the same way that we approach any data-driven problem.

maxint <- 10^5 df <- data.frame(int = 1:maxint) # the original number df <- transform(df, st = cpp_collatz(int)) # stopping times df <- transform(df, sum_digits = sum_digits(int), inv_triangular = (sqrt(8*int + 1) - 1)/2, # inverse triangular number inv_square = sqrt(int), # inverse square number inv_pentagonal = (sqrt(24*int + 1) + 1)/6 # inverse pentagonal number )

To determine if a numeric number is an integer or not, we need to be careful about not using the '==' operator in R, as it is not guaranteed to work, because of minute rounding errors that may occur. Here's an example:

.3 == .4 - .1 # we expect to get TRUE but get FALSE instead

[1] FALSE

The solution is to check whether the absolute difference between the above numbers is smaller than some tolerance threshold.

eps <- 1e-9

abs(.3 - (.4 - .1)) < eps # returns TRUE

[1] TRUE

df <- transform(df, is_triangular = abs(inv_triangular - as.integer(inv_triangular)) < eps, is_square = abs(inv_square - as.integer(inv_square)) < eps, is_pentagonal = abs(inv_pentagonal - as.integer(inv_pentagonal)) < eps, is_prime = is_prime(int), is_perfect = is_perfect(int) ) df <- df[ , names(df)[-grep("^inv_", names(df))]]

Finally, we will create a variable listing all of the integer's proper prime divisors. Every composite integer can be recunstruced out of these basic building blocks, a mathematical result known as the *'unique factorization theorem'*. We can use the function 'gen_divisors' to get a vector of an integers proper divisors, and the 'is_prime' function to only keep the ones that are prime. Finally, because the return object must be a singleton, we can use 'paste' with the 'collapse' argument to join all of the prime divisors into a single comma-separated string.

Lastly, on its own, we may not find the variable 'all_prime_divs' especially helpful. Instead, we cangenerate multiple flag variables out of it indicating whether or not a specific prime number is a divisor for the integer. We will generate 25 flag variables, one for each of the first 25 prime numbers.

There are many more features that we can extract from the underlying integers, but we will stop here. As we mentioned earlier, our goal is not to provide a rigorous mathematical work, but show you how the tools of data analysis can be brought to bear to tackle a problem of such nature.

Here's a sample of 10 rows from the data:

df[sample.int(nrow(df), 10), ] int st sum_digits is_triangular is_square is_pentagonal is_prime 21721 21721 162 13 FALSE FALSE FALSE FALSE 36084 36084 142 21 FALSE FALSE FALSE FALSE 40793 40793 119 23 FALSE FALSE FALSE FALSE 3374 3374 43 17 FALSE FALSE FALSE FALSE 48257 48257 44 26 FALSE FALSE FALSE FALSE 42906 42906 49 21 FALSE FALSE FALSE FALSE 37283 37283 62 23 FALSE FALSE FALSE FALSE 55156 55156 60 22 FALSE FALSE FALSE FALSE 6169 6169 111 22 FALSE FALSE FALSE FALSE 77694 77694 231 33 FALSE FALSE FALSE FALSE is_perfect all_prime_divs is_div_by_2 is_div_by_3 is_div_by_5 is_div_by_7 21721 FALSE 7,29,107 FALSE FALSE FALSE TRUE 36084 FALSE 2,3,31,97 TRUE TRUE FALSE FALSE 40793 FALSE 19,113 FALSE FALSE FALSE FALSE 3374 FALSE 2,7,241 TRUE FALSE FALSE TRUE 48257 FALSE 11,41,107 FALSE FALSE FALSE FALSE 42906 FALSE 2,3,7151 TRUE TRUE FALSE FALSE 37283 FALSE 23,1621 FALSE FALSE FALSE FALSE 55156 FALSE 2,13789 TRUE FALSE FALSE FALSE 6169 FALSE 31,199 FALSE FALSE FALSE FALSE 77694 FALSE 2,3,23,563 TRUE TRUE FALSE FALSE is_div_by_11 is_div_by_13 is_div_by_17 is_div_by_19 is_div_by_23 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE TRUE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 TRUE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE TRUE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE TRUE is_div_by_29 is_div_by_31 is_div_by_37 is_div_by_41 is_div_by_43 21721 TRUE FALSE FALSE FALSE FALSE 36084 FALSE TRUE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE TRUE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE TRUE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_47 is_div_by_53 is_div_by_59 is_div_by_61 is_div_by_67 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_71 is_div_by_73 is_div_by_79 is_div_by_83 is_div_by_89 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_97 21721 FALSE 36084 TRUE 40793 FALSE 3374 FALSE 48257 FALSE 42906 FALSE 37283 FALSE 55156 FALSE 6169 FALSE 77694 FALSE

We can now move on to looking at various statistical summaries to see if we notice any differences between the stopping times (our response variable) when we break up the data in different ways. We will look at the counts, mean, median, standard deviation, and trimmed mean (after throwing out the highest 10 percent) of the stopping times, as well as the correlation between stopping times and the integers. This is by no means a comprehensive list, but it can serve as a guidance for deciding which direction to go to next.

my_summary <- function(df) { primes <- gen_primes(9) res <- with(df, data.frame( count = length(st), mean_st = mean(st), median_st = median(st), tmean_st = mean(st[st < quantile(st, .9)]), sd_st = sd(st), cor_st_int = cor(st, int, method = "spearman") ) ) }

To create above summaries broken up by the flag variables in the data, we will use the 'ddply' function in the 'plyr' package. For example, the following will give us the summaries asked for in 'my_summary', grouped by 'is_prime'.

To avoid having to manually type every formula, we can pull the flag variables from the data set, generate the strings that will make up the formula, wrap it inside 'as.formula' and pass it to 'ddply'.

flags <- names(df)[grep("^is_", names(df))] res <- lapply(flags, function(nm) ddply(df, as.formula(sprintf('~ %s', nm)), my_summary)) names(res) <- flags res $is_triangular is_triangular count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99554 107.58511 99 96.23178 51.36153 0.1700491 2 TRUE 446 97.11211 94 85.97243 51.34051 0.3035063 $is_square is_square count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99684 107.59743 99 96.25033 51.34206 0.1696582 2 TRUE 316 88.91772 71 77.29577 55.43725 0.4274504 $is_pentagonal is_pentagonal count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99742 107.56948 99.0 96.22466 51.35580 0.1702560 2 TRUE 258 95.52326 83.5 83.86638 53.91514 0.3336478 $is_prime is_prime count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 90408 107.1227 99 95.95392 51.29145 0.1693013 2 TRUE 9592 111.4569 103 100.39643 51.90186 0.1953000 $is_perfect is_perfect count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99995 107.5419 99 96.20092 51.36403 0.1707839 2 TRUE 5 37.6000 18 19.50000 45.06440 0.9000000 $is_div_by_2 is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 50000 113.5745 106 102.58324 52.25180 0.1720108 2 TRUE 50000 101.5023 94 90.68234 49.73778 0.1737786 $is_div_by_3 is_div_by_3 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 66667 107.4415 99 96.15832 51.30838 0.1705745 2 TRUE 33333 107.7322 99 96.94426 51.48104 0.1714432 $is_div_by_5 is_div_by_5 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 80000 107.5676 99 96.21239 51.36466 0.1713552 2 TRUE 20000 107.4214 99 96.13873 51.37206 0.1688929 . . . $is_div_by_89 is_div_by_89 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98877 107.5381 99 96.19965 51.37062 0.1707754 2 TRUE 1123 107.5628 97 96.02096 50.97273 0.1803568 $is_div_by_97 is_div_by_97 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98970 107.4944 99 96.15365 51.36880 0.17187453 2 TRUE 1030 111.7660 106 101.35853 50.93593 0.07843996

As we can see from the above results, most comparisons appear to be non-significant (although, in mathematics, even tiny differences can be meaningful, therefore we will avoid relying on statistical significance here). Here's a summary of trends that stand out as we go over the results:

- On average, stopping times are slightly higher for prime numbers compared to composite numbers.
- On average, stopping times are slightly lower for triangular numbers, square numbers, and pentagonal numbers compared to their corresponding counterparts.
- Despite having lower average stopping times, triangular numbers, square numbers and pentagonal numbers are more strongly correlated with their stopping times than their corresponding counterparts. A larger sample size may help in this case.
- On average, odd numbers have a higher stopping time than even numbers.

This last item could just be restatement of 1. however. Let's take a closer look:

ddply(df, ~ is_prime + is_div_by_2, my_summary) is_prime is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE FALSE 40409 114.0744 106 102.91178 52.32495 0.1658988 2 FALSE TRUE 49999 101.5043 94 90.68434 49.73624 0.1737291 3 TRUE FALSE 9591 111.4685 103 100.40796 51.89231 0.1950482 4 TRUE TRUE 1 1.0000 1 NaN NA NA

When limiting the analysis to odd numbers only, prime numbers now have a lower average stopping time compared to composite numbers, which reverses the trend seen in 1.

Abd ibe final point: Since having a prime number as a proper divisor is a proxy for being a composite number, it is difficult to read too much into whether divisibility by a specific prime number affects the average stopping time. But no specific prime number stands out in particular. Once again, a larger sample size would give us more confidence about the results.

The above analysis is perhaps too crude and primitive to offer any significant lead, so here are some possible improvements:

- We could think of other features that we could add to the data.
- We could think of other statistical summaries to include in the analysis.
- We could try to scale up by looking at all integers form 1 to 1 billion instead of 1 to 1 million.

In the next post, we will see how we can use the 'RevoScaleR' package to go from an R 'data.frame' (memory-bound) to an external 'data.frame' or XDF file for short. In doing so, we will achieve the following improvements:

- as the data will no longer be bound by available memory, we will scale with the size of the data,
- since the XDF format is also a distributed data format, we will use the multiple cores available on a single machine to distribute the computation itself by having each core process a separate chunk of the data. On a cluster of machines, the same analysis could then be distributed over the different nodes of the cluster.