FiveThirtyEight published a fascinating article this week about the subreddits that provided support to Donald Trump during his campaign, and continue to do so today. Reddit, for those not in the know, is an popular online social community organized into thousands of discussion topics, called subreddits (the names all begin with "`r/`

"). Most of the subreddits are a useful forum for interesting discussions by like-minded people, and some of them are toxic. (That toxicity extends to some of the names, which is reflected in some of the screenshots below — apologies in advance.) The article looks at various popular and notorious subreddits and finds those that are most similar to the main subreddit devoted to Donald Trump and also to the main other contenders in the 2016 campaign for president, Hillary Clinton and Bernie Sanders.

The underlying method used to compare subreddits for this purpose is quite ingenious. It's based on a concept you might call "subreddit algebra": you can "add" two subreddits and find a third that reflects the intersection of the two. (One example they give is adding `r/nba`

to `r/minnesota`

gives you `r/timberwolves`

, the subreddit for Minnesota's NBA team.) The you can apply the same process to subtraction: if you remove all the posts like those in the mainstream `r/politics`

site from those in `r/The_Donald`

you're left with posts that look like those in several toxic subreddits.

The statistical technique used to identify posts that are "similar" to another is Latent Semantic Analysis, and the article gives this nice illustration of using it to compare subreddits:

The analysis was performed in R, and the code is available in GitHub. The code makes heavy use of the lsa package for R, which provides a number of functions for performing latent semantic analysis. The triangular plot shown above — known as a ternary diagram — was created using the ggtern package.

For the complete subreddit analysis, and the list of subreddits close to Donald Trump based on the analysis, check out the FiveThirtyEight article linked below.

FiveThirtyEight: Dissecting Trump's Most Rabid Following

While R's base graphics library is almost limitlessly flexible when it comes to create static graphics and data visualizations, new Web-based technologies like d3 and webgl open up new horizons in high-resolution, rescalable and interactive charts. Graphics built with these libraries can easily be embedded in a webpage, can be dynamically resized while maintaining readable fonts and clear lines, and can include interactive features like hover-over data tips or movable components. And thanks to htmlwidgets for R, you can easily create a variety of such charts using R data and functions, explore them in an interactive R session, and include them in Web-based applications for others to experience.

This htmlwidgets showcase allows you to try out a few such charts, but to see the full diversity of charts available check out the new htmlwidgets gallery. It's an index of packages on CRAN and on Github that create htmlwidgets visualizations. (Be sure to turn OFF the "CRAN Only" switch in the top-right corner to see the full range available.)

There's a huge range of charts to explore, including scrollable geographic maps, heat maps, word clouds, streamgraphs, bubble charts and much much more. There are also general interfaces to interactive graphics systems like Plotly and Bokeh. Browse around the gallery to find a chart of interest, click to find the associated R package, install it, and get visualizing!

By the way, while interactive charts are great for hands-on applications, they're not necessarily so helpful for printed reports or presentations. If you're looking for R graphics of the static variety, check out the R Graph Gallery. But if interactive and Web-enabled is your thing, check out the link below.

htmlwidgets for R: Gallery

A graph, a collection of nodes connected by edges, is just data. Whether it's a social network (where nodes are people, and edges are friend relationships), or a decision tree (where nodes are branch criteria or values, and edges decisions), the nature of the graph is easily represented in a data object. It might be represented as a matrix (where rows and columns are nodes, and elements mark whether an edge between them is present) or as a data frame (where each row is an edge, with columns representing the pair of connected nodes).

The trick comes in how you represent a graph visually; there are many different options each with strengths and weaknesses when it comes to interpretation. A graph with many nodes and edges may become an unintelligible hairball without careful arrangement, and including directionality or other attributes of edges or nodes can reveal insights about the data that wouldn't be apparent otherwise. There are many R packages for creating and displaying graphs (igraph is a popular one, and this CRAN task view lists many others) but that's a problem in its own right: an important part of the data exploration process is trying and comparing different visualization options, and the myriad packages and interfaces makes that process difficult for graph data.

Now, there's the new ggraph package, recently published to CRAN by author Thomas Lin Pederson, which promises to make exploring graph data easier. Unlike other graphing packages, ggraph uses the grammar of graphics paradigm of the ggplot2 package, unifying the data structures and attributes associated with graphics. It also includes a wide range of visual representations of graphs — layouts — and makes it easy to switch between them. The basic "mesh" visualization of nodes and edges provides 11 different options for arranging the nodes:

Other types of visualizations are supported, too: hive plots, dendrograms, treemaps, and circle plots, to name just a few. Note that only static graphs are available, though: unlike igraph and some other packages, you can't rearrange the location of the nodes or otherwise manipulate the graphics with a mouse.

For the R programmer, most of the work is done by the ggraph function. It's analagous to the ggplot function, except that you don't provide data for the locations of the nodes; their position is selected by an algorithm. (Similarly, layout choices are automatically made for visualization types other than the mesh.) There are also various themes suited to graphs you can use to style your chart: goodbye gridlines and axes; hello labels, annotations and edge arrows.

The ggraph package is available on CRAN now, and works with R version 2.10 and later. For more on the ggraph package, see the announcement blog post linked below.

Data Imaginist: Announcing ggraph: A grammar of graphics for relational data

Facebook is a famously data-driven organization, and an important goal in any data science activity is forecasting. Now, Facebook has released Prophet, an open-source package for R and Python that implements the time-series methodology that Facebook uses in production for forecasting at scale.

Prophet has a very simple interface: you pass it a column of dates and a column of numbers, and is produces a forecast for the time series, like this:

The black dots are the number of views of Peyton Manning's Wikipedia page through the end of 2016; the blue region is a forecast (with uncertainty interval) into 2017. As you can see, the Prophet forecast automatically detects the seasonal cycles (presumably related to NFL seasons). The prophet function also provides options to explicitly model weekly and/or yearly seasonality, account for holidays, and to specify changepoints where discontinuities in the time series are expected. It also supports modeling logistic growth, where each data point is measured against a maximum possible capacity.

The underlying model behing the forecast is described in this paper. It's not your traditional ARIMA-style time series model. It's closer in spirit to a Bayesian-influenced generalized additive model, a regression of smooth terms. The model is resistant to the effects of outliers, and supports data collected over an irregular time scale (ingliding presence of missing data) without the need for interpolation. The underlying calculation engine is Stan; the R and Python packages simply provide a convenient interface.

Prophet is billed as a platform for "forecasting at scale", but here "scale" does *not* refer to data size or computational speed (as we are accustomed to in this domain). As authors Sean Taylor and Ben Letham write in the aforementioned paper:

The actual problems of scale we have observed in practice involve the complexity introduced by the variety of forecasting problems and building trust in a large number of forecasts once they have been produced.

With its simple interface that automates much of the process of finding a "best fit" characterization of a complex time series, Prophet scales easily to a large number of users who need many forecasts of qualitatively many different kinds of data, without in-depth expertise in time series modeling techniques.

You can find out more about Prophet, including how to install it for R and Python, at the link below.

Facebook Open Source: Prophet

Oksana Kutina and Stefan Feuerriegel fom University of Freiburg recently published an in-depth comparison of four R packages for deep learning. The packages reviewed were:

- MXNet: The R interface to the MXNet deep learning library. (The blog post refers to an older name for the package, MXNetR.)
- darch: An R package for deep architectures and restricted Boltzmann machines.
- deepnet: An R package implementing feed-forward neural networks, restricted Boltzmann machines, deep belief networks, and stacked autoencoders.
- h2o: The R interface to the H2O deep-learning framework.

The blog post goes into detail about the capabilities of the packages, and compares them in terms of flexibility, ease-of-use, parallelization frameworks supported (GPUs, clusters) and performance -- follow the link below for details. I include the conclusion from the paper here:

The current version of deepnet might represent the most differentiated package in terms of available architectures. However, due to its implementation, it might not be the fastest nor the most user-friendly option. Furthermore, it might not offer as many tuning parameters as some of the other packages.

H2O and MXNetR, on the contrary, offer a highly user-friendly experience. Both also provide output of additional information, perform training quickly and achieve decent results. H2O might be more suited for cluster environments, where data scientists can use it for data mining and exploration within a straightforward pipeline. When flexibility and prototyping is more of a concern, then MXNetR might be the most suitable choice. It provides an intuitive symbolic tool that is used to build custom network architectures from scratch. Additionally, it is well optimized to run on a personal computer by exploiting multi CPU/GPU capabilities.

darch offers a limited but targeted functionality focusing on deep belief networks.

Information Systems Research R Blog: Deep Learning in R

The heatmap is a useful graphical tool in any data scientist's arsenal. It's a useful way of representing data that naturally aligns to numeric data in a 2-dimensional grid, where the value of each cell in the grid is represented by a color. It's a natural fit for data that's in a grid already (say, a correlation matrix). But it's also useful for data that can be arranged in a grid, like quantities in a calendar, as a way of comparing clusters, or simply as a combination of two categorical or discrete variables.

The base R heatmap function does a good job of generating basic heatmaps (this FlowingData tutorial showcases its capabilities), but if you want to put anything on the margins besides labels you're going to need something more powerful. The superheat package by Rebecca Barter (currently available only on GitHub) provides many additional capabilites for basic heatmaps (like ordering the rows/columns, or choosing a color scheme) and also the option to supercharge the heatmap with annotations and additional data visualizations in the margins. Here are a few examples:

Add a scatterplot (or boxplot) to the one of the margins (details and code):

Color the labels by another variable (here, Human Development Ranking, also represented as a bar chart — details and code here):

Or add dendrograms (perhaps from a clustering process) to the margins (details and code):

While the superheat pacakge uses the ggplot2 package internally, it doesn't itself follow the grammar of graphics paradigm: the function is more like a traditional base R graphics function with a couple of dozen options, and it creates a plot directly rather than returning a ggplot2 object that can be further customized. But as long as the options cover your heatmap needs (and that's likely), you should find it a useful tool next time you need to represent data on a grid.

The superheat package apparently works with any R version after 3.1 (and I can confirm it works on the most recent, R 3.3.2). This arXiv paper provides some details and several case studies, and you can find more examples here. Check out the vignette for detailed usage instructions, and download it from its GitHub repository linked below.

GitHub (rlbarter): superheat

If you want to get data out of R and into another application or system, simply copying the data as it resides in memory generally isn't an option. Instead you have to *serialize* the data (into a file, usually), which the other application can then *deserialize* to recreate the original data. R has several options to serialize data frames:

- You can serialize (export) data to comma-separated files (CSVs), which can be imported by just about any application. R has several packages to read and write CSVs, including
`fwrite`

and`fread`

from the data.table package. The downside is that as an ASCII format, CSVs are inefficient, particularly for numeric data. - The base R function
`saveRDS`

(and its deserialization counterpart,`readRDS`

) can write any R object to a file. This is fairly efficient binary representation of the data, but not many applications can read RDS files. - The feather package provides the functions
`read_feather`

and`write_feather`

, an efficient binary format based on the open Apache Arrow framework.

And now there's a new package to add to the list: the fst package. Like the data.table package (the fast data.frame replacement for R), the primary focus of the fst package is speed. The chart below compares the speed of reading and writing data to/from CSV files (with fwrite/fread), feather, fts, and the native R RDS format. The vertical axis is throughput in megabytes per second — more is better. As you can see, fst outperforms the other options for both reading (orange) and writing (green).

The fst package achieves these impressive benchmarks thanks to the magic of compression. Even in this modern age of fast, solid-state storage, it's still (usually) faster to spend time using the CPU to compress the data first, rather than simply writing a larger file to disk. (The same applies to de-compression, and because that's an easier task than compression, there are even more performance gains to be had when reading.) The benefits are dependent on the data itself though: data without a lot of repetition (or, in the worst case, truly random numerical data) won't see performance gains like this.

Nonetheless, fst looks like it will be a useful package for applications that need to export data from R and into another R session as quickly as possible. (The fst format isn't supported by any systems other than R, as far as I know.) The package is still in its early days — the authors warn that the file format is likely to change in the future — but it likely has a place in high-performance R applications that rely on data transfer.

fst package: Lightning Fast Serialization of Data Frames for R

CRAN, the global repository of open-source packages that extend the capabiltiies of R, reached a milestone today. There are now more than 10,000 R packages available for download*.

(Incidentally, that count doesn't even include all the R packages out there. There are also another 1294 packages for genomic analysis in the BioConductor repository, hundreds of R packages published only on GitHub, commercial R packages from vendors such as Microsoft and Oracle, and an unknowable number of private, unpublished packages.)

Why so many packages? R has a *very* active developer community, who contribute new packages to CRAN on a daily basis. As a result, R is unparalleled in its capabilities for statistical computing, data science, and data visualization: almost anything you might care to do with data has likely already been implemented for you and released as an R package. There are several reasons behind this:

- R is the most popular language for data scientists —
*and*it's been around for almost 20 years — and so by sheer force of numbers and time, R has more extensions than any other data science software. - R is the primary tool used for statistical
*research:*when new methods are developed, they're not just published as a paper — they're also published as an R package. That means R is always at the cutting edge of new methodologies. - R was designed as an interface language — a means to present a consistent language interface for algorithms written in other languages. Many packages work by providing R language bindings to other open-source software, making R a convenient hub for all kinds of algorithms and methods.
- Last but certainly now least, the CRAN system itself is a very effective platform for sharing R extensions, with a mature system for package authoring, building, testing, and distribution. The R core team and in particular the CRAN maintainers deserve significant credit for creating such a vibrant ecosystem for R packages.

Having so many packages available can be a double-edged sword though: it can take some searching to find the package you need. Luckily, there are some resources available to help you:

- MRAN (the Microsoft R Application Network) provides a search tool for R packages on CRAN.
- To find the most popular packages, Rdocumentation.org provides a leaderboard of packages by number of downloads. It also provides lists of newly-released and recently-updated packages.
- CRAN provides package Task Views, providing a directory of packages by topic area (such as Finance or Clinical Trials). MRAN and RDocumentation.org also provide searchable versions based on the CRAN task views.
- To find popular and active R packages on GitHub, see the Trending R repositories list.
- For curated news on updated and new R packages, check out the Package Picks by Joseph Rickert on RStudio's RViews blog, and also the Package Spotlights published with each Microsoft R Open release. Cranberries also provides a comprehensive uncurated feed of new and updated packages.

The rate of R package growth shows no signs of abating, either. As you can see from this chart (created using this script by Gergely Daróczi), the growth in R packages shows no signs of plateauing soon. (This chart includes packages released and subsequently withdrawn from CRAN, which is why it goes over 10,000.)

Know of any other resources for exploring R packages? Let us know in the comments.

**Actually, as of this writing, there are 9,999 packages on CRAN. So close! But it won't be long...*

Microsoft R Server 9 includes a new R package for machine learning: MicrosoftML. (So do the Data Science Virtual Machine and the free Microsoft R Client edition, incidentally.) This package includes a suite of fast predictive modeling functions implemented by Microsoft Research, including:

- Linear (
`rxFastLinear`

) and logistic (`rxLogisticRegression`

) model functions based on the Stochastic Dual Coordinate Ascent method; - Classification/regression trees (
`rxFastTrees`

) and random forests (`rxFastForests`

) based on FastRank, an efficient implementation of the MART gradient boosting algorithm; - A neural network algorithm (
`rxNeuralNet`

) with support for custom, multilayer network topologies; and - One-class anomaly detection (
`rxOneClassSvm`

) based on support vector machines.

As the function names suggest, the implementations are tuned for speed: most use multiple CPUs, and some will even use the GPU (if available). Not all of the implementations scale to unlimited data sizes, however; all but the linear and logistic regression routines are bound by available RAM.

If you want to give these routines a try, the MIcrosoft R Server Tiger Team has prepared a walkthrough analyzing the famous NYC Taxi data set. Once you have access to Microsoft R Server (or Client), this R script walks you through the process of:

- Loading the MicrosoftML package
- Importing the NYC Taxi Data from SQL Server (it comes preinstalled on the Data Science Virtual Machine)
- Splitting the data into a test set and a training set, with the binary value "tipped" (whether or not the driver was tipped) as the response
- Fitting several predictive models: logistic regression, linear model,, fast forest, and neural network.
- Making predictions on the test data
- Evaluating model performance by comparing AUC (area under the ROC curve)

The ROC curves are shown below. As you'd expect the linear model performs poorly compared to the others, since it's being applied here to a binary variable.

To try it out yourself, follow the walkthrough linked below, which also provides instructions for running the logistic regression model in SQL Server Management Studio.

Microsoft R Server Tiger Team: Predicting NYC Taxi Tips using MicrosoftML

Jilia Silge and David Robinson are both dab hands at using R to analyze text, from tracking the happiness (or otherwise) of Jane Austen characters, to identifying whether Trump's tweets came from him or a staffer. If you too would like to be able to make statistical sense of masses of (possibly messy) text data, check out their book Tidy *Tidy Text Mining with R*, available free online and soon to be published by O'Reilly.

The book builds on the tidytext package (to which Gabriela De Queiroz also contributed) and describes how to handle and analyze text data. The "tidy text" of the title refers to a standardized way of handling text data, as a simple table with one term per row (where a "term" may be a word, collection of words, or sentence, depending on the application). Julia gave several examples of tidy text in her recent talk at the RStudio conference:

Once you have text data in this "tidy" format, you can apply a vast range of statistical tools to it, by assigning data values to the terms. For example, can use sentiment analysis tools to quantify terms by their emotional content, and analyze that. You can compare rates of term usage, such as between chapters or to compare authors, or simply create a word cloud of terms used. You coyld use topic modeling techniques, to classify a collection of documents into like kinds.

There are a wealth of data sources you can use to apply these techniques: documents, emails, text messages ... anything with human-readable text. The book includes examples of analyzing works of literature (check out the janesustenr and guternbergr packages), downloading Tweets and Usenet posts, and even shows how to use metadata (in this case, from NASA) as the subject of a text analysis. But it's just as likely you have data of your own to try tidy text mining with, so check out Tidy Text Mining with R and to get started.