If you've lived in or simply love London, a wonderful new book for your coffee-table is *London: The Information Capital*. In 100 beautifully-rendered charts, the book explores the data that underlies the city and its residents. To create most of these charts, geographer James Cheshire and designer Oliver Uberti relied on programs written in R. Using the R programming language not only created beautiful results, it saved time: "a couple of lines of code in R saved a day of manually drawing lines".

Take for example *From Home To Work, *the graphic illustrating the typical London-area commute. R's ggplot2 package was used to draw the invidual segments as transparent lines, which when overlaid build up the overall picture of commuter flows around cities and towns. The R graphic was then imported into Adobe Illustrator to set the color palette and add annotations. (FlowingData's Nathan Yau uses a similar process.)

Another example is the chart below of cycle routes in London. (We reported on an earlier version of this chart back in 2012.) As the authors note, "hundreds of thousands of line segments are plotted here, making the graphic an excellent illustration of R’s power to plot large volumes of data."

You can learn more from the authors about how R was used to create the graphics in *London: The Information Capital* and see several more examples at the link below. And if you'd like a copy, you can buy the book here.

London: The Information Capital / Our Process: The Coder and Designer

by Tim Winke

PhD student in Demography and Social Sciences in Berlin

*This post has been abstracted from Tim's entry to a contest that Dalia Research is running based on a global smarthpone survey that they are conducting. Tim's entry post is available as is all of the code behind it. - editor*

When people think about Germany, what comes to their mind? Oktoberfest, ok – but Mercedes might be second or BMW or Porsche. German car brands have a solid reputation all over the world, but how popular is each brand in different countries?

There are plenty of survey data out there but hardly anyone collects answers within a couple of days from 6 continents. A new start-up called Dalia Research found a way to use smartphone and tablet networks to conduct surveys. It’s not a separate app but works via thousands of apps where targeted users decide to take part in a survey in exchange for an incentive.

In August 2014, they asked 51 questions to young mobile users in 64 countries including Colombia, Iran and the Ukraine. This is impressive – you have access to opinions of 32 000 people collected within 4 days from all over the world – 500 respondents in each country – about their religion, what they think about the Unites States, about where the EU has global influence or if Qatar should host the 2022 FIFA World Cup, and also: “What is your favorite German car brand?”.

Surprisingly, as the map below shows, BMW seems to be the most popular German car brand – and Volkswagen does not reach the pole position in any country.

The ggplot2 stacked barchart provides even more detail.

To see how I employed dplyr, ggplot2 and rworldmap to construct these plots as well as how to integrate the survey data with world development indicators from the World Bank please have a look at my original post.

With so many more devices and instruments connected to the "Internet of Things" these days, there's a whole lot more time series data available to analyze. But time series are typically quite noisy: how do you distinguish a short-term tick up or down from a true change in the underlying signal? To solve this problem, Twitter created the BreakoutDetection package for R, which decomposes a time series into a series of segments of one of three types:

**Steady state**: The time series follows a fixed mean (with random noise around the mean);**Mean shift**: The time series jumps directly from one steady state to another;**Ramp up / down**: The time series transitions linearly from one steady state to another, over a fixed period of time.

Given a univariate time series (and a few tuning parameters), the breakout function will return a list of **breakout points**: times when these state transitions are detected. It uses a non-parametric algorithm (E-Divisive with Medians) to detect the breakout points, so no assumptions are made about the underlying distribution of the time series.

Twitter uses this R package to monitor the user experience on the Twitter network and detect when things are "Breaking Bad". Data scientist Randy Zwitch used the package to identify the dates of blog posts or references on Hacker News from his blog traffic data. (He also compared the algorithm to anomaly detection with the Adobe Analytics API.) And the University of Louisville School of Medicine has also looked at using the package to identify past influenza outbreaks from CDC data:

For more information about the BreakoutDetection package, check out Twitter's blog post linked below. You can download the BreakoutDetection R package itself from GitHub.

Twitter Engineering blog: Breakout detection in the wild (via FlowingData)

Mike Cavaretta is Ford Motor Company’s Chief Data Scientist, and was tasked by the incoming CEO Alan Mulally to help change the culture so that "important decisions within the company had to be based on data". In a feature article at *Dataconomy*, he reveals that R is a big part of this revolution at Ford:

On the statistical side, we did a lot of stuff in R. ... We’ve done a lot more with R and we’re currently evaluating Pentaho. So we’ve really moved from more point solutions for solving particular problems, to more of a framework and understanding different needs in different areas. For example, there may be certain times when SAS is great for analysis because we already have implementations, and it’s easier to get that into production. There are other times when R is a better choice because it’s got certain packages that makes that analysis a lot easier, so we’re working on trying to put all that together.

Data Science is revolutionizing Ford's products, too. Among the applications Cavaratta describes:

- Breaking down data silos between analytics groups within Ford;
- Understanding how drivers are using electric cars, based on opt-in telemetry data from the cars themselves; and
- Supporting the development of features in vehicles by looking at social media chatter about them. (He gives a great example related to the "three-blink" turn signal in the Ford Fiesta.)

You can read more about Ford's use of data science and R at the link below, or explore more companies using R.

Dataconomy: How Ford Uses Data Science: Past, Present and Future (via @JulianHi)

by Joseph Rickert

We usually have a pretty good time at the monthly Bay Area useR Group (BARUG) meetings, but this month's meeting was a bit more of a party than usual. The very well connected PR team at Sqor Sports, our host company for the evening, secured San Francisco's tres trendy 111 Minna Gallery for the venue. There was a full bar, house music for the networking portion of the meeting, gourmet grilled cheese sandwiches complements of Revolution Analytics and drama — Matt Dowle, one of our speakers, was on a flight that was late getting in from London.

Oh! and yes, there were three very engaging presentations — well worth standing around in the dark.

First up was Noah Gift, CTO or Sqor, a company with a mission to take sports marketing to a whole new level. They are creating a marketplace for athletes to build and promote their digital brands. Noah described how devilishly difficult it is to gather, clean and prepare the data. Correctly labeling social media data from several sources generated by different athletes with the same name poses a number of vexing challenges.

One surprising aspect of the technology that Sqor is developing is what they call an Erlang to R bridge the replaces many tasks they formerly accomplished with Python. Noah indicated that they planning on placing this code in open source.

Below is a plot from Noah's presentation showing predictions from their R based machine learning algorithms.

Our second speaker was Stephen Elston who gave a virtuoso, live demo on using R on the Microsoft Azure Machine Learning cloud platform. Steve glided between the Azure workflow interface and running R scripts. He showed how to manipulate and transform data in both environments, go back and forth to run models in both Azure and R and visualize results in R. Slides for Steve’s talk are available as is some R code on Steve's github site. Studying the scripts will give you an idea of the features he presented.

Finally, just in from London, and still lucid at what would have been 4AM his time, Matt Dowle walked through a summary of new features of data.table v1.9.4 and v1.9.5. There were several data.table users present, and Matt made a few new converts with a series of impressively fast benchmarks against base R. In one demo, Matt showed data.table's forder() taking only 17 seconds to sort 40 million random numerics, a task that took R 7 minutes. According to Matt, the trick for getting this kind of performance is data.table's C-based implementation of radix sorting which works on numeric, character and integer types, with no range restrictions (recall that base::sort.list(...,method="radix") is limited to integers with range < 100,000).

data.table's radix sorting, which scales linearly i.e. below the O(n log n) bound for comparison sorts, is based on two papers: one by Terdiman and the other by Herf. However, where both of these papers use the least significant digit, data.table uses the most significant digit to improve cache efficiency.

Matt also demonstrated data.table's new automatic indexes (You can now use == in i and data.table will automatically build a secondary key) as well as using dplyr syntax with data.table. Matt emphasized that this flexibility shows the power of R's object oriented design. Matt also claimed that both Python's pandas and the dplyr for R made the wrong choices in using hashing. Instead of hashing, data.table uses fast sorting based on the sort order vector which is an index in data.table

For more benchmark information be sure to visit Matt's github site. If you are new to data.table, I recommend starting with Matt's 2014 useR presentation which explains some the ideas underlying data.table as well as providing an introduction.

Todd Schneider's blog post on solving the traveling salesman problem with R hit the front page of reddit.com. This is a big deal: front-page placement on the popular social news site can drive a *ton* of traffic (in Todd's case, 1.3 million pageviews). But what factors determine which of reddit's contributed links make it to the front page? (There are 25 front-page slots, but more than 100,000 reddit posts on an average day.)

Todd set out to answer this question using the statistical language R, and reported his results on Mashable. He collected 6 weeks of data including 1.2 million rankings for about 15,000 posts, and looked for commonalities amongst those posts that made the top 25.

Now, you might expect that a post's front page ranking is determined by its score (the number of times it has been "liked" by a reddit user, most likely after having seen it in the "subreddit" special topic area where it was posted), and how long since it was posted (reddit's front page generally contains recent posts). But it turns out that *not all subreddits are treated equally*. Todd discovered that there are three different types of subreddits when it comes to how posts are promoted to the front page:

- "
**Viral Candy"**subreddits like funny, gifs and todayilearned. Posts from this category dominate page one. - "
**Page Two"**subreddits", which includes Documentaries, Fitness and personalfinance. As the name suggests, posts in these subreddits almost never make it to page 1, but are often promoted to page 2. - "
**The Rest**", which includes food, LifeProTips, and sports. Todd's post was in this category, in the subreddit dataisbeautiful. Posts in these subreddits make a small but significant fraction of page 1 posts.

It seems that reddit's front page (and pages 2, 3 and 4 which follow) follow a well-defined mix of posts from each of the three categories, as you can see in the chart below:

Starting from the left of the chart above, you can see the #1 post (on page 1) is from one of the "Viral Candy" subreddits about 97% of the time, but that a "The Rest" post does occasionally make top billing. By contrast, posts from the "Page Two" subreddits almost never appear above #10, but dominate page two (ranks 26-50). There's a pretty consistent mix on pages 3 and 4: about 65% "viral candy", about 15% "page twos" and about 25% "the rest".

As for post scores, Todd noted that posts from "Viral Candy" and "The Rest" subreddits need high scores to get on page 1: about 3500-4500 and 3000-4000 respectively for the top slot. By contrast, posts in "Page 2" reddits only need scores in the 500-1500 range to hit the lower ranks of Page 1 (but are much more likely to appear on Page 2).

If you're interested in the details of what gets a post on reddit's front page, Todd's blog post has lots more information. And if you're an R user and want to do a similar analysis, Todd's data and R code are available on github.

Todd W Schneider: The reddit Front Page is Not a Meritocracy

by Jeremy Reynolds

Senior R Trainer, Revolution Analytics

Last week, Revolution Analytics released its first massive open, online course through a partnership with datacamp.com: Introduction to Revolution R Enterprise for Big Data Analytics. You can sign up for the free course here.

This course provides a look at some of the tools provided by the RevoScaleR package that ships with Revolution R Enterprise. The course and the interactive training framework provided by the platform allow you to get a feel for how you can manipulate, visualize, and analyze large datasets with RevoScaleR.

There are four “chapters” within the course:

- Chapter 1 introduces the RevoScaleR package. Within the chapter, we discuss the challenges associated with big data and how the functions and algorithms in RevoScaleR address them. We walk through an example in which we demonstrate the use of several core RevoScaleR functions, and we provide exercises in which you can use the RevoScaleR to create your first linear model on big data.
- Chapter 2 provides details for some of the RevoScaleR functions used to explore large datasets. We demonstrate how you can use these functions to summarize, cross-tabulate, and visualize variables in large datasets, and we provide hands-on exercises where you can practice with them.
- Chapter 3 covers the RevoScaleR functions used to manipulate and transform large datasets. We demonstrate how you can use these functions to perform simple and complex transformations. We provide a set of interactive exercises that allow you to practice creating transformed variables and to explore how chunking algorithms impact the ways in which you need to process data.
- Chapter 4 concentrates on the analysis of larte data sets. We demonstrate how to use RevoScaleR functions to build statistical and machine learning models on large data sets. Here, we cover linear and logistic regression, k-means clustering, and decision tree estimation. For each kind of analysis, there are hands-on exercises in which you can explore some of the flexibility and power associated with the functions.

We are very excited about our partnership with datacamp.com — the platform provides a unique, hands-on training environment in which you can practice in a live R environment, and you can “learn by doing.” We are diligently working to create more extensive content, including courses on the Fundamentals of the R Programming Language, Introductory Statistics with Revolution R Enterprise, Predictive Modeling, and Advanced R Programming. Stay tuned for more!

DataCamp: Introduction to Revolution R Enterprise for Big Data Analytics

The video replay from last week's Introducing Revolution R Open webinar is now available, and I've embedded it below for anyone who may have missed the live presentation.

In addition to providing an overview of Revolution R Open, the presentation also introduces other open source projects from Revolution Analytics. If you've ever wanted to embed R into another application, check out the demo of DeployR Open beginnning at the 21-minute mark.

There are lots of links in the slides, which you can follow in the presentation on SlideShare. If you want to get started with Revolution R Open, visit our MRAN website (which has just been upgraded to provide even more detailed package info — see the knitr entry for an example).

Revolution Analytics Webinars: Introducing Revolution R Open: Enhanced, Open Source R distribution from Revolution Analytics

The R language has jumped to number 12 in the November 2014 TIOBE Index of programming language popularity. This is R's highest ranking in the history of the TIOBE index, which has been ranking languages since 2003. A high ranking is an impressive achievement for R given that it is a domain-specific language (designed for data science applications), and puts it in the company of general-purpose languages like C, Java and Python. Here are the top 15 from the latest index:

Other data science / statistics languages in the top 50 include SAS (at #22), Matlab (#24) and Scala (#31).

Tiobe's managing director noted in an Infoworld interview that "R is gaining share for a while now", and you can see R's rise in the history of it's relative ranking:

The TIOBE rankings are based on search engine results and on the number of skilled engineers world-wide, courses and third party vendors. It's interesting to note that while this methodology is very different from that used for the IEEE language popularity rankings, R also ranked very highly in that list (at #9 in July 2014).

by Joseph Rickert

The igraph package has become a fundamental tool for the study of graphs and their properties, the manipulation and visualization of graphs and the statistical analysis of networks. To get an idea of just how firmly igraph has become embedded into the R package ecosystem consider that currently igraph lists 72 reverse depends, 59 reverse imports and 24 reverse suggests. The following plot, which was made with functions from the igraph and miniCRAN packages, indicates the complexity of this network.

library(miniCRAN) library(igraph) pk <- c("igraph","agop","bc3net","BDgraph","c3net","camel", "cccd", "CDVine", "CePa", "CINOEDV", "cooptrees","corclass", "cvxclustr", "dcGOR", "ddepn","dils", "dnet", "dpa", "ebdbNet", "editrules", "fanovaGraph", "fastclime", "FisHiCal", "flare", "G1DBN", "gdistance", "GeneNet", "GeneReg", "genlasso", "ggm", "gRapfa", "hglasso", "huge", "igraphtosonia", "InteractiveIGraph", "iRefR", "JGL", "lcd", "linkcomm", "locits", "loe", "micropan", "mlDNA", "mRMRe", "nets", "netweavers", "optrees", "packdep", "PAGI", "pathClass", "PBC", "phyloTop", "picasso", "PoMoS", "popgraph", "PROFANCY", "qtlnet", "RCA", "ReliabilityTheory", "rEMM", "restlos", "rgexf", "RNetLogo", "ror", "RWBP", "sand", "SEMID", "shp2graph", "SINGLE", "spacejam", "TDA", "timeordered", "tnet") dg <- makeDepGraph(pk) plot(dg,main="Network of reverse depends for igraph",cex=.4,vertex.size=8)

The igraph package itself is a tour de force with over 200 functions. Learning the package can be a formidable task especially if you are trying to learn graph theory and network analysis at the same time. To help myself through this process, I sorted the functions in the igraph package into seven rough categories:

- Create Graph
- Describe Graph
- Environment
- Find Communities
- Operate on a Graph
- Plot
- Statistics

The following shows a portion of the table for the first 10 functions related to creating graphs.

Function | Description | Category of Function | |

1 | aging.prefatt.game | Generate an evolving random graph with preferential attachment and aging | Create Graph |

2 | barabasi.game | generate scale-free graphs according to the Barabasi-Albert model | Create Graph |

3 | bipartite.random.game | Generate bipartite graphs using the Erdos-Renyi model | Create Graph |

4 | degree.sequence.game | Generate a random graph with a given degree degree sequence | Create Graph |

5 | erdos.renyi.game | Generate random graphs according to the Erdos-Renyi model | Create Graph |

6 | forest.fire.game | Grow a network that simulates how a fire spreads by igniting trees | Create Graph |

7 | graph.adjacency | Create an igraph from an adjacency matrix | Create Graph |

8 | graph.bipartite | Create a bipartite graph | Create Graph |

9 | graph.complementer | Create the complementary graph for a given graph | Create Graph |

10 | graph.empty | Create an empty graph | Create Graph |

The entire table may be downloaded here: Download Igraph_functions.

infomap.community() is an intriguing function listed under the Finding Communities category that looks for structure in a network by minimizing the expected description length of a random walker trajectory. The abstract to the paper by Rosvall and Bergstrom that introduced this method states:

To comprehend the multipartite organization of large-scale biological and social systems, we introduce an information theoretic approach that reveals community structure in weighted and directed networks. We use the probability flow of random walks on a network as a proxy for information flows in the real system and decompose the network into modules by compressing a description of the probability flow. The result is a map that both simplifies and highlights the regularities in the structure and their relationships.

I am not willing to claim that the 37 communities found by the algorithm represent meaningful structure, however, the idea of partitioning the network based on information flow does seem relevant to the package building process. Anybody looking for a research project?

imc <- infomap.community(dg) imc # Graph community structure calculated with the infomap algorithm # Number of communities: 37 # Modularity: 0.5139813 # Membership vector:

Some additional resources for working with igraph are:

- The igraph home page
- A presentation by Gábor Csrádi, igraph's principle author
- An old post with some pointers to additional resources
- A nice couple of nice tutorials: here and here.
- The new book: Statistical Analysis of Network Data with R by Eric D. Kolaczyk and Gábor Csrádi is a very good read.