If you're a technical type (a programmer or engineer) who's ever been pulled into a business meeting, this may seem familiar to you. Horrifyingly familiar.

In that situation, I don't think I'd have been able to resist suggesting a seven-dimensional chromatic-agnostic platform as the solution (with the full expectation that the account manager would immediately promise to deliver it!).

That's all for this week. We'll be back on Monday — enjoy your weekend!

by Joseph Rickert

New R packages just keep coming. The following plot, constructed with information from the monthly files on Dirk Eddelbuettel's CRANberries site, shows a plot of the number of new packages released to CRAN between January 1, 2013 and July 27, 2015 by month (not quite 31 months).

This is amazing growth! The mean rate is about 125 new packages a month. How can anyone keep up? The direct approach, of course, would be to become an avid, frequent reader of CRANberries. Every day the CRAN:New link presents the relentless roll call of new arrivals. However, dealing with this extreme level of tediousness is not for everyone.

At MRAN we are attempting to provide some help with the problem of keeping up with what's new through the old fashioned (pre-machine learning) practice of making some idiosyncratic, but not completely capricious, human generated recommendations. With every new release of RRO we publish on the Package Spotlight page brief descriptions of packages in three categories: New Packages, Updated Packages and GitHub packages. None of these lists are intended to be either comprehensive or complete in any sense.

The New Packages list includes new packages that have been released to CRAN since the previous release of RRO. My general rules for selecting packages for this list are: (1) that they should either be tools or infrastructure packages that may prove to be useful to a wide audience or (2) they should involve a new algorithm or statistical technique that I think will be of interest to statisticians and data scientists working in many different areas. The following two packages respectively illustrate these two selection rules:

metricsgraphics V0.8.5: provides an htmlwidgets widgets interface to the MetricsGraphics.js D3 JavaScript library for plotting time series data. The vignette shows what it can do

rotationForest V0.1: provides an implementation of the new Rotation Forest binary ensemble classifier described in the paper by Rodriguez et. al

I also tend to favor packages that are backed by a vignette, paper or url that provides additional explanatory material.

Of course, any scheme like this is limited by the knowledge and biases of the curator. I am particularly worried about missing packages targeted towards biotech applications that may indeed have broader appeal. The way to mitigate the shortcomings of this approach is to involve more people. So if you come across a new package that you think may have broad appeal send us a note and let us know why (open@revolutionanalytics.com).

The Updated Package list is constructed with the single criterion that the fact that the package was updated should convey news of some sort. Most of the very popular and useful packages are updated frequently, some approaching monthly updates. So, even though they are important packages the fact that they have been updated is generally no news at all. It is also the case that package authors generally do not put much effort in to describing the updates. In my experience poking around CRAN I have found that the NEWS directories for packages go mostly unused. (An exemplary exception is the NEWS for ggplot2.)

Finally, the GitHub list is mostly built from repositories that are trending on GitHub with a few serendipitous finds included.

We would be very interested in learning how you keep up with new R packages. Please leave us a comment.

Post Script:

Note that the information from CRANberries about CRAN's new, updated and removed packages is also available as an RSS feed: Download Index.

The code for generating the plot may be found here: Download New_packages

Also, we have written quite a few posts over the last year or so about the difficulties of searching for relevant packages on CRAN. Here are links to three recent posts:

How many packages are there really on CRAN?

Fishing for packages in CRAN

Working with R Studio CRAN Logs

by Andrie de Vries

Last week, IEEE Spectrum said R rised to #6 in Top Programming languages. They use a weighted methodology of 12 factors to compute their score. Among these factors is the activity on social programming websites, including StackOverflow and Github.

I recently used data.stackexchange.com to query the total number of questions on StackOverflow using the R tag.

It is easy to extend the query to include all of the top 10 languages (according to IEEE Spectrum) and see if the StackOverflow activity tells us anything interesting.

This plot shows the monthly total number of questions in each tag on StackOverflow, since 2008. JavaScript and Java tops the list, with R in position #8, roughly on par with C.

There is a fair amount of noise in the form of monthly fluctuations. The underlying trends are slightly easier to observe by fitting a smoother through the data:

From this it is apparent that R is increasing its rank compared to the other languages. During the past year, it has been catching up to C and should overtake soon to be in position #7. By comparison, during most of 2013, R was in position #9 only.

<edit>

I've had several requests, from Dirk Eddelbuettel and Hadley Wickham, to include a plot with a log y scale. Here it is. It seems to me that the R-tag has a steeper slope than the other tags, indicating faster growth.

Also, Gabe Becker asked if this includes question on CrossValidated. The answer is no, this is only questions on StackOverflow. On CrossValidated there are an additional ~400 R questions every month. In other words, about 10% of the volume on StackOverflow. This will make a small, but not hugely material difference. Note that this also doesn't include tags on any of the other StackExchange sites.

</edit>

The IEEE spectrum study uses many other factors, including search rankings and job postings. For R to jump to #6 in their study means that these other factors also play a role.

Nevertheless, it is interesting to compare the numbers on a single, very easily accessible metric.

I didn't try to include any of the other languages in the StackExchange data query, using only the list of top 10. Overall ranking might change in my analysis with a more comprehensive list of tags to search for. (You can change the underlying query and add more tag names)

Here is the code to make these plots:

by John Mount

Data Scientist, Win-Vector LLC

R has a number of very good packages for manipulating and aggregating data (plyr, sqldf, RevoScaleR, data.table, and more), but when it comes to accumulating results the beginning R user is often at sea. The R execution model is a bit exotic so many R users are very uncertain which methods of accumulating results are efficient and which are inefficient.

In this latest "R as it is" we will quickly become expert at efficiently accumulating results in R. To read more please click here.

Priceonomics published on Friday an in-depth profile of Hadley Wickham, author of many of the most popular R packages including ggplot2, dplyr and devtools. In the article, he reveals that his motivation for creating these packages was primarily to provide better ways of accomplishing routine tasks in R, an immensely useful contribution that sadly wasn't recognized in an academic setting. He said:

“There are definitely some academic statisticians who just don’t understand why what I do is statistics, but basically I think they are all wrong . What I do is fundamentally statistics. The fact that data science exists as a field is a colossal failure of statistics. To me, that is what statistics is all about. It is gaining insight from data using modelling and visualization. Data munging and manipulation is hard and statistics has just said that’s not our domain.”

I couldn't agree with the sentiment more, and I too with the field of Statistics had more respect for solving these "mundane" (i.e. non-mathematical), but important problems.

Hadley also says that his work with R has made him "nerd famous" — in the words of the article author, "The kind of famous where people at statistics conferences line up for selfies, ask him for autographs, and are generally in awe of him". I can attest to the truth of that personally: here's a photo I took at the China R User Conference last year, where enthusiastic attendees were lined up at least 5 deep to get his autograph and take pictures:

The whole article is definitely worth a read, and can be found at the link below. You may also like to check out the 2010 profile on Hadley Wickham from this blog.

Priceonomics: Hadley Wickham, the Man Who Revolutionized R

If you've ever wanted to see how guitar strings actually move as they make music, it turns out you don't need an expensive high-speed camera. All you need to do is set your smartphone to record video, and put it inside:

While some have suggested this is due to the rolling shutter in an iPhone camera, I'm not so sure. I think this is just an example of motion frequencies resonating with the video frame rate — the same effect that makes wagon wheels appear to move without rotating in old black-and-white films. Whatever the cause, it's cool to see the shapes the strings make.

That's all for this week. We'll be back on Monday — enjoy your weekend!

IEEE Spectrum has published its 2015 list of Top Programming Languages, and R ranks in 6th place, jumping 3 places from its 2014 ranking.

Here's what the IEEE has to say about the top 10 from the table above:

The big five—Java, C, C++, Python, and C#—remain on top, with their ranking undisturbed, but C has edged to within a whisper of knocking Java off the top spot. The big mover is R, a statistical computing language that’s handy for analyzing and visualizing big data, which comes in at sixth place. Last year it was in ninth place, and its move reflects the growing importance of big data to a number of fields.

IEEE Spectrum ranks languages using a weighted score of 12 factors including Google search rankings and trends, social media chatter, aggregator posts (Reddit and Hacker news), social programming activity (GitHub and StackOverflow), job opportunities (Career Builder and Dice) and academic citations. You can also specify your own rankings using this interactive web application (for a charge of $0.99). The application also offers rankings of trending languages (R is #10 in this list), languages in demand by employers (R is #13), and languages popular on open source hubs (R is #10). However you measure, R's ranking is impressive: as a domain-specific language for data science, the fact that it's ranking with general purpose languages like Java, C and Python demonstrates the importance of advanced data analysis in today's world.

IEEE Spectrum: The 2015 Top Ten Programming Languages

The latest update to Revolution R Open, RRO 3.2.1, is now available for download from MRAN. This release upgrades to the latest R engine (3.2.1), enables package downloads via HTTPS by default, and adds new supported Linux platforms.

Revolution R Open 3.2.1 includes:

- The latest R engine, R 3.2.1. Improvements in this release include more flexible character string handling, reduced memory usage, and some minor bug fixes.
- Multi-threaded math processing, reducing the time for some numerical operations on multi-core systems.
- A focus on reproducibility, with access to a fixed CRAN snapshot taken on July 1, 2015. Many new and updated packages are available since the previous release of RRO — see the latest Package Spotlight for details. CRAN packages released since July 1 can be easily (and reproducibly!) accessed with the checkpoint function.
- Binary downloads for Windows, Mac and Linux systems, including new support for SUSE Linux Enterprise Server 10 and 11, and openSUSE 13.1.
- 100% compatibility with R 3.2.1, RStudio and all other R-based applications.

You can download Revolution R Open now from the link below, and we welcome comments, suggestions and other discussion on the RRO Google Group. If you're new to Revolution R Open, here are some tips to get started, and there are many data sources you can explore with RRO. Thanks go as always to the contributors to the R Project upon which RRO is built.

by Joseph Rickert

The XXXV Sunbelt Conference of the International Network for Social Network Analysis (INSNA) was held last month at Brighton beach in the UK. (And I am still bummed out that I was not there.)

A run of 35 conferences is impressive indeed, but the social network analysts have been at it for an even longer time than that:

and today they are still on the cutting edge of the statistical analysis of networks. The conference presentations have not been posted yet, but judging from the conference workshops program there was plenty of R action in Brighton.

Social network analysis at this level involves some serious statistics and mastering a very specialized vocabulary. However, it seems to me that some knowledge of this field will become important to everyone working in data science. Supervised learning models and statistical models that assume independence among the predictors will most likely represent only the first steps that data scientists will take in exploring the complexity of large data sets.

And, maybe of equal importance is that fact that working with network data is great fun. Moreover, software tools exist in R and other languages that make it relatively easy to get started with just a few pointers.

From a statistical inference point of view what you need to know is Exponential Random Graph Models (ERGMs) are at the heart of modern social network analysis. An ERGM is a statistical model that enables one to predict the probability of observing a given network from a specified given class of networks based on both observed structural properties of the network plus covariates associated with the vertices of the network. The exponential part of the name comes from exponential family of functions used to specify the form of these models. ERGMs are analogous to generalized linear models except that ERGMs take into account the dependency structure of ties (edges) between vertices. For a rigorous definition of ERGMs see sections 3 and 4 of the paper by Hunter et al. in the 2008 special issue of the JSS, or Chapter 6 in Kolaczyk and Csárdi's book *Statistical Analysis of Network Data with R*. (I have found this book to be very helpful and highly recommend it. Not only does it provide an accessible introduction to ERGMs it also begins with basic network statistics and the igraph package and then goes on to introduce some more advanced topics such as modeling processes that take place on graphs and network flows.)

In the R world, the place to go to work with ERGMs is the statnet.org. statnet is a suite of 15 or so CRAN packages that provide a complete infrastructure for working with ERGMs. statnet.org is a real gem of a site that contains documentation for all of the statnet packages along with tutorials, presentations from past Sunbelt conferences and more.

I am particularly impressed with the Shiny based GUI for learning how to fit ERGMs. Try it out on the Shiny webpage or in the box below. Click the **Get Started** button. Then select "built-in network" and "ecoli 1" under **File type**. After that, click the right arrow in the upper right corner. You should see a plot of the ecoli graph.

--------------------------------------------------------------------------------------------------------------------------

You will be fitting models in no time. And since the commands used to drive the GUI are similar to specifying the parameters for the functions in the ergm package you will be writing your own R code shortly after that.