Redmonk have once again updated (a little later than usual) their bi-annual programming language report with their January 2017 rankings. If you haven't come across these rankings before, they are based on GitHub contributions and StackOverflow questions related to around 40 commonly-used programming languages. The raw data (as of January 2017) is shown below — as you might guess from the appearance of the chart, the analysis for the rankings is done in R.

Languages used by data scientists rank highly in this metric. Python is ranked #3 (up from #4 in the June 2016 rankings). R is ranked #14, down from its all-time high rank of #12 in June 2016. Redmonk's Stephen O'Grady attributes this dip to a change in the process of collecting the GitHub data and notes that R "remains popular in spite of a step back".

Other notable moves in the rankings are Swift, jumping from #26 to #16 in the rankings; Go (moving to #11 from #15); and Powershell (jumping to #19 from #36). For more the complete rankings and associated analysis, check out the RedMonk post linked below.

RedMonk (Tecosystems): The RedMonk Programming Language Rankings: January 2017

When we last looked at job trends from indeed.com, job listings for "R statistics" were on the rise but were still around half the volume of listings for "SAS statistics". Three-and-a-half years later, R has overtaken SAS in job listings for "statistics".

I added Python to the search this time; job listings for "Python statistics" have risen at a similar rate to those for R, but with a somewhat higher volume for R.

Since data science is popular job role these days, let's do the same search for "data scientist":

For "data scientist" jobs, R and Python track very closely, with Python just edging out R in the past few months. This is most likely because R and Python (but unlike SAS) appear together in many data scientist job listings.

You can explore other job titles at indeed.com. (And thanks to reader SK for the suggestion to revisit these searches!)

If you're just getting started with data science, the Sharp Sight Labs blog argues that R is the best data science language to learn today.

The blog post gives several detailed reasons, but the main arguments are:

- R is an extremely popular (arguably
*the*most popular) data progamming language, and ranks highly in several popularity surveys. - Learning R is a great way of learning data science, with many R-based books and resources for probability, frequentist and Bayesian statistics, data visualization, machine learning and more.
- Python is another excellent language for data science, but with R it's easier to learn the foundations.

Once you've learned the basics, Sharp Sight also argues that R is also a great data science to master, even though it's an old langauge compared to some of the newer alternatives. Every tool has a shelf life, but R isn't going anywhere and learning R gives you a foundation beyond the language itself.

If you want to get started with R, Sharp Sight labs offers a data science crash course. You might also want to check out the Introduction to R for Data Science course on EdX.

Sharp Sight Labs: Why R is the best data science language to learn today, and Why you should master R (even if it might eventually become obsolete)

O'Reilly has released the results of the 2016 Data Science Salary Survey. This survey is based on data from over 900 respondents to a 64-question survey about data-related tasks, tools, and the salary they receive from doing/using them. The median salary reported in the survey was US$87,000; amongst data scientists in the US, the median salary was US$106,000.

Appropriately for a survey about data science, O'Reilly doesn't merely report aggregate statistics from the survey; they fit a linear regression model for a data, and extact coefficients from the model indicative of salary "bumps" (or downgrades) attributable to demographic factors. (The model apparently includes no interaction terms.) Factors that tended to increase salary included: working in cloud computing environments; working with Python; and being older. Factors that tended to decrease salary included: in the Education industry, working with Excel, and being female. (Since this is a regression model, that means female data scientists earned $7,800 per year less than their male counterparts for doing the same work.)

The survey also reports on use of tools, and the top 3 in each category were as follows (respondents could select multiple tools in each category):

- Operating Systems: Windows 74%; Linux 49%; Mac OS X 42%
- Databases: MySQL 37%; Microsoft SQL Server 33%; Oracle 23%
- Programming languages: SQL 70%; R 57%; Python 54%

Interestingly, the survey also reported on the tasks that data scientists perform: over 90% reported soem kind of coding in their day-to-day work. The tasks reported, in order of frequency or reporting and shown with corresponding salary ranges of the subset, are shown below:

For much more data and analysis from the 2016 survey, follow the link below to download a free copy from O'Reilly.

O'Reilly: 2016 Data Science Salary Survey

IEEE Spectrum has just published its third annual ranking with its 2016 Top Programming Languages, and the R Language is once again near the top of the list, moving up one place to fifth position.

As I said last year (when R moved up to take sixth place), this is an extraordinary result for a domain-specific language. The other four languages in the top 5 (C, Java, Python amd C++) are all general-purpose languages, suitable for just about any programming task. R by contrast is a language specifically for data science, and its high ranking here reflects both the critical importance of data science as a discipline today, and of R as the language of choice for data scientists.

IEEE Spectrum ranks languages according to a large number of factors, including search rankings and trends, social media mentions, and job posting. (You can adjust the weighting of these factors to generate your own rankings using this interactive tool.) It also includes scholarly citations of the languages, a factur that influenced R's rise in this ranking:

Another language that has continued to move up the rankings since 2014 is R, now in fifth place. R has been lifted in our rankings by racking up more questions on Stack Overflow—about 46 percent more since 2014. But even more important to R’s rise is that it is increasingly mentioned in scholarly research papers. The

Spectrumd efault ranking is heavily weighted toward data from IEEE Xplore, which indexes millions of scholarly articles, standards, and books in the IEEE database. In our 2015 ranking there were a mere 39 papers talking about the language, whereas this year we logged 244 papers.

In related news, R also increased its ranking in the recently-released RedMonk Language Rankings for June 2016, moving up one spot to take 12th place. Unlike IEEE Spectrum, RedMonk ranks language using just two criteria: activity of the language on GitHub and StackOverflow. Analyst Stephen O'Grady had this to say about R's performance in the RedMonk rankings:

Out of all the back half of the Top 20 languages, R has shown the most consistent upwards movement over time. From its position of 17 back in 2012, it has made steady gains over time, but had seemed to stall at 13 having stuck there for three consecutive quarters. This time around, however, R took over #12 from Perl which in turn dropped to #13. There’s still an enormous amount of Perl in circulation, but the fact that the more specialized R has unseated the language once considered the glue of the web says as much about Perl as it does about R.

Again, R's steady growth in this and numerous other surveys and rankings over time reflects the growing importance of data science applied using R.

The open-source R language is the most frequently used analytics / data science software, selected by 49% of the 2895 voters of the 2016 KDNuggets Software Poll. (R was also the top selection in last year's poll.) Python was a close second at 45.8%, and SQL was third at 35.5%. (Respondents could select multiple tools in the poll, and 6 tools were selected on average.)

While this is a self-selected poll (not a scientific survey), the results have been remarkably consistent over the 16 years that it has been conducted. This year's top 10 is very similar to that in 2015, the notable changes being Python's leap from #4 to #2 ranking, and scikit-learn (the machine-learning library for Python) joining the top 10 at the expense of SAS (which drops to #18 this year).

2016 also saw increased use of Big Data tools (particularly Hadoop and Spark, the latter seeing a 91% increase in use) and Deep Learning tools. For details on those results and a complete analysis of the poll results, follow the link below.

KDnuggets: R, Python Duel As Top Analytics, Data Science software – KDnuggets 2016 Software Poll Results

by Andrie de Vries

A few weeks ago I wrote about the growth of CRAN packages, where I demonstrated how to scrape CRAN archives to get an estimate of the number of packages over time. In this post I briefly mentioned that the Ecdat package contains a dataset, CRANpackages, with snapshots recorded by John Fox and Spencer Graves.

Here is a plot of the data they collected. The dataset contains data through 2014, so I manually added the package count as of today (8,329).

In my previous post, I asked the question: "are there indications that the contribution rate is steady, accelerating or decelerating?"

This hints to analysis by John Fox where he says "The number of packages on CRAN ... has grown roughly exponentially, with residuals from the exponential trend ... showing a recent decline in the rate of growth" (Fox, 2009).

In my previous post Using segmented regression to analyse world record running times I used segmented regression to estimate a model that is piece-wise linear.

I used the same process to fit a segmented regression line through the CRAN package data.

By default the segmented package fits a single break point through the data. The results of this analysis indicates a break point occurring some time during 2008. This is entirely consistent with the observation by John Fox that the rate of growth is slowing down.

However, note that the the segmented regression line doesn't fit the data very well during the period 2008 to 2012.

With a small amount of extra work you can fit segmented models with multiple break points. To do this, you simply have to specify initial values for the search. Here I show the results of a simple model with two break points. This model finds the first break point during 2007 and the second break point during 2011.

Natural systems can not maintain exponential growth forever. There are always some limits on the system that will ultimately inhibit any further growth. This is why many systems display some kind of sigmoid curve, or S curve.

Although the growth curve of CRAN packages shows signs of slowing down, it does not seem as if there is an inflexion point in the data. An inflexion point is where the curve transitions from being convex to being concave.

Thus it seems the grown of CRAN packages will appear to be exponential for quite some time in the future!

As usual, here is the R code I used.

The first official release of R, R version 1.0.0 was released on February 29, 2000. The anniversary was marked on Twitter by Thomas Lumley, a member of R Core Group: 20 leading statisicians and computer scientists (and 4 alums) from around the world without whom the R Project would not exist. That makes it 16 years — sixteen! — that the R language has faithfully served statisticians, bioinformaticians, quantiative analysts, data scientists and others solving problems with data.

But in fact, the R project is even more venerable than that. The project itself began in 1993, 7 years before the first "official" R release was made available to the public, as research project initiated by Ross Ihaka and Robert Gentleman. Here's a brief timeline of the history of R:

**1993**: Research project in Auckland, NZ**1995**: R Released as open-source software**1997**: R core group formed**2000**: R 1.0.0 released (February 29)**2003**: R Foundation founded**2004**: First international user conference in Vienna**2015:**R Consortium founded

The project is still as strong and as active as ever. (A new update for R, version 3.2.4, is scheduled for March 10.) Likewise, the community around R continues to grow rapidly, as evidenced by user-created contributions to R hosted on the Comprehensive R Archive Network, CRAN. CRAN is a repository where anyone can contribute an extension to R (called a "package"), as long as it meets the quality and licensing requirements set by the CRAN maintainers. On February 29, R's official 16th anniversary, there were exactly 8,000 packages — *eight thousand* — hosted on CRAN. (You can explore those packages at MRAN, Microsoft's historical archive of CRAN.)

That number doesn't even count packages that have been accepted to CRAN but have since been retired (either through obsolescence or lack of updates by the author when they fail to pass checks with new versions of R). Gergely Daróczi analyzed the CRAN logs to show that submission have increased exponentially over time, and more than 9,000 distinct packages have been accepted to CRAN:

By the way, hosting and maintaining CRAN is a huge effort, run by volunteers from the R Core Group. Every package submitted to CRAN is automatically tested on a wide variety of platforms by the CRAN build system, and volunteers spend a significant amount of time personally interacting with package authors to resolve problems that arise. On top of that, the CRAN system runs checks of R packages with each nightly build of R, a process that takes 90 days of computing time (37 of which is on the Solaris system, which R still supports.) Again, problems that arise in this process often result in manual notifications to package authors by the CRAN volunteers. R packages are an incredibly important part of the value of R, and it's thanks to the CRAN system (and its volunteers) that all R users have access to such an amazingly rich source of capabilities for R.

So this week of R's 16th official anniversary marks a great time to thank the R Core Group and the CRAN volunteers for providing their time and expertise to create the most useful ecosystem for data science the world has ever known. Thank you all!

Analyst firm RedMonk have updated their (near-)biannual Programming Lanuage rankings as of January 2016, and the R language ranks at #13, unchanged since the last ranking in June 2015. Redmonk's rankings are based on number of projects in GitHub and number of questions tagged in StackOverflow, and the most recent data is visualized (using R's ggplot2 package) below:

In this latest RedMonk ranking, Python is likewise steady at #4, and Matlab falls one place to #18. See the entire list and a new visualization of ranking changes over time here: RedMonk Programming Language Rankings: January 2016.

Meanwhile, the Tiobe Index for February 2016, which ranks languages according to search engine results, places R at #17 (up 1 places from a year ago). In the same Tiobe rankings, Python places at #5, Matlab at #18, and SAS at #12.

For more R rankings from these and other sources, check out the popularity section of the blog.