I'm speaking at the DataWeek conference in San Francisco today. My talk follows Skylar Lyon from Accenture — I'm really looking forward to hearing how he uses Revolution R Enterprise with Teradata Database to run R in-database with 400 million rows of data. **Update**: Here are Skylar's slides.

The slides for my talk on other companies' applications of R are embedded below.

The R Foundation for Statistical Computing, the Vienna-based non-profit organization that oversees the R Project, has just added several new "ordinary members". (Ordinary members participate in R Foundation meetings and provide guidance to the project.) The new members are: Dirk Eddelbuettel, Torsten Hothorn, Marc Schwartz, Hadley Wickham, and Achim Zeileis, Martin Morgan and Michael Lawrence.

The R Core group, the group of developers with commit access to the R codebase, has also expanded with Martin Morgan and Michael Lawrence joining the team. (Stefano Iacus is stepping down from R Core.)

Each of these new members to R Foundation and R Core have been prolific contributors to the R project over the past several years. It's great to see such a wealth of talent, R expertise and enthusiasm coming to the R project. We welcome them to the project and thank them and all of the R contributors for making R what it is today.

R-devel mailing list: Updates to R Core and R Foundation Membership

Graduate student Clay McLeod decided to find out what makes a post on the social-sharing site Reddit popular. These are the questions he seeks to answer:

What’s in a post? Reddit pulls in around 115 million unique visitors each month, amassing a staggering 5 billion page views per month. For a long time, I’ve wondered what factors draw people to certain Reddit posts while shunning others - does it have to do with the time of day a post is submitted? Do certain users have a monopoly on the most viewed posts? What about text posts vs. links?

Reddit provides a JSON API to download Reddit data, and Clay created this Python script to download a CSV file with one record per post, with information about its domain, subreddit, upvotes, downvotes, numbr of comments etc. This file can then easily be analyzed with R:

If you've spent any time on Reddit none of the analysis will be very surprising: images generate a lot of votes, NSFW posts are more popular, etc. But I'm interested to see what can do make with data from Reddit.

Clay Mcleod: What's in a Post, Part 1 (via KDNuggets)

Google has just released a new package for R: CausalImpact. Amongst many other things, this package allows Google to resolve the classical conundrum: how can we asses the impact of an intervention (for example, the effect of an advertising campaign on website clicks) when we can't know what would have happened if we *hadn't* run the campaign? For a marketer, the worry is that the spike in clicks was partially or wholly the result of something unrelated (say, a general increase in web traffic) rather than your campaign.

The CausalImpact package uses Bayesian structural time-series models to resolve this question. All you need is a *second* time series to act a a "virtual" control, which is unaffected by your actions but which is still subject to the extraneous effects you're worried about. (For the marketing example, you might choose web clicks from a region where the campaign didn't run.) Then, you can model the extraneous effects and subtract them from your actual results, so see how your things would have played out had the intervention *not* occurred.

In the chart below (from the Google Open Source blog post) you can see the results of the campaign in black, with the campaign launch at the dotted line. The blue line shows the estimated results had the campaign *not* run, clearly showing that it was effective.

Google uses R and the CausalImpact package to measure the return-on-investment on advertising campaigns its customers run:

We've been testing and applying structural time-series models for some time at Google. For example, we've used them to better understand the effectiveness of advertising campaigns and work out their return on investment. We've also applied the models to settings where a randomised experiment was available, to check how similar our effect estimates would have been without an experimental control.

Now that Google has made the CausalImpact package available, you can do the same for any kind of intervention, as long as you have a time series of results before and after the intervation, and a "control" time series where the intervention had no effect. Read more about the package and the methodology behind it at the link below.

Google Open Source Blog: CausalImpact: A new open-source package for estimating causal effects in time series

by Joseph Rickert

The days are getting shorter here in California and the summer R conferences UseR!2014 and JSM are behind us, but there are still some very fine conferences for R users to look forward to before the year ends.

DataWeek starts in San Francisco on September 15th. I will be conducting a bootcamp for new R users, and on Wedneday the 17th Skylar Lyon and David Smith will talk about R in production during an R Case Studies Session.

The same week, halfway around the world, the EARL (Effective Applications of the R Language) Conference starts in London. Ben Goldacre, author of two best sellers Bad Science and Bad Pharma will be the keynote speaker. The technical program will include sessions from such R luminaries as Hadley Wickham, Patrick Burns, Matt Dowle, Andrie de Vries, Romain Francois, Tal Galili and more.

On October 5th through 9th, Predictive Analytics World - Healthcare will be held in Boston. Max Kuhn will be conducting a hands-on workshop on R for predictive modeling in Business and Healthcare.

One week later, October 15th, Strata and Hadoop World will kick off in New York City. The R Studio team will be start the conference with an R Day. Hadley Wickham, WInston Chang, Garrett Grolemund, JJ Allaire and Yihui Xie will all be giving presentations. On the 16th, Amanda Cox, the force behind the Time's superb R based graphics, will be givining a keynote address. Other R releated sessions include a presentation by Sunil Venkayala and Indrajit Roy on HP's Distributed R platform. Revolution Analytics will, once again, be sponsoring this conference. If you go, please drop by the Revolution Analytics' booth.

That same week, PAZUR'14 the Polish Academic R User's Meeting will be taking place. Workshops will be held on data visualization, the analysis of surveys, the exploration of geospatial data and much more. Revolution Analytics is pleased to be a sponsor here too.

On October 25th, the ACM will once again hold its very popular Data Science Bootcamp at eBay in San Jose. And, once again Revolution Analytics is proud to be a sponsor. I will be attending this event, so please look me up if you want to chat about R.

There is sure to be some R content at the ICSM, International Conference on Statistics and Mathematics and the Workshop on Bayesian Modeling to be held on November 24th through 26th in Surabaya, Indonesia.

Also, sometime in November, rOpenSci will be "bringing together a mix of developers and academics to hack on/create tools in the open data space using and for R". A date hasen't been set yet but we will let you know when things are finalized. Here is the link to the Spring hackathon that took place in San Francsco which was pretty impressive.

If I have missed anything, please let us know!

DataScience.LA has posted a great recap of the latest LA R meetup, which in turn was a recap of presentations from the useR! 2014 conference. Follow that link to review slides from the event, whith summaries of useR! 2014 related to R and Python; Finance; dplyr; R books; SalesForce and R AnalyticFlow.

DataScience.LA has also posted more videos from useR! 2014, including Joe Cheng's demonstration of Shiny, Matthew Dowle's presentation on data.table and interviews with Heather Turner and Yihui Xie. Go check 'em out!

by Seth Mottaghinejad

Let's review the Collatz conjecture, which says that given a positive integer n, the following recursive algorithm will always terminate:

- if n is 1, stop, otherwise recurse on the following
- if n is even, then divide it by 2
- if n is odd, then multiply it by 3 and add 1

In our last post, we created a function called 'cpp_collatz', which given an integer vector, returns an integer vector of their corresponding stopping times (number of iterations for the above algorithm to reach 1). For example, when n = 5 we have

5 -> 3*5+1 = 16 -> 16/2 = 8 -> 8/2 = 4 -> 4/2 = 2 -> 2/2 = 1,

giving us a stopping time of 5 iterations.

In today's article, we want to perform some exploratory data analysis to see if we can find any pattern relating an integer to its stopping time. As part of the analysis, we will extract some features out of the integers that could help us explain any differences in stopping times.

Here are some examples of potentially useful features:

- Is the integer a prime number?
- What are its proper divisors?
- What is the remainder upon dividing the integer by some other number?
- What are the sum of its digits?
- Is the integer a triangular number?
- Is the integer a square number?
- Is the integer a pentagonal number?

In case you are encountering these terms for the first time, a triangular number is any number m that can be written as m = n(n+1)/2, where n is some positive integer. To determine if a number is triangular, we can rewrite the above equation as n^2 + n - 2m = 0, and use the quadriatic formula to get n = (-1 + sqrt(1 + 8m))/2 and (-1 - sqrt(1 + 8m))/2. Since n must be a positive integer, we ignore the latter solution, leaving us with (-1 + sqrt(1 + 8m))/2.

Thus, if plugging m in the above formula results in an integer, we can say that m is a triangular number. Similar rules exist to determine if an integer is square or pentagonal, but I will refer you to Wikipedia for the details.

For the purpose of conducting our analysis, we created some other functions in C++ and R to help us. Let's take a look at these functions:

cat(paste(readLines(file.path(directory, "collatz.cpp")), collapse = "\n")) #include <Rcpp.h> #include <vector> using namespace Rcpp; // [[Rcpp::export]] int collatz(int nn) { int ii = 0; while (nn != 1) { if (nn % 2 == 0) nn /= 2; else nn = 3 * nn + 1; ii += 1; } return ii; } // [[Rcpp::export]] IntegerVector cpp_collatz(IntegerVector ints) { return sapply(ints, collatz); } // [[Rcpp::export]] bool is_int_prime(int nn) { if (nn < 1) stop("int must be greater than 1."); else if (nn == 1) return FALSE; else if (nn % 2 == 0) return (nn == 2); for (int ii=3; ii*ii<=nn; ii+=2) { if (nn % ii == 0) return false; } return true; } // [[Rcpp::export]] LogicalVector is_prime(IntegerVector ints) { return sapply(ints, is_int_prime); } // [[Rcpp::export]] NumericVector gen_primes(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> primes; primes.push_back(2); int i = 3; while (primes.size() < unsigned(n)) { if (is_int_prime(i)) primes.push_back(i); i += 2; } return Rcpp::wrap(primes); } // [[Rcpp::export]] NumericVector gen_divisors(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> divisors; divisors.push_back(1); for(int i = 2; i <= sqrt(n); i++) { if(n%i == 0) { divisors.push_back(i); divisors.push_back(n/i); } } sort(divisors.begin(), divisors.end()); divisors.erase(unique(divisors.begin(), divisors.end()), divisors.end()); return Rcpp::wrap(divisors); } // [[Rcpp::export]] bool is_int_perfect(int nn) { if (nn < 1) stop("int must be greater than 0."); return nn == sum(gen_divisors(nn)); } // [[Rcpp::export]] LogicalVector is_perfect(IntegerVector ints) { return sapply(ints, is_int_perfect); }

Here is a list of other helper functions in collatz.cpp:

- 'is_prime': given an integer vector, returns a logical vector
- 'gen_primes': given some integer input, n, generates the first n prime numbers
- 'gen_divisors': given an integer n, returns an integer vector of all its proper divisors
- 'is_perfect': given an integer vector, returns a logical vector

sum_digits <- function(x) { # returns the sum of the individual digits that make up x # x must be an integer vector f <- function(xx) sum(as.integer(strsplit(as.character(xx), "")[[1]])) sapply(x, f) }

As you can guess, many of the above functions (such as 'is_prime' and 'gen_divisors') rely on loops, which makes C++ the ideal place to perform the computation. So we farmed out as much of the heavy-duty computations to C++ leaving R with the task of processing and analyzing the resulting data.

Let's get started. We will perform the analysis on all integers below 10^5, since R is memory-bound and we can run into a bottleneck quickly. But next time, we will show you how to overcome this limitation using the 'RevoScaleR' package, which will allow us to scale the analysis to much larger integers.

One small caveat before we start: I enjoy dabbling in mathematics but I know very little about number theory. The analysis we are about to perform is not meant to be rigorous. Instead, we will attempt to approach the problem using EDA the same way that we approach any data-driven problem.

maxint <- 10^5 df <- data.frame(int = 1:maxint) # the original number df <- transform(df, st = cpp_collatz(int)) # stopping times df <- transform(df, sum_digits = sum_digits(int), inv_triangular = (sqrt(8*int + 1) - 1)/2, # inverse triangular number inv_square = sqrt(int), # inverse square number inv_pentagonal = (sqrt(24*int + 1) + 1)/6 # inverse pentagonal number )

To determine if a numeric number is an integer or not, we need to be careful about not using the '==' operator in R, as it is not guaranteed to work, because of minute rounding errors that may occur. Here's an example:

.3 == .4 - .1 # we expect to get TRUE but get FALSE instead

[1] FALSE

The solution is to check whether the absolute difference between the above numbers is smaller than some tolerance threshold.

eps <- 1e-9

abs(.3 - (.4 - .1)) < eps # returns TRUE

[1] TRUE

df <- transform(df, is_triangular = abs(inv_triangular - as.integer(inv_triangular)) < eps, is_square = abs(inv_square - as.integer(inv_square)) < eps, is_pentagonal = abs(inv_pentagonal - as.integer(inv_pentagonal)) < eps, is_prime = is_prime(int), is_perfect = is_perfect(int) ) df <- df[ , names(df)[-grep("^inv_", names(df))]]

Finally, we will create a variable listing all of the integer's proper prime divisors. Every composite integer can be recunstruced out of these basic building blocks, a mathematical result known as the *'unique factorization theorem'*. We can use the function 'gen_divisors' to get a vector of an integers proper divisors, and the 'is_prime' function to only keep the ones that are prime. Finally, because the return object must be a singleton, we can use 'paste' with the 'collapse' argument to join all of the prime divisors into a single comma-separated string.

Lastly, on its own, we may not find the variable 'all_prime_divs' especially helpful. Instead, we cangenerate multiple flag variables out of it indicating whether or not a specific prime number is a divisor for the integer. We will generate 25 flag variables, one for each of the first 25 prime numbers.

There are many more features that we can extract from the underlying integers, but we will stop here. As we mentioned earlier, our goal is not to provide a rigorous mathematical work, but show you how the tools of data analysis can be brought to bear to tackle a problem of such nature.

Here's a sample of 10 rows from the data:

df[sample.int(nrow(df), 10), ] int st sum_digits is_triangular is_square is_pentagonal is_prime 21721 21721 162 13 FALSE FALSE FALSE FALSE 36084 36084 142 21 FALSE FALSE FALSE FALSE 40793 40793 119 23 FALSE FALSE FALSE FALSE 3374 3374 43 17 FALSE FALSE FALSE FALSE 48257 48257 44 26 FALSE FALSE FALSE FALSE 42906 42906 49 21 FALSE FALSE FALSE FALSE 37283 37283 62 23 FALSE FALSE FALSE FALSE 55156 55156 60 22 FALSE FALSE FALSE FALSE 6169 6169 111 22 FALSE FALSE FALSE FALSE 77694 77694 231 33 FALSE FALSE FALSE FALSE is_perfect all_prime_divs is_div_by_2 is_div_by_3 is_div_by_5 is_div_by_7 21721 FALSE 7,29,107 FALSE FALSE FALSE TRUE 36084 FALSE 2,3,31,97 TRUE TRUE FALSE FALSE 40793 FALSE 19,113 FALSE FALSE FALSE FALSE 3374 FALSE 2,7,241 TRUE FALSE FALSE TRUE 48257 FALSE 11,41,107 FALSE FALSE FALSE FALSE 42906 FALSE 2,3,7151 TRUE TRUE FALSE FALSE 37283 FALSE 23,1621 FALSE FALSE FALSE FALSE 55156 FALSE 2,13789 TRUE FALSE FALSE FALSE 6169 FALSE 31,199 FALSE FALSE FALSE FALSE 77694 FALSE 2,3,23,563 TRUE TRUE FALSE FALSE is_div_by_11 is_div_by_13 is_div_by_17 is_div_by_19 is_div_by_23 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE TRUE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 TRUE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE TRUE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE TRUE is_div_by_29 is_div_by_31 is_div_by_37 is_div_by_41 is_div_by_43 21721 TRUE FALSE FALSE FALSE FALSE 36084 FALSE TRUE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE TRUE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE TRUE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_47 is_div_by_53 is_div_by_59 is_div_by_61 is_div_by_67 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_71 is_div_by_73 is_div_by_79 is_div_by_83 is_div_by_89 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_97 21721 FALSE 36084 TRUE 40793 FALSE 3374 FALSE 48257 FALSE 42906 FALSE 37283 FALSE 55156 FALSE 6169 FALSE 77694 FALSE

We can now move on to looking at various statistical summaries to see if we notice any differences between the stopping times (our response variable) when we break up the data in different ways. We will look at the counts, mean, median, standard deviation, and trimmed mean (after throwing out the highest 10 percent) of the stopping times, as well as the correlation between stopping times and the integers. This is by no means a comprehensive list, but it can serve as a guidance for deciding which direction to go to next.

my_summary <- function(df) { primes <- gen_primes(9) res <- with(df, data.frame( count = length(st), mean_st = mean(st), median_st = median(st), tmean_st = mean(st[st < quantile(st, .9)]), sd_st = sd(st), cor_st_int = cor(st, int, method = "spearman") ) ) }

To create above summaries broken up by the flag variables in the data, we will use the 'ddply' function in the 'plyr' package. For example, the following will give us the summaries asked for in 'my_summary', grouped by 'is_prime'.

To avoid having to manually type every formula, we can pull the flag variables from the data set, generate the strings that will make up the formula, wrap it inside 'as.formula' and pass it to 'ddply'.

flags <- names(df)[grep("^is_", names(df))] res <- lapply(flags, function(nm) ddply(df, as.formula(sprintf('~ %s', nm)), my_summary)) names(res) <- flags res $is_triangular is_triangular count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99554 107.58511 99 96.23178 51.36153 0.1700491 2 TRUE 446 97.11211 94 85.97243 51.34051 0.3035063 $is_square is_square count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99684 107.59743 99 96.25033 51.34206 0.1696582 2 TRUE 316 88.91772 71 77.29577 55.43725 0.4274504 $is_pentagonal is_pentagonal count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99742 107.56948 99.0 96.22466 51.35580 0.1702560 2 TRUE 258 95.52326 83.5 83.86638 53.91514 0.3336478 $is_prime is_prime count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 90408 107.1227 99 95.95392 51.29145 0.1693013 2 TRUE 9592 111.4569 103 100.39643 51.90186 0.1953000 $is_perfect is_perfect count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99995 107.5419 99 96.20092 51.36403 0.1707839 2 TRUE 5 37.6000 18 19.50000 45.06440 0.9000000 $is_div_by_2 is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 50000 113.5745 106 102.58324 52.25180 0.1720108 2 TRUE 50000 101.5023 94 90.68234 49.73778 0.1737786 $is_div_by_3 is_div_by_3 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 66667 107.4415 99 96.15832 51.30838 0.1705745 2 TRUE 33333 107.7322 99 96.94426 51.48104 0.1714432 $is_div_by_5 is_div_by_5 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 80000 107.5676 99 96.21239 51.36466 0.1713552 2 TRUE 20000 107.4214 99 96.13873 51.37206 0.1688929 . . . $is_div_by_89 is_div_by_89 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98877 107.5381 99 96.19965 51.37062 0.1707754 2 TRUE 1123 107.5628 97 96.02096 50.97273 0.1803568 $is_div_by_97 is_div_by_97 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98970 107.4944 99 96.15365 51.36880 0.17187453 2 TRUE 1030 111.7660 106 101.35853 50.93593 0.07843996

As we can see from the above results, most comparisons appear to be non-significant (although, in mathematics, even tiny differences can be meaningful, therefore we will avoid relying on statistical significance here). Here's a summary of trends that stand out as we go over the results:

- On average, stopping times are slightly higher for prime numbers compared to composite numbers.
- On average, stopping times are slightly lower for triangular numbers, square numbers, and pentagonal numbers compared to their corresponding counterparts.
- Despite having lower average stopping times, triangular numbers, square numbers and pentagonal numbers are more strongly correlated with their stopping times than their corresponding counterparts. A larger sample size may help in this case.
- On average, odd numbers have a higher stopping time than even numbers.

This last item could just be restatement of 1. however. Let's take a closer look:

ddply(df, ~ is_prime + is_div_by_2, my_summary) is_prime is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE FALSE 40409 114.0744 106 102.91178 52.32495 0.1658988 2 FALSE TRUE 49999 101.5043 94 90.68434 49.73624 0.1737291 3 TRUE FALSE 9591 111.4685 103 100.40796 51.89231 0.1950482 4 TRUE TRUE 1 1.0000 1 NaN NA NA

When limiting the analysis to odd numbers only, prime numbers now have a lower average stopping time compared to composite numbers, which reverses the trend seen in 1.

Abd ibe final point: Since having a prime number as a proper divisor is a proxy for being a composite number, it is difficult to read too much into whether divisibility by a specific prime number affects the average stopping time. But no specific prime number stands out in particular. Once again, a larger sample size would give us more confidence about the results.

The above analysis is perhaps too crude and primitive to offer any significant lead, so here are some possible improvements:

- We could think of other features that we could add to the data.
- We could think of other statistical summaries to include in the analysis.
- We could try to scale up by looking at all integers form 1 to 1 billion instead of 1 to 1 million.

In the next post, we will see how we can use the 'RevoScaleR' package to go from an R 'data.frame' (memory-bound) to an external 'data.frame' or XDF file for short. In doing so, we will achieve the following improvements:

- as the data will no longer be bound by available memory, we will scale with the size of the data,
- since the XDF format is also a distributed data format, we will use the multiple cores available on a single machine to distribute the computation itself by having each core process a separate chunk of the data. On a cluster of machines, the same analysis could then be distributed over the different nodes of the cluster.

In case you missed them, here are some articles from August of particular interest to R users:

R is the most popular software in the KDNuggets poll for the 4th year running.

The frequency of R user group meetings continues to rise, and there are now 147 R user groups worldwide.

A video interview with David Smith, Chief Community Officer at Revolution Analytics, at the useR! 2014 conference.

In a provocative op-ed, Norm Matloff worries that Statistics is losing ground to Computer Science.

A new certification program for Revolution R Enterprise.

An interactive map of R user groups around the world, created with R and Shiny.

Using R to generate calendar entries (and create photo opportunities).

Integrating R with production systems with Domino.

The New York Times compares data science to janitorial work.

Rdocumentation.org provides search for CRAN, GitHub and BioConductor packages and publishes a top-10 list of packages by downloads.

An update to the "airlines" data set (the "iris" of Big Data) with flights through the end of 2012.

A consultant compares the statistical capabilities of R, Matlab, SAS, Stata and SPSS.

Using heatmaps to explore correlations in financial portfolios.

Video of John Chambers' keynote at the useR! 2014 conference on the interfaces, efficiency, big data and the history of R.

CIO magazine says the open source R language is becoming pervasive.

Reviews of some presentations at the JSM 2014 conference that used R.

GRAN is a new R package to manage package repositories to support reproducibility.

The ASA launches a PR campaign to promote the role of statisticians in society.

Video replay of the webinar Applications in R, featuring examples from several companies using R.

General interest stories (not related to R) in the past month included: dance moves from Japan, an earthquake's signal in personal sensors, a 3-minute movie in less than 4k, smooth time-lapse videos, representing mazes as trees and the view from inside a fireworks display.

As always, thanks for the comments and please send any suggestions to me at david@revolutionanalytics.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

by Joseph Rickert

The ancient Egyptians were a people with long memories. The lists of their pharaohs went back thousands of years, and we still have the names and tax assessments for certain persons and institutions from the time of Ramesses II. When Herodotus began writing about Egypt and the Nile (~ 450 BC), the Egyptians who knew that their prosperity depended on the river’s annual overflow, had been keeping records of the Nile’s high water mark for more than three millennia. So, it seems reasonable, and maybe even appropriate, that one of the first attempts to understand long memory in time series was motivated by the Nile.

The story of British hydrologist and civil servant H.E. Hurst who earned the nickname “Abu Nil”, Father of the Nile, for his 62 year career of measuring and studying the river is now fairly well known. Pondering an 847 year record of Nile overflow data, Hurst noticed that the series was persistent in the sense that heavy flood years tended to be followed by heavier than average flood years while below average flood years were typically followed by light flood years. Working from an ancient formula on optimal dam design he devised the equation: log(R/S) = K*log(N/2) where R is the range of the time series, S is the standard deviation of year-to-year flood measurements and N is the number of years in the series. Hurst noticed that the value of K for the Nile series and other series related to climate phenomena tended to be about 0.73, consistently much higher than the 0.5 value that one would expect from independent observations and short autocorrelations.

Today, mostly due to the work of Benoit Mandelbrot who rediscovered and popularized Hurst work in the early 1960s, Hurst’s Rescale/Range Analysis, and the calculation of the Hurst exponent (Mandlebrot renamed “K” to “H”) is the demarcation point for the modern study of Long Memory Time Series. To investigate let’s look at some monthly flow data taken at the Dongola measurement station that is just upstream from the high dam at Aswan. (Look here for the data, and here for map and nice Python-based analysis that covers some of the same ground as is presented below.) The data consists of average monthly flow measurements from January 1869 to December 1984.

head(nile_dat) Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1869 NA NA NA NA NA NA 2606 8885 10918 8699 NA NA 2 1870 NA NA 1146 682 545 540 3027 10304 10802 8288 5709 3576 3 1871 2606 1992 1485 933 731 760 2475 8960 9953 6571 3522 2419 4 1872 1672 1033 728 605 560 879 3121 8811 10532 7952 4976 3102 5 1873 2187 1851 1235 756 556 1392 2296 7093 8410 5675 3070 2049 6 1874 1340 847 664 516 466 964 3061 10790 11805 8064 4282 2904

To get a feel for the data we plot a portion of the time series.

The pattern is very regular and the short term correlations are apparent. The following boxplots show the variation in monthly flow.

Herodotus clearly knew what he was talking about when he wrote (*The Histories*: Book 2, 19):

I was particularly eager to find out from them (the Egyptian priests) why the Nile starts coming down in a flood at the summer solstice and continues flooding for a hundred days, but when the hundred days are over the water starts to recede and decreases in volume, with the result that it remains low for the whole winter, until the summer solstice comes round again.

To construct a long memory time series we aggregate the monthly flows to produce a yearly time series of total flow (droppng the years 1869 and 1870 because of the missing values).

Plotting the ACF, we see that the autocorelations persist for nearly 20 years!!

So, let's compute the Hurst exponent. For our first try, we use a simple function suggested by an example in Bernard Pfaff's classic text: *Analysis of Integrated and Cointegrated Time Series* with R.

Bingo! 0.73 - just what we were expecting for a long memory time series. Unfortunately, things are not so simple. The function hurst() from the pracma package which is a much more careful calculation than simpleHurst() yields:

hurst(nile.yr.ts) [1] 1.041862

This is midly distressing since H is supposed to be bounded above by 1. The function hurstexp() from the same package which is based on Weron's MatLab code and implements the small sample correction seems to solve that problem.

> hurstexp(nile.yr.ts) Corrected R over S Hurst exponent: 1.041862 Theoretical Hurst exponent: 0.5244148 Corrected empirical Hurst exponent: 0.7136607 Empirical Hurst exponent: 0.6975531

0.71 is more reasonable result. However, as a post on the Timely Portfolio blog pointed out a few years ago, computing the Hurst exponent is an estimation problem not merely a calculation. So, where are the error bars?

I am afraid that confidence intervals and a look at several other methods available in R for estimating the Hurst exponent will have to wait for another time. In the meantime, the following references may be of interest. The first two are light reading from early champions of applying Rescale/Range analysis and the Hurst exponent to Financial time series. The book by Mandelbrot and Hudson is especially interesting for its sketch of the historical background. The last two are relatively early papers on the subject.

- Edgar E. Peters - Fractal Market Analysis: Applying Chaos Theory to Investments and Economics
- Benoit B. Mandelbrot & Richard L Hudson - The (Mis)behavior of Markets
- Davies & Hart - Tests for the Hurst Effect
- Rafeal Weron - Estimating long range dependence: finite sample properties and confidence intervals

My code for this post may be downloaded here: Long_memory_Post.

As more companies explore the benefits that Hadoop may provide, the opportunities to better understand the technology are myriad and unequal. As a provider of in-Hadoop analytics, Revolution Analytics is participating in the coming Hortonworks seminar series. We will be on site to discuss how to deploy R-based analytics within Hadoop clusters using Revolution R Enterprise. The seminar series itself will host a number of different speakers to provide the attendees with the chance to:

- Gain insight into the business value derived from Hadoop
- Understand the technical role of Hadoop within your data center
- Look at the future of Hadoop as presented by the builders and architects of the platform

The cost to attend the seminar is $99.00 and includes breakfast, lunch and a cocktail hour; starting at 8:00 AM and ending at 4:00 PM. We hope to see you at one of the cities that we are sponsoring: