by Joseph Rickert

While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.

#----------------------------------------- # SET UP THE PARAMETER SPACE SEARCH GRID ctrl <- trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10-fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid <- expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <-trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune <- train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) #--------------------------------- # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune <- train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) #----------------------------------- # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues <- resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values #--------------------------------------------- # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot

After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, *"The design and analysis of benchmark experiments"*, Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675-699

The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.

Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)

I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.

All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.

by Seth Mottaghinejad

Let's review the Collatz conjecture, which says that given a positive integer n, the following recursive algorithm will always terminate:

- if n is 1, stop, otherwise recurse on the following
- if n is even, then divide it by 2
- if n is odd, then multiply it by 3 and add 1

In our last post, we created a function called 'cpp_collatz', which given an integer vector, returns an integer vector of their corresponding stopping times (number of iterations for the above algorithm to reach 1). For example, when n = 5 we have

5 -> 3*5+1 = 16 -> 16/2 = 8 -> 8/2 = 4 -> 4/2 = 2 -> 2/2 = 1,

giving us a stopping time of 5 iterations.

In today's article, we want to perform some exploratory data analysis to see if we can find any pattern relating an integer to its stopping time. As part of the analysis, we will extract some features out of the integers that could help us explain any differences in stopping times.

Here are some examples of potentially useful features:

- Is the integer a prime number?
- What are its proper divisors?
- What is the remainder upon dividing the integer by some other number?
- What are the sum of its digits?
- Is the integer a triangular number?
- Is the integer a square number?
- Is the integer a pentagonal number?

In case you are encountering these terms for the first time, a triangular number is any number m that can be written as m = n(n+1)/2, where n is some positive integer. To determine if a number is triangular, we can rewrite the above equation as n^2 + n - 2m = 0, and use the quadriatic formula to get n = (-1 + sqrt(1 + 8m))/2 and (-1 - sqrt(1 + 8m))/2. Since n must be a positive integer, we ignore the latter solution, leaving us with (-1 + sqrt(1 + 8m))/2.

Thus, if plugging m in the above formula results in an integer, we can say that m is a triangular number. Similar rules exist to determine if an integer is square or pentagonal, but I will refer you to Wikipedia for the details.

For the purpose of conducting our analysis, we created some other functions in C++ and R to help us. Let's take a look at these functions:

cat(paste(readLines(file.path(directory, "collatz.cpp")), collapse = "\n")) #include <Rcpp.h> #include <vector> using namespace Rcpp; // [[Rcpp::export]] int collatz(int nn) { int ii = 0; while (nn != 1) { if (nn % 2 == 0) nn /= 2; else nn = 3 * nn + 1; ii += 1; } return ii; } // [[Rcpp::export]] IntegerVector cpp_collatz(IntegerVector ints) { return sapply(ints, collatz); } // [[Rcpp::export]] bool is_int_prime(int nn) { if (nn < 1) stop("int must be greater than 1."); else if (nn == 1) return FALSE; else if (nn % 2 == 0) return (nn == 2); for (int ii=3; ii*ii<=nn; ii+=2) { if (nn % ii == 0) return false; } return true; } // [[Rcpp::export]] LogicalVector is_prime(IntegerVector ints) { return sapply(ints, is_int_prime); } // [[Rcpp::export]] NumericVector gen_primes(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> primes; primes.push_back(2); int i = 3; while (primes.size() < unsigned(n)) { if (is_int_prime(i)) primes.push_back(i); i += 2; } return Rcpp::wrap(primes); } // [[Rcpp::export]] NumericVector gen_divisors(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> divisors; divisors.push_back(1); for(int i = 2; i <= sqrt(n); i++) { if(n%i == 0) { divisors.push_back(i); divisors.push_back(n/i); } } sort(divisors.begin(), divisors.end()); divisors.erase(unique(divisors.begin(), divisors.end()), divisors.end()); return Rcpp::wrap(divisors); } // [[Rcpp::export]] bool is_int_perfect(int nn) { if (nn < 1) stop("int must be greater than 0."); return nn == sum(gen_divisors(nn)); } // [[Rcpp::export]] LogicalVector is_perfect(IntegerVector ints) { return sapply(ints, is_int_perfect); }

Here is a list of other helper functions in collatz.cpp:

- 'is_prime': given an integer vector, returns a logical vector
- 'gen_primes': given some integer input, n, generates the first n prime numbers
- 'gen_divisors': given an integer n, returns an integer vector of all its proper divisors
- 'is_perfect': given an integer vector, returns a logical vector

sum_digits <- function(x) { # returns the sum of the individual digits that make up x # x must be an integer vector f <- function(xx) sum(as.integer(strsplit(as.character(xx), "")[[1]])) sapply(x, f) }

As you can guess, many of the above functions (such as 'is_prime' and 'gen_divisors') rely on loops, which makes C++ the ideal place to perform the computation. So we farmed out as much of the heavy-duty computations to C++ leaving R with the task of processing and analyzing the resulting data.

Let's get started. We will perform the analysis on all integers below 10^5, since R is memory-bound and we can run into a bottleneck quickly. But next time, we will show you how to overcome this limitation using the 'RevoScaleR' package, which will allow us to scale the analysis to much larger integers.

One small caveat before we start: I enjoy dabbling in mathematics but I know very little about number theory. The analysis we are about to perform is not meant to be rigorous. Instead, we will attempt to approach the problem using EDA the same way that we approach any data-driven problem.

maxint <- 10^5 df <- data.frame(int = 1:maxint) # the original number df <- transform(df, st = cpp_collatz(int)) # stopping times df <- transform(df, sum_digits = sum_digits(int), inv_triangular = (sqrt(8*int + 1) - 1)/2, # inverse triangular number inv_square = sqrt(int), # inverse square number inv_pentagonal = (sqrt(24*int + 1) + 1)/6 # inverse pentagonal number )

To determine if a numeric number is an integer or not, we need to be careful about not using the '==' operator in R, as it is not guaranteed to work, because of minute rounding errors that may occur. Here's an example:

.3 == .4 - .1 # we expect to get TRUE but get FALSE instead

[1] FALSE

The solution is to check whether the absolute difference between the above numbers is smaller than some tolerance threshold.

eps <- 1e-9

abs(.3 - (.4 - .1)) < eps # returns TRUE

[1] TRUE

df <- transform(df, is_triangular = abs(inv_triangular - as.integer(inv_triangular)) < eps, is_square = abs(inv_square - as.integer(inv_square)) < eps, is_pentagonal = abs(inv_pentagonal - as.integer(inv_pentagonal)) < eps, is_prime = is_prime(int), is_perfect = is_perfect(int) ) df <- df[ , names(df)[-grep("^inv_", names(df))]]

Finally, we will create a variable listing all of the integer's proper prime divisors. Every composite integer can be recunstruced out of these basic building blocks, a mathematical result known as the *'unique factorization theorem'*. We can use the function 'gen_divisors' to get a vector of an integers proper divisors, and the 'is_prime' function to only keep the ones that are prime. Finally, because the return object must be a singleton, we can use 'paste' with the 'collapse' argument to join all of the prime divisors into a single comma-separated string.

Lastly, on its own, we may not find the variable 'all_prime_divs' especially helpful. Instead, we cangenerate multiple flag variables out of it indicating whether or not a specific prime number is a divisor for the integer. We will generate 25 flag variables, one for each of the first 25 prime numbers.

There are many more features that we can extract from the underlying integers, but we will stop here. As we mentioned earlier, our goal is not to provide a rigorous mathematical work, but show you how the tools of data analysis can be brought to bear to tackle a problem of such nature.

Here's a sample of 10 rows from the data:

df[sample.int(nrow(df), 10), ] int st sum_digits is_triangular is_square is_pentagonal is_prime 21721 21721 162 13 FALSE FALSE FALSE FALSE 36084 36084 142 21 FALSE FALSE FALSE FALSE 40793 40793 119 23 FALSE FALSE FALSE FALSE 3374 3374 43 17 FALSE FALSE FALSE FALSE 48257 48257 44 26 FALSE FALSE FALSE FALSE 42906 42906 49 21 FALSE FALSE FALSE FALSE 37283 37283 62 23 FALSE FALSE FALSE FALSE 55156 55156 60 22 FALSE FALSE FALSE FALSE 6169 6169 111 22 FALSE FALSE FALSE FALSE 77694 77694 231 33 FALSE FALSE FALSE FALSE is_perfect all_prime_divs is_div_by_2 is_div_by_3 is_div_by_5 is_div_by_7 21721 FALSE 7,29,107 FALSE FALSE FALSE TRUE 36084 FALSE 2,3,31,97 TRUE TRUE FALSE FALSE 40793 FALSE 19,113 FALSE FALSE FALSE FALSE 3374 FALSE 2,7,241 TRUE FALSE FALSE TRUE 48257 FALSE 11,41,107 FALSE FALSE FALSE FALSE 42906 FALSE 2,3,7151 TRUE TRUE FALSE FALSE 37283 FALSE 23,1621 FALSE FALSE FALSE FALSE 55156 FALSE 2,13789 TRUE FALSE FALSE FALSE 6169 FALSE 31,199 FALSE FALSE FALSE FALSE 77694 FALSE 2,3,23,563 TRUE TRUE FALSE FALSE is_div_by_11 is_div_by_13 is_div_by_17 is_div_by_19 is_div_by_23 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE TRUE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 TRUE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE TRUE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE TRUE is_div_by_29 is_div_by_31 is_div_by_37 is_div_by_41 is_div_by_43 21721 TRUE FALSE FALSE FALSE FALSE 36084 FALSE TRUE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE TRUE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE TRUE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_47 is_div_by_53 is_div_by_59 is_div_by_61 is_div_by_67 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_71 is_div_by_73 is_div_by_79 is_div_by_83 is_div_by_89 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_97 21721 FALSE 36084 TRUE 40793 FALSE 3374 FALSE 48257 FALSE 42906 FALSE 37283 FALSE 55156 FALSE 6169 FALSE 77694 FALSE

We can now move on to looking at various statistical summaries to see if we notice any differences between the stopping times (our response variable) when we break up the data in different ways. We will look at the counts, mean, median, standard deviation, and trimmed mean (after throwing out the highest 10 percent) of the stopping times, as well as the correlation between stopping times and the integers. This is by no means a comprehensive list, but it can serve as a guidance for deciding which direction to go to next.

my_summary <- function(df) { primes <- gen_primes(9) res <- with(df, data.frame( count = length(st), mean_st = mean(st), median_st = median(st), tmean_st = mean(st[st < quantile(st, .9)]), sd_st = sd(st), cor_st_int = cor(st, int, method = "spearman") ) ) }

To create above summaries broken up by the flag variables in the data, we will use the 'ddply' function in the 'plyr' package. For example, the following will give us the summaries asked for in 'my_summary', grouped by 'is_prime'.

To avoid having to manually type every formula, we can pull the flag variables from the data set, generate the strings that will make up the formula, wrap it inside 'as.formula' and pass it to 'ddply'.

flags <- names(df)[grep("^is_", names(df))] res <- lapply(flags, function(nm) ddply(df, as.formula(sprintf('~ %s', nm)), my_summary)) names(res) <- flags res $is_triangular is_triangular count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99554 107.58511 99 96.23178 51.36153 0.1700491 2 TRUE 446 97.11211 94 85.97243 51.34051 0.3035063 $is_square is_square count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99684 107.59743 99 96.25033 51.34206 0.1696582 2 TRUE 316 88.91772 71 77.29577 55.43725 0.4274504 $is_pentagonal is_pentagonal count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99742 107.56948 99.0 96.22466 51.35580 0.1702560 2 TRUE 258 95.52326 83.5 83.86638 53.91514 0.3336478 $is_prime is_prime count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 90408 107.1227 99 95.95392 51.29145 0.1693013 2 TRUE 9592 111.4569 103 100.39643 51.90186 0.1953000 $is_perfect is_perfect count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99995 107.5419 99 96.20092 51.36403 0.1707839 2 TRUE 5 37.6000 18 19.50000 45.06440 0.9000000 $is_div_by_2 is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 50000 113.5745 106 102.58324 52.25180 0.1720108 2 TRUE 50000 101.5023 94 90.68234 49.73778 0.1737786 $is_div_by_3 is_div_by_3 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 66667 107.4415 99 96.15832 51.30838 0.1705745 2 TRUE 33333 107.7322 99 96.94426 51.48104 0.1714432 $is_div_by_5 is_div_by_5 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 80000 107.5676 99 96.21239 51.36466 0.1713552 2 TRUE 20000 107.4214 99 96.13873 51.37206 0.1688929 . . . $is_div_by_89 is_div_by_89 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98877 107.5381 99 96.19965 51.37062 0.1707754 2 TRUE 1123 107.5628 97 96.02096 50.97273 0.1803568 $is_div_by_97 is_div_by_97 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98970 107.4944 99 96.15365 51.36880 0.17187453 2 TRUE 1030 111.7660 106 101.35853 50.93593 0.07843996

As we can see from the above results, most comparisons appear to be non-significant (although, in mathematics, even tiny differences can be meaningful, therefore we will avoid relying on statistical significance here). Here's a summary of trends that stand out as we go over the results:

- On average, stopping times are slightly higher for prime numbers compared to composite numbers.
- On average, stopping times are slightly lower for triangular numbers, square numbers, and pentagonal numbers compared to their corresponding counterparts.
- Despite having lower average stopping times, triangular numbers, square numbers and pentagonal numbers are more strongly correlated with their stopping times than their corresponding counterparts. A larger sample size may help in this case.
- On average, odd numbers have a higher stopping time than even numbers.

This last item could just be restatement of 1. however. Let's take a closer look:

ddply(df, ~ is_prime + is_div_by_2, my_summary) is_prime is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE FALSE 40409 114.0744 106 102.91178 52.32495 0.1658988 2 FALSE TRUE 49999 101.5043 94 90.68434 49.73624 0.1737291 3 TRUE FALSE 9591 111.4685 103 100.40796 51.89231 0.1950482 4 TRUE TRUE 1 1.0000 1 NaN NA NA

When limiting the analysis to odd numbers only, prime numbers now have a lower average stopping time compared to composite numbers, which reverses the trend seen in 1.

Abd ibe final point: Since having a prime number as a proper divisor is a proxy for being a composite number, it is difficult to read too much into whether divisibility by a specific prime number affects the average stopping time. But no specific prime number stands out in particular. Once again, a larger sample size would give us more confidence about the results.

The above analysis is perhaps too crude and primitive to offer any significant lead, so here are some possible improvements:

- We could think of other features that we could add to the data.
- We could think of other statistical summaries to include in the analysis.
- We could try to scale up by looking at all integers form 1 to 1 billion instead of 1 to 1 million.

In the next post, we will see how we can use the 'RevoScaleR' package to go from an R 'data.frame' (memory-bound) to an external 'data.frame' or XDF file for short. In doing so, we will achieve the following improvements:

- as the data will no longer be bound by available memory, we will scale with the size of the data,
- since the XDF format is also a distributed data format, we will use the multiple cores available on a single machine to distribute the computation itself by having each core process a separate chunk of the data. On a cluster of machines, the same analysis could then be distributed over the different nodes of the cluster.

by Joseph Rickert

The ancient Egyptians were a people with long memories. The lists of their pharaohs went back thousands of years, and we still have the names and tax assessments for certain persons and institutions from the time of Ramesses II. When Herodotus began writing about Egypt and the Nile (~ 450 BC), the Egyptians who knew that their prosperity depended on the river’s annual overflow, had been keeping records of the Nile’s high water mark for more than three millennia. So, it seems reasonable, and maybe even appropriate, that one of the first attempts to understand long memory in time series was motivated by the Nile.

The story of British hydrologist and civil servant H.E. Hurst who earned the nickname “Abu Nil”, Father of the Nile, for his 62 year career of measuring and studying the river is now fairly well known. Pondering an 847 year record of Nile overflow data, Hurst noticed that the series was persistent in the sense that heavy flood years tended to be followed by heavier than average flood years while below average flood years were typically followed by light flood years. Working from an ancient formula on optimal dam design he devised the equation: log(R/S) = K*log(N/2) where R is the range of the time series, S is the standard deviation of year-to-year flood measurements and N is the number of years in the series. Hurst noticed that the value of K for the Nile series and other series related to climate phenomena tended to be about 0.73, consistently much higher than the 0.5 value that one would expect from independent observations and short autocorrelations.

Today, mostly due to the work of Benoit Mandelbrot who rediscovered and popularized Hurst work in the early 1960s, Hurst’s Rescale/Range Analysis, and the calculation of the Hurst exponent (Mandlebrot renamed “K” to “H”) is the demarcation point for the modern study of Long Memory Time Series. To investigate let’s look at some monthly flow data taken at the Dongola measurement station that is just upstream from the high dam at Aswan. (Look here for the data, and here for map and nice Python-based analysis that covers some of the same ground as is presented below.) The data consists of average monthly flow measurements from January 1869 to December 1984.

head(nile_dat) Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1869 NA NA NA NA NA NA 2606 8885 10918 8699 NA NA 2 1870 NA NA 1146 682 545 540 3027 10304 10802 8288 5709 3576 3 1871 2606 1992 1485 933 731 760 2475 8960 9953 6571 3522 2419 4 1872 1672 1033 728 605 560 879 3121 8811 10532 7952 4976 3102 5 1873 2187 1851 1235 756 556 1392 2296 7093 8410 5675 3070 2049 6 1874 1340 847 664 516 466 964 3061 10790 11805 8064 4282 2904

To get a feel for the data we plot a portion of the time series.

The pattern is very regular and the short term correlations are apparent. The following boxplots show the variation in monthly flow.

Herodotus clearly knew what he was talking about when he wrote (*The Histories*: Book 2, 19):

I was particularly eager to find out from them (the Egyptian priests) why the Nile starts coming down in a flood at the summer solstice and continues flooding for a hundred days, but when the hundred days are over the water starts to recede and decreases in volume, with the result that it remains low for the whole winter, until the summer solstice comes round again.

To construct a long memory time series we aggregate the monthly flows to produce a yearly time series of total flow (droppng the years 1869 and 1870 because of the missing values).

Plotting the ACF, we see that the autocorelations persist for nearly 20 years!!

So, let's compute the Hurst exponent. For our first try, we use a simple function suggested by an example in Bernard Pfaff's classic text: *Analysis of Integrated and Cointegrated Time Series* with R.

Bingo! 0.73 - just what we were expecting for a long memory time series. Unfortunately, things are not so simple. The function hurst() from the pracma package which is a much more careful calculation than simpleHurst() yields:

hurst(nile.yr.ts) [1] 1.041862

This is midly distressing since H is supposed to be bounded above by 1. The function hurstexp() from the same package which is based on Weron's MatLab code and implements the small sample correction seems to solve that problem.

> hurstexp(nile.yr.ts) Corrected R over S Hurst exponent: 1.041862 Theoretical Hurst exponent: 0.5244148 Corrected empirical Hurst exponent: 0.7136607 Empirical Hurst exponent: 0.6975531

0.71 is more reasonable result. However, as a post on the Timely Portfolio blog pointed out a few years ago, computing the Hurst exponent is an estimation problem not merely a calculation. So, where are the error bars?

I am afraid that confidence intervals and a look at several other methods available in R for estimating the Hurst exponent will have to wait for another time. In the meantime, the following references may be of interest. The first two are light reading from early champions of applying Rescale/Range analysis and the Hurst exponent to Financial time series. The book by Mandelbrot and Hudson is especially interesting for its sketch of the historical background. The last two are relatively early papers on the subject.

- Edgar E. Peters - Fractal Market Analysis: Applying Chaos Theory to Investments and Economics
- Benoit B. Mandelbrot & Richard L Hudson - The (Mis)behavior of Markets
- Davies & Hart - Tests for the Hurst Effect
- Rafeal Weron - Estimating long range dependence: finite sample properties and confidence intervals

My code for this post may be downloaded here: Long_memory_Post.

by Norman Matloff

The American Statistical Association (ASA) leadership, and many in Statistics academia. have been undergoing a period of angst the last few years, They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:

- The field is to a large extent being usurped by other disciplines, notably Computer Science (CS).
- Efforts to make the field attractive to students have largely been unsuccessful.

I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see then-ASA president Marie Davidson write a plaintive editorial titled, “Aren’t *We* Data Science?”

Good, the ASA is taking action, I thought. But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics: Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.

This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become. Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”

In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R activist.

**CS vs. Statistics**

Let’s consider the CS issue first. Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly. To many of us, though, this is just “old wine in new bottles,” with the “wine” being Statistics. But the new “bottles” are disciplines outside of Statistics–especially CS.

I have a foot in both the Statistics and CS camps. I’ve spent most of my career in the Computer Science Department at the University of California, Davis, but I began my career in Statistics at that institution. My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology. I was one of the seven charter members of the Department of Statistics. Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature. With such “dual loyalties,” I’ll refer to people in both professions via third-person pronouns, not first, and I will be critical of both groups. However, in keeping with the theme of the ASA’s recent actions, my essay will be Stat-centric: What is poor Statistics to do?

Well then, how did CS come to annex the Stat field? The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI). Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.

That switch in AI was due largely to the emergence of Big Data. No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days. Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects. Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.

Involvement is one thing, but usurpation is another. Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas. This is dramatically demonstrated by statements that are made like, “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics. ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearest-neighbor classification, random forests, the EM algorithm and so on.

Though the Stat leaders seem to regard all this as something of an existential threat to the well-being of their profession, I view it as much worse than that. The problem is not that CS people are doing Statistics, but rather that they are doing it poorly: Generally the quality of CS work in Stat is weak. It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented. Instead, there are a number of **systemic reasons** for this, structural problems with the CS research “business model”:

**CS, having grown out of a research on fast-changing software and hardware systems, became accustomed to the “24-hour news cycle”**–very rapid publication rates, with the venue of choice being (refereed) frequent conferences rather than slow journals. This leads to research work being less thoroughly conducted, and less thoroughly reviewed, resulting in poorer quality work. The fact that some prestigious conferences have acceptance rates in the teens or even lower doesn’t negate these realities.- Because CS Depts. at research universities tend to be housed in Colleges of Engineering, there is
**heavy pressure to bring in lots of research funding, and produce lots of PhD students**. Large amounts of time is spent on trips to schmooze funding agencies and industrial sponsors, writing grants, meeting conference deadlines and managing a small army of doctoral students–instead of time spent in careful, deep, long-term contemplation about the problems at hand. This is made even worse by the rapid change in the fashionable research topic du jour. making it difficult to go into a topic in any real depth. Offloading the actual research onto a large team of grad students can result in faculty not fully applying the talents they were hired for; I’ve seen too many cases in which the thesis adviser is not sufficiently aware of what his/her students are doing. **There is rampant “reinventing the wheel.”**The above-mentioned lack of “adult supervision” and lack of long-term commitment to research topics results in weak knowledge of the literature. This is especially true for knowledge of the Stat literature, which even the “adults” tend to have very little awareness of. For instance, consider a paper on the use of unlabeled training data in classification. (I’ll omit names.) One of the two authors is one of the most prominent names in the machine learning field, and the paper has been cited over 3,000 times, yet the paper cites nothing in the extensive Stat literature on this topic, consisting of a long stream of papers from 1981 to the present.- Again for historical reasons, CS research is largely empirical/experimental in nature. This causes what in my view is
**one of the most serious problems plaguing CS research in Stat – lack of rigor**. Mind you, I am not saying that every paper should consist of theorems and proofs or be overly abstract; data- and/or simulation-based studies are fine. But there is no substitute for precise thinking, and in my experience, many (nominally) successful CS researchers in Stat do not have a solid understanding of the fundamentals underlying the problems they work on. For example, a recent paper in a top CS conference incorrectly stated that the logistic classification model cannot handle non-monotonic relations between the predictors and response variable; actually, one can add quadratic terms, and so on, to models like this. **This “engineering-style” research model causes a cavalier attitude towards underlying models and assumptions.**Most empirical work in CS doesn’t have any models to worry about. That’s entirely appropriate, but in my observation it creates a mentality that inappropriately carries over when CS researchers do Stat work. A few years ago, for instance, I attended a talk by a machine learning specialist who had just earned her PhD at one of the very top CS Departments. in the world. She had taken a Bayesian approach to the problem she worked on, and I asked her why she had chosen that specific prior distribution. She couldn’t answer – she had just blindly used what her thesis adviser had given her–and moreover, she was baffled as to why anyone would want to know why that prior was chosen.**Again due to the history of the field, CS people tend to have grand, starry-eyed ambitions–laudable, but a double-edged sword**. On the one hand, this is a huge plus, leading to highly impressive feats such as recognizing faces in a crowd. But this mentality leads to an oversimplified view of things, with everything being viewed as a paradigm shift. Neural networks epitomize this problem. Enticing phrasing such as “Neural networks work like the human brain” blinds many researchers to the fact that neural nets are not fundamentally different from other parametric and nonparametric methods for regression and classification.(Recently I was pleased to discover–“learn,” if you must–that the famous book by Hastie, Tibshirani and Friedman complains about what they call “hype” over neural networks; sadly, theirs is a rare voice on this matter.) Among CS folks, there is a failure to understand that the celebrated accomplishments of “machine learning” have been mainly the result of applying a lot of money, a lot of people time, a lot of computational power and prodigious amounts of tweaking to the given problem – not because fundamentally new technology has been invented.

All this matters – a LOT. In my opinion, the above factors result in highly lamentable opportunity costs. Clearly, I’m not saying that people in CS should stay out of Stat research. But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them. This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.

**Making Statistics Attractive to Students**

This of course is an age-old problem in Stat. Let’s face it–the very word statistics sounds hopelessly dull. But I would argue that a more modern development is making the problem a lot worse – the Advanced Placement (AP) Statistics courses in high schools.

Professor Xiao-Li Meng has written extensively about the destructive nature of AP Stat. He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turn-off’ experience in an AP statistics course.” That says it all, doesn’t it? And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students. No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics. It is especially troubling that Statistics may be losing the “best and brightest” students.

One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter. A typical example is that a student complained to me that his AP Stat teacher could not answer his question as to why it is customary to use n-1 rather than n in the denominator of s^2 , even though he had attended a top-quality high school in the heart of Silicon Valley. But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on. AP courses are ostensibly college level, but the students are not getting college-level instruction. The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.

The most exasperating part of all this is that AP Stat officially relies on TI-83 pocket calculators as its computational vehicle. The machines are expensive, and after all we are living in an age in which R is free! Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides – exactly the kinds of things that motivate young people.

So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually can be fixed reasonably simply. If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program. Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.

As noted, R is free and is multi platform, with outstanding graphical capabilities. There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.

As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: An Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R. But to really do it right, I would suggest that a group of Stat professors collaboratively write an open-source text, as has been done for instance for Chemistry. Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.

This is not a complete solution by any means. There still is the issue of AP Stat being taught by people who lack depth in the field, and so on. And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.

But given all these weighty problems, it certainly would be nice to do something, right? Switching to R would be doable–and should be done.

*[Crossposted with permission from the Mad (Data) Scientist blog.]*

by Joseph Rickert

I think we can be sure that when American botanist Edgar Anderson meticulously collected data on three species of iris in the early 1930s he had no idea that these data would produce a computational storm that would persist well into the 21st century. The calculations started, presumably by hand, when R. A. Fisher selected this data set to illustrate the techniques described in his 1936 paper on discriminant analysis. However, they really got going in the early 1970s when the pattern recognition and machine learn community began using it to test new algorithms or illustrate fundamental principles. (The earliest reference I could find was: Gates, G.W. "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431-433.) Since then, the data set (or one of its variations) has been used to test hundreds, if not thousands, of machine learning algorithms. The UCI Machine Learning Repository, which contains what is probably the “official” iris data set, lists over 200 papers referencing the iris data.

So why has the iris data set become so popular? Like most success stories, randomness undoubtedly plays a huge part. However, Fisher’s selecting it to illustrate a discrimination algorithm brought it to peoples attention, and the fact that the data set contains three classes, only one of which is linearly separable from the other two, makes it interesting.

For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)

Some key features of the airlines data set are:

- It is big enough to exceed the memory of most desktop machines. (The version of the airlines data set used for the competition contained just over 123 million records with twenty-nine variables.
- The data set contains several different types of variables. (Some of the categorical variables have hundreds of levels.)
- There are interesting things to learn from the data set. (This exercise from Kane and Emerson for example)
- The data set is
*tidy*, but not clean, making it an attractive tool to practice big data wrangling. (The AirTime variable ranges from -3,818 minutes to 3,508 minutes)

An additional, really nice feature of the airlines data set is that it keeps getting bigger! RITA, The Research and Innovative Technology Administration Bureau of Transportation Statistics continues to collect data which can be downloaded in .csv files. For your convenience, we have a 143M+ record version of the data set Revolution Analytics test data site which contains all of the RITA records from 1987 through the end of 2012 available for download.

The following analysis from Revolution Analytics’ Sue Ranney uses this large version of the airlines data set and illustrates how a good model, driven with enough data, can reveal surprising features of a data set.

# Fit a Tweedie GLM tm <- system.time( glmOut <- rxGlm(ArrDelayMinutes~Origin:Dest + UniqueCarrier + F(Year) + DayOfWeek:F(CRSDepTime) , data = airData, family = rxTweedie(var.power =1.15), cube = TRUE, blocksPerRead = 30) ) tm # Build a dataframe for three airlines: Delta (DL), Alaska (AK), HA airVarInfo <- rxGetVarInfo(airData) predData <- data.frame( UniqueCarrier = factor(rep(c( "DL", "AS", "HA"), times = 168), levels = airVarInfo$UniqueCarrier$levels), Year = as.integer(rep(2012, times = 504)), DayOfWeek = factor(rep(c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat", "Sun"), times = 72), levels = airVarInfo$DayOfWeek$levels), CRSDepTime = rep(0:23, each = 21), Origin = factor(rep("SEA", times = 504), levels = airVarInfo$Origin$levels), Dest = factor(rep("HNL", times = 504), levels = airVarInfo$Dest$levels) ) # Use the model to predict the arrival day for the three airlines and plot predDataOut <- rxPredict(glmOut, data = predData, outData = predData, type = "response") rxLinePlot(ArrDelayMinutes_Pred~CRSDepTime|UniqueCarrier, groups = DayOfWeek, data = predDataOut, layout = c(3,1), title = "Expected Delay: Seattle to Honolulu by Departure Time, Day of Week, and Airline", xTitle = "Scheduled Departure Time", yTitle = "Expected Delay")

Here, rxGLM()fits a Tweedie Generalized Linear model that looks at arrival delay as a function of the interaction between origin and destination airports, carriers, year, and the interaction between days of the week and scheduled departure time. This function kicks off a considerable amount of number crunching as Origin is a factor variable with 373 levels and Dest, also a factor, has 377 levels. The F() function makes Year and CrsDepTIme factors "on the fly" as the model is being fit. The resulting model ends up with 140,852 coefficients 8,626 of which are not NA. The calculation takes 12.6 minutes to run on a 5 node (4 cores and 16GB of RAM per node) IBM Platform LFS cluster.

The rest of the code uses the model to predict arrival delay for three airlines and plots the fitted values by day of the week and departure time.

It looks like Saturday is the best time to fly for these airlines. Note that none of the sturcture revealed in these curves was put into the model in the sense that there are no polynomial terms in the model.

It will take a few minutes to download the zip file with the 143M airlines records, but please do, and let us know how your modeling efforts go.

*The latest in a series by Daniel Hanson*

**Introduction**

Correlations between holdings in a portfolio are of course a key component in financial risk management. Borrowing a tool common in fields such as bioinformatics and genetics, we will look at how to use heat maps in R for visualizing correlations among financial returns, and examine behavior in both a stable and down market.

While base R contains its own heatmap(.) function, the reader will likely find the heatmap.2(.) function in the R package gplots to be a bit more user friendly. A very nicely written companion article entitled A short tutorial for decent heat maps in R (Sebastian Raschka, 2013), which covers more details and features, is available on the web; we will also refer to it in the discussion below.

We will present the topic in the form of an example.

**Sample Data**

As in previous articles, we will make use of R packages Quandl and xts to acquire and manage our market data. Here, in a simple example, we will use returns from the following global equity indices over the period 1998-01-05 to the present, and then examine correlations between them:

S&P 500 (US)

RUSSELL 2000 (US Small Cap)

NIKKEI (Japan)

HANG SENG (Hong Kong)

DAX (Germany)

CAC (France)

KOSPI (Korea)

First, we gather the index values and convert to returns:

library(xts) library(Quandl) my_start_date <- "1998-01-05" SP500.Q <- Quandl("YAHOO/INDEX_GSPC", start_date = my_start_date, type = "xts") RUSS2000.Q <- Quandl("YAHOO/INDEX_RUT", start_date = my_start_date, type = "xts") NIKKEI.Q <- Quandl("NIKKEI/INDEX", start_date = my_start_date, type = "xts") HANG_SENG.Q <- Quandl("YAHOO/INDEX_HSI", start_date = my_start_date, type = "xts") DAX.Q <- Quandl("YAHOO/INDEX_GDAXI", start_date = my_start_date, type = "xts") CAC.Q <- Quandl("YAHOO/INDEX_FCHI", start_date = my_start_date, type = "xts") KOSPI.Q <- Quandl("YAHOO/INDEX_KS11", start_date = my_start_date, type = "xts") # Depending on the index, the final price for each day is either # "Adjusted Close" or "Close Price". Extract this single column for each: SP500 <- SP500.Q[,"Adjusted Close"] RUSS2000 <- RUSS2000.Q[,"Adjusted Close"] DAX <- DAX.Q[,"Adjusted Close"] CAC <- CAC.Q[,"Adjusted Close"] KOSPI <- KOSPI.Q[,"Adjusted Close"] NIKKEI <- NIKKEI.Q[,"Close Price"] HANG_SENG <- HANG_SENG.Q[,"Adjusted Close"] # The xts merge(.) function will only accept two series at a time. # We can, however, merge multiple columns by downcasting to *zoo* objects. # Remark: "all = FALSE" uses an inner join to merge the data. z <- merge(as.zoo(SP500), as.zoo(RUSS2000), as.zoo(DAX), as.zoo(CAC), as.zoo(KOSPI), as.zoo(NIKKEI), as.zoo(HANG_SENG), all = FALSE) # Set the column names; these will be used in the heat maps: myColnames <- c("SP500","RUSS2000","DAX","CAC","KOSPI","NIKKEI","HANG_SENG") colnames(z) <- myColnames # Cast back to an xts object: mktPrices <- as.xts(z) # Next, calculate log returns: mktRtns <- diff(log(mktPrices), lag = 1) head(mktRtns) mktRtns <- mktRtns[-1, ] # Remove resulting NA in the 1st row

**Generate Heat Maps**

As noted above, heatmap.2(.) is the function in the gplots package that we will use. For convenience, we’ll wrap this function inside our own generate_heat_map(.) function, as we will call this parameterization several times to compare market conditions.

As for the parameterization, the comments should be self-explanatory, but we’re keeping things simple by eliminating the dendogram, and leaving out the trace lines inside the heat map and density plot inside the color legend. Note also the setting Rowv = FALSE, this ensures the ordering of the rows and columns remains consistent from plot to plot. We’re also just using the default color settings; for customized colors, see the Raschka tutorial linked above.

require(gplots) generate_heat_map <- function(correlationMatrix, title) { heatmap.2(x = correlationMatrix, # the correlation matrix input cellnote = correlationMatrix # places correlation value in each cell main = title, # heat map title symm = TRUE, # configure diagram as standard correlation matrix dendrogram="none", # do not draw a row dendrogram Rowv = FALSE, # keep ordering consistent trace="none", # turns off trace lines inside the heat map density.info="none", # turns off density plot inside color legend notecol="black") # set font color of cell labels to black }

Next, let’s calculate three correlation matrices using the data we have obtained:

- Correlations based on the entire data set from 1998-01-05 to the present
- Correlations of market indices during a reasonably calm period -- January through December 2004
- Correlations of falling market indices in the midst of the financial crisis - October 2008 through May 2009

Now, let’s call our heat map function using the total market data set:

generate_heat_map(corr1, "Correlations of World Market Returns, Jan 1998 - Present")

And then, examine the result:

As expected, we trivially have correlations of 100% down the main diagonal. Note that, as shown in the color key, the darker the color, the lower the correlation. By design, using the parameters of the heatmap.2(.) function, we set the title with the main = title parameter setting, and the correlations shown in black by using the notecol="black" setting.

Next, let’s look at a period of relative calm in the markets, namely the year 2004:

generate_heat_map(corr2, "Correlations of World Market Returns, Jan - Dec 2004")

This gives us:

generate_heat_map(corr2, "Correlations of World Market Returns, Jan - Dec 2004")

Note that in this case, at a glance of the darker colors in each of the cells, we can see that we have even lower correlations than those from our entire data set. This may of course be verified by comparing the numerical values.

Finally, let’s look at the opposite extreme, during the upheaval of the financial crisis in 2008-2009:

generate_heat_map(corr3, "Correlations of World Market Returns, Oct 2008 - May 2009")

This yields the following heat map:

Note that in this case, again just at first glance, we can tell the correlations have increased compared to 2004, by the colors changing from dark to light nearly across the board. While there are some correlations that do not increase all that much, such as the SP500/Nikkei and the Russell 2000/Kospi values, there are others across international and capitalization categories that jump quite significantly, such as the SP500/Hang Seng correlation going from about 21% to 41%, and that of the Russell 2000/DAX moving from 43% to over 57%. So, in other words, portfolio diversification can take a hit in down markets.

**Conclusion**

In this example, we only looked at seven market indices, but for a closer look at how correlations were affected during 2008-09 -- and how heat maps among a greater number of market sectors compared -- this article, entitled* Diversification is Broken*, is a recommended and interesting read.

by Joseph Rickert

Last week, I posted a list of sessions at the Joint Statistical Meetings related to R. As it turned out, that list was only the tip of the iceberg. In some areas of statistics, such as graphics, simulation and computational statistics the use of R is so prevalent that people working in the field often don't think to mention it. For example, in the session New Approaches to Data Exploration and Discovery which included the presentation on the Glassbox package that figured in my original list, R was important to the analyses underlying nearly all of the talks in one way or another. The following are synopses of the talks in that session along with some pointers to relevant R resources.

Exploring Huge Collections of Scatterplots

Statistics and visualization legend Leland Wilkinson of Skytree showed off ScagExployer, a tool he built with Tuan Dang of the University of Illinois at Chicago to explore scagnostics (a contraction for “Scatter Plot Diagnostics” made up by John Hartigan and Paul Tukey in the 1980’s). ScagExployer makes it possible to look for anomalies and search for similar distributions in a huge collections of scatter plots. (The example Leland showed contained 124K plots).The ideas and many of the visuals for the talk can be found in the paper ScagExplorer: Exploring Scatterplots by Their Scagnostics. ScagExployer is Java based tool, but R users can work with the scagnostics package written by Lee Wilkinson and Anushka Anand in 2007.

Glassbox: An R Package for Visualizing Algorithmic Models:

Google’s Max Ghenis presented work he did with fellow Googlers Ben Ogorek; and Estevan Flores. Glassbox is an R application that attempts to provide transparency to “blackbox” algorithmic models such as Random Forests. Among other things, it calculates and plots the collective importance of groups of variables in such a model. The slides for the presentation are available, as is the package itself. Google is using predictive modeling and tools such as glassbox to better understand the characteristics of its workforce and to ask important, reflective questions such a “How can we better understand diversity?” The company also does HR modeling to see if what they know about people can give them a competitive edge in hiring. For example, Google uses data collected from people who have interviewed at the company in the past, but who have not received offers from Google, to try and understand Google's future hiring needs. The coolest thing about this presentation was that these guys work for the Human Resources Department! If you think that you work for a tech company go down to HR and see if you can get some help with Random Forests.

A Web Application for Efficient Analysis of Peptide Libraries

Eric Hare of Iowa State University introduced PeLica, work he did with colleagues Timo Sieber of University Medical Center Hamburg-Eppendorf and Heike Hofmann of Iowa State University. PeLica is an interactive, Shiny application to help assess the statistical properties of peptide libraries. PeLica’s creators refer to it as a Peptide Library Calculator that acts as a front end to the R package peptider which contains functions for evaluating the diversity of peptide libraries. The authors have done an exceptional job of using the documentation features available in Shiny to make their app a teaching tool.

To Merge or Not to Merge: An Interactive Visualization Tool for Local Merges of Mixture Model Components Elizabeth Lorenzi of Carnegie Mellon showed the prototype for an interactive visualization tool that she is working on with Rebecca Nugent of Carnegie Mellon and Nema Dean of the University of Glasgow. The software calculates inter-component similarities of mixture model component trees and displays them as hierarchical dendrograms. Elizabeth and her colleagues are implementing this tool as an R package.

An Interactive Visualization Platform for Interpreting Topic Models

Carson Sievert of Iowa State University presented LDAvis, a general framework for visualizing topic models that he is building with Kenny Shirley of AT&T Labs. LDAvis is interactive R software that enables users to interpret and compare topics by highlighting keywords. The theory is nicely described in a recent paper, and the examples on Carson’s Github page are instructive and fun to play with. In this plot below, circle 26 representing a topic has been selected. The bar chart on the right displays the 30 most relevant terms for this topic. The red bars represent the frequency of a term in a given topic, (proportional to p(term | topic)), and the gray bars represent a term's frequency across the entire corpus, (proportional to p(term)).

Gravicom: A Web-Based Tool for Community Detection in Networks

Andrea Kaplan showed off an interactive application that she and her Iowa State University team members, Heike Hofmann and Daniel Nordman are building. GRavicom is an interactive web application based on Shiny and the D3 JavaScript library that lets a user manually collect nodes into clusters in a social network graph and then save this grouping information for subsequent processing. The idea is that eyeballing a large social network and selecting “obvious” groups may be an efficient way to initialize a machine learning algorithm. Have a look at a Live demo.

Human Factors Influencing Visual Statistical Inference

Mahbubul Majumder of the University of Nebraska presented joint work done with Heike Hofmann and Dianne Cook, both of Iowa State University, on identifying key factors such as demographics, experience, training, of even the placement of figures in an array of plots, that may be important for the human analysis of visual data.

I'm here at the JSM conference in Boston, the latest annual gathering of 6000+ statisticians from North America and around the world. (Revolution Analytics is a proud sponsor of the conference.) One of the great things to see is that the American Statistical Association, the organizer of the conference and the professional body for statisticians, is putting some effort into promoting Statistics as a discipline with its new website This is Statistics.

Statistics as a discipline has been overshadowed by its close sibling, Data Science. (I'm as guilty of this as anyone.) As statistician Terry Speed points out (hat tip: Stephanie Hicks), Statisticians haven't been deeply involved in the Big Data revolution, despite having been trained in exactly the types of issues that are critical to extracting useful inferences from complex data sets. So it's great to see that the ASA is helping to promote the role of Statisticians in today's data centric workplace. My favorite video comes from Roger Peng, who has himself been instrumental promoting statistical analysis in R via Coursera:

Check out the complete This is Statistics site at the link below, and share it with your friends so they can learn what Statistics is really about.

American Statistical Association: This is Statistics

by Joseph Rickert

The Joint Statistical Meetings (JSM) get underway this weekend in Boston and Revolution Analytics is again proud to be a sponsor. More than 6,000 statisticians and data scientists from around the world are expected to attend and listen to thousands of presentations. It is true that many talks will be on specialized topics that only statisticians working in particular a field will have the interest and patience to sit through. However, there is evidence that the conference will have something exciting to offer data scientists and statisticians working in industry. Keyword searches yield 79 presentations for Big Data, 29 on Machine Learning, 17 on Data Science, 17 on Data Mining and 19 related to R. There is more than enough here to fill a data scientist’s dance card.

Three must-see presentations under the Big Data keyword are: Michael Franklin's presentation on Analyzing Data at Scale with the Berkeley Data Analytics Stack; Hui Jiang et al. on Implementation of Statistical Algorithms in Big Data Platforms and Tim Hesterberg's talk on Simulation-Based Methods in Statistics Education, and Google Tools. Under the Data Science label, Bill Ruh’s invited talk Industrial Internet, an Opportunity for Statisticians to Become Data Scientists looks most inviting. There are also quite a few Data Science talks that indicated some soul searching within the academic community as to how the statistics curriculum ought to be changed. See, for example, Michael Rappa’s talk on Data Scientists: How Do We Prepare for the Future? and Johanna Hardin’s talk: Data Science and Statistics: How Should They Fit into Our Curriculum?

Here is the list of R related presentations:

**Saturday, August 2**

- 8:00 AM - 12:00 PM: Adaptive Tests of Significance Using R and SAS — Professional Development Continuing Education Course ASA Instructor: Tom O'Gorman

**Sunday, August 3**

- 8:30 AM - 5:00 PM: Adaptive Methods in Modern Clinical Trials — Professional Development Continuing Education Course ASA , Biometrics Section Instructors: Frank Bretz, Byron Jones, and Guosheng Yin
- 4:20 PM: Glassbox: An R Package for Visualizing Algorithmic Models: Max Ghenis and Ben Ogorek and Estevan Flores
- 4:45 PM: Bayesian Enrollment and Event Predictions in Clinical Trials Leveraging Literature Data: Aijun Gao and Fanni Natanegara and Govinda Weerakkody

**Monday, August 4**

- 8:55 AM: Thinking with Data in the Second Course: Nicholas J. Horton and Ben S. Baumer and Hadley Wickham
- 8:30 AM to 10:20 AM: Do You See What I See? Formal Usability Testing and Statistical Graphics: Marie C. Vendettuoli and Matthew Williams and Susan Ruth VanderPlas
- 8:35 AM: Preparing Students for Big Data Using R and Rstudio: Randall Pruim
- 8:35 AM: Does R Provide What Customer Need?: Vipin Arora
- 8:55 AM: Doing Reporducible Research Unconscously: Higher Standard, but Less Work: Yihui Xie
- 12:30 PM: to 1:50 PM: Analyzing Umpire Performance Using PITCHf/x: Andrew Swift
- 3:30 PM: The Perfect Bracket: Machine Learning in NCAA Basketball: Sara Stoudt and Loren Santana and Ben S. Baumer

**Tuesday, August 5**

- 10:35 AM: Tools for Teaching R and Statistics Using Games Brad Luen and Michael Higgins
- 2:00 PM: Multiple Treatment Groups: A Case Study with Health Care Practice and Policy Implications Alexandra Hanlon and Karen Hirschman and Beth Ann Griffin and Mary Naylor
- 2:05 PM: glmmplus: An R Package for Messy Longitudinal Data Ben Ogorek and Caitlin Hogan
- 3:30 PM: Give Me an Old Computer, a Blank DVD, and an Internet Connection and I'll Give You World-Class Analytics Ty Henkaline

**Wednesday, August 6**

- 9:35 AM: Testing Packages for the R Language: Stephen Kaluzny and Lou Bajuk-Yorgan
- 9:50 AM: Using R Analytics on Streaming Data: Lou Bajuk-Yorgan and Stephen Kaluzny
- 10:35 Shiny: Easy Web Applications in R:Joseph Cheng
- 10:30 AM to 12:20 PM: Classroom Demonstrations of Big Data: Eric A. Suess
- 11:00 AM: ggvis: Moving Toward a Grammar of Interactive Graphics: Hadley Wickham
- 3:05 PM: Accessing Data from the Census Bureau API: Alex Shum and Heike Hofmann

**Thursday, August 7**

- 9:20 AM: Predicting Dangerous E. Coli Levels at Erie, Pennsylvania, Beaches with Random Forests in R: Michael Rutter
- 9:25 AM: Beyond the Black Box: Flexible Programming of Hierarchical Modeling Algorithms for BUGS-Compatible Models Using NIMBLE: Perry de de Valpine and Daniel Turek and Christopher J. Paciorek and Rastislav Bodik and Duncan Temple Lang

If you are going to JSM please come by booth #303 to say hello. You may also find the mobile apps (Apple or Android) that Revolution Analytics is sponsoring useful, and don't forget to fill out the survey for a chance to win an Apple TV.

Finally, I will be the program chair for Session 401, Monte Carlo Methods to be held Tuesday, 8/5/2014, from 2:00 PM to 3:50 PM in room CC-101. If you are interested in simulation be sure to drop in. I have seen the presentations and think they are well worth attending.

by Joseph Rickert

If I had to pick just one application to be the “killer app” for the digital computer I would probably choose Agent Based Modeling (ABM). Imagine creating a world populated with hundreds, or even thousands of agents, interacting with each other and with the environment according to their own simple rules. What kinds of patterns and behaviors would emerge if you just let the simulation run? Could you guess a set of rules that would mimic some part of the real world? This dream is probably much older than the digital computer, but according to Jan Thiele’s brief account of the history of ABMs that begins his recent paper, *R Marries NetLogo: Introduction to the RNetLogo Package* in the *Journal of Statistical Software,* academic work with ABMs didn’t really take off until the late 1990s.

Now, people are using ABMs for serious studies in economics, sociology, ecology, socio-psychology, anthropology, marketing and many other fields. No less of a complexity scientist than Doyne Farmer (of Dynamic Systems and Prediction Company fame) has argued in *Nature* for using ABMs to model the complexity of the US economy, and has published on using ABMs to drive investment models. in the following clip of a 2006 interview, Doyne talks about building ABMs to explain the role of subprime mortgages on the Housing Crisis. (Note that when asked about how one would calibrate such a model Doyne explains the need to collect massive amounts of data on individuals.)

Fortunately, the tools for building ABMs seem to be keeping pace with the ambition of the modelers. There are now dozens of platforms for building ABMs, and it is somewhat surprising that NetLogo, a tool with some whimsical terminology (e.g. agents are called turtles) that was designed for teaching children, has apparently become a defacto standard. NetLogo is Java based, has an intuitive GUI, ships with dozens of useful sample models, is easy to program, and is available under the GPL 2 license.

As you might expect, R is a perfect complement for NetLogo. Doing serious simulation work requires a considerable amount of statistics for calibrating models, designing experiments, performing sensitivity analyses, reducing data, exploring the results of simulation runs and much more. The recent *JASS* paper* Facilitating Parameter Estimation and Sensitivity Analysis of Agent-Based Models: a Cookbook Using NetLogo and R *by Thiele and his collaborators describe the R / NetLogo relationship in great detail and points to a decade’s worth of reading. But the real fun is that Thiele’s RNetLogo package lets you jump in and start analyzing NetLogo models in a matter of minutes.

Here is part of an extended example from Thiele's *JSS* paper that shows R interacting with the Fire model that ships with NetLogo. Using some very simple logic, Fire models the progress of a forest fire.

Snippet of NetLogo Code that drives the Fire model

to go if not any? turtles ;; either fires or embers [ stop ] ask fires [ ask neighbors4 with [pcolor = green] [ ignite ] set breed embers ] fade-embers tick end ;; creates the fire turtles to ignite ;; patch procedure sprout-fires 1 [ set color red ] set pcolor black set burned-trees burned-trees + 1 end

The general idea is that turtles represent the frontier of the fire run through a grid of randomly placed trees. Not shown in the above snippet is the logic that shows that the entire model is controlled by a single parameter representing the density of the trees.

This next bit of R code shows how to launch the Fire model from R, set the density parameter, and run the model.

# Launch RNetLogo and control an initial run of the # NetLogo Fire Model library(RNetLogo) nlDir <- "C:/Program Files (x86)/NetLogo 5.0.5" setwd(nlDir) nl.path <- getwd() NLStart(nl.path) model.path <- file.path("models", "Sample Models", "Earth Science","Fire.nlogo") NLLoadModel(file.path(nl.path, model.path)) NLCommand("set density 70") # set density value NLCommand("setup") # call the setup routine NLCommand("go") # launch the model from R

Here we see the Fire model running in the NetLogo GUI after it was launched from RStudio.

This next bit of code tracks the progression of the fire as a function of time (model "ticks"), returns results to R and plots them. The plot shows the non-linear behavior of the system.

# Investigate percentage of forest burned as simulation proceeds and plot library(ggplot2) NLCommand("set density 60") NLCommand("setup") burned <- NLDoReportWhile("any? turtles", "go", c("ticks", "(burned-trees / initial-trees) * 100"), as.data.frame = TRUE, df.col.names = c("tick", "percent.burned")) # Plot with ggplot2 p <- ggplot(burned,aes(x=tick,y=percent.burned)) p + geom_line() + ggtitle("Non-linear forest fire progression with density = 60")

As with many dynamical systems, the Fire model displays a phase transition. Setting the density lower than 55 will not result in the complete destruction of the forest, while setting density above 75 will very likely result in complete destruction. The following plot shows this behavior.

RNetLogo makes it very easy to programatically run multiple simulations and capture the results for analysis in R. The following two lines of code runs the Fire model twenty times for each value of density between 55 and 65, the region surrounding the pahse transition.

d <- seq(55, 65, 1) # vector of densities to examine res <- rep.sim(d, 20) # Run the simulation

The plot below shows the variability of the percent of trees burned as a function of density in the transition region.

My code to generate plots is available in the file: Download NelLogo_blog while all of the code from Thiele's JSS paper is available from the journal website.

Finally, here are a few more interesting links related to ABMs.

- On validating ABMs
- ABMs and