The following post by Norm Matloff originally appeared on his blog, Mad(Data)Scientist, on September 15th. We rarely republish posts that have appeared on other blogs, however, the questions that Norm raises both with respect to the teaching of statistics, and his assertion that "R's statistical procedures are centered far too much on significance testing" deserve a second look. Moreover, Norm's post elicited quite a few comments, many of which are at a high level of discourse. At the bottom of this post we have include excerpts from exchanges with statistician Mervin Thomas and with philosopher of science Deborah Mayo. It is well worth reading the full threads of these exchanges as well as those associated with a number of other comments. Norm has been a contributor to the Revolutions Blog in the past. We thank him for permission to republish his post. (Guest post editor, Joseph Rickert).
by Norm Matloff
My posting about the statistics profession losing ground to computer science drew many comments, not only here in Mad (Data) Scientist, but also in the coposting at Revolution Analytics, and in Slashdot. One of the themes in those comments was that Statistics Departments are out of touch and have failed to modernize their curricula. Though I may disagree with the commenters’ definitions of “modern,” I have in fact long felt that there are indeed serious problems in statistics curricula.
I must clarify before continuing that I do NOT advocate that, to paraphrase Shakespeare, “First thing we do, we kill all the theoreticians.” A precise mathematical understanding of the concepts is crucial to good applications. But stat curricula are not realistic.
I’ll use Student ttests to illustrate. (This is material from my opensource book on probablity and statistics.) The ttest is an exemplar for the curricular ills in three separate senses:
Significance testing has long been known to be underinformative at best, and highly misleading at worst. Yet it is the core of almost any applied stat course. Why are we still teaching — actually highlighting — a method that is recognized to be harmful?
We prescribe the use of the ttest in situations in which the sampled population has an exact normal distribution — when we know full well that there is no such animal. All reallife random variables are bounded (as opposed to the infinitesupport normal distributions) and discrete (unlike the continuous normal family). [Clarification, added 9/17: I advocate skipping the tdistribution, and going directly to inference based on the Central Limit Theorem. Same for regression. See my book.]
Going handinhand with the ttest is the sample variance. The classic quantity s2 is an unbiased estimate of the population variance σ2, with s2 defined as 1/(n1) times the sum of squares of our data relative to the sample mean. The concept of unbiasedness does have a place, yes, but in this case there really is no point to dividing by n1 rather than n. Indeed, even if we do divide by n1, it is easily shown that the quantity that we actually need, s rather than s2, is a BIASED (downward) estimate of σ. So that n1 factor is much ado about nothing.
Right from the beginning, then, in the very first course a student takes in statistics, the star of the show, the ttest, has three major problems.
Sadly, the R language largely caters to this oldfashioned, unwarranted thinking. The var() and sd() functions use that 1/(n1) factor, for example — a bit of a shock to unwary students who wish to find the variance of a random variable uniformly distributed on, say, 1,2,…,10.
Much more importantly, R’s statistical procedures are centered far too much on significance testing. Take ks.test(), for instance; all one can do is a significance test, when it would be nice to be able to obtain a confidence band for the true cdf. Or consider loglinear models: The loglin() function is so centered on testing that the user must proactively request parameter estimates, never mind standard errors. (One can get the latter by using glm() as a workaround, but one shouldn’t have to do this.)
I loved the suggestion by Frank Harrell in rdevel to at least remove the “star system” (asterisks of varying numbers for different pvalues) from R output. A Quixotic action on Frank’s part (so of course I chimed in, in support of his point); sadly, no way would such a change be made. To be sure, R in fact is modern in many ways, but there are some problems nevertheless.
In my blog posting cited above, I was especially worried that the stat field is not attracting enough of the “best and brightest” students. Well, any thoughtful student can see the folly of claiming the ttest to be “exact.” And if a sharp student looks closely, he/she will notice the hypocrisy of using the 1/(n1) factor in estimating variance for comparing two general means, but NOT doing so when comparing two proportions. If unbiasedness is so vital, why not use 1/(n1) in the proportions case, a skeptical student might ask?
Some years ago, an Israeli statistician, upon hearing me kvetch like this, said I would enjoy a book written by one of his countrymen, titled What’s Not What in Statistics. Unfortunately, I’ve never been able to find it. But a good cleanup along those lines of the way statistics is taught is long overdue.
Selected Comments
Mervyn Thomas
SEPTEMBER 16, 2014 AT 4:15 PM
I have run statistics operations in quite large public and private sector organisations, and directly supervised many masters and PhD level statisticians. The biggest problem I had with new statisticians was helping them to understand that nobody else cares about the statistics.
Of course the statistics is important, but only in so far as it helps produce solid and reliable answers to problems – or reveals that no such answers are available with current data. Nearly everybody is focussed on their own problems. The trick is producing results and reports which address those problems in a rigorous and defensible way.
In a sense, I see applied statistics as more of an engineering discipline – but one that makes careful use of rigorous analysis.
I believe that statistics departments have largely missed the boat with data science (except for a few stand out examples like Stanford), and that the reason is that many academic statisticians have failed to engage with other disciplines properly. Of course, there are very significant exceptions to that – Terry Speed for example.
One of the most telling examples of that for me is the number of time academic statisticians have asked if I or my life science collaborators could provide them with data to test an approach — without actually wanting to engage with the problem that generated the data.
Relevance comes from engagement, not from rarefied brilliance. There is no better example of that than Fisher.
Does it matter? Yes because I see other disciplines reinventing the statistical wheel – and doing it badly.
REPLY
matloff
SEPTEMBER 16, 2014 AT 5:01 PM
Very interesting comments. I largely agree.
Sadly, my own campus, the University of California at Davis, illustrates your point. To me, a big issue is joint academic appointments, and to my knowledge the Statistics Dept. has none. This is especially surprising in light of the longtime (several decades) commitment of UCD to interdisciplinary research. The Stat. Dept. has even gone in the opposite direction: The Stat grad program used to be administered by a Graduate Group, a unique UCD entity in which faculty from many departments run the graduate program in a given field; yet a few years ago, the Stat. Dept. disbanded its Graduate Group. I must hasten to add that there IS good interdisciplinary work being done by Stat faculty with researchers in other fields, but still the structure is too narrow, in my view.
(My own department, Computer Science, has several appointments with other disciplines, and more important, has actually expanded the membership of its Graduate Group.)
I would say, though, that I think the biggest reason Stat (in general, not just UCD) has been losing ground to CS and other fields is not because of disinterest in applications, but rather a failure to tackle the complex, largescale, “messy” problems that the Machine Learning crowd addresses routinely.
REPLY
Mervyn Thomas
SEPTEMBER 16, 2014 AT 5:19 PM
“a failure to tackle the complex, largescale, “messy” problems that the Machine Learning crowd addresses routinely.” Good point! I have often struggled with junior statisticians wanting to know whether or not an analysis is `right’ rather than fit for purpose. That’s a strange preoccupation, because in 40 years as a professional statistician I have never done a `correct’ analysis. Everything is predicated on assumptions which are approximations at best.

Mayo
SEPTEMBER 27, 2014 AT 9:27 PM
I reviewed the part in your book on tests vs CIs. It was quite as extreme as I’d remembered it. I’m so used to interpreting significance levels and pvalues in terms of discrepancies warranted or not that I automatically have those (severity) interpretations in mind when I consider tests. Fallacies of rejection and acceptance, relativity to sample size–all dealt with, and the issues about CIs requiring testing supplements remain (especially in onesided testing which is common). This paper covers 13 central problems with hypothesis tests, and how error statistics deals with them.
I remember many of the things I like A LOT about Matloff’s book. I’m glad he sees CIs as the way to go for variable choice (on prediction grounds) because it means that severity is relevant there too.
Norm Matloff
SEPTEMBER 28, 2014 AT 11:46 PM
Looks like a very interesting paper, Deborah (as I would have expected). I look forward to reading it. Just skimming through, though, it looks like I’ll probably have comments similar to the ones I made on Mervyn’s points.
Going back to my original post, do you at least agree that CIs are more informative than tests?
by Joseph Rickert
One of the most difficult things about R, a problem that is particularly vexing to beginners, is finding things. This is an unintended consequence of R's spectacular, but mostly uncoordinated, organic growth. The R core team does a superb job of maintaining the stability and growth of the R language itself, but the innovation engine for new functionality is largely in the hands of the global R communty.
Several structures have been put in place to address various apsects of the finding things problem. For example, Task Views represent a monumental effort to collect and classify R packages. The RSeek site is an effective tool for web searches. RBloggers is a good place to go for R applications and CRANberries let's you know what's new. But, how do you find things that you didn't even know you were looking for?For this, the so called "misc packages" can be very helpful. Whereas the majority of R packages are focused on a particular type of analysis or class of models, or special tool, misc packages tend to be collections of functions that facilitate common tasks. (Look below for a partial list).
DescTools is a new entry to the misc package scene that I think could become very popular. The description for the package begins:
DescTools contains a bunch of basic statistic functions and convenience wrappers for efficiently describing data, creating specific plots, doing reports using MS Word, Excel or PowerPoint. The package's intention is to offer a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R.
So far, of the 380 functions in this collection the Desc function has my attention. This function provides very nice tabular and graphic summaries of the variables in a data frame with output that is specific to the data type. The d.pizza data frame that comes with the package has a nice mix of data types
head(d.pizza) index date week weekday area count rabate price operator driver delivery_min temperature wine_ordered wine_delivered 1 1 20140301 9 6 Camden 5 TRUE 65.655 Rhonda Taylor 20.0 53.0 0 0 2 2 20140301 9 6 Westminster 2 FALSE 26.980 Rhonda Butcher 19.6 56.4 0 0 3 3 20140301 9 6 Westminster 3 FALSE 40.970 Allanah Butcher 17.8 36.5 0 0 4 4 20140301 9 6 Brent 2 FALSE 25.980 Allanah Taylor 37.3 NA 0 0 5 5 20140301 9 6 Brent 5 TRUE 57.555 Rhonda Carter 21.8 50.0 0 0 6 6 20140301 9 6 Camden 1 FALSE 13.990 Allanah Taylor 48.7 27.0 0 0 wrongpizza quality 1 FALSE medium 2 FALSE high 3 FALSE <NA> 4 FALSE <NA> 5 FALSE medium 6 FALSE low
Here is some of the voluminous output from the function. The data frame as a whole is summarized as follows
'data.frame': 1209 obs. of 16 variables: 1 $ index : int 1 2 3 4 5 6 7 8 9 10 ... 2 $ date : Date, format: "20140301" "20140301" "20140301" "20140301" ... 3 $ week : num 9 9 9 9 9 9 9 9 9 9 ... 4 $ weekday : num 6 6 6 6 6 6 6 6 6 6 ... 5 $ area : Factor w/ 3 levels "Brent","Camden",..: 2 3 3 1 1 2 2 1 3 1 ... 6 $ count : int 5 2 3 2 5 1 4 NA 3 6 ... 7 $ rabate : logi TRUE FALSE FALSE FALSE TRUE FALSE ... 8 $ price : num 65.7 27 41 26 57.6 ... 9 $ operator : Factor w/ 3 levels "Allanah","Maria",..: 3 3 1 1 3 1 3 1 1 3 ... 10 $ driver : Factor w/ 7 levels "Butcher","Carpenter",..: 7 1 1 7 3 7 7 7 7 3 ... 11 $ delivery_min : num 20 19.6 17.8 37.3 21.8 48.7 49.3 25.6 26.4 24.3 ... 12 $ temperature : num 53 56.4 36.5 NA 50 27 33.9 54.8 48 54.4 ... 13 $ wine_ordered : int 0 0 0 0 0 0 1 NA 0 1 ... 14 $ wine_delivered: int 0 0 0 0 0 0 1 NA 0 1 ... 15 $ wrongpizza : logi FALSE FALSE FALSE FALSE FALSE FALSE ... 16 $ quality : Ord.factor w/ 3 levels "low"<"medium"<..: 2 3 NA NA 2 1 1 3 3 2 ...
The factor variable driver gets a table and a plot.
10  driver (factor) length n NAs levels unique dupes 1'209 1'204 5 7 7 y level freq perc cumfreq cumperc 1 Carpenter 272 .226 272 .226 2 Carter 234 .194 506 .420 3 Taylor 204 .169 710 .590 4 Hunter 156 .130 866 .719 5 Miller 125 .104 991 .823 6 Farmer 117 .097 1108 .920 7 Butcher 96 .080 1204 1.000
and so does the numeric variable delivery.
11  delivery_min (numeric) length n NAs unique 0s mean meanSE 1'209 1'209 0 384 0 25.653 0.312 .05 .10 .25 median .75 .90 .95 10.400 11.600 17.400 24.400 32.500 40.420 45.200 rng sd vcoef mad IQR skew kurt 56.800 10.843 0.423 11.268 15.100 0.611 0.095 lowest : 8.8 (3), 8.9, 9 (3), 9.1 (5), 9.2 (3) highest: 61.9, 62.7, 62.9, 63.2, 65.6 ShapiroWilks normality test p.value : 2.2725e16
Pretty nice for an automatic first look at the data.
For some more R treasure hunting have a look into the following short list of misc packages.
Package 
Description 
Tools for manipulating data (No 1 package downloaded for 2013) 

Convenience wrappers for functions for manipulating strings 

One of the most popular R packages of all time: functions for data analysis, graphics, utilities and much more 

Package development tools 

The “go to” package for machine learning, classification and regression training 

Good svm implementation and other machine learning algorithms 

Tools for describing data and descriptive statistics 

Tools for plotting decision trees 

Functions for numerical analysis, linear algebra, optimization, differential equations and some special functions 

Contains different highlevel graphics functions for displaying large datasets 

Relatively new package with various functions for survival data extending the methods available in the survival package. 

New this year: miscellaneous R tools to simplify the working with data types and formats including functions for working with data frames and character strings 

Some functions for Kalman filters 

Misc 3d plots including isosurfaces 

New package with utilities for producing maps 

Various programming tools like ASCIIfy() to convert characters to ASCII and checkRVersion() to see if a newer version of R is available 

A grab bag of utilities including progress bars and function timers 
by Joseph Rickert
While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.
# # SET UP THE PARAMETER SPACE SEARCH GRID ctrl < trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid < expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune < train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) # # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune < train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) # # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues < resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values # # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot
After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, "The design and analysis of benchmark experiments", Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675699
The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.
Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)
I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.
All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.
by Seth Mottaghinejad
Let's review the Collatz conjecture, which says that given a positive integer n, the following recursive algorithm will always terminate:
In our last post, we created a function called 'cpp_collatz', which given an integer vector, returns an integer vector of their corresponding stopping times (number of iterations for the above algorithm to reach 1). For example, when n = 5 we have
5 > 3*5+1 = 16 > 16/2 = 8 > 8/2 = 4 > 4/2 = 2 > 2/2 = 1,
giving us a stopping time of 5 iterations.
In today's article, we want to perform some exploratory data analysis to see if we can find any pattern relating an integer to its stopping time. As part of the analysis, we will extract some features out of the integers that could help us explain any differences in stopping times.
Here are some examples of potentially useful features:
In case you are encountering these terms for the first time, a triangular number is any number m that can be written as m = n(n+1)/2, where n is some positive integer. To determine if a number is triangular, we can rewrite the above equation as n^2 + n  2m = 0, and use the quadriatic formula to get n = (1 + sqrt(1 + 8m))/2 and (1  sqrt(1 + 8m))/2. Since n must be a positive integer, we ignore the latter solution, leaving us with (1 + sqrt(1 + 8m))/2.
Thus, if plugging m in the above formula results in an integer, we can say that m is a triangular number. Similar rules exist to determine if an integer is square or pentagonal, but I will refer you to Wikipedia for the details.
For the purpose of conducting our analysis, we created some other functions in C++ and R to help us. Let's take a look at these functions:
cat(paste(readLines(file.path(directory, "collatz.cpp")), collapse = "\n")) #include <Rcpp.h> #include <vector> using namespace Rcpp; // [[Rcpp::export]] int collatz(int nn) { int ii = 0; while (nn != 1) { if (nn % 2 == 0) nn /= 2; else nn = 3 * nn + 1; ii += 1; } return ii; } // [[Rcpp::export]] IntegerVector cpp_collatz(IntegerVector ints) { return sapply(ints, collatz); } // [[Rcpp::export]] bool is_int_prime(int nn) { if (nn < 1) stop("int must be greater than 1."); else if (nn == 1) return FALSE; else if (nn % 2 == 0) return (nn == 2); for (int ii=3; ii*ii<=nn; ii+=2) { if (nn % ii == 0) return false; } return true; } // [[Rcpp::export]] LogicalVector is_prime(IntegerVector ints) { return sapply(ints, is_int_prime); } // [[Rcpp::export]] NumericVector gen_primes(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> primes; primes.push_back(2); int i = 3; while (primes.size() < unsigned(n)) { if (is_int_prime(i)) primes.push_back(i); i += 2; } return Rcpp::wrap(primes); } // [[Rcpp::export]] NumericVector gen_divisors(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> divisors; divisors.push_back(1); for(int i = 2; i <= sqrt(n); i++) { if(n%i == 0) { divisors.push_back(i); divisors.push_back(n/i); } } sort(divisors.begin(), divisors.end()); divisors.erase(unique(divisors.begin(), divisors.end()), divisors.end()); return Rcpp::wrap(divisors); } // [[Rcpp::export]] bool is_int_perfect(int nn) { if (nn < 1) stop("int must be greater than 0."); return nn == sum(gen_divisors(nn)); } // [[Rcpp::export]] LogicalVector is_perfect(IntegerVector ints) { return sapply(ints, is_int_perfect); }
Here is a list of other helper functions in collatz.cpp:
sum_digits < function(x) { # returns the sum of the individual digits that make up x # x must be an integer vector f < function(xx) sum(as.integer(strsplit(as.character(xx), "")[[1]])) sapply(x, f) }
As you can guess, many of the above functions (such as 'is_prime' and 'gen_divisors') rely on loops, which makes C++ the ideal place to perform the computation. So we farmed out as much of the heavyduty computations to C++ leaving R with the task of processing and analyzing the resulting data.
Let's get started. We will perform the analysis on all integers below 10^5, since R is memorybound and we can run into a bottleneck quickly. But next time, we will show you how to overcome this limitation using the 'RevoScaleR' package, which will allow us to scale the analysis to much larger integers.
One small caveat before we start: I enjoy dabbling in mathematics but I know very little about number theory. The analysis we are about to perform is not meant to be rigorous. Instead, we will attempt to approach the problem using EDA the same way that we approach any datadriven problem.
maxint < 10^5 df < data.frame(int = 1:maxint) # the original number df < transform(df, st = cpp_collatz(int)) # stopping times df < transform(df, sum_digits = sum_digits(int), inv_triangular = (sqrt(8*int + 1)  1)/2, # inverse triangular number inv_square = sqrt(int), # inverse square number inv_pentagonal = (sqrt(24*int + 1) + 1)/6 # inverse pentagonal number )
To determine if a numeric number is an integer or not, we need to be careful about not using the '==' operator in R, as it is not guaranteed to work, because of minute rounding errors that may occur. Here's an example:
.3 == .4  .1 # we expect to get TRUE but get FALSE instead
[1] FALSE
The solution is to check whether the absolute difference between the above numbers is smaller than some tolerance threshold.
eps < 1e9
abs(.3  (.4  .1)) < eps # returns TRUE
[1] TRUE
df < transform(df, is_triangular = abs(inv_triangular  as.integer(inv_triangular)) < eps, is_square = abs(inv_square  as.integer(inv_square)) < eps, is_pentagonal = abs(inv_pentagonal  as.integer(inv_pentagonal)) < eps, is_prime = is_prime(int), is_perfect = is_perfect(int) ) df < df[ , names(df)[grep("^inv_", names(df))]]
Finally, we will create a variable listing all of the integer's proper prime divisors. Every composite integer can be recunstruced out of these basic building blocks, a mathematical result known as the 'unique factorization theorem'. We can use the function 'gen_divisors' to get a vector of an integers proper divisors, and the 'is_prime' function to only keep the ones that are prime. Finally, because the return object must be a singleton, we can use 'paste' with the 'collapse' argument to join all of the prime divisors into a single commaseparated string.
Lastly, on its own, we may not find the variable 'all_prime_divs' especially helpful. Instead, we cangenerate multiple flag variables out of it indicating whether or not a specific prime number is a divisor for the integer. We will generate 25 flag variables, one for each of the first 25 prime numbers.
There are many more features that we can extract from the underlying integers, but we will stop here. As we mentioned earlier, our goal is not to provide a rigorous mathematical work, but show you how the tools of data analysis can be brought to bear to tackle a problem of such nature.
Here's a sample of 10 rows from the data:
df[sample.int(nrow(df), 10), ] int st sum_digits is_triangular is_square is_pentagonal is_prime 21721 21721 162 13 FALSE FALSE FALSE FALSE 36084 36084 142 21 FALSE FALSE FALSE FALSE 40793 40793 119 23 FALSE FALSE FALSE FALSE 3374 3374 43 17 FALSE FALSE FALSE FALSE 48257 48257 44 26 FALSE FALSE FALSE FALSE 42906 42906 49 21 FALSE FALSE FALSE FALSE 37283 37283 62 23 FALSE FALSE FALSE FALSE 55156 55156 60 22 FALSE FALSE FALSE FALSE 6169 6169 111 22 FALSE FALSE FALSE FALSE 77694 77694 231 33 FALSE FALSE FALSE FALSE is_perfect all_prime_divs is_div_by_2 is_div_by_3 is_div_by_5 is_div_by_7 21721 FALSE 7,29,107 FALSE FALSE FALSE TRUE 36084 FALSE 2,3,31,97 TRUE TRUE FALSE FALSE 40793 FALSE 19,113 FALSE FALSE FALSE FALSE 3374 FALSE 2,7,241 TRUE FALSE FALSE TRUE 48257 FALSE 11,41,107 FALSE FALSE FALSE FALSE 42906 FALSE 2,3,7151 TRUE TRUE FALSE FALSE 37283 FALSE 23,1621 FALSE FALSE FALSE FALSE 55156 FALSE 2,13789 TRUE FALSE FALSE FALSE 6169 FALSE 31,199 FALSE FALSE FALSE FALSE 77694 FALSE 2,3,23,563 TRUE TRUE FALSE FALSE is_div_by_11 is_div_by_13 is_div_by_17 is_div_by_19 is_div_by_23 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE TRUE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 TRUE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE TRUE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE TRUE is_div_by_29 is_div_by_31 is_div_by_37 is_div_by_41 is_div_by_43 21721 TRUE FALSE FALSE FALSE FALSE 36084 FALSE TRUE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE TRUE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE TRUE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_47 is_div_by_53 is_div_by_59 is_div_by_61 is_div_by_67 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_71 is_div_by_73 is_div_by_79 is_div_by_83 is_div_by_89 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_97 21721 FALSE 36084 TRUE 40793 FALSE 3374 FALSE 48257 FALSE 42906 FALSE 37283 FALSE 55156 FALSE 6169 FALSE 77694 FALSE
We can now move on to looking at various statistical summaries to see if we notice any differences between the stopping times (our response variable) when we break up the data in different ways. We will look at the counts, mean, median, standard deviation, and trimmed mean (after throwing out the highest 10 percent) of the stopping times, as well as the correlation between stopping times and the integers. This is by no means a comprehensive list, but it can serve as a guidance for deciding which direction to go to next.
my_summary < function(df) { primes < gen_primes(9) res < with(df, data.frame( count = length(st), mean_st = mean(st), median_st = median(st), tmean_st = mean(st[st < quantile(st, .9)]), sd_st = sd(st), cor_st_int = cor(st, int, method = "spearman") ) ) }
To create above summaries broken up by the flag variables in the data, we will use the 'ddply' function in the 'plyr' package. For example, the following will give us the summaries asked for in 'my_summary', grouped by 'is_prime'.
To avoid having to manually type every formula, we can pull the flag variables from the data set, generate the strings that will make up the formula, wrap it inside 'as.formula' and pass it to 'ddply'.
flags < names(df)[grep("^is_", names(df))] res < lapply(flags, function(nm) ddply(df, as.formula(sprintf('~ %s', nm)), my_summary)) names(res) < flags res $is_triangular is_triangular count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99554 107.58511 99 96.23178 51.36153 0.1700491 2 TRUE 446 97.11211 94 85.97243 51.34051 0.3035063 $is_square is_square count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99684 107.59743 99 96.25033 51.34206 0.1696582 2 TRUE 316 88.91772 71 77.29577 55.43725 0.4274504 $is_pentagonal is_pentagonal count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99742 107.56948 99.0 96.22466 51.35580 0.1702560 2 TRUE 258 95.52326 83.5 83.86638 53.91514 0.3336478 $is_prime is_prime count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 90408 107.1227 99 95.95392 51.29145 0.1693013 2 TRUE 9592 111.4569 103 100.39643 51.90186 0.1953000 $is_perfect is_perfect count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99995 107.5419 99 96.20092 51.36403 0.1707839 2 TRUE 5 37.6000 18 19.50000 45.06440 0.9000000 $is_div_by_2 is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 50000 113.5745 106 102.58324 52.25180 0.1720108 2 TRUE 50000 101.5023 94 90.68234 49.73778 0.1737786 $is_div_by_3 is_div_by_3 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 66667 107.4415 99 96.15832 51.30838 0.1705745 2 TRUE 33333 107.7322 99 96.94426 51.48104 0.1714432 $is_div_by_5 is_div_by_5 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 80000 107.5676 99 96.21239 51.36466 0.1713552 2 TRUE 20000 107.4214 99 96.13873 51.37206 0.1688929 . . . $is_div_by_89 is_div_by_89 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98877 107.5381 99 96.19965 51.37062 0.1707754 2 TRUE 1123 107.5628 97 96.02096 50.97273 0.1803568 $is_div_by_97 is_div_by_97 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98970 107.4944 99 96.15365 51.36880 0.17187453 2 TRUE 1030 111.7660 106 101.35853 50.93593 0.07843996
As we can see from the above results, most comparisons appear to be nonsignificant (although, in mathematics, even tiny differences can be meaningful, therefore we will avoid relying on statistical significance here). Here's a summary of trends that stand out as we go over the results:
This last item could just be restatement of 1. however. Let's take a closer look:
ddply(df, ~ is_prime + is_div_by_2, my_summary) is_prime is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE FALSE 40409 114.0744 106 102.91178 52.32495 0.1658988 2 FALSE TRUE 49999 101.5043 94 90.68434 49.73624 0.1737291 3 TRUE FALSE 9591 111.4685 103 100.40796 51.89231 0.1950482 4 TRUE TRUE 1 1.0000 1 NaN NA NA
When limiting the analysis to odd numbers only, prime numbers now have a lower average stopping time compared to composite numbers, which reverses the trend seen in 1.
Abd ibe final point: Since having a prime number as a proper divisor is a proxy for being a composite number, it is difficult to read too much into whether divisibility by a specific prime number affects the average stopping time. But no specific prime number stands out in particular. Once again, a larger sample size would give us more confidence about the results.
The above analysis is perhaps too crude and primitive to offer any significant lead, so here are some possible improvements:
In the next post, we will see how we can use the 'RevoScaleR' package to go from an R 'data.frame' (memorybound) to an external 'data.frame' or XDF file for short. In doing so, we will achieve the following improvements:
by Joseph Rickert
The ancient Egyptians were a people with long memories. The lists of their pharaohs went back thousands of years, and we still have the names and tax assessments for certain persons and institutions from the time of Ramesses II. When Herodotus began writing about Egypt and the Nile (~ 450 BC), the Egyptians who knew that their prosperity depended on the river’s annual overflow, had been keeping records of the Nile’s high water mark for more than three millennia. So, it seems reasonable, and maybe even appropriate, that one of the first attempts to understand long memory in time series was motivated by the Nile.
The story of British hydrologist and civil servant H.E. Hurst who earned the nickname “Abu Nil”, Father of the Nile, for his 62 year career of measuring and studying the river is now fairly well known. Pondering an 847 year record of Nile overflow data, Hurst noticed that the series was persistent in the sense that heavy flood years tended to be followed by heavier than average flood years while below average flood years were typically followed by light flood years. Working from an ancient formula on optimal dam design he devised the equation: log(R/S) = K*log(N/2) where R is the range of the time series, S is the standard deviation of yeartoyear flood measurements and N is the number of years in the series. Hurst noticed that the value of K for the Nile series and other series related to climate phenomena tended to be about 0.73, consistently much higher than the 0.5 value that one would expect from independent observations and short autocorrelations.
Today, mostly due to the work of Benoit Mandelbrot who rediscovered and popularized Hurst work in the early 1960s, Hurst’s Rescale/Range Analysis, and the calculation of the Hurst exponent (Mandlebrot renamed “K” to “H”) is the demarcation point for the modern study of Long Memory Time Series. To investigate let’s look at some monthly flow data taken at the Dongola measurement station that is just upstream from the high dam at Aswan. (Look here for the data, and here for map and nice Pythonbased analysis that covers some of the same ground as is presented below.) The data consists of average monthly flow measurements from January 1869 to December 1984.
head(nile_dat) Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec 1 1869 NA NA NA NA NA NA 2606 8885 10918 8699 NA NA 2 1870 NA NA 1146 682 545 540 3027 10304 10802 8288 5709 3576 3 1871 2606 1992 1485 933 731 760 2475 8960 9953 6571 3522 2419 4 1872 1672 1033 728 605 560 879 3121 8811 10532 7952 4976 3102 5 1873 2187 1851 1235 756 556 1392 2296 7093 8410 5675 3070 2049 6 1874 1340 847 664 516 466 964 3061 10790 11805 8064 4282 2904
To get a feel for the data we plot a portion of the time series.
The pattern is very regular and the short term correlations are apparent. The following boxplots show the variation in monthly flow.
Herodotus clearly knew what he was talking about when he wrote (The Histories: Book 2, 19):
I was particularly eager to find out from them (the Egyptian priests) why the Nile starts coming down in a flood at the summer solstice and continues flooding for a hundred days, but when the hundred days are over the water starts to recede and decreases in volume, with the result that it remains low for the whole winter, until the summer solstice comes round again.
To construct a long memory time series we aggregate the monthly flows to produce a yearly time series of total flow (droppng the years 1869 and 1870 because of the missing values).
Plotting the ACF, we see that the autocorelations persist for nearly 20 years!!
So, let's compute the Hurst exponent. For our first try, we use a simple function suggested by an example in Bernard Pfaff's classic text: Analysis of Integrated and Cointegrated Time Series with R.
Bingo! 0.73  just what we were expecting for a long memory time series. Unfortunately, things are not so simple. The function hurst() from the pracma package which is a much more careful calculation than simpleHurst() yields:
hurst(nile.yr.ts) [1] 1.041862
This is midly distressing since H is supposed to be bounded above by 1. The function hurstexp() from the same package which is based on Weron's MatLab code and implements the small sample correction seems to solve that problem.
> hurstexp(nile.yr.ts) Corrected R over S Hurst exponent: 1.041862 Theoretical Hurst exponent: 0.5244148 Corrected empirical Hurst exponent: 0.7136607 Empirical Hurst exponent: 0.6975531
0.71 is more reasonable result. However, as a post on the Timely Portfolio blog pointed out a few years ago, computing the Hurst exponent is an estimation problem not merely a calculation. So, where are the error bars?
I am afraid that confidence intervals and a look at several other methods available in R for estimating the Hurst exponent will have to wait for another time. In the meantime, the following references may be of interest. The first two are light reading from early champions of applying Rescale/Range analysis and the Hurst exponent to Financial time series. The book by Mandelbrot and Hudson is especially interesting for its sketch of the historical background. The last two are relatively early papers on the subject.
My code for this post may be downloaded here: Long_memory_Post.
by Norman Matloff
The American Statistical Association (ASA) leadership, and many in Statistics academia. have been undergoing a period of angst the last few years, They worry that the field of Statistics is headed for a future of reduced national influence and importance, with the feeling that:
I had been aware of these issues for quite a while, and thus was pleasantly surprised last year to see thenASA president Marie Davidson write a plaintive editorial titled, “Aren’t We Data Science?”
Good, the ASA is taking action, I thought. But even then I was startled to learn during JSM 2014 (a conference tellingly titled “Statistics: Global Impact, Past, Present and Future”) that the ASA leadership is so concerned about these problems that it has now retained a PR firm.
This is probably a wise move–most large institutions engage in extensive PR in one way or another–but it is a sad statement about how complacent the profession has become. Indeed, it can be argued that the action is long overdue; as a friend of mine put it, “They [the statistical profession] lost the PR war because they never fought it.”
In this post, I’ll tell you the rest of the story, as I see it, viewing events as a statistician, computer scientist and R activist.
CS vs. Statistics
Let’s consider the CS issue first. Recently a number of new terms have arisen, such as data science, Big Data, and analytics, and the popularity of the term machine learning has grown rapidly. To many of us, though, this is just “old wine in new bottles,” with the “wine” being Statistics. But the new “bottles” are disciplines outside of Statistics–especially CS.
I have a foot in both the Statistics and CS camps. I’ve spent most of my career in the Computer Science Department at the University of California, Davis, but I began my career in Statistics at that institution. My mathematics doctoral thesis at UCLA was in probability theory, and my first years on the faculty at Davis focused on statistical methodology. I was one of the seven charter members of the Department of Statistics. Though my departmental affiliation later changed to CS, I never left Statistics as a field, and most of my research in Computer Science has been statistical in nature. With such “dual loyalties,” I’ll refer to people in both professions via thirdperson pronouns, not first, and I will be critical of both groups. However, in keeping with the theme of the ASA’s recent actions, my essay will be Statcentric: What is poor Statistics to do?
Well then, how did CS come to annex the Stat field? The primary cause, I believe, came from the CS subfield of Artificial Intelligence (AI). Though there always had been some probabilistic analysis in AI, in recent years the interest has been almost exclusively in predictive analysis–a core area of Statistics.
That switch in AI was due largely to the emergence of Big Data. No one really knows what the term means, but people “know it when they see it,” and they see it quite often these days. Typical data sets range from large to huge to astronomical (sometimes literally the latter, as cosmology is one of the application fields), necessitating that one pay key attention to the computational aspects. Hence the term data science, combining quantitative methods with speedy computation, and hence another reason for CS to become involved.
Involvement is one thing, but usurpation is another. Though not a deliberate action by any means, CS is eclipsing Stat in many of Stat’s central areas. This is dramatically demonstrated by statements that are made like, “With machine learning methods, you don’t need statistics”–a punch in the gut for statisticians who realize that machine learning really IS statistics. ML goes into great detail in certain aspects, e.g. text mining, but in essence it consists of parametric and nonparametric curve estimation methods from Statistics, such as logistic regression, LASSO, nearestneighbor classification, random forests, the EM algorithm and so on.
Though the Stat leaders seem to regard all this as something of an existential threat to the wellbeing of their profession, I view it as much worse than that. The problem is not that CS people are doing Statistics, but rather that they are doing it poorly: Generally the quality of CS work in Stat is weak. It is not a problem of quality of the researchers themselves; indeed, many of them are very highly talented. Instead, there are a number of systemic reasons for this, structural problems with the CS research “business model”:
All this matters – a LOT. In my opinion, the above factors result in highly lamentable opportunity costs. Clearly, I’m not saying that people in CS should stay out of Stat research. But the sad truth is that the usurpation process is causing precious resources–research funding, faculty slots, the best potential grad students, attention from government policymakers, even attention from the press–to go quite disproportionately to CS, even though Statistics is arguably better equipped to make use of them. This is not a CS vs. Stat issue; Statistics is important to the nation and to the world, and if scarce resources aren’t being used well, it’s everyone’s loss.
Making Statistics Attractive to Students
This of course is an ageold problem in Stat. Let’s face it–the very word statistics sounds hopelessly dull. But I would argue that a more modern development is making the problem a lot worse – the Advanced Placement (AP) Statistics courses in high schools.
Professor XiaoLi Meng has written extensively about the destructive nature of AP Stat. He observed, “Among Harvard undergraduates I asked, the most frequent reason for not considering a statistical major was a ‘turnoff’ experience in an AP statistics course.” That says it all, doesn’t it? And though Meng’s views predictably sparked defensive replies in some quarters, I’ve had exactly the same experiences as Meng in my own interactions with students. No wonder students would rather major in a field like CS and study machine learning–without realizing it is Statistics. It is especially troubling that Statistics may be losing the “best and brightest” students.
One of the major problems is that AP Stat is usually taught by people who lack depth in the subject matter. A typical example is that a student complained to me that his AP Stat teacher could not answer his question as to why it is customary to use n1 rather than n in the denominator of s^2 , even though he had attended a topquality high school in the heart of Silicon Valley. But even that lapse is really minor, compared to the lack among the AP teachers of the broad overview typically possessed by Stat professors teaching university courses, in terms of what can be done with Stat, what the philosophy is, what the concepts really mean and so on. AP courses are ostensibly college level, but the students are not getting collegelevel instruction. The “teach to the test” syndrome that pervades AP courses in general exacerbates this problem.
The most exasperating part of all this is that AP Stat officially relies on TI83 pocket calculators as its computational vehicle. The machines are expensive, and after all we are living in an age in which R is free! Moreover, the calculators don’t have the capabilities of dazzling graphics and analyzing of nontrivial data sets that R provides – exactly the kinds of things that motivate young people.
So, unlike the “CS usurpation problem,” whose solution is unclear, here is something that actually can be fixed reasonably simply. If I had my druthers, I would simply ban AP Stat, and actually, I am one of those people who would do away with the entire AP program. Obviously, there are too many deeply entrenched interests for this to happen, but one thing that can be done for AP Stat is to switch its computational vehicle to R.
As noted, R is free and is multi platform, with outstanding graphical capabilities. There is no end to the number of data sets teenagers would find attractive for R use, say the Million Song Data Set.
As to a textbook, there are many introductions to Statistics that use R, such as Michael Crawley’s Statistics: An Introduction Using R, and Peter Dalgaard’s Introductory Statistics Using R. But to really do it right, I would suggest that a group of Stat professors collaboratively write an opensource text, as has been done for instance for Chemistry. Examples of interest to high schoolers should be used, say this engaging analysis on OK Cupid.
This is not a complete solution by any means. There still is the issue of AP Stat being taught by people who lack depth in the field, and so on. And even switching to R would meet with resistance from various interests, such as the College Board and especially the AP Stat teachers themselves.
But given all these weighty problems, it certainly would be nice to do something, right? Switching to R would be doable–and should be done.
[Crossposted with permission from the Mad (Data) Scientist blog.]
by Joseph Rickert
I think we can be sure that when American botanist Edgar Anderson meticulously collected data on three species of iris in the early 1930s he had no idea that these data would produce a computational storm that would persist well into the 21st century. The calculations started, presumably by hand, when R. A. Fisher selected this data set to illustrate the techniques described in his 1936 paper on discriminant analysis. However, they really got going in the early 1970s when the pattern recognition and machine learn community began using it to test new algorithms or illustrate fundamental principles. (The earliest reference I could find was: Gates, G.W. "The Reduced Nearest Neighbor Rule". IEEE Transactions on Information Theory, May 1972, 431433.) Since then, the data set (or one of its variations) has been used to test hundreds, if not thousands, of machine learning algorithms. The UCI Machine Learning Repository, which contains what is probably the “official” iris data set, lists over 200 papers referencing the iris data.
So why has the iris data set become so popular? Like most success stories, randomness undoubtedly plays a huge part. However, Fisher’s selecting it to illustrate a discrimination algorithm brought it to peoples attention, and the fact that the data set contains three classes, only one of which is linearly separable from the other two, makes it interesting.
For similar reasons, the airlines data set used in the 2009 ASA Sections on Statistical Computing and Statistical Graphics Data expo has gained a prominent place in the machine learning world and is well on its way to becoming the “iris data set for big data”. It shows up in all kinds of places. (In addition to this blog, it made its way into the RHIPE documentation and figures in several college course modeling efforts.)
Some key features of the airlines data set are:
An additional, really nice feature of the airlines data set is that it keeps getting bigger! RITA, The Research and Innovative Technology Administration Bureau of Transportation Statistics continues to collect data which can be downloaded in .csv files. For your convenience, we have a 143M+ record version of the data set Revolution Analytics test data site which contains all of the RITA records from 1987 through the end of 2012 available for download.
The following analysis from Revolution Analytics’ Sue Ranney uses this large version of the airlines data set and illustrates how a good model, driven with enough data, can reveal surprising features of a data set.
# Fit a Tweedie GLM tm < system.time( glmOut < rxGlm(ArrDelayMinutes~Origin:Dest + UniqueCarrier + F(Year) + DayOfWeek:F(CRSDepTime) , data = airData, family = rxTweedie(var.power =1.15), cube = TRUE, blocksPerRead = 30) ) tm # Build a dataframe for three airlines: Delta (DL), Alaska (AK), HA airVarInfo < rxGetVarInfo(airData) predData < data.frame( UniqueCarrier = factor(rep(c( "DL", "AS", "HA"), times = 168), levels = airVarInfo$UniqueCarrier$levels), Year = as.integer(rep(2012, times = 504)), DayOfWeek = factor(rep(c("Mon", "Tues", "Wed", "Thur", "Fri", "Sat", "Sun"), times = 72), levels = airVarInfo$DayOfWeek$levels), CRSDepTime = rep(0:23, each = 21), Origin = factor(rep("SEA", times = 504), levels = airVarInfo$Origin$levels), Dest = factor(rep("HNL", times = 504), levels = airVarInfo$Dest$levels) ) # Use the model to predict the arrival day for the three airlines and plot predDataOut < rxPredict(glmOut, data = predData, outData = predData, type = "response") rxLinePlot(ArrDelayMinutes_Pred~CRSDepTimeUniqueCarrier, groups = DayOfWeek, data = predDataOut, layout = c(3,1), title = "Expected Delay: Seattle to Honolulu by Departure Time, Day of Week, and Airline", xTitle = "Scheduled Departure Time", yTitle = "Expected Delay")
Here, rxGLM()fits a Tweedie Generalized Linear model that looks at arrival delay as a function of the interaction between origin and destination airports, carriers, year, and the interaction between days of the week and scheduled departure time. This function kicks off a considerable amount of number crunching as Origin is a factor variable with 373 levels and Dest, also a factor, has 377 levels. The F() function makes Year and CrsDepTIme factors "on the fly" as the model is being fit. The resulting model ends up with 140,852 coefficients 8,626 of which are not NA. The calculation takes 12.6 minutes to run on a 5 node (4 cores and 16GB of RAM per node) IBM Platform LFS cluster.
The rest of the code uses the model to predict arrival delay for three airlines and plots the fitted values by day of the week and departure time.
It looks like Saturday is the best time to fly for these airlines. Note that none of the sturcture revealed in these curves was put into the model in the sense that there are no polynomial terms in the model.
It will take a few minutes to download the zip file with the 143M airlines records, but please do, and let us know how your modeling efforts go.
The latest in a series by Daniel Hanson
Introduction
Correlations between holdings in a portfolio are of course a key component in financial risk management. Borrowing a tool common in fields such as bioinformatics and genetics, we will look at how to use heat maps in R for visualizing correlations among financial returns, and examine behavior in both a stable and down market.
While base R contains its own heatmap(.) function, the reader will likely find the heatmap.2(.) function in the R package gplots to be a bit more user friendly. A very nicely written companion article entitled A short tutorial for decent heat maps in R (Sebastian Raschka, 2013), which covers more details and features, is available on the web; we will also refer to it in the discussion below.
We will present the topic in the form of an example.
Sample Data
As in previous articles, we will make use of R packages Quandl and xts to acquire and manage our market data. Here, in a simple example, we will use returns from the following global equity indices over the period 19980105 to the present, and then examine correlations between them:
S&P 500 (US)
RUSSELL 2000 (US Small Cap)
NIKKEI (Japan)
HANG SENG (Hong Kong)
DAX (Germany)
CAC (France)
KOSPI (Korea)
First, we gather the index values and convert to returns:
library(xts) library(Quandl) my_start_date < "19980105" SP500.Q < Quandl("YAHOO/INDEX_GSPC", start_date = my_start_date, type = "xts") RUSS2000.Q < Quandl("YAHOO/INDEX_RUT", start_date = my_start_date, type = "xts") NIKKEI.Q < Quandl("NIKKEI/INDEX", start_date = my_start_date, type = "xts") HANG_SENG.Q < Quandl("YAHOO/INDEX_HSI", start_date = my_start_date, type = "xts") DAX.Q < Quandl("YAHOO/INDEX_GDAXI", start_date = my_start_date, type = "xts") CAC.Q < Quandl("YAHOO/INDEX_FCHI", start_date = my_start_date, type = "xts") KOSPI.Q < Quandl("YAHOO/INDEX_KS11", start_date = my_start_date, type = "xts") # Depending on the index, the final price for each day is either # "Adjusted Close" or "Close Price". Extract this single column for each: SP500 < SP500.Q[,"Adjusted Close"] RUSS2000 < RUSS2000.Q[,"Adjusted Close"] DAX < DAX.Q[,"Adjusted Close"] CAC < CAC.Q[,"Adjusted Close"] KOSPI < KOSPI.Q[,"Adjusted Close"] NIKKEI < NIKKEI.Q[,"Close Price"] HANG_SENG < HANG_SENG.Q[,"Adjusted Close"] # The xts merge(.) function will only accept two series at a time. # We can, however, merge multiple columns by downcasting to *zoo* objects. # Remark: "all = FALSE" uses an inner join to merge the data. z < merge(as.zoo(SP500), as.zoo(RUSS2000), as.zoo(DAX), as.zoo(CAC), as.zoo(KOSPI), as.zoo(NIKKEI), as.zoo(HANG_SENG), all = FALSE) # Set the column names; these will be used in the heat maps: myColnames < c("SP500","RUSS2000","DAX","CAC","KOSPI","NIKKEI","HANG_SENG") colnames(z) < myColnames # Cast back to an xts object: mktPrices < as.xts(z) # Next, calculate log returns: mktRtns < diff(log(mktPrices), lag = 1) head(mktRtns) mktRtns < mktRtns[1, ] # Remove resulting NA in the 1st row
Generate Heat Maps
As noted above, heatmap.2(.) is the function in the gplots package that we will use. For convenience, we’ll wrap this function inside our own generate_heat_map(.) function, as we will call this parameterization several times to compare market conditions.
As for the parameterization, the comments should be selfexplanatory, but we’re keeping things simple by eliminating the dendogram, and leaving out the trace lines inside the heat map and density plot inside the color legend. Note also the setting Rowv = FALSE, this ensures the ordering of the rows and columns remains consistent from plot to plot. We’re also just using the default color settings; for customized colors, see the Raschka tutorial linked above.
require(gplots) generate_heat_map < function(correlationMatrix, title) { heatmap.2(x = correlationMatrix, # the correlation matrix input cellnote = correlationMatrix # places correlation value in each cell main = title, # heat map title symm = TRUE, # configure diagram as standard correlation matrix dendrogram="none", # do not draw a row dendrogram Rowv = FALSE, # keep ordering consistent trace="none", # turns off trace lines inside the heat map density.info="none", # turns off density plot inside color legend notecol="black") # set font color of cell labels to black }
Next, let’s calculate three correlation matrices using the data we have obtained:
Now, let’s call our heat map function using the total market data set:
generate_heat_map(corr1, "Correlations of World Market Returns, Jan 1998  Present")
And then, examine the result:
As expected, we trivially have correlations of 100% down the main diagonal. Note that, as shown in the color key, the darker the color, the lower the correlation. By design, using the parameters of the heatmap.2(.) function, we set the title with the main = title parameter setting, and the correlations shown in black by using the notecol="black" setting.
Next, let’s look at a period of relative calm in the markets, namely the year 2004:
generate_heat_map(corr2, "Correlations of World Market Returns, Jan  Dec 2004")
This gives us:
generate_heat_map(corr2, "Correlations of World Market Returns, Jan  Dec 2004")
Note that in this case, at a glance of the darker colors in each of the cells, we can see that we have even lower correlations than those from our entire data set. This may of course be verified by comparing the numerical values.
Finally, let’s look at the opposite extreme, during the upheaval of the financial crisis in 20082009:
generate_heat_map(corr3, "Correlations of World Market Returns, Oct 2008  May 2009")
This yields the following heat map:
Note that in this case, again just at first glance, we can tell the correlations have increased compared to 2004, by the colors changing from dark to light nearly across the board. While there are some correlations that do not increase all that much, such as the SP500/Nikkei and the Russell 2000/Kospi values, there are others across international and capitalization categories that jump quite significantly, such as the SP500/Hang Seng correlation going from about 21% to 41%, and that of the Russell 2000/DAX moving from 43% to over 57%. So, in other words, portfolio diversification can take a hit in down markets.
Conclusion
In this example, we only looked at seven market indices, but for a closer look at how correlations were affected during 200809  and how heat maps among a greater number of market sectors compared  this article, entitled Diversification is Broken, is a recommended and interesting read.
by Joseph Rickert
Last week, I posted a list of sessions at the Joint Statistical Meetings related to R. As it turned out, that list was only the tip of the iceberg. In some areas of statistics, such as graphics, simulation and computational statistics the use of R is so prevalent that people working in the field often don't think to mention it. For example, in the session New Approaches to Data Exploration and Discovery which included the presentation on the Glassbox package that figured in my original list, R was important to the analyses underlying nearly all of the talks in one way or another. The following are synopses of the talks in that session along with some pointers to relevant R resources.
Exploring Huge Collections of Scatterplots
Statistics and visualization legend Leland Wilkinson of Skytree showed off ScagExployer, a tool he built with Tuan Dang of the University of Illinois at Chicago to explore scagnostics (a contraction for “Scatter Plot Diagnostics” made up by John Hartigan and Paul Tukey in the 1980’s). ScagExployer makes it possible to look for anomalies and search for similar distributions in a huge collections of scatter plots. (The example Leland showed contained 124K plots).The ideas and many of the visuals for the talk can be found in the paper ScagExplorer: Exploring Scatterplots by Their Scagnostics. ScagExployer is Java based tool, but R users can work with the scagnostics package written by Lee Wilkinson and Anushka Anand in 2007.
Glassbox: An R Package for Visualizing Algorithmic Models:
Google’s Max Ghenis presented work he did with fellow Googlers Ben Ogorek; and Estevan Flores. Glassbox is an R application that attempts to provide transparency to “blackbox” algorithmic models such as Random Forests. Among other things, it calculates and plots the collective importance of groups of variables in such a model. The slides for the presentation are available, as is the package itself. Google is using predictive modeling and tools such as glassbox to better understand the characteristics of its workforce and to ask important, reflective questions such a “How can we better understand diversity?” The company also does HR modeling to see if what they know about people can give them a competitive edge in hiring. For example, Google uses data collected from people who have interviewed at the company in the past, but who have not received offers from Google, to try and understand Google's future hiring needs. The coolest thing about this presentation was that these guys work for the Human Resources Department! If you think that you work for a tech company go down to HR and see if you can get some help with Random Forests.
A Web Application for Efficient Analysis of Peptide Libraries
Eric Hare of Iowa State University introduced PeLica, work he did with colleagues Timo Sieber of University Medical Center HamburgEppendorf and Heike Hofmann of Iowa State University. PeLica is an interactive, Shiny application to help assess the statistical properties of peptide libraries. PeLica’s creators refer to it as a Peptide Library Calculator that acts as a front end to the R package peptider which contains functions for evaluating the diversity of peptide libraries. The authors have done an exceptional job of using the documentation features available in Shiny to make their app a teaching tool.
To Merge or Not to Merge: An Interactive Visualization Tool for Local Merges of Mixture Model Components Elizabeth Lorenzi of Carnegie Mellon showed the prototype for an interactive visualization tool that she is working on with Rebecca Nugent of Carnegie Mellon and Nema Dean of the University of Glasgow. The software calculates intercomponent similarities of mixture model component trees and displays them as hierarchical dendrograms. Elizabeth and her colleagues are implementing this tool as an R package.
An Interactive Visualization Platform for Interpreting Topic Models
Carson Sievert of Iowa State University presented LDAvis, a general framework for visualizing topic models that he is building with Kenny Shirley of AT&T Labs. LDAvis is interactive R software that enables users to interpret and compare topics by highlighting keywords. The theory is nicely described in a recent paper, and the examples on Carson’s Github page are instructive and fun to play with. In this plot below, circle 26 representing a topic has been selected. The bar chart on the right displays the 30 most relevant terms for this topic. The red bars represent the frequency of a term in a given topic, (proportional to p(term  topic)), and the gray bars represent a term's frequency across the entire corpus, (proportional to p(term)).
Gravicom: A WebBased Tool for Community Detection in Networks
Andrea Kaplan showed off an interactive application that she and her Iowa State University team members, Heike Hofmann and Daniel Nordman are building. GRavicom is an interactive web application based on Shiny and the D3 JavaScript library that lets a user manually collect nodes into clusters in a social network graph and then save this grouping information for subsequent processing. The idea is that eyeballing a large social network and selecting “obvious” groups may be an efficient way to initialize a machine learning algorithm. Have a look at a Live demo.
Human Factors Influencing Visual Statistical Inference
Mahbubul Majumder of the University of Nebraska presented joint work done with Heike Hofmann and Dianne Cook, both of Iowa State University, on identifying key factors such as demographics, experience, training, of even the placement of figures in an array of plots, that may be important for the human analysis of visual data.
I'm here at the JSM conference in Boston, the latest annual gathering of 6000+ statisticians from North America and around the world. (Revolution Analytics is a proud sponsor of the conference.) One of the great things to see is that the American Statistical Association, the organizer of the conference and the professional body for statisticians, is putting some effort into promoting Statistics as a discipline with its new website This is Statistics.
Statistics as a discipline has been overshadowed by its close sibling, Data Science. (I'm as guilty of this as anyone.) As statistician Terry Speed points out (hat tip: Stephanie Hicks), Statisticians haven't been deeply involved in the Big Data revolution, despite having been trained in exactly the types of issues that are critical to extracting useful inferences from complex data sets. So it's great to see that the ASA is helping to promote the role of Statisticians in today's data centric workplace. My favorite video comes from Roger Peng, who has himself been instrumental promoting statistical analysis in R via Coursera:
Check out the complete This is Statistics site at the link below, and share it with your friends so they can learn what Statistics is really about.
American Statistical Association: This is Statistics