My secondfavourite keynote from yesterday's Strata Hadoop World conference was this one, from Pinterest's John Rauser. To many people (especially in the Big Data world), Statistics is a series of complex equations, but a just a little intuition goes a long way to really understanding data. John illustrates this wonderfully using an example of data collected to determine whether consuming beer causes mosquitoes to bite you more:
The big lesson here, IMO, is that so many statistical problems can seem complex, but you can actually get a lot of insight by recognizing that your data is just one possible instance of a random process. If you have a hypothesis for what that process is, you can simulate it, and get an intuitive sense of how surprising your data is. R has excellent tools for simulating data, and a couple of hours spent writing code to simulate data can often give insights that will be valuable for the formal data analysis to come.
(By the way, my favourite keynote from the connference was Amanda Cox's keynote on data visualization at the New York Times, which featured several examples developed in R. Sadly, though, it wasn't recorded.)
O'Reilly Strata: Statistics Without the Agonizing Pain
by Joseph Rickert
There is something about R user group meetings that both encourages, and nourshies a certain kind of "after hours" creativity. Maybe it is the pressure of having to make a presentation about stuff you do at work interesting to a general audience, or maybe it is just the desire to reach a high level of play. But, R user group presentations often manage to make some obscure area of computational statistics seem to be not only accessible, but also relevant and fun. Here are a couple of examples of what I mean.
Recently Xiaocun Sun conducted an Image processing workshop for KRUG, the Knoxville R User's Group. As the folowing slide indicates, he used the EBImage Bioconductor package, a package that I imagine few people who don't do medical imaging for a living would be likely to stubmle upon by accident, to illustrate the basics of image processing.
Xiacuns's presentation along with R code is available for download from the KRUG site.
As a second example, consider the presentation that Antonio Piccolboni recently made to the Bay Area useR Group (BARUG): 10 Eigenmaps of the United States of America. Inspired by an article in the New York Times, Antonio decided to undertake his own idiosyncratic tour through the Census data and look at socioeconomic trends in the United States. His analysis is both thought provoking and visually compelling. For example, concerning the following map Antonio writes:
This map shows a very interesting ring pattern around some cities, including Atlanta, Dallas an Minneapolis. The red areas show strong population increase, including migration, and increase in available housing and high median income. The blue areas have a higher death rate, Federal Government payments to individuals, more widows, single person households and older people receiving social security.
Antonio's presentation might well illustrate the theme: "Data Scientist reads the Sunday paper and finds data to begin a conversation about what he read with his quantitative, Rliterate friends".
This kind of active reading fits nicely with ideas about responsible, quantitative journalism that Chris Wiggins expresses in a presentation he recently made to the New York Open Statistical Programming Meetup. Here, Chris provides some insight into the role of Data Science at the New York Times and offers advice on using data to study relevant issues and clearly communicate findings. One major point in Chris' presentation is that data science plus clear communication can have a very positive influence on shaping our culture.
It is not an exaggeration to say that the kind of work that Xiaocun, Antonio and other R user group presenters undertake in their spare time "for fun" is valuable and important beyond the immediate goals of learning and teaching R.
Part 2 of a series
by Daniel Hanson, with contributions by Steve Su (author of the GLDEX package)
In our previous article, we introduced the fourparameter Generalized Lambda Distribution (GLD) and looked at fitting a 20year set of returns from the Wilshire 5000 Index, comparing the results of two methods, namely the Method of Moments, and the Method of Maximum Likelihood.
Errata: One very important omission in Part 1, however, was not putting
require(GLDEX)
prior to the examples shown. Many thanks to a reader who pointed this out in the comments section last time.
Let’s also recall the code we used for obtaining returns from the Wilshire 5000 index, and the first four moments of the data (details are in Part 1):
require(quantmod) # quantmod package must be installed
getSymbols("VTSMX", from = "19940901")
VTSMX.Close < VTSMX[,4] # Closing prices
VTSMX.vector < as.vector(VTSMX.Close)
# Calculate log returns
Wilsh5000 < diff(log(VTSMX.vector), lag = 1)
Wilsh5000 < 100 * Wilsh5000[1] # Remove the NA in the first position,
# and put in percent format
# Moments of Wilshire 5000 market returns:
fun.moments.r(Wilsh5000,normalise="Y") # normalise="Y"  subtracts the 3
# from the normal dist value.
# Results:
# mean variance skewness kurtosis
# 0.02824676 1.50214916 0.30413445 7.82107430
Finally, in Part 1, we looked at two methods for fitting a GLD to this data, namely the Method of Moments (MM), and the Method of Maximum Likelihood (ML). We found that MM gave us a near perfect match in mean, variance, skewness, and kurtosis, but goodness of fit measures showed that we could not conclude that the market data was drawn from the fitted distribution. On the other hand, ML gave us a much better fit, but it came at the price of skewness being way off compared to that of the data, and kurtosis not being determined by the fitting algorithm (NA).
Method of LMoments (LM)
Steve Su, in his contributions to this article series, suggested the option of a “third way”, namely the Method of LMoments. Also, as mentioned in the paper Lmoments and TLmoments of the generalized lambda distribution, (William H. Asquith, 2006),
“The method of Lmoments is an alternative technique, which is suitable and popular for heavytailed distributions. The method of Lmoments is particularly useful for distributions, such as the generalized lambda distribution (GLD), that are only expressible in inverse or quantile function form.”
Additional details on the method and algorithm for computing it can be found in this paper, noted above.
As we will see in the example that follows, the result is essentially a compromise between our first two results, but the goodness of fit is still far preferable to that of the Method of Moments.
We follow the same approach as above, but using the GLDEX function fun.RMFMKL.lm(.) to calculate the fitted distribution:
# Fit the LM distribution:
require(GLDEX) # Remembered it this time!
wshLambdaLM = fun.RMFMKL.lm(Wilsh5000)
# Compute the associated moments
fun.theo.mv.gld(wshLambdaLM[1], wshLambdaLM[2], wshLambdaLM[3], wshLambdaLM[4],
param = "fmkl", normalise="Y")
# The results are:
# mean variance skewness kurtosis
# 0.02824678 1.56947022 1.32265715 291.58852044
As was the case with the maximum likelihood fit, the mean and variance are reasonably close, but the skewness and kurtosis do not match the empirical data. However, the skew is not as far off as in the ML case, and we are at least able to calculate a kurtosis value.
Looking at our goodness of fit tests based on the KS statistic:
fun.diag.ks.g(result = wshLambdaLM, data = Wilsh5000, no.test = 1000,
param = "fmkl")
# Result: 740/1000
ks.gof(Wilsh5000,"pgl", wshLambdaLM, param="fmkl")
# D = 0.0201, pvalue = 0.03383
In the first case, our result of 740/1000 suggests a much better fit than the Method of Moments (53/1000) in Part 1, while falling slightly from the ratio we obtained with the Method of Maximum Likelihood. In the second test, a pvalue of 0.03383 is not overwhelmingly convincing, but technically it does allow us to accept the hypothesis that our data is drawn from the same distribution at an α = 0.025 or 0.01 confidence level.
Perhaps more interesting is just looking at the plot
bins < nclass.FD(Wilsh5000) # We get 158 (FreedmanDiaconis Rule)
fun.plot.fit(fit.obj = wshLambdaLM, data = Wilsh5000, nclass = bins, param = "fmkl",
xlab = "Returns", main = "Method of LMoments")
Again, compared to the result for the Method of Moments in Part 1, the plot suggests that we have a better fit.
The QQ plot, however, is not much different than what we got in the Maximum Likelihood case; in particular, losses in the left tail are not underestimated by the fitted distribution as they are in the MM case:
qqplot.gld(fit=wshLambdaLM,data=Wilsh5000,param="fkml", type = "str.qqplot",
main = "Method of LMoments")
Which Option is “Best”?
Steve Su points out that there is no one “best” solution, as there are tradeoffs and competing constraints involved in the algorithms, and it is one reason why, in addition to the methods described above, so many different functions are available in the GLDEX package. On one point, however, there is general agreement in the literature, and that is the Method of Moments  even with the appeal of matching the moments of the empirical data  is an inferior method to others that result in a better fit. This is also discussed in the paper by Asquith, namely, that the method of moments “generally works well for lighttailed distributions. For heavytailed distributions, however, use of the method of moments can be questioned.”
Comparison with the Normal Distribution
For a strawman comparison, we can fit the Wilshire 5000 returns to a normal distribution in R, and run the KS test as follows:
f < fitdistr(Wilsh5000, densfun = "normal")
ks.test(Wilsh5000, "pnorm", f$estimate[1], f$estimate[2], alternative = "two.sided")
The results are as follows:
# Onesample KolmogorovSmirnov test
# data: Wilsh5000
# D = D = 0.0841, pvalue < 2.2e16
# alternative hypothesis: twosided
With a pvalue that small, we can firmly reject the returns data as being drawn from a fitted normal distribution.
We can also get a look at the plot of the implied normal distribution overlaid upon the fit we obtained with the method of Lmoments, as follows:
x < seq(min(Wilsh5000), max(Wilsh5000), bins)
# Chop the domain into bins = 158nintervals to get sample points
# from the approximated normal distribution (FreedmanDiaconis)
fun.plot.fit(fit.obj = wshLambdaLM, data = Wilsh5000, nclass = bins,
param = "fmkl", xlab = "Returns")
curve(dnorm(x, mean=f$estimate[1], sd=f$estimate[2]), add=TRUE,
col = "red", lwd = 2) # Normal curve in red
Although it may be a little difficult to see, note that between 3 and 4 on the horizontal axis, the tail of the normal fit (in red) falls below that of the GLD (in blue), and it is along this left tail where extreme events can occur in the markets. The normal distribution implies a lower probability of these “black swan” events than the more representative GLD.
This is further confirmed by looking at the QQ plot vs a normal distribution fit. Note how the theoretical fit (horizontal axis in this case, using the Base R function qqnorm(.); ie, the axes are switched compared to those in our previous QQ plots) vastly underestimates losses in the left tail.
qqnorm(Wilsh5000, main = "Normal QQ Plot")
qqline(Wilsh5000)
In summary, from these plots, we can see that the GLD fit, particularly using ML or LM, is a superior alternative to what we get with the normal distribution fit when estimating potential index losses.
Conclusion
We have seen, using R and the GLDEX package, how a four parameter distribution such as the Generalized Lambda Distribution can be used to fit a more realistic distribution to market data as compared to the normal distribution, particularly considering the fat tails typically present in returns data that cannot be captured by a normal distribution. While the Method of Moments as a fitting algorithm is highly appealing due to its preserving the moments of the empirical distribution, we sacrifice goodness of fit that can be obtained using other methods such as Maximum Likelihood, and LMoments.
The GLD has been demonstrated in financial texts and research literature as a suitable distributional fit for determining market risk measures such as Value at Risk (VaR), Expected Shortfall (ES), and other metrics. We will look at examples in an upcoming article.
Again, very special thanks are due to Dr Steve Su for his contributions and guidance in presenting this topic.
by Joseph Rickert
In a recent post I talked about the information that can be developed by fitting a Tweedie GLM to a 143 million record version of the airlines data set. Since I started working with them about a year or so ago, I now see Tweedie models everywhere. Basically, any time I come across a histogram that looks like it might be a sample from a gamma distribution except for a big spike at zero, I see a candidate for a Tweedie model. (Having a Tweedie hammer makes lots of things look like Tweedie nails.) Nevertheless, apparently lots of people are seeing Tweedie these days. Even the scolarly citations for Maurice Tweedie's original paper are up.
Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter f. Statistician Bent Jørgensen solidified the concept of EDMs in a 1987 paper, and named the following class of EDMs after Tweedie.
An EDM random variable Y follows a Tweedie distribution if
var(Y) = f * V(m)
where m is the mean of the distribution, f is the dispersion parameter, V is function describing the mean/variance relationship of the distribution and p is a constant such that:
V(m) = m^{p }
Some very familiar distributions fall into the Tweedie family. Setting p = 0 gives a normal distribution. p = 1 is Poisson. p = 2 gives a gamma distribution and p = 3 yields an inverse Gaussian. However, much of the action for fitting Tweedie GLMs is for values of p between 1 and 2. In this interval, closed form distribution functions don’t exist, but as it turns out, Tweedies in this interval are compound Poisson distributions. (A compound Poisson random variable Y is the sum of N independent gamma random variables where N follows a Poisson distribution and N and the gamma random variates are independent.)
This last fact helps to explain why Tweedies are so popular. For example, one might model the insurance claims for a customer as a series of independent gamma random variables and the number of claims in some time interval as a Poisson random variable. Or, the gamma random variables could be models for precipation, and the total rainfall resulting from N rainstorms would follow a Tweedie distribution. The possibilities are endless.
R has a quite a few resources for working with Tweedie models. Here are just a few. You can fit Tweedie GLM model with the tweedie function in the statmod package .
# Fit an inverseGaussion glm with loglink
glm(y~x,family=tweedie(var.power=3,link.power=0))
The tweedie package has several interesting functions for working with Tweedie models including a function to generate random samples.The following graph shows four different Tweedie histograms as the power parameter moves from 1.2 to 1.9.
It is apparent that increasing the power shifts mass away from zero towards the right.
(Note the code for producing these plots which includes some nice code from Stephen Turner for putting all for ggplots on a single graph are availale are available. Download Plot_tweedie)
Package poistweedie also provides functions for simulating Tweedie models. Package HDtweedie implements an iteratively reweighted least squares algorithm for computing solution paths for grouped lasso and grouped elastic net Tweedie models. And, package cplm provides both likelihood and Bayesian functions for working with compound Poisson models. Be sure to have a look at the vignette for this package to see compound Poisson distributions in action.
Finally, two very readable references for both the math underlying Tweedie models and the algorithms to compute them are a couple of papers by Dunn and Smyth: here and here.
by Joseph Rickert
Recently, I had the opportunity to present a webinar on R and Data Science. The challenge with attempting this sort of thing is to say something interesting that does justice to the subject while being suitable for an audience that may include both experienced R users and curious beginners. The approach I settled on had three parts. I decided to:
The "why" slides attempt to convey the great number of machine learning and statistical algorithms available in R, the visualization capabilities, the richness of the R programming language and its many tools for data manipulation. I tried to emphasize the great amount of effort that the R community continues to make in order to integrate R with other languages and computing platforms, and to scale R to handle massive data sets on Hadoop and other big data platforms.
The code examples presented in the webinar emphasize the machine learning algorithms oganized in the caret package and the many tools available for working through the predictive modeling process such as functions for searching through the parameter space of a model, performing cross validation, comparing models etc. The code for the caret examples is available here.
Towards the end of the webinar I show the code for running a large Tweedie model with Revolution Analytics rxGlm() function and I also show what it looks like to run an rxLogit() model directly on Hadoop.
Click on video to view the webinar, or go to the Revolution Analytics' website to download the webinar and a pdf of the slides. All of the code is available on my GitHub repository.
The New York Times published an article of interest to statisticians the other day: "The Odds, Continually Updated". Surprisingly for a generalaudience newspaper, this article goes into the the distinctions between Bayesian and frequentist statistics, and does so in a very approachable way. Here's an excerpt:
The essence of the frequentist technique is to apply probability to data. If you suspect your friend has a weighted coin, for example, and you observe that it came up heads nine times out of 10, a frequentist would calculate the probability of getting such a result with an unweighted coin. The answer (about 1 percent) is not a direct measure of the probability that the coin is weighted; it’s a measure of how improbable the ninein10 result is — a piece of information that can be useful in investigating your suspicion.
By contrast, Bayesian calculations go straight for the probability of the hypothesis, factoring in not just the data from the cointoss experiment but any other relevant information — including whether you’ve previously seen your friend use a weighted coin.
The article covers the genesis of both frequentist and Bayesian statistics, and includes several examples of Bayesian statistics applications, including healthcare, cosmology, and even searchandrescue. (This story of how Bayesian analysis directed the rescue of a fisherman lost at sea is nothing short of amazing.)
R, of course, has extensive capabilities for Bayesian analysis. Check out the Bayesian Inference task view on CRAN, and especially the RStan package.
New York Times: The Odds, Continually Updated
The following post by Norm Matloff originally appeared on his blog, Mad(Data)Scientist, on September 15th. We rarely republish posts that have appeared on other blogs, however, the questions that Norm raises both with respect to the teaching of statistics, and his assertion that "R's statistical procedures are centered far too much on significance testing" deserve a second look. Moreover, Norm's post elicited quite a few comments, many of which are at a high level of discourse. At the bottom of this post we have include excerpts from exchanges with statistician Mervin Thomas and with philosopher of science Deborah Mayo. It is well worth reading the full threads of these exchanges as well as those associated with a number of other comments. Norm has been a contributor to the Revolutions Blog in the past. We thank him for permission to republish his post. (Guest post editor, Joseph Rickert).
by Norm Matloff
My posting about the statistics profession losing ground to computer science drew many comments, not only here in Mad (Data) Scientist, but also in the coposting at Revolution Analytics, and in Slashdot. One of the themes in those comments was that Statistics Departments are out of touch and have failed to modernize their curricula. Though I may disagree with the commenters’ definitions of “modern,” I have in fact long felt that there are indeed serious problems in statistics curricula.
I must clarify before continuing that I do NOT advocate that, to paraphrase Shakespeare, “First thing we do, we kill all the theoreticians.” A precise mathematical understanding of the concepts is crucial to good applications. But stat curricula are not realistic.
I’ll use Student ttests to illustrate. (This is material from my opensource book on probablity and statistics.) The ttest is an exemplar for the curricular ills in three separate senses:
Significance testing has long been known to be underinformative at best, and highly misleading at worst. Yet it is the core of almost any applied stat course. Why are we still teaching — actually highlighting — a method that is recognized to be harmful?
We prescribe the use of the ttest in situations in which the sampled population has an exact normal distribution — when we know full well that there is no such animal. All reallife random variables are bounded (as opposed to the infinitesupport normal distributions) and discrete (unlike the continuous normal family). [Clarification, added 9/17: I advocate skipping the tdistribution, and going directly to inference based on the Central Limit Theorem. Same for regression. See my book.]
Going handinhand with the ttest is the sample variance. The classic quantity s2 is an unbiased estimate of the population variance σ2, with s2 defined as 1/(n1) times the sum of squares of our data relative to the sample mean. The concept of unbiasedness does have a place, yes, but in this case there really is no point to dividing by n1 rather than n. Indeed, even if we do divide by n1, it is easily shown that the quantity that we actually need, s rather than s2, is a BIASED (downward) estimate of σ. So that n1 factor is much ado about nothing.
Right from the beginning, then, in the very first course a student takes in statistics, the star of the show, the ttest, has three major problems.
Sadly, the R language largely caters to this oldfashioned, unwarranted thinking. The var() and sd() functions use that 1/(n1) factor, for example — a bit of a shock to unwary students who wish to find the variance of a random variable uniformly distributed on, say, 1,2,…,10.
Much more importantly, R’s statistical procedures are centered far too much on significance testing. Take ks.test(), for instance; all one can do is a significance test, when it would be nice to be able to obtain a confidence band for the true cdf. Or consider loglinear models: The loglin() function is so centered on testing that the user must proactively request parameter estimates, never mind standard errors. (One can get the latter by using glm() as a workaround, but one shouldn’t have to do this.)
I loved the suggestion by Frank Harrell in rdevel to at least remove the “star system” (asterisks of varying numbers for different pvalues) from R output. A Quixotic action on Frank’s part (so of course I chimed in, in support of his point); sadly, no way would such a change be made. To be sure, R in fact is modern in many ways, but there are some problems nevertheless.
In my blog posting cited above, I was especially worried that the stat field is not attracting enough of the “best and brightest” students. Well, any thoughtful student can see the folly of claiming the ttest to be “exact.” And if a sharp student looks closely, he/she will notice the hypocrisy of using the 1/(n1) factor in estimating variance for comparing two general means, but NOT doing so when comparing two proportions. If unbiasedness is so vital, why not use 1/(n1) in the proportions case, a skeptical student might ask?
Some years ago, an Israeli statistician, upon hearing me kvetch like this, said I would enjoy a book written by one of his countrymen, titled What’s Not What in Statistics. Unfortunately, I’ve never been able to find it. But a good cleanup along those lines of the way statistics is taught is long overdue.
Selected Comments
Mervyn Thomas
SEPTEMBER 16, 2014 AT 4:15 PM
I have run statistics operations in quite large public and private sector organisations, and directly supervised many masters and PhD level statisticians. The biggest problem I had with new statisticians was helping them to understand that nobody else cares about the statistics.
Of course the statistics is important, but only in so far as it helps produce solid and reliable answers to problems – or reveals that no such answers are available with current data. Nearly everybody is focussed on their own problems. The trick is producing results and reports which address those problems in a rigorous and defensible way.
In a sense, I see applied statistics as more of an engineering discipline – but one that makes careful use of rigorous analysis.
I believe that statistics departments have largely missed the boat with data science (except for a few stand out examples like Stanford), and that the reason is that many academic statisticians have failed to engage with other disciplines properly. Of course, there are very significant exceptions to that – Terry Speed for example.
One of the most telling examples of that for me is the number of time academic statisticians have asked if I or my life science collaborators could provide them with data to test an approach — without actually wanting to engage with the problem that generated the data.
Relevance comes from engagement, not from rarefied brilliance. There is no better example of that than Fisher.
Does it matter? Yes because I see other disciplines reinventing the statistical wheel – and doing it badly.
REPLY
matloff
SEPTEMBER 16, 2014 AT 5:01 PM
Very interesting comments. I largely agree.
Sadly, my own campus, the University of California at Davis, illustrates your point. To me, a big issue is joint academic appointments, and to my knowledge the Statistics Dept. has none. This is especially surprising in light of the longtime (several decades) commitment of UCD to interdisciplinary research. The Stat. Dept. has even gone in the opposite direction: The Stat grad program used to be administered by a Graduate Group, a unique UCD entity in which faculty from many departments run the graduate program in a given field; yet a few years ago, the Stat. Dept. disbanded its Graduate Group. I must hasten to add that there IS good interdisciplinary work being done by Stat faculty with researchers in other fields, but still the structure is too narrow, in my view.
(My own department, Computer Science, has several appointments with other disciplines, and more important, has actually expanded the membership of its Graduate Group.)
I would say, though, that I think the biggest reason Stat (in general, not just UCD) has been losing ground to CS and other fields is not because of disinterest in applications, but rather a failure to tackle the complex, largescale, “messy” problems that the Machine Learning crowd addresses routinely.
REPLY
Mervyn Thomas
SEPTEMBER 16, 2014 AT 5:19 PM
“a failure to tackle the complex, largescale, “messy” problems that the Machine Learning crowd addresses routinely.” Good point! I have often struggled with junior statisticians wanting to know whether or not an analysis is `right’ rather than fit for purpose. That’s a strange preoccupation, because in 40 years as a professional statistician I have never done a `correct’ analysis. Everything is predicated on assumptions which are approximations at best.

Mayo
SEPTEMBER 27, 2014 AT 9:27 PM
I reviewed the part in your book on tests vs CIs. It was quite as extreme as I’d remembered it. I’m so used to interpreting significance levels and pvalues in terms of discrepancies warranted or not that I automatically have those (severity) interpretations in mind when I consider tests. Fallacies of rejection and acceptance, relativity to sample size–all dealt with, and the issues about CIs requiring testing supplements remain (especially in onesided testing which is common). This paper covers 13 central problems with hypothesis tests, and how error statistics deals with them.
I remember many of the things I like A LOT about Matloff’s book. I’m glad he sees CIs as the way to go for variable choice (on prediction grounds) because it means that severity is relevant there too.
Norm Matloff
SEPTEMBER 28, 2014 AT 11:46 PM
Looks like a very interesting paper, Deborah (as I would have expected). I look forward to reading it. Just skimming through, though, it looks like I’ll probably have comments similar to the ones I made on Mervyn’s points.
Going back to my original post, do you at least agree that CIs are more informative than tests?
by Joseph Rickert
One of the most difficult things about R, a problem that is particularly vexing to beginners, is finding things. This is an unintended consequence of R's spectacular, but mostly uncoordinated, organic growth. The R core team does a superb job of maintaining the stability and growth of the R language itself, but the innovation engine for new functionality is largely in the hands of the global R communty.
Several structures have been put in place to address various apsects of the finding things problem. For example, Task Views represent a monumental effort to collect and classify R packages. The RSeek site is an effective tool for web searches. RBloggers is a good place to go for R applications and CRANberries let's you know what's new. But, how do you find things that you didn't even know you were looking for?For this, the so called "misc packages" can be very helpful. Whereas the majority of R packages are focused on a particular type of analysis or class of models, or special tool, misc packages tend to be collections of functions that facilitate common tasks. (Look below for a partial list).
DescTools is a new entry to the misc package scene that I think could become very popular. The description for the package begins:
DescTools contains a bunch of basic statistic functions and convenience wrappers for efficiently describing data, creating specific plots, doing reports using MS Word, Excel or PowerPoint. The package's intention is to offer a toolbox, which facilitates the (notoriously time consuming) first descriptive tasks in data analysis, consisting of calculating descriptive statistics, drawing graphical summaries and reporting the results. Many of the included functions can be found scattered in other packages and other sources written partly by Titans of R.
So far, of the 380 functions in this collection the Desc function has my attention. This function provides very nice tabular and graphic summaries of the variables in a data frame with output that is specific to the data type. The d.pizza data frame that comes with the package has a nice mix of data types
head(d.pizza) index date week weekday area count rabate price operator driver delivery_min temperature wine_ordered wine_delivered 1 1 20140301 9 6 Camden 5 TRUE 65.655 Rhonda Taylor 20.0 53.0 0 0 2 2 20140301 9 6 Westminster 2 FALSE 26.980 Rhonda Butcher 19.6 56.4 0 0 3 3 20140301 9 6 Westminster 3 FALSE 40.970 Allanah Butcher 17.8 36.5 0 0 4 4 20140301 9 6 Brent 2 FALSE 25.980 Allanah Taylor 37.3 NA 0 0 5 5 20140301 9 6 Brent 5 TRUE 57.555 Rhonda Carter 21.8 50.0 0 0 6 6 20140301 9 6 Camden 1 FALSE 13.990 Allanah Taylor 48.7 27.0 0 0 wrongpizza quality 1 FALSE medium 2 FALSE high 3 FALSE <NA> 4 FALSE <NA> 5 FALSE medium 6 FALSE low
Here is some of the voluminous output from the function. The data frame as a whole is summarized as follows
'data.frame': 1209 obs. of 16 variables: 1 $ index : int 1 2 3 4 5 6 7 8 9 10 ... 2 $ date : Date, format: "20140301" "20140301" "20140301" "20140301" ... 3 $ week : num 9 9 9 9 9 9 9 9 9 9 ... 4 $ weekday : num 6 6 6 6 6 6 6 6 6 6 ... 5 $ area : Factor w/ 3 levels "Brent","Camden",..: 2 3 3 1 1 2 2 1 3 1 ... 6 $ count : int 5 2 3 2 5 1 4 NA 3 6 ... 7 $ rabate : logi TRUE FALSE FALSE FALSE TRUE FALSE ... 8 $ price : num 65.7 27 41 26 57.6 ... 9 $ operator : Factor w/ 3 levels "Allanah","Maria",..: 3 3 1 1 3 1 3 1 1 3 ... 10 $ driver : Factor w/ 7 levels "Butcher","Carpenter",..: 7 1 1 7 3 7 7 7 7 3 ... 11 $ delivery_min : num 20 19.6 17.8 37.3 21.8 48.7 49.3 25.6 26.4 24.3 ... 12 $ temperature : num 53 56.4 36.5 NA 50 27 33.9 54.8 48 54.4 ... 13 $ wine_ordered : int 0 0 0 0 0 0 1 NA 0 1 ... 14 $ wine_delivered: int 0 0 0 0 0 0 1 NA 0 1 ... 15 $ wrongpizza : logi FALSE FALSE FALSE FALSE FALSE FALSE ... 16 $ quality : Ord.factor w/ 3 levels "low"<"medium"<..: 2 3 NA NA 2 1 1 3 3 2 ...
The factor variable driver gets a table and a plot.
10  driver (factor) length n NAs levels unique dupes 1'209 1'204 5 7 7 y level freq perc cumfreq cumperc 1 Carpenter 272 .226 272 .226 2 Carter 234 .194 506 .420 3 Taylor 204 .169 710 .590 4 Hunter 156 .130 866 .719 5 Miller 125 .104 991 .823 6 Farmer 117 .097 1108 .920 7 Butcher 96 .080 1204 1.000
and so does the numeric variable delivery.
11  delivery_min (numeric) length n NAs unique 0s mean meanSE 1'209 1'209 0 384 0 25.653 0.312 .05 .10 .25 median .75 .90 .95 10.400 11.600 17.400 24.400 32.500 40.420 45.200 rng sd vcoef mad IQR skew kurt 56.800 10.843 0.423 11.268 15.100 0.611 0.095 lowest : 8.8 (3), 8.9, 9 (3), 9.1 (5), 9.2 (3) highest: 61.9, 62.7, 62.9, 63.2, 65.6 ShapiroWilks normality test p.value : 2.2725e16
Pretty nice for an automatic first look at the data.
For some more R treasure hunting have a look into the following short list of misc packages.
Package 
Description 
Tools for manipulating data (No 1 package downloaded for 2013) 

Convenience wrappers for functions for manipulating strings 

One of the most popular R packages of all time: functions for data analysis, graphics, utilities and much more 

Package development tools 

The “go to” package for machine learning, classification and regression training 

Good svm implementation and other machine learning algorithms 

Tools for describing data and descriptive statistics 

Tools for plotting decision trees 

Functions for numerical analysis, linear algebra, optimization, differential equations and some special functions 

Contains different highlevel graphics functions for displaying large datasets 

Relatively new package with various functions for survival data extending the methods available in the survival package. 

New this year: miscellaneous R tools to simplify the working with data types and formats including functions for working with data frames and character strings 

Some functions for Kalman filters 

Misc 3d plots including isosurfaces 

New package with utilities for producing maps 

Various programming tools like ASCIIfy() to convert characters to ASCII and checkRVersion() to see if a newer version of R is available 

A grab bag of utilities including progress bars and function timers 
by Joseph Rickert
While preparing for the DataWeek R Bootcamp that I conducted this week I came across the following gem. This code, based directly on a Max Kuhn presentation of a couple years back, compares the efficacy of two machine learning models on a training data set.
# # SET UP THE PARAMETER SPACE SEARCH GRID ctrl < trainControl(method="repeatedcv", # use repeated 10fold cross validation repeats=5, # do 5 repititions of 10fold cv summaryFunction=twoClassSummary, # Use AUC to pick the best model classProbs=TRUE) # Note that the default search grid selects 3 values of each tuning parameter # grid < expand.grid(.interaction.depth = seq(1,7,by=2), # look at tree depths from 1 to 7 .n.trees=seq(10,100,by=5), # let iterations go from 10 to 100 .shrinkage=c(0.01,0.1)) # Try 2 values of the learning rate parameter # BOOSTED TREE MODEL set.seed(1) names(trainData) trainX <trainData[,4:61] registerDoParallel(4) # Registrer a parallel backend for train getDoParWorkers() system.time(gbm.tune < train(x=trainX,y=trainData$Class, method = "gbm", metric = "ROC", trControl = ctrl, tuneGrid=grid, verbose=FALSE)) # # SUPPORT VECTOR MACHINE MODEL # set.seed(1) registerDoParallel(4,cores=4) getDoParWorkers() system.time( svm.tune < train(x=trainX, y= trainData$Class, method = "svmRadial", tuneLength = 9, # 9 values of the cost function preProc = c("center","scale"), metric="ROC", trControl=ctrl) # same as for gbm above ) # # COMPARE MODELS USING RESAPMLING # Having set the seed to 1 before running gbm.tune and svm.tune we have generated paired samplesfor comparing models using resampling. # # The resamples function in caret collates the resampling results from the two models rValues < resamples(list(svm=svm.tune,gbm=gbm.tune)) rValues$values # # BOXPLOTS COMPARING RESULTS bwplot(rValues,metric="ROC") # boxplot
After setting up a grid to search the parameter space of a model, the train() function from the caret package is used used to train a generalized boosted regression model (gbm) and a support vector machine (svm). Setting the seed produces paired samples and enables the two models to be compared using the resampling technique described in Hothorn at al, "The design and analysis of benchmark experiments", Journal of Computational and Graphical Statistics (2005) vol 14 (3) pp 675699
The performance metric for the comparison is the ROC curve. From examing the boxplots of the sampling distributions for the two models it is apparent that, in this case, the gbm has the advantage.
Also, notice that the call to registerDoParallel() permits parallel execution of the training algorithms. (The taksbar showed all foru cores of my laptop maxed out at 100% utilization.)
I chose this example because I wanted to show programmers coming to R for the first time that the power and productivity of R comes not only from the large number of machine learning models implemented, but also from the tremendous amount of infrastructure that package authors have built up, making it relatively easy to do fairly sophisticated tasks in just a few lines of code.
All of the code for this example along with the rest of the my code from the Datweek R Bootcamp is available on GitHub.
by Seth Mottaghinejad
Let's review the Collatz conjecture, which says that given a positive integer n, the following recursive algorithm will always terminate:
In our last post, we created a function called 'cpp_collatz', which given an integer vector, returns an integer vector of their corresponding stopping times (number of iterations for the above algorithm to reach 1). For example, when n = 5 we have
5 > 3*5+1 = 16 > 16/2 = 8 > 8/2 = 4 > 4/2 = 2 > 2/2 = 1,
giving us a stopping time of 5 iterations.
In today's article, we want to perform some exploratory data analysis to see if we can find any pattern relating an integer to its stopping time. As part of the analysis, we will extract some features out of the integers that could help us explain any differences in stopping times.
Here are some examples of potentially useful features:
In case you are encountering these terms for the first time, a triangular number is any number m that can be written as m = n(n+1)/2, where n is some positive integer. To determine if a number is triangular, we can rewrite the above equation as n^2 + n  2m = 0, and use the quadriatic formula to get n = (1 + sqrt(1 + 8m))/2 and (1  sqrt(1 + 8m))/2. Since n must be a positive integer, we ignore the latter solution, leaving us with (1 + sqrt(1 + 8m))/2.
Thus, if plugging m in the above formula results in an integer, we can say that m is a triangular number. Similar rules exist to determine if an integer is square or pentagonal, but I will refer you to Wikipedia for the details.
For the purpose of conducting our analysis, we created some other functions in C++ and R to help us. Let's take a look at these functions:
cat(paste(readLines(file.path(directory, "collatz.cpp")), collapse = "\n")) #include <Rcpp.h> #include <vector> using namespace Rcpp; // [[Rcpp::export]] int collatz(int nn) { int ii = 0; while (nn != 1) { if (nn % 2 == 0) nn /= 2; else nn = 3 * nn + 1; ii += 1; } return ii; } // [[Rcpp::export]] IntegerVector cpp_collatz(IntegerVector ints) { return sapply(ints, collatz); } // [[Rcpp::export]] bool is_int_prime(int nn) { if (nn < 1) stop("int must be greater than 1."); else if (nn == 1) return FALSE; else if (nn % 2 == 0) return (nn == 2); for (int ii=3; ii*ii<=nn; ii+=2) { if (nn % ii == 0) return false; } return true; } // [[Rcpp::export]] LogicalVector is_prime(IntegerVector ints) { return sapply(ints, is_int_prime); } // [[Rcpp::export]] NumericVector gen_primes(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> primes; primes.push_back(2); int i = 3; while (primes.size() < unsigned(n)) { if (is_int_prime(i)) primes.push_back(i); i += 2; } return Rcpp::wrap(primes); } // [[Rcpp::export]] NumericVector gen_divisors(int n) { if (n < 1) stop("n must be greater than 0."); std::vector<int> divisors; divisors.push_back(1); for(int i = 2; i <= sqrt(n); i++) { if(n%i == 0) { divisors.push_back(i); divisors.push_back(n/i); } } sort(divisors.begin(), divisors.end()); divisors.erase(unique(divisors.begin(), divisors.end()), divisors.end()); return Rcpp::wrap(divisors); } // [[Rcpp::export]] bool is_int_perfect(int nn) { if (nn < 1) stop("int must be greater than 0."); return nn == sum(gen_divisors(nn)); } // [[Rcpp::export]] LogicalVector is_perfect(IntegerVector ints) { return sapply(ints, is_int_perfect); }
Here is a list of other helper functions in collatz.cpp:
sum_digits < function(x) { # returns the sum of the individual digits that make up x # x must be an integer vector f < function(xx) sum(as.integer(strsplit(as.character(xx), "")[[1]])) sapply(x, f) }
As you can guess, many of the above functions (such as 'is_prime' and 'gen_divisors') rely on loops, which makes C++ the ideal place to perform the computation. So we farmed out as much of the heavyduty computations to C++ leaving R with the task of processing and analyzing the resulting data.
Let's get started. We will perform the analysis on all integers below 10^5, since R is memorybound and we can run into a bottleneck quickly. But next time, we will show you how to overcome this limitation using the 'RevoScaleR' package, which will allow us to scale the analysis to much larger integers.
One small caveat before we start: I enjoy dabbling in mathematics but I know very little about number theory. The analysis we are about to perform is not meant to be rigorous. Instead, we will attempt to approach the problem using EDA the same way that we approach any datadriven problem.
maxint < 10^5 df < data.frame(int = 1:maxint) # the original number df < transform(df, st = cpp_collatz(int)) # stopping times df < transform(df, sum_digits = sum_digits(int), inv_triangular = (sqrt(8*int + 1)  1)/2, # inverse triangular number inv_square = sqrt(int), # inverse square number inv_pentagonal = (sqrt(24*int + 1) + 1)/6 # inverse pentagonal number )
To determine if a numeric number is an integer or not, we need to be careful about not using the '==' operator in R, as it is not guaranteed to work, because of minute rounding errors that may occur. Here's an example:
.3 == .4  .1 # we expect to get TRUE but get FALSE instead
[1] FALSE
The solution is to check whether the absolute difference between the above numbers is smaller than some tolerance threshold.
eps < 1e9
abs(.3  (.4  .1)) < eps # returns TRUE
[1] TRUE
df < transform(df, is_triangular = abs(inv_triangular  as.integer(inv_triangular)) < eps, is_square = abs(inv_square  as.integer(inv_square)) < eps, is_pentagonal = abs(inv_pentagonal  as.integer(inv_pentagonal)) < eps, is_prime = is_prime(int), is_perfect = is_perfect(int) ) df < df[ , names(df)[grep("^inv_", names(df))]]
Finally, we will create a variable listing all of the integer's proper prime divisors. Every composite integer can be recunstruced out of these basic building blocks, a mathematical result known as the 'unique factorization theorem'. We can use the function 'gen_divisors' to get a vector of an integers proper divisors, and the 'is_prime' function to only keep the ones that are prime. Finally, because the return object must be a singleton, we can use 'paste' with the 'collapse' argument to join all of the prime divisors into a single commaseparated string.
Lastly, on its own, we may not find the variable 'all_prime_divs' especially helpful. Instead, we cangenerate multiple flag variables out of it indicating whether or not a specific prime number is a divisor for the integer. We will generate 25 flag variables, one for each of the first 25 prime numbers.
There are many more features that we can extract from the underlying integers, but we will stop here. As we mentioned earlier, our goal is not to provide a rigorous mathematical work, but show you how the tools of data analysis can be brought to bear to tackle a problem of such nature.
Here's a sample of 10 rows from the data:
df[sample.int(nrow(df), 10), ] int st sum_digits is_triangular is_square is_pentagonal is_prime 21721 21721 162 13 FALSE FALSE FALSE FALSE 36084 36084 142 21 FALSE FALSE FALSE FALSE 40793 40793 119 23 FALSE FALSE FALSE FALSE 3374 3374 43 17 FALSE FALSE FALSE FALSE 48257 48257 44 26 FALSE FALSE FALSE FALSE 42906 42906 49 21 FALSE FALSE FALSE FALSE 37283 37283 62 23 FALSE FALSE FALSE FALSE 55156 55156 60 22 FALSE FALSE FALSE FALSE 6169 6169 111 22 FALSE FALSE FALSE FALSE 77694 77694 231 33 FALSE FALSE FALSE FALSE is_perfect all_prime_divs is_div_by_2 is_div_by_3 is_div_by_5 is_div_by_7 21721 FALSE 7,29,107 FALSE FALSE FALSE TRUE 36084 FALSE 2,3,31,97 TRUE TRUE FALSE FALSE 40793 FALSE 19,113 FALSE FALSE FALSE FALSE 3374 FALSE 2,7,241 TRUE FALSE FALSE TRUE 48257 FALSE 11,41,107 FALSE FALSE FALSE FALSE 42906 FALSE 2,3,7151 TRUE TRUE FALSE FALSE 37283 FALSE 23,1621 FALSE FALSE FALSE FALSE 55156 FALSE 2,13789 TRUE FALSE FALSE FALSE 6169 FALSE 31,199 FALSE FALSE FALSE FALSE 77694 FALSE 2,3,23,563 TRUE TRUE FALSE FALSE is_div_by_11 is_div_by_13 is_div_by_17 is_div_by_19 is_div_by_23 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE TRUE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 TRUE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE TRUE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE TRUE is_div_by_29 is_div_by_31 is_div_by_37 is_div_by_41 is_div_by_43 21721 TRUE FALSE FALSE FALSE FALSE 36084 FALSE TRUE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE TRUE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE TRUE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_47 is_div_by_53 is_div_by_59 is_div_by_61 is_div_by_67 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_71 is_div_by_73 is_div_by_79 is_div_by_83 is_div_by_89 21721 FALSE FALSE FALSE FALSE FALSE 36084 FALSE FALSE FALSE FALSE FALSE 40793 FALSE FALSE FALSE FALSE FALSE 3374 FALSE FALSE FALSE FALSE FALSE 48257 FALSE FALSE FALSE FALSE FALSE 42906 FALSE FALSE FALSE FALSE FALSE 37283 FALSE FALSE FALSE FALSE FALSE 55156 FALSE FALSE FALSE FALSE FALSE 6169 FALSE FALSE FALSE FALSE FALSE 77694 FALSE FALSE FALSE FALSE FALSE is_div_by_97 21721 FALSE 36084 TRUE 40793 FALSE 3374 FALSE 48257 FALSE 42906 FALSE 37283 FALSE 55156 FALSE 6169 FALSE 77694 FALSE
We can now move on to looking at various statistical summaries to see if we notice any differences between the stopping times (our response variable) when we break up the data in different ways. We will look at the counts, mean, median, standard deviation, and trimmed mean (after throwing out the highest 10 percent) of the stopping times, as well as the correlation between stopping times and the integers. This is by no means a comprehensive list, but it can serve as a guidance for deciding which direction to go to next.
my_summary < function(df) { primes < gen_primes(9) res < with(df, data.frame( count = length(st), mean_st = mean(st), median_st = median(st), tmean_st = mean(st[st < quantile(st, .9)]), sd_st = sd(st), cor_st_int = cor(st, int, method = "spearman") ) ) }
To create above summaries broken up by the flag variables in the data, we will use the 'ddply' function in the 'plyr' package. For example, the following will give us the summaries asked for in 'my_summary', grouped by 'is_prime'.
To avoid having to manually type every formula, we can pull the flag variables from the data set, generate the strings that will make up the formula, wrap it inside 'as.formula' and pass it to 'ddply'.
flags < names(df)[grep("^is_", names(df))] res < lapply(flags, function(nm) ddply(df, as.formula(sprintf('~ %s', nm)), my_summary)) names(res) < flags res $is_triangular is_triangular count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99554 107.58511 99 96.23178 51.36153 0.1700491 2 TRUE 446 97.11211 94 85.97243 51.34051 0.3035063 $is_square is_square count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99684 107.59743 99 96.25033 51.34206 0.1696582 2 TRUE 316 88.91772 71 77.29577 55.43725 0.4274504 $is_pentagonal is_pentagonal count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99742 107.56948 99.0 96.22466 51.35580 0.1702560 2 TRUE 258 95.52326 83.5 83.86638 53.91514 0.3336478 $is_prime is_prime count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 90408 107.1227 99 95.95392 51.29145 0.1693013 2 TRUE 9592 111.4569 103 100.39643 51.90186 0.1953000 $is_perfect is_perfect count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 99995 107.5419 99 96.20092 51.36403 0.1707839 2 TRUE 5 37.6000 18 19.50000 45.06440 0.9000000 $is_div_by_2 is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 50000 113.5745 106 102.58324 52.25180 0.1720108 2 TRUE 50000 101.5023 94 90.68234 49.73778 0.1737786 $is_div_by_3 is_div_by_3 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 66667 107.4415 99 96.15832 51.30838 0.1705745 2 TRUE 33333 107.7322 99 96.94426 51.48104 0.1714432 $is_div_by_5 is_div_by_5 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 80000 107.5676 99 96.21239 51.36466 0.1713552 2 TRUE 20000 107.4214 99 96.13873 51.37206 0.1688929 . . . $is_div_by_89 is_div_by_89 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98877 107.5381 99 96.19965 51.37062 0.1707754 2 TRUE 1123 107.5628 97 96.02096 50.97273 0.1803568 $is_div_by_97 is_div_by_97 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE 98970 107.4944 99 96.15365 51.36880 0.17187453 2 TRUE 1030 111.7660 106 101.35853 50.93593 0.07843996
As we can see from the above results, most comparisons appear to be nonsignificant (although, in mathematics, even tiny differences can be meaningful, therefore we will avoid relying on statistical significance here). Here's a summary of trends that stand out as we go over the results:
This last item could just be restatement of 1. however. Let's take a closer look:
ddply(df, ~ is_prime + is_div_by_2, my_summary) is_prime is_div_by_2 count mean_st median_st tmean_st sd_st cor_st_int 1 FALSE FALSE 40409 114.0744 106 102.91178 52.32495 0.1658988 2 FALSE TRUE 49999 101.5043 94 90.68434 49.73624 0.1737291 3 TRUE FALSE 9591 111.4685 103 100.40796 51.89231 0.1950482 4 TRUE TRUE 1 1.0000 1 NaN NA NA
When limiting the analysis to odd numbers only, prime numbers now have a lower average stopping time compared to composite numbers, which reverses the trend seen in 1.
Abd ibe final point: Since having a prime number as a proper divisor is a proxy for being a composite number, it is difficult to read too much into whether divisibility by a specific prime number affects the average stopping time. But no specific prime number stands out in particular. Once again, a larger sample size would give us more confidence about the results.
The above analysis is perhaps too crude and primitive to offer any significant lead, so here are some possible improvements:
In the next post, we will see how we can use the 'RevoScaleR' package to go from an R 'data.frame' (memorybound) to an external 'data.frame' or XDF file for short. In doing so, we will achieve the following improvements: