Many R scripts depend on CRAN packages, and most CRAN packages in turn depend on other CRAN packages. If you install an R package, you'll also be installing its dependencies to make it work, and possibly other packages as well to enable its full functionality.

My colleague Andrie posted some R code to map package dependencies a couple of months ago, but now you can easily explore the dependencies of any CRAN package at MRAN. Simply search for a package and click the Dependencies Graph tab. Here's a very simple one: the foreach package.

The foreach package depends on two others: iterators and codetools, which will be automatically installed for you by install.packages when you install foreach. (We'll duscuss the use of "Suggests" — as here with randomForest — later.) Now let's look at a more complex example: the caret package.

The caret package provides an interface to many of the predictive modeling packages on CRAN, and so it has several dependencies (nine, in fact — you can see the list by clicking on the Dependencies Table tab). But it also Suggests many more packages — these are packages that are not required to run caret, but if you do have them, there are more model types you can use within the caret framework.

Here's a quick overview of the types of dependencies you'll find in the charts and tables on MRAN:

- Imports (red) and Depends (orange). These packages are required to use the package of interest. Their dependencies are
*also*required, and shown in the graph. (As a package user, you don't need to worry about the distinction between Imports and Depends, but if you're interested Hadley Wickham's excellent Guide to R Packages has the details.) - Suggests (grey). These are packages that add additional functionality to the package of interest, but aren't required to use it. You may find certain functions or options don't work if the Suggested packages are not installed and loaded.
- LinkingTo (black): Only relevant if you are working with compiled code. If you need to have the package installed, it will also be listed as a dependency. (Again, see Guide to R Packages for details.)
- Enhances (blue): Very few packages use this type of link, but if they do it indicates that the package provides optional capabilities in another package. (It's sort of a reverse Suggests.)

MRAN is updated daily, and so the Dependencies Graph is always up-to-date with the latest CRAN packages and their connections. Start exploring at the link below.

MRAN: Explore Packages

My second-favourite keynote from yesterday's Strata Hadoop World conference was this one, from Pinterest's John Rauser. To many people (especially in the Big Data world), Statistics is a series of complex equations, but a just a little intuition goes a long way to really understanding data. John illustrates this wonderfully using an example of data collected to determine whether consuming beer causes mosquitoes to bite you more:

The big lesson here, IMO, is that so many statistical problems can seem complex, but you can actually get a lot of insight by recognizing that your data is just one possible instance of a random process. If you have a hypothesis for what that process is, you can **simulate it**, and get an intuitive sense of how surprising your data is. R has excellent tools for simulating data, and a couple of hours spent writing code to simulate data can often give insights that will be valuable for the formal data analysis to come.

(By the way, my favourite keynote from the connference was Amanda Cox's keynote on data visualization at the New York Times, which featured several examples developed in R. Sadly, though, it wasn't recorded.)

O'Reilly Strata: Statistics Without the Agonizing Pain

by Joseph Rickert

There is something about R user group meetings that both encourages, and nourshies a certain kind of "after hours" creativity. Maybe it is the pressure of having to make a presentation about stuff you do at work interesting to a general audience, or maybe it is just the desire to reach a high level of play. But, R user group presentations often manage to make some obscure area of computational statistics seem to be not only accessible, but also relevant and fun. Here are a couple of examples of what I mean.

Recently Xiaocun Sun conducted an Image processing workshop for KRUG, the Knoxville R User's Group. As the folowing slide indicates, he used the EBImage Bioconductor package, a package that I imagine few people who don't do medical imaging for a living would be likely to stubmle upon by accident, to illustrate the basics of image processing.

Xiacuns's presentation along with R code is available for download from the KRUG site.

As a second example, consider the presentation that Antonio Piccolboni recently made to the Bay Area useR Group (BARUG): 10 Eigenmaps of the United States of America. Inspired by an article in the New York Times, Antonio decided to undertake his own idiosyncratic tour through the Census data and look at socio-economic trends in the United States. His analysis is both thought provoking and visually compelling. For example, concerning the following map Antonio writes:

This map shows a very interesting ring pattern around some cities, including Atlanta, Dallas an Minneapolis. The red areas show strong population increase, including migration, and increase in available housing and high median income. The blue areas have a higher death rate, Federal Government payments to individuals, more widows, single person households and older people receiving social security.

Antonio's presentation might well illustrate the theme: "Data Scientist reads the Sunday paper and finds data to begin a conversation about what he read with his quantitative, R-literate friends".

This kind of active reading fits nicely with ideas about responsible, quantitative journalism that Chris Wiggins expresses in a presentation he recently made to the New York Open Statistical Programming Meetup. Here, Chris provides some insight into the role of Data Science at the New York Times and offers advice on using data to study relevant issues and clearly communicate findings. One major point in Chris' presentation is that data science plus clear communication can have a very positive influence on shaping our culture.

It is not an exaggeration to say that the kind of work that Xiaocun, Antonio and other R user group presenters undertake in their spare time "for fun" is valuable and important beyond the immediate goals of learning and teaching R.

For the past 7 years, Revolution Analytics has been the leading provider of R-based software and services to companies around the globe. Today, we're excited to announce a new, enhanced R distribution for everyone: Revolution R Open.

Revolution R Open is a downstream distribution of R from the R Foundation for Statistical Computing. It's built on the R 3.1.1 language engine, so it's 100% compatible with any scripts, packages or applications that work with R 3.1.1. It also comes with enhancements to improve your R experience, focused on performance and reproducibility:

- Revolution R Open is linked with the Intel Math Kernel Libraries (MKL). These replace the standard R BLAS/LAPACK libraries to improve the performance of R, especially on multi-core hardware. You don't need to modify your R code to take advantage of the performance improvements.

- Revolution R Open comes with the Reproducible R Toolkit. The default CRAN repository is a static snapshot of CRAN (taken on October 1). You can always access newer R packages with the checkpoint package, which comes pre-installed. These changes make it easier to share R code with other R users, confident that they will get the same results as you did when you wrote the code.

Today we are also introducing MRAN, a new website where you can find information about R, Revolution R Open, and R Packages. MRAN includes tools to explore R Packages and R Task Views, making it easy to find packages to extend R's capabilities. MRAN is updated daily.

Revolution R Open is available for download now. Visit mran.revolutionanalytics.com/download for binaries for Windows, Mac, Ubuntu, CentOS/Red Hat Linux and (of course) the GPLv2 source distribution.

With the new Revolution R Plus program, Revolution Analytics is offering technical support and open-source assurance for Revolution R Open and several other open source projects from Revolution Analytics (including DeployR Open, ParallelR and RHadoop). If you are interested in subscribing, you can find more information at www.revolutionanalytics.com/plus . And don't forget that big-data R capabilities are still available in Revolution R Enterprise.

We hope you enjoy using Revolution R Open, and that your workplace will be confident adopting R with the backing of technical support and open source assurance of Revolution R Plus. Let us know what you think in the comments!

Part 2 of a series

by Daniel Hanson, with contributions by Steve Su (author of the GLDEX package)

In our previous article, we introduced the four-parameter Generalized Lambda Distribution (GLD) and looked at fitting a 20-year set of returns from the Wilshire 5000 Index, comparing the results of two methods, namely the Method of Moments, and the Method of Maximum Likelihood.

Errata: One very important omission in Part 1, however, was not putting

require(GLDEX)

prior to the examples shown. Many thanks to a reader who pointed this out in the comments section last time.

Let’s also recall the code we used for obtaining returns from the Wilshire 5000 index, and the first four moments of the data (details are in Part 1):

require(quantmod) # quantmod package must be installed

getSymbols("VTSMX", from = "1994-09-01")

VTSMX.Close <- VTSMX[,4] # Closing prices

VTSMX.vector <- as.vector(VTSMX.Close)

# Calculate log returns

Wilsh5000 <- diff(log(VTSMX.vector), lag = 1)

Wilsh5000 <- 100 * Wilsh5000[-1] # Remove the NA in the first position,

# and put in percent format

# Moments of Wilshire 5000 market returns:

fun.moments.r(Wilsh5000,normalise="Y") # normalise="Y" -- subtracts the 3

# from the normal dist value.

# Results:

# mean variance skewness kurtosis

# 0.02824676 1.50214916 -0.30413445 7.82107430

Finally, in Part 1, we looked at two methods for fitting a GLD to this data, namely the Method of Moments (MM), and the Method of Maximum Likelihood (ML). We found that MM gave us a near perfect match in mean, variance, skewness, and kurtosis, but goodness of fit measures showed that we could not conclude that the market data was drawn from the fitted distribution. On the other hand, ML gave us a much better fit, but it came at the price of skewness being way off compared to that of the data, and kurtosis not being determined by the fitting algorithm (NA).

**Method of L-Moments (LM)**Steve Su, in his contributions to this article series, suggested the option of a “third way”, namely the Method of L-Moments. Also, as mentioned in the paper L-moments and TL-moments of the generalized lambda distribution, (William H. Asquith, 2006),

“The method of L-moments is an alternative technique, which is suitable and popular for heavy-tailed distributions. The method of L-moments is particularly useful for distributions, such as the generalized lambda distribution (GLD), that are only expressible in inverse or quantile function form.”

Additional details on the method and algorithm for computing it can be found in this paper, noted above.

As we will see in the example that follows, the result is essentially a compromise between our first two results, but the goodness of fit is still far preferable to that of the Method of Moments.

We follow the same approach as above, but using the GLDEX function fun.RMFMKL.lm(.) to calculate the fitted distribution:

# Fit the LM distribution:

require(GLDEX) # Remembered it this time!

wshLambdaLM = fun.RMFMKL.lm(Wilsh5000)

# Compute the associated moments

fun.theo.mv.gld(wshLambdaLM[1], wshLambdaLM[2], wshLambdaLM[3], wshLambdaLM[4],

param = "fmkl", normalise="Y")

# The results are:

# mean variance skewness kurtosis

# 0.02824678 1.56947022 -1.32265715 291.58852044

As was the case with the maximum likelihood fit, the mean and variance are reasonably close, but the skewness and kurtosis do not match the empirical data. However, the skew is not as far off as in the ML case, and we are at least able to calculate a kurtosis value.

Looking at our goodness of fit tests based on the KS statistic:

fun.diag.ks.g(result = wshLambdaLM, data = Wilsh5000, no.test = 1000,

param = "fmkl")

# Result: 740/1000

ks.gof(Wilsh5000,"pgl", wshLambdaLM, param="fmkl")

# D = 0.0201, p-value = 0.03383

In the first case, our result of 740/1000 suggests a much better fit than the Method of Moments (53/1000) in Part 1, while falling slightly from the ratio we obtained with the Method of Maximum Likelihood. In the second test, a p-value of 0.03383 is not overwhelmingly convincing, but technically it does allow us to accept the hypothesis that our data is drawn from the same distribution at an α = 0.025 or 0.01 confidence level.

Perhaps more interesting is just looking at the plot

bins <- nclass.FD(Wilsh5000) # We get 158 (Freedman-Diaconis Rule)

fun.plot.fit(fit.obj = wshLambdaLM, data = Wilsh5000, nclass = bins, param = "fmkl",

xlab = "Returns", main = "Method of L-Moments")

Again, compared to the result for the Method of Moments in Part 1, the plot suggests that we have a better fit.

The QQ plot, however, is not much different than what we got in the Maximum Likelihood case; in particular, losses in the left tail are not underestimated by the fitted distribution as they are in the MM case:

qqplot.gld(fit=wshLambdaLM,data=Wilsh5000,param="fkml", type = "str.qqplot",

main = "Method of L-Moments")

Which Option is “Best”?

Steve Su points out that there is no one “best” solution, as there are trade-offs and competing constraints involved in the algorithms, and it is one reason why, in addition to the methods described above, so many different functions are available in the GLDEX package. On one point, however, there is general agreement in the literature, and that is the Method of Moments -- even with the appeal of matching the moments of the empirical data -- is an inferior method to others that result in a better fit. This is also discussed in the paper by Asquith, namely, that the method of moments “generally works well for light-tailed distributions. For heavy-tailed distributions, however, use of the method of moments can be questioned.”

**Comparison with the Normal Distribution**

For a strawman comparison, we can fit the Wilshire 5000 returns to a normal distribution in R, and run the KS test as follows:

f <- fitdistr(Wilsh5000, densfun = "normal")

ks.test(Wilsh5000, "pnorm", f$estimate[1], f$estimate[2], alternative = "two.sided")

The results are as follows:

# One-sample Kolmogorov-Smirnov test

# data: Wilsh5000

# D = D = 0.0841, p-value < 2.2e-16

# alternative hypothesis: two-sided

With a p-value that small, we can firmly reject the returns data as being drawn from a fitted normal distribution.

We can also get a look at the plot of the implied normal distribution overlaid upon the fit we obtained with the method of L-moments, as follows:

x <- seq(min(Wilsh5000), max(Wilsh5000), bins)

# Chop the domain into bins = 158nintervals to get sample points

# from the approximated normal distribution (Freedman-Diaconis)

fun.plot.fit(fit.obj = wshLambdaLM, data = Wilsh5000, nclass = bins,

param = "fmkl", xlab = "Returns")

curve(dnorm(x, mean=f$estimate[1], sd=f$estimate[2]), add=TRUE,

col = "red", lwd = 2) # Normal curve in red

Although it may be a little difficult to see, note that between -3 and -4 on the horizontal axis, the tail of the normal fit (in red) falls below that of the GLD (in blue), and it is along this left tail where extreme events can occur in the markets. The normal distribution implies a lower probability of these “black swan” events than the more representative GLD.

This is further confirmed by looking at the QQ plot vs a normal distribution fit. Note how the theoretical fit (horizontal axis in this case, using the Base R function qqnorm(.); ie, the axes are switched compared to those in our previous QQ plots) vastly underestimates losses in the left tail.

qqnorm(Wilsh5000, main = "Normal Q-Q Plot")

qqline(Wilsh5000)

In summary, from these plots, we can see that the GLD fit, particularly using ML or LM, is a superior alternative to what we get with the normal distribution fit when estimating potential index losses.

**Conclusion**

We have seen, using R and the GLDEX package, how a four parameter distribution such as the Generalized Lambda Distribution can be used to fit a more realistic distribution to market data as compared to the normal distribution, particularly considering the fat tails typically present in returns data that cannot be captured by a normal distribution. While the Method of Moments as a fitting algorithm is highly appealing due to its preserving the moments of the empirical distribution, we sacrifice goodness of fit that can be obtained using other methods such as Maximum Likelihood, and L-Moments.

The GLD has been demonstrated in financial texts and research literature as a suitable distributional fit for determining market risk measures such as Value at Risk (VaR), Expected Shortfall (ES), and other metrics. We will look at examples in an upcoming article.

*Again, very special thanks are due to Dr Steve Su for his contributions and guidance in presenting this topic*.

The ability to create reproducible research is an important topic for many users of R. So important, that several groups in the R community have tackled this problem. Notably, packrat from RStudio, and gRAN from Genentech (see our previous blog post).

The **Reproducible R Toolkit** is a new open-source initiative from Revolution Analytics. It takes a simple approach to dealing with R package versions, consisting of an R package checkpoint, and an associated daily CRAN snapshot archive, checkpoint-server. Here's one illustration of the problem it solves (with apologies to xkcd):

To achieve reproducibility, we store daily snapshots of all CRAN packages. At midnight UTC each day we refresh the CRAN mirror and then store a snapshot of CRAN as it exists at that very moment. You can access these daily snapshots using the checkpoint package, which installs and consistently use these packages just as they existed at the snapshot date. Daily snapshots exist starting from 2014-09-17.

checkpoint package

The goal of the checkpoint package is to solve the problem of package reproducibility in R. Since packages get updated on CRAN all the time, it can be difficult to recreate an environment where all your packages are consistent with some earlier state. To solve this issue, checkpoint allows you to install packages locally as they existed on a specific date from the corresponding snapshot (stored on the checkpoint server) and it configures your R session to use only these packages. Together, the checkpoint package and the checkpoint server act as a "CRAN time machine", so that anyone using checkpoint can ensure the reproducibility of scripts or projects at any time.

How to use checkpoint

One you have the checkpoint package installed, using the checkpoint() function is as simple as adding the following lines to the top of your script:

Typically, you will use the date you created the script as the argument to checkpoint. The first time you run the script, checkpoint will inspect your script (and other R files in the same project folder) for the packages used, and install the required packages with versions as of the specified date. (The first time you run the script, it will take some time to download and install the packages, but subsequent runs will use the previously-installed package versions.)

The checkpoint package installs the packages in a folder specific to the current project (in a subfolder of

If you want to update the packages you use at a later date, just update the date in the checkpoint() call and checkpoint() will automatically update the locally-installed packages.

The checkpoint package is available on CRAN:

Worked example

The Reproducible R Toolkit was created by the Open Source Solutions group at Revolution Analytics. Special thanks go to Scott Chamberlain who helped with early development.

We'd love to know what you think about checkpoint. Leave comments here on the blog, or via the checkpoint GitHub page.

The Fantasy Football Analytics blog shares these 14 reasons why R is better than Excel for data analysis:

- More powerful data manipulation capabilities
- Easier automation
- Faster computation
- It reads any type of data
- Easier project organization
- It supports larger data sets
- Reproducibility (important for detecting errors)
- Easier to find and fix errors
- It's free
- It's open source
- Advanced Statistics capabilities
- State-of-the-art graphics
- It runs on many platforms
- Anyone can contribute packages to improve its functionality

The two most important in my mind are #2 (automation) and #7 (reproducibility), reasons that apply to any GUI-driven tool. The ability to use code to repeat your analyses and reproduce the results consistently cannot be overstated.

For more detailed background behind each of these reasons, and four situations where it's best to use Excel, check out the complete blog blost linked below.

Fantasy Football Analytics: Why R is Better Than Excel for Fantasy Football (and most other) Data Analysis

by Joseph Rickert

In a recent post I talked about the information that can be developed by fitting a Tweedie GLM to a 143 million record version of the airlines data set. Since I started working with them about a year or so ago, I now see Tweedie models everywhere. Basically, any time I come across a histogram that looks like it might be a sample from a gamma distribution except for a big spike at zero, I see a candidate for a Tweedie model. (Having a Tweedie hammer makes lots of things look like Tweedie nails.) Nevertheless, apparently lots of people are seeing Tweedie these days. Even the scolarly citations for Maurice Tweedie's original paper are up.

Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter f. Statistician Bent Jørgensen solidified the concept of EDMs in a 1987 paper, and named the following class of EDMs after Tweedie.

An EDM random variable Y follows a Tweedie distribution if

var(Y) = f * V(m)

where m is the mean of the distribution, f is the dispersion parameter, V is function describing the mean/variance relationship of the distribution and p is a constant such that:

V(m) = m^{p }

Some very familiar distributions fall into the Tweedie family. Setting p = 0 gives a normal distribution. p = 1 is Poisson. p = 2 gives a gamma distribution and p = 3 yields an inverse Gaussian. However, much of the action for fitting Tweedie GLMs is for values of p between 1 and 2. In this interval, closed form distribution functions don’t exist, but as it turns out, Tweedies in this interval are compound Poisson distributions. (A compound Poisson random variable Y is the sum of N independent gamma random variables where N follows a Poisson distribution and N and the gamma random variates are independent.)

This last fact helps to explain why Tweedies are so popular. For example, one might model the insurance claims for a customer as a series of independent gamma random variables and the number of claims in some time interval as a Poisson random variable. Or, the gamma random variables could be models for precipation, and the total rainfall resulting from N rainstorms would follow a Tweedie distribution. The possibilities are endless.

R has a quite a few resources for working with Tweedie models. Here are just a few. You can fit Tweedie GLM model with the tweedie function in the statmod package .

# Fit an inverse-Gaussion glm with log-link

glm(y~x,family=tweedie(var.power=3,link.power=0))

The tweedie package has several interesting functions for working with Tweedie models including a function to generate random samples.The following graph shows four different Tweedie histograms as the power parameter moves from 1.2 to 1.9.

It is apparent that increasing the power shifts mass away from zero towards the right.

(Note the code for producing these plots which includes some nice code from Stephen Turner for putting all for ggplots on a single graph are availale are available. Download Plot_tweedie)

Package poistweedie also provides functions for simulating Tweedie models. Package HDtweedie implements an iteratively reweighted least squares algorithm for computing solution paths for grouped lasso and grouped elastic net Tweedie models. And, package cplm provides both likelihood and Bayesian functions for working with compound Poisson models. Be sure to have a look at the vignette for this package to see compound Poisson distributions in action.

Finally, two very readable references for both the math underlying Tweedie models and the algorithms to compute them are a couple of papers by Dunn and Smyth: here and here.

In case you missed them, here are some articles from September of particular interest to R users.

Norm Matloff argues that T-tests shouldn't be part of the Statistics curriculum and questions the "star system" for p-values in R.

A nice video introduction to the dplyr package and the %>% operator, presented by Kevin Markham.

An animation of police militarization in the US, created with R and open data published by the New York Times.

An overview of the miscellaneous R functions in the DescTools package.

Some guidance from Will Stanton on becoming a "data hacker" using R and Hadoop.

A tutorial on publishing ggplot2 graphics to the web with plotly.

A Shiny app that implements the Traveling Salesman problem and animates the simulating annealing algorithm behind the solution.

R code for comparing performance of machine learning models.

Presentations at DataWeek on applications of R at companies.

Announcing new members for the R Foundation and the R Core team.

A graduate student uses R to look at the popularity of posts on Reddit.

Google introduces the CausalImpact package for R, and uses it to evaluate performance of marketing campaigns.

A review of several recent and upcoming conferences that include R-related tracks.

More presentations and video interviews from the useR! 2014 conference, from DataScience.LA.

A detailed Rcpp example based on the Collatz Conjecture.

Use Rmarkdown to create documents combining text, mathematics, and R graphical and tabular output.

A very early example of data analysis: Nile floods in 450 BC.

The Rockefeller Institute of Government uses R to simulate the finances of public sector pension funds.

General interest stories (not related to R) in the past month included: ET for the Atari 2600, Talk Like a Pirate day photos, a parody lifestyle magazine for data scientists and the spread of the Ice Bucket Challenge.

As always, thanks for the comments and please send any suggestions to me at david@revolutionanalytics.com. Don't forget you can follow the blog using an RSS reader, via email using blogtrottr, or by following me on Twitter (I'm @revodavid). You can find roundups of previous months here.

by Daniel Hanson, with contributions by Steve Su (author of the GLDEX package). Part 1 of a series.

As most readers are well aware, market return data tends to have heavier tails than that which can be captured by a normal distribution; furthermore, skewness will not be captured either. For this reason, a four parameter distribution such as the Generalized Lambda Distribution (GLD) can give us a more realistic representation of the behavior of market returns, including a more accurate measure of expected loss in risk management applications as compared to the normal distribution.

This is not to say that the normal distribution should be thrown in the dustbin, as the underlying stochastic calculus, based on Brownian Motion, remains a very convenient tool in modeling derivatives pricing and risk exposures (see earlier blog article here), but like all modeling methods, it has its strengths and weaknesses.

As noted in the book Financial Risk Modelling and Portfolio Optimization with R (Pfaff, Ch 6: Suitable distributions for returns) (publisher information provided here), the GLD is one of the recommended distributions to consider in order “to model not just the tail behavior of the losses, but the entire return distribution. This need arises when, for example, returns have to be sampled for Monte Carlo type applications.” The author provides descriptions and examples of several R packages freely available on the CRAN website, namely Davies, fBasics, gld, and lmomco. Another package, also freely available on CRAN, is the GLDEX package, which is the package we will use in the current article. It contains a rich offering of functions and is well documented. In addition, the author of the GLDEX package, Dr Steve Su, has kindly provided assistance in the writing of this article. He has also published a very useful and related article in the Journal of Statistical Software (JSS) (2007), to which we will refer in the discussion below.

The four parameters of the GLD are, not surprisingly, λ1, λ2, λ3, and λ4. Without going into theoretical details, suffice it to say that λ1 and λ2 are measures of location and scale respectively, while the skewness and kurtosis of the distribution are determined by λ3 and λ4.

Furthermore, there are two forms of the GLD that are implemented in GLDEX, namely those of Ramberg and Schmeiser (1974), and Freimer, Mudholkar, Kollia, and Lin (1988). These are commonly abbreviated as RS and FMKL. As the FMKL form is the more modern of the two, we will focus on it in the discussion that follows. An additional reference frequently cited in the literature related to the GLD in finance is the paper by Chalabi, Scott, and Wurtz, freely available here on the rmetrics website.

As Steve Su points out in his 2007 JSS article on the GLDEX package (see link above), there are three basic steps that are useful in determining the quality of the GLD fit. The first two, as we shall see, can be competing objectives in determining the fit. The GLDEX package provides functionality for each.

- Comparing the mean, variance, skewness and kurtosis of the fitted distribution with the empirical data
- The Komogorov-Smirnoff (KS) resample test and goodness of fit
- Graphical outputs

Remark: The list of options has been presented here in opposite order of that in the JSS article in order to assist in the development of the discussion, as we shall see.

*Market Returns Data*

Let’s first obtain some market data to use. The Wilshire 5000 index is commonly used as a measure of the total US equity market -- comprising large, medium, and small cap stocks -- so we call once again upon our old friend the quantmod package to access the past 20 years of daily closing prices of the Vanguard Total Stock Market Index Fund (VTSMX).

require(quantmod) # quantmod package must be installed

getSymbols("VTSMX", from = "1994-09-01")

VTSMX.Close <- VTSMX[,4] # Closing prices

VTSMX.vector <- as.vector(VTSMX.Close)

# Calculate log returns

Wilsh5000 <- diff(log(VTSMX.vector), lag = 1)

Wilsh5000 <- 100 * Wilsh5000[-1] # Remove the NA in the first position,

# and put in percent format

*Method of Moments*

Appealing to step (1) above, the following function uses the FMKL form to fit the data to a GLD with the method of moments,

wshLambdaMM <- fun.RMFMKL.mm(Wilsh5000)

This returns the estimated values of λ1, λ2, λ3, and λ4 in the vector wshLambdaMM :

[1] 0.04882924 1.98442097 -0.16423899 -0.13470102

Remark: Warning messages such as the following may occur when running this function:

Warning messages:

1: In beta(a, b) : NaNs produced

2: In beta(a, b) : NaNs produced

…

These may be ignored.

We can then compare the four moments of the fitted distribution with those of the market data using the following functions respectively:

# Moments of fitted distribution:

fun.theo.mv.gld(wshLambdaMM[1], wshLambdaMM[2], wshLambdaMM[3], wshLambdaMM[4],

param = "fmkl", normalise="Y")

# Results:

# mean variance skewness kurtosis

# 0.02824672 1.50214919 -0.30413445 7.8210743

# Moments of Wilshire 5000 market returns:

fun.moments.r(Wilsh5000,normalise="Y") # normalise="Y" -- subtracts the 3

# from the normal dist value.

# Results:

# mean variance skewness kurtosis

# 0.02824676 1.50214916 -0.30413445 7.82107430

We’re basically spot-on here, and things are looking pretty good; however, we haven’t looked at a goodness of fit test yet, and unfortunately, this will tell a different story. We will first look at the Komogorov-Smirnoff (KS) resample test, as shown in the 2007 JSS article. The test is based on the sample statistic Kolmogorov-Smirnoff Distance (D) between the data in the sample and the fitted distribution. The null hypothesis is, simply speaking, that the sample data is drawn from the same distribution as the fitted distribution.

The function here, from the GLDEX package, samples a proportion (default = 90%) of the data and fitted distribution and calculates the KS test p-value 1000 times (no.test argument), and returns the number of times that the p-value is not significant. The higher the number, the more confident we can be that the fitted distribution is reasonable.

fun.diag.ks.g(result = wshLambdaMM, data = Wilsh5000, no.test = 1000,

param = "fmkl")

Our result here is 53/1000, which suggests that we’re pretty way off. A more recent addition to the GLDEX package that was not available at the time the related 2007 JSS article was written is the following:

ks.gof(Wilsh5000, "pgl", wshLambdaMM, param="fmkl")

where pgl is the GLD distribution function included in the GLDEX package (the analog of the pnorm normal distribution function included in Base R).

With a p-value of 1.912e-05, it is pretty safe to reject the hypothesis that the sample data is drawn from the fitted distribution.

*Method of Maximum Likelihood (ML)*

As Steve Su points out in his JSS article, “The maximum likelihood estimation is usually the preferred method” for “providing definite fits to a data set using the GLD”. The function in the GLDEX package, again for the FKML parameterization, is

wshLambdaML = fun.RMFMKL.ml(Wilsh5000)

Checking our goodness of fit tests,

fun.diag.ks.g(result = wshLambdaML, data = Wilsh5000, no.test = 1000, param = "fmkl")

We get a result of 825/1000, and for

ks.gof(Wilsh5000,"pgl", wshLambdaML, param="fmkl")

we get D = 0.0151, p-value = 0.2

This p-value, while not spectacular, is far better than what we saw for the method of moments case, and the KS resample test is also much more convincing. But now, the “bad news”: if we look at the four moments of the ML fit,

fun.theo.mv.gld(wshLambdaML[1], wshLambdaML[2], wshLambdaML[3],

wshLambdaML[4], param = "fmkl", normalise="Y")

we get

# mean variance skewness kurtosis

# 0.02850058 1.64456695 -2.13494680 NA

While the mean and variance are reasonably close to their empirical counterparts, skewness is off by about 60%, and kurtosis can’t be determined by the algorithm.

*Graphical Comparison of Method of Moments and Maximum Likelihood*

Now, invoking step 3, let’s compare the plots resulting from the two different methods, using the fun.plot.fit(.) function provided in the GLDEX package, by overlaying the pdf curve of the fitted distribution on top of the histogram of the returns data. In order to assure a meaningful plot, however, we should first determine the optimal number of bins in the histogram using the Freedman-Diaconis Rule with the following R function:

bins <- nclass.FD(Wilsh5000) # We get 158

Then, set nclass = bins into the plotting function in the GLDEX package:

# Method of Moments

fun.plot.fit(fit.obj = wshLambdaMM, data = Wilsh5000, nclass = bins,

param = "fmkl",

xlab = "Returns", main = "Method of Moments Fit")

# Method of Maximum Likelihood

fun.plot.fit(fit.obj = wshLambdaML, data = Wilsh5000,

nclass = bins, param = "fkml",

xlab = "Returns", main = "Method of Maximum Likelihood")

Visual inspection of the plots is consistent with our findings above, that the method of maximum likelihood results in a better fit of the data than the method of moments, despite the fact that the moments line up almost exactly in the case of the former.

One more set of plots that one should inspect is the set of quantile (“QQ”) plots:

qqplot.gld(fit=wshLambdaMM,data=Wilsh5000,param="fkml", type = "str.qqplot",

main = "Method of Moments")

qqplot.gld(fit=wshLambdaML,data=Wilsh5000,param="fkml", type = "str.qqplot",

main = "Method of Maximum Likelihood")

Now, if we were to look at these two plots in a vacuum, so to speak, with none of the other prior information available, there is a good case to be made that the QQ plot for Method of Moments might indicate a better fit. However, note that at about -4 along the horizontal (empirical data) axis, the plotted points start to drift above the line indicating where the horizontal and vertical axis values are equal. This implies that our fit is underestimating market losses as we move out toward the left tail of the distribution. The QQ plot for Maximum Likelihood is more conservative, erring on the side of caution with the fitted distribution indicating an increased risk of greater loss than the Method of Moments fit. As Steve Su puts it, the general recommendation is to look at the QQ plot and KS test results together, to determine the goodness of fit. The QQ plot alone, however, is not a fail-proof method.

**Conclusion**We have seen, using R and the GLDEX package, how a four parameter distribution such as the Generalized Lambda Distribution can be used to fit a distribution to market data. While the Method of Moments as a fitting algorithm is highly appealing due to its preserving the moments of the empirical distribution, we sacrifice goodness of fit that can be obtained by using the Method of Maximum Likelihood.

In our next article, we will look at an alternative GLD fitting method know as the Method of L-Moments as a compromise between the two methods discussed here, and then conclude with a comparison with the normal distribution, which will exhibit quite clearly the advantages of the GLD when it comes to fitting financial returns data.

*Very special thanks are due to Dr Steve Su for his contributions and guidance in presenting this topic.*