by Joseph Rickert

In a recent post I talked about the information that can be developed by fitting a Tweedie GLM to a 143 million record version of the airlines data set. Since I started working with them about a year or so ago, I now see Tweedie models everywhere. Basically, any time I come across a histogram that looks like it might be a sample from a gamma distribution except for a big spike at zero, I see a candidate for a Tweedie model. (Having a Tweedie hammer makes lots of things look like Tweedie nails.) Nevertheless, apparently lots of people are seeing Tweedie these days. Even the scolarly citations for Maurice Tweedie's original paper are up.

Tweedie distributions are a subset of what are called Exponential Dispersion Models. EDMs are two parameter distributions from the linear exponential family that also have a dispersion parameter f. Statistician Bent Jørgensen solidified the concept of EDMs in a 1987 paper, and named the following class of EDMs after Tweedie.

An EDM random variable Y follows a Tweedie distribution if

var(Y) = f * V(m)

where m is the mean of the distribution, f is the dispersion parameter, V is function describing the mean/variance relationship of the distribution and p is a constant such that:

V(m) = m^{p }

Some very familiar distributions fall into the Tweedie family. Setting p = 0 gives a normal distribution. p = 1 is Poisson. p = 2 gives a gamma distribution and p = 3 yields an inverse Gaussian. However, much of the action for fitting Tweedie GLMs is for values of p between 1 and 2. In this interval, closed form distribution functions don’t exist, but as it turns out, Tweedies in this interval are compound Poisson distributions. (A compound Poisson random variable Y is the sum of N independent gamma random variables where N follows a Poisson distribution and N and the gamma random variates are independent.)

This last fact helps to explain why Tweedies are so popular. For example, one might model the insurance claims for a customer as a series of independent gamma random variables and the number of claims in some time interval as a Poisson random variable. Or, the gamma random variables could be models for precipation, and the total rainfall resulting from N rainstorms would follow a Tweedie distribution. The possibilities are endless.

R has a quite a few resources for working with Tweedie models. Here are just a few. You can fit Tweedie GLM model with the tweedie function in the statmod package .

# Fit an inverse-Gaussion glm with log-link

glm(y~x,family=tweedie(var.power=3,link.power=0))

The tweedie package has several interesting functions for working with Tweedie models including a function to generate random samples.The following graph shows four different Tweedie histograms as the power parameter moves from 1.2 to 1.9.

It is apparent that increasing the power shifts mass away from zero towards the right.

(Note the code for producing these plots which includes some nice code from Stephen Turner for putting all for ggplots on a single graph are availale are available. Download Plot_tweedie)

Package poistweedie also provides functions for simulating Tweedie models. Package HDtweedie implements an iteratively reweighted least squares algorithm for computing solution paths for grouped lasso and grouped elastic net Tweedie models. And, package cplm provides both likelihood and Bayesian functions for working with compound Poisson models. Be sure to have a look at the vignette for this package to see compound Poisson distributions in action.

Finally, two very readable references for both the math underlying Tweedie models and the algorithms to compute them are a couple of papers by Dunn and Smyth: here and here.

Another package to add to your list, and one which comes with R; mgcv. For a while mgcv has has a Tweedie family for use in `gam()` etc., but you had to set the power parameter p and hence optimise over a range of values p by hand. Since version 1.8.0, mgcv has included a new tw family function which treats p as just another model parameter to be included in the model and optimised for.

Posted by: Gavin Simpson | October 09, 2014 at 09:43