By Joseph Rickert
Even to the practiced eye, looking at coefficients in R model summaries can be tedious. And, capturing information about the significance of coefficients from scores or maybe even hundreds of models in a way that makes writing the final report a bit easier is a time consuming and thankless task. Of course, once you know what you are looking for, it only takes a few lines of code to select coefficients and plot them. Nevertheless, it would be nice to have a function that just plots the coefficients with error bars. Coefplot, a relatively recent package by Jared Lander, does exactly this and has the potential to become a very useful tool. Built on top of ggplot2 graphics, coefplot plots coefficients from lm and glm models as well as from the big data models generated by RevoScaleR's rxLinMod and rxLogit functions. A small example from Revolution Analytics’ Saar Golde illustrates the use of coefplot. The R code reads in credit data (see table) from 10 separate csv files, concatenates them into a single file,
creditScore |
houseAge |
yearsEmploy |
ccDebt |
year |
default |
691 |
16 |
9 |
6725 |
2000 |
0 |
691 |
4 |
4 |
5077 |
2000 |
0 |
743 |
18 |
3 |
3080 |
2000 |
0 |
728 |
22 |
1 |
4345 |
2000 |
0 |
745 |
17 |
3 |
2969 |
2000 |
0 |
539 |
15 |
3 |
4588 |
2000 |
0 |
and uses RevoScaleR’s rxLinMod function to perform the linear regression:
default ~F(year) + yearsEmploy + ccDebt + creditScore
Note that the F function makes year a factor on the fly so that the regression will produce a coefficient for each year. Running coefplot on the model object produces the graph of the coefficients.
This is slick, but to be really useful coefplot should be able to handle models with thousands of coefficients. I spoke with Jared about this. He said that he is well aware of the problem and is working on it:
“The big issue is identifying levels that belong to factors, which I solved, even for interactions. But how do people specify levels that might belong to different factors, or how to handle a specified level and its interactions, etc....”
It is difficult to build useful tools, and an amazing feature about the open source R project is that so many people are willing to try. Jared also said that he is open to suggestions.
Very interesting - thank you for your post.
Posted by: Tal Galili | January 03, 2012 at 14:04
This type of plot easily leads to a probabilistic bayesian interpretation of frequentist parameter estimates. The visuals just seem to suggest it.
Bayesians have long used boxplots to visualize parameter estimates.
The frequentist standard error added to and subtracted from the point estimate is something different, however.
Posted by: Joint_Posterior | January 03, 2012 at 14:25
I have worked for a while on the coefplot2 package (on r-forge), which is an extension and partial reworking of the coefplot function from the arm package. It focuses more on the back-end machinery of extracting coefficients in a common format from as many different model types as possible. It would be great to see these approaches pooled.
Posted by: Ben Bolker | January 03, 2012 at 18:15
@Ben The coefplot from arm was my inspiration too. I worked with one of Andy Gelman's collaborates over the course of my development. The backend machinery is where the bulk of the effort lies. I'd be happy to work together. I cannot find your email so please contact me.
Posted by: Jared | January 11, 2012 at 08:52