by Max Kuhn: Director, Nonclinical Statistics, Pfizer
I spend a lot of time working on machine learning problems where we have the need to predict some future data point based on historical data. I just wrote a book on the subject and developed the caret
R package to facilitate my work. caret
stands for Classification And REgression Training (and yes I back-fit the name to the acronym).
One of the primary features of caret
is to provide a unified user interface to the abundance of machine learning models in R. Since these models are often written by different authors, they can have very different interfaces and conventions. I don't want to spend any of my intellectual resources on remembering the syntactical minutiae of each function. caret spares the user this pain by being a wrapper to a large number of modeling functions. In this way, the user needs to remember one set of syntax.
When I first started the package here at Pfizer in 2005 I didn't think much about distributing the code and I foolishly hard-coded the details of each model into the package. The downside to this approach is that the package currently has a lot of dependencies, and I mean a lot. There are 76 package in the “Suggests” field and and 7 in the “Depends”. Keep in mind that, to check this package, they have to make sure that all of the dependencies (and their dependencies and so on) are working and the current versions are loaded. Needless to say, our fine R Core folks who maintain CRAN are very generous with their time.
One other problem with this approach is that, since the model-specific details are essentially hidden, other people and packages cannot utilize or modify them.
Given this, I've been working for the last few months on refactoring my package to keep most of the model specific code formalized and indexed in a way that 1) reduces the number of package dependencies and 2) allows users to “roll their own” model code. I'm finishing up testing and should send this to CRAN in the next month, depending on whether I can shake this horrible cold.
In doing this, I made a catalog entry for each model with components related to model training, prediction, variable importance and so on. I also put in tags for each model so that users can easily find functions that have similar features to packages that they already know about. Example tags would be “tree-based model”, “Implicit Feature Selection”, “Boosting” and so on. For each model, the current relationship between the tags and the models can be found in this file. I'm still refining them.
tag_data <- read.csv("model_tags.csv", stringsAsFactors = FALSE)
rownames(tag_data) <- tag_data$Model
tag_data$Model <- NULL
The file contains data on 143 models and, for each, there are 41 possible tags.
How can we visualize these data? We can calculate the some measure of tag similarity between the models and threshold those values to say which models are “similar”. From there, we can draw a network graph. The tags are represented as binary values (i.e. 1 if the model uses boosting and 0 otherwise) and I used the Jaccard index via the proxy
package to do this calculation.
library(proxy)
D <- dist(tag_data, method = "Jaccard")
Dm <- as.matrix(D)
## Currently a distance, so we can convert it to similarity:
sim <- 1 - Dm
I've had to play around with the threshold value a little bit, but a value around 0.55 seems reasonable to say that two models are similar of their Jaccard index is greater than the threshold.
R has some great packages to create graphs. However, I wanted to have something interactive, so I used a D3 implementation of force-directed graphs from Mike Bostock. To use his Javascript code, I need to encode the similarity data into a graph format and save the data in JSON format. I used some basic paste commands to create the nodes list, but first I wanted to color the node for each model based on whether the model worked for classification, regression or both. The R object grps
contains the mapping between models and their characteristic:
grps <- rep(NA, nrow(sim))
grps[tag_data[,"Classification"] == 1 & tag_data[,"Regression"] == 1] <- 4
grps[tag_data[,"Classification"] == 0 & tag_data[,"Regression"] == 1] <- 2
grps[tag_data[,"Classification"] == 1 & tag_data[,"Regression"] == 0] <- 3
Now, we want this part of the JSON data to look like this:
"nodes":[ {"name":"model0","group":3}, {"name":"model1","group":4}, -snip-
This R code gets us there:
nodes <- paste(' {"name":"', rownames(sim), '","group":',
grps, "}", sep = "")
nodes <- paste(' "nodes":[\n',
paste(nodes, collapse = ",\n"),
'\n ],', sep = "")
We also need to encode when nodes (i.e. models) are formally similar to each other and should be connected. Here, the JSON should look like this:
"links":[ {"source":0,"target":6,"value":1}, {"source":0,"target":8,"value":2}, {"source":0,"target":9,"value":3}, -snip-
The source
and target
values are zero-based indices for the nodes (i.e. model 0 is the first node listed in the JSON file). The number after value
describes how thick the line between the nodes should be. I used a subjectively defined function of their similarity where thickness is 20 times similarity.
To produce the part of the JSON, we loop through all the models and find which are greater than the threshold:
thresh <- 0.55
binary <- sim >= thresh
links <- NULL
index <- 0
for(i in 1:nrow(binary)){
for(j in i:ncol(binary)){
if(i != j) {
if(binary[i,j]) {
index <- index + 1
val <- round((sim[i,j]- thresh)*20, 2)
tmp <- paste(' {"source":', i - 1,
',"target":', j - 1,
',"value":', val, '}', sep = "")
links <- if(is.null(links)) tmp else c(links, tmp)
rm(tmp)
}
}
}
}
links <- paste(' "links":[\n',
paste(links, collapse = ",\n", sep = ""),
'\n ]\n', sep = "")
## cat('{\n', nodes, links, '}', sep = "", file = "model_data.json")
The results look like this:
Orange is regression only, dark blue is classification only and light blue is “dual use”. Hover over a circle to get the model name and the model code used by the caret package and refreshing the screen will re-configure the layout. There are a few clusters of analogous models. In all, there are some small tweaks that I should make to the tags but I think they cluster models fairly well. What do you think?
Comments
You can follow this conversation by subscribing to the comment feed for this post.