by Siddarth Ramesh, Data Scientist, Microsoft
I’m an R programmer. To me, R has been great for data exploration, transformation, statistical modeling, and visualizations. However, there is a huge community of Data Scientists and Analysts who turn to Python for these tasks. Moreover, both R and Python experts exist in most analytics organizations, and it is important for both languages to coexist.
Many times, this means that R coders will develop a workflow in R but then must redesign and recode it in Python for their production systems. If the coder is lucky, this is easy, and the R model can be exported as a serialized object and read into Python. There are packages that do this, such as pmml. Unfortunately, many times, this is more challenging because the production system might demand that the entire end to end workflow is built exclusively in Python. That’s sometimes tough because there are aspects of statistical model building in R which are more intuitive than Python.
Python has many strengths, such as its robust data structures such as Dictionaries, compatibility with Deep Learning and Spark, and its ability to be a multipurpose language. However, many scenarios in enterprise analytics require people to go back to basic statistics and Machine Learning, which the classic Data Science packages in Python are not as intuitive as R for. The key difference is that many statistical methods are built into R natively. As a result, there is a gap for when R users must build workflows in Python. To try to bridge this gap, this post will discuss a relatively new package developed by Microsoft, revoscalepy.
Revoscalepy is the Python implementation of the R package RevoScaleR included with Microsoft Machine Learning Server.
The methods in ‘revoscalepy’ are the same, and more importantly, the way the R user can view data is the same. The reason this is so important is that for an R programmer, being able to understand the data shape and structure is one of the challenges with getting used to Python. In Python, data types are different, preprocessing the data is different, and the criteria to feed the processed dataset into a model is different.
To understand how revoscalepy eases the transition from R to Python, the following section will compare building a decision tree using revoscalepy with building a decision tree using sklearn. The Titanic dataset from Kaggle will be used for this example. To be clear, this post is written from an R user’s perspective, as many of the challenges this post will outline are standard practices for native Python users.
revoscalepy works on Python 3.5, and can be downloaded as a part of Microsoft Machine Learning Server. Once downloaded, set the Python environment path to the python
executable in the MML directory, and then import the packages.
The first chunk of code imports the revoscalepy, numpy, pandas, and sklearn packages, and imports the Titatic data. Pandas has some R roots in that it has its own implementation of DataFrames as well as methods that resemble R’s exploratory methods.
import revoscalepy as rp; import numpy as np; import pandas as pd; import sklearn as sk; titanic_data = pd.read_csv('titanic.csv') titanic_data.head()
One of the challenges as an R user with using sklearn is that the decision tree model for sklearn can only handle the numeric datatype. Pandas has a categorical type that looks like factors in R, but sklearn’s Decision Tree does not integrate with this. As a result, numerically encoding the categorical data becomes a mandatory step. This example will use a one-hot encoder to shape the categories in a way that sklearn’s decision tree understands.
The side effect of having to one-hot encode the variable is that if the dataset contains high cardinality features, it can be memory intensive and computationally expensive because each category becomes its own binary column. While implementing one-hot encoding itself is not a difficult transformation in Python and provides good results, it is still an extra step for an R programmer to have to manually implement. The following chunk of code detaches the categorical columns, label and one-hot encodes them, and then reattaches the encoded columns to the rest of the dataset.
from sklearn import tree le = sk.preprocessing.LabelEncoder() x = titanic_data.select_dtypes(include=[object]) x = x.drop(['Name', 'Ticket', 'Cabin'], 1) x = pd.concat([x, titanic_data['Pclass']], axis = 1) x['Pclass'] = x['Pclass'].astype('object') x = pd.DataFrame(x) x = x.fillna('Missing') x_cats = x.apply(le.fit_transform) enc = sk.preprocessing.OneHotEncoder() enc.fit(x_cats) onehotlabels = enc.transform(x_cats).toarray() encoded_titanic_data = pd.concat([pd.DataFrame(titanic_data.select_dtypes(include=[np.number])), pd.DataFrame(onehotlabels)], axis = 1)
At this point, there are more columns than before, and the columns no longer have semantic names (they have been enumerated). This means that if a decision tree is visualized, it will be difficult to understand without going through the extra step of renaming these columns. There are techniques in Python to help with this, but it is still an extra step that must be considered.
Unlike sklearn, revoscalepy reads pandas’ ‘category’ type like factors in R. This section of code iterates through the DataFrame, finds the string types, and converts those types to ‘category’. In pandas, there is an argument to set the order to False, to prevent ordered factors.
titanic_data_object_types = titanic_data.select_dtypes(include = ['object']) titanic_data_object_types_columns = np.array(titanic_data_object_types.columns) for column in titanic_data_object_types_columns: titanic_data[column] = titanic_data[column].astype('category', ordered = False) titanic_data['Pclass'] = titanic_data['Pclass'].astype('category', ordered = False)
This dataset is already ready to be fed into the revoscalepy model.
One difference between implementing a model in R and in sklearn in Python is that sklearn does not use formulas.
Formulas are important and useful for modeling because they provide a consistent framework to develop models with varying degrees of complexity. With formulas, users can easily apply different types of variable cases, such as ‘+
’ for separate independent variables, ‘:
’ for interaction terms, and ‘*
’ to include both the variable and its interaction terms, along with many other convenient calculations. Within a formula, users can do mathematical calculations, create factors, and include more complex entities third order interactions. Furthermore, formulas allow for building highly complex models such as mixed effect models, which are next to impossible build without them. In Python, there are packages such as ‘statsmodels’ which have more intuitive ways to build certain statistical models. However, statsmodels has a limited selection of models, and does not include tree based models.
With sklearn, model.fit
expects the independent and dependent terms to be columns from the DataFrame. Interactions must be created manually as a preprocessing step for more complex examples. The code below trains the decision tree:
model = tree.DecisionTreeClassifier(max_depth = 50) x = encoded_titanic_data.drop(['Survived'], 1) x = x.fillna(-1) y = encoded_titanic_data['Survived'] model = model.fit(x,y)
revoscalepy brings back formulas. Granted, users cannot view the formula the same way as they can in R, because formulas are strings in Python. However, importing code from R to Python is an easy transition because formulas are read the same way in the revoscalepy functions as the model fit functions in R. The below code fits the Decision Tree in revoscalepy.
#rx_dtree works with formulas, just like models in R form = 'Survived ~ Pclass + Sex + Age + Parch + Fare + Embarked' titanic_data_tree = rp.rx_dtree(form, titanic_data, max_depth = 50)
The resulting object, titanic_data_tree
, is the same structural object that RxDTree()
would create in R. Because the individual elements that make up the rx_dtree()
object are the same as RxDTree()
, it allows R users to easily understand the decision tree without having to translate between two object structures.
From the workflow, it should be clear how revoscalepy can help with transliteration between R and Python. Sklearn has different preprocessing considerations because the data must be fed into the model differently. The advantage to revoscalepy is that R programmers can easily convert their R code to Python without thinking too much about the ‘Pythonic way’ of implementing their R code. Categories replace factors, rx_dtree()
reads the R-like formula, and the arguments are similar to the R equivalent. Looking at the big picture, revoscalepy is one way to ease Python for the R user and future posts will cover other ways to transition between R and Python.
Microsoft Docs: Introducing revoscalepy
by Bob Horton and Vanja Paunic, Microsoft AI and Research Data Group
Training deep learning models from scratch requires large data sets and significant computational reources. Using pre-trained deep neural network models to extract relevant features from images allows us to build classifiers using standard machine learning approaches that work well for relatively small data sets. In this context, a deep learning solution can be thought of as incorporating layers that compute features, followed by layers that map these features to outcomes; here we’ll just map the features to outcomes ourselves.
We explore an example of using a pre-trained deep learning image classifier to generate features for use with traditional machine learning approaches to address a problem the original model was never trained on (see the blog post “Image featurization with a pre-trained deep neural network model” for other examples). This approach allows us to quickly and easily create a custom classifier for a specific specialized task, using only a relatively small training set. We use the image featurization abilities of Microsoft R Server 9.1 (MRS) to create a classifier for different types of knots in lumber. These images were made publicly available from the laboratory of Prof., Dr. Olli Silven, University of Oulu, Finland, in 1995. Please note that we are using this problem as an academic example of an image classification task with clear industrial implications, but we are not really trying to raise the bar in this well-established field.
We characterize the performance of the machine learning model and describe how it might fit into the framework of a lumber grading system. Knowing the strengths and weaknesses of the classifier, we discuss how it could be used to triage additional image data for labeling by human experts, so that the system can be iteratively improved.
The pre-trained deep learning models used here are optional components that can be installed alongside Microsoft R Server 9.1; directions are here.
In the sawmill industry lumber grading is an important step of the manufacturing process. Improved grading accuracy and better control of quality variation in production leads directly to improved profits. Grading has traditionally been done by visual inspection, in which a (human) grader marks each piece of lumber as it leaves the mill, according to a factors like size, category, and position of knots, cracks, species of tree, etc. Visual inspection is often an error prone and laborious task. Certain defect classes may be difficult to distinguish, even for a human expert. To that end, a number of automated lumber grading systems have been developed which aim to improve the accuracy and the efficiency of lumber grading [2-7].
Let us start by downloading the data.
DATA_DIR <- file.path(getwd(), 'data')
if(!dir.exists(DATA_DIR)){
dir.create(DATA_DIR)
knots_url <- 'http://www.ee.oulu.fi/research/imag/knots/KNOTS/knots.zip'
names_url <- 'http://www.ee.oulu.fi/research/imag/knots/KNOTS/names.txt'
download.file(knots_url, destfile = file.path(DATA_DIR, 'knots.zip'))
download.file(names_url, destfile = file.path(DATA_DIR, 'names.txt'))
unzip(file.path(DATA_DIR, 'knots.zip'), exdir = file.path(DATA_DIR, 'knot_images'))
}
Let’s now load the data from the downloaded files and look at some of those files.
knot_info <- read.delim(file.path(DATA_DIR, "names.txt"), sep=" ", header=FALSE, stringsAsFactors=FALSE)[1:2]
names(knot_info) <- c("file", "knot_class")
knot_info$path <- file.path(DATA_DIR, "knot_images", knot_info$file)
names(knot_info)
## [1] "file" "knot_class" "path"
We’ll be trying to predict knot_class
, and here are the counts of the categories:
table(knot_info$knot_class)
##
## decayed_knot dry_knot edge_knot encased_knot horn_knot
## 14 69 65 29 35
## leaf_knot sound_knot
## 47 179
Four of these labels relate to how the knot is integrated into the structure of the surrounding wood; these are the descriptions from the README file:
sound: “A knot grown firmly into the surrounding wood material and does not contain any bark or signs of decay. The color may be very close to the color of sound wood.”
dry: “A firm or partially firm knot, and has not taken part to the vital processes of growing wood, and does not contain any bark or signs of decay. The color is usually darker than the color of sound wood, and a thin dark ring or a partial ring surrounds the knot.”
encased: “A knot surrounded totally or partially by a bark ring. Compared to dry knot, the ring around the knot is thicker.”
decayed: “A knot containing decay. Decay is difficult to recognize, as it usually affects only the strength of the knot.”
Edge, horn and leaf knots are related to the orientation of the knot relative to the cutting plane, the position of the knot on the board relative to the edge, or both. These attributes are of different character than the ones related to structural integration. Theoretically, you could have an encased horn knot, or a decayed leaf knot, etc., though there do not happen to be any in this dataset. Unless otherwise specified, we have to assume the knot is sound, cross-cut, and not near an edge. This means that including these other labels makes the knot_class
column ‘untidy’, in that it refers to more than one different kind of characteristic. We could try to split out distributed attributes for orientation and position, as well as for structural integration, but to keep this example simple we’ll just filter the data to keep only the sound, dry, encased, and decayed knots that are cross-cut and not near an edge. (It turns out that you can use featurization to fit image classification models that recognize position and orientation quite well, but we’ll leave that as an exercise for the reader.)
knot_classes_to_keep <- c("sound_knot", "dry_knot", "encased_knot", "decayed_knot")
knot_info <- knot_info[knot_info$knot_class %in% knot_classes_to_keep,]
knot_info$knot_class <- factor(knot_info$knot_class, levels=knot_classes_to_keep)
We kept the knot_class
column as character data while we were deciding which classes to keep, but now we can make that a factor.
Here are a few examples of images from each of the four classes:
library("pixmap")
set.seed(1)
samples_per_category <- 3
op <- par(mfrow=c(1, samples_per_category), oma=c(2,2,2,2), no.readonly = TRUE)
for (kc in levels(knot_info$knot_class)){
kc_files <- knot_info[knot_info$knot_class == kc, "path"]
kc_examples <- sample(kc_files, samples_per_category)
for (kc_ex in kc_examples){
pnm <- read.pnm(kc_ex)
plot(pnm, xlab=gsub(".*(knot[0-9]+).*", "\\1", kc_ex))
mtext(text=gsub("_", " ", kc), side=3, line=0, outer=TRUE, cex=1.7)
}
}
par(op)
Here we use the rxFeaturize
function from Microsoft R Server, which allows us to perform a number of transformations on the knot images in order to produce numerical features. We first resize the images to fit the dimensions required by the pre-trained deep neural model we will use, then extract the pixels to form a numerical data set, then run that data set through a DNN pre-trained model. The result of the image featurization is a numeric vector (“feature vector”) that represents key characteristics of that image.
Image featurization here is accomplished by using a deep neural network (DNN) model that has already been pre-trained by using millions of images. Currently, MRS supports four types of DNNs - three ResNet models (18, 50, 101)[1] and AlexNet [8].
knot_data_df <- rxFeaturize(data = knot_info,
mlTransforms = list(loadImage(vars = list(Image = "path")),
resizeImage(vars = list(Features = "Image"),
width = 224, height = 224,
resizingOption = "IsoPad"),
extractPixels(vars = "Features"),
featurizeImage(var = "Features",
dnnModel = "resnet101")),
mlTransformVars = c("path", "knot_class"))
## Elapsed time: 00:02:59.8362547
We have chosen the “resnet101” DNN model, which is 101 layers deep; the other ResNet options (18 or 50 layers) generate features more quickly (well under 2 minutes each for this dataset, as opposed to several minutes for ResNet-101), but we found that the features from 101 layers work better for this example.
We have placed the features in a dataframe, which lets us use any R algorithm to build a classifier. Alternatively, we could have saved them directly to an XDF file (the native file format for Microsoft R Server, suitable for large datasets that would not fit in memory, or that you want to distribute on HDFS), or generated them dynamically when training the model (see examples in the earlier blog post).
Since the featurization process takes a while, let’s save the results up to this point. We’ll put them in a CSV file so that, if you are so inspired, you can open it in Excel and admire the 2048 numerical features the deep neural net model has created to describe each image. Once we have the features, training models on this data will be relatively fast.
write.csv(knot_data_df, "knot_data_df.csv", row.names=FALSE)
Now that we have extracted numeric features from each image, we can use traditional machine learning algorithms to train a classifier.
in_training_set <- sample(c(TRUE, FALSE), nrow(knot_data_df), replace=TRUE, prob=c(2/3, 1/3))
in_test_set <- !in_training_set
We put about two thirds of the examples (203) into the training set and the rest (88) in the test set.
We will use the popular randomForest
package, since it handles both “wide” data (with more features than cases) as well as multiclass outcomes.
library(randomForest)
form <- formula(paste("knot_class" , paste(grep("Feature", names(knot_data_df), value=TRUE), collapse=" + "), sep=" ~ "))
training_set <- knot_data_df[in_training_set,]
test_set <- knot_data_df[in_test_set,]
fit_rf <- randomForest(form, training_set)
pred_rf <- as.data.frame(predict(fit_rf, test_set, type="prob"))
pred_rf$knot_class <- knot_data_df[in_test_set, "knot_class"]
pred_rf$pred_class <- predict(fit_rf, knot_data_df[in_test_set, ], type="response")
# Accuracy
with(pred_rf, sum(pred_class == knot_class)/length(knot_class))
## [1] 0.8068182
Let’s see how the classifier did for all four classes. Here is the confusion matrix, as well as a moisaic plot showing predicted class (pred_class
) against the actual class (knot_class
).
with(pred_rf, table(knot_class, pred_class))
## pred_class
## knot_class sound_knot dry_knot encased_knot decayed_knot
## sound_knot 51 0 0 0
## dry_knot 3 19 0 0
## encased_knot 6 5 1 0
## decayed_knot 2 1 0 0
mycolors <- c("yellow3", "yellow2", "thistle3", "thistle2")
mosaicplot(table(pred_rf[,c("knot_class", "pred_class")]), col = mycolors, las = 1, cex.axis = .55,
main = NULL, xlab = "Actual Class", ylab = "Predicted Class")
It looks like the classifier performs really well on the sound knots. On the other hand, all of the decayed knots in the test set are misclassified. This is a small class, so let’s look at those misclassified samples.
library(dplyr)
pred_rf$path <- test_set$path
# misclassified knots
misclassified <- pred_rf %>% filter(knot_class != pred_class)
# number of misclassified decayed knots
num_example <- sum(misclassified$knot_class == "decayed_knot")
op <- par(mfrow = c(1, num_example), mar=c(2,1,0,1), no.readonly = TRUE)
for (i in which(misclassified$knot_class == "decayed_knot")){
example <- misclassified[i, ]
pnm <- read.pnm(example$path)
plot(pnm)
mtext(text=sprintf("predicted as\n%s", example$pred_class), side=1, line=-2)
}
mtext(text='Misclassified Decayed Knots', side = 3, outer = TRUE, line = -3, cex=1.2)
par(op)
We were warned about the decayed knots in the README file; they are more difficult to visually classify. Interestingly, the ones classified as sound actually do appear to be well-integrated into the surrounding wood; they also happen to look somewhat rotten. Also, the decayed knot classified as dry does have a dark border, and looks like a dry knot. These knots appear to be two decayed sound knots and a decayed dry knot; in other words, maybe decay should be represented as a separate attribute that is independent of the structural integration of the knot. We may have found an issue with the labels, rather than a problem with the classifier.
Let’s look at the performance of the classifier more closely. Using the scores returned by the classifier, we can plot an ROC curve for each of the individual classes.
library(pROC)
plot_target_roc <- function(target_class, outcome, test_data, multiclass_predictions){
is_target <- test_data[[outcome]] == target_class
roc_obj <- roc(is_target, multiclass_predictions[[target_class]])
plot(roc_obj, print.auc=TRUE, main=target_class)
text(x=0.2, y=0.2, labels=sprintf("total cases: %d", sum(is_target)), pos=3)
}
op <- par(mfrow=c(2,2), oma=c(2,2,2,2), #mar=c(0.1 + c(7, 4, 4, 2)),
no.readonly = TRUE)
for (knot_category in levels(test_set$knot_class)){
plot_target_roc(knot_category, "knot_class", test_set, pred_rf)
}
mtext(text="ROC curves for individual classes", side=3, line=0, outer=TRUE, cex=1.5)
par(op)
These ROC curves show that the classifier scores can be used to provide significant enrichment for any of the classes. With sound knots, for example, the score can unambiguously identify a large majority of the cases that are not of that class. Another way to look at this is to consider the ranges of each classification score for each actual class:
library(tidyr)
library(ggplot2)
pred_rf %>%
select(-path, -pred_class) %>%
gather(prediction, score, -knot_class) %>%
ggplot(aes(x=knot_class, y=score, col=prediction)) + geom_boxplot()
Note that the range of the sound knot and dry knot scores tend to be higher for all four classes. But when you consider the scores for a given prediction class across all the actual classes, the scores tend to be higher when they match than when they don’t. For example, even though the decayed knot score never goes very high, it tends to be higher for the decayed knots than for other classes. Here’s a boxplot of just the decayed_knot scores for all four actual classes:
boxplot(decayed_knot ~ knot_class, pred_rf, xlab="actual class", ylab="decayed_knot score", main="decayed_knot score for all classes")
Even though the multi-class classifier did not correctly identify any of the three decayed knots in the test set, this does not mean that it is useless for finding decayed knots. In fact, the decayed knots had higher scores for the decayed predictor than most of the other knots did (as shown in the boxplots above), and this score it is able to correctly determine that almost 80% of the knots are not decayed. This means that this classifier could be used to screen for knots that are more likely to be decayed, or alternatively, you could use these scores to isolate a collection of cases that are unlikely to be any of the kinds of knots your classifier is good at recognizing. This ability to focus on things we are not good at might be helpful if we needed to search through a large collection of images to find more of the kinds of knots for which we need more training data. We could show those selected knots to expert human labelers, so they wouldn’t need to label all the examples in the entire database. This would help us get started with an active learning process, where we use a model to help develop a better model.
Here we’ve shown an open-source random forest model trained on features that have been explicitly written to a dataframe. The model classes in the MicrosoftML package can also generate these features dynamically, so that the trained model can accept a file name as input and automatically generate the features on demand during scoring. For examples see the earlier blog post. In general, these data sets are “wide”, in that they have a large number of features; in this example, they have far more columns of features than rows of cases. This means that you need to select algorithms that can handle wide data, like elastic net regularized linear models, or, as we’ve seen here, random forests. Many of the MicrosoftML algorithms are well suited for wide datasets.
Image featurization allows us to make effective use of relatively small sets of labeled images that would not be sufficient to train a deep network from scratch. This is because we can re-use the lower-level features that the pre-trained model had learned for more general image classification tasks.
This custom classifier was constructed quite quickly; although the featurizer required several minutes to run, after the features were generated all training was done on relatively small models and data sets (a few hundred cases with a few thousand features), where training a given model takes on the order of seconds, and tuning hyperparameters is relatively straightforward. Industrial applications commonly require specialized classification or scoring models that can be used to address specific technical or regulatory concerns, so having a rapid and adaptable approach to generate such models should have considerable practical utility.
Featurization is the low-hanging fruit of transfer learning. More general transfer learning allows the specialization of deeper and deeper layers in a pre-trained general model, but the deeper you want to go, the more labelled training data you will generally need. We often find ourselves awash in data, but limited by the availability of quality labels. Having a partially effective classifier can let you bootstrap an active learning approach, where a crude model is used to triage images, classifying the obvious cases directly and referring only those with uncertain scores to expert labelers. The larger set of labeled images is then used to build a better model, which can do more effective triage, and so on, leading to iterative model improvement. Companies like CrowdFlower use these kinds of approaches to optimize data labeling and model building, but there are many use cases where general crowdsourcing may not be adequate (such as when the labeling must be done by experts), so having a straightforward way to bootstrap that process could be quite useful.
The labels we have may not be tidy, that is, they may not refer to distinct characteristics. In this case, the decayed knot does not seem to really be an alternative to “encased”, “dry”, or “sound” knots; rather, it seems that any of these categories of knots might possibly be decayed. In many applications, it is not always obvious what a properly distributed outcome labeling should include. This is one reason that a quick and easy approach is valuable; it can help you to clarify these questions with iterative solutions that can help frame the discussion with domain experts. In a subsequent iteration we may want to consider “decayed” as a separate attribute from “sound”, “dry” or “encased”.
Image featurization gives us a simple way to build domain-specific image classifiers using relatively small training sets. In the common use case where data is plentiful but good labels are not, this provides a rapid and straightforward first step toward getting the labeled cases we need to build more sophistiated models; with a few iterations, we might even learn to recognize decayed knots. Seriously, wood that knot be cool?
Kaiming He, Xiangyu Zhang, Shaoqing Ren, Jian Sun. (2016) The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp. 770-778
Kauppinen H. and Silven O.: A Color Vision Approach for Grading Lumber. In Theory & Applications of Image Processing II - Selected papers from the 9th Scandinavian Conference on Image Analysis (ed. G. Borgefors), World Scientific, pp. 367-379 1995.
Silven O. and Kauppinen H.: Recent Developments in Wood Inspection. (1996) International Journal on Pattern Recognition and Artificial Intelligence, IJPRAI, pp. 83-95, 1996.
Rinnhofer, Alfred & Jakob, Gerhard & Deutschl, Edwin & Benesova, Wanda & Andreu, Jean-Philippe & Parziale, Geppy & Niel, Albert. (2006). A multi-sensor system for texture based high-speed hardwood lumber inspection. Proceedings of SPIE - The International Society for Optical Engineering. 5672. 34-43. 10.1117/12.588199.
Irene Yu-Hua Gu, Henrik Andersson, Raul Vicen. (2010) Wood defect classification based on image analysis and support vector machines. Wood Science and Technology 44:4, 693-704. Online publication date: 1-Nov-2010.
Kauppinen H, Silven O & Piirainen T. (1999) Self-organizing map based user interface for visual surface inspection. Proc. 11th Scandinavian Conference on Image Analysis (SCIA’99), June 7-11, Kangerlussuaq, Greenland, 801-808.
Kauppinen H, Rautio H & Silven O (1999). Nonsegmenting defect detection and SOM-based classification for surface inspection using color vision. Proc. EUROPTO Conf. on Polarization and Color Techniques in Industrial Inspection (SPIE Vol. 3826), June 17-18, Munich, Germany, 270-280.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E. Hinton. (2012) ImageNet Classification with Deep Convolutional Neural Networks. In NIPS, 2012.
Last week I wrote about how you can use the MicrosoftML package in Microsoft R to featurize images: reduce an image to a vector of 4096 numbers that quantify the essential characteristics of the image, according to an AI vision model. You can perform a similar featurization process with text as well, but in this case you have a lot more control of the features used to represent the text.
Tsuyoshi Matsuzaki demonstrates the process in a post at the MSDN Blog. The post explores the Multi-Domain Sentiment Dataset, a collection of product reviews from Amazon.com. The dataset includes reviews from 975,194 products on Amazon.com from a variety of domains, and for each product there is a text review and a star rating of 1, 2, 4, or 5. (There are no 3-star rated reviews in the data set.) Here's one example, selected at random:
What a useful reference! I bought this book hoping to brush up on my French after a few years of absence, and found it to be indispensable. It's great for quickly looking up grammatical rules and structures as well as vocabulary-building using the helpful vocabulary lists throughout the book. My personal favorite feature of this text is Part V, Idiomatic Usage. This section contains extensive lists of idioms, grouped by their root nouns or verbs. Memorizing one or two of these a day will do wonders for your confidence in French. This book is highly recommended either as a standalone text, or, preferably, as a supplement to a more traditional textbook. In either case, it will serve you well in your continuing education in the French language.
The review contains many positive terms ("useful", "indespensable", "highly recommended"), and in fact is associated with a 5-star rating for this book. The goal of the blog post was to find the terms most associated with positive (or negative) reviews. One way to do this is to use the featurizeText
function in thje Microsoft ML package included with Microsoft R Client and Microsoft R Server. Among other things, this function can be used to extract ngrams (sequences of one, two, or more words) from arbitrary text. In this example, we extract all of the one and two-word sequences represented at least 500 times in the reviews. Then, to assess which have the most impact on ratings, we use their presence or absence as predictors in a linear model:
transformRule = list( featurizeText( vars = c(Features = "REVIEW_TEXT"), # ngramLength=2: include not only "Azure", "AD", but also "Azure AD" # skipLength=1 : "computer" and "compuuter" is the same wordFeatureExtractor = ngramCount( weighting = "tfidf", ngramLength = 2, skipLength = 1), language = "English" ), selectFeatures( vars = c("Features"), mode = minCount(500) ) ) # train using transforms ! model <- rxFastLinear( RATING ~ Features, data = train, mlTransforms = transformRule, type = "regression" # not binary (numeric regression) )
We can then look at the coefficients associated with these features (presence of n-grams) to assess their impact on the overall rating. By this standard, the top 10 words or word-pairs contributing to a negative rating are:
boring -7.647399 waste -7.537471 not -6.355953 nothing -6.149342 money -5.386262 bad -5.377981 no -5.210301 worst -5.051558 poorly -4.962763 disappointed -4.890280
Similarly, the top 10 words or word-pairs associated with a positive rating are:
will 3.073104 the|best 3.265797 love 3.290348 life 3.562267 wonderful 3.652950 ,|and 3.762862 you 3.889580 excellent 3.902497 my 4.454115 great 4.552569
Another option is simply to look at the sentiment score for each review, which can be extracted using the getSentiment
function.
sentimentScores <- rxFeaturize(data=data, mlTransforms = getSentiment(vars = list(SentimentScore = "REVIEW_TEXT")))
As we expect, a negative seniment (in the 0-0.5 range) is associated with 1- and 2-star reviews, while a positive sentiment (0.5-1.0) is associated with the 4- and 5-star reviews.
You can find more details on this analysis, including the Microsoft R code, at the link below.
Microsoft Technologies Blog for Enterprise Developers: Analyze your text in R (MicrosoftML)
by Fang Zhou, Data Scientist; and Graham Williams, Director of Data Science, all at Microsoft
Rattle — the R Analytical Tool To Learn Easily — is a popular open-source GUI for data mining using R. It presents statistical and visual summaries of data, transforms data that can be readily modelled, builds both unsupervised and supervised models from the data, presents the performance of models graphically, and scores new datasets. All of the underlying R code is presented as a script for learning R and for running independent of Rattle.
Collaborating with IGD Data Insight Team and under the guidance of the author of Rattle, Graham Williams, we took the challenging task to understand the existing code base and re-engineer it to support the latest machine learning algorithms in Open Source R and Microsoft R Server for model development and evaluation.
Extreme Gradient Boosting algorithm from the R package xgboost, is one of the newly added features, to provide alternative option for implementing boosting model. The main effort in integrating xgboost into Rattle lies in three aspects:
Now we demonstrate the usage of Rattle for xgboost on the credit card data set from Kaggle Competition- Credit Card Fraud Detection.
After loading the credit card data in CSV file from Rattle’s Data Tab, we can click on Model Tab to navigate to Boosting Model. By choosing the Model Builder xgb and a set of hyper-parameters, we can easily build a xgboost model without coding.
The measure and visualization of feature importance as well as training error can be generated by clicking the Importance and Errors buttons.
Performance evaluation is also supported. By navigating to the Evaluate Tab, we can calculate the confusion matrix and draw various statistical plots for model evaluation, such as ROC curve, Risk chart and Lift chart.
Do check the Log Tab to review the commands that were executed underneath.
Inspired by the work of IGD Data Insight Team (see this blog Microsoft R Server support for Rattle) and the latest release of LightGBM, mxnet, MicrosoftML etc, we could extend Rattle to expose plenty of functionality in the near future.
The latest release of Rattle (Version 5.0.18) is available on Bitbucket.
You can try this new version out using either Microsoft R Client on Windows or fire up an Azure Linux Data Science Virtual Machine which comes with the developer version of Microsoft R Server installed. Then upgrade the pre-installed Rattle to this new release.
togaware: Rattle: A Graphical User Interface for Data Mining using R
With the current focus on deep learning, neural networks are all the rage again. (Neural networks have been described for more than 60 years, but it wasn't until the the power of modern computing systems became available that they have been successfully applied to tasks like image recognition.) Neural networks are the fundamental predictive engine in deep learning systems, but it can be difficult to understand exactly what they do. To help with that, Brandon Rohrer has created this from-the-basics guide to how neural networks work:
In R, you can train a simple neural network with just a single hidden layer with the nnet package, which comes pre-installed with every R distribution. It's a great place to start if you're new to neural networks, but the deep learning applications call for more complex neural networks. R has several packages to check out here, including MXNet, darch, deepnet, and h2o: see this post for a comparison. The tensorflow package can also be used to implement various kinds of neural networks. And the rxNeuralNet function (found in the MicrosoftML package included with Microsoft R Server and Microsoft R Client) provides high-performance training of complex neural networks using CPUs and GPUs.
Data Science and Robots Blog: How neural networks work
The MicrosoftML package introduced with Microsoft R Server 9.0 added several new functions for high-performance machine learning, including rxNeuralNet. Tomaz Kastrun recently applied rxNeuralNet to the MNIST database of handwritten digits to compare its performance with two other machine learning packages, h2o and xgboost. The results are summarized in the chart below:
In addition to having the best performance (for both the CPU-enabled and GPU-enabled modes), rxNeuralNetwork did not have to sacrifice accuracy. In fact, rxNeuralNetwork had the best accuracy of the three algorithms: 97.8%, compared to 95.3% for h2o and 94.9% for xgBoost. The same training and validation set were used for each case, and the R code is available here. (If you're looking for other uses of MicrosoftML, this script also applies algorithms like rxFastForest and rxFastLinear to various other datasets.)
The MicrosoftML package can be used to classify other kinds of images, too. This post from the Microsoft R Server Tiger Team demostrates using the rxNeuralNet function to classify images from the UCI Image Segmentation Data Set. But for more on the OCR application, follow the link below.
TomaztSQL: rxNeuralNet vs. xgBoost vs. H2O
One of the major "wow!" moments in the keynote where SQL Server 2016 was first introduced was a demo that automated the process classifying images of galaxies in a huge database of astronomical images.
The SQL Server Blog has since published a step-by-step tutorial on implementing the galaxy classifier in SQL Server (and the code is also available on GitHub). This updated version of the demo uses the new MicrosoftML package in Microsoft R Server 9, and specifically the rxNeuralNet
function for deep neural networks. The tutorial recommends using the Azure NC class of virtual machines, to take advantage of the GPU-accelerated capabilities of the function, and provides details on using the SQL Server interfaces to train the neural netowrk and run predictions (classifications) on the image database. For the details, follow the link below.
SQL Server Blog: How six lines of code + SQL Server can bring Deep Learning to ANY App
Oksana Kutina and Stefan Feuerriegel fom University of Freiburg recently published an in-depth comparison of four R packages for deep learning. The packages reviewed were:
The blog post goes into detail about the capabilities of the packages, and compares them in terms of flexibility, ease-of-use, parallelization frameworks supported (GPUs, clusters) and performance -- follow the link below for details. I include the conclusion from the paper here:
The current version of deepnet might represent the most differentiated package in terms of available architectures. However, due to its implementation, it might not be the fastest nor the most user-friendly option. Furthermore, it might not offer as many tuning parameters as some of the other packages.
H2O and MXNetR, on the contrary, offer a highly user-friendly experience. Both also provide output of additional information, perform training quickly and achieve decent results. H2O might be more suited for cluster environments, where data scientists can use it for data mining and exploration within a straightforward pipeline. When flexibility and prototyping is more of a concern, then MXNetR might be the most suitable choice. It provides an intuitive symbolic tool that is used to build custom network architectures from scratch. Additionally, it is well optimized to run on a personal computer by exploiting multi CPU/GPU capabilities.
darch offers a limited but targeted functionality focusing on deep belief networks.
Information Systems Research R Blog: Deep Learning in R
Microsoft R Server 9 includes a new R package for machine learning: MicrosoftML. (So do the Data Science Virtual Machine and the free Microsoft R Client edition, incidentally.) This package includes a suite of fast predictive modeling functions implemented by Microsoft Research, including:
rxFastLinear
) and logistic (rxLogisticRegression
) model functions based on the Stochastic Dual Coordinate Ascent method;rxFastTrees
) and random forests (rxFastForests
) based on FastRank, an efficient implementation of the MART gradient boosting algorithm;rxNeuralNet
) with support for custom, multilayer network topologies; andrxOneClassSvm
) based on support vector machines.As the function names suggest, the implementations are tuned for speed: most use multiple CPUs, and some will even use the GPU (if available). Not all of the implementations scale to unlimited data sizes, however; all but the linear and logistic regression routines are bound by available RAM.
If you want to give these routines a try, the MIcrosoft R Server Tiger Team has prepared a walkthrough analyzing the famous NYC Taxi data set. Once you have access to Microsoft R Server (or Client), this R script walks you through the process of:
The ROC curves are shown below. As you'd expect the linear model performs poorly compared to the others, since it's being applied here to a binary variable.
To try it out yourself, follow the walkthrough linked below, which also provides instructions for running the logistic regression model in SQL Server Management Studio.
Microsoft R Server Tiger Team: Predicting NYC Taxi Tips using MicrosoftML
by Bob Horton, Microsoft Senior Data Scientist
Receiver Operating Characteristic (ROC) curves are a popular way to visualize the tradeoffs between sensitivitiy and specificity in a binary classifier. In an earlier post, I described a simple “turtle’s eye view” of these plots: a classifier is used to sort cases in order from most to least likely to be positive, and a Logo-like turtle marches along this string of cases. The turtle considers all the cases it has passed as having tested positive. Depending on their actual class they are either false positives (FP) or true positives (TP); this is equivalent to adjusting a score threshold. When the turtle passes a TP it takes a step upward on the y-axis, and when it passes a FP it takes a step rightward on the x-axis. The step sizes are inversely proportional to the number of actual positives (in the y-direction) or negatives (in the x-direction), so the path always ends at coordinates (1, 1). The result is a plot of true positive rate (TPR, or specificity) against false positive rate (FPR, or 1 - sensitivity), which is all an ROC curve is.
Computing the area under the curve is one way to summarize it in a single value; this metric is so common that if data scientists say “area under the curve” or “AUC”, you can generally assume they mean an ROC curve unless otherwise specified.
Probably the most straightforward and intuitive metric for classifier performance is accuracy. Unfortunately, there are circumstances where simple accuracy does not work well. For example, with a disease that only affects 1 in a million people a completely bogus screening test that always reports “negative” will be 99.9999% accurate. Unlike accuracy, ROC curves are insensitive to class imbalance; the bogus screening test would have an AUC of 0.5, which is like not having a test at all.
In this post I’ll work through the geometry exercise of computing the area, and develop a concise vectorized function that uses this approach. Then we’ll look at another way of viewing AUC which leads to a probabilistic interpretation.
Let’s start with a simple artificial data set:
category <- c(1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0)
prediction <- rev(seq_along(category))
prediction[9:10] <- mean(prediction[9:10])
Here the vector prediction
holds ersatz scores; these normally would be assigned by a classifier, but here we’ve just assigned numbers so that the decreasing order of the scores matches the given order of the category labels. Scores 9 and 10, one representing a positive case and the other a negative case, are replaced by their average so that the data will contain ties without otherwise disturbing the order.
To plot an ROC curve, we’ll need to compute the true positive and false positive rates. In the earlier article we did this using cumulative sums of positives (or negatives) along the sorted binary labels. But here we’ll use the pROC
package to make it official:
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
## Area under the curve: 0.825
roc_df <- data.frame(
TPR=rev(roc_obj$sensitivities),
FPR=rev(1 - roc_obj$specificities),
labels=roc_obj$response,
scores=roc_obj$predictor)
The roc
function returns an object with plot methods and other conveniences, but for our purposes all we want from it is vectors of TPR and FPR values. TPR is the same as sensitivity, and FPR is 1 - specificity (see “confusion matrix” in Wikipedia). Unfortunately, the roc
function reports these values sorted in the order of ascending score; we want to start in the lower left hand corner, so I reverse the order. According to the auc
function from the pROC package, our simulated category and prediction data gives an AUC of 0.825; we’ll compare other attempts at computing AUC to this value.
If the ROC curve were a perfect step function, we could find the area under it by adding a set of vertical bars with widths equal to the spaces between points on the FPR axis, and heights equal to the step height on the TPR axis. Since actual ROC curves can also include portions representing sets of values with tied scores which are not square steps, we need to adjust the area for these segments. In the figure below we use green bars to represent the areas under the steps. Adjustments for sets of tied values will be shown as blue rectangles; half the area of each of these blue rectagles is below a sloped segment of the curve.
The function for drawing polygons in base R takes vectors of x and y values; we’ll start by defining a rectangle
function that uses a simpler and more specialized syntax; it takes x and y coordinates for the lower left corner of the rectangle, and a height and width. It sets some default display options, and passes along any other parameters we might specify (like color) to the polygon
function.
rectangle <- function(x, y, width, height, density=12, angle=-45, ...)
polygon(c(x,x,x+width,x+width), c(y,y+height,y+height,y),
density=density, angle=angle, ...)
The spaces between TPR (or FPR) values can be calculated by diff
. Since this results in a vector one position shorter than the original data, we pad each difference vector with a zero at the end:
roc_df <- transform(roc_df,
dFPR = c(diff(FPR), 0),
dTPR = c(diff(TPR), 0))
For this figure, we’ll draw the ROC curve last to place it on top of the other elements, so we start by drawing an empty graph (type='n'
) spanning from 0 to 1 on each axis. Since the data set has exactly ten positive and ten negative cases, the TPR and FPR values will all be multiples of 1/10, and the points of the ROC curve will all fall on a regularly spaced grid. We draw the grid using light blue horizontal and vertical lines spaced one tenth of a unit apart. Now we can pass the values we calculated above to the rectangle function, using mapply
(the multi-variate version of sapply
) to iterate over all the cases and draw all the green and blue rectangles. Finally we plot the ROC curve (that is, we plot TPR against FPR) on top of everything in red.
plot(0:10/10, 0:10/10, type='n', xlab="FPR", ylab="TPR")
abline(h=0:10/10, col="lightblue")
abline(v=0:10/10, col="lightblue")
with(roc_df, {
mapply(rectangle, x=FPR, y=0,
width=dFPR, height=TPR, col="green", lwd=2)
mapply(rectangle, x=FPR, y=TPR,
width=dFPR, height=dTPR, col="blue", lwd=2)
lines(FPR, TPR, type='b', lwd=3, col="red")
})
The area under the red curve is all of the green area plus half of the blue area. For adding areas we only care about the height and width of each rectangle, not its (x,y) position. The heights of the green rectangles, which all start from 0, are in the TPR column and widths are in the dFPR column, so the total area of all the green rectangles is the dot product of TPR and dFPR. Note that the vectored approach computes a rectangle for each data point, even when the height or width is zero (in which case it doesn’t hurt to add them). Similarly, the heights and widths of the blue rectangles (if there are any) are in columns dTPR and dFPR, so their total area is the dot product of these vectors. For regions of the graph that form square steps, one or the other of these values will be zero, so you only get blue rectangles (of non-zero area) if both TPR and FPR change in the same step. Only half the area of each blue rectangle is below its segment of the ROC curve (which is a diagonal of a blue rectangle). Remember the ‘real’ auc
function gave us an AUC of 0.825, so that is the answer we’re looking for.
simple_auc <- function(TPR, FPR){
# inputs already sorted, best scores first
dFPR <- c(diff(FPR), 0)
dTPR <- c(diff(TPR), 0)
sum(TPR * dFPR) + sum(dTPR * dFPR)/2
}
with(roc_df, simple_auc(TPR, FPR))
## [1] 0.825
Now let’s try a completely different approach. Here we generate a matrix representing all possible combinations of a positive case with a negative case. Each row represents a positive case, in order from the highest-scoring positive case at the bottom to the lowest-scoring positive case at the top. Similarly, the columns represent the negative cases, sorted with the highest scores at the left. Each cell represents a comparison between a particular positive case and a particular negative case, and we mark the cell by whether its positive case has a higher score (or higher overall rank) than its negative case. If your classifier is any good, most of the positive cases will outrank most of the negative cases, and any exceptions will be in the upper left corner, where low-ranking positives are being compared to high-ranking negatives.
rank_comparison_auc <- function(labels, scores, plot_image=TRUE, ...){
score_order <- order(scores, decreasing=TRUE)
labels <- as.logical(labels[score_order])
scores <- scores[score_order]
pos_scores <- scores[labels]
neg_scores <- scores[!labels]
n_pos <- sum(labels)
n_neg <- sum(!labels)
M <- outer(sum(labels):1, 1:sum(!labels),
function(i, j) (1 + sign(pos_scores[i] - neg_scores[j]))/2)
AUC <- mean (M)
if (plot_image){
image(t(M[nrow(M):1,]), ...)
library(pROC)
with( roc(labels, scores),
lines((1 + 1/n_neg)*((1 - specificities) - 0.5/n_neg),
(1 + 1/n_pos)*sensitivities - 0.5/n_pos,
col="blue", lwd=2, type='b'))
text(0.5, 0.5, sprintf("AUC = %0.4f", AUC))
}
return(AUC)
}
rank_comparison_auc(labels=as.logical(category), scores=prediction)
## [1] 0.825
The blue line is an ROC curve computed in the conventional manner (slid and stretched a bit to get the coordinates to line up with the corners of the matrix cells). This makes it evident that the ROC curve marks the boundary of the area where the positive cases outrank the negative cases. The AUC can be computed by adjusting the values in the matrix so that cells where the positive case outranks the negative case receive a 1
, cells where the negative case has higher rank receive a 0
, and cells with ties get 0.5
(since applying the sign
function to the difference in scores gives values of 1, -1, and 0 to these cases, we put them in the range we want by adding one and dividing by two.) We find the AUC by averaging these values.
The probabilistic interpretation is that if you randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC. This is evident from the figure, where the total area of the plot is normalized to one, the cells of the matrix enumerate all possible combinations of positive and negative cases, and the fraction under the curve comprises the cells where the positive case outranks the negative one.
We can use this observation to approximate AUC:
auc_probability <- function(labels, scores, N=1e7){
pos <- sample(scores[labels], N, replace=TRUE)
neg <- sample(scores[!labels], N, replace=TRUE)
# sum( (1 + sign(pos - neg))/2)/N # does the same thing
(sum(pos > neg) + sum(pos == neg)/2) / N # give partial credit for ties
}
auc_probability(as.logical(category), prediction)
## [1] 0.8249989
Now let’s try our new AUC functions on a bigger dataset. I’ll use the simulated dataset from the earlier blog post, where the labels are in the bad_widget
column of the test set dataframe, and the scores are in a vector called glm_response_scores
.
This data has no tied scores, so for testing let’s make a modified version that has ties. We’ll plot a black line representing the original data; since each point has a unique score, the ROC curve is a step function. Then we’ll generate tied scores by rounding the score values, and plot the rounded ROC in red. Note that we are using “response” scores from a glm
model, so they all fall in the range from 0 to 1. When we round these scores to one decimal place, there are 11 possible rounded scores, from 0.0 to 1.0. The AUC values calculated with the pROC
package are indicated on the figure.
roc_full_resolution <- roc(test_set$bad_widget, glm_response_scores)
rounded_scores <- round(glm_response_scores, digits=1)
roc_rounded <- roc(test_set$bad_widget, rounded_scores)
plot(roc_full_resolution, print.auc=TRUE)
##
## Call:
## roc.default(response = test_set$bad_widget, predictor = glm_response_scores)
##
## Data: glm_response_scores in 59 controls (test_set$bad_widget FALSE) < 66 cases (test_set$bad_widget TRUE).
## Area under the curve: 0.9037
lines(roc_rounded, col="red", type='b')
text(0.4, 0.43, labels=sprintf("AUC: %0.3f", auc(roc_rounded)), col="red")
Now we can try our AUC functions on both sets to check that they can handle both step functions and segments with intermediate slopes.
options(digits=22)
set.seed(1234)
results <- data.frame(
`Full Resolution` = c(
auc = as.numeric(auc(roc_full_resolution)),
simple_auc = simple_auc(rev(roc_full_resolution$sensitivities), rev(1 - roc_full_resolution$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, glm_response_scores,
main="Full-resolution scores (no ties)"),
auc_probability = auc_probability(test_set$bad_widget, glm_response_scores)
),
`Rounded Scores` = c(
auc = as.numeric(auc(roc_rounded)),
simple_auc = simple_auc(rev(roc_rounded$sensitivities), rev(1 - roc_rounded$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, rounded_scores,
main="Rounded scores (ties in all segments)"),
auc_probability = auc_probability(test_set$bad_widget, rounded_scores)
)
)
Full.Resolution | Rounded.Scores | |
---|---|---|
auc | 0.90369799691833586 | 0.89727786337955828 |
simple_auc | 0.90369799691833586 | 0.89727786337955828 |
rank_comparison_auc | 0.90369799691833586 | 0.89727786337955828 |
auc_probability | 0.90371970000000001 | 0.89716879999999999 |
So we have two new functions that give exactly the same results as the function from the pROC
package, and our probabilistic function is pretty close. Of course, these functions are intended as demonstrations; you should normally use standard packages such as pROC
or ROCR
for actual work.
Here we’ve focused on calculating AUC and understanding the probabilistic interpretation. The probability associated with AUC is somewhat arcane, and is not likely to be exactly what you are looking for in practice (unless you actually will be randomly selecting a positive and a negative case, and you really want to know the probability that the classifier will score the positive case higher.) While AUC gives a single-number summary of classifier performance that is suitable in some circumstances, other metrics are often more appropriate. In many applications, overall behavior of a classifier across all possible score thresholds is of less interest than the behavior in a specific range. For example, in marketing the goal is often to identify a highly enriched target group with a low false positive rate. In other applications it may be more important to clearly identify a group of cases likely to be negative. For example, when pre-screening for a disease or defect you may want to rule out as many cases as you can before you start running expensive confirmatory tests. More generally, evaluation metrics that take into account the actual costs of false positive and false negative errors may be much more appropriate than AUC. If you know these costs, you should probably use them. A good introduction relating ROC curves to economic utility functions, complete with story and characters, is given in the excellent blog post “ML Meets Economics.”