by Bob Horton, Microsoft Senior Data Scientist
Receiver Operating Characteristic (ROC) curves are a popular way to visualize the tradeoffs between sensitivitiy and specificity in a binary classifier. In an earlier post, I described a simple “turtle’s eye view” of these plots: a classifier is used to sort cases in order from most to least likely to be positive, and a Logo-like turtle marches along this string of cases. The turtle considers all the cases it has passed as having tested positive. Depending on their actual class they are either false positives (FP) or true positives (TP); this is equivalent to adjusting a score threshold. When the turtle passes a TP it takes a step upward on the y-axis, and when it passes a FP it takes a step rightward on the x-axis. The step sizes are inversely proportional to the number of actual positives (in the y-direction) or negatives (in the x-direction), so the path always ends at coordinates (1, 1). The result is a plot of true positive rate (TPR, or specificity) against false positive rate (FPR, or 1 - sensitivity), which is all an ROC curve is.
Computing the area under the curve is one way to summarize it in a single value; this metric is so common that if data scientists say “area under the curve” or “AUC”, you can generally assume they mean an ROC curve unless otherwise specified.
Probably the most straightforward and intuitive metric for classifier performance is accuracy. Unfortunately, there are circumstances where simple accuracy does not work well. For example, with a disease that only affects 1 in a million people a completely bogus screening test that always reports “negative” will be 99.9999% accurate. Unlike accuracy, ROC curves are insensitive to class imbalance; the bogus screening test would have an AUC of 0.5, which is like not having a test at all.
In this post I’ll work through the geometry exercise of computing the area, and develop a concise vectorized function that uses this approach. Then we’ll look at another way of viewing AUC which leads to a probabilistic interpretation.
Let’s start with a simple artificial data set:
category <- c(1, 1, 1, 1, 0, 1, 1, 0, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 0, 0)
prediction <- rev(seq_along(category))
prediction[9:10] <- mean(prediction[9:10])
Here the vector prediction
holds ersatz scores; these normally would be assigned by a classifier, but here we’ve just assigned numbers so that the decreasing order of the scores matches the given order of the category labels. Scores 9 and 10, one representing a positive case and the other a negative case, are replaced by their average so that the data will contain ties without otherwise disturbing the order.
To plot an ROC curve, we’ll need to compute the true positive and false positive rates. In the earlier article we did this using cumulative sums of positives (or negatives) along the sorted binary labels. But here we’ll use the pROC
package to make it official:
library(pROC)
roc_obj <- roc(category, prediction)
auc(roc_obj)
## Area under the curve: 0.825
roc_df <- data.frame(
TPR=rev(roc_obj$sensitivities),
FPR=rev(1 - roc_obj$specificities),
labels=roc_obj$response,
scores=roc_obj$predictor)
The roc
function returns an object with plot methods and other conveniences, but for our purposes all we want from it is vectors of TPR and FPR values. TPR is the same as sensitivity, and FPR is 1 - specificity (see “confusion matrix” in Wikipedia). Unfortunately, the roc
function reports these values sorted in the order of ascending score; we want to start in the lower left hand corner, so I reverse the order. According to the auc
function from the pROC package, our simulated category and prediction data gives an AUC of 0.825; we’ll compare other attempts at computing AUC to this value.
If the ROC curve were a perfect step function, we could find the area under it by adding a set of vertical bars with widths equal to the spaces between points on the FPR axis, and heights equal to the step height on the TPR axis. Since actual ROC curves can also include portions representing sets of values with tied scores which are not square steps, we need to adjust the area for these segments. In the figure below we use green bars to represent the areas under the steps. Adjustments for sets of tied values will be shown as blue rectangles; half the area of each of these blue rectagles is below a sloped segment of the curve.
The function for drawing polygons in base R takes vectors of x and y values; we’ll start by defining a rectangle
function that uses a simpler and more specialized syntax; it takes x and y coordinates for the lower left corner of the rectangle, and a height and width. It sets some default display options, and passes along any other parameters we might specify (like color) to the polygon
function.
rectangle <- function(x, y, width, height, density=12, angle=-45, ...)
polygon(c(x,x,x+width,x+width), c(y,y+height,y+height,y),
density=density, angle=angle, ...)
The spaces between TPR (or FPR) values can be calculated by diff
. Since this results in a vector one position shorter than the original data, we pad each difference vector with a zero at the end:
roc_df <- transform(roc_df,
dFPR = c(diff(FPR), 0),
dTPR = c(diff(TPR), 0))
For this figure, we’ll draw the ROC curve last to place it on top of the other elements, so we start by drawing an empty graph (type='n'
) spanning from 0 to 1 on each axis. Since the data set has exactly ten positive and ten negative cases, the TPR and FPR values will all be multiples of 1/10, and the points of the ROC curve will all fall on a regularly spaced grid. We draw the grid using light blue horizontal and vertical lines spaced one tenth of a unit apart. Now we can pass the values we calculated above to the rectangle function, using mapply
(the multi-variate version of sapply
) to iterate over all the cases and draw all the green and blue rectangles. Finally we plot the ROC curve (that is, we plot TPR against FPR) on top of everything in red.
plot(0:10/10, 0:10/10, type='n', xlab="FPR", ylab="TPR")
abline(h=0:10/10, col="lightblue")
abline(v=0:10/10, col="lightblue")
with(roc_df, {
mapply(rectangle, x=FPR, y=0,
width=dFPR, height=TPR, col="green", lwd=2)
mapply(rectangle, x=FPR, y=TPR,
width=dFPR, height=dTPR, col="blue", lwd=2)
lines(FPR, TPR, type='b', lwd=3, col="red")
})
The area under the red curve is all of the green area plus half of the blue area. For adding areas we only care about the height and width of each rectangle, not its (x,y) position. The heights of the green rectangles, which all start from 0, are in the TPR column and widths are in the dFPR column, so the total area of all the green rectangles is the dot product of TPR and dFPR. Note that the vectored approach computes a rectangle for each data point, even when the height or width is zero (in which case it doesn’t hurt to add them). Similarly, the heights and widths of the blue rectangles (if there are any) are in columns dTPR and dFPR, so their total area is the dot product of these vectors. For regions of the graph that form square steps, one or the other of these values will be zero, so you only get blue rectangles (of non-zero area) if both TPR and FPR change in the same step. Only half the area of each blue rectangle is below its segment of the ROC curve (which is a diagonal of a blue rectangle). Remember the ‘real’ auc
function gave us an AUC of 0.825, so that is the answer we’re looking for.
simple_auc <- function(TPR, FPR){
# inputs already sorted, best scores first
dFPR <- c(diff(FPR), 0)
dTPR <- c(diff(TPR), 0)
sum(TPR * dFPR) + sum(dTPR * dFPR)/2
}
with(roc_df, simple_auc(TPR, FPR))
## [1] 0.825
Now let’s try a completely different approach. Here we generate a matrix representing all possible combinations of a positive case with a negative case. Each row represents a positive case, in order from the highest-scoring positive case at the bottom to the lowest-scoring positive case at the top. Similarly, the columns represent the negative cases, sorted with the highest scores at the left. Each cell represents a comparison between a particular positive case and a particular negative case, and we mark the cell by whether its positive case has a higher score (or higher overall rank) than its negative case. If your classifier is any good, most of the positive cases will outrank most of the negative cases, and any exceptions will be in the upper left corner, where low-ranking positives are being compared to high-ranking negatives.
rank_comparison_auc <- function(labels, scores, plot_image=TRUE, ...){
score_order <- order(scores, decreasing=TRUE)
labels <- as.logical(labels[score_order])
scores <- scores[score_order]
pos_scores <- scores[labels]
neg_scores <- scores[!labels]
n_pos <- sum(labels)
n_neg <- sum(!labels)
M <- outer(sum(labels):1, 1:sum(!labels),
function(i, j) (1 + sign(pos_scores[i] - neg_scores[j]))/2)
AUC <- mean (M)
if (plot_image){
image(t(M[nrow(M):1,]), ...)
library(pROC)
with( roc(labels, scores),
lines((1 + 1/n_neg)*((1 - specificities) - 0.5/n_neg),
(1 + 1/n_pos)*sensitivities - 0.5/n_pos,
col="blue", lwd=2, type='b'))
text(0.5, 0.5, sprintf("AUC = %0.4f", AUC))
}
return(AUC)
}
rank_comparison_auc(labels=as.logical(category), scores=prediction)
## [1] 0.825
The blue line is an ROC curve computed in the conventional manner (slid and stretched a bit to get the coordinates to line up with the corners of the matrix cells). This makes it evident that the ROC curve marks the boundary of the area where the positive cases outrank the negative cases. The AUC can be computed by adjusting the values in the matrix so that cells where the positive case outranks the negative case receive a 1
, cells where the negative case has higher rank receive a 0
, and cells with ties get 0.5
(since applying the sign
function to the difference in scores gives values of 1, -1, and 0 to these cases, we put them in the range we want by adding one and dividing by two.) We find the AUC by averaging these values.
The probabilistic interpretation is that if you randomly choose a positive case and a negative case, the probability that the positive case outranks the negative case according to the classifier is given by the AUC. This is evident from the figure, where the total area of the plot is normalized to one, the cells of the matrix enumerate all possible combinations of positive and negative cases, and the fraction under the curve comprises the cells where the positive case outranks the negative one.
We can use this observation to approximate AUC:
auc_probability <- function(labels, scores, N=1e7){
pos <- sample(scores[labels], N, replace=TRUE)
neg <- sample(scores[!labels], N, replace=TRUE)
# sum( (1 + sign(pos - neg))/2)/N # does the same thing
(sum(pos > neg) + sum(pos == neg)/2) / N # give partial credit for ties
}
auc_probability(as.logical(category), prediction)
## [1] 0.8249989
Now let’s try our new AUC functions on a bigger dataset. I’ll use the simulated dataset from the earlier blog post, where the labels are in the bad_widget
column of the test set dataframe, and the scores are in a vector called glm_response_scores
.
This data has no tied scores, so for testing let’s make a modified version that has ties. We’ll plot a black line representing the original data; since each point has a unique score, the ROC curve is a step function. Then we’ll generate tied scores by rounding the score values, and plot the rounded ROC in red. Note that we are using “response” scores from a glm
model, so they all fall in the range from 0 to 1. When we round these scores to one decimal place, there are 11 possible rounded scores, from 0.0 to 1.0. The AUC values calculated with the pROC
package are indicated on the figure.
roc_full_resolution <- roc(test_set$bad_widget, glm_response_scores)
rounded_scores <- round(glm_response_scores, digits=1)
roc_rounded <- roc(test_set$bad_widget, rounded_scores)
plot(roc_full_resolution, print.auc=TRUE)
##
## Call:
## roc.default(response = test_set$bad_widget, predictor = glm_response_scores)
##
## Data: glm_response_scores in 59 controls (test_set$bad_widget FALSE) < 66 cases (test_set$bad_widget TRUE).
## Area under the curve: 0.9037
lines(roc_rounded, col="red", type='b')
text(0.4, 0.43, labels=sprintf("AUC: %0.3f", auc(roc_rounded)), col="red")
Now we can try our AUC functions on both sets to check that they can handle both step functions and segments with intermediate slopes.
options(digits=22)
set.seed(1234)
results <- data.frame(
`Full Resolution` = c(
auc = as.numeric(auc(roc_full_resolution)),
simple_auc = simple_auc(rev(roc_full_resolution$sensitivities), rev(1 - roc_full_resolution$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, glm_response_scores,
main="Full-resolution scores (no ties)"),
auc_probability = auc_probability(test_set$bad_widget, glm_response_scores)
),
`Rounded Scores` = c(
auc = as.numeric(auc(roc_rounded)),
simple_auc = simple_auc(rev(roc_rounded$sensitivities), rev(1 - roc_rounded$specificities)),
rank_comparison_auc = rank_comparison_auc(test_set$bad_widget, rounded_scores,
main="Rounded scores (ties in all segments)"),
auc_probability = auc_probability(test_set$bad_widget, rounded_scores)
)
)
Full.Resolution | Rounded.Scores | |
---|---|---|
auc | 0.90369799691833586 | 0.89727786337955828 |
simple_auc | 0.90369799691833586 | 0.89727786337955828 |
rank_comparison_auc | 0.90369799691833586 | 0.89727786337955828 |
auc_probability | 0.90371970000000001 | 0.89716879999999999 |
So we have two new functions that give exactly the same results as the function from the pROC
package, and our probabilistic function is pretty close. Of course, these functions are intended as demonstrations; you should normally use standard packages such as pROC
or ROCR
for actual work.
Here we’ve focused on calculating AUC and understanding the probabilistic interpretation. The probability associated with AUC is somewhat arcane, and is not likely to be exactly what you are looking for in practice (unless you actually will be randomly selecting a positive and a negative case, and you really want to know the probability that the classifier will score the positive case higher.) While AUC gives a single-number summary of classifier performance that is suitable in some circumstances, other metrics are often more appropriate. In many applications, overall behavior of a classifier across all possible score thresholds is of less interest than the behavior in a specific range. For example, in marketing the goal is often to identify a highly enriched target group with a low false positive rate. In other applications it may be more important to clearly identify a group of cases likely to be negative. For example, when pre-screening for a disease or defect you may want to rule out as many cases as you can before you start running expensive confirmatory tests. More generally, evaluation metrics that take into account the actual costs of false positive and false negative errors may be much more appropriate than AUC. If you know these costs, you should probably use them. A good introduction relating ROC curves to economic utility functions, complete with story and characters, is given in the excellent blog post “ML Meets Economics.”
About a decade or so photomosaics were all the rage: a near-recreation of a famous image by using many smaller images as elements. Here, for example, is the Mona Lisa, created using the Metapixel program by overlaying 32x32 images of vehicles and animals.
An image like this presents an interesting computer vision challenge: can you use deep learning techniques to find the pictures of boats and cars embedded in the image, amongst all the noise and clutter of the other images around and often on top of them? This is the challenge that Max Kaznady and his colleagues in the data science team took upon themselves, using the power of an Azure N-Series virtual machine with 24 cores and 4 K80 GPUs. The model was trained using the mxnet package running on Microsoft R Server, which takes advantage of the powerful GPUs to train a Residual Network (ResNet) DNN with 21 convolutional blocks. (You can read more about Resnet in this Microsoft Research paper.) Once the model was trained, an HDInsight Spark cluster running Microsoft R Server was used to parallelize the problem of finding boat and car images within the photomosaic. Here's the architecture of the system, with the steps marked in order in yellow; another blog post explains how to set up such an architecture yourself. You can also just use the Deep Learning Toolkit with the Data Science Virtual Machine.
To learn more about this application, check out this recorded presentation from the Data Science Summit presented by Max Kaznady and Tao Wu, or the blog post linked below.
Cortana Intelligence and Machine Learning Blog: Applying Deep Learning at Cloud Scale, with Microsoft R Server & Azure Data Lake
There's a new e-book available to download free from Microsoft Academy: Data Science with Microsoft SQL Server 2016.
This 90-page e-book is aimed at data scientists who already have some experience in R, but want to learn how to use R wirth SQL Server. The book was written by some of my most experienced colleagues from Microsoft's data science team at Microsoft: Buck Woody, Danielle Dean, Debraj GuhaThakurta, Gagan Bansal, Matt Conners, and Wee-Hyong Tok, and begins with an introduction by Joseph Sirosh. It includes everything you need to know to use R and SQL Server:
(If you're new to R or data science, there are also links to learning resources in Chapter 1.) The book also includes several fully-worked examples following the data science process, with links to data and code so you can try them out yourself:
Data Science with Microsoft SQL Server 2016 is available now as a download from Microsoft Academy. (Desktop and mobile PDF formats available. Free registration required.)
If you'd like to manipulate and analyze very large data sets with the R language, one option is to use R and Apache Spark together. R provides the simple, data-oriented language for specifying transformations and models; Spark provides the storage and computation engine to handle data much larger than R alone can handle.
At the KDD 2016 conference last October, a team from Microsoft presented a tutorial on Scalable R on Spark, and made all of the materials available on Github. The materials include an 80-slide presentation covering several tutorials (you can download the 13Mb PowerPoint file here).
Slides 1-29 form an introduction which covers:
Slides 32-44 form a hands-on tutorial working with the airline arrival data to predict flight delays. In the tutorial, you use SparkR to clean and join the data, R Server's "rxDTree" function to fit a random forest model to predict delays, and then publish a prediction function to Azure with the AzureML package to create a cloud-based flight-delay prediction service. The Microsoft R scripts are available here.
Slides 46-50 form another tutorial, this time working with the NYC Taxi dataset. The first tutorial script uses the sparklyr package to visualize the data and create models to predict the tip amount. This second tutorial script goes further with models, fitting Elastic Net, Random Forest and Gradient Boosted Tree models with both SparkR and sparklyr. In addition this script uses SparkR and SparkSQL to create a map of the trips.
Slides 51-59 demonstrate optimizing the performance of a time series forecasting model, by searching over a large parameter space with the hts package. By running the models in parallel to optimize the MAPE (mean absolute percent error), the total execution time was reduced to 1 day compared to the 40 days to complete the computations serially. The parallelization was achieved with the Microsoft R Server "rxExec" function, which you can replicate with the script available here.
Slides 60-72 includes background on calculating learning curves for various predictive models (see here and here for more information). There's also a tutorial: the R scripts are available here.
To work though the materials from the tutorial, you'll need access to a Spark cluster configured with Microsoft R Server and the necessary scripts and data files. You can easily create an HDInsight Premium cluster including Microsoft R Server on Microsoft Azure: these instructions provide the details. Once the cluster is ready, you can remotely access it from your desktop using ssh
as described. The clusters are charged by the hour (according to their size and power), so be sure to shut down the cluster when you're done with the tutorial.
These tutorials are hopefully useful to anyone who is trying to learn to use R with Spark. The full collection of materials and slides is available at the Github repository below.
Github (Azure): KDD 2016 tutorial: Scalable Data Science with R, from Single Nodes to Spark Clusters
In Joseph Sirosh's keynote presentation at the Data Science Summit on Monday, Wee Hyong Tok demonstrated using R in SQL Server 2016 to detect fraud in real-time credit card transactions at a rate of 1 million transactions per second. The demo (which starts at the 17:00 minute mark) used a gradient-boosted tree model to predict the probability of a credit card transaction being fraudulent, based on attributes like the charge amount and the country of origin.
Then, a stored procedure in SQL Server 2016 was used to score transactions streaming into the database at a rate of 3.6 billion per hour. If you'd like to try this yourself, a step-by-step tutorial with code to implement the model and scoring is available here.
Later in the keynote (starting at 25:00), John Salch, VP of Technology and Platforms at PROS describes using R to determine prices for airline tickets, hotel rooms, and laptops. PROS has been using R for a while in development, but found running R within SQL Server 2016 to be 100 times (not 100%, 100x!) faster for price optimization. "This really woke us up that we can use R in a production setting ... it's truly amazing," he says.
It's great to see these global-scale applications of R, driving the intelligence of businesses behind the scenes. As Joseph said in the opening, "If there's one language you should learn today ... it's R."
Channel 9: Microsoft Machine Learning & Data Science Summit 2016
by Anusua Trivedi, Microsoft Data Scientist
This is part 3 of my series on Deep Learning, where I describe my experiences and go deep into the reasons behind my choices.
In Part 1, I discussed the pros and cons of different symbolic frameworks, and my reasons for choosing Theano (with Lasagne) as my platform of choice. A very recent benchmarking paper compares CNTK with Caffe, Torch & TensorFlow, and CNTK performs substantially better than all the other three frameworks.
In Part 2, I describe Deep Convolutional Neural Network (DCNN) and how Transfer learning and Fine-tuning helps better the training process for domain specific images.
This Part 3 of the series is based on my talk at PAPI 2016. In this blog, I show re-usability of trained DCNN model by combining it with a Long Short-term Memory (LSTM) Recurrent Neural Network (RNN).
I am delivering a Deep Learning webinar on 27^{th} September 2016, 10:00am-11:00am PST. You can register for live webinar to learn more on Microsoft Azure and Deep Learning. Please feel free to email me at trivedianusua23@gmail.com if you have questions.
Given the role of clothing apparel in society, fashion classification has many applications. For example Liu et.al.’s work on predicting the clothing details in an unlabeled image can facilitate the discovery of the most similar fashion items in an e-commerce database. Yang et.al. shows that real-time clothing recognition can be useful in the surveillance context, where information about individuals’ clothes can be used to identify crime suspects. Depending on the application of fashion classification, the most relevant problems to solve will differ.
We will focus on papers by Wang et.al. and Vinyals et.al. and modify the model for optimizing fashion classification for the purposes of annotating images and predicting clothing tags for the fashion images.
The main inspiration of our work comes from recent advances in machine translation, where the task is to translate words or sentences individually. Recent work has shown that translation can be done in a much simpler way using RNNs and outperform the state-of-the-art performance. An “encoder” RNN reads the source tag/label and transforms it into a rich fixed-length vector representation, which in turn is used as the initial hidden state of a “decoder” RNN that generates the target tag/label. Here, we propose to follow this elegant recipe, replacing the encoder RNN by a DCNN. Over the last few years it has been convincingly shown that DCNNs can produce a rich representation of the input image by embedding it to a fixed-length vector, which can be used for a variety of vision tasks. Hence, it is natural to use a DCNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates tags.
In this work, we use an ImageNet pre-trained GoogLeNet model [Figure 2] for extracting CNN-features from the ACS dataset. Then we use the CNN-features to train a LSTM RNN model for the ACS Tag prediction.
In this work, we use the ACS dataset: a complete pipeline for recognizing and classifying people’s clothing in natural scenes. This has several interesting applications, including e-commerce, event and activity recognition, online advertising, etc. The stages of the pipeline combine several state-of-the-art building blocks such as upper body detectors, various feature channels and visual attributes. ACS defines 15 clothing classes and introduces a benchmark data set for the clothing classification task consisting of over 80, 000 images, which is publicly available. We use ACS dataset to predict new clothing tags for unseen images.
Images from the training and test datasets have very different resolutions, aspect ratios, colors etc. Neural networks require a fixed input size, so each image was resized and/or cropped to fixed dimensions (3 × 224 × 224).
One of the drawbacks of non-regularized neural networks is that they are extremely flexible: they learn both features and noise equally well, increasing the potential for overfitting. In our model, we apply L2 regularization to avoid overfitting. But even after that, we observed a large gap between the performance on the training and validation of ACS images, indicating that the fine-tuning process is overfitting to the training set. To combat this overfitting, we leverage data augmentation the for the ACS image dataset.
There are many ways to do data augmentation, such as the popular horizontally flipping, random crops and color jittering. As the color information in these images is very important, we only rotate the images at different angles – at 0, 90, 180, and 270 degrees.
For this ACS Tag prediction problem, we fall under fine-tuning scenario 2 (refer to Part 2). The GoogLeNet model that we use here was initially trained on ImageNet. The ImageNet dataset contains about 1 million natural images and 1000 labels/categories. In contrast, our labeled ACS dataset has about 80,000 domain-specific fashion images and 15 labels/ categories. The ACS dataset is insufficient to train a network as complex as GoogLeNet. Thus we use weights from the ImageNet-trained GoogLeNet model. We fine-tune all the layers of the pre-trained GoogLeNet model by continuing the backpropagation.
In this work, we use a LSTM RNN model, which has shown state-of-the art performance on sequence tasks. The core of the LSTM model is a memory cell, which encodes knowledge of what inputs have been observed at every time step [Figure 5]. The behavior of the cell is controlled by gates, which are layers that are applied multiplicatively. Three gates are being used here:
The memory block contains a cell ‘c’ which is controlled by three gates. In blue we show the recurrent connections – the output ‘m’ at time (t – 1) is fed back to the memory at time ‘t’ via the three gates; the cell value is fed back via the forget gate; the predicted word at time (t – 1) is fed back in addition to the memory output ‘m’ at time ‘t’ into the Softmax for tag prediction.
The LSTM model is trained to predict top tags for each image. It takes the ACS DCNN-features (produced using the GoogLeNet) as input. LSTM is then trained on a combination of these DCNN-features and labels for these pre-processed images.
A copy of the LSTM memory is created for each of the LSTM image and each label, such that all LSTMs share the same parameters and the output (m)×(t−1) of the LSTM at time (t – 1) is fed to the LSTM at time (t) [Figure 6]. All recurrent connections are transformed to feed-forward connections in the final version. We saw that feeding the image at each time step as an extra input yields inferior results, as the network can explicitly exploit noise in the image and overfits more easily. The LSTM -RNN loss is minimized w.r.t. all the parameters of the LSTM, the top layer of the image embedder CNN and label embedding.
We apply the above LSTM-RNN-CNN model for ACS Tag prediction. Tag prediction accuracy of this model improves quickly during the first part of the iterations and stabilizes after about 20,000 iterations.
Each image in our ACS image dataset consists of only unique types of clothing, there is no image combining the clothing types. In spite of this fact, when we test images with multiple clothing type, our trained model generates tags for these unseen test images quite accurately (~80% accurate).
Below we show a sample of our prediction outputs. Our test image below contains a person in suit and a person in t-shirt.
The image below describes the GoogLeNet Tag Prediction for our test image using ImageNet-trained GoogleNet model:
As we see, the prediction using this model is not very accurate as our test image is tagged as ‘Windsor Tie’. The image below describes ACS Tag Prediction for our test image using the above LSTM-RNN-CNN model. As we see, the prediction using this model is very accurate as our test image is tagged as ‘jersey, T-shirt, upper body, suit of clothes’.
Here our aim is to develop a model which can predict clothing category tags for fashion images with high accuracy.
In this work, we compare the tag prediction accuracy of the GoogLeNet model to our LSTM-RNN-CNN model. The tags predicted by our model are way more accurate than that predicted by the state-of-the-art GoogLeNet model. For training our model, we used a relatively smaller number of training iterations, around ~10,000. Prediction accuracy of our model improves quickly with increasing number of training iterations and stabilizes after about 20,000 iterations. We used only one GPU for this work.
Fine-tuning allows us to bring the power of state-of-the-art DCNN models to new domains. Moreover, combining DCNN-RNN model helps us extend the trained model to solve completely different problem like fashion image tag generation. I hope this blog series helps you train DCNNs for domain-specific images and leads to the re-usability of the trained models.
There are many algorithms behind Deep Learning (see this comparison of deep learning frameworks for details), but one common algorithm used by many frameworks is Convolutional Neural Networks (CNNs). The mathematics behind that algorithm are complex, but Brandon Rohrer explains the process in plain language, and shows how AIs trained with CNNs can appear to mimic human processes like vision:
You can also download the slides from Brandon's presentation for offline viewing. Check out in particular Slide 90 which includes links to several software frameworks for deep learning, including Microsoft's open-source CNTK toolkit.
For a text-based explanation of CNNs from Brandon, follow the link to his post on KDnuggets below.
KDnuggets: How Convolutional Neural Networks Work
by Jaya Mathew, Data Scientist at Microsoft
By using R Services within SQL Server 2016, users can leverage the power of R at scale without having to move their data around. Such a solution is beneficial for organizations with very sensitive, big data which cannot be hosted on any public cloud but does most of their coding in R.
To illustrate the scenario, we will focus on companies who operate machines which encounter mechanical failures. These failures lead to downtime which has cost implications on any business, hence most companies are interested in predicting the failures ahead of time so that they can proactively prevent them. This scenario is aligned with an existing R Notebook published in the Cortana Intelligence Gallery but works with a larger dataset where we will focus on predicting component failures of a machine using raw telemetry, maintenance logs, previous errors/failures and additional information about the make/model of the machine. This scenario is widely applicable for almost any industry which uses machines that need maintenance. A quick overview of typical feature engineering techniques as well as how to build a model will be discussed below.
The sample data and code for this template are available on Github, and requires Windows for both the local R client and remote SQL Server. To use the template, you will need:
To run the template using the sample data, follow these steps:
In this template, there are 5 data sources namely: telemetry, errors, maintenance, machines, failures. The data ingestion, feature engineering and data preparation is done using SQL code on the server where the data resides for most users which circumvents the issues of moving sensitive big data. (For the template we upload the local CSV files to SQL Server.) Then various aspects of the data are visualized and models built using their R code on SQL Server which enables users to scale their operations.
The process can be visualized as shown below:
The five data sources (telemetry, errors, failures, maintenance, machines) capture various aspects of the functioning of the machine. The machine attribute table typically contains descriptive information about the make, model, age of the machine. The telemetry time-series data consists of operating measurements like voltage, rotation, pressure and vibration measurements collected at an hourly rate (in this template data). Depending on the machine, other measurements might be available. The error log contains non-breaking errors thrown while the machine is still operational hence do not qualify as failures. The error date and times are rounded to the closest hour since the telemetry is also collected at an hourly rate. The scheduled and unscheduled maintenance records correspond to both regular inspection of components as well as failures where a record is generated if a component is replaced during the scheduled inspection or replaced due to a break down. The failure table captures the records of component replacements due to machine failures. Each record has a date and time, machine ID and failed component type associated with it. In this template the data is at an hourly rate, but in other use cases, the rate can be every second, minute, hour or even once in a day etc.
The first step in predictive maintenance applications is feature engineering which combines the different data sources to create features that best describe a machines’ health condition at a given point in time. Telemetry data almost always comes with time-stamps which makes it suitable for calculating lagging features. In the following template, rolling mean and standard deviation of the telemetry data over the last 3-hour lag window is calculated for every 3 hours (depending on the requirements of the business, the lag window sizes can be edited). Similar to telemetry data, the errors also come with time-stamps. However, unlike telemetry that had numerical values, errors have categorical values denoting the type of error that occurred at a time-stamp. In this case, aggregating methods such as averaging does not apply. Hence, counting the different categories is a more viable approach where lagging counts of different types of errors that occurred in the lag window are calculated. Maintenance data contains the information of component replacement records. A relevant feature from this data set is to calculate how long it has been since a component was last replaced or number of times a component is replaced. The machine features are used directly since they hold descriptive information about the type of the machines and their age which is defined as the years in service.
Once the data has been collated and features generated, the next step was to create the label for the prediction problem. In this example scenario we are interested in computing the probability that a machine will fail in the next 24 hours due to a certain component failure, hence the labels are (component 1,2,3 or 4) and the rest of the records are labeled as "none" indicating there is no failure within the next 24 hours.
For this predictive maintenance problem, a time-dependent splitting strategy is used to estimate performance which is done by validating and testing on examples that are later in time than the training examples. For a time-dependent split, a point in time is picked and model is trained on examples up to that point in time, and validated on the examples after that point assuming that the future data after the splitting point is not known. There are many other ways of splitting the data, but this template illustrates this method of splitting the data.
At the end of the template implementation, the user would have had a hands-on experience about how to formulate the problem and work with SQL Server R Services to solve their data science problems even when working with Big Data. Based on experience, we have noticed a significant decrease in the feature engineering processing time when done using SQL directly on the server. Another benefit is the ability to run R models on much larger datasets that could previously not be done on a local laptop or PC.
Acknowledgements: Thanks to Danielle Dean and Fidan Boylu Uz for their input.
Cortana Intelligence Gallery: Predictive Maintenance Modeling Guide using SQL R Services
by Anusua Trivedi, Microsoft Data Scientist
This is a blog series in several parts — where I describe my experiences and go deep into the reasons behind my choices. In Part 1, I discussed the pros and cons of different symbolic frameworks, and my reasons for choosing Theano (with Lasagne) as my platform of choice.
Part 2 of this blog series is based on my upcoming talk at The Data Science Conference, 2016. Here in Part 2, I describe Deep Convolutional Neural Networks (DCNNs) and how Transfer learning and Fine-tuning helps better the training process for domain specific images.
Please feel free to email me at trivedianusua23@gmail.com if you have questions.
The eye disease Diabetic Retinopathy (DR) is a common cause of vision loss. Screening diabetic patients using fluorescein angiography images can potentially reduce the risk of blindness. Current trends in the research have demonstrated that DCNNs are very effective in automatically analyzing large collections of images and identifying features that can categorize images with minimum error. DCNNs are rarely trained from scratch, as it is relatively uncommon to have a domain-specific dataset of sufficient size. Since modern DCNNs take 2-3 weeks to train across GPUs, Berkley Vision and Learning Center (BVLC) have released some final DCNN checkpoints. In this blog, we use such a pre-trained network: GoogLeNet. This GoogLeNet network is pre-trained on a large collection of natural ImageNet images. We transfer the learned ImageNet weights as initial weights for the network, and fine-tune these pre-trained generic network to recognize fluorescein angiography images of eyes and improve DR prediction.
Much work has been done in developing algorithms and morphological image processing techniques that explicitly extract features prevalent in patients with DR. The generic workflow used in a standard image classification technique is as follows:
Faust et al. provide a very comprehensive analysis of models that use explicit feature extraction for DR screening. Vujosevic et al. build a binary classifier on a dataset of 55 patients by explicitly forming single lesion features. These authors use morphological image processing techniques to extract blood vessel, and hemorrhage features and then train an SVM on a data set of 331 images. These authors report accuracy of 90% and sensitivity of 90% on binary classification task with a dataset of 140 images.
However, all these processes are very time and effort consuming. Further improvements in prediction accuracy require large quantities of labeled data. Image processing and feature extraction of image datasets is very complex and time-consuming. Thus, we choose to automate the image processing and feature extraction step by using DCNNs.
Image data requires subject-matter expertise to extract key features. DCNNs extract features automatically from domain-specific images, without any feature engineering techniques. This process makes DCNNs suitable for image analysis:
Convolution: Convolution layers consist of a rectangular grid of neurons. The weights for this are the same for each neuron in the convolution layer. The convolution layer weights specify the convolution filter.
Pooling: The pooling layer takes small rectangular blocks from the convolutional layer and subsamples it to produce a single output from that block.
In this post, we are using GoogLeNet DCNN, which was developed at Google. GoogLeNet won the ImageNet challenge in 2014, setting the record for the best contemporaneous results. Motivations for this model were a simultaneously deeper as well as computationally inexpensive architecture.
In practice, we don’t usually train an entire DCNN from scratch with random initialization. This is because it is relatively rare to have a dataset of sufficient size that is required for the depth of network required. Instead, it is common to pre-train a DCNN on a very large dataset and then use the trained DCNN weights either as an initialization or a fixed feature extractor for the task of interest.
Fine-Tuning: Transfer learning strategies depend on various factors, but the two most important ones are the size of the new dataset, and its similarity to the original dataset. Keeping in mind that DCNN features are more generic in early layers and more dataset-specific in later layers, there are four major scenarios:
Fine-tuning DCNNs: For this DR prediction problem, we fall under scenario iv. We fine-tune the weights of the pre-trained DCNN by continuing the backpropagation. It is possible to fine-tune all the layers of the DCNN, or it’s possible to keep some of the earlier layers fixed (due to overfitting concerns) and only fine-tune some higher-level portion of the network. This is motivated by the observation that the earlier features of a DCNN contain more generic features (e.g. edge detectors or color blob detectors) that should be useful to many tasks, but later layers of the DCNN becomes progressively more specific to the details of the classes contained in the DR dataset.
Transfer learning constraints: As we use a pre-trained network, we are slightly constrained in terms of the model architecture. For example, we can’t arbitrarily take out convolutional layers from the pre-trained network. However, due to parameter sharing, we can easily run a pre-trained network on images of different spatial size. This is clearly evident in the case of Convolutional and Pool layers because their forward function is independent of the input volume spatial size. In case of Fully Connected (FC) layers, this still holds true because FC layers can be converted to a Convolutional Layer.
Learning rates: We use a smaller learning rate for DCNN weights that are being fine-tuned under the assumption that the pre-trained DCNN weights are relatively good. We don’t wish to distort them too quickly or too much, so we keep both our learning rate and learning rate decay really small.
Data Augmentation: One of the drawbacks of non-regularized neural networks is that they are extremely flexible: they learn both features and noise equally well, increasing the potential for overfitting. In our model, we apply L2 regularization to avoid overfitting. But even after that, we observed a large gap in model performance on the training and validation DR images, indicating that the fine tuning process is overfitting to the training set. To combat this overfitting, we leverage data augmentation for the DR image dataset.
There are many ways to do data augmentation, such as the popular horizontally flipping, random crops and color jittering. As the color information in these images is very important, we only rotate the images at different angles – at 0, 90, 180, and 270 degrees.
Fine-tuning GoogLeNet: The GoogLeNet network we use here for DR screening was initially trained on ImageNet. The ImageNet dataset contains about 1 million natural images and 1000 labels/categories. In contrast, our labeled DR dataset has only about 30,000 domain-specific images and 4 labels/ categories. Thus, the DR dataset is insufficient to train a network as complex as GoogLeNet and so we use weights from the ImageNet-trained GoogLeNet network. We fine-tune all layers, except for the top 2 pre-trained layers which contains more generic data-independent weights. The original classification layer "loss3/classifier" outputs predictions for 1000 classes. We replace it with a new binary layer.
Fine-tuning allows us to bring the power of state-of-the-art DCNN models to new domains where insufficient data and time/cost constraints might otherwise prevent their use. This approach achieves a significant improvement of average accuracy and improves the state-of-the-art of image-based medical classification.
In my Part 3 of this blog series (coming soon), I will explain re-usability of these trained DCNN models.
by Anusua Trivedi, Microsoft Data Scientist
Background and Approach
This blog series is based on my upcoming talk on re-usability of Deep Learning Models at the Hadoop+Strata World Conference in Singapore. This blog series will be in several parts – where I describe my experiences and go deep into the reasons behind my choices.
Deep learning is an emerging field of research, which has its application across multiple domains. I try to show how transfer learning and fine tuning strategy leads to re-usability of the same Convolution Neural Network model in different disjoint domains. Application of this model across various different domains brings value to using this fine-tuned model.
In this blog (Part1), I describe and compare the commonly used open-source deep learning frameworks. I dive deep into different pros and cons for each framework, and discuss why I chose Theano for my work.
Please feel free to email me at trivedianusua23@gmail.com if you have questions.
Symbolic Frameworks
Symbolic computation frameworks (as in CNTK, MXNET, TensorFlow, Theano) are specified as a symbolic graph of vector operations, such as matrix add/multiply or convolution. A layer is just a composition of those operations. The fine granularity of the building blocks (operations) allows users to invent new complex layer types without implementing them in a low-level language (as in Caffe).
I've used different symbolic computation frameworks in my work. However, I found each of them has their pros and cons in their design and current implementation, and none of them can perfectly satisfy all needs. For my problem needs , I decided to work with Theano.
Here we compare the following symbolic computation frameworks:
Non-symbolic frameworks
PROS:
CONS:
Symbolic frameworks
PROS:
CONS:
Adding New Operations
In all of these frameworks, adding an Operation with reasonable performance is not easy.
Theano / MXNET |
TensorFlow |
Can add Operation in Python with inline C support. |
Forward in C++, symbolic gradient in Python. |
Code Re-usability
Training deep networks are time-consuming. So, Caffe has released some pre-trained model/weights (model zoo) which could be used as initial weights while transfer learning or fine tuning deep networks on domain specific or custom images.
Low-level Tensor Operators
A reasonably efficient implementation of low-level operators can serve as ingredients in writing new models, saving the effort to write new Operations.
Theano |
TensorFlow |
MXNET |
A lot of basic Operations |
Fairly good |
Very few |
Control Flow Operator
Control flow operators make the symbolic engine more expressive and generic.
Theano |
TensorFlow |
MXNET |
Supported |
Experimental |
Not Supported |
High-level Support
Performance
Benchmarking Using Single-GPU
I benchmark LeNet model on MNIST Dataset using a Single-GPU (NVIDIA Quadro K1200 GPU).
Theano |
TensorFlow |
MXNET |
Great |
Not so good |
Excellent |
Memory
GPU memory is limited and may usually be a problem for large models.
Theano |
TensorFlow |
MXNET |
Great |
Not so good |
Excellent |
Single-GPU Speed
Theano takes a long time to compile a graph, especially with complex models. TensorFlow is a bit slower.
Theano / MXNET |
TensorFlow |
comparable to CuDNNv4 |
about 0.5x slower |
Parallel/Distributed Support
Theano |
TensorFlow |
MXNET |
experimental multi-GPU |
multi-GPU |
distributed |
Conclusion
Theano (with higher-level Lasagne & Keras) is a great choice for deep learning models. It’s very easy to implement new networks & modify existing networks using Lasagne/Keras. I prefer python, and thus prefer using Lasagne/Keras due to their very mature python interface. However, they do not support R. I have tried using transfer learning and fine tuning in Lasagne/Keras, and it’s very easy to modify an existing network and customize it with domain-specific custom data.
Comparisons of different frameworks show that MXNET is the best choice (better performance/memory). Moreover, it has a great R support. In fact, it is the only framework that supports all functions in R. In MXNET, transfer learning and fine tuning networks are possible, but not as easy (as compared to Lasagne/Keras). This makes modifying existing trained networks more difficult, and thus a bit difficult to use domain-specific custom data.
Continued in Deep Learning Part 2: Transfer Learning and Fine-tuning Deep Convolutional Neural Networks