by Anusua Trivedi, Microsoft Data Scientist
This is part 3 of my series on Deep Learning, where I describe my experiences and go deep into the reasons behind my choices.
In Part 1, I discussed the pros and cons of different symbolic frameworks, and my reasons for choosing Theano (with Lasagne) as my platform of choice. A very recent benchmarking paper compares CNTK with Caffe, Torch & TensorFlow, and CNTK performs substantially better than all the other three frameworks.
In Part 2, I describe Deep Convolutional Neural Network (DCNN) and how Transfer learning and Fine-tuning helps better the training process for domain specific images.
This Part 3 of the series is based on my talk at PAPI 2016. In this blog, I show re-usability of trained DCNN model by combining it with a Long Short-term Memory (LSTM) Recurrent Neural Network (RNN).
I am delivering a Deep Learning webinar on 27th September 2016, 10:00am-11:00am PST. You can register for live webinar to learn more on Microsoft Azure and Deep Learning. Please feel free to email me at firstname.lastname@example.org if you have questions.
Given the role of clothing apparel in society, fashion classification has many applications. For example Liu et.al.’s work on predicting the clothing details in an unlabeled image can facilitate the discovery of the most similar fashion items in an e-commerce database. Yang et.al. shows that real-time clothing recognition can be useful in the surveillance context, where information about individuals’ clothes can be used to identify crime suspects. Depending on the application of fashion classification, the most relevant problems to solve will differ.
We will focus on papers by Wang et.al. and Vinyals et.al. and modify the model for optimizing fashion classification for the purposes of annotating images and predicting clothing tags for the fashion images.
The main inspiration of our work comes from recent advances in machine translation, where the task is to translate words or sentences individually. Recent work has shown that translation can be done in a much simpler way using RNNs and outperform the state-of-the-art performance. An “encoder” RNN reads the source tag/label and transforms it into a rich fixed-length vector representation, which in turn is used as the initial hidden state of a “decoder” RNN that generates the target tag/label. Here, we propose to follow this elegant recipe, replacing the encoder RNN by a DCNN. Over the last few years it has been convincingly shown that DCNNs can produce a rich representation of the input image by embedding it to a fixed-length vector, which can be used for a variety of vision tasks. Hence, it is natural to use a DCNN as an image “encoder”, by first pre-training it for an image classification task and using the last hidden layer as an input to the RNN decoder that generates tags.
In this work, we use an ImageNet pre-trained GoogLeNet model [Figure 2] for extracting CNN-features from the ACS dataset. Then we use the CNN-features to train a LSTM RNN model for the ACS Tag prediction.
In this work, we use the ACS dataset: a complete pipeline for recognizing and classifying people’s clothing in natural scenes. This has several interesting applications, including e-commerce, event and activity recognition, online advertising, etc. The stages of the pipeline combine several state-of-the-art building blocks such as upper body detectors, various feature channels and visual attributes. ACS defines 15 clothing classes and introduces a benchmark data set for the clothing classification task consisting of over 80, 000 images, which is publicly available. We use ACS dataset to predict new clothing tags for unseen images.
Images from the training and test datasets have very different resolutions, aspect ratios, colors etc. Neural networks require a fixed input size, so each image was resized and/or cropped to fixed dimensions (3 × 224 × 224).
One of the drawbacks of non-regularized neural networks is that they are extremely flexible: they learn both features and noise equally well, increasing the potential for overfitting. In our model, we apply L2 regularization to avoid overfitting. But even after that, we observed a large gap between the performance on the training and validation of ACS images, indicating that the fine-tuning process is overfitting to the training set. To combat this overfitting, we leverage data augmentation the for the ACS image dataset.
There are many ways to do data augmentation, such as the popular horizontally flipping, random crops and color jittering. As the color information in these images is very important, we only rotate the images at different angles – at 0, 90, 180, and 270 degrees.
Finetuning GoogleNet for ACS
For this ACS Tag prediction problem, we fall under fine-tuning scenario 2 (refer to Part 2). The GoogLeNet model that we use here was initially trained on ImageNet. The ImageNet dataset contains about 1 million natural images and 1000 labels/categories. In contrast, our labeled ACS dataset has about 80,000 domain-specific fashion images and 15 labels/ categories. The ACS dataset is insufficient to train a network as complex as GoogLeNet. Thus we use weights from the ImageNet-trained GoogLeNet model. We fine-tune all the layers of the pre-trained GoogLeNet model by continuing the backpropagation.
Training LSTM RNN
In this work, we use a LSTM RNN model, which has shown state-of-the art performance on sequence tasks. The core of the LSTM model is a memory cell, which encodes knowledge of what inputs have been observed at every time step [Figure 5]. The behavior of the cell is controlled by gates, which are layers that are applied multiplicatively. Three gates are being used here:
- forget the current cell value (forget gate f),
- read its input (input gate i)
- output the new cell value (output gate o).
The memory block contains a cell ‘c’ which is controlled by three gates. In blue we show the recurrent connections – the output ‘m’ at time (t – 1) is fed back to the memory at time ‘t’ via the three gates; the cell value is fed back via the forget gate; the predicted word at time (t – 1) is fed back in addition to the memory output ‘m’ at time ‘t’ into the Softmax for tag prediction.
The LSTM model is trained to predict top tags for each image. It takes the ACS DCNN-features (produced using the GoogLeNet) as input. LSTM is then trained on a combination of these DCNN-features and labels for these pre-processed images.
A copy of the LSTM memory is created for each of the LSTM image and each label, such that all LSTMs share the same parameters and the output (m)×(t−1) of the LSTM at time (t – 1) is fed to the LSTM at time (t) [Figure 6]. All recurrent connections are transformed to feed-forward connections in the final version. We saw that feeding the image at each time step as an extra input yields inferior results, as the network can explicitly exploit noise in the image and overfits more easily. The LSTM -RNN loss is minimized w.r.t. all the parameters of the LSTM, the top layer of the image embedder CNN and label embedding.
We apply the above LSTM-RNN-CNN model for ACS Tag prediction. Tag prediction accuracy of this model improves quickly during the first part of the iterations and stabilizes after about 20,000 iterations.
Each image in our ACS image dataset consists of only unique types of clothing, there is no image combining the clothing types. In spite of this fact, when we test images with multiple clothing type, our trained model generates tags for these unseen test images quite accurately (~80% accurate).
Below we show a sample of our prediction outputs. Our test image below contains a person in suit and a person in t-shirt.
The image below describes the GoogLeNet Tag Prediction for our test image using ImageNet-trained GoogleNet model:
As we see, the prediction using this model is not very accurate as our test image is tagged as ‘Windsor Tie’. The image below describes ACS Tag Prediction for our test image using the above LSTM-RNN-CNN model. As we see, the prediction using this model is very accurate as our test image is tagged as ‘jersey, T-shirt, upper body, suit of clothes’.
Here our aim is to develop a model which can predict clothing category tags for fashion images with high accuracy.
In this work, we compare the tag prediction accuracy of the GoogLeNet model to our LSTM-RNN-CNN model. The tags predicted by our model are way more accurate than that predicted by the state-of-the-art GoogLeNet model. For training our model, we used a relatively smaller number of training iterations, around ~10,000. Prediction accuracy of our model improves quickly with increasing number of training iterations and stabilizes after about 20,000 iterations. We used only one GPU for this work.
Fine-tuning allows us to bring the power of state-of-the-art DCNN models to new domains. Moreover, combining DCNN-RNN model helps us extend the trained model to solve completely different problem like fashion image tag generation. I hope this blog series helps you train DCNNs for domain-specific images and leads to the re-usability of the trained models.