Multimodal Learning with Image, Text, and Columnar Data
This competition challenge data scientist to “predict demand for an online advertisement based on its full description (title, description, images, etc.), its context (geographically where it was posted, similar ads already posted) and historical demand for similar ads in similar contexts.”
I didn’t plan to spend much time on this competition until I watched Fast.ai Lesson 10 and decided to find a real-world dataset to try the language model from the updated fast.ai library (with less than three weeks left before the competition ends). Fine-tuning a universal language model has shown to be effective for text classification[2,3]. I wondered if it also works for regression. (Recently OpenAI published a paper expanding the framework using transformer networks[4,5].)
By pre-training a language model, extract the encoder, combined with numerical and categorical features, the resulting neural network model averaged with a public LightGBM kernel reached bronze medal range (~125th place as I recall) within a week. I got intrigued and decided to invest some time to add image features into the model. The final ensemble model landed at 54th place in the private leaderboard:
- Fortunately the model did not overfit the public leaderboard. Rather, I’d guess the model still underfit. Because of some mistakes I’ll mention in later sections, the training is slower than it should be.
- I did not train a text encoder from scratch as control group. But since I did not really do any extra feature engineering, it looks like the language model gave me some boost.
The Purpose of This Post
As you can see, I really did not do anything special other than pretrained a language model. And I was actually going to skip writing about this competition. But there are some implementation details that bothered me and some of them still haven’t been fully solved. I figured writing them down could help me avoid making the same mistakes or discovering the same confusion in the future. Also, I want to try making model architecture diagrams using Google Drawings.
The pure neural network model can be splited into 3 stages:
- Pretrain the language model
- Image feature extraction
- Train the Regression Model
The input token comes from concatenated title and description fields.
The embedding layer is initialized with FastText pretrained vectors. This is tricky because we want the embedding matrix also be the weights of the softmax layer[7, 8]. This is what I did:
learner.models.model.encoder.weight = nn.Parameter(T(vectors))
learner.models.model.decoder.weight = (
In the final model I used LSTM layers, which worked alright using the default fast.ai settings. For QRNN the parameters needs some tuning, I did not finished training QRNN before the end of the competition.
Here’s the learning rate schedule I used:
lrs = 1e-4
learner.fit(lrs, 1, wds=1e-7, use_clr=(32, 5),
The pretrained Resnet101 model comes from the official torchvision library, and the Resnext101_64x4d model comes from Cadene/pretrained-models.pytorch. The last average pooling layer was replaced with a global pooling layer to support arbitrary-sized images.
Two kind of image preprocessing were used: center cropping and padding to square. Padding to square seemed to provide better results, but only marginally. You can use both methods and concatenate the resultss. I didn’t do that due to disk space constraints.
self.transform_pad = transforms.Compose([
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
self.transform_center = transforms.Compose([
mean=[0.485, 0.456, 0.406],
std=[0.229, 0.224, 0.225])
The outputs are dumped to the disk as one pickle file per image, in an attempt to avoid blowing up memory. This is a mistake because it makes the total size on the disk bigger and reading slower. The better way is probably to use numpy.memmap.
This is where all parts come together. Numerical features were normalized to zero mean and unit standard deviation. Categorical embedding dimensions were relatively conservative comparing to what other competitors used:
self.region_emb = nn.Embedding(28, 3)
self.city_emb = nn.Embedding(290, 5)
self.p_cate_emb = nn.Embedding(9, 3)
self.cate_emb = nn.Embedding(47, 5)
self.image_top1_emb = nn.Embedding(888, 5)
self.user_type_emb = nn.Embedding(3, 2)
self.weekday_emb = nn.Embedding(7, 3)
self.param1_emb = nn.Embedding(204, 5)
self.param2_emb = nn.Embedding(131, 3)
self.param3_emb = nn.Embedding(113, 3)
The light blue dense layers act as downsamplers. They reduces the dimensions of the image features and encoder outputs to 128. We can extract features from these layers and feed them to GBM models.
The dense layers share the same structure. The layer normalization probably should have been placed after Relu. But somehow I placed it before Relu and did not really think of it until now…
For this regression model, the model and learner were almost all re-written. Only some utility functions from the fast.ai library were used.(It’s weird that when using the
Learner class from fast.ai library, the validation losses were always off by some amount. I wasn’t able to find where the problem was. My custom learner did not have such problem.)
I re-implemented the slanted triangular learning rates[2,10] by extending the official learning rate scheduler class:
The model is trained using 5-fold validation framework. The test predictions from each fold were averaged to get the final predictions.
The best single(and final) pure neural network model produces 0.2201 public and 0.2242 private loss, which lands at between 534th to 548th place in the private leaderboard.
If we extract the image and text features from the network (the light blue dense layers), and put them into the LightGBM model slightly modified from the public kernel. We can get a single(sort of) model with 0.2197 public and 0.2236 private loss, which lands at between 189th to 210th place in the private leaderboard.
To get to the 54th place, what we need is to feed the out-of-fold predictions from different models (the one with center-cropped resnet101, the one with padded-to-square resnext101_64x4d, public lightgbm, lightgbm with NN features, etc.) to a LightGBM model. Model diversity is important here. Because I only use one pretrained language model encoder, I think there are still some low hanging fruits to be grabbed. There are of course more stacking/ensemble tricks that could boost the performance further. You can check out what other competitors’ shared on the forum to get some hints.
That’s it! It’s really a low-effort process that works surprisingly well. In the last week I mostly left my machine to train the model for 24+ hours, came back and did some tweaking, then repeat the cycle. I could have done much more iterations if the image file issue was dealt with properly. This brings us to the final section:
Handling A Large Number of Small Files (e.g. Images) on Disk
Available on my computer were an SSD-backed ext4 partition with 40+GB free space, an SSD-backed NTFS partition with 70+GB free space, and HDD-backed NTFS partition with 1TB free space. We have 1+ million images from the training and test datasets. Which costs around 60GB of space.
I had found reading images from HDD-backed NTFS excruciating slow. Even finding a file in the command prompt can take seconds. I did some research and it appeared that NTFS cannot handle too many files in a single folder:
Here’s some advice from someone with an environment where we have folders containing tens of millions of files. A…stackoverflow.com
So I wrote a script to put images into sub-folders, in a similar manner as:
I have a project that will generate a huge number of images. Around 1,000,000 for start. They are not large images so I…serverfault.com
The performances seem to be better, at least moving the images from HDD to SDD was faster. I did not spend more time digging this issue and moved on to use SSD-backed NTFS partition to store image files, and SSD-backed ext4 partition to store extracted image features. As I mentioned earlier, this still slowed down the model training quite a bit, and using numpy.memmap instead of dumping individual pickle files should be much better.
This kind of issue happens once every few months to me. So after competition, I decided to take some time to find out what is the correct way to do this, and wrote some simple scripts to benchmark different schemes:
small-file-benchmark – Simple Benchmark of Reading Small Files From Diskgithub.com
However, the results were highly inconsistent. One day I found flat structure slower than nested structure. Then the next day nested structure was slower than flat structure. It was frustrating. I suspect the OS was doing some optimization under the hood, but I do not have this kind of knowledge yet. So this is still a mystery waiting to be solved.
I’d probably just create an instance on Google Cloud Compute with a big SSD-backed partition attached and large enough memory to load the dataset if I had the budget. This would make things much easier. It’s likely a bad idea using HDD to serve large quantities of random reads anyway.
- Fast.ai: 10 — NLP Classification and Translation
- Howard, J., & Ruder, S. (2018). Universal Language Model Fine-tuning for Text Classification.
- Introducing state of the art text classification with universal language models
- Radford, A., & Salimans, T.. Improving Language Understanding by Generative Pre-Training.
- OpenAI blog: Improving Language Understanding with Unsupervised Learning
- E. Grave*, P. Bojanowski*, P. Gupta, A. Joulin, T. Mikolov, Learning Word Vectors for 157 Languages
- “Using the Output Embedding to Improve Language Models” (Press & Wolf 2016)
- “Tying Word Vectors and Word Classifiers: A Loss Framework for Language Modeling” (Inan et al. 2016)
- Pretrained models for Pytorch (Github)
- Leslie N Smith. 2017. Cyclical learning rates for train- ing neural networks.
Source: Deep Learning on Medium