My solution to the Global Data Science Challenge
My solution of Global Data Science Game 2020
This repository contains the code I used to train the models and the following post details the approach undertaken.
More details about project here
pip install git+https://github.com/ildoonet/pytorch-gradual-warmup-lr.git
pip install timm
pip install pretrainedmodels
Since we're asked to list the top 20 most similar images for each whale's fluke, I found that metric learning techniques worked better than a classification task where you would train a network to classify the images among all the ids and extract one hidden layer weights hoping for it to be a good representation vector for similarity measures.
Metric learning techniques are suitable to generate vector representations that are useful to compute a similarity measure (cosine, euclidean, etc.) between the items.
The most popular losses that allow to achieve this task are:
In fact, if you look at the Humpback Whale Identification Kaggle competition, and more specifically at the solutions, you'll find out that most participant went for a Siamese Neural Network with a margin loss, and it worked quite well.
I tried triplet loss with hard sample mining (c.f. this article for more details) for about two months, and the best score I could achieve on the test leaderboard, for a single model, was about ~ 1370. So two weeks before the deadline, I started searching for another method, and that's when I discovered the arcface loss, which literally blew my mind :). So I oubviously went for it.
The arcface loss has been introduced in 2019 (CVPR) and its main goal is to maximize face class separability by learning highly discriminative features for face recognition. According to the writers of the paper, this method outperformed triplet loss, intra-loss and inter-loss on most common face identification benchmarks.
What does this loss exactly do?
When given a feature vector extracted from the network and the corresponding ground truth value (in this case, the whale id), arcface performs an operation to convert the computations in an angular space by computing the angle between the feature vector and the ground truth. Then, by adding a margin, like in the triplet loss or center loss scheme, it reverts back to the original space and applies a softmax.
The main benefit of this loss is the transition into a new space, where separability is maximized.
What are the benefits of this loss compared to softmax and triplet loss:
Let's now dive into the solution:
data
folder) of the whales' fluke in order to discard any surrounding noises (water splashes, etc) and zoom on the relevant information. This acts as an attention mechanism.Note: I didn't use the pretrained Kaggle detection model but I trained a Fluke detector from scratch myself (yolo v3) after annotating about ~ 300 whales on this tool.
Key learning : Spend as much time as you can on the data: clean it, curate it, cross check it... Although powerful in representation learning, deep learning models are still garbage-in garbage out models. If you feed them noisy data, don't hope for good results.
The first weeks of the competition, I used ImageNet pretrained models as backbones. It was fine, my models ended up converging after some time. The top score I could achieve on the test leaderboard was about 1270.
Then I looked into Humpback Whale Identification Kaggle competition data. I noticed a couple of things:
Kaggle whales' flukes
So I decided finetuning the ImageNet pretrained models on this data using the triplet loss.
Funny how things worked out:
Key learning: Transfer learning rarely hurts. If you start by ImageNet models that are pretrained on 1000 common objects (animals, cars, etc.), it's more likely that a pretrained network on a similar dataset of yours is better.
As we've seen it, Lisa's images can be sometimes very large. This is due to professional cameras and tools she's using. Some images can even reach 3000x1200 or higher.
When I started the competition, I set the input size to 224x224 pixels, as I typically do in most image classification problems.
However, when I started varying the input sizes, I got a lift of performance with the 480x480 size.
Tow key learnings here:
As we all know there's a large family of network architectures to choose from
After several experiments, I noticed that best performing architectures on our dataset are:
My best performing single model relies on densenet121.
Key learnings :
The training pipeline consists of 5 major steps:
Step 1: the dataloder connects to the database and serves the images and the corresponding labels to the network, in batches. It's also responsible shuffling the data it between the epochs and applying on-the-fly data augmentation. Heavy augmentation has been applied as a regularization effect for a better generalization. Transformations include: Gaussian noise, blurring, motion blur, random rain (to simulate splash effects), color shift, random change in brightness, hue and saturation, sharpening, perspective and elastic transformations, random rotation ± 20°, affine transformations (translation and shearing), and random occlusion (to increase generalization capabilities) Here is the pipeline script:
Step 2: forward pass. The model takes the images as input and generates the features.
Step 3: the arcface loss is computed between the features and the targets
Step 4: the loss.backaward() method is called where the gradients of the loss with respect to model parameters is computed
Step 5: Here's when the Adam optimizer comes in. Based on the gradients computed in step 4, it updates the weights of the network. This operation is performed on each batch.
Training tips
I made a lot of experiments during this competition. Here is my list of tips when it comes to building a strong training pipeline:
I trained two models following the previous pipeline with the following parameters:
model 1 | model 2 | |
---|---|---|
backbone | resnet34 | densenet121 |
pretrained | Kaggle data | Kaggle data |
embedding size | 256 | 256 |
image size | 620 | 480 |
pseudo labels | yes | no |
learning rate | 2.5e-4 to 5e-5 | 2.5e-4 to 5e-5 |
batch size | 32 | 16 |
dropout | 0 | 0.5 |
epochs | 70 | 90 |
Score on test leaderboard | 1434 | 1450 |
Classification layer | no | yes |
Embedding layer | yes | yes |
What gave me a bump in the final score is the way I combined these two models. This is a simple meta-embedding technique that is quite commonly used in Natural Language Processing.
It consists of generating the embeddings of each model on all the samples, then concatenating them.
This method is used to generate the meta-embeddings of the train and test data sets. Then, the same computations are used to generate the submission.
key learning:
I would like to thank the whole GDSC team for their work in making this challenge a great learning opportunity and Lisa Steiner for giving us the chance to bring our knowledge to a new field.
I hope you'll find here resources that you can use in other computer vision and deep learning projects.