Freesound Audio Tagging 2019
This is Eric BOUTEILLON's proposed solution for Kaggle Freesound Audio Tagging 2019 Competition and DCASE 2019 Task 2.
Indicators :+1: were added to sections containing major contributions from the author.
This repository presents a semi-supervised warm-up pipeline used to create an efficient audio tagging system as well as a novel data augmentation technique for multi-labels audio tagging named by the author SpecMix.
These new techniques were applied to our submitted audio tagging system to the Kaggle Freesound Audio Tagging 2019 challenge carried out within the DCASE 2019 Task 2 challenge [3]. Purpose of this challenge consist of predicting the audio labels for every test clips using machine learning techniques trained on a small amount of reliable, manually-labeled data, and a larger quantity of noisy web audio data in a multi-label audio tagging task with a large vocabulary setting.
Provided Jupyter notebooks result in a lwlrap of .738 in public leaderboard, that is to say 12th position in this competition.
You can also find resulting weights of CNN-model-1 and VGG-16 training in a public kaggle dataset. Note I am no longer using git-lfs to store weights due to quota issues.
This competition required to performed inference in a Kaggle kernel without change in its configuration. So it was important to use same version of pytorch and fastai as the Kaggle kernel configuration during the competition to be able to load locally generated CNN weights. So it is important to use pytorch 1.0.1 and fastai 1.0.51.
To get same configuration as my local system, here are the steps, tested on GNU Linux Ubuntu 18.04.2 LTS:
git clone https://github.com/ebouteillon/freesound-audio-tagging-2019.git
Install anaconda3
Type in a linux terminal:
conda create --name freesound --file spec-file.txt
You are ready to go!
Note: My configuration has CUDA 10 installed, so you may have to adapt version of pytorch and cudatoolkit to your own configuration in the spec-file.txt
.
This method does not guarantee to get the exact same configuration as the author as newer package may be installed by conda.
git clone https://github.com/ebouteillon/freesound-audio-tagging-2019.git
Install anaconda3
Type in a linux terminal:
conda update conda
conda create -n freesound python=3.7 anaconda
conda activate freesound
conda install numpy pandas scipy scikit-learn matplotlib tqdm seaborn pytorch==1.0.1 torchvision cudatoolkit=10.0 fastai==1.0.51 -c pytorch -c fastai
conda uninstall --force jpeg libtiff -y
conda install -c conda-forge libjpeg-turbo
CC="cc -mavx2" pip install --no-cache-dir -U --force-reinstall --no-binary :all: --compile pillow-simd
conda install -c conda-forge librosa
Notes:
During the competition I use the following:
Download dataset from Kaggle
(optional) Download my weights dataset from Kaggle
Unpack dataset in input
folder so you environment looks like:
├── code
│ ├── inference-kernel.ipynb
│ ├── training-cnn-model1.ipynb
│ └── training-vgg16.ipynb
├── images
│ ├── all_augmentations.png
│ └── model-explained.png
├── input
│ ├── test
│ │ └── ...
│ ├── train_curated
│ │ └── ...
│ ├── train_noisy
│ │ └── ...
│ ├── sample_submission.csv
│ ├── train_curated.csv
│ ├── train_noisy.csv
│ └── keep.txt
├── LICENSE
├── README.md
├── requirements.txt
├── spec-file.txt
└── weights
├── cnn-model-1
│ └── work
│ ├── models
│ │ └── keep.txt
│ ├── stage-10_fold-0.pkl
│ ├── ...
│ └── stage-2_fold-9.pkl
└── vgg16
└── work
├── models
│ └── keep.txt
├── stage-10_fold-0.pkl
├── ...
└── stage-2_fold-9.pkl
conda activate freesound
jupyter notebook
Your web-browser should open and then select the notebook you want to execute. Recommended order:
Enjoy!
Notes:
training-*.ipynb
notebook to train one of the models. :smile:work
folder and a preprocessed
folders will be created, you may want to change their location: it is as easy as updating variables WORK
and PREPROCESSED
.models_list
. I kept the paths used within the Kaggle kernel for the competition.Audio clips were first trimmed of leading and trailing silence (threshold of 60 dB), then converted into 128-bands mel-spectrogram using a 44.1 kHz sampling rate, hop length of 347 samples between successive frames, 2560 FFT components and frequencies kept in range 20 Hz – 22,050 Hz. Last preprocessing consisted in normalizing (mean=0, variance=1) the resulting images and duplicating to 3 channels.
In this section, we describe the neural network architectures used:
Version 1 consists in an ensemble of a custom CNN "CNN-model-1" defined in Table 1 and a VGG-16 with batch-normalization. Both are trained in the same manner.
Version 2 consist of only our custom CNN "CNN-model-1", defined in Table 1.
Version 3 is evaluated for Judge award and it is same model as version 2.
Input 128 × 128 × 3 |
---|
3 × 3 Conv(stride=1, pad=1)−64−BN−ReLU |
3 × 3 Conv(stride=1, pad=1)−64−BN−ReLU |
3 × 3 Conv(stride=1, pad=1)−128−BN−ReLU |
3 × 3 Conv(stride=1, pad=1)−128−BN−ReLU |
3 × 3 Conv(stride=1, pad=1)−256−BN−ReLU |
3 × 3 Conv(stride=1, pad=1)−256−BN−ReLU |
3 × 3 Conv(stride=1, pad=1)−512−BN−ReLU |
3 × 3 Conv(stride=1, pad=1)−512−BN−ReLU |
concat(AdaptiveAvgPool2d + AdaptiveMaxPool2d) |
Flatten−1024-BN-Dropout 25% |
Dense-512-Relu-BN-Dropout 50% |
Dense-80 |
Table 1: CNN-model-1. BN: Batch Normalisation, ReLU: Rectified Linear Unit,
One important technique to leverage a small training set is to augment this set using data augmentation. For this purpose we created a new augmentation named SpecMix. This new augmentation is an extension of SpecAugment [1] inspired by mixup [2].
SpecAugment applies 3 transformations to augment a training sample: time warping, frequency masking and time masking on mel-spectrograms.
mixup creates a virtual training example by computing a weighted average of two samples inputs and targets.
SpecMix is inspired from the two most effective transformations from SpecAugment and extends them to create virtual multi-labels training examples:
Figure 1: Comparison of mixup, SpecAugment and SpecMix
We added other data augmentation techniques:
At training time, we give to the network batches of 128 augmented excerpts of randomly selected sample mel-spectrograms. We use a 10-fold cross validation setup and the fastai library [4].
Training is done in 4 stages, each stage generating a model which is used for 3 things:
An important point of this competition, is that we are not allowed to use external data nor pretrained models. So our pipeline presented below only used curated and noisy sets from the competition:
Figure 2: warm-up pipeline
For inference we split the test audio clips in windows of 128 time samples (2 seconds), windows were overlapping. Then these samples are fed into our models to obtain predictions. All predictions linked to an audio clip are averaged to get the final predictions to submit.
This competition had major constraints for test prediction inference: submission must be made through a Kaggle kernel with time constraints. As our solution requires a GPU, the inference of the whole unseen test set shall be done in less than an hour.
In order to match this hard constraint, we took following decisions:
To asses the performance of our system, we provide results in Table 2. Evaluation of performances on noisy set and curated set were cross-validated using 10-folds. Evaluation on test set predictions are values reported by the public leaderbord. The metric used is lwlrap (label-weighted label-ranking average precision).
Model | lwlrap noisy | lwlrap curated | leaderboard |
---|---|---|---|
model1 | 0.65057 | 0.41096 | N/A |
model2 | 0.38142 | 0.86222 | 0.723 |
model3 | 0.56716 | 0.87930 | 0.724 |
model4 | 0.57590 | 0.87718 | 0.724 |
ensemble | N/A | N/A | 0.733 |
Table 2: Empirical results of CNN-model-1 using proposed warm-up pipeline
Each stage of the warm-up pipeline generates a model with excellent prediction performance on the test test. As one can see in Figure 3, each model would give us a silver medal with the 25th position on the public leaderboard. Moreover these warm-up models bring sufficient diversity on their own, as a simple averaging of their predictions (lwlrap .733) gives 16th position on the public leaderboard.
Final 12th position of the author was provided by version 1, which is an average of the predictions given by CNN-model-1 and VGG-16, both trained the same way.
Figure 3: Public leaderboard
This git repository presents a semi-supervised warm-up pipeline used to create an efficient audio tagging system as well as a novel data augmentation technique for multi-labels audio tagging named by the author SpecMix. These techniques leveraged both clean and noisy sets and were shown to give excellent results.
These results are reproducible, description of requirements, steps to reproduce and source code are available on GitHub1. Source code is released under an open source license (MIT).
These results were possible thanks to the infinite support of my 5 years-old boy, who said while I was watching the public leaderboard: “Dad, you are the best and you will be at the very top”. ❤️
I also thank the whole kaggle community for sharing knowledge, ideas and code. In peculiar daisuke for his kernels during the competition and mhiro2 for his simple CNN-model and all the competition organizers.
[1] Daniel S. Park, William Chan, Yu Zhang, Chung-Cheng Chiu, Barret Zoph, Ekin D. Cubuk, Quoc V. Le, "SpecAugment: A Simple Data Augmentation Method for Automatic Speech Recognition", arXiv:1904.08779, 2019.
[2] Hongyi Zhang, Moustapha Cisse, Yann N. Dauphin, and David Lopez-Paz. "mixup: Beyondempirical risk minimization". arXiv preprint arXiv:1710.09412, 2017.
[3] Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, and Xavier Serra. "Audio tagging with noisy labels and minimal supervision". Submitted to DCASE2019 Workshop, 2019. URL: https://arxiv.org/abs/1906.02975
[4] fastai, Howard, Jeremy and others, 2018, URL: https://github.com/fastai/fastai