2018 Spotify ACM RecSys Challenge 2'nd Place Solution
Contact: [email protected]
This repository contains the tensorflow v1 implementation of our entry for main track. We propose MMCF, which is consists of two components: (1) Context-aware autoencoders using both the playlist and its categorical contents and (2) Character-level convolutional neural networks that learn the latent relationship between playlists and their titles.
If you are interested in building up your research on this work, please cite:
@inproceedings{mmcf18,
author = {Yang, Hojin and Jeong, Yoonki and Choi, Minjin and Lee, Jongwuk},
year = {2018},
month = {10},
pages = {1-6},
title = {MMCF: Multimodal Collaborative Filtering for Automatic Playlist Continuation},
isbn = {978-1-4503-6586-4},
journal = {RecSys Challenge '18: Proceedings of the ACM Recommender Systems Challenge 2018},
doi = {10.1145/3267471.3267482}
}
Spotify has produced the MPD(Million Playlist Dataset) which contains a million user-curated playlists. Each playlist in the MPD contains a playlist title, a list of tracks(with metadata), and other miscellaneous information.
Proceed with these steps to convert the MPD’s data format into our system's.
--datadir
: Directory where converted dataset(training, test, challenge) will be stored. default: ./data--mpd_tr
: Directory which contains MPD-slice json files used for training the model. default: ./mpd_train--mpd_te
: Directory which contains MPD-slice json files used for testing the model. default: ./mpd_test--mpd_ch
: Directory which contains the challenge set json file. default: ./challenge--mincount_trk
: The minimum number of occurences of tracks in the train data default: 5--mincount_art
: The minimum number of occurences of artists in the train data default: 3--divide_ch
: A list where each elements is a range of challenge seed numbers default: 0-1,5,10-25,10-25r python data_generator.py --datadir ./data --mpd_tr ./mpd_train --mpd_te ./mpd_test --mpd_ch ./challenge
You can set the minimum number of occurences of tracks and artists on training set manually. When you run the following command, tracks with less than three occurrences are removed:
python data_generator.py --mincount_trk 3
Each test files contains same seed pattern as Spotify RecSys Challenge: seed 0, 1, 5, 10, 25, 100, 25r, 100r.
We also divide challenge set into four categories based on seed pattern by default: (0,1) , (5) , (10,25,100) , (25r,100r)
For submission, we train our models with four different denoising schemes. Each schemes performs better on one of four different challenge categories.
Our model is composed of two parts: Denoising Autoencoders and Character-level CNN; train the parameters of the DAE first, then integrate with char-level CNN.
--dir
: Directory name which contains config file.--pretrain
: Pretrain dae parameters if specified.--dae
: Train dae parameters if specified.--title
: Train paramters of title module if specified.--challenge
: Generate challenge submission candidates if specified.--testmode
: Get the results without training the model if specified.python main.py --dir sample --pretrain
Run main in DAE mode after the loss is converged in pretrain mode. If you set pretrain file name in config.ini file, following command will use pretrained paramters saved in the fold you created(./sample). You can also train DAE without initital value depending on the config.ini setting:
python main.py --dir sample --dae
After you run DAE, its parameters are saved as pickle format in ./sample.
python main.py --dir sample --title
python main.py --dir sample --challenge
[note]
For all models, paramters are updated if the avearge of update_seeds r-precision score(s) increases. Our system calculates r-precision score every epoch.
You must specify only one mode(dae, title, challenge) when you set arguments of main.py.
You can easily replace parameter pickle files(for DAE) and/or ckpt graph file(for title) with other directories,
if both have same number of tracks & artists and same CNN filter shapes.
If you want to just check metrices scores after replacing paramters with directory's, using --testmode is efficient:
# after replacing DAE pickle file from another folder #
python main.py --dir sample --dae --testmode
We already set the initial setting: create 4 different directories(0to1_inorder,5_inorder,10to100_inorder, 25to100_random), and set config files of each directories.
--divide_ch
of data_generator.py as 0-1,5,10-25,10-25r(andom).directory | challenge category | firstN_range | input denoising | pretrain only |
---|---|---|---|---|
0to1_inorder | challenge_inorder_0to1 | 0, 0.3 | 0.75 | True |
5_inorder | challenge_inorder_5 | 1, 50 | 0.75 | False |
10to100_inorder | challenge_inorder_10to100 | 0.3, 0.6 | 0.75 | False |
25to100_random | challenge_random_25to100 | -1 | 0.5, 0.8 | False |
In summary, run the following commands one line at a time:
# 997 mpd.slice on ./mpd_train, 3 mpd.slice on ./mpd_test, challenge set on ./challenge #
python data_generator.py
python main.py --dir 0to1_inorder --pretrain
python main.py --dir 0to1_inorder --title
python main.py --dir 0to1_inorder --challenge
# copy 0to1_inorder/graph to 5_inorder #
python main.py --dir 5_inorder --pretrain
python main.py --dir 5_inorder --dae
python main.py --dir 5_inorder --challenge
# copy 0to1_inorder/graph to 10to100_inorder #
python main.py --dir 10to100_inorder --pretrain
python main.py --dir 10to100_inorder --dae
python main.py --dir 10to100_inorder --challenge
# copy 0to1_inorder/graph to 25to100_inorder #
python main.py --dir 25to100_random --pretrain
python main.py --dir 25to100_random --dae
python main.py --dir 25to100_random --challenge
python merge_results.py
[Note]
[BASE]
verbose - boolean. print log on console if True.
data_dir - string. Directory of data that system will read.
The directory contains one training json file and multiple types of test json files.
challenge_dir - string. Directory where final results are saved.
testsize - int. The number of maximum test playlist in each test case.
[DAE]
epochs - int. Number of training epoch.
batch - int. batch size.
lr - float. learning rate.
reg_lamdba - float. regularization constant.
hidden - int. DAE hidden layer size.
test_seed - comma seperated int(or int+’r’) list. Seed numbers that you run the test after each epoch.
test_seed = 1,5,10 means the system runs test after each epoch by reading test-1, test-5, test-10 json file in the directory set in fold_dir.
update_seed - comma seperated int(or int+’r’) list. Seed numbers that is considered when updating parameters. Update_seed must be inner set of test_seed.
test_seed = 25r,100r , update_seed = 100r means the system runs test after each epoch by reading test-25r, test-100r json file, creates log,
and update parameters if the test-100r’s r-precision value increases.
keep_prob - float(0.0<x<=1.0). Drop out keep probability in hidden layer.
keep_prob = 0.75 means drop out 25% of input for every batch.
input_kp - comma seperated floats list(0.0<x<=1.0). Denoising keep probability range in input layer.
input_kp = 0.5, 0.8 means denoise randomly selected probability between 50%~20%.
firstN_range - comma seperated floats or int list. The range to draw a random number n,.
when you set the tracks from 0th track to n-th track of a playlist as input value.
You can set it up in three different ways.
firstN_range = -1 means to consider all the songs in the playlist as an input value.
firstN_range = float a , float b means set input track range from 0-th to random(a*N, b*N). (N is the length of the playlist)
firstN_range = int a , int b means set input track range from 0-th to random(a, b).
ex)
firstN_range - -1 : 0~N
firstN_range - 0,50 : 0~random(0,50)
firstN_range - 0.3,0.6 : 0~random(N*0.3, N*0.6)
initval - string. Name of pickle file which contains pretrained parameters. Set NULL if no initial value.
save - string. Name of pickle file to store the updated parameters.
[PRETRAIN]
epochs - int. Number of training epoch.
batch - int. batch size.
lr - float. learning rate.
reg_lamdba - float. regularization constant.
save - string. Name of pickle file to store the updated parameters.
[TITLE]
epochs - int. Number of training epoch.
batch - int. batch size.
lr - float. learning rate.
keep_prob - float(0.0<x<=1.0). Drop out keep probability in DAE hidden layer.
input_kp - comma seperated floats(0.0<x<=1.0). Denoising keep probability range in input layer.
title_kp - float(0.0<x<=1.0). Drop out keep probability in title model hidden layer.
test_seed - comma seperated int(or int+’r’) list. Seed numbers that you run the test after each epoch.
update_seed - comma seperated int(or int+’r’) list. Seed numbers that is considered when updating parameters.
char_model - Char_CNN or Char_RNN
rnn_hidden - int. Set this one if char_model is Char_RNN. RNN hidden size.
filter_num - int. Set this one if char_model is Char_CNN. Number of CNN filters.
filter_size - comma seperated int list. Set this one if char_model is Char_CNN. Size of CNN filters.
char_emb - int. Character embedding size. One-hot if the value is 0.
DAEval - string. Name of pickle file where the parameters of DAE is saved.
save - string. Name of checkpoint file which saves updated tensor graph.
[CHALLENGE]
batch - int. batch size.
challenge_data - string. Name of challenge file whose format is modified to fit our system in data-dir.
result - string. Name of pifckle file to save the result.