Generating lyrics with a recurrent neural network
This is a small experiment in generating lyrics with a recurrent neural network, trained with Keras and Tensorflow 2.
It works in the browser with Tensorflow.js! Try it here.
The model can be trained at both word- and character level which each have their own pros and cons.
A few pre-trained models can be found here.
Requires Python 3.7+.
pip install -r requirements.txt
The requirement file has been reduced in size so if any of the scripts fail, just install the missing packages :-)
songdata.csv
file in a data
sub-directory.--songdata-file
parameter when training.glove.6B.50d.txt
file in a data
sub-directory.The code expects an input dataset to be stored at date/songdata.csv
by default (this can be changed in config.py
or via CLI parameter --songdata-file
).
The file should be in CSV format with the following columns (case sensitive):
artist
text
You can have any number of other columns, they will just be ignored.
A sample dataset with a simple text is provided in sample.csv
. To test things are working, you can train using that file:
python -m lyrics.train --songdata-file sample.csv --early-stopping-patience 50 --artists '*'
billboardHot100_1999-2019.csv
file from the Data on Songs from Billboard 1999-2019
data/
folder and run python scripts/billboard.py
script which will prepare the file for training.pip install fasttext
to detect language. If it's not installed, language is not detected.If you have the songdata.csv
file from above, you can simply create the
word2vec vectors like this:
python -m lyrics.embedding --name-suffix _myembedding
This will create word2vec_myembedding.model
and word2vec_myembedding.txt
files in the default data directory data/
. Use -h
to see other options
like artists and custom songdata file.
python -m lyrics.train -h
This command by default takes care of all the training. Warning: it takes a very long time on a normal CPU!
Check -h
for options. For example, if you want to use a different embedding
than the glove embedding:
python -m lyrics.train --embedding-file ./embeddings.txt
The embeddings are still assumed to be 50 dimensional.
The output model and tokenizer is stored in a timestamped folder like export/2020-01-01T010203
by default.
Note: During experimentation, I found that raising the batch size to something like 2048 speeds up processing, but it depends on your hardware resources whether this is feasible of course.
I have found it easier to train on GPU by using Docker and nvidia-docker, rather than try to install CUDA myself. To do this, first make sure you have nvidia-docker set up correct, and then:
docker build -t lyrics-gpu .
docker run --rm -it --gpus all -v $PWD:/tf/src -u $(id -u):$(id -g) lyrics-gpu bash
Then run the normal commands from there, e.g. python -m lyrics.train
.
Tip: You might want to use the parameter --gpu-speedup
! Just note that this will disable the Tensorflowjs compatibility, regardless of whether you have set the --tfjs-compatible
flag.
Tip: If you get a cryptic Tensorflow error like errors_impl.CancelledError: [_Derived_]RecvAsync is cancelled.
while training on GPU, try pre-pending the train command with TF_FORCE_GPU_ALLOW_GROWTH=true
, e.g.:
TF_FORCE_GPU_ALLOW_GROWTH=true python -m lyrics.train --transform-words --num-lines-to-include=10 --artists '*' --gpu-speedup
To use the universal sentence encoder or BERT architecture use the --transformer-network
parameter:
python -m lyrics.train --transformer-network [use|bert]
Note: These models are not going to work in Tensorflow JS currently, so it should only be used from the command-line.
Note: I have not been able to get any result with BERT. Only included for illustration purposes.
In the default training mode, the model predicts the next word, given a sequence of words. Changing the model to predict the next character can be done using the --char-level
flag.
python -m lyrics.train --char-level
python -m cli lyrics model.h5 tokenizer.pickle
Try python -m cli lyrics -h
to find out more. For example, using --randomness
and --text
can be recommended.
If you want to add newlines to the seed text via --text
, you need to add a space on each side. For example, this works in Bash:
--text $'you are my fire \n the one desire'
Note: Make sure to use the --tfjs-compatible
flag during training!
python -m cli export model.h5 tokenizer.pickle
This creates a sub-directory export/js
with the relevant files (can be used
for the app).
Note: Make sure to use the --tfjs-compatible
flag during training!
The lyrics-tfjs
sub-directory has a simple web-page that can be used to
create lyrics in the browser. The code expects data to be found in a data/
sub-directory. This includes the words.json
file, model.json
and any extra
files generated by the Tensorflow export.
Demo.
Make sure to get all dependencies:
pip install -r requirements_dev.txt
python -m pytest --cov=lyrics tests/