MIDI / symbolic music tokenizers for Deep Learning models 🎶
Thanks to @Kapitan11 who spotted bugs when decodings tokens given as ids / integers (#59), this update brings a few fixes that solve them alongside tests ensuring that the input / output (i/o) formats of the tokenizers are well handled in every cases. The documentation has also been updated on this subject, that was unclear until now.
MuMIDI
and Octuple
token encodings that performed the preprocessing steps twice;MuMIDI
can now decode tempo tokens;_in_as_seq
decorator now used solely for the tokens_to_midi()
method, and removed from tokens_to_track()
which explicitly expects a TokSequence
object as argument (089fa74);_in_as_seq
decorator now handling all token ids input formats as it should;TSD
decoding with multiple input sequences when not in one_token_stream
mode;unique_track
property renamed to one_token_stream
as it is more explicit and accurate;convert_sequence_to_tokseq
method, which can convert any input sequence holding ids (integer), tokens (string) or events (Event) data into a TokSequence
or list of TokSequence
s objects, with the appropriate format depending on the tokenizer. This method is used by the _in_as_seq
decorator;io_format
tokenizer property, returning the tokenizer's io format as a tuple of strings. Their significations are: I for instrument (for non one_token_stream tokenizers), T for token, C for sub-token class (for multi-voc tokenizers)learn_bpe()
for tokenizers in unique_track
mode;unique_track
mode: 1) was skipping files (detected as drums) and 2) it now augment all pitches except drums ones (as opposed to all before);Program
tokens from tokenizer.config.programs
given by user.Program
tokens, make sure to give (-1, 128)
as argument for your tokenizer's config (TokenizerConfig
programs
arg). It's already it by default, this message only applied if you gave something else.This "mid-size" update brings a new TokenizerConfig
object, holding any tokenizer's configuration. This object is now used to instantiate all tokenizers, and replaces the now removed beat_res
, nb_velocities
, pitch_range
and additional_tokens
arguments. It allows to simplify the code, reduce exceptions, and expose a simplified way to custom tokenizers.
You can read the documentation and example to see how to use it.
TokenizerConfig
object to hold config and instantiate tokenizers__repr__
max_bar_embedding
argument for REMIPlus
is now by default set to Falseload_params
now private method, and documentation updated for this featuremerge_tracks
methodTSD
now natively handle Program
tokens, the same way REMIPlus
does. Using the use_prorams
option will convert MIDIs into a single token sequence for all tracks, instead of one seq per track instead;params
arg);MMM
tokenizer (Multi-Track Music Machine)learn_bpe
methodlearn_bpe
with unique_track
compatible tokenizer (REMIPlus) causing no BPE learninglearn_bpe
: checking that the total number of unique base tokens (chars) is inferior to the target vocabulary size__call__
magic method allows to load MIDI and json files before converting themTokSequence
s are now subscriptable! (you can do tok_seq[id_]
)None
valuetoken_type_graph
and tokens_errors
: previous config files store special tokens with None value (eg PAD_None
), have to modified to remove it (eg just PAD
) (special_tokens
entry only). No change in vocabulary / tokens._ids_are_bpe_encoded
methodREMI+
is now implemented! 🎉 This multitrack tokenization can be seen as an extension of REMI
.additional_tokens
argument, with the keys chord_maps
, chord_tokens_with_root_note
and chord_unknown
. You can use the default value as an example._in_as_seq
decorator now automatically checks if the input ids are encoded with BPEThis major update brings:
TokSequence
object to represent tokens! This objects holds tokens as tokens (strings), ids (integers to pass to models), Events and bytes (used internally for BPE).Vocabulary
class is being replaced by a dictionary. Other (protected) dictionaries are also added for token <--> id <--> byte conversions;special_tokens
constructor argument for all tokenizers, in place of the previous pad
, mask
, sos_eos
and sep
arguments. It is a list of tokens (str) for more versatility. By default, special tokens are ["PAD", "BOS", "EOS", "MASK"]
;__getitem__
now handles both ids (int) and tokens (str), with multi-vocab;MIDITokenizer
meant to be used internally are now protected;TokSequence
object, used as in and out object for midi_to_tokens
and tokens_to_midi
methods, thanks to the _in_as_seq
and _out_as_complete_seq
decorators;complete_sequence
method allowing to automatically convert the uninitiated attributes of a TokSequence
(ids, tokens);tokens_to_events
renamed _ids_to_tokens
, and new id / token / byte conversion methods with recursivity;ids
key (previously tokens
);decompose_bpe
method renamed decode_bpe
;tokenize_dataset
allows to apply BPE afterwards.Tokens and tokenizers from v1.4.3 and before are compatible, this update does not change anything on the specific tokenizations. However you will need to adapt your files to load them, and to update some of your code to adapt to new changes:
ids
key (previously tokens
). To adapt your previously saved tokens, open them with json and rewrite them with the ids
key instead;midi_to_tokens
(also called with tokenizer(midi)
) now outputs a list of TokSequence
s, each holding tokens as tokens (str) and their ids (int). It previously returned token ids. You can now get them by accessing the .ids
attribute, as tokseq.ids
;Vocabulary
class deleted. You can still access to the vocabulary with tokenizer.vocab
but it is now a dictionary. The methods of the Vocabulary
class are now directly integrated in MIDITokenizer
;pad
, mask
, sos_eos
and sep
constructor arguments need to be replaced with the new special_tokens
argument;decompose_bpe
method renamed decode_bpe
.With all big changes can come hidden bugs. We carefully tested that all methods pass the previous tests, while assessing the robustness of the new methods. Despite these efforts, if you encounter any bugs, please report them by opening an issue, and we will do our best to solve them as quickly as possible.