MIDI / symbolic music tokenizers for Deep Learning models 🎶
encode_ids_split
attribute of the tokenizer config;clip
method;DatasetMIDI
and DataCollator
;filter_dataset
to clean a dataset of MIDI/abc files before using it;MMM
tokenizer has been cleaned up, and is now fully modular: it now works on top of other tokenizations (REMI
, TSD
and MIDILike
) to allow more flexibility and interoperability;TokSequence
objects can now be sliced and concatenated (eg seq3 = seq1[:50] + seq2[50:]
);TokSequence
objects tokenized from a tokenizer can now be split per bars or beats subsequences;A few methods and properties were previously named after "bpe" and "midi". To align with the more general usages of these methods (support for several file formats and training algorithms), they have been renamed with more idiomatic and accurate names.
midi_to_tokens
--> encode
;tokens_to_midi
--> decode
;learn_bpe
--> train
;apply_bpe
--> encode_token_ids
;decode_bpe
--> decode_token_ids
;ids_bpe_encoded
--> are_ids_encoded
;vocab_bpe
--> vocab_model
.tokenize_midi_dataset
--> tokenize_dataset
;MIDITokenizer
--> MusicTokenizer
;augment_midi
--> augment_score
;augment_midi_dataset
--> augment_dataset
;augment_midi_multiple_offsets
--> augment_score_multiple_offsets
;split_midis_for_training
--> split_files_for_training
;split_midi_per_note_density
--> split_score_per_note_density
;get_midi_programs
--> get_score_programs
;merge_midis
--> merge_scores
;get_midi_ticks_per_beat
--> get_score_ticks_per_beat
;split_midi_per_ticks
--> split_score_per_ticks
;split_midi_per_beats
--> split_score_per_beats
;split_midi_per_tracks
--> split_score_per_tracks
;concat_midis
--> concat_scores
;MIDITokenizer._tokens_to_midi
--> MusicTokenizer._tokens_to_score
;MIDITokenizer._midi_to_tokens
--> MusicTokenizer._score_to_tokens
;MIDITokenizer._create_midi_events
--> MusicTokenizer._create_global_events
There is no other compatibility issue beside these renaming.
Full Changelog: https://github.com/Natooz/MidiTok/compare/v3.0.2...v3.0.3
This new version introduces a new DatasetMIDI
class to use when training PyTorch models. It relies on the previously named DatasetTok
class, with pre-tokenizing option and better handling of BOS and EOS tokens.
A new miditok.pytorch_data.split_midis_for_training
method allows to dynamically chunk MIDIs into smaller parts that make approximately the desire token sequence length, based on the note densities of their bars. These chunks can be used to train a model while maximizing the overall amount of data used.
A few new utils methods have been created for this features, e.g. to split, concat or merge symusic.Score
objects.
Thanks @Kinyugo for the discussions and tests that guided the development of the features! (#147)
The update also brings a few minor fixes, and the docs have a new theme!
save_pretrained
to comply with huggingface_hub v0.21 by @Natooz in https://github.com/Natooz/MidiTok/pull/150
overwrite _create_durations_tuples
in init by @JLenzy in https://github.com/Natooz/MidiTok/pull/153
Full Changelog: https://github.com/Natooz/MidiTok/compare/v3.0.1...v3.0.2
use_pitchdrum_tokens
option to use dedicated PitchDrum
tokens for drums trackstokens
in https://github.com/Natooz/MidiTok/pull/138 (#137 @oiabtt)load_tokens
now returning TokSequence
by in https://github.com/Natooz/MidiTok/pull/139 (#137 @oiabtt)MIDITokenizer.from_pretrained
similarly to the AutoTokenizer
in the Hugging Face transformers library by in https://github.com/Natooz/MidiTok/pull/142 (discussed in #127 @oiabtt)Full Changelog: https://github.com/Natooz/MidiTok/compare/v3.0.0...v3.0.1
This major version marks the switch from the miditoolkit MIDI reading/writing library to symusic, and a large optimisation of the MIDI preprocessing steps.
Symusic is a MIDI reading / writing library written in C++ with Python binding, offering unmatched speeds, up to 500 times faster than native Python libraries. It is based on minimidi. The two libraries are created and maintained by @Yikai-Liao and @lzqlzzq, who did an amazing work, which is still ongoing as many useful features are on the roadmap! 🫶
Tokenizers from previous versions are compatible with this new version, but their might be some time variations if you compare how MIDIs are tokenized and tokens decoded.
These changes result in a way faster MIDI loading/writing and tokenization times! The overall tokenization (loading MIDI and tokenizing it) is between 5 to 12 times faster depending the tokenizer and data. You can find other benchmarks here.
This huge speed gain allows to discard the previously recommended step of pre-tokenizing MIDI files as json tokens, and directly tokenize the MIDIs on the fly while training/using a model! We updated the usage examples of the docs accordingly, the code is now simplified.
miditoolkit.MidiFile
objects, but those will be converted on the fly to a symusic.Score
object and a depreciation warning will be thrown;Full Changelog: https://github.com/Natooz/MidiTok/compare/v2.1.8...v3.0.0
This new version brings a new additional token type: pitch intervals. It allows to represent pitch intervals for simultaneous and successive note. You can read more details about how it works in the docs. We greatly improved the tests and Ci workflow, and fixed a few minor bugs and improvements along the way.
This new version also drops support for Python 3.7, and now requires Python 3.8 and newer. You can read more about the decision and how to make it retro-compatible in the docs.
We encourage you to update to the latest miditoolkit version, which also features some fixes and improvements. The most notable one is a clean of the dependencies, and compatibility with recent numpy versions!
Full Changelog: https://github.com/Natooz/MidiTok/compare/v2.1.7...v2.1.8
This release bring the integration of the Hugging Face Hub, along with a few important fixes and improvements!
.from_pretrained
and push_to_hub
methods as you would do for your models! Special thanks to @Wauplin and @julien-c for the help and support! 🤗🤗func_to_get_labels
argument to DatasetTok
allowing to use it to retrieve labels when loading data;detect_chords
when checking whether to use unknown chords;tokenize_midi_dataset
now reproduces the file tree of the source files. This change fixes issues when files with the same name were overwritten in the previous method. You can also specify wether to overwrite files in the destination directory or not.Full Changelog: https://github.com/Natooz/MidiTok/compare/v2.1.6...v2.1.7
program_change
config option, that will insert Program
tokens whenever an event is from a different track than the previous one. They mimic the MIDI ProgramChange
messages. If this parameter is disabled (by default), a Program
token will prepend each track programs (as done in previous versions);MIDILike
decoding optimized;tokenize_check_equals
test method and more test cases;Bar
/Position
-based tokenizer (REMI
, CPWord
, Octuple
, MMM
;Octuple
is now tested with time signature disabled: as TimeSig
tokens are only carried with notes, Octuple
cannot accurately represent time signatures; as a result, if a Time Signature change occurs and that the following bar do not contain any note, the time will be shifted by one or multiple bars depending on the previous time signature numerator and time gap between the last and current note. We do not recommend to use Octuple
with MIDIs with several time signature changes (at least numerator changes);MMM
tokenization workflow speedup.one_token_stream_for_programs
parameter allowing treat all tracks of a MIDI as a single stream of tokens (adding Program
tokens before Pitch
/NoteOn
...). This option is enabled by default, and corresponds to the default code behaviour of the previous versions. Disabling it allows to have Program
tokens in the vocabulary (config.use_programs
enabled) while converting each track independently;TimeShift
and Rest
tokens can now be created successively during the tokenization, happening when the largest TimeShift
/ Rest
value of the tokenizer isn't sufficient;TimeShift
s, and the config.rest_range
parameter has been renamed beat_res_rest
for simplicity and flexibility. The default value is {(0, 1): 8, (1, 2): 4, (2, 12): 2}
;Full Changelog: https://github.com/Natooz/MidiTok/compare/v2.1.4...v2.1.5
Thanks to @caenopy for reporting the bugs fixed here.
rest_range
parameter will be converted to the new beat_res_rest
format.save_tokens
method, reading kwargs
in the json file saved;REMI
, TSD
and MIDILike
tokenizersMMM
now adds additional tokens in the same order than other tokenizers, meaning previously saved MMM
tokenizers with these tokens would need to be converted if needed.This big update brings a few important changes and improvements.
We distinguish now three types of tokens:
All tokenisations now follows the pattern:
This cleans considerably the code (DRY, less redundant methods), while bringing speedups as the calls to sorting methods has been reduced.
pytorch_data
offering PyTorch Dataset
objects and a data collator, to be used when training a PyTorch model. Learn more in the documentation of the module;MIDILike
, CPWord
and Structured
now handle natively Program
tokens in a multitrack / one_token_stream
way;TSD
, MIDILike
and CPWord
;time_signature_range
config option is now more flexible / convenient.pytorch_data
submodule, with DatasetTok
and DatasetJsonIO
classes. This module is only loaded if torch
is installed in the python environment;tokenize_midi_dataset()
method now have a tokenizer_config_file_name
argument, allowing to save the tokenizer config with a custom file name;DataCollator
object to be used with PyTorch DataLoader
s;Structured
and MIDILike
now natively handle Program
tokens. When setting config.use_programs
true, a Program
token will be added before each Pitch
/NoteOn
/NoteOff
token to associate its instrument. MIDIs will also be treated as a single stream of tokens in this case, whereas otherwise each track is converted into independent token sequences;miditok.utils.remove_duplicated_notes
method can now remove notes with the same pitch and onset time, regardless of their offset time / duration;miditok.utils.merge_same_program_tracks
is now called in preprocess_midi
when config.use_programs
is True;REMI
codebase, that now has all the features of REMIPlus
, and code clean and speedups (less calls to sorting). The REMIPlus
class is now basically only a wrapped REMI
with programs and time signature enabled;TSD
and MIDILike
now encode and decode time signature changes;Tempo
s can now be created with a logarithmic scale, instead of the default linear scale.track_to_tokens
and tokens_to_track
methods are now partially removed. They are now protected, for classes that still rely on them, and removed from the others. These methods were made for internal calls and not recommended to use. Instead, the midi_to_tokens
method is recommended;time_signature_range
into a dictionary {denom_i: [num_i1, ..., num_in] / (min_num_i, max_num_i)}
;TokenizerConfig
to delete the successive tempo / time signature changes carrying the same value during MIDI preprocessing;CPWord
and Octuple
now follow the common tokenization workflow;OctupleMono
is removed as there was no records of its use. It is now equivalent to Octuple
without config.use_programs
;CPWord
now handling time signature changes;save_tokens
now by default doesn't save programs if config.use_programs
is Falsetrack_to_tokens
and tokens_to_track
methods are not supported anymore. If you used these methods, you may replace them with midi_to_tokens
and tokens_to_midi
(or just call the tokenizer) while selecting the appropriate token sequences / tracks;time_signature_range
now needs to be given as a dictionary;Octuple
(as programs are now optional), tokenizers and tokens made with previous versions will not be compatible unless the vocabulary order is swapped, idx 3 moved to 5.