MidiTok Versions Save

MIDI / symbolic music tokenizers for Deep Learning models 🎶

v2.1.2

10 months ago

Thanks to @Kapitan11 who spotted bugs when decodings tokens given as ids / integers (#59), this update brings a few fixes that solve them alongside tests ensuring that the input / output (i/o) formats of the tokenizers are well handled in every cases. The documentation has also been updated on this subject, that was unclear until now.

Changes

  • 394dc4d Fix in MuMIDI and Octuple token encodings that performed the preprocessing steps twice;
  • 394dc4d code of single track tests improved and now covering tempos for most tokenizations;
  • 394dc4d MuMIDI can now decode tempo tokens;
  • 394dc4d _in_as_seq decorator now used solely for the tokens_to_midi() method, and removed from tokens_to_track() which explicitly expects a TokSequence object as argument (089fa74);
  • 089fa74 _in_as_seq decorator now handling all token ids input formats as it should;
  • 9fe7639 Fix in TSD decoding with multiple input sequences when not in one_token_stream mode;
  • 9fe7639 Adding i/o input ids tests;
  • 8c2349bfb771145c805c8a652392ae8f11ed0756 unique_track property renamed to one_token_stream as it is more explicit and accurate;
  • 8c2349bfb771145c805c8a652392ae8f11ed0756 new convert_sequence_to_tokseq method, which can convert any input sequence holding ids (integer), tokens (string) or events (Event) data into a TokSequence or list of TokSequences objects, with the appropriate format depending on the tokenizer. This method is used by the _in_as_seq decorator;
  • 8c2349bfb771145c805c8a652392ae8f11ed0756 new io_format tokenizer property, returning the tokenizer's io format as a tuple of strings. Their significations are: I for instrument (for non one_token_stream tokenizers), T for token, C for sub-token class (for multi-voc tokenizers)
  • Minor code lint improvements;

Compatibility

  • All good 🙌

v2.1.1

10 months ago

Changes

  • 220f3842a55693e0d5a68e89f31c3eede6b4ab12 Fix in learn_bpe() for tokenizers in unique_track mode;
  • 30d554693b5c0c6e271cdcd72cb969ef5dc1efaa Fixes in data augmentation (on tokens) in unique_track mode: 1) was skipping files (detected as drums) and 2) it now augment all pitches except drums ones (as opposed to all before);
  • 30d554693b5c0c6e271cdcd72cb969ef5dc1efaa Tokenizer creating Program tokens from tokenizer.config.programs given by user.

Compatibility

  • If you used custom Program tokens, make sure to give (-1, 128) as argument for your tokenizer's config (TokenizerConfig programs arg). It's already it by default, this message only applied if you gave something else.

v2.1.0

10 months ago

Major change

This "mid-size" update brings a new TokenizerConfig object, holding any tokenizer's configuration. This object is now used to instantiate all tokenizers, and replaces the now removed beat_res, nb_velocities, pitch_range and additional_tokens arguments. It allows to simplify the code, reduce exceptions, and expose a simplified way to custom tokenizers. You can read the documentation and example to see how to use it.

Changes

  • e586b1fa444f90fd4f925f636f1eeffb549aae9d New TokenizerConfig object to hold config and instantiate tokenizers
  • 26a67a65b1d7af174d271294c5df38238c9a71b5 @tingled Fix in __repr__
  • 9970ec472bd7d6e983574d5b28d4b8cbcdd82013 Fix in CPWord token type graph
  • 69e64a7f4c1a8f511bd437d519b13fb838229fa6 max_bar_embedding argument for REMIPlus is now by default set to False
  • 62292d63bde48f619be354c69e371e03b3ee0d21 @Kapitan11 load_params now private method, and documentation updated for this feature
  • 3aeb7ffa03b3e5e5235ca1c42eabedc1311a1db5 Removing the depreciated "slow" BPE methods
  • f8ca8548c7e1bd5ac10092f3601fdaeed253694a @ilya16 Fixing PitchBend time attribute in merge_tracks method
  • b12d270660cff14ae36549e3cfc00c320c5032b0 TSD now natively handle Program tokens, the same way REMIPlus does. Using the use_prorams option will convert MIDIs into a single token sequence for all tracks, instead of one seq per track instead;
  • Other minor code, lint and docstring improvements

Compatibility

  • On your current / previous projects, you will need to update your code, specifically the way you create tokenizers, to use this update. This doesn't apply to code creating tokenizers from config file (params arg);
  • Slow BPE removed. If you still use these methods, we encourage you to switch to the new fast ones. You trained models will need to be using with old slow tokenizers.

v2.0.6

1 year ago

Changes

Compatibility

  • All good 🙌

v2.0.5

1 year ago

Changes

  • f9f63d076bd630606f0482291375af44e37d1136 (related to #37) adding a compatibility check to learn_bpe method
  • f1af66ad2fec24007a59961d96069782f9b97ffc fixing an issue when loading tokens in learn_bpe with unique_track compatible tokenizer (REMIPlus) causing no BPE learning
  • f1af66ad2fec24007a59961d96069782f9b97ffc in learn_bpe: checking that the total number of unique base tokens (chars) is inferior to the target vocabulary size
  • 47b616643643bdb3dac82388d10b0603ad988b4f handling multi-voc indexing with tokens present in all vocabs eg special

Compatibility

  • All good 🙌

v2.0.4

1 year ago

Changes

  • 456a6ce149fc856d58b65a73f71f2af9e3e4af87 bugfix on the velocity feature when performing data augmentation at token level

v2.0.3

1 year ago

Changes

  • ff1bb5eebff335258cfda68339c6d304a1a82541 and 195cb6594ee579a302ba4bce9f7228f5c8bb2353 the __call__ magic method allows to load MIDI and json files before converting them
  • c045630f2ae2dd250c30dfa9a6eaef47dc425588 TokSequences are now subscriptable! (you can do tok_seq[id_])
  • a63221435fc058523669b39ac0b08656a8754d84 Special tokens are now stored without the None value
  • Minor code and documentation improvements

Compatibility

  • In case you use token_type_graph and tokens_errors: previous config files store special tokens with None value (eg PAD_None), have to modified to remove it (eg just PAD) (special_tokens entry only). No change in vocabulary / tokens.

v2.0.2

1 year ago
  • 63110d7de15a140b6b8b79db794e0034e8ad222e fix in _ids_are_bpe_encoded method

v2.0.1

1 year ago

Changes

  • e26b088531befab41d9f8a7f6d7244498b544861 from @atsukoba + help from @muthissar: REMI+ is now implemented! 🎉 This multitrack tokenization can be seen as an extension of REMI.
  • 29622115f6061579f5de5502bbcea8b05c3712a0 Chord tokens can now represent the root note within tokens (versus only chord quality previously). Chord parameters have to be specified in additional_tokens argument, with the keys chord_maps, chord_tokens_with_root_note and chord_unknown. You can use the default value as an example.
  • e402b0d42f7eb39eeb074d439e839e63bf8a1098 _in_as_seq decorator now automatically checks if the input ids are encoded with BPE
  • 2064ee944494d0d0583418ab6a2670c7861e561a fix with BPE containing spaces in merges, could not load tokenizers after training

Compatibility

  • due to 2064ee944494d0d0583418ab6a2670c7861e561a, bytes and merges are shifted from v2.0.0. BPE tokenizers will be incompatible and would have to be retrained, or the bytes from their vocabularies and merges would have to be shifted. This only applies for BPE.

v2.0.0

1 year ago

TL;DR

This major update brings:

  • The integration of the Hugging Face 🤗tokenizers library as Byte Pair Encoding (BPE) backend. BPE is now between 30 to 50 times faster, for both training and encoding ! 🙌
  • A new TokSequence object to represent tokens! This objects holds tokens as tokens (strings), ids (integers to pass to models), Events and bytes (used internally for BPE).
  • Many internal changes, methods and variables renamed, that require you to update some of your code (details below).

Changes

  • a9b82e4ffb0f77b1541c5a236af4377ea156d77a Vocabulary class is being replaced by a dictionary. Other (protected) dictionaries are also added for token <--> id <--> byte conversions;
  • a9b82e4ffb0f77b1541c5a236af4377ea156d77a New special_tokens constructor argument for all tokenizers, in place of the previous pad, mask, sos_eos and sep arguments. It is a list of tokens (str) for more versatility. By default, special tokens are ["PAD", "BOS", "EOS", "MASK"];
  • a9b82e4ffb0f77b1541c5a236af4377ea156d77a __getitem__ now handles both ids (int) and tokens (str), with multi-vocab;
  • 36bf0f66a392835e492a9f7decf7e382662f30aa Some methods of MIDITokenizer meant to be used internally are now protected;
  • a2db7b9ed173149c34e9f01e3d873317a185288f New training method with 🤗tokenizers BPE model;
  • 9befb8d76c90b24533f8badbff9b368bf80e6da5 TokSequence object, used as in and out object for midi_to_tokens and tokens_to_midi methods, thanks to the _in_as_seq and _out_as_complete_seq decorators;
  • 9befb8d76c90b24533f8badbff9b368bf80e6da5 complete_sequence method allowing to automatically convert the uninitiated attributes of a TokSequence (ids, tokens);
  • 9befb8d76c90b24533f8badbff9b368bf80e6da5 tokens_to_events renamed _ids_to_tokens, and new id / token / byte conversion methods with recursivity;
  • 9befb8d76c90b24533f8badbff9b368bf80e6da5 Tokens are now saved and loaded with the ids key (previously tokens);
  • cddd29cc2939fda706f502c9be36ccfa06f5dd20 Tokenization files moves to dedicated tokenizations module;
  • cddd29cc2939fda706f502c9be36ccfa06f5dd20 decompose_bpe method renamed decode_bpe;
  • d5201287b93fd42121b76b39e08cabf575e9bdcd tokenize_dataset allows to apply BPE afterwards.

Compatibility

Tokens and tokenizers from v1.4.3 and before are compatible, this update does not change anything on the specific tokenizations. However you will need to adapt your files to load them, and to update some of your code to adapt to new changes:

  • Tokens are now saved and loaded with the ids key (previously tokens). To adapt your previously saved tokens, open them with json and rewrite them with the ids key instead;
  • midi_to_tokens (also called with tokenizer(midi)) now outputs a list of TokSequences, each holding tokens as tokens (str) and their ids (int). It previously returned token ids. You can now get them by accessing the .ids attribute, as tokseq.ids;
  • Vocabulary class deleted. You can still access to the vocabulary with tokenizer.vocab but it is now a dictionary. The methods of the Vocabulary class are now directly integrated in MIDITokenizer;
  • For all tokenizers, the pad, mask, sos_eos and sep constructor arguments need to be replaced with the new special_tokens argument;
  • decompose_bpe method renamed decode_bpe.

Bug reports

With all big changes can come hidden bugs. We carefully tested that all methods pass the previous tests, while assessing the robustness of the new methods. Despite these efforts, if you encounter any bugs, please report them by opening an issue, and we will do our best to solve them as quickly as possible.