Open Stt Versions Save

Open STT

v1.02

4 years ago

New OPUS direct download links

v1.01

4 years ago

OPUS torrent micro release

v1.0-beta

4 years ago

The largest Russian STT dataset up-to-date

  • ~16m utterances;
  • ~20 000 hours;
  • 2,3 TB of data(in .wav format in int16);
  • A wide variety of practical, close to real-life domains;

Major highlights

  • ~3 000 hours of a completely new domain - public speech;
  • A huge Radio dataset update with 10 000+ hours ;
  • A 5% demo version of new Radio/Public Speech datasets;
  • Vastly improved dataset normalization;
  • Overall annotation quality is improved:
    • Upstream model quality improvement;
    • No more "dangling" letters;
    • Improved voice activity detection; See the above TLDR bullets;

Next steps

  • Major past error clean-up planned in 1.1;
  • Refine and publish speaker labels, probably add speakers for old datasets;
  • Improve / re-upload some of the existing datasets, refine the STT labels;
  • Probably add new languages;
  • Add pre-trained models;

v0.5-beta

4 years ago

TLDR:

  • 855 GB (in .wav format in int16) non archived;
  • (new!) A new domain - radio;
  • (new!) A larger YouTube dataset with 1000+ additional hours;
  • (new!) A small (300 hours) YouTube dataset downloaded in maximum quality;
  • (new!) 18 hours in 3 validation sets for YouTube / books / public calls with ground truth annotations;
  • See the distilled files with "bad" data in this issue;

v0.4.3-alpha

4 years ago

v0.4.2-alpha

5 years ago

Added txt files to torrents and direct archives. Updated torrents.

v0.4.1-alpha

5 years ago

Added link to a torrent download.

v0.4-alpha

5 years ago

Key changes:

  • Converted the majority of the dataset to MP3;
  • Added download script, md5 hashes into download script;
  • Fixed license;
  • Added items to FAQ and common issues;

THE MAJORITY OF WAV LINKS WILL BE DELETED SOON.

Coming soon:

  • Download via torrent;
  • Large (1,500 hours) YouTube dataset;
  • ... and more)

Dataset composition

Dataset Utterances Hours GB Av s/chars Comment Annotation Quality/noise
public_youtube1500 (*) 1,500 * Coming soon
audiobook_2 1,149,404 1,511 166 4.7s / 56 Books Alignment (*) 95% / crisp
public_youtube700 759,483 701 75 3.3s / 43 Youtube videos Subtitles 95% / ~crisp
tts_russian_addresses 1,741,838 754 81 1.6s / 20 Russian addresses TTS 4 voices 100% / crisp
asr_public_phone_calls_2 603,797 601 66 3.6s / 37 Phone calls ASR 70% / noisy
asr_public_phone_calls_1 233,868 211 23 3.3s / 29 Phone calls ASR 70% / noisy
asr_public_stories_2 78,186 78 9 3.5s / 43 Books ASR 80% / crisp
asr_public_stories_1 46,142 38 4 3.0s / 30 Books ASR 80% / crisp
public_series_1 20,243 17 2 3.1s / 38 Youtube videos Subtitles 95% / ~crisp
ru_RU 5,826 17 2 11s / 12 Public dataset Alignment 99% / crisp
voxforge_ru 8,344 17 2 7.5s / 77 Public dataset Reading 100% / crisp
russian_single 3,357 9 1 9.3s / 102 Public dataset Alignment 99% / crisp
public_lecture_1 6,803 6 1 3.4s / 47 Lectures Subtitles 95% / crisp
Total 4,657,291 3,961 431

Links

Meta data file.

Dataset GB, wav GB, mp3 Wav Mp3 Source Manifest
audiobook_2 166 21.0 down part1 Sources from the Internet + alignment link
asr_public_phone_calls_2 66 7.5 down part1 Sources from the Internet + ASR link
asr_public_stories_2 9 (7.5) NA part1 NA Sources from the Internet + alignment link
tts_russian_addresses_rhvoice_4voices 80.9 9.9 down part1 TTS link
public_youtube700 75.0 9.6 down part1 YouTube videos link
asr_public_phone_calls_1 22.7 2.6 down part1 Sources from the Internet + ASR link
asr_public_stories_1 4.1 0.5 down part1 Public stories link
public_series_1 1.9 0.2 down part1 Public series link
ru_RU 1.9 0.2 down part1 Caito.de dataset link
voxforge_ru 1.9 0.2 down part1 Voxforge dataset link
russian_single 0.9 0.1 down part1 Russian single speaker dataset link
public_lecture_1 0.7 0.1 down part1 Sources from the Internet link
Total 431 52

v0.3-alpha

5 years ago

Key changes:

  • Added dataset: 1500 hours of aligned books, 600+ hours of phone calls, 78 hours of ASR stories.
  • Formatting changes;
  • Added license;
  • Added items to FAQ and common issues;

Coming soon:

  • Large (1,500 hours) YouTube dataset;
  • ... and more)

Dataset composition

Dataset Utterances Hours GB Av s/chars Comment Annotation Quality/noise
public_youtube1500 (*) 1,500 * Coming soon
audiobook_2 1,149,404 1,511 166 4.7s / 56 Books Alignment 99% / crisp
audiobook_1 196,666 237 26 4.3s / 50 Books Alignment 99% / crisp
public_youtube700 759,483 701 75 3.3s / 43 Youtube videos Subtitles 95% / ~crisp
tts_russian_addresses 1,741,838 754 81 1.6s / 20 Russian addresses TTS 4 voices 100% / crisp
asr_public_phone_calls_2 603,797 601 66 3.6s / 37 Phone calls ASR 70% / noisy
asr_public_phone_calls_1 233,868 211 23 3.3s / 29 Phone calls ASR 70% / noisy
asr_public_stories_2 78,186 78 9 3.5s / 43 Books ASR 80% / crisp
asr_public_stories_1 46,142 38 4 3.0s / 30 Books ASR 80% / crisp
public_series_1 20,243 17 2 3.1s / 38 Youtube videos Subtitles 95% / ~crisp
ru_RU 5,826 17 2 11s / 12 Public dataset Alignment 99% / crisp
voxforge_ru 8,344 17 2 7.5s / 77 Public dataset Reading 100% / crisp
russian_single 3,357 9 1 9.3s / 102 Public dataset Alignment 99% / crisp
public_lecture_1 6,803 6 1 3.4s / 47 Lectures Subtitles 95% / crisp
Total 4,853,957 4,198 457

Links

Meta data file.

Dataset GB GB, compressed Audio Source Manifest
audiobook_1 26 20.8 part1 Public books + alignment link
audiobook_2 166 131.7 part1, part2, part3, part4, part5, part6, part7 Public books + alignment link
asr_public_phone_calls_2 66 51.7 part1, part2, part3 ASR + public phone calls link
asr_public_stories_2 9 7.5 part1 Public books + alignment link
tts_russian_addresses_rhvoice_4voices 80.9 67.0 part1, part2, part3, part4 TTS link
public_youtube700 75.0 67.0 part1, part2, part3, part4 YouTube videos link
asr_public_phone_calls_1 22.7 19.0 part1 ASR + public phone calls link
asr_public_stories_1 4.1 3.8 part1 Public stories link
public_series_1 1.9 1.7 part1 Public series link
ru_RU 1.9 1.4 part1 Caito.de dataset link
voxforge_ru 1.9 1.5 part1 Voxforge dataset link
russian_single 0.9 0.7 part1 Russian single speaker dataset link
public_lecture_1 0.7 0.6 part1 Public lectures link
Total 190 163

Check md5sum

md5sum /path/to/downloaded/file

Click to expand
type md5sum file
manifest b0ce7564ba90b121aeb13aada73a6e30 asr_public_phone_calls_1.csv
manifest 6867d14dfdec1f9e9b8ca2f1de9ceda6 asr_public_phone_calls_2.csv
manifest 0bdd77e15172e654d9a1999a86e92c7f asr_public_stories_1.csv
manifest f388013039d94dc36970547944db51c7 asr_public_stories_2.csv
manifest 697738331b6021890c29a0d415d0f22d private_buriy_audiobooks_1.csv
manifest 3b67e27c1429593cccbf7c516c4b582d private_buriy_audiobooks_2.csv
manifest 04027c20eb3aff05f6067957ecff856b public_lecture_1.csv
manifest 89da3f1b6afcd4d4936662ceabf3033e public_series_1.csv
manifest a81dfb018c88d0ecd5194ab3d8ff6c95 public_youtube700.csv
manifest c858f020729c34ba0ab525bbb8950d0c ru_RU.csv
manifest 0275525914825dec663fd53390fdc9a0 russian_single.csv
manifest 52f406f4e30fcc8c634f992befd91beb tts_russian_addresses_rhvoice_4voices.csv
audio a5496898ee78654bf398ec6df71540d7 asr_public_phone_calls_1.tar.gz
audio e4df5ef50787384648b59f5a87edc0c6 asr_public_phone_calls_2.tar.gz
audio 97594127a922df8a7bcc2eecd2470805 asr_public_phone_calls_2.tar.gz_aa
audio f9b6475f0f2898b16d9e6e0e648fb531 asr_public_phone_calls_2.tar.gz_ab
audio b19977c889cda639f621195251e6bb6f asr_public_phone_calls_2.tar.gz_ac
audio 657a31b544b10295f909ef4b2ca5c156 asr_public_stories_1.tar.gz
audio 7533581bb26975212817bcacb25546d0 asr_public_stories_2.tar.gz
audio d7d374025c56ca556d9cde86b9fdffda audiobooks_1.tar.gz
audio 3955616cd89761bf2d54d0e992f7eae5 audiobooks_2.tar.gz_aa
audio 81b6ec147c0c43bdd56002c41e0288b8 audiobooks_2.tar.gz_ab
audio 15d4cf99171c2db3f375619f4bd2b6d9 audiobooks_2.tar.gz_ac
audio 50635b0f4bdf44fae96e5a65f4738e19 audiobooks_2.tar.gz_ad
audio f1103be39ffc2da4a98d8f6ddeb50aa0 audiobooks_2.tar.gz_ae
audio 8b45d2bd8b1fa1d906e36b9fabd9fe4c audiobooks_2.tar.gz_af
audio 5104df44933b612b3c1bfc06f6376654 audiobooks_2.tar.gz_ag
audio e6b9e5f46811d33ea34ce50f6067a762 public_lecture_1.tar.gz
audio 86ebf7e30986b8ee8df11f85b35588a0 public_series_1.tar.gz
audio dc260dd8151b4fce6cde6d80af13146d public_youtube700.tar.gz_aa
audio 04706ef0f98841ec8d2f20a83aca3cf1 public_youtube700.tar.gz_ab
audio e11d5b118bf71425e4915e61277a06a9 public_youtube700.tar.gz_ac
audio d9a93157263eb9d8078c0e0b88c271de public_youtube700.tar.gz_ad
audio 1bbba5eb2f4911c9ed20ec69cbd292cb ru_ru.tar.gz
audio 6f79a9c514ad48a5763e3142919fc765 russian_single.tar.gz
audio c926df1068218eb9cc8103c94003fcc6 tts_russian_addresses_rhvoice_4voices.tar
audio 31d515e0bdfc467c3fe63088b817c15c tts_russian_addresses_rhvoice_4voices.tar.gz_aa
audio 4ca15694a8d8a638bbdc5e90832eadb4 tts_russian_addresses_rhvoice_4voices.tar.gz_ab
audio 447559a38cd8bf61c5de64e602f06da3 tts_russian_addresses_rhvoice_4voices.tar.gz_ac
audio 9131347a97c2e794d7c6d5a265083e83 tts_russian_addresses_rhvoice_4voices.tar.gz_ad
audio 91e2115b17b1ad08649f428d2caa643b voxforge_ru.tar.gz

v0.2-alpha

5 years ago

Added medium-sized YouTube dataset and TTS dataset

Key changes:

  • The storage format was changed to on-disk DB with hashes;
  • Added a 700 hour YouTube dataset;
  • Added a 700+ hour TTS dataset with Russian addresses;
  • Added some utils to work with manifests;
  • Added manifest files for easier porting into your ASR application;
  • Discarded previous links;
  • Dataset format will be uniform from now, new "datasets" will be just added;

Coming soon:

  • Large (1,500 hours) phone call dataset;
  • Large (1,500 hours) YouTube dataset;
  • ... and more)

Dataset composition

Dataset Utterances Hours GB Av len/chars Comment Annotation Quality/noise
asr_public_phone_calls_2 (*) 1,500 * Coming soon
public_youtube1500 (*) 1,500 * Coming soon
tts_russian_addresses 1,741,838 754 81 1.6s / 20 Russian addresses TTS, 4 voices 100% / crisp
public_youtube700 759,483 701 75 3.3s / 43 Youtube videos Subtitles >95% / ~crisp
asr_public_phone_calls_1 233,868 211 23 3.3s / 29 Phone calls ASR 70% / noisy
asr_public_stories_1 46,142 38 4 3.0s / 30 Books ASR 70% / crisp
public_series_1 20,243 17 2 3.1s / 38 Youtube videos Subtitles 95% / ~crisp
ru_RU 5,826 17 2 10.8s / 12 Public dataset Alignment 99% / crisp
voxforge_ru 8,344 17 2 7.5s / 77 Public dataset Reading 100% / crisp
russian_single 3,357 9 1 9.3s / 102 Public dataset Alignment 99% / crisp
public_lecture_1 6,803 6 1 3.4s / 47 Lectures Subtitles >95% / crisp
Total 2,825,904 1,771 190

Links

Meta data file.

Dataset GB GB, compressed Audio Source Manifest
tts_russian_addresses_rhvoice_4voices 80.9 67.0 part1, part2, part3, part4 TTS link
public_youtube700 75.0 67.0 part1, part2, part3, part4 YouTube videos link
asr_public_phone_calls_1 22.7 19.0 part1 ASR + public phone calls link
asr_public_stories_1 4.1 3.8 part1 Public stories link
public_series_1 1.9 1.7 part1 Public series link
ru_RU 1.9 1.4 part1 Caito.de dataset link
voxforge_ru 1.9 1.5 part1 Voxforge dataset link
russian_single 0.9 0.7 part1 Russian single speaker dataset link
public_lecture_1 0.7 0.6 part1 Public lectures link
Total 190 163