Sova Dataset Save

Project README

SOVA Dataset

SOVA Dataset is free public STT/ASR dataset.

Key facts:

Russian, English and Chinese languages
~ 32 328 hours
~ 3,21 TB in .wav format

Dataset composition

Name		Lang	Hours	Size	Source	Equipment	Annotation	Speech type	Augmentation	Quality
EngAudiobooksOriginal	Download	EN	7 130	743 Gb	audiobook	professional	forced alignment	reading	none	95%
EngAudiobooksNoisy	Download	EN	3 873	310 Gb	audiobook	professional	forced alignment	reading	phone calls	95%
RuAudiobooksDevices	Download	RU	298	30,24 Gb	audiobook	unprofessional	manual	reading	none	99%
RuDevices	Download	RU	101	10,42 Gb	audio records	unprofessional	manual	live speech	none	98%
RuYoutube	Download	RU	17 451	1 873 Gb	audio records	unprofessional	asr	live speech	none	95%
ZhYoutube	Download	CN	3 475,1	321 Gb	audio records	unprofessional	asr	live speech	none	97.83%
TOTAL	-	-	32 328,1	3 287,66 Gb (3,21 TB)	-	-	-	-	-	-

Audio characteristics

Bit rate mode: constant
Bit rate: 256 kbps
Channel(s): 1 channel
Sample rate: 16.0 kHz
Bit depth: 16 bit

Updates

08/11/2022: Release v0.4.0
10/12/2021: Release v0.3.0
22/12/2020: Release v0.2.0
24/12/2019: Published dataset with 116 hours.

Contacts

For all questions please feel free to contact us [email protected]

License

SOVA Dataset is licensed under Creative Commons BY 4.0 license by Virtual Assistant, LLC.

Open Source Agenda is not affiliated with "Sova Dataset" Project. README Source: sovaai/sova-dataset

Stars

111

Open Issues

Last Commit

1 year ago

Repository

sovaai/sova-dataset

Homepage

https://sova.ai

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/sova-dataset"><img src="https://www.opensourceagenda.com/projects/sova-dataset/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022