CNN-based audio segmentation toolkit. Allows to detect speech, music, noise and speaker gender. Has been designed for large scale gender equality studies based on speech time per gender.
final.onnx and raw81.pth are pretrained X-vector Resnet101 architectures, obtained from VBX project (Brno University of Technology) https://github.com/BUTSpeechFIT/VBx/tree/master/VBx/models/ResNet101_16kHz/nnet For more details see F. Landini, J. Profant, M. Diez, L. Burget: Bayesian HMM clustering of x-vector sequences (VBx) in speaker diarization: theory, implementation and analysis on standard tasks (arXiv version)
interspeech2023_all.hdf5 and interspeech2023_cvfr.hdf5 are X-vector MLP gender classification models trained by @simonD3V . This work is described in a study submitted to interspeech 2023 to be described upon acceptance.
Classification models used in inaSpeechSegmenter