Data manipulation and transformation for audio signal processing, powered by PyTorch
This release is compatible with PyTorch 2.2.2 patch release. There are no new features added.
This release is compatible with PyTorch 2.2.1 patch release. There are no new features added.
trio
top-level module, dedicated for core I/O operations (https://github.com/pytorch/audio/pull/3676, https://github.com/pytorch/audio/pull/3680, https://github.com/pytorch/audio/pull/3681, https://github.com/pytorch/audio/pull/3682) Please refer to https://pytorch.org/audio/2.2.0/torio.html for the details.This is a patch release, which is compatible with PyTorch 2.1.2. There are no new features added.
TorchAudio v2.1 introduces the new features and backward-incompatible changes;
torchaudio.io.AudioEffector
can apply filters, effects and encodings to waveforms in online/offline fashion.torchaudio.functional.forced_align
computes alignment from an emission and torchaudio.pipelines.MMS_FA
provides access to the model trained for multilingual forced alignment in MMS: Scaling Speech Technology to 1000+ languages project.forced_align
function, and https://pytorch.org/audio/2.1/tutorials/forced_alignment_for_multilingual_data_tutorial.html for how one can use MMS_FA
to align transcript in multiple languages.torchaudio.pipelines.SQUIM_SUBJECTIVE
and torchaudio.pipelines.SQUIM_OBJECTIVE
models to estimate the various speech quality and intelligibility metrics. This is helpful when evaluating the quality of speech generation models, such as TTS.torchaudio.models.decoder.CUCTCDecoder
takes emission stored in CUDA memory and performs CTC beam search on it in CUDA device. The beam search is fast. It eliminates the need to move data from CUDA device to CPU when performing automatic speech recognition. With PyTorch's CUDA support, it is now possible to perform the entire speech recognition pipeline in CUDA.torchaudio.io.StreamWriter
(#3135)torchaudio.io.StreamReader.get_out_stream_info
(#3155)torchaudio.io.StreamReader
filter graph (#3183, #3479)torchaudio.io.StreamWriter
(#3194)torchaudio.io.StreamReader
(#3216)torchaudio.io.StreamWriter
(#3207)420p10le
support to torchaudio.io.StreamReader
CPU decoder (#3332)torchaudio.io.AudioEffector
(#3163, #3372, #3374)torchaudio.transforms.SpecAugment
(#3309, #3314)torchaudio.functional.forced_align
(#3348, #3355, #3533, #3536, #3354, #3365, #3433, #3357)torchaudio.functional.merge_tokens
(#3535, #3614)torchaudio.functional.frechet_distance
(#3545)torchaudio.models.SquimObjective
for speech enhancement (#3042, 3087, #3512)torchaudio.models.SquimSubjective
for speech enhancement (#3189)torchaudio.models.decoder.CUCTCDecoder
(#3096)torchaudio.pipelines.SquimObjectiveBundle
for speech enhancement (#3103)torchaudio.pipelines.SquimSubjectiveBundle
for speech enhancement (#3197)torchaudio.pipelines.MMS_FA
Bundle for forced alignment (#3521, #3538)torchaudio.io.AudioEffector
(#3226)torchaudio.models.decoder.CUCTCDecoder
(#3297)In this release, the following third party libraries are removed from TorchAudio binary distributions. TorchAudio now search and link these libraries at runtime. Please install them to use the corresponding APIs.
libsox
is used for various audio I/O, filtering operations.
Pre-built binaries are avaialble via package managers, such as conda
, apt
and brew
. Please refer to the respective documetation.
The APIs affected include;
torchaudio.load
("sox" backend)torchaudio.info
("sox" backend)torchaudio.save
("sox" backend)torchaudio.sox_effects.apply_effects_tensor
torchaudio.sox_effects.apply_effects_file
torchaudio.functional.apply_codec
(also deprecated, see below)Changes related to the removal: #3232, #3246, #3497, #3035
flashlight-text
is the core of CTC decoder.
Pre-built packages are available on PyPI. Please refer to https://github.com/flashlight/text for the detail.
The APIs affected include;
torchaudio.models.decoder.CTCDecoder
Changes related to the removal: #3232, #3246, #3236, #3339
A custom built libkaldi
was used to implement torchaudio.functional.compute_kaldi_pitch
. This function, along with libkaldi integration, is removed in this release. There is no replcement.
Changes related to the removal: #3368, #3403
To make I/O operations more flexible, TorchAudio introduced the backend dispatcher in v2.0, and users could opt-in to use the dispatcher. In this release, the backend dispatcher becomes the default mechanism for selecting the I/O backend.
You can pass backend
argument to torchaudio.info
, torchaudio.load
and torchaudio.save
function to select I/O backend library per-call basis. (If it is omitted, an available backend is automatically selected.)
If you want to use the global backend mechanism, you can set the environment variable, TORCHAUDIO_USE_BACKEND_DISPATCHER=0
.
Please note, however, that this the global backend mechanism is deprecated and is going to be removed in the next release.
Please see #2950 for the detail of migration work.
torchaudio.io.StreamReader
accepted a byte-string wrapped in 1D torch.Tensor
object. This is no longer supported.
Please wrap the underlying data with io.BytesIO
instead.
The optional arguments of add_[audio|video]_stream
methods of torchaudio.io.StreamReader
and torchaudio.io.StreamWriter
are now keyword-only arguments.
Previously TorchAudio supported FFmpeg 4 (>=4.1, <=4.4). In this release, TorchAudio supports FFmpeg 4, 5 and 6 (>=4.4, <7). With this change, support for FFmpeg 4.1, 4.2 and 4.3 are dropped.
torchaudio.functional.apply_codec
(#3397)In previous versions, TorchAudio shipped custom built libsox
, so that it can perform in-memory decoding and encoding.
Now, in-memory decoding and encoding are handled by FFmpeg binding, and with the switch to dynamic libsox
linking, torchaudio.functional.apply_codec
no longer process audio in in-memory fashion. Instead it writes to temporary file.
For in-memory processing, please use torchaudio.io.AudioEffector
.
lstsq
when solving InverseMelScale (#3280)Previously, torchaudio.transform.InverseMelScale
ran SGD optimizer to find the inverse of mel-scale transform. This approach has number of issues as listed in #2643.
This release switches to use torch.linalg.lstsq
.
The infer
method of torchaudio.models.RNNTBeamSearch
has been updated to accept series of previous hypotheses.
bundle = torchaudio.pipelines.EMFORMER_RNNT_BASE_LIBRISPEECH
decoder: RNNTBeamSearch = bundle.get_decoder()
hypothesis = None
while streaming:
...
hypo, state = decoder.infer(
features,
length,
beam_width,
state=state,
hypothesis=hypothesis,
)
...
hypothesis = hypo
# Previously this had to be hypothesis = hypo[0]
torchaudio.functional.apply_codec
function (#3386)Due to the removal of custom libsox binding, torchaudio.functional.apply_codec
no longer supports in-memory processing. Please migrate to torchaudio.io.AudioEffector
.
Please refer to for the detailed usage of torchaudio.io.AudioEffector
.
get_trellis
in forced alignment tutorial (#3172)torchaudio.io.StreamWriter
(#3373)lfilter
(#3432)torchaudio.io.StreamWriter
is not opened (#3152)torchaudio.io.StreamReader
(#3157, #3170, #3186, #3184, #3188, #3320, #3296, #3328, #3419, #3209)torchaudio.io.StreamWriter
(#3205, #3319, #3296, #3328, #3426, #3428)n_fft
(#3442)torch.norm
to torch.linalg.vector_norm
(#3522)torch.nn.utils.weight_norm
to nn.utils.parametrizations.weight_norm
(#3523)This is a minor release, which is compatible with PyTorch 2.0.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.
Full Changelog: https://github.com/pytorch/audio/compare/v2.0.1...v2.0.2
TorchAudio 2.0 release includes:
info
, load
, save
functionsThe release adds several data augmentation operators under torchaudio.functional
and torchaudio.transforms
:
torchaudio.functional.add_noise
torchaudio.functional.convolve
torchaudio.functional.deemphasis
torchaudio.functional.fftconvolve
torchaudio.functional.preemphasis
torchaudio.functional.speed
torchaudio.transforms.AddNoise
torchaudio.transforms.Convolve
torchaudio.transforms.Deemphasis
torchaudio.transforms.FFTConvolve
torchaudio.transforms.Preemphasis
torchaudio.transforms.Speed
torchaudio.transforms.SpeedPerturbation
The operators can be used to synthetically diversify training data to improve the generalizability of downstream models.
For usage details, please refer to the documentation for torchaudio.functional
and torchaudio.transforms
, and tutorial “Audio Data Augmentation”.
The release adds two self-supervised learning models for speech and audio.
Besides the model architectures, torchaudio also supports corresponding pre-trained pipelines:
torchaudio.pipelines.WAVLM_BASE
torchaudio.pipelines.WAVLM_BASE_PLUS
torchaudio.pipelines.WAVLM_LARGE
torchaudio.pipelines.WAV2VEC_XLSR_300M
torchaudio.pipelines.WAV2VEC_XLSR_1B
torchaudio.pipelines.WAV2VEC_XLSR_2B
For usage details, please refer to factory function
and pre-trained pipelines
documentation.
Release 2.0 introduces new versions of I/O functions torchaudio.info
, torchaudio.load
and torchaudio.save
, backed by a dispatcher that allows for selecting one of backends FFmpeg, SoX, and SoundFile to use, subject to library availability. Users can enable the new logic in Release 2.0 by setting the environment variable TORCHAUDIO_USE_BACKEND_DISPATCHER=1
; the new logic will be enabled by default in Release 2.1.
# Fetch metadata using FFmpeg
metadata = torchaudio.info("test.wav", backend="ffmpeg")
# Load audio (with no backend parameter value provided, function prioritizes using FFmpeg if it is available)
waveform, rate = torchaudio.load("test.wav")
# Write audio using SoX
torchaudio.save("out.wav", waveform, rate, backend="sox")
Please see the documentation for torchaudio
for more details.
Dropped Python 3.7 support (#3020) Following the upstream PyTorch (https://github.com/pytorch/pytorch/pull/93155), the support for Python 3.7 has been dropped.
Default to "precise" seek in torchaudio.io.StreamReader.seek
(#2737, #2841, #2915, #2916, #2970)
Previously, the StreamReader.seek
method seeked into a key frame closest to the given time stamp. A new option mode
has been added which can switch the behavior to seeking into any type of frame, including non-key frames, that is closest to the given timestamp, and this behavior is now default.
Removed deprecated/unused/undocumented functions from datasets.utils (#2926, #2927)
The following functions are removed from datasets.utils
stream_url
download_url
validate_file
extract_archive
.Deprecated 'onesided' init param for MelSpectrogram (#2797, #2799)
torchaudio.transforms.MelSpectrogram
assumes the onesided
argument to be always True
. The forward path fails if its value is False
. Therefore this argument is deprecated. Users specifying this argument should stop specifying it.
Deprecated "sinc_interpolation"
and "kaiser_window"
option value in favor of "sinc_interp_hann"
and "sinc_interp_kaiser"
(#2922)
The valid values of resampling_method
argument of resampling operations (torchaudio.transforms.Resample
and torchaudio.functional.resample
) are changed. "kaiser_window"
is now "sinc_interp_kaiser"
and "sinc_interpolation"
is "sinc_interp_hann"
. The old values will continue to work, but users are encouraged to update their code.
For the reason behind of this change, please refer #2891.
Deprecated sox initialization/shutdown public API functions (#3010)
torchaudio.sox_effects.init_sox_effects
and torchaudio.sox_effects.shutdown_sox_effects
are deprecated. They were required to use libsox-related features, but are called automatically since v0.6, and the initialization/shutdown mechanism have been moved elsewhere. These functions are now no-op. Users can simply remove the call to these functions.
torchaudio.load
, torchaudio.info
and torchaudio.save
.torchaudio.sox_effects.apply_effects_file
and torchaudio.functional.apply_codec
.
For I/O, to continue using file-like objects, please use the new dispatcher mechanism.
For effects, replacement functions will be added in the next release.torchaudio.io.StreamReader
supports decoding media from byte strings contained in 1D tensors of torch.uint8
type. Using torch.Tensor type as a container for byte string is now deprecated. To pass byte strings, please wrap the string with io.BytesIO
.
Deprecated | Migration |
---|---|
data = b"..." src = torch.frombuffer(data, dtype=torch.uint8) StreamReader(src) |
data = b"..." src = io.BytesIO(data) StreamReader(src) |
torchaudio.functional.lfilter
(#3080)Without the change in #2873, the WER results are:
Model | dev-clean | dev-other | test-clean | test-other |
---|---|---|---|---|
WAV2VEC2_ASR_LARGE_LV60K_10M | 10.59 | 15.62 | 9.58 | 16.33 |
WAV2VEC2_ASR_LARGE_LV60K_100H | 2.80 | 6.01 | 2.82 | 6.34 |
WAV2VEC2_ASR_LARGE_LV60K_960H | 2.36 | 4.43 | 2.41 | 4.96 |
HUBERT_ASR_LARGE | 1.85 | 3.46 | 2.09 | 3.89 |
HUBERT_ASR_XLARGE | 2.21 | 3.40 | 2.26 | 4.05 |
After applying layer normalization, the updated WER results are:
Model | dev-clean | dev-other | test-clean | test-other |
---|---|---|---|---|
WAV2VEC2_ASR_LARGE_LV60K_10M | 6.77 | 10.03 | 6.87 | 10.51 |
WAV2VEC2_ASR_LARGE_LV60K_100H | 2.19 | 4.55 | 2.32 | 4.64 |
WAV2VEC2_ASR_LARGE_LV60K_960H | 1.78 | 3.51 | 2.03 | 3.68 |
HUBERT_ASR_LARGE | 1.77 | 3.32 | 2.03 | 3.68 |
HUBERT_ASR_XLARGE | 1.73 | 2.72 | 1.90 | 3.16 |
shuffle
is set True
in BucketizeBatchSampler
, the seed is only the same for the first epoch. In later epochs, each BucketizeBatchSampler
object will generate a different shuffled iteration list, which may cause DPP training to hang forever if the lengths of iteration lists are different across nodes. In the 2.0.0 release, the issue is fixed by using the same seed for RNG in all nodes._fail_info_fileobj
(#3032)torchaudio.io.StreamReader
.torchaudio.functional.lfilter
(#3018)AddNoise
, Convolve
, FFTConvolve
, Speed
, SpeedPerturbation
, Deemphasis
, and Preemphasis
in torchaudio.transforms
, and add_noise
, fftconvolve
, convolve
, speed
, preemphasis
, and deemphasis
in torchaudio.functional
.fill_buffer
method to torchaudio.io.StreamReader
(#2954, #2971)buffer_chunk_size=-1
option to torchaudio.io.StreamReader
(#2969)
When buffer_chunk_size=-1
, StreamReader
does not drop any buffered frame. Together with the fill_buffer
method, this is a recommended way to load the entire media.
reader = StreamReader("video.mp4")
reader.add_basic_audio_stream(buffer_chunk_size=-1)
reader.add_basic_video_stream(buffer_chunk_size=-1)
reader.fill_buffer()
audio, video = reader.pop_chunks()
torchaudio.io.StreamReader
(#2975)
torchaudio.io.SteramReader
now gives PTS (presentation time stamp) of the media chunk it is returning. To maintain backward compatibility, the timestamp information is attached to the returned media chunk.
reader = StreamReader(...)
reader.add_basic_audio_stream(...)
reader.add_basic_video_stream(...)
for audio_chunk, video_chunk in reader.stream():
# Fetch timestamp
print(audio_chunk.pts)
print(video_chunk.pts)
# Chunks behave the same as torch.Tensor.
audio_chunk.mean(dim=1)
torchaudio.io.play_audio
(#3026, #3051)
You can play audio with the torchaudio.io.play_audio
function. (macOS only)torchaudio.utils.ffmpeg_utils
, which can be used to query into the dynamically linked FFmpeg libraries.
get_demuxers()
get_muxers()
get_audio_decoders()
get_audio_encoders()
get_video_decoders()
get_video_encoders()
get_input_devices()
get_output_devices()
get_input_protocols()
get_output_protocols()
get_build_config()
Refactor StreamReader/Writer implementation
torchaudio::ffmpeg
namespace with torchaudio::io
(#3013)pop_chunks
implementations (#3002)Added logging to torchaudio.io.StreamReader/Writer
(#2878)
Fixed the #threads used by FilterGraph to 1 (#2985)
Fixed the default #threads used by decoder to 1 in torchaudio.io.StreamReader
(#2949)
Moved libsox integration from libtorchaudio
to libtorchaudio_sox
(#2929)
Added query methods to FilterGraph (#2976)
cuda_version
(#2952)USE_CUDA
detection (#3005)USE_ROCM
detection (#3008)This is a minor release, which is compatible with PyTorch 1.13.1 and includes bug fixes, improvements and documentation updates. There is no new feature added.
TorchAudio 0.13.0 release includes:
Hybrid Demucs is a music source separation model that uses both spectrogram and time domain features. It has demonstrated state-of-the-art performance in the Sony Music DeMixing Challenge. (citation: https://arxiv.org/abs/2111.03600)
The TorchAudio v0.13 release includes the following features
SDR Results of pre-trained pipelines on MUSDB-HQ test set
Pipeline | All | Drums | Bass | Other | Vocals |
---|---|---|---|---|---|
HDEMUCS_HIGH_MUSDB* | 6.42 | 7.76 | 6.51 | 4.47 | 6.93 |
HDEMUCS_HIGH_MUSDB_PLUS** | 9.37 | 11.38 | 10.53 | 7.24 | 8.32 |
* Trained on the training data of MUSDB-HQ dataset. ** Trained on both training and test sets of MUSDB-HQ and 150 extra songs from an internal database that were specifically produced for Meta.
Special thanks to @adefossez for the guidance.
ConvTasNet model architecture was added in TorchAudio 0.7.0. It is the first source separation model that outperforms the oracle ideal ratio mask. In this release, TorchAudio adds the pre-trained pipeline that is trained within TorchAudio on the Libri2Mix dataset. The pipeline achieves 15.6dB SDR improvement and 15.3dB Si-SNR improvement on the Libri2Mix test set.
With the addition of four new audio-related datasets, there is now support for all downstream tasks in version 1 of the SUPERB benchmark. Furthermore, these datasets support metadata mode through a get_metadata
function, which enables faster dataset iteration or preprocessing without the need to load or store waveforms.
Datasets with metadata functionality:
In release 0.12, TorchAudio released a CTC beam search decoder with KenLM language model support. This release, there is added functionality for creating custom Python language models that are compatible with the decoder, using the torchaudio.models.decoder.CTCDecoderLM
wrapper.
torchaudio.io.StreamWriter
is a class for encoding media including audio and video. This can handle a wide variety of codecs, chunk-by-chunk encoding and GPU encoding.
GriffinLim
implementations in transforms and functional used the momentum
parameter differently, resulting in inconsistent results between the two implementations. The transforms.GriffinLim
usage of momentum
is updated to resolve this discrepancy.torchaudio.info
decode audio to compute num_frames
if it is not found in metadata (#2740).
In such cases, torchaudio.info
may now return non-zero values for num_frames
.torchaudio.compliance.kaldi.fbank
with dither option produced a different output from kaldi because it used a skewed, rather than gaussian, distribution for dither. This is updated in this release to correctly use a random gaussian instead.runtime_error
exception with TORCH_CHECK
(#2550, #2551, #2592)torchaudio.functional.resample
function using the sinc resampling method, on float32
tensor with two channels and one second duration.CPU
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
---|---|---|---|---|
0.13 | 0.256 | 0.549 | 0.769 | 0.820 |
0.12 | 0.386 | 0.534 | 31.8 | 12.1 |
CUDA
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
---|---|---|---|---|
0.13 | 0.332 | 0.336 | 0.345 | 0.381 |
0.12 | 0.524 | 0.334 | 64.4 | 22.8 |
WER improvement on LibriSpeech dev and test sets
Viterbi (v0.12) | Viterbi (v0.13) | KenLM (v0.12) | KenLM (v0.13) | |
---|---|---|---|---|
dev-clean | 10.7 | 10.9 | 4.4 | 4.2 |
dev-other | 18.3 | 17.5 | 9.7 | 9.4 |
test-clean | 10.8 | 10.9 | 4.4 | 4.4 |
test-other | 18.5 | 17.8 | 10.1 | 9.5 |
:autosummary:
in torchaudio docs (#2664, #2681, #2683, #2684, #2693, #2689, #2690, #2692)