Data manipulation and transformation for audio signal processing, powered by PyTorch
This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.
For the full feature of v0.12, please refer to the v0.12.0 release note.
TorchAudio 0.12.0 includes the following:
To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from Flashlight (GitHub). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.
For usage details, please check out the documentation and ASR inference tutorial.
To improve flexibility in usage, the release adds two new beamforming modules under torchaudio.transforms
: SoudenMVDR and RTFMVDR. They differ from MVDR mainly in that they:
reference_channel
as an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.Besides the two modules, the release adds new function-level beamforming methods under torchaudio.functional
. These include
For usage details, please check out the documentation at torchaudio.transforms and torchaudio.functional and the Speech Enhancement with MVDR Beamforming tutorial.
StreamReader
is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to
For usage details, please check out the documentation and tutorials:
† To use StreamReader
, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.
torchaudio.load
, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution).torchaudio.info
now returns num_frames=0
for MP3.Hypothesis
subclassed namedtuple
. Containers of namedtuple
instances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility, Hypothesis
has been modified in release 0.12 to instead alias tuple
. This affects RNNTBeamSearch
as it accepts and returns a list of Hypothesis
instances.complex128
to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.torchaudio.transforms.PitchShift
, after its first call, to perform the operation on float32
Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.TorchAudio Version | 2 | 3 | 4 | 5 |
---|---|---|---|---|
0.12 | 2.76 | 5 | 1860 | 223 |
0.11 | 6.71 | 161 | 8680 | 1450 |
__getattr__
to implement delayed initialization (#2377)TorchAudio 0.11.0 release includes:
To support streaming ASR use cases, the release adds implementations of Emformer (docs), an RNN-T model that uses Emformer (emformer_rnnt_base), and an RNN-T beam search decoder (RNNTBeamSearch). It also includes a pipeline bundle (EMFORMER_RNNT_BASE_LIBRISPEECH) that wraps pre- and post-processing components, the beam search decoder, and the RNN-T Emformer model with weights pre-trained on LibriSpeech, which in whole allow for performing streaming ASR inference out of the box. For reference and reproducibility, the release provides the training recipe used to produce the pre-trained weights in the examples directory.
The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.
The release adds an implementation of Conformer (docs), a convolution-augmented transformer architecture that has achieved state-of-the-art results on speech recognition benchmarks.
F.magphase
, F.angle
, F.complex_norm
, and T.ComplexNorm
. (#1934, #1935, #1942)
F.spectrogram
, T.Spectrogram
, F.phase_vocoder
, and T.TimeStretch
(#1957, #1958)
create_fb_matrix
(#1998)
create_fb_matrix
was replaced by melscale_fbanks
in release 0.10. It is removed in 0.11. Please use melscale_fbanks
.VCTK_092
class for the latest version of the dataset.diskcache_iterator
and bg_iterator
were deprecated in 0.10. They are removed in 0.11. Please cease the usage of them.<s>
, <pad>
, </s>
, <unk>
) that were not related to ASR tasks and not used. These dimensions were removed.This is a minor release compatible with PyTorch 1.10.2.
There is no feature change in torchaudio from 0.10.1. For the full feature of v0.10, please refer to the v0.10.0 release notes.
This is a minor release, which is compatible with PyTorch 1.10.1 and include small bug fix, improvements and documentation update. There is no new feature added.
TORCH_CUDA_ARCH_LIST
delimiterFor the full feature of v0.10, please refer to the v0.10.0 release note.
torchaudio 0.10.0 release includes:
HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.
These pretrained weights can be used for feature extractions and downstream task adaptation.
>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...
Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD
A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines
module.
The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.
>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)
The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss
or torchaudio.transforms.RNNTLoss
) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.
This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio. Please refer to the MVDR tutorial.
This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.
torchaudio.functional.lfilter
now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.
PCM_24
(the previous default) could cause warping. The default has been changed to PCM_16
, which does not suffer this.power=None
, torchaudio.functional.spectrogram
and torchaudio.transforms.Spectrogram
now defaults to return_complex=True
, which returns Tensor of native complex type (such as torch.cfloat
and torch.cdouble
). To use a pseudo complex type, pass the resulting tensor to torch.view_as_real
.torchaudio.functional.resample
.specgram
.extract_features
of Wav2Vec2Model (#1776)
Wav2Vec2Model.feature_extractor()
.Wav2Vec2Model
was updated. Wav2Vec2Model.encoder.read_out
module is moved to Wav2Vec2Model.aux
. If you have serialized state dict, please replace the key encoder.read_out
with aux
.num_out
parameter has been changed to aux_num_out
and other parameters are added before it. Please update the code from wav2vec2_base(num_out)
to wav2vec2_base(aux_num_out=num_out)
.melscale_fbanks
and deprecate create_fb_matrix
(#1653)
linear_fbanks
is introduced, create_fb_matrix
is renamed to melscale_fbanks
. The original create_fb_matrix
is now deprecated. Please use melscale_fbanks
.VCTK
dataset (#1810)
VCTK_092
dataset.bg_iterator
and diskcache_iterator
are known to not improve the throughput of data loaders. Please cease their usage.Tacotron2
HuBERT
CUDA
Tensor shape | [1,4,8000] | [1,4,16000] | [1,4,32000] |
---|---|---|---|
0.10 | 119 | 120 | 123 |
0.9 | 160 | 184 | 240 |
Unit: msec
This release depends on pytorch 1.9.1 No functional changes other than minor updates to CI rules.
torchaudio 0.9.0 release includes:
This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq
and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.
# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model
original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)
# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model
Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)
# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base
model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())
# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")
The internal implementation of lfilter
has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad
variants.
The following table illustrates the performance improvements compared against the previous releases. lfilter
was applied on float32
tensors with one channel and different number of frames.
torchaudio version | 256 |
512 |
1024 |
0.9 | 0.282 |
0.381 |
0.564 |
0.8 | 0.493 |
0.780 |
1.37 |
0.7 | 5.42 |
10.8 |
22.3 |
Unit: msec
torchaudio
has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio
adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat
and torch.cdouble
were introduced to represent complex values natively. (In the following, we refer to torchaudio
’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)
As the native complex types have become mature and stable, torchaudio
has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.
Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.
The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32
Tensor with two channels and 256 frames.
torchaudio version | Spectrogram
|
TimeStretch
|
GriffinLim
|
0.9 | 0.229 |
12.6 |
3320 |
0.8 | 0.283 |
126 |
5320 |
Unit: msec
torchaudio version | Spectrogram
|
TimeStretch
|
GriffinLim
|
0.9 | 0.195 |
0.599 |
36 |
0.8 | 0.219 |
0.687 |
60.2 |
Unit: msec
Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.
lfilter
allpass_biquad
biquad
band_biquad
bandpass_biquad
bandrefect_biquad
bass_biquad
equalizer_biquad
treble_biquad
highpass_biquad
lowpass_biquad
AmplitudeToDB
ComputeDeltas
Fade
GriffinLim
TimeMasking
FrequencyMasking
MFCC
MelScale
MelSpectrogram
Resample
SpectralCentroid
Spectrogram
SlidingWindowCmn
TimeStretch
*
Vol
NOTE:
amplitude_to_DB
spectrogram
griffinlim
resample
phase_vocoder
*
mask_along_axis_iid
mask_along_axis
gain
spectral_centroid
torchaudio.transforms.TimeStretch
and torchaudio.functional.phase_vocoder
call atan2
, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.
rolloff
parameter has been added for anti-aliasing control.torchaudio.transforms.Resample
precomputes the kernel using float64
precision and caches it for even faster operation.torchaudio.functional.resample
has been added and the original entry point, torchaudio.compliance.kaldi.resample_waveform
is deprecated.The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample
to complete the operation on float32
tensor with two channels and one-second duration.
torchaudio version | 8k → 16k [Hz] | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
0.9 | 0.192 |
0.559 |
0.478 |
0.467 |
0.8 | 0.537 |
0.753 |
43.9 |
17.6 |
Unit: msec
torchaudio version | 8k → 16k | 16k → 8k | 16k → 44.1k | 44.1k → 16k |
0.9 | 0.203 |
0.172 |
0.213 |
0.212 |
0.8 | 0.860 |
0.559 |
116 |
46.7 |
Unit: msec
torchaudio implements some operations in C++ for reasons such as performance and integration with third-party libraries. This C++ module was only available on Linux and macOS. In this release, Windows packages also come with C++ module.
This C++ module in Windows package includes the efficient filtering implementation mentioned above, however, “sox_io”
backend and torchaudio.functional.compute_kaldi_pitch
are not included.
Since the 0.6 release, we have continuously improved I/O functionality. Specifically, in 0.8 the default backend has been changed from “sox”
to “sox_io”
, and the similar API change has been applied to “soundfile”
backend. The 0.9 release concludes this migration by removing the deprecated backends. For the detail please refer to #903.
normalized
argument from torchaudio.functional.griffinlim
(#1369)
torchaudio.functional.sliding_window_cmn
arg for correctness (#1347)
waveform=...
, please change it to specgram=...
torchaudio.transforms.Resample
to precompute and cache the resampling kernel. (#1499, #1514)
resampler = torchaudio.transforms.Resample(orig_freq=8000, new_freq=44100)
resampler.to(torch.device("cuda"))
torchaudio
no longer supports programmatic download of Common Voice dataset. Please remove the arguments from your code.torchaudio
is adopting native complex type and the use of pseudo complex type and the related utility functions are now deprecated. Please refer to #1337 for the migration process.torchaudio.compliance.kaldi.resample_waveform
(#1533)
torchaudio.functional.resample
.torchaudio.transforms.MelScale
now expects valid n_stft
value (#1515)
n_stft
.torchaudio.functional.lfilter
(#1319)torchaudio.functional.lfilter
(#1310, #1441)torchaudio.functional.resample
(#1402)rolloff
parameter (#1488)torchaudio.transforms.Resample
(#1499, #1514, #1556)torchaudio.functional.phase_vocoder
and torchaudio.transforms.TimeStretch
(#1410)return_complex
to torchaudio.functional.spectrogram
and torchaudio.transforms.Spectrogram
(#1366, #1551)__str__
override to AudioMetaData
for easy print (#1339)sox/utils.cpp
(#1306)check_length
from validate_input_file
(#1312)torchaudio.functional.griffinlim
(#1368)torchaudio.transforms.MelScale
when n_stft
is invalid (#1505)__all__
(#1458)reference_cast
in make_boxed_from_unboxed_functor
(#1300)torchaudio.transforms.GriffinLim
(#1433)librosa
's Mel scale conversion with torchaudio
’s in WaveRNN example (#1444)config.guess
to support source build in recent architectures (#1484)torchaudio.functional.lfilter
and biquad
variants (#1400, #1438)torchaudio.transforms.FrequencyMasking
(#1498)torchaudio.transforms.SlidingWindowCmn
(#1482)torchaudio.transforms.MelScale
(#1467)torchaudio.transforms.Vol
(#1460)torchaudio.transforms.TimeStretch
(#1420)torchaudio.transforms.AmplitudeToDB
(#1447)torchaudio.transforms.GriffinLim
(#1421)torchaudio.transforms.SpectralCentroid
(#1425)torchaudio.transforms.ComputeDeltas
(#1422)torchaudio.transforms.Fade
(#1424)torchaudio.transforms.Resample
(#1416)torchaudio.transforms.MFCC
(#1415)torchaudio.transforms.Spectrogram
/ MelSpectrogram
(#1340)torchaudio.functional.lfilter
shape (#1360)torchaudio.functional.resample
(#1516)torchaudio.functional.phase_vocoder
(#1379)floor_divide
with div
(#1455)torch.assert_allclose
with assertEqual
(#1387)torchaudio.functional.lfilter
autograd tests input size (#1443)torchaudio.transforms.InverseMelScale
comparison test (#1437)torchaudio.transforms.TimeMasking
and torchaudio.transforms.FrequencyMasking
to perform out-of-place masking (#1481)power
of torchaudio.transforms.MelSpectrogram
as float only (#1572)torch.nn.functional.conv1d
in torchaudio.functional.lfilter
(#1318)torchaudio.functional.overdrive
(#1299)sox_effects.apply_effects_tensor
is CPU-only (#1459)sliding_window_cmn
(#1383)This release depends on pytorch 1.8.1.
This release supports Python 3.9.
Continuing from the previous release, torchaudio improves the audio I/O mechanism. In this release, we have four major updates.
Backend migration. We have migrated the default backend for audio I/O. The new default backend is “sox_io” (for Linux/macOS). The interface for “soundfile” backend has been also changed to align that of “sox_io”. Following the change of default backends, the legacy backend/interface have been marked as deprecated. The legacy backend/interface are still accessible, though it is strongly discouraged to use them. For the detail on the migration, please refer to #903.
File-like object support.
We have added file-like object support to I/O functions and sox_effects. You can perform the info
, load
, save
and apply_effects_file
operation on file-like objects.
# Query audio metadata over HTTP
# Will only fetch the first few kB
with requests.get(URL, stream=True) as response:
metadata = torchaudio.info(response.raw)
# Load audio from tar file
# No need to extract TAR file.
with tarfile.open(TAR_PATH, mode='r') as tarfile_:
fileobj = tarfile_.extractfile(SAMPLE_TAR_ITEM)
waveform, sample_rate = torchaudio.load(fileobj)
# Saving to Bytes buffer
# Using BytesIO, you can perform in-memory encoding/decoding.
buffer_ = io.BytesIO()
torchaudio.save(buffer_, waveform, sample_rate, format="wav")
# Apply effects (lowpass filter / resampling) while loading audio from S3
client = boto3.client('s3')
response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
response['Body'], [["lowpass", "-1", "300"], ["rate", "8000"]])
[Beta] Codec Application.
Built upon the file-like object support, we added functional.apply_codec
function, which can degrades audio data by applying audio codecs supported by “sox_io” backend, in in-memory fashion.
# Apply MP3 codec
degraded = F.apply_codec(
waveform, sample_rate, format="mp3", compression=-9)
# Apply GSM codec
degraded = F.apply_codec(waveform, sample_rate, format="gsm")
Encoding options.
We have added encoding options to save function of new backends. Now you can change the format and encodings with format
, encoding
and bits_per_sample
options
# Save without any encoding option.
# The function will pick the encoding which the provided data fit
# For Tensor of float32 type, that is 32-bit floating-point PCM.
torchaudio.save("data.wav", waveform, sample_rate)
# Save as 16-bit signed integer Linear PCM
# The resulting file occupies half the storage but loses precision
torchaudio.save(
"data.wav", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)
More format support to "sox_io"’s save function. We have added support for GSM, HTK, AMB, and AMR-NB formats to "sox_io"’s save function.
torchaudio was utilizing CMake to build third party dependencies. Now torchaudio uses CMake to build its C++ extension. This will open the door to integrate torchaudio in non-Python environments (such as C++ applications and mobile). We will work on adding example applications and mobile integrations in upcoming releases.