Audio Versions Save

Data manipulation and transformation for audio signal processing, powered by PyTorch

v0.12.1

1 year ago

This is a minor release, which is compatible with PyTorch 1.12.1 and include small bug fixes, improvements and documentation update. There is no new feature added.

Bug Fix

#2560 Fix fall back failure in sox_io backend
#2588 Fix hubert fine-tuning recipe bugs

Improvement

#2552 Remove unused boost source code
#2527 Improve speech enhancement tutorial
#2544 Update forced alignment tutorial
#2595 Update data augmentation tutorial

For the full feature of v0.12, please refer to the v0.12.0 release note.

v0.12.0

1 year ago

TorchAudio 0.12.0 Release Notes

Highlights

TorchAudio 0.12.0 includes the following:

CTC beam search decoder
New beamforming modules and methods
Streaming API

[Beta] CTC beam search decoder

To support inference-time decoding, the release adds the wav2letter CTC beam search decoder, ported over from Flashlight (GitHub). Both lexicon and lexicon-free decoding are supported, and decoding can be done without a language model or with a KenLM n-gram language model. Compatible token, lexicon, and certain pretrained KenLM files for the LibriSpeech dataset are also available for download.

For usage details, please check out the documentation and ASR inference tutorial.

[Beta] New beamforming modules and methods

To improve flexibility in usage, the release adds two new beamforming modules under torchaudio.transforms: SoudenMVDR and RTFMVDR. They differ from MVDR mainly in that they:

Use power spectral density (PSD) and relative transfer function (RTF) matrices as inputs instead of time-frequency masks. The module can be integrated with neural networks that directly predict complex-valued STFT coefficients of speech and noise.
Add reference_channel as an input argument in the forward method to allow users to select the reference channel in model training or dynamically change the reference channel in inference.

Besides the two modules, the release adds new function-level beamforming methods under torchaudio.functional. These include

For usage details, please check out the documentation at torchaudio.transforms and torchaudio.functional and the Speech Enhancement with MVDR Beamforming tutorial.

[Beta] Streaming API

StreamReader is TorchAudio’s new I/O API. It is backed by FFmpeg† and allows users to

Decode various audio and video formats, including MP4 and AAC.
Handle various input forms, such as local files, network protocols, microphones, webcams, screen captures and file-like objects.
Iterate over and decode media chunk-by-chunk, while changing the sample rate or frame rate.
Apply various audio and video filters, such as low-pass filter and image scaling.
Decode video with Nvidia's hardware-based decoder (NVDEC).

For usage details, please check out the documentation and tutorials:

† To use StreamReader, FFmpeg libraries are required. Please install FFmpeg. The coverage of codecs depends on how these libraries are configured. TorchAudio official binaries are compiled to work with FFmpeg 4 libraries; FFmpeg 5 can be used if TorchAudio is built from source.

Backwards-incompatible changes

I/O

MP3 decoding is now handled by FFmpeg in sox_io backend. (#2419, #2428)
- FFmpeg is now used as fallback in sox_io backend, and now MP3 decoding is handled by FFmpeg. To load MP3 audio with torchaudio.load, please install a compatible version of FFmpeg (Version 4 when using an official binary distribution).
- Note that, whereas the previous MP3 decoding scheme pads the output audio, the new scheme does not. As a consequence, the new version returns shorter audio tensors.
- torchaudio.info now returns num_frames=0 for MP3.

Models

Change underlying implementation of RNN-T hypothesis to tuple (#2339)
- In release 0.11, Hypothesis subclassed namedtuple. Containers of namedtuple instances, however, are incompatible with the PyTorch Lite Interpreter. To achieve compatibility, Hypothesis has been modified in release 0.12 to instead alias tuple. This affects RNNTBeamSearch as it accepts and returns a list of Hypothesis instances.

Bug Fixes

Ops

Fix return dtype in MVDR module (#2376)
- In release 0.11, the MVDR module converts the dtype of input spectrum to complex128 to improve the precision and robustness of downstream matrix computations. The output dtype, however, is not correctly converted back to the original dtype. In release 0.12, we fix the output dtype to be consistent with the original input dtype.

Build

Fix Kaldi submodule integration (#2269)
Pin jinja2 version for build_docs (#2292)
Use sourceforge url to fetch zlib (#2297)

New Features

I/O

Add Streaming API (#2041, #2042, #2043, #2044, #2045, #2046, #2047, #2111, #2113, #2114, #2115, #2135, #2164, #2168, #2202, #2204, #2263, #2264, #2312, #2373, #2378, #2402, #2403, #2427, #2429)
Add YUV420P format support to Streaming API (#2334)
Support specifying decoder and its options (#2327)
Add NV12 format support in Streaming API (#2330)
Add HW acceleration support on Streaming API (#2331)
Add file-like object support to Streaming API (#2400)
Make FFmpeg log level configurable (#2439)
Set the default ffmpeg log level to FATAL (#2447)

Ops

New beamforming methods (#2227, #2228, #2229, #2230, #2231, #2232, #2369, #2401)
New MVDR modules (#2367, #2368)
Add and refactor CTC lexicon beam search decoder (#2075, #2079, #2089, #2112, #2117, #2136, #2174, #2184, #2185, #2273, #2289)
Add lexicon free CTC decoder (#2342)
Add Pretrained LM Support for Decoder (#2275)
Move CTC beam search decoder to beta (#2410)

Datasets

Add QUESST14 dataset (#2290, #2435, #2458)
Add LibriLightLimited dataset (#2302)

Improvements

I/O

Use FFmpeg-based I/O as fallback in sox_io backend. (#2416, #2418, #2423)

Ops

Raise error for resampling int waveform (#2318)
Move multi-channel modules to a separate file (#2382)
Refactor MVDR module (#2383)

Models

Add an option to use Tanh instead of ReLU in RNNT joiner (#2319)
Support GroupNorm and re-ordering Convolution/MHA in Conformer (#2320)
Add extra arguments to hubert pretrain factory functions (#2345)
Add feature_grad_mult argument to HuBERTPretrainModel (#2335)

Datasets

Refactor LibriSpeech dataset (#2387)
Raising RuntimeErrors when datasets missing (#2430)

Performance

Make Pitchshift for faster by caching resampling kernel (#2441) The following table illustrates the performance improvement over the previous release by comparing the time in msecs it takes torchaudio.transforms.PitchShift, after its first call, to perform the operation on float32 Tensor with two channels and 8000 frames, resampled to 44.1 kHz across various shifted steps.

TorchAudio Version	2	3	4	5
0.12	2.76	5	1860	223
0.11	6.71	161	8680	1450

Tests

Add complex dtype support in functional autograd test (#2244)
Refactor torchscript consistency test in functional (#2246)
Add unit tests for PyTorch Lightning modules of emformer_rnnt recipes (#2240)
Refactor batch consistency test in functional (#2245)
Run smoke tests on regular PRs (#2364)
Refactor smoke test executions (#2365)
Move seed to setup (#2425)
Remove possible manual seeds from test files (#2436)

Build

Revise the parameterization of third party libraries (#2282)
Use zlib v1.2.12 with GitHub source (#2300)
Fix ffmpeg integration for ffmpeg 5.0 (#2326)
Use custom FFmpeg libraries for torchaudio binary distributions (#2355)
Adding m1 builds to torchaudio (#2421)

Other

Add download utility specialized for torchaudio (#2283)
Use module-level __getattr__ to implement delayed initialization (#2377)
Update build_doc job to use Conda CUDA package (#2395)
Update I/O initialization (#2417)
Add Python 3.10 (build and test) (#2224)
Retrieve version from version.txt (#2434)
Disable OpenMP on mac (#2431)

Examples

Ops

Add CTC decoder example for librispeech (#2130, #2161)
Fix LM, arguments in CTC decoding script (#2235, #2315)
Use pretrained LM API for decoder example (#2317)

Pipelines

Refactor pipeline_demo.py to support variant EMFORMER_RNNT bundles (#2203)
Refactor eval and pipeline_demo scripts in emformer_rnnt (#2238)
Refactor pipeline_demo script in emformer_rnnt recipes (#2239)
Add EMFORMER_RNNT_BASE_MUSTC into pipeline demo script (#2248)

Tests

Add unit tests for Emformer RNN-T LibriSpeech recipe (#2216)
Add fixed random seed for Emformer RNN-T recipe test (#2220)

Training recipes

Add recipe for HuBERT model pre-training (#2143, #2198, #2296, #2310, #2311, #2412)
Add HuBERT fine-tuning recipe (#2352)
Refactor Emformer RNNT recipes (#2212)
Fix bugs from Emformer RNN-T recipes merge (#2217)
Add SentencePiece model training script for LibriSpeech Emformer RNN-T (#2218)
Add training recipe for Emformer RNNT trained on MuST-C release v2.0 dataset (#2219)
Refactor ArgumentParser arguments in emformer_rnnt recipes (#2236)
Add shebang lines to scripts in emformer_rnnt recipes (#2237)
Introduce DistributedBatchSampler (#2299)
Add Conformer RNN-T LibriSpeech training recipe (#2329)
Refactor LibriSpeech Conformer RNN-T recipe (#2366)
Refactor LibriSpeech Lightning datamodule to accommodate different dataset implementations (#2437)

Prototypes

Models

Add Conformer RNN-T model prototype (#2322)
Add ConvEmformer module (streaming-capable Conformer) (#2324, #2358)
Add conv_tasnet_base factory function to prototype (#2411)

Pipelines

Add EMFORMER_RNNT_BASE_MUSTC bundle to torchaudio.prototype (#2241)

Documentation

Add ASR CTC decoding inference tutorial (#2106)
Update context building to not delay the inference (#2213)
Update online ASR tutorial (#2226)
Update CTC decoder docs and add citation (#2278)
[Doc] fix typo and backlink (#2281)
Fix calculation of SNR value in tutorial (#2285)
Add notes about prototype features in tutorials (#2288)
Update README around version compatibility matrix (#2293)
Update decoder pretrained lm docs (#2291)
Add devices/properties badges (#2321)
Fix LibriMix documentation (#2351)
Update wavernn.py (#2347)
Add citations for datasets (#2371)
Update audio I/O tutorials (#2385)
Update MVDR beamforming tutorial (#2398)
Update audio feature extraction tutorial (#2391)
Update audio resampling tutorial (#2386)
Update audio data augmentation tutorial (#2388)
Add tutorial to use NVDEC with Stream API (#2393)
Expand subsections in tutorials by default (#2397)
Fix documentation (#2407)
Fix documentation (#2409)
Dataset doc fixes (#2426)
Update CTC decoder docs (#2443)
Split Streaming API tutorials into two (#2446)
Update HW decoding tutorial and add notes about unseekable object (#2408)

v0.11.0

2 years ago

torchaudio 0.11.0 Release Note

Highlights

TorchAudio 0.11.0 release includes:

Emformer (paper) RNN-T components, training recipe, and pre-trained pipeline for streaming ASR
Voxpopuli pre-trained pipelines
HuBERTPretrainModel for training HuBERT from scratch
Conformer model for speech recognition
Drop Python 3.6 support

[Beta] Emformer RNN-T

To support streaming ASR use cases, the release adds implementations of Emformer (docs), an RNN-T model that uses Emformer (emformer_rnnt_base), and an RNN-T beam search decoder (RNNTBeamSearch). It also includes a pipeline bundle (EMFORMER_RNNT_BASE_LIBRISPEECH) that wraps pre- and post-processing components, the beam search decoder, and the RNN-T Emformer model with weights pre-trained on LibriSpeech, which in whole allow for performing streaming ASR inference out of the box. For reference and reproducibility, the release provides the training recipe used to produce the pre-trained weights in the examples directory.

[Beta] HuBERT Pretrain Model

The masked prediction training of HuBERT model requires the masked logits, unmasked logits, and feature norm as the outputs. The logits are for cross-entropy losses and the feature norm is for penalty loss. The release adds HuBERTPretrainModel and corresponding factory functions (hubert_pretrain_base, hubert_pretrain_large, and hubert_pretrain_xlarge) to enable training from scratch.

[Beta] Conformer (paper)

The release adds an implementation of Conformer (docs), a convolution-augmented transformer architecture that has achieved state-of-the-art results on speech recognition benchmarks.

Backward-incompatible changes

Ops

Removed deprecated F.magphase, F.angle, F.complex_norm, and T.ComplexNorm. (#1934, #1935, #1942)
- Utility functions for pseudo complex types were deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
Dropped pseudo complex support from F.spectrogram, T.Spectrogram, F.phase_vocoder, and T.TimeStretch (#1957, #1958)
- The support for the pseudo complex type was deprecated in 0.10, and now they are removed in 0.11. For the detail of this migration plan, please refer to #1337.
Removed deprecated create_fb_matrix (#1998)
- create_fb_matrix was replaced by melscale_fbanks in release 0.10. It is removed in 0.11. Please use melscale_fbanks.

Datasets

Removed deprecated VCTK (#1825)
- The original VCTK archive file is no longer accessible. Please migrate to VCTK_092 class for the latest version of the dataset.
Removed deprecated dataset utils (#1826)
- Undocumented methods diskcache_iterator and bg_iterator were deprecated in 0.10. They are removed in 0.11. Please cease the usage of them.

Models

Removed unused dimension from pretrained Wav2Vec2 ASR (#1914)
- The final linear layer of Wav2Vec2 ASR models included dimensions (<s>, <pad>, </s>, <unk>) that were not related to ASR tasks and not used. These dimensions were removed.

Build

Dropped support for Python3.6 (#2119, #2139)
- Following the lifecycle of Python-3.6, torchaudio dropped the support for Python 3.6.

New Features

RNN-T Emformer

Introduced Emformer (#1801)
Added Emformer RNN-T model (#2003)
Added RNN-T beam search decoder (#2028)
Cleaned up Emformer module (#2091)
Added pretrained Emformer RNN-T streaming ASR inference pipeline (#2093)
Reorganized RNN-T components in prototype module (#2110)
Added integration test for Emformer RNN-T LibriSpeech pipeline (#2172)
Registered RNN-T pipeline global stats constants as buffers (#2175)
Refactored RNN-T factory function to support num_symbols argument (#2178)
Fixed output shape description in RNN-T docstrings (#2179)
Removed invalid token blanking logic from RNN-T decoder (#2180)
Updated stale prototype references (#2189)
Revised RNN-T pipeline streaming decoding logic (#2192)
Cleaned up Emformer (#2207)
Applied minor fixes to Emformer implementation (#2252)

Conformer

Introduced Conformer (#2068)
Removed subsampling and positional embedding logic from Conformer (#2171)
Moved ASR features out of prototype (#2187)
Passed bias and dropout args to Conformer convolution block (#2215)
Adjusted Conformer args (#2223)

Datasets

Added DR-VCTK dataset (#1819)

Models

Added HuBERT pretrain model to enable training from scratch (#2064)
Added feature mean square value to HuBERT Pretrain model output (#2128)

Pipelines

Added wav2vec2 ASR French pretrained from voxpopuli (#1919)
Added wav2vec2 ASR Spanish pretrained model from voxpopuli (#1924)
Added wav2vec2 ASR German pretrained model from voxpopuli (#1953)
Added wav2vec2 ASR Italian pretrained model from voxpopuli (#1954)
Added wav2vec2 ASR English pretrained model from voxpopuli (#1956)

Build

Added CUDA-11.5 builds to torchaudio (#2067)

Improvements

I/O

Fixed load behavior for 24-bit input (#2084)

Ops

Added OpenMP support (#1761)
Improved MVDR stability (#2004)
Relaxed dtype for MVDR (#2024)
Added warnings in mu_law* for the wrong input type (#2034)
Added parameter p to TimeMasking (#2090)
Removed unused vars from RNN-T loss (#2142)
Removed complex32 dtype in F.griffinlim (#2233)

Datasets

Deprecated data utils (#2073)
Updated URLs for libritts (#2074)
Added subset support for TEDLIUM release3 dataset (#2157)

Models

Replaced dropout with Dropout (#1815)
Inplace initialization of RNN weights (#2010)
Updated to xavier_uniform and avoid legacy data.uniform_ initialization (#2018)
Allowed Tacotron2 decode batch_size 1 examples (#2156)

Pipelines

Added tool to convert voxpopuli model (#1923)
Refactored wav2vec2 pipeline util (#1925)
Allowed the customization of axis exclusion for ASR head (#1932)
Tweaked wav2vec2 checkpoint conversion tool (#1938)
Added melkwargs setting for MFCC in HuBERT pipeline (#1949)

Documentation

Added 0.10.0 to version compatibility matrix (#1862)
Removed MACOSX_DEPLOYMENT_TARGET (#1880)
Updated intersphinx inventory (#1893)
Updated compatibility matrix to include LTS version (#1896)
Updated CONTRIBUTING with doc conventions (#1898)
Added anaconda stats to README (#1910)
Updated README.md (#1916)
Added citation information (#1947)
Updated CONTRIBUTING.md (#1975)
Doc fixes (#1982)
Added tutorial to CONTRIBUTING (#1990)
Fixed docstring (#2002)
Fixed minor typo (#2012)
Updated audio augmentation tutorial (#2082)
Added Sphinx gallery automatically (#2101)
Disabled matplotlib warning in tutorial rendering (#2107)
Updated prototype documentations (#2108)
Added custom CSS to make signatures appear in multi-line (#2123)
Updated prototype pipeline documentation (#2148)
Tweaked documentation (#2152)

Tests

Refactored integration test (#1922)
Enabled integration tests on CI (#1939)
Removed facebook folder in wav2vec unit tests (#2015)
Temporarily skipped threadpool test (#2025)
Revised Griffin-Lim transform test to reduce execution time (#2037)
Fixed CircleCI test failures (#2069)
Do not auto-skip tests on CI (#2127)
Relaxed absolute tolerance for Kaldi compat tests (#2165)
Added tacotron2 unit test with different batch_size (#2176)

Build

Updated GPU resource class (#1791)
Updated the main version to 0.11.0 (#1793)
Updated windows cuda installer 11.1.0 to 11.1.1 (#1795)
Renamed build_tools to tools (#1812)
Limit Windows GPU testing to CUDA-11.3 only (#1842)
Used cu113 for unittest_windows_gpu (#1853)
USE_CUDA in windows and reduce one vcvarsall (#1854)
Check torch installation before building package (#1867)
Install tools from conda instead of brew (#1873)
Cleaned up setup.py (#1900)
Moved TorchAudio conda package to use pytorch-mutex (#1904)
Updated smoke test docker image (#1905)
Fixed formatting CIRCLECI_TAG when building docs (#1915)
Fetch third party sources automatically (#1966)
Disabled SPHINXOPT=-W for local env (#2013)
Improved installing nightly pytorch (#2026)
Improved cuda installation on windows (#2032)
Refactored the library loading mechanism (#2038)
Cleaned up libtorchaudio customization logic (#2039)
Refactored and functionize the library definition (#2040)
Introduced helper function to define extension (#2077)
Standardized the location of third-party source code (#2086)
Show lint diff with color (#2102)
Updated third party submodule setup (#2132)
Suppressed stderr from subprocess in setup.py (#2133)
Fixed header include (#2135)
Updated ROCM version 4.1 -> 4.3.1 and 4.5 (#2186)
Added "cu102" back (#2190)
Pinned flake8 version (#2191)

Style

Removed trailing whitespace (#1803)
Fixed style checks (#1913)
Resolved lint warning (#1971)
Enabled CLANGFORMAT (#1999)
Fixed style checks in examples/tutorials (#2006)
OSS config for lint checks (#2066)
Excluded sphinx-gallery examples (#2071)
Reverted linting exemptions introduced in #2071 (#2087)
Applied arc lint to pytorch audio (#2096)
Enforced lint checks and fix/mute lint errors (#2116)

Other

Replaced issue templates with new issue forms (#1802)
Notify merger if PR is incorrectly labeled (#1937)
Added script to collect PRs between commits (#1943)
Fixed PR labeling requirement (#1946)
Refactored collecting-PR script for release note (#1951)
Fixed bandit failure (#1960)
Renamed bug fix label (#1961)
Updated PR label notifier (#1964)
Reverted "Update PR label notifier (#1964)" (#1965)
Consolidated network utils (#1974)
Added PR collecting script (#2008)
Re-sync with internal repository (#2017)
Updated script for getting PR merger and labels (#2030)
Fixed third party archive fetch job (#2095)
Use python:3.X Docker image for build doc (#2151)
Updated PR labeling workflow (#2160)
Fixed librosa calls (#2208)

Examples

Ops

Removed the MVDR tutorial in examples (#2109)
Abstracted BucketizeSampler to be usable outside of HuBERT example (#2147)
Refactored BucketizeBatchSampler and HuBERTDataset (#2150)
Removed multiprocessing from audio dataset tutorial (#2163)

Models

Added training recipe for RNN-T Emformer ASR model (#2052)
Added global stats script and new json for LibriSpeech RNN-T training recipe (#2183)

Pipelines

Added preprocessing scripts for HuBERT model training (#1911)
Supported multi-node training for source separation pipeline (#1968)
Added bucketize sampler and dataset for HuBERT Base model training pipeline (#2000)
Added librispeech inference script (#2130)

Other

Added unmaintained warnings (#1813)
torch.quantization -> torch.ao.quantization (#1823)
Use download.pytorch.org for asset URL (#2182)
Added deprecation path for renamed training type plugins (#11227)
Renamed DDPPlugin to DDPStrategy (#11142)

v0.10.2

2 years ago

This is a minor release compatible with PyTorch 1.10.2.

There is no feature change in torchaudio from 0.10.1. For the full feature of v0.10, please refer to the v0.10.0 release notes.

v0.10.1

2 years ago

This is a minor release, which is compatible with PyTorch 1.10.1 and include small bug fix, improvements and documentation update. There is no new feature added.

Bug Fix

#2050 Allow whitespace as TORCH_CUDA_ARCH_LIST delimiter

Improvement

#2054 Fetch third party source code automatically The build process now fetches third party source code (git submodule and cmake external projects)
#2059 Improve documentation

For the full feature of v0.10, please refer to the v0.10.0 release note.

v0.10.0

2 years ago

torchaudio 0.10.0 Release Note

Highlights

torchaudio 0.10.0 release includes:

New models (Tacotron2, HuBERT) and datasets (CMUDict, LibriMix)
Pretrained model support for ASR (Wav2Vec2, HuBERT) and TTS (WaveRNN, Tacotron2)
New operations (RNN Transducer loss, MVDR beamforming, PitchShift, etc)
CUDA-enabled binaries

[Beta] Wav2Vec2 / HuBERT Models and Pretrained Weights

HuBERT model architectures (“base”, “large” and “extra large” configurations) are added. In addition to that, support for pretrained weights from wav2vec 2.0, Unsupervised Cross-lingual Representation Learning and HuBERT are added.

These pretrained weights can be used for feature extractions and downstream task adaptation.

>>> import torchaudio
>>>
>>> # Build the model and load pretrained weight.
>>> model = torchaudio.pipelines.HUBERT_BASE.get_model()
>>> # Perform feature extraction.
>>> features, lengths = model.extract_features(waveforms)
>>> # Pass the features to downstream task
>>> ...

Some of the pretrained weights are fine-tuned for ASR tasks. The following example illustrates how to use weights and access to associated information, such as labels, which can be used in subsequent CTC decoding steps. (Note: torchaudio does not provide a CTC decoding mechanism.)

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.HUBERT_ASR_LARGE
>>>
>>> # Build the model and load pretrained weight.
>>> model = bundle.get_model()
Downloading:
100%|███████████████████████████████| 1.18G/1.18G [00:17<00:00, 73.8MB/s]
>>> # Check the corresponding labels of the output.
>>> labels = bundle.get_labels()
>>> print(labels)
('<s>', '<pad>', '</s>', '<unk>', '|', 'E', 'T', 'A', 'O', 'N', 'I', 'H', 'S', 'R', 'D', 'L', 'U', 'M', 'W', 'C', 'F', 'G', 'Y', 'P', 'B', 'V', 'K', "'", 'X', 'J', 'Q', 'Z')
>>>
>>> # Infer the label probability distribution
>>> waveform, sample_rate = torchaudio.load(hello-world.wav')
>>>
>>> emissions, _ = model(waveform)
>>>
>>> # Pass emission to (hypothetical) decoder
>>> transcripts = ctc_decode(emissions, labels)
>>> print(transcripts[0])
HELLO WORLD

[Beta] Tacotron2 and TTS Pipeline

A new model architecture, Tacotron2 is added, alongside several pretrained weights for TTS (text-to-speech). Since these TTS pipelines are composed of multiple models and specific data processing, so as to make it easy to use associated objects, a notion of bundle is introduced. Bundles provide a common access point to create a pipeline with a set of pretrained weights. They are available under torchaudio.pipelines module. The following example illustrates a TTS pipeline where two models (Tacotron2 and WaveRNN) are used together.

>>> import torchaudio
>>>
>>> bundle = torchaudio.pipelines.TACOTRON2_WAVERNN_CHAR_LJSPEECH
>>>
>>> # Build text processor, Tacotron2 and vocoder (WaveRNN) model
>>> processor = bundle.get_text_preprocessor()
>>> tacotron2 = bundle.get_tacotron2()
Downloading:
100%|███████████████████████████████| 107M/107M [00:01<00:00, 87.9MB/s]
>>> vocoder = bundle.get_vocoder()
Downloading:
100%|███████████████████████████████| 16.7M/16.7M [00:00<00:00, 78.1MB/s]
>>>
>>> text = "Hello World!"
>>>
>>> # Encode text
>>> input, lengths = processor(text)
>>>
>>> # Generate (mel-scale) spectrogram
>>> specgram, lengths, _ = tacotron2.infer(input, lengths)
>>>
>>> # Convert spectrogram to waveform
>>> waveforms, lengths = vocoder(specgram, lengths)
>>>
>>> # Save audio
>>> torchaudio.save('hello-world.wav', waveforms, vocoder.sample_rate)

[Beta] RNN Transducer Loss

The loss function used in the RNN transducer architecture, which is widely used for speech recognition tasks, is added. The loss function (torchaudio.functional.rnnt_loss or torchaudio.transforms.RNNTLoss) supports float16 and float32 logits, has autograd and torchscript support, and can be run on both CPU and GPU, which has a custom CUDA kernel implementation for improved performance.

[Beta] MVDR Beamforming

This release adds support for MVDR beamforming on multi-channel audio using Time-Frequency masks. There are three solutions (ref_channel, stv_evd, stv_power) and it supports single-channel and multi-channel (perform average in the method) masks. It provides an online option that recursively updates the parameters for streaming audio. Please refer to the MVDR tutorial.

GPU Build

This release adds GPU builds that support custom CUDA kernels in torchaudio, like the one being used for RNN transducer loss. Following this change, torchaudio’s binary distribution now includes CPU-only versions and CUDA-enabled versions. To use CUDA-enabled binaries, PyTorch also needs to be compatible with CUDA.

Additional Features

torchaudio.functional.lfilter now supports batch processing and multiple filters. Additional operations, including pitch shift, LFCC, and inverse spectrogram, are now supported in this release. The datasets CMUDict and LibriMix are added as well.

Backward Incompatible Changes

I/O

Default to PCM_16 for flac on soundfile backend (#1604)
- When saving FLAC format with “soundfile” backend, PCM_24 (the previous default) could cause warping. The default has been changed to PCM_16, which does not suffer this.

Ops

Default to native complex type when returning raw spectrogram (#1549)
- When power=None, torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram now defaults to return_complex=True, which returns Tensor of native complex type (such as torch.cfloat and torch.cdouble). To use a pseudo complex type, pass the resulting tensor to torch.view_as_real.
Remove deprecated kaldi.resample_waveform (#1555)
- Please use torchaudio.functional.resample.
Replace waveform with specgram in SlidingWindowCmn (#1859)
- The argument name was corrected to specgram.
Ensure integer input frequencies for resample (#1857)
- Sampling rates were silently cast to integers in the resampling implementation, so it now requires integer sampling rate inputs to ensure expected resampling quality.

Wav2Vec2

Update extract_features of Wav2Vec2Model (#1776)
- The previous implementation returned outputs from convolutional feature extractors. To match the behavior with the original fairseq’s implementation, the method was changed to return the outputs of the intermediate layers of transformer layers. To achieve the original behavior, please use Wav2Vec2Model.feature_extractor().
Move fine-tune specific module out of wav2vec2 encoder (#1782)
- The internal structure of Wav2Vec2Model was updated. Wav2Vec2Model.encoder.read_out module is moved to Wav2Vec2Model.aux. If you have serialized state dict, please replace the key encoder.read_out with aux.
Updated wav2vec2 factory functions for more customizability (#1783, #1804, #1830)
- The signatures of wav2vec2 factory functions are changed. num_out parameter has been changed to aux_num_out and other parameters are added before it. Please update the code from wav2vec2_base(num_out) to wav2vec2_base(aux_num_out=num_out).

Deprecations

Add melscale_fbanks and deprecate create_fb_matrix (#1653)
- As linear_fbanks is introduced, create_fb_matrix is renamed to melscale_fbanks. The original create_fb_matrix is now deprecated. Please use melscale_fbanks.
Deprecate VCTK dataset (#1810)
- This dataset has been taken down and is no longer available. Please use VCTK_092 dataset.
Deprecate data utils (#1809)
- bg_iterator and diskcache_iterator are known to not improve the throughput of data loaders. Please cease their usage.

New Features

Models

Tacotron2

Add Tacotron2 model (#1621, #1647, #1844)
Add Tacotron2 loss function (#1764)
Add Tacotron2 inference method (#1648, #1839, #1849)
Add phoneme text preprocessing for Tacotron2 (#1668)
Move Tacotron2 out of prototype (#1714)

HuBERT

Add HuBERT model architectures (#1769, #1811)

Pretrained Weights and Pipelines

Add pretrained weights for wavernn (#1612) 
Add Tacotron2 pretrained models (#1693) 
Add HUBERT pretrained weights (#1821, #1824) 
Add pretrained weights from wav2vec2.0 and XLSR papers (#1827) 
Add customization support to wav2vec2 labels (#1834) 
Default pretrained weights to eval mode (#1843) 
Move wav2vec2 pretrained models to pipelines module (#1876) 
Add TTS bundle/pipelines (#1872) 
Fix vocoder interface (#1895) 
Fix Phonemizer download (#1897)

RNN Transducer Loss

Add reduction parameter for RNNT loss (#1590) 
Rename RNNT loss C++ parameters (#1602) 
Rename transducer to RNNT (#1603) 
Remove gradient variable from RNNT loss Python code (#1616) 
Remove reuse_logits_for_grads option for RNNT loss (#1610) 
Remove fused_log_softmax option from RNNT loss (#1615) 
RNNT loss resolve null gradient (#1707) 
Move RNNT loss out of prototype (#1711)

MVDR Beamforming

Add MVDR module to example (#1709) 
Add normalization to steering vector solutions in MVDR Module (#1765) 
Move MVDR and PSD modules to transforms (#1771) 
Add MVDR beamforming tutorial to example directory (#1768)

Ops

Add edit_distance (#1601) 
Add PitchShift to functional and transform (#1629) 
Add LFCC feature to transforms (#1611) 
Add InverseSpectrogram to transforms and functional (#1652)

Datasets

Add CMUDict dataset (#1627) 
Move LibriMix dataset to datasets directory (#1833)

Improvements

I/O

Make buffer size for function info configurable (#1634)

Ops

Replace deprecated AutoNonVariableTypeMode (#1583) 
Remove lazy behavior from MelScale (#1636) 
Simplify axis value checks (#1501) 
Use at::parallel_for in lfilter core loop (#1557) 
Add filterbanks support to lfilter (#1587) 
Add batch support to lfilter (#1638) 
Use integer rates in pitch shift resample (#1861)

Models

Rename infer method to forward for WaveRNNInferenceWrapper (#1650) 
Refactor WaveRNN infer and move it to the codebase (#1704) 
Make the core wav2vec2 factory function public (#1829) 
Refactor WaveRNNInferenceWrapper (#1845) 
Store n_bits in WaveRNN (#1847) 
Replace custom padding with torch’s native impl (#1846) 
Avoid concatenation in loop (#1850) 
Add lengths param to WaveRNN.infer (#1851) 
Add sample rate to wav2vec2 bundle (#1878) 
Remove factory functions of Tacotron2 and WaveRNN (#1874)

Datasets

Fix encoding of CMUDict data reading (#1665) 
Rename utterance to transcript in datasets (#1841) 
Clean up constructor of CMUDict (#1852)

Performance

Refactor transforms.Fade on GPU computation (#1871)

CUDA

Tensor shape	[1,4,8000]	[1,4,16000]	[1,4,32000]
0.10	119	120	123
0.9	160	184	240

Unit: msec

Examples

Add text preprocessing utilities for TTS pipeline (#1639) 
Replace simple_ctc with Python greedy decoder (#1558) 
Add an inference example for WaveRNN (#1637) 
Refactor coding style for WaveRNN example (#1663) 
Add style checks on example files on CI (#1667) 
Add Tacotron2 training script (#1642) 
Add an inference example for Tacotron2 (#1654) 
Fix Tacotron2 inference example (#1716) 
Fix WaveRNN training example (#1740) 
Training recipe for ConvTasNet on Libri2Mix dataset (#1757)

Build

Update skipIfNoCuda decorator and force GPU tests in GPU CIs (#1559) 
Temporarily pin nightly version on Linux/macOS CPU unittest (#1598) 
Temporarily pin nightly version on Linux GPU unitest (#1606) 
Revert CI hot fix (#1614) 
Expose USE_CUDA in build (#1609) 
Pin MKL to 2021.2.0 (#1655) 
Simplify extension initialization (#1649) 
Synchronize extension initialization mechanism with fbcode (#1682) 
Ensure we’re propagating BUILD_VERSION (#1697) 
Guard Kaldi’s version generation (#1715) 
Update sphinx to 3.5.4 (#1685) 
Default to BUILD_SOX=1 in non-Windows systems (#1725) 
Add CUDA install step to Win Packaging jobs (#1732) 
setup.py should parse TORCH_CUDA_ARCH_LIST (#1733) 
Simplify the extension initialization process (#1734) 
Fix CUDA build logic for _torchaudio.so (#1737) 
Enable Linux wheel/conda GPU package builds (#1730) 
Increase no_output_timeout to 20m for WinConda (#1738) 
Build torchaudio for 11.3 as well (#1747) 
Upload wheels to respective folders (#1751) 
Extract PyBind11 feature implementations (#1739) 
Update the way to access libsox global config (#1755) 
Fix ROCM build error (#1729) 
Fix compile warnings (#1762) 
Migrate CircleCI docker image (#1767) 
Split extension into custom impl and Python wrapper libraries (#1752) 
Put libtorchaudio in lib directory (#1773) 
Update win gpu image from previous to stable (#1786) 
Set libtorch audio suffix as pyd on Windows (#1788) 
Fix build on Windows with CUDA (#1787) 
Enable audio windows cuda tests (#1777) 
Set release and base PyTorch version (#1816) 
Exclude prototype if it is in release (#1870) 
Log prototype exclusion (#1882) 
Update prototype exclusion (#1885) 
Remove alpha from version number (#1901)

Testing

Migrate resample tests from kaldi to functional (#1520) 
Add autograd gradcheck test for RNN transducer loss (#1532) 
Fix HF wav2vec2 test (#1585) 
Update unit test CUDA to 10.2 (#1605) 
Fix CircleCI unittest environemnt 
Remove skipIfRocm from test_fileobj_flac in soundfile.save_test (#1626) 
MFCC test refactor (#1618) 
Refactor RNNT Loss Unit Tests (#1630) 
Reduce sample rate to avoid test time out (#1640) 
Refactor text preprocessing tests in Tacotron2 example (#1635) 
Move test initialization logic to dedicated directory (#1680) 
Update pitch shift batch consistency test (#1700) 
Refactor scripting in test (#1727) 
Update the version of fairseq used for testing (#1745) 
Put output tensor on proper device in get_whitenoise (#1744) 
Refactor batch consistency test in transforms (#1772) 
Tweak test name by appending factory function name (#1780) 
Enable audio windows cuda tests (#1777) 
Skip hubert_asr_xlarge TS test on Windows (#1800) 
Skip hubert_xlarge TS test on Windows (#1807)

Others

Remove unused files (#1588) 
Remove residuals for removed modules (#1599) 
Remove torchscript bc test references (#1623) 
Remove torchaudio._internal.fft module (#1631)

Misc

Rename master branch to main (#1649) 
Fix Python spacing (#1670) 
Lint fix (#1726) 
Add .gitattributes (#1731) 
Style fixes (#1766) 
Update reference from master to main elsewhere (#1784)

Bug Fixes

Fix models import (#1664) 
Fix HF model integration (#1781)

Documentation

README Updates 
- Update README (#1544) 
- Remove NumPy dependency from README (#1582) 
- Fix typos and sentence structure in README.md (#1633) 
- Update and move convention section to CONTRIBUTING.md (#1635) 
- Remove unnecessary README (#1728) 
- Add link to TTS colab example to README (#1748) 
- Fix typo in source separation README (#1774) 
Docstring Changes 
- Set removal version of pseudo complex support (#1553) 
- Update docs (#1584) 
- Add return type in doc for RNNT loss (#1591) 
- Improve RNNT loss docstrings (#1642) 
- Add documentation for CMUDict’s property (#1683) 
- Refactor lfilter docs (#1698) 
- Standardize optional types in docstrings (#1746) 
- Fix return type of wav2vec2 model (#1790) 
- Add equations to MVDR docstring (#1789) 
- Standardize tensor shapes format in docs (#1838) 
- Add license to pre-trained model doc (#1836) 
- Update Tacotron2 docs (#1840) 
- Fix PitchShift docstring (#1866) 
- Update descriptions of lengths parameters (#1890) 
- Standardization and minor fixes (#1892) 
- Update models/pipelines doc (#1894) 
Docs formatting 
- Remove override CSS (#1554) 
- Add prototype.tacotron2 page to docs (#1695) 
- Add doc for InverseSepctrogram (#1706) 
- Add sections to transforms docs (#1720) 
- Add edit_distance to documentation with a new category Metric (#1743) 
- Fix model subsections (#1775) 
- List all the pre-trained models on right bar (#1828) 
- Put pretrained weights to subsection (#1879) 
Examples (see #1564) 
- Add example code for Resample (#1644) 
- Fix examples in transforms (#1646) 
- Add example for ComplexNorm (#1658) 
- Add example for MuLawEncoding (#1586) 
- Add example for Spectrogram (#1566) 
- Add example for GriffinLim (#1671) 
- Add example for MuLawDecoding (#1684) 
- Add example for Fade transform (#1719) 
- Update RNNT loss docs and add example (#1835) 
- Add SpecAugment figure/citation (#1887) 
- Add filter bank figures (#1891)

v0.9.1

2 years ago

This release depends on pytorch 1.9.1 No functional changes other than minor updates to CI rules.

v0.9.0

2 years ago

torchaudio 0.9.0 Release Note

Highlights

torchaudio 0.9.0 release includes:

Lots of performance improvements. (filtering, resampling, spectral operation)
Popular wav2vec2.0 model architecture.
Improved autograd support.

[Beta] Wav2Vec2.0 Model

This release includes model architectures from wav2vec2.0 paper with utility functions that allow importing pretrained model parameters published on fairseq and Hugging Face Hub. Now you can easily run speech recognition with torchaudio. These model architectures also support TorchScript, and you can deploy them with ONNX or in non-Python environments, such as C++, Android and iOS. Please checkout our C++, Android and iOS examples. The following snippets illustrate how to create a deployable model.

# Import fine-tuned model from Hugging Face Hub
import transformers
from torchaudio.models.wav2vec2.utils import import_huggingface_model

original = Wav2Vec2ForCTC.from_pretrained("facebook/wav2vec2-base-960h")
imported = import_huggingface_model(original)

# Import fine-tuned model from fairseq
import fairseq
from torchaudio.models.wav2vec2.utils import import_fairseq_model

Original, _, _ = fairseq.checkpoint_utils.load_model_ensemble_and_task(
    ["wav2vec_small_960h.pt"], arg_overrides={'data': "<data_dir>"})
imported = import_fairseq_model(original[0].w2v_encoder)

# Build uninitialized model and load state dict
from torchaudio.models import wav2vec2_base

model = wav2vec2_base(num_out=32)
model.load_state_dict(imported.state_dict())

# Quantize / script / optimize for mobile
quantized_model = torch.quantization.quantize_dynamic(
    model, qconfig_spec={torch.nn.Linear}, dtype=torch.qint8)
scripted_model = torch.jit.script(quantized_model)
optimized_model = optimize_for_mobile(scripted_model)
optimized_model.save("model_for_deployment.pt")

Filtering Improvement

The internal implementation of lfilter has been updated to support autograd on both CPU and CUDA. Additionally, the performance on CPU is significantly improved. These improvements also apply to biquad variants.

The following table illustrates the performance improvements compared against the previous releases. lfilter was applied on float32 tensors with one channel and different number of frames.

torchaudio version	256	512	1024
0.9	0.282	0.381	0.564
0.8	0.493	0.780	1.37
0.7	5.42	10.8	22.3

Unit: msec

Complex Tensor Migration

torchaudio has functions that handle complex-valued tensors. In early days when PyTorch did not have a complex dtype, torchaudio adopted the convention to use an extra dimension to represent real and imaginary parts. In PyTorch 1.6, new dtyps, such as torch.cfloat and torch.cdouble were introduced to represent complex values natively. (In the following, we refer to torchaudio’s original convention as pseudo complex types, and PyTorch’s native dtype as native complex types.)

As the native complex types have become mature and stable, torchaudio has started to migrate complex functions to use the native complex type. In this release, the internal implementation was updated to use the native complex types, and interfaces were updated to allow passing/receiving native complex type directly. Users can choose to keep using the pseudo complex type or opt in to use native complex type. However, please note that the use of the pseudo complex type is now deprecated. These functions are tested to support TorchScript and autograd. For the detail of this migration plan, please refer to #1337.

Additionally, switching the internal implementation to the native complex types improved the performance. Since the internal implementation uses native complex type regardless of which complex type is passed/returned, users will automatically benefit from this performance improvement.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for complex transforms to perform operation on float32 Tensor with two channels and 256 frames.

CPU

torchaudio version	`Spectrogram`	`TimeStretch`	`GriffinLim`
0.9	0.229	12.6	3320
0.8	0.283	126	5320

Unit: msec

CUDA

torchaudio version	`Spectrogram`	`TimeStretch`	`GriffinLim`
0.9	0.195	0.599	36
0.8	0.219	0.687	60.2

Unit: msec

Improved Autograd Support

Along with the work of Complex Tensor Migration and Filtering Improvement mentioned above, more tests were added to ensure the autograd support. Now the following operations are guaranteed to support autograd up to second order.

Functionals

lfilter
allpass_biquad
biquad
band_biquad
bandpass_biquad
bandrefect_biquad
bass_biquad
equalizer_biquad
treble_biquad
highpass_biquad
lowpass_biquad

Transforms

AmplitudeToDB
ComputeDeltas
Fade
GriffinLim
TimeMasking
FrequencyMasking
MFCC
MelScale
MelSpectrogram
Resample
SpectralCentroid
Spectrogram
SlidingWindowCmn
TimeStretch*
Vol

NOTE:

Autograd test for transforms also covers the following functionals.
- amplitude_to_DB
- spectrogram
- griffinlim
- resample
- phase_vocoder*
- mask_along_axis_iid
- mask_along_axis
- gain
- spectral_centroid
torchaudio.transforms.TimeStretch and torchaudio.functional.phase_vocoder call atan2, which is not differentiable around zero. Therefore these functions are differentiable only when the input spectrogram does not contain values around zero.

[Beta] Resampling Improvement

In release 0.8, the resampling operation was vectorized and its performance improved. In this release, the implementation of the resampling algorithm has been further revised.

Kaiser window has been added for a wider range of resampling quality.
rolloff parameter has been added for anti-aliasing control.
torchaudio.transforms.Resample precomputes the kernel using float64 precision and caches it for even faster operation.
New entry point, torchaudio.functional.resample has been added and the original entry point, torchaudio.compliance.kaldi.resample_waveform is deprecated.

The following table illustrates the performance improvements from the previous release by comparing the time it takes for torchaudio.transforms.Resample to complete the operation on float32 tensor with two channels and one-second duration.

CPU

torchaudio version	8k → 16k [Hz]	16k → 8k	16k → 44.1k	44.1k → 16k
0.9	0.192	0.559	0.478	0.467
0.8	0.537	0.753	43.9	17.6

Unit: msec

CUDA

torchaudio version	8k → 16k	16k → 8k	16k → 44.1k	44.1k → 16k
0.9	0.203	0.172	0.213	0.212
0.8	0.860	0.559	116	46.7

Unit: msec

Improved Windows Support

torchaudio implements some operations in C++ for reasons such as performance and integration with third-party libraries. This C++ module was only available on Linux and macOS. In this release, Windows packages also come with C++ module.

This C++ module in Windows package includes the efficient filtering implementation mentioned above, however, “sox_io” backend and torchaudio.functional.compute_kaldi_pitch are not included.

I/O Functions Migration

Since the 0.6 release, we have continuously improved I/O functionality. Specifically, in 0.8 the default backend has been changed from “sox” to “sox_io”, and the similar API change has been applied to “soundfile” backend. The 0.9 release concludes this migration by removing the deprecated backends. For the detail please refer to #903.

Backward Incompatible Changes

I/O

Deprecated backends and functions were removed (#1311, #1329, #1362)
- Please see #903 for the migration.
Added validation of the number of channels when saving GSM (#1384)
- Please make sure that signal has only one channel when saving into GSM.

Ops

Removed deprecated normalized argument from torchaudio.functional.griffinlim (#1369)
- This argument was never used. Please remove the argument from your call.
Renamed torchaudio.functional.sliding_window_cmn arg for correctness (#1347)
- The first argument is supposed to spectrogram. If you have used keyword argument waveform=..., please change it to specgram=...
Changed torchaudio.transforms.Resample to precompute and cache the resampling kernel. (#1499, #1514)
- To use the transform in devices other than CPU, please move the instantiated object to the target device.
```
resampler = torchaudio.transforms.Resample(orig_freq=8000, new_freq=44100)
resampler.to(torch.device("cuda"))
```

Dataset

Removed deprecated arguments from CommonVoice (#1534)
- torchaudio no longer supports programmatic download of Common Voice dataset. Please remove the arguments from your code.

Deprecations

Deprecated the use of pseudo complex type (#1445, #1492)
- torchaudio is adopting native complex type and the use of pseudo complex type and the related utility functions are now deprecated. Please refer to #1337 for the migration process.
Deprecated torchaudio.compliance.kaldi.resample_waveform (#1533)
- Please use torchaudio.functional.resample.
torchaudio.transforms.MelScale now expects valid n_stft value (#1515)
- Please provide a valid value to n_stft.

New Features

[Beta] Wav2Vec2.0

Added wav2vec2.0 model (#1529)
Added wav2vec2.0 HuggingFace importer (#1530)
Added wav2vec2.0 fairseq importer (#1531)
Added speech recognition C++ example (#1538)
- Please refer to C++ example for the detail.

Filtering

Added C++ implementation of torchaudio.functional.lfilter (#1319)
Added autograd support to torchaudio.functional.lfilter (#1310, #1441)

[Beta] Resampling

Added torchaudio.functional.resample (#1402)
Added rolloff parameter (#1488)
Added kaiser window support to resampling (#1509)
Added kernel caching mechanism in torchaudio.transforms.Resample (#1499, #1514, #1556)
Skip resampling when sampling rate is not changed (#1537)

Native Complex Tensor

Added complex tensor support to torchaudio.functional.phase_vocoder and torchaudio.transforms.TimeStretch (#1410)
Added return_complex to torchaudio.functional.spectrogram and torchaudio.transforms.Spectrogram (#1366, #1551)

Improvements

I/O

Added file path to I/O error messages (#1523)
Added __str__ override to AudioMetaData for easy print (#1339)
Fixed uninitialized variable in sox/utils.cpp (#1306)
Replaced UB sox conversion macros with tensor op (#1370)
Removed check_length from validate_input_file (#1312)

Ops

Added warning for non-integer resampling frequencies (#1490)
Adopted native complex tensors in torchaudio.functional.griffinlim (#1368)
Prohibited scripting torchaudio.transforms.MelScale when n_stft is invalid (#1505)
Added input dimension check to VAD (#1513)
Added HTK-compatible option to Mel-scale conversion (#593)

Models

Added vanilla DeepSpeech model (#1399)

Datasets

Fixed checksum for the YESNO dataset (#1405)

Misc

Added missing transforms to __all__ (#1458)
Removed reference_cast in make_boxed_from_unboxed_functor (#1300)
Removed unused normalized constant from torchaudio.transforms.GriffinLim (#1433)
Removed unused helper function (#1396)

Examples

Added libtorchaudio C++ example (#1349)
Refactored libtorchaudio example (#1486)
Replaced librosa's Mel scale conversion with torchaudio’s in WaveRNN example (#1444)

Build

Updated config.guess to support source build in recent architectures (#1484)
Explicitly disabled wavpack when building SoX (#1462)
Added ROCm support to source build (#1411)
Added Windows C++ binary build (#1345, #1371)
Made kaldi selective in build (#1342)
Made sox selective (#1338)

Testing

Added autograd test for torchaudio.functional.lfilter and biquad variants (#1400, #1438)
Added autograd test for transforms (overview: #1414)
- torchaudio.transforms.FrequencyMasking (#1498)
- torchaudio.transforms.SlidingWindowCmn (#1482)
- torchaudio.transforms.MelScale (#1467)
- torchaudio.transforms.Vol (#1460)
- torchaudio.transforms.TimeStretch (#1420)
- torchaudio.transforms.AmplitudeToDB (#1447)
- torchaudio.transforms.GriffinLim (#1421)
- torchaudio.transforms.SpectralCentroid (#1425)
- torchaudio.transforms.ComputeDeltas (#1422)
- torchaudio.transforms.Fade (#1424)
- torchaudio.transforms.Resample (#1416)
- torchaudio.transforms.MFCC (#1415)
- torchaudio.transforms.Spectrogram / MelSpectrogram (#1340)
Added test for a batch of different items in the functional batch consistency test. (#1315)
Added test for validating torchaudio.functional.lfilter shape (#1360)
Added TorchScript test for torchaudio.functional.resample (#1516)
Added TorchScript test for torchaudio.functional.phase_vocoder (#1379)
Added steps to save and load the scripted object in TorchScript (#1446)
Added GPU support to functional tests (#1475)
Added GPU support to transform librosa compatibility test (#1439)
Added GPU support to functional librosa compatibility test (#1436)
Improved HTTP fetch test reliability (#1512)
Refactored functional batch consistency test (#1341)
Refactored test classes for complex (#1491)
Refactored sox_io load test (#1394)
Refactored Kaldi compatibility tests (#1359)
Refactored functional test (#1435, #1463)
Refactored transform tests (#1356)
Refactored librosa compatibility test (#1350)
Refactored sox compatibility test (#1344)
Refactored librosa compatibility test (#1259)
Removed the use I/O functions in batch consistency test (#1521)
Removed skipIfNoSoxBackend (#1390)
Removed VAD from batch consistency tests (#1451)
Replaced deprecated floor_divide with div (#1455)
Replaced torch.assert_allclose with assertEqual (#1387)
Shortened torchaudio.functional.lfilter autograd tests input size (#1443)
Updated torchaudio.transforms.InverseMelScale comparison test (#1437)

Bug Fixes

Updated torchaudio.transforms.TimeMasking and torchaudio.transforms.FrequencyMasking to perform out-of-place masking (#1481)
Annotate power of torchaudio.transforms.MelSpectrogram as float only (#1572)

Performance

Adopted torch.nn.functional.conv1d in torchaudio.functional.lfilter (#1318)
Added C++ implementation of torchaudio.functional.overdrive (#1299)

Documentation

Update docs (#1550)
Reformat resample docs (#1548)
Updated resampling documentation (#1519)
Added the clarification that sox_effects.apply_effects_tensor is CPU-only (#1459)
Removed instructions on using external sox (#1365, #1281)
Added navigation with left/right arrow keys (#1336)
Fixed docstring of sliding_window_cmn (#1383)
Update contributing guide (#1372)
Fix broken links in contribution guide (#1361)
Added Windows build instructions (#1440)
Fixed typo (#1471, #1397, #1293)
Added WER to readme in wav2letter pipeline (#1470)
Fixed wav2letter usage example (#1060)
Added Google Analytics support (#1466)

v0.8.1

3 years ago

Highlights

This release depends on pytorch 1.8.1.

Bug Fixes

Added back support for 24-bit signed LPCM wav via sox_io backend. (#1389)

v0.8.0

3 years ago

Highlights

This release supports Python 3.9.

I/O Improvements

Continuing from the previous release, torchaudio improves the audio I/O mechanism. In this release, we have four major updates.

Backend migration. We have migrated the default backend for audio I/O. The new default backend is “sox_io” (for Linux/macOS). The interface for “soundfile” backend has been also changed to align that of “sox_io”. Following the change of default backends, the legacy backend/interface have been marked as deprecated. The legacy backend/interface are still accessible, though it is strongly discouraged to use them. For the detail on the migration, please refer to #903.

File-like object support. We have added file-like object support to I/O functions and sox_effects. You can perform the info, load, save and apply_effects_file operation on file-like objects.

# Query audio metadata over HTTP
# Will only fetch the first few kB
with requests.get(URL, stream=True) as response:
  metadata = torchaudio.info(response.raw)

# Load audio from tar file
# No need to extract TAR file.
with tarfile.open(TAR_PATH, mode='r') as tarfile_:
  fileobj = tarfile_.extractfile(SAMPLE_TAR_ITEM)
  waveform, sample_rate = torchaudio.load(fileobj)

# Saving to Bytes buffer
# Using BytesIO, you can perform in-memory encoding/decoding.
buffer_ = io.BytesIO()
torchaudio.save(buffer_, waveform, sample_rate, format="wav")

# Apply effects (lowpass filter / resampling) while loading audio from S3
client = boto3.client('s3')
response = client.get_object(Bucket=S3_BUCKET, Key=S3_KEY)
waveform, sample_rate = torchaudio.sox_effects.apply_effect_file(
  response['Body'], [["lowpass", "-1", "300"], ["rate", "8000"]])

[Beta] Codec Application. Built upon the file-like object support, we added functional.apply_codec function, which can degrades audio data by applying audio codecs supported by “sox_io” backend, in in-memory fashion.
```
# Apply MP3 codec
degraded = F.apply_codec(
  waveform, sample_rate, format="mp3", compression=-9)
# Apply GSM codec
degraded = F.apply_codec(waveform, sample_rate, format="gsm")
```

Encoding options. We have added encoding options to save function of new backends. Now you can change the format and encodings with format, encoding and bits_per_sample options

# Save without any encoding option.
# The function will pick the encoding which the provided data fit
# For Tensor of float32 type, that is 32-bit floating-point PCM.
torchaudio.save("data.wav", waveform, sample_rate)

# Save as 16-bit signed integer Linear PCM
# The resulting file occupies half the storage but loses precision
torchaudio.save(
  "data.wav", waveform, sample_rate, encoding="PCM_S", bits_per_sample=16)

More format support to "sox_io"’s save function. We have added support for GSM, HTK, AMB, and AMR-NB formats to "sox_io"’s save function.

Switch to CMake-based build

torchaudio was utilizing CMake to build third party dependencies. Now torchaudio uses CMake to build its C++ extension. This will open the door to integrate torchaudio in non-Python environments (such as C++ applications and mobile). We will work on adding example applications and mobile integrations in upcoming releases.

Backwards Incompatible Changes

Removed deprecated transform and target_transform arguments from VCTK and YESNO datasets. (#1120) If you were relying on the previous behavior, we recommend that you apply the transforms in the collate function.
Removed torchaudio.datasets.utils.walk_files (#1111) and replaced by Path and glob. (#1069, #1101). If you relied on the function, we recommend that you use glob instead.
Removed torchaudio.data.utils.unicode_csv_reader. (#1086) If you relied on the function, we recommend that you replace by csv.reader.
Disabled CommonVoice download as users are required to sign user agreement. Please download and extract the dataset manually, and replace the root argument by the subfolder for the version and language of interest, see #1082 for more details. (#1018, #1079, #1080, #1082)
Removed legacy sox effects (#977, #1001). Please migrate to apply_effects_file or apply_effects_tensor.
Switched the default backend to the ones with new interfaces (#978). If you were relying on the previous behavior, you can return to the previous behavior by following instructions in #975 for one more release.

New Features

Added GSM, HTK, AMB, AMR-NB and AMR-WB format support to “sox_io” backend. (#1276, #1291, #1277, #1275, #1066)
Added encoding options (format, bits_per_sample and encoding) to save function. (#1226, #1177, #1129, #1104)
Added new attributes (bits_per_sample and encoding) to the info function return type (AudioMetaData) (#1177, #1206, #1324)
Added format override to libsox-based file input. (load, info, sox_effects.apply_effects_file) (#1104)
Added file-like object support in “sox_io”, and “soundfile” backend and sox_effects.apply_effects_file. (#1115)
[Beta] Added the Kaldi Pitch feature. (#1243, #1260)
[Beta] Added the SpectralCentroid transform. (#1167, #1216, #1316)
[Beta] Added codec transformation apply_codec. (#1200)

Improvements

Exposed normalization method to Mel transforms. (#1212)
Exposed additional STFT arguments to Spectrogram (#892) and to MelSpectrogram (#1211).
Added support for pathlib.Path to apply_effects_file (#1048) and to CMUARCTIC (#1025), YESNO (#1015), COMMONVOICE (#1027), VCTK and LJSPEECH (#1028), GTZAN (#1032), SPEECHCOMMANDS (#1039), TEDLIUM (#1045), LIBRITTS and LIBRISPEECH (#1046).
Added SpeechCommands train/valid/test split. (#966, #1012)

Internals

Replaced if-elseif-else with switch in sox C++ code. (#1270)
Refactored C++ interface for sox_io's get_info_file (#1232) and get_encodinginfo (#1233).
Add explicit functional import in init. (#1228)
Refactored YESNO dataset (#1127), LJSPEECH dataset (#1143).
Removed Python 2.7 reference from setup.py. (#1182)
Merged flake8 configurations into single .flake8 file. (#1172, #1214)
Updated calls to torch.stft to use return_complex=True. (#1096, #1013)
Cleaned up handling of optional args in C++ with c10:optional. (#1043)
Removed unused imports in sox effects. (#1052)
Introduced functional submodule to organize functionals. (#1003)
[Testing] Refactored MelSpectrogram librosa compatibility test to decouple from other tests. (#1267)
[Testing] Moved batch tests for functionals. (#1254)
[Testing] Refactored tests for backend (#1239) and for functionals (#1237).
[Testing] Removed dependency on pytest from testing (#1157, #1188)
[Testing] Refactored unitests for VCTK (#1134), SPEECHCOMMANDS (#1136), LIBRISPEECH (#1140), TEDLIUM (#1135), LJSPEECH (#1138), LIBRITTS (#1139), CMUARCTIC (#1147), GTZAN(#1148), COMMONVOICE and YESNO (#1133).
[Testing] Removed dependency on COMMONVOICE dataset from tests. (#1132)
[Build] Fixed Python 3.9 support (#1242)
[Build] Switched to cmake for build. (#1187, #1246, #1249)
[Build] Restructured C++ code to allow per file registration of custom ops. (#1221)
[Build] Added logging to sox/CMakeLists.txt. (#1190)
[Build] Disabled C++11 ABI when necessary for libtorch compatibility. (#880)
[Build] Reorganized libsox source and build directory to accommodate additional third party code. (#1161, #1176)
[Build] Refactored sox source files and moved into dedicated subfolder. (#1106)
[Build] Enabled custom clean function for python setup.py clean. (#1142)
[CI] Documented undocumented parameters. Added CI check. (#1248)
[CI] Fixed sphinx warnings in documentation. Turned warnings into errors. (#1247)
[CI] Print CPU info before running unit test. (#1218)
[CI] Fixed clang-format job and fixed newly detected formatting issues. (#981, #1198, #1222)
[CI] Updated unit test base Docker image. (#1193)
[CI] Disabled CCI cache which is now known to be flaky. (#1189)
[CI] Disabled torchscript BC test which is known to fail. (#1192)
[CI] Stripped version suffix for pytorch. (#1185)
[CI] Ran smoke test with CPU package for pytorch due to known issue with CUDA 11. (#1105)
[CI] Added missing empty line at the end of config.yml. (#1020)
[CI] Added automatic documentation build and push to branch in CI. (#1006, #1034, #1041, #1049, #1091, #1093, #1098, #1100, #1121)
[CI] Ran GPU test for all pull requests and fixed current setup. (#998, #1014, #1191)
[CI] Skipped tests that is known to fail on macOS Python 3.6/3.7. (#999)
[CI] Changed the order of installation and aligned with Windows. (#987)
[CI] Fixed documentation rendering by using Sphinx 2.4.4. (#974)
[Doc] Added subcategories to functional documentation. (#1325)
[Doc] Added a version selector in documentation. (#1273)
[Doc] Updated compilation recommendation in README. (#1263)
[Doc] Added CONTRIBUTING.md. (#1241)
[Doc] Added instructions to install parametrized package. (#1164)
[Doc] Fixed the return type for load functions. (#1122)
[Doc] Added missing modules and minor fixes. (#1022, #1056, #1117)
[Doc] Fixed spelling and links in README. (#1029, #1037, #1062, #1110, #1261)
[Doc] Grouped filtering functionals in documentation page. (#1005, #1004)
[Doc] Updated the compatibility matrix with torchaudio 0.7 (#979)
[Doc] Added description of prototype/beta/stable features. (#968)

Bug Fixes

Fixed amplitude_to_DB clamping behaviour on batches. (#1113)
Disabled audio devices in sox builds which could interfere in the build process when detected. (#1153)
Fixed COMMONVOICE for French where the audio file extension was missing on load. (#1126)
Disabled OpenMP support for libsox which can produce errors when used in DataLoader. (#1026)
Fixed noise_down_time argument in VAD by properly propagating it. (#1017)
Removed print-freq option to compute validation loss at each epoch in wav2letter pipeline. (#997)
Migrated from torch.rfft to torch.fft.rfft and cfloat following change in pytorch. (#941)
Fixed interactive ASR demo to aligned with latest version of FAIRSeq. (#996)

Deprecations

The normalized argument is unused and will be removed from griffinlim. (#1036)
The previous sox and soundfile backend remain available for one release, see #903 for details. (#975)

Performance

Added C++ lfilter core loop for faster iteration on CPU. (#1244)
Leveraged julius resampling implementation to make resampling faster. (#1087)