List of useful data augmentation resources. You will find here some not common techniques, libraries, links to GitHub repos, papers, and others.
Looking for a person who would like to help me maintain this repository! Contact me on LN or simply add a PR!
List of useful data augmentation resources. You will find here some links to more or less popular github repos :sparkles:, libraries, papers :books: and other information.
Do you like it? Feel free to :star: ! Feel free to make a pull request!
Data augmentation can be simply described as any method that makes our dataset larger by making modified copies of the existing dataset. To create more images for example, we could zoom in and save the result, we could change the brightness of the image or rotate it. To get a bigger sound dataset we could try to raise or lower the pitch of the audio sample or slow down/speed up. Example data augmentation techniques are presented in the diagram below.
If you wish to cite us, you can cite the following paper of your choice: Style transfer-based image synthesis as an efficient regularization technique in deep learning or Data augmentation for improving deep learning in image classification problem.
Example Jupyter notebooks:
Example transformations:
Example Jupyter notebooks:
Example transformations:
At a granular level, Kornia is a library that consists of the following components:
Component | Description |
---|---|
kornia | a Differentiable Computer Vision library, with strong GPU support |
kornia.augmentation | a module to perform data augmentation in the GPU |
kornia.color | a set of routines to perform color space conversions |
kornia.contrib | a compilation of user contributed and experimental operators |
kornia.enhance | a module to perform normalization and intensity transformation |
kornia.feature | a module to perform feature detection |
kornia.filters | a module to perform image filtering and edge detection |
kornia.geometry | a geometric computer vision library to perform image transformations, 3D linear algebra and conversions using different camera models |
kornia.losses | a stack of loss functions to solve different vision tasks |
kornia.morphology | a module to perform morphological operations |
kornia.utils | image to tensor utilities and metrics for vision problems |
The details are available here: UNSUPERVISED DATA AUGMENTATION FOR CONSISTENCY TRAINING
In this paper, we introduce Random Erasing, a new data augmentation method for training the convolutional neural network (CNN). In training, Random Erasing randomly selects a rectangle region in an image and erases its pixels with random values. In this process, training images with various levels of occlusion are generated, which reduces the risk of over-fitting and makes the model robust to occlusion. Random Erasing is parameter learning free, easy to implement, and can be integrated with most of the CNN-based recognition models. Albeit simple, Random Erasing is complementary to commonly used data augmentation techniques such as random cropping and flipping, and yields consistent improvement over strong baselines in image classification, object detection and person re-identification. Code is available at: this https URL.
[Project] [Paper] [YouTube] [Bilibili] [Poster] [Supp]
A multi-platform and open-source software able to create synthetic image documents with ground truth.
DAG-GAN provides simple implementations of the DAG modules in both PyTorch and TensorFlow, which can be easily integrated into any GAN models to improve the performance, especially in the case of limited data. We only illustrate some augmentation techniques (rotation, cropping, flipping, ...) as discussed in our paper, but our DAG is not limited to these augmentations. The more augmentation to be used, the better improvements DAG enhances the GAN models. It is also easy to design your augmentations within the modules. However, there may be a trade-off between the numbers of many augmentations to be used in DAG and the computational cost.
Unsupervised Data Augmentation or UDA is a semi-supervised learning method which achieves state-of-the-art results on a wide variety of language and vision tasks. With only 20 labeled examples, UDA outperforms the previous state-of-the-art on IMDb trained on 25,000 labeled examples.
They are releasing the following:
Each modality’s augmentations are contained within its own sub-library. These sub-libraries include both function-based and class-based transforms, composition operators, and have the option to provide metadata about the transform applied, including its intensity.
AugLy is a great library to utilize for augmenting your data in model training, or to evaluate the robustness gaps of your model! We designed AugLy to include many specific data augmentations that users perform in real life on internet platforms like Facebook's -- for example making an image into a meme, overlaying text/emojis on images/videos, reposting a screenshot from social media. While AugLy contains more generic data augmentations as well, it will be particularly useful to you if you're working on a problem like copy detection, hate speech detection, or copyright infringement where these "internet user" types of data augmentations are prevalent.
It can be used to significantly improve the data efficiency for GAN training. We have provided DiffAugment-stylegan2 (TensorFlow) and DiffAugment-stylegan2-pytorch, DiffAugment-biggan-cifar (PyTorch) for GPU training, and DiffAugment-biggan-imagenet (TensorFlow) for TPU training.
project | paper | datasets | video | slides
Augmenter
is the basic element of augmentation while Flow
is a pipeline to orchestra multi augmenter together.Features:
Each modality’s augmentations are contained within its own sub-library. These sub-libraries include both function-based and class-based transforms, composition operators, and have the option to provide metadata about the transform applied, including its intensity.
AugLy is a great library to utilize for augmenting your data in model training, or to evaluate the robustness gaps of your model! We designed AugLy to include many specific data augmentations that users perform in real life on internet platforms like Facebook's -- for example making an image into a meme, overlaying text/emojis on images/videos, reposting a screenshot from social media. While AugLy contains more generic data augmentations as well, it will be particularly useful to you if you're working on a problem like copy detection, hate speech detection, or copyright infringement where these "internet user" types of data augmentations are prevalent.
Many of the components of TextAttack are useful for data augmentation. The textattack.Augmenter
class
uses a transformation and a list of constraints to augment data. We also offer five built-in recipes
for data augmentation source:QData/TextAttack:
textattack.WordNetAugmenter
augments text by replacing words with WordNet synonymstextattack.EmbeddingAugmenter
augments text by replacing words with neighbors in the counter-fitted embedding space, with a constraint to ensure their cosine similarity is at least 0.8textattack.CharSwapAugmenter
augments text by substituting, deleting, inserting, and swapping adjacent characterstextattack.EasyDataAugmenter
augments text with a combination of word insertions, substitutions and deletions.textattack.CheckListAugmenter
augments text by contraction/extension and by substituting names, locations, numbers.textattack.CLAREAugmenter
augments text by replacing, inserting, and merging with a pre-trained masked language model.N < 500
. While other techniques require you to train a language model on an external dataset just to get a small boost, we found that simple text editing operations using EDA result in good performance gains. Given a sentence in the training set, we perform the following operations:This repository contains a collection of scripts for an experiment of Contextual Augmentation.
Unsupervised Data Augmentation or UDA is a semi-supervised learning method that achieves state-of-the-art results on a wide variety of language and vision tasks. With only 20 labeled examples, UDA outperforms the previous state-of-the-art on IMDb trained on 25,000 labeled examples.
They are releasing the following:
The goal of this package is to make it easy for practitioners to consistently apply perturbations to annotated music data for the purpose of fitting statistical models.
is a Python package for time series augmentation. It offers a set of augmentation methods for time series, as well as a simple API to connect multiple augmenters into a pipeline. Can be used for audio augmentation.
The audio data is represented as pytorch tensors. It is particularly useful for speech data. Among others, it implements the augmentations that we found to be most useful for self-supervised learning (Data Augmenting Contrastive Learning of Speech Representations in the Time Domain, E. Kharitonov, M. Rivière, G. Synnaeve, L. Wolf, P.-E. Mazaré, M. Douze, E. Dupoux. [arxiv]):
Each modality’s augmentations are contained within its own sub-library. These sub-libraries include both function-based and class-based transforms, composition operators, and have the option to provide metadata about the transform applied, including its intensity.
AugLy is a great library to utilize for augmenting your data in model training, or to evaluate the robustness gaps of your model! We designed AugLy to include many specific data augmentations that users perform in real life on internet platforms like Facebook's -- for example making an image into a meme, overlaying text/emojis on images/videos, reposting a screenshot from social media. While AugLy contains more generic data augmentations as well, it will be particularly useful to you if you're working on a problem like copy detection, hate speech detection, or copyright infringement where these "internet user" types of data augmentations are prevalent.
is a Python package for time series augmentation. It offers a set of augmentation methods for time series, as well as a simple API to connect multiple augmenters into a pipeline.
Example augmenters:
T. T. Um et al., “Data augmentation of wearable sensor data for parkinson’s disease monitoring using convolutional neural networks,” in Proceedings of the 19th ACM International Conference on Multimodal Interaction, ser. ICMI 2017. New York, NY, USA: ACM, 2017, pp. 216–220.
Automatic Data Augmentation is a family of algorithms that searches for the policy of augmenting the dataset for solving the selected task.
Github repositories:
More broadly, we aim at fostering a collaboration between academia and industry in terms of leveraging machine learning research and human-in-the-loop, interactive labeling to quickly build datasets that will enable the use of powerful deep models in all problems of computer vision.
The workshop topics include (but are not limited to):