MetaLM: Language Models are General-Purpose Interfaces
The Big Convergence - Large-scale self-supervised pre-training across tasks (predictive and generative), languages (100+ languages), and modalities (language, image, audio, layout/format + language, vision + language, audio + language, etc.)
Language & Multilingual
UniLM: unified pre-training for language understanding and generation
InfoXLM/XLM-E: multilingual/cross-lingual pre-trained models for 100+ languages
DeltaLM/mT6: encoder-decoder pre-training for language generation and translation for 100+ languages
MiniLM: small and fast pre-trained models for language understanding and generation
AdaLM: domain, language, and task adaptation of pre-trained models
EdgeLM(NEW): small pre-trained models on edge/client devices
SimLM (NEW): similarity matching with language model pre-training
BEiT/BEiT-2: generative self-supervised pre-training for vision / BERT Pre-Training of Image Transformers
DiT (NEW): self-supervised pre-training for Document Image Transformers
[Model Release] September, 2022: TrOCRBASE and LARGE models for Scene Text Recognition (STR).
[Model Release] September, 2022: BEiT v2 code and pretrained models
August, 2022: BEiT-3- a general-purpose multimodal foundation model, which achieves state-of-the-art transfer performance on both vision and vision-language tasks
July, 2022: SimLM - Large-scale self-supervised pre-training for similarity matching
June, 2022: DiT and LayoutLMv3 were accepted by ACM Multimedia 2022
June, 2022: MetaLM - Language models are general-purpose interfaces to foudation models (language/multilingual, vision, speech, and multimodal)
June, 2022: VL-BEiT - bidirectional multimodal Transformer learned from scratch with one unified pretraining task, one shared backbone, and one-stage training, supporting both vision and vision-language tasks.
May, 2021: LayoutLMv2, InfoXLMv2, MiniLMv2, UniLMv3, and AdaLM were accepted by ACL 2021.
April, 2021: LayoutXLM is coming by extending the LayoutLM into multilingual support! A multilingual form understanding benchmark XFUND is also introduced, which includes forms with human labeled key-value pairs in 7 languages (Chinese, Japanese, Spanish, French, Italian, German, Portuguese).
Aggressive Decoding (May 20, 2022): Aggressive Decoding, a novel decoding paradigm for lossless speedup for seq2seq generation. Unlike the previous efforts (e.g., non-autoregressive decoding) speeding up seq2seq generation at the cost of quality loss, Aggressive Decoding aims to yield the identical (or better) generation compared with autoregressive decoding but in a significant speedup: For the seq2seq tasks characterized by highly similar inputs and outputs (e.g., Grammatical Error Correction and Text Simplification), the Input-guided Aggressive Decoding can introduce a 7x-9x speedup for the popular 6-layer Transformer on GPU with the identical results as greedy decoding; For other general seq2seq tasks (e.g., Machine Translation and Abstractive Summarization), the Generalized Aggressive Decoding can have a 3x-5x speedup with the identical or even better quality. "Lossless Acceleration for Seq2seq Generation with Aggressive Decoding"
LayoutLM 3.0 (April 19, 2022): LayoutLMv3, a multimodal pre-trained Transformer for Document AI with unified text and image masking. Additionally, it is also pre-trained with a word-patch alignment objective to learn cross-modal alignment by predicting whether the corresponding image patch of a text word is masked. The simple unified architecture and training objectives make LayoutLMv3 a general-purpose pre-trained model for both text-centric and image-centric Document AI tasks. Experimental results show that LayoutLMv3 achieves state-of-the-art performance not only in text-centric tasks, including form understanding, receipt understanding, and document visual question answering, but also in image-centric tasks such as document image classification and document layout analysis. "LayoutLMv3: Pre-training for Document AI with Unified Text and Image MaskingACM MM 2022"
EdgeFormer (March 18, 2022): EdgeFormer, the first publicly available pretrained parameter-efficient Transformer for on-device seq2seq generation. EdgeFormer has only 11 million parameters, taking up less than 15MB disk size after int8 quantization and compression, which can process a sentence of the length of 20-30 tokens with acceptable latency on two middle-to-high end CPU cores and less than 50MB memory footprint. The pretrained EdgeFormer can be fine-tuned to English seq2seq tasks and achieve promising results -- significantly better than the strong paramter-efficient Transformer baseline (pretrained Universal Transformer) and full-parameterized Transformer-base model without pretraining, which we believe can largely facilitate on-device seq2seq generation in practice. "EdgeFormer: A Parameter-Efficient Transformer for On-Device Seq2seq Generation"
DiT (March 4, 2022): DiT, a self-supervised pre-trained Document Image Transformer model using large-scale unlabeled text images for Document AI tasks, which is essential since no supervised counterparts ever exist due to the lack of human labeled document images. We leverage DiT as the backbone network in a variety of vision-based Document AI tasks, including document image classification, document layout analysis, table detection as well as text detection for OCR. Experiment results have illustrated that the self-supervised pre-trained DiT model achieves new state-of-the-art results on these downstream tasks, e.g. document image classification (91.11 → 92.69), document layout analysis (91.0 → 94.9), table detection (94.23 → 96.55) and text detection for OCR (93.07 → 94.29). "DiT: Self-supervised Pre-training for Document Image TransformerACM MM 2022"
WavLM (October 27, 2021): WavLM, a new pre-trained speech model, to solve full-stack downstream speech tasks.
WavLM integrates the gated relative position embedding structure and the utterance mixing method, to model both spoken content and speaker identity preservation. WavLM is trained on 94k hours of public audio data, which is larger than other released checkpoints for English Speech modeling. WavLM Large achieves state-of-the-art performance on the SUPERB benchmark, and brings significant improvements for various speech processing tasks on their representative benchmarks. "WavLM: Large-Scale Self-Supervised Pre-training for Full Stack Speech Processing"
MarkupLM (October 19, 2021): MarkupLM, a simple yet effective pre-training approach for text and markup language. With the Transformer architecture, MarkupLM integrates different input embeddings including text embeddings, position embeddings, and XPath embeddings. Furthermore, we also propose new pre-training objectives that are specially designed for understanding the markup language. We evaluate the pre-trained MarkupLM model on the WebSRC and SWDE datasets. Experiments show that MarkupLM significantly outperforms several SOTA baselines in these tasks. "MarkupLM: Pre-training of Text and Markup Language for Visually-rich Document UnderstandingACL 2022"
TrOCR (September 22, 2021): Transformer-based OCR with pre-trained models, which leverages the Transformer architecture for both image understanding and bpe-level text generation. The TrOCR model is simple but effective (convolution free), and can be pre-trained with large-scale synthetic data and fine-tuned with human-labeled datasets. "TrOCR: Transformer-based Optical Character Recognition with Pre-trained Models"
LayoutLM 2.0 (December 29, 2020): multimodal pre-training for visually-rich document understanding by leveraging text, layout and image information in a single framework. It is coming with new SOTA on a wide range of document understanding tasks, including FUNSD (0.7895 -> 0.8420), CORD (0.9493 -> 0.9601), SROIE (0.9524 -> 0.9781), Kleister-NDA (0.834 -> 0.852), RVL-CDIP (0.9443 -> 0.9564), and DocVQA (0.7295 -> 0.8672). "LayoutLMv2: Multi-modal Pre-training for Visually-Rich Document UnderstandingACL 2021"