DGMs for NLP. A roadmap.
DGMs 4 NLP. Deep Generative Models for Natural Language Processing. A Roadmap.
Yao Fu, University of Edinburgh, [email protected]
**Update**: How does GPT Obtain its Ability? Tracing Emergent Abilities of Language Models to their Sources
**Update**: A Closer Look at Language Model Emergent Abilities
**Update**: Large Languge Models
**Update**: Long-range Dependency; Why S4 is Good at Long Sequence: Remembering a Sequence with Online Function Approximation
**TODO 1**: Calibration; Prompting; Long-range transformers; State-space Models
**TODO 2**: Matrix Factorization and Word embedding; Kernels; Gaussian Process
**TODO 3**: Relationship between inference and RL;
(written in early 2019, originated from the DGM seminar at Columbia)
Why do we want deep generative models? Because we want to learn basic factors that generate language. Human language contains rich latent factors, the continuous ones might be emotion, intention, and others, the discrete/ structural factors might be POS/ NER tags or syntax trees. Many of them are latent as in most cases, we just observe the sentence. They are also generative: human should produce language based on the overall idea, the current emotion, the syntax, and all other things we can or cannot name.
How to model the generative process of language in a statistically principled way? Can we have a flexible framework that allows us to incorporate explicit supervision signals when we have labels, or add distant supervision or logical/ statistical constraints when we do not have labels but have other prior knowledge, or simply infer whatever makes the most sense when we have no labels or a priori? Is it possible that we exploit the modeling power of advanced neural architectures while still being mathematical and probabilistic? DGMs allow us to achieve these goals.
Let us begin the journey.
Citation:
@article{yao2019DGM4NLP,
title = "Deep Generative Models for Natual Language Processing",
author = "Yao Fu",
year = "2019",
url = "https://github.com/FranxYao/Deep-Generative-Models-for-Natural-Language-Processing"
}
How to write Variational Inference and Generative Models for NLP: a recipe. This is strongly suggested for beginners writing papers about VAEs for NLP.
A Tutorial on Deep Latent Variable Models of Natural Language (link), EMNLP 18
Latent Structure Models for NLP. ACL 2019 tutorial link
Columbia STAT 8201 - Deep Generative Models, by John Cunningham
Stanford CS 236 - Deep Generative Models, by Stefano Ermon
U Toronto CS 2541 - Differentiable Inference and Generative Models, CS 2547 Learning Discrete Latent Structures, CSC 2547 Fall 2019: Learning to Search. By David Duvenaud
U Toronto STA 4273 Winter 2021 - Minimizing Expectations. By Chris Maddison
Berkeley CS294-158 - Deep Unsupervised Learning. By Pieter Abbeel
Columbia STCS 8101 - Representation Learning: A Probabilistic Perspective. By David Blei
Stanford CS324 - Large Language Models. By Percy Liang, Tatsunori Hashimoto and Christopher Re
U Toronto CSC2541 - Neural Net Training Dynamics. By Roger Grosse.
The fundation of the DGMs is built upon probabilistic graphical models. So we take a look at the following resources
Blei's Foundation of Graphical Models course, STAT 6701 at Columbia (link)
Xing's Probabilistic Graphical Models, 10-708 at CMU (link)
Collins' Natural Language Processing, COMS 4995 at Columbia (link)
Pattern Recognition and Machine Learning. Christopher M. Bishop. 2006
Machine Learning: A Probabilistic Perspective. Kevin P. Murphy. 2012
Graphical Models, Exponential Families, and Variational Inference. 2008
Linguistic Structure Prediction. 2011
The Syntactic Process. 2000
Generating Sentences from a Continuous Space, CoNLL 15
Neural variational inference for text processing, ICML 16
Learning Neural Templates for Text Generation. EMNLP 2018
Residual Energy Based Models for Text Generation. ICLR 20
Paraphrase Generation with Latent Bag of Words. NeurIPS 2019.
Fairseq Decoding Library. [github]
Controllabel Neural Text Generation [Lil'Log]
Best-First Beam Search. TACL 2020
The Curious Case of Neural Text Degeneration. ICLR 2020
Comparison of Diverse Decoding Methods from Conditional Language Models. ACL 2019
Stochastic Beams and Where to Find Them: The Gumbel-Top-k Trick for Sampling Sequences Without Replacement. ICML 19
Conditional Poisson Stochastic Beam Search. EMNLP 2021
Massive-scale Decoding for Text Generation using Lattices. 2021
Lexically Constrained Decoding for Sequence Generation Using Grid Beam Search. ACL 2017
Fast Lexically Constrained Decoding with Dynamic Beam Allocation for Neural Machine Translation. NAACL 2018
Improved Lexically Constrained Decoding for Translation and Monolingual Rewriting. NAACL 2019
Towards Decoding as Continuous Optimisation in Neural Machine Translation. EMNLP 2017
Gradient-guided Unsupervised Lexically Constrained Text Generation. EMNLP 2020
Controlled Text Generation as Continuous Optimization with Multiple Constraints. 2021
NeuroLogic Decoding: (Un)supervised Neural Text Generation with Predicate Logic Constraints. NAACL 2021
NeuroLogic A*esque Decoding: Constrained Text Generation with Lookahead Heuristics. 2021
COLD Decoding: Energy-based Constrained Text Generation with Langevin Dynamics. 2022
Note: I have not fully gone through this chapter, please give me suggestions!
Non-Autoregressive Neural Machine Translation. ICLR 2018
Fully Non-autoregressive Neural Machine Translation: Tricks of the Trade.
Fast Decoding in Sequence Models Using Discrete Latent Variables. ICML 2021
Cascaded Text Generation with Markov Transformers. Arxiv 20
Glancing Transformer for Non-Autoregressive Neural Machine Translation. ACL 2021
TODO: more about it
Prompt Papers, ThuNLP (link)
CTRL: A Conditional Transformer Language Model for Controllable Generation. Arxiv 2019
Plug and Play Language Models: a Simple Approach to Controlled Text Generation
Torch-Struct: Deep Structured Prediction Library. github, paper, documentation
An introduction to Conditional Random Fields. 2012
Inside-Outside and Forward-Backward Algorithms Are Just Backprop. 2016.
Learning with Fenchel-Young Losses. JMLR 2019
Structured Attention Networks. ICLR 2017
Differentiable Dynamic Programming for Structured Prediction and Attention. ICML 2018
Recurrent Neural Network Grammars. NAACL 16
Unsupervised Recurrent Neural Network Grammars, NAACL 19
Differentiable Perturb-and-Parse: Semi-Supervised Parsing with a Structured Variational Autoencoder, ICLR 19
The Syntactic Process. 2020
Linguistically-Informed Self-Attention for Semantic Role Labeling. EMNLP 2018 Best paper award
Semantic Parsing with Semi-Supervised Sequential Autoencoders. 2016
Compositional Generalization in NLP. Paper list
Generalization without Systematicity: On the Compositional Skills of Sequence-to-Sequence Recurrent Networks. ICML 2019
Improving Text-to-SQL Evaluation Methodology. ACL 2018
Probabilistic inference using Markov chain Monte Carlo methods. 1993
Elements of Sequential Monte Carlo (link)
A Conceptual Introduction to Hamiltonian Monte Carlo (link)
Candidate Sampling (link)
Noise-constrastive estimation: A new estimation principle for unnormalized statistical models. AISTATA 2010
A* Sampling. NIPS 2014 Best paper award
Cambridge Variational Inference Reading Group (link)
Variational Inference: A Review for Statisticians.
Stochastic Variational Inference
Variational Bayesian Inference with Stochastic Search. ICML 12
Auto-Encoding Variational Bayes, ICLR 14
beta-VAE: Learning Basic Visual Concepts with a Constrained Variational Framework. ICLR 2017
Importance Weighted Autoencoders. ICLR 2015
Stochastic Backpropagation and Approximate Inference in Deep Generative Models. ICML 14
Semi-amortized variational autoencoders, ICML 18
Adversarially Regularized Autoencoders, ICML 18
More on reparameterization: to reparameterize gaussian mixture, permutation matrix, and rejection samplers(Gamma and Dirichlet).
Stochastic Backpropagation through Mixture Density Distributions, Arxiv 16
Reparameterization Gradients through Acceptance-Rejection Sampling Algorithms. AISTATS 2017
Implicit Reparameterization Gradients. NeurIPS 2018.
Categorical Reparameterization with Gumbel-Softmax. ICLR 2017
The Concrete Distribution: A Continuous Relaxation of Discrete Random Variables. ICLR 2017
Invertible Gaussian Reparameterization: Revisiting the Gumbel-Softmax. 2020
Reparameterizable Subset Sampling via Continuous Relaxations. IJCAI 2019
Generative Adversarial Networks, NIPS 14
Towards principled methods for training generative adversarial networks, ICLR 2017
Wasserstein GAN
InfoGAN: Interpretable Representation Learning by Information Maximizing Generative Adversarial Nets. NIPS 2016
Adversarially Learned Inference. ICLR 2017
Flow Based Deep Generative Models, from Lil's log
Variational Inference with Normalizing Flows, ICML 15
Learning About Language with Normalizing Flows
Improved Variational Inference with Inverse Autoregressive Flow
Density estimation using Real NVP. ICLR 17
Unsupervised Learning of Syntactic Structure with Invertible Neural Projections. EMNLP 2018
Latent Normalizing Flows for Discrete Sequences. ICML 2019.
Discrete Flows: Invertible Generative Models of Discrete Data. 2019
FlowSeq: Non-Autoregressive Conditional Sequence Generation with Generative Flow. EMNLP 2019
Variational Neural Machine Translation with Normalizing Flows. ACL 2020
On the Sentence Embeddings from Pre-trained Language Models. EMNLP 2020
FY: Need to see how score-based generative models and diffusion models can be used for discrete sequences
Generative Modeling by Estimating Gradients of the Data Distribution. Blog 2021
Score Based Generative Modeling Papers
Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019
What are Diffusion Models? 2021
Deep Unsupervised Learning using Nonequilibrium Thermodynamics. 2015
Denoising Diffusion Probabilistic Models. NeurIPS 2020
Argmax Flows and Multinomial Diffusion: Learning Categorical Distributions. NeurIPS 2021
Structured Denoising Diffusion Models in Discrete State-Spaces. NeurIPS 2021
Autoregressive Diffusion Models. ICLR 2022
Diffusion-LM Improves Controllable Text Generation. 2022
Photorealistic Text-to-Image Diffusion Models with Deep Language Understanding. 2022
Ordered Neurons: Integrating Tree Structured into Recurrent Neural Networks
RNNs can generate bounded hierarchical languages with optimal memory
Analyzing Multi-Head Self-Attention: Specialized Heads Do the Heavy Lifting, the Rest Can Be Pruned. ACL 2019
Theoretical Limitations of Self-Attention in Neural Sequence Models. TACL 2019
Rethinking Attention with Performers. 2020
THUNLP: Pre-trained Languge Model paper list (link)
Tomohide Shibata's BERT-related Papers
HiPPO: Recurrent Memory with Optimal Polynomial Projections. NeurIPS 2020
Combining Recurrent, Convolutional, and Continuous-time Models with the Linear State Space Layer. NeurIPS 2021
Efficiently Modeling Long Sequences with Structured State Spaces. ICLR 2022
Why S4 is Good at Long Sequence: Remembering a Sequence with Online Function Approximation. 2022
GPT3 (175B). Language Models are Few-Shot Learners. May 2020
Megatron-Turing NLG (530B). Using DeepSpeed and Megatron to Train Megatron-Turing NLG 530B, A Large-Scale Generative Language Model. Jan 2022
LaMDA (137B). LaMDA: Language Models for Dialog Applications. Jan 2022
Gopher (280B). Scaling Language Models: Methods, Analysis & Insights from Training Gopher. Dec 2021
Chinchilla (70B). Training Compute-Optimal Large Language Models. Mar 2022
PaLM (540B). PaLM: Scaling Language Modeling with Pathways. Apr 2022
OPT (175B). OPT: Open Pre-trained Transformer Language Models. May 2022
BLOOM (176B): BigScience Large Open-science Open-access Multilingual Language Model. May 2022
BlenderBot 3 (175B): a deployed conversational agent that continually learns to responsibly engage. Aug 2022
Scaling Laws for Neural Language Models. 2020
Emergent Abilities of Large Language Models. 2022
Minimizing Expectations. Chris Maddison
Monte Carlo Gradient Estimation in Machine Learning
Variational Inference for Monte Carlo Objectives. ICML 16
REBAR: Low-variance, unbiased gradient estimates for discrete latent variable models. NIPS 17
Backpropagation Through the Void: Optimizing Control Variates for Black-box Gradient Estimation. ICLR 18
Backpropagating through Structured Argmax using a SPIGOT. ACL 2018 Best Paper Honorable Mention.
Understanding the Mechanics of SPIGOT: Surrogate Gradients for Latent Structure Learning. EMNLP 2020
Learning with Differentiable Perturbed Optimizers. NeurIPS 2020
Gradient Estimation with Stochastic Softmax Tricks. NeurIPS 2020
Differentiable Dynamic Programming for Structured Prediction and Attention. ICML 18
Stochastic Optimization of Sorting Networks via Continuous Relaxations
Differentiable Ranks and Sorting using Optimal Transport
Reparameterizing the Birkhoff Polytope for Variational Permutation Inference. AISTATS 2018
A Regularized Framework for Sparse and Structured Neural Attention. NeurIPS 2017
SparseMAP: Differentiable Sparse Structured Inference. ICML 2018
Nested Named Entity Recognition with Partially-Observed TreeCRFs. AAAI 2021
Rao-Blackwellized Stochastic Gradients for Discrete Distributions. ICML 2019.
Efficient Marginalization of Discrete and Structured Latent Variables via Sparsity. NeurIPS 2020
Posterior Regularization for Structured Latent Variable Models. JMLR 2010
Posterior Control of Blackbox Generation. 2019
Dependency Grammar Induction with a Neural Variational Transition-based Parser. AAAI 2019
(In Chinese) 微分几何与拓扑学简明教程
Only Bayes Should Learn a Manifold (On the Estimation of Differential Geometric Structure from Data). Arxiv 2018
The Riemannian Geometry of Deep Generative Models. CVPRW 2018
The Geometry of Deep Generative Image Models and Its Applications. ICLR 2021
Metrics for Deep Generative Models. AISTATS 2017
First-Order Algorithms for Min-Max Optimization in Geodesic Metric Spaces. 2022
Random Features for Large-Scale Kernel Machines. NeurIPS 2007
Finding structure with randomness: Probabilistic algorithms for constructing approximate matrix decompositions. SIAM 2011
Efficient optimization of loops and limits with randomized telescoping sums. ICML 2019
Telescoping Density-Ratio Estimation. NeurIPS 2020
Bias-Free Scalable Gaussian Processes via Randomized Truncations. ICML 2021
Randomized Automatic Differentiation. ICLR 2021
Scaling Structured Inference with Randomization. 2021
Elements of Information Theory. Cover and Thomas. 1991
On Variational Bounds of Mutual Information. ICML 2019
Learning Deep Representations By Mutual Information Estimation And Maximization. ICLR 2019
MINE: Mutual Information Neural Estimation
Deep Variational Information Bottleneck. ICLR 2017
Identifying Bayesian Mixture Models
Disentangling Disentanglement in Variational Autoencoders. ICML 2019
Challenging Common Assumptions in the Unsupervised Learning of Disentangled Representations. ICML 2019
Emergence of Invariance and Disentanglement in Deep Representations
Invariant Risk Minimization
Fixing a Broken ELBO. ICML 2018.
Tighter Variational Bounds are Not Necessarily Better. ICML 2018
The continuous Bernoulli: fixing a pervasive error in variational autoencoders. NeurIPS 2019
Do Deep Generative Models Know What They Don't Know? ICLR 2019
Effective Estimation of Deep Generative Language Models. ACL 2020
How Good is the Bayes Posterior in Deep Neural Networks Really? ICML 2020
A statistical theory of cold posteriors in deep neural networks. ICLR 2021
Limitations of Autoregressive Models and Their Alternatives. NAACL 2021