Scaling Laws for Neural Language Models (Arxiv, Jan. 2020) [Paper]
An empirical analysis of compute-optimal large language model training (NeurIPS 2022) [Paper]
Data Repetition
Scaling Laws and Interpretability of Learning from Repeated Data (Arxiv, May 2022) [Paper]
Will we run out of data? An analysis of the limits of scaling datasets in Machine Learning (Arxiv, Oct. 2022) [Paper]
Scaling Data-Constrained Language Models (Arxiv, May 2023) [Paper][Code]
To Repeat or Not To Repeat: Insights from Scaling LLM under Token-Crisis (Arxiv, May 2023) [Paper]
D4: Improving LLM Pretraining via Document De-Duplication and Diversification (Arxiv, Aug. 2023) [Paper]
Data Quality
Deduplication
Deduplicating training data makes language models better (ACL 2022) [Paper][Code]
Deduplicating training data mitigates privacy risks in language models (ICML 2022) [Paper]
Noise-Robust De-Duplication at Scale (ICLR 2022) [Paper]
SemDeDup: Data-efficient learning at web-scale through semantic deduplication (Arxiv, Mar. 2023) [Paper][Code]
Quality Filtering
An Empirical Exploration in Quality Filtering of Text Data (Arxiv, Sep. 2021) [Paper]
Quality at a glance: An audit of web-crawled multilingual datasets (ACL 2022) [Paper]
The MiniPile Challenge for Data-Efficient Language Models (Arxiv, April 2023) [Paper][Dataset]
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
Textbooks Are All You Need (Arxiv, Jun. 2023) [Paper][Code]
The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only (NeurIPS Dataset and Benchmark track 2023) [Paper][Dataset]
Textbooks Are All You Need II: phi-1.5 technical report (Arxiv, Sep. 2023) [Paper][Model]
When less is more: Investigating Data Pruning for Pretraining LLMs at Scale (Arxiv, Sep. 2023) [Paper]
Ziya2: Data-centric Learning is All LLMs Need (Arxiv, Nov. 2023) [Paper][Model]
DeepSeek LLM: Scaling Open-Source Language Models with Longtermism (Arxiv, Jan. 2024) [Paper]
Toxicity Filtering
Detoxifying language models risks marginalizing minority voices (NAACL-HLT, 2021) [Paper][Code]
Challenges in detoxifying language models (EMNLP Findings, 2021) [Paper]
What’s in the box? a preliminary analysis of undesirable content in the Common Crawl corpus (Arxiv, May 2021) [Paper][Code]
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
Social Biases
Documenting large webtext corpora: A case study on the Colossal Clean Crawled Corpus (EMNLP 2021) [Paper]
An empirical survey of the effectiveness of debiasing techniques for pre-trained language models (ACL, 2022) [Paper][Code]
Whose language counts as high quality? Measuring language ideologies in text data selection (EMNLP, 2022) [Paper][Code]
From Pretraining Data to Language Models to Downstream Tasks: Tracking the Trails of Political Biases Leading to Unfair NLP Models (ACL 2023) [Paper][Code]
Diversity & Age
Beyond Scale: the Diversity Coefficient as a Data Quality Metric Demonstrates LLMs are Pre-trained on Formally Diverse Data (Arxiv, Jun. 2023) [Paper]
D2 Pruning: Message Passing for Balancing Diversity and Difficulty in Data Pruning (Arxiv, Oct. 2023) [Paper][Code]
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
Domain Composition
Lamda: Language models for dialog applications (Arxiv, Jan. 2022) [Paper][Code]
Data Selection for Language Models via Importance Resampling (Arxiv, Feb. 2023) [Paper][Code]
CodeGen2: Lessons for Training LLMs on Programming and Natural Languages (ICLR 2023) [Paper][Model]
DoReMi: Optimizing Data Mixtures Speeds Up Language Model Pretraining (Arxiv, May 2023) [Paper][Code]
A Pretrainer's Guide to Training Data: Measuring the Effects of Data Age, Domain Coverage, Quality, & Toxicity (Arxiv, May 2023) [Paper]
SlimPajama-DC: Understanding Data Combinations for LLM Training (Arxiv, Sep. 2023) [Paper][Model][Dataset]
DoGE: Domain Reweighting with Generalization Estimation (Arxiv, Oct. 2023) [Paper]
Data Management Systems
Data-Juicer: A One-Stop Data Processing System for Large Language Models (Arxiv, Sep. 2023) [Paper][Code]
Oasis: Data Curation and Assessment System for Pretraining of Large Language Models (Arxiv, Nov. 2023) [Paper][Code]
Supervised Fine-Tuning
Data Quantity
Exploring the Impact of Instruction Data Scaling on Large Language Models: An Empirical Study on Real-World Use Cases (Arxiv, Mar. 2023) [Paper]
Lima: Less is more for alignment (Arxiv, May 2023) [Paper]
Maybe Only 0.5% Data is Needed: A Preliminary Exploration of Low Training Data Instruction Tuning (Arxiv, May 2023) [Paper]
Scaling Relationship on Learning Mathematical Reasoning with Large Language Models (Arxiv, Aug. 2023) [Paper][Code]
How Abilities In Large Language Models Are Affected By Supervised Fine-Tuning Data Composition (Arxiv, Oct. 2023) [Paper]
Dynamics of Instruction Tuning: Each Ability of Large Language Models Has Its Own Growth Pace (Arxiv, Oct. 2023) [Paper]
Data Quality
Instruction Quality
Self-refine: Iterative refinement with self-feedback (Arxiv, Mar. 2023) [Paper][Project]
Lima: Less is more for alignment (Arxiv, May 2023) [Paper]
Enhancing Chat Language Models by Scaling High-quality Instructional Conversations (Arxiv, May 2023) [Paper][Code]
SelFee: Iterative Self-Revising LLM Empowered by Self-Feedback Generation (Blog post, May 2023) [Project]
INSTRUCTEVAL: Towards Holistic Evaluation of Instruction-Tuned Large Language Models (Arxiv, Jun. 2023) [Paper][Code]
Instruction mining: High-quality instruction data selection for large language models (Arxiv, Jul. 2023) [Paper][Code]
Harnessing the Power of David against Goliath: Exploring Instruction Data Generation without Using Closed-Source Models (Arxiv, Aug. 2023) [Paper]
Self-Alignment with Instruction Backtranslation (Arxiv. Aug. 2023) [Paper]
SELF: Language-Driven Self-Evolution for Large Language Models (Arxiv, Oct. 2023) [Paper]
Reflection-Tuning: Recycling Data for Better Instruction-Tuning (NeurIPS 2023 Instruction Workshop) [Paper][Code]