Unlock the Power of LLM: Explore These Datasets to Train Your Own ChatGPT!
Dataset Name | Size | Languages | Source | License |
---|---|---|---|---|
cc_sbu_align | 4K | English | MiniGPT-4 datadset | BSD 3-Clause License |
SLF5K | 5K | English | The Summarization with Language Feedback (SLF5K) dataset is an English-language dataset containing 5K unique samples that can be used for the task of abstraction summarization. | apache-2.0 |
blended_skill_talk | 7K | English | A dataset of 7k conversations explicitly designed to exhibit multiple conversation modes: displaying personality, having empathy, and demonstrating knowledge. | - |
GSM-IC | 8K | English | Grade-School Math with Irrelevant Context (GSM-IC) | - |
ChatAlpaca | 10K | English | The data currently contain a total of 10,000 conversations with 95,558 utterances. | Apache-2.0 license |
PKU-SafeRLHF-10K | 10K | English | PKU-SafeRLHF-10K, which is the first dataset of its kind and contains 10k instances with safety preferences. | - |
Dolly | 15K | English | databricks-dolly-15k is a corpus of more than 15,000 records generated by thousands of Databricks employees to enable large language models to exhibit the magical interactivity of ChatGPT. | CC 3.0 |
WebGPT | 20K | English | This is the dataset of all comparisons that were marked as suitable for reward modeling by the end of the WebGPT project. | - |
Code Alpaca | 20K | English | Code generation task involving 20,022 samples | - |
HC3 | 37K | English, Chinese | 37,175 instructions generated by ChatGPT and human | - |
RefGPT | 50K | English,chinese | we introduce a cost-effective method called RefGPT, which generates a vast amount of high-quality multi-turn Q&A content. | - |
Alpaca Dataset | 52K | English | 175 seed instructions by OpenAI API | CC By NC 4.0; OpenAI terms of use |
Alpaca Data Cleaned | 52K | English | Revised version of Alpaca Dataset | - |
Alpaca GPT-4 Data | 52K | English | Generated by GPT-4 using Alpaca prompts | - |
Alpaca GPT-4 Data (Chinese) | 52K | Chinese | Generated by GPT-4 using Chinese prompts translated from Alpaca by ChatGPT | - |
Cabrita Dataset | 52K | Portuguese | Translated from Alpaca Data | |
Japanese Alpaca Dataset | 52K | Japanese | Translated from Alpaca Data by ChatGPT API | CC By NC 4.0; OpenAI terms of use |
Traditional Chinese Alpaca Dataset | 52K | Traditional Chinese | Translated from Alpaca Data by ChatGPT API | Apache-2.0 license |
Dynosaur | 66K | English | Dynosaur, a dynamic growth paradigm for instruction-tuning data curation. | Apache-2.0 license |
Finance | 69K | English | 68,912 financial related instructions | - |
evol | 70K | English | This is the training data of WizardLM. | - |
Vicuna Dataset | 75K | English | ~100k ShareGPT conversations | - |
InstructionTranslation | 80K | Multi-lingual | Translations were generated by M2M 12B and the output generations were limited at 512 tokens due to VRAM limit (40G). | MIT |
Self-Instruct | 82K | English | We release a dataset that contains 52k instructions, paired with 82K instance inputs and outputs. | - |
OASST1 | 89K | Multi-lingual | a human-generated, human-annotated assistant-style conversation corpus consisting of 161,443 messages in 35 different languages, annotated with 461,292 quality ratings, resulting in over 10,000 fully annotated conversation trees. | apache-2.0 |
HH-RLHF | 91K | English | The data are described in the paper: Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback. | MIT |
Guanaco Dataset | 98K | English, Simplified Chinese, Traditional Chinese HK & TW, Japanese | 175 tasks from the Alpaca model | GPLv3 |
InstructionWild | 104K | English, Chinese | 429 seed instructions and follow Alpaca to generate 52K | Research only; OpenAI terms of use |
Camel Dataset | 107K | Multi-lingual | Role-playing between AIs (Open AI API) | - |
tapir-cleaned-116k | 116K | English | This is a revised version of the DAISLab dataset of IFTTT rules, which has been thoroughly cleaned, scored, and adjusted for the purpose of instruction-tuning. | cc-by-nc-4.0 |
LLaVA Visual Instruct | 150K | English | LLaVA Visual Instruct 150K is a set of GPT-generated multimodal instruction-following data. It is constructed for visual instruction tuning and for building large multimodal towards GPT-4 vision/language capability. | cc-by-nc-4.0 |
Prosocial Dialog | 166K | English | 165,681 instructions produced by GPT-3 rewrites questions and human feedback | - |
COIG | 191K | Chinese | Chinese Open Instruction Generalist (COIG) project to maintain a harmless, helpful, and diverse set of Chinese instruction corpora. | apache-2.0 |
Unnatural Instructions | 241K | English | a large dataset of cre- ative and diverse instructions, collected with virtually no human labor. | MIT |
SHP | 358K | English | SHP is a dataset of 385K collective human preferences over responses to questions/instructions in 18 different subject areas, from cooking to legal advice. | Reddit non-exclusive, non-transferable, non-sublicensable, and revocable license |
ultrachat | 404K | English | To ensure generation quality, two separate ChatGPT Turbo APIs are adopted in generation, where one plays the role of the user to generate queries and the other generates the response. | cc-by-nc-4.0 |
ELI5 | 559K | English | The ELI5 dataset is an English-language dataset of questions and answers gathered from three subreddits where users ask factual questions requiring paragraph-length or longer answers. | - |
GPT4All Dataset | 806K | Multi-lingual | Subset of LAION OIG, StackOverflow Question, BigSciense/p3 dataset. Answered by OpenAI API. | - |
Instruct | 889K | English | 888,969 English instructions, augmentation using AllenAI NLP tools | MIT |
MOSS | 1M | Chinese | Generated by gpt-3.5-turbo | Apache-2.0, AGPL-3.0 licenses |
LaMini-Instruction | 3M | English | a total of 2.58M pairs of instructions and responses using gpt-3.5-turbo based on several existing resources of prompts | cc-by-nc-4.0 |
Natural Instructions | 5M | Multi-lingual | 5,040,134 instructions collected from diverse NLP tasks | - |
BELLE | 10M | Chinese | The 10M Chinese dataset is composed of subsets spanning multiple (instruction) types and multiple fields. | Research only; OpenAI terms of use |
Firefly | 16M | Chinese | 1,649,398 Chinese instructions in 23 NLP tasks | - |
OIG-43M Dataset | 43M | Multi-lingual | Together, LAION, and Ontocord.ai. | - |
xP3 | 79M | Multi-lingual | 78,883,588 instructions collected by prompts & datasets across 46 languages & 16 NLP tasks | - |
Alpaca-CoT Dataset | - | Multi-lingual | Instruction Data Collection | ODC-By |
stack-exchange-paired | - | English | This dataset contains questions and answers from the Stack Overflow Data Dump for the purpose of preference model training. | cc-by-sa-4.0 |
CodeParrot | - | python | The database was queried for all Python files with less than 1MB in size resulting in a 180GB dataset with over 20M files. | - |
LangChainDatasets | - | English | This is a community-drive dataset repository for datasets that can be used to evaluate LangChain chains and agents. | - |
ParlAI | - | English | 100+ popular datasets available all in one place, dialogue models, from open-domain chitchat, to task-oriented dialogue, to visual question answering. | - |
GPTeacher | - | English | A collection of modular datasets generated by GPT-4, General-Instruct - Roleplay-Instruct - Code-Instruct - and Toolformer | - |
silk-road/Wizard-LM-Chinese-instruct-evol | - | chinese | Wizard-LM-Chinese | - |