A collection of open-source dataset to train instruction-following LLMs (ChatGPT,LLaMA,Alpaca)
A collection of open-source instruction tuning datasets to train (text and multi-modal) chat-based LLMs (GPT-4, ChatGPT,LLaMA,Alpaca). We currently include three types of dataset:
Instruction Tuning / Reinforcement Learning from Human Feedback (RLHF) Dataset is a key component of instruction-following LLMs such as ChatGPT. This repo is dedicated to providing a comprehensive list of datasets used for instruction tuning in various LLMs, making it easier for researchers and developers to access and utilize these resources.
Lists of codebse to train your LLMs:
Size: The number of instruction tuning pairs
Lingual-Tags:
Task-Tags:
Generation-method:
Append the new project at the end of file
## [({owner}/{project-name)|Tags}]{https://github.com/link/to/project}
- summary:
- Data generation model:
- paper:
- License:
- Related: (if applicable)
BSD 3-Clause
GPT-4-0314
CC BY-NC 4.0
CC BY-NC 4.0
52K
data generated from modified self-instruct
pipeline with human written 175 seed task
.text-davinci-003
CC BY-NC 4.0
text-davinci-003
CC BY-NC 4.0
52K
data generated from modified self-instruct
pipeline with human written 429 seed task
.text-davinci-003
52K
instruction data generated from modified self-instruct
pipeline with human written 429 seed task
.text-davinci-003
GPL-3.0
gpt-3.5
, human generated
CC BY-SA 4.0
gpt-3.5
, human generated
CC BY-SA 4.0
gpt-3.5
, human generated
CC BY 4.0
1,616 diverse NLP tasks
and their natural language definitions/instructions.Human generated
Apache License 2.0
Apache License 2.0
Apache License 2.0
MIT License
GPT-4
MIT License
Apache License 2.0
GPT-3.5-turbo
CC BY-NC 4.0
GPT-3.5-turbo
Apache License 2.0
text-davinci-002
MIT License
GPT-4
CC BY-NC 4.0
CC BY-SA 3.0
Apache License 2.0
GPT-4
,GPT-3.5
CC0 1.0 Universal
GPT-3.5
CC BY 4.0
Anthropic RL-CAI 52B
MIT License
GPT-3.5
Apache License 2.0
CC BY-SA 4.0
GPT-4
model NOT human) of Alpaca prompts from three models (GPT-4, GPT-3.5 and OPT-IML) by asking GPT-4 to rate the quality. Author believes "GPT-4 is capable of identifying and fixing its own mistakes, and accurately judging the quality of responses"GPT-4
CC BY-NC 4.0
Note: While these licenses permit commercial use, they may have different requirements for attribution, distribution, or modification. Be sure to review the specific terms of each license before using it in a commercial project.
Commercial use licenses:
Apache License 2.0
MIT License
BSD 3-Clause License
BSD 2-Clause License
GNU Lesser General Public License v3.0 (LGPLv3)
GNU Affero General Public License v3.0 (AGPLv3)
Mozilla Public License 2.0 (MPL-2.0)
Eclipse Public License 2.0 (EPL-2.0)
Microsoft Public License (Ms-PL)
Creative Commons Attribution 4.0 International (CC BY 4.0)
Creative Commons Attribution-ShareAlike 4.0 International (CC BY-SA 4.0)
zlib License
Boost Software License 1.0