A curated list of practical guide resources of LLMs (LLMs Tree, Examples, Papers)
A curated (still actively updated) list of practical guide resources of LLMs. It's based on our survey paper: Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond and efforts from @xinyadu. The survey is partially based on the second half of this Blog. We also build an evolutionary tree of modern Large Language Models (LLMs) to trace the development of language models in recent years and highlights some of the most well-known models.
These sources aim to help practitioners navigate the vast landscape of large language models (LLMs) and their applications in natural language processing (NLP) applications. We also include their usage restrictions based on the model and data licensing information. If you find any resources in our repository helpful, please feel free to use them (don't forget to cite our paper! π). We welcome pull requests to refine this figure!
@article{yang2023harnessing,
title={Harnessing the Power of LLMs in Practice: A Survey on ChatGPT and Beyond},
author={Jingfeng Yang and Hongye Jin and Ruixiang Tang and Xiaotian Han and Qizhang Feng and Haoming Jiang and Bing Yin and Xia Hu},
year={2023},
eprint={2304.13712},
archivePrefix={arXiv},
primaryClass={cs.CL}
}
We build a decision flow for choosing LLMs or fine-tuned models~\protect\footnotemark for user's NLP applications. The decision flow helps users assess whether their downstream NLP applications at hand meet specific conditions and, based on that evaluation, determine whether LLMs or fine-tuned models are the most suitable choice for their applications.
We build a table summarizing the LLMs usage restrictions (e.g. for commercial and research purposes). In particular, we provide the information from the models and their pretraining data's perspective. We urge the users in the community to refer to the licensing information for public models and data and use them in a responsible manner. We urge the developers to pay special attention to licensing, make them transparent and comprehensive, to prevent any unwanted and unforeseen usage.
LLMs | Model | Data | |||
---|---|---|---|---|---|
License | Commercial Use | Other noteable restrictions | License | Corpus | |
Encoder-only | |||||
BERT series of models (general domain) | Apache 2.0 | β | Public | BooksCorpus, English Wikipedia | |
RoBERTa | MIT license | β | Public | BookCorpus, CC-News, OpenWebText, STORIES | |
ERNIE | Apache 2.0 | β | Public | English Wikipedia | |
SciBERT | Apache 2.0 | β | Public | BERT corpus, 1.14M papers from Semantic Scholar | |
LegalBERT | CC BY-SA 4.0 | β | Public (except data from the Case Law Access Project) | EU legislation, US court cases, etc. | |
BioBERT | Apache 2.0 | β | PubMed | PubMed, PMC | |
Encoder-Decoder | |||||
T5 | Apache 2.0 | β | Public | C4 | |
Flan-T5 | Apache 2.0 | β | Public | C4, Mixture of tasks (Fig 2 in paper) | |
BART | Apache 2.0 | β | Public | RoBERTa corpus | |
GLM | Apache 2.0 | β | Public | BooksCorpus and English Wikipedia | |
ChatGLM | ChatGLM License | β | No use for illegal purposes or military research, no harm the public interest of society | N/A | 1T tokens of Chinese and English corpus |
Decoder-only | |||||
GPT2 | Modified MIT License | β | Use GPT-2 responsibly and clearly indicate your content was created using GPT-2. | Public | WebText |
GPT-Neo | MIT license | β | Public | Pile | |
GPT-J | Apache 2.0 | β | Public | Pile | |
---> Dolly | CC BY NC 4.0 | β | CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI | Pile, Self-Instruct | |
---> GPT4ALL-J | Apache 2.0 | β | Public | GPT4All-J dataset | |
Pythia | Apache 2.0 | β | Public | Pile | |
---> Dolly v2 | MIT license | β | Public | Pile, databricks-dolly-15k | |
OPT | OPT-175B LICENSE AGREEMENT | β | No development relating to surveillance research and military, no harm the public interest of society | Public | RoBERTa corpus, the Pile, PushShift.io Reddit |
---> OPT-IML | OPT-175B LICENSE AGREEMENT | β | same to OPT | Public | OPT corpus, Extended version of Super-NaturalInstructions |
YaLM | Apache 2.0 | β | Unspecified | Pile, Teams collected Texts in Russian | |
BLOOM | The BigScience RAIL License | β | No use of generating verifiably false information with the purpose of harming others; content without expressly disclaiming that the text is machine generated |
Public | ROOTS corpus (LaurenΒΈcon et al., 2022) |
---> BLOOMZ | The BigScience RAIL License | β | same to BLOOM | Public | ROOTS corpus, xP3 |
Galactica | CC BY-NC 4.0 | β | N/A | The Galactica Corpus | |
LLaMA | Non-commercial bespoke license | β | No development relating to surveillance research and military, no harm the public interest of society | Public | CommonCrawl, C4, Github, Wikipedia, etc. |
---> Alpaca | CC BY NC 4.0 | β | CC BY NC 4.0, Subject to terms of Use of the data generated by OpenAI | LLaMA corpus, Self-Instruct | |
---> Vicuna | CC BY NC 4.0 | β | Subject to terms of Use of the data generated by OpenAI; Privacy Practices of ShareGPT |
LLaMA corpus, 70K conversations from ShareGPT.com | |
---> GPT4ALL | GPL Licensed LLaMa | β | Public | GPT4All dataset | |
OpenLLaMA | Apache 2.0 | β | Public | RedPajama | |
CodeGeeX | The CodeGeeX License | β | No use for illegal purposes or military research | Public | Pile, CodeParrot, etc. |
StarCoder | BigCode OpenRAIL-M v1 license | β | No use of generating verifiably false information with the purpose of harming others; content without expressly disclaiming that the text is machine generated |
Public | The Stack | MPT-7B | Apache 2.0 | β | Public | mC4 (english), The Stack, RedPajama, S2ORC |
falcon | TII Falcon LLM License | β /β | Available under a license allowing commercial use | Public | RefinedWeb |