Pratical Llms Save

A collection of hand on notebook for LLMs practitioner

Project README

Guide for LLM Practitioners

Welcome to the repository for LLM (Large Language Model) engineers! This collection of Jupyter Notebooks is designed to collect pratical aspects of our job. I will collect and add jupyter and/or script for learning and experimenting purpose.

Notebooks Included

Notebook	Description	Url
1_understanding_llms_benchmarks.ipynb	This notebook provides an explanation of the main benchmarks used in the openLLM leaderboard. It aims to help you grasp the key metrics and methodologies used in benchmarking LLMs.	Link
2_quantization_base.ipynb	In this notebook, you'll learn how to open a Hugging Face model in 8-bit and 4-bit using the BitandBytes library. Quantization is a crucial technique for optimizing model performance and resource usage, and this notebook guides you through the process.	Link
3_quantization_gptq.ipynb	Explore quantization in GPTQ format using the auto-gptq library with this notebook. GPTQ format is gaining popularity for its effectiveness in compressing and quantizing large models like GPT. Learn how to leverage this format for your models.	Link
4_quantization_exllamav2.ipynb	How to quantize a model from HF to exllamav2	Link
5_sharding_and_offloading.ipynb	How to shard a model in multiple chunk. This allow to load it on different devices or load one at time managing memory. Learn how to offload some layer to CPU or even disk	Link
6_gguf_quantization_and_inference.ipynb	Quantize a model into GGUF using the llama.cpp library. Inferencing on OpenAI-compatible server.	Link
7_gguf_split_and_load.ipynb	Split a GGUF Quantized model in multiple parts, making it easily sharable	Link
8_hqq_quantization.ipynb	Explore quantization using Half-Quadratic Quantization (HQQ)	Link
9_inference_big_model_cpu_plus_gpu.ipynb	This notebook shows how to calculate the RAM required by a quantized GGUF model and how to load it into memory using both RAM and VRAM, optimizing the number of layers that can be offloaded to the GPU. The notebook demonstrates loading Qwen/Qwen1.5-32B-Chat-GGUF as an example on a system with a T4 15GB VRAM and approximately 32GB of RAM	Link
a10_inference_llama3.ipynb	LLama3 has been released. This notebook demonstrates how to run LLama3-8B-Instruct half precision if you have access to a GPU with 24GB of VRAM, quantized to 8 bits if you have 10GB of VRAM, and shows how to run the Q8 GGUF version to achieve maximum performance if you only have 10GB of VRAM.	Link
a11_llm_guardrails_using_llama3_guard.ipynb	Protect your backend and your generative AI applications using LLama3-guard-2. In this notebook, I show you how to set up a server using 10GB of VRAM and how to perform inference through HTTP POST requests.	Link
a12_speculative_decoding.ipynb	The notebook practically describes and demonstrates the technique of 'speculative decoding' to increase the tokens/second generated by a Target Model through the use of a smaller and lighter Draft Model. Example realized on LLama-3-70B-Instruct (Target) and LLama-3-8B-Instruct (Draft).	Link

References

For further resources and support, feel free to reach out to the community or refer to the following:

BitandBytes GitHub Repository: Learn more about the BitandBytes library for quantization.
Auto-GPTQ GitHub Repository: Access the auto-gptq library for GPTQ format quantization.
ExLlamaV2 GitHub Repository: Learn more about the ExLlamaV2 library for quantization and fast inference.
Accelerate GitHub Repository: Learn more about the Accelerate library from HF.
llama.cpp Github Repository: Learn more about the llama.cpp library.
HQQ Github Repository: Learn more about the HQQ library.

Additional Resources

Which GGUF is right for me?: Useful reference on GGUF and guide on how to choose the right quantization for your scenario.
Interesting reddit thread on GGUF: Useful reference on GGUF.
Half-Quadratic Quantization of Large Machine Learning Models: HQQ Blog post
GPTQ vs AWS vs EXL2 vs llamacpp: Quantization method performance (Memory, Speed and VRAM) comparison

Happy learning and experimenting with LLMs! 🚀

Open Source Agenda is not affiliated with "Pratical Llms" Project. README Source: AntonioGr7/pratical-llms

Stars

Open Issues

Last Commit

3 weeks ago

Repository

AntonioGr7/pratical-llms

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/pratical-llms"><img src="https://www.opensourceagenda.com/projects/pratical-llms/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022