Mixtral Offloading Save

Run Mixtral-8x7B models in Colab or consumer desktops

Project README

Mixtral offloading

This project implements efficient inference of Mixtral-8x7B models.

How does it work?

In summary, we achieve efficient inference of Mixtral-8x7B models through a combination of techniques:

Mixed quantization with HQQ. We apply separate quantization schemes for attention layers and experts to fit the model into the combined GPU and CPU memory.
MoE offloading strategy. Each expert per layer is offloaded separately and only brought pack to GPU when needed. We store active experts in a LRU cache to reduce GPU-RAM communication when computing activations for adjacent tokens.

For more detailed information about our methods and results, please refer to our tech-report.

Running

To try this demo, please use the demo notebook: ./notebooks/demo.ipynb or

For now, there is no command-line script available for running the model locally. However, you can create one using the demo notebook as a reference. That being said, contributions are welcome!

Work in progress

Some techniques described in our technical report are not yet available in this repo. However, we are actively working on adding support for them in the near future.

Some of the upcoming features are:

Support for other quantization methods
Speculative expert prefetching

Open Source Agenda is not affiliated with "Mixtral Offloading" Project. README Source: dvmazur/mixtral-offloading

Stars

2,233

Open Issues

Last Commit

3 weeks ago

Repository

dvmazur/mixtral-offloading

License

MIT

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/mixtral-offloading"><img src="https://www.opensourceagenda.com/projects/mixtral-offloading/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022