LLaVA Versions Save

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

v1.2.0

3 months ago

LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post, and explore the demo! Models are available in Model Zoo. Training/eval data and scripts coming soon.

v1.1.3

6 months ago

Updates

Support LoRA for the instruction tuning stage of LLaVA-1.5 -- comparable performance to full-model finetuning, and reduced requirements on GPU VRAM. (ckpts/logs, script)
Bring your own data and finetune LLaVA-1.5 to your own task. (instruction)
Basic support for Windows. (instruction)
Fix: the training behavior with gradient accumulation is the same as large-batch training.

Notes

A new LoRA schedule for LLaVA-1.5 is used,
- rank: 128
- alpha: 256
- lr (LoRA): 2e-4
- lr (projector): 2e-5

v1.1.1

7 months ago

In this version, we release the training scripts, data, and evaluation scripts on benchmarks for LLaVA 1.5. Bake your LLaVA today!

LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo!

v1.1.0

7 months ago

🔥 LLaVA-1.5 is out! This release supports LLaVA-1.5 model inference and serving. We will release the training scripts, data, and evaluation scripts on benchmarks in the coming week.

LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo, with training and evaluation scripts coming in the next week!

v1.0.2

8 months ago

Added model zoo
Improved support for ScienceQA with latest training configurations
Improved docs

We are working to continue improving the documentation. Please let us know if you find any documentation unclear, thanks!

v1.0.1

9 months ago

Added LLaMA-2 support
Full LoRA support. To make model training more accessible, we release a set of model weights based on LoRA, which supports training on academic resources (e.g. 4x A6000s, or 8x 3090s, without the need of CPU offloading)
A more versatile design for training large multimodal models, including swapping different language models, vision encoders, and more coming soon
Support higher resolution input using CLIP-ViT-L-336px as the vision encoder for a more detailed visual understanding
Ablate and clean up some design choices to make the training simpler and smoother
Full DeepSpeed support
Improved model checkpoint saving during pretraining stage to save disk space
Improved WebUI interface
Improved support for inference with multiple-GPUs
Support inference with 4-bit and 8-bit quantization
Support interactive CLI inference

We train all models in this release using LLaVA-LCS-558K for pretraining and LLaVA-Instruct-80K for instruction tuning, to maintain an efficient and affordable training budget. The full training (including both pretraining and finetuning) can be completed within 6 hours on 8x 3090s.

We hope this release further benefits the community and makes large multimodal models more accessible.

Detailed Changes

Tokenization. We remove the dependency of the additional tokens (<IM_START>, <IM_END>, <IM_PATCH>), so that during the pretraining stage, the tokenizer does not change at all and we only update the linear projector weights.
Prompt.
- Pretraining. We simplified the pretraining prompts by removing additional instructions like Describe the image details, which we find to allow the zero-shot inference and can slightly improve the training speed.
- We keep the train/test prompt consistent, which we find to slightly improve the model's performance during the inference.