LLaVA Versions Save

[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.

v1.2.0

3 months ago

LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post, and explore the demo! Models are available in Model Zoo. Training/eval data and scripts coming soon.

v1.1.3

6 months ago

Updates

  • Support LoRA for the instruction tuning stage of LLaVA-1.5 -- comparable performance to full-model finetuning, and reduced requirements on GPU VRAM. (ckpts/logs, script)
  • Bring your own data and finetune LLaVA-1.5 to your own task. (instruction)
  • Basic support for Windows. (instruction)
  • Fix: the training behavior with gradient accumulation is the same as large-batch training.

Notes

  • A new LoRA schedule for LLaVA-1.5 is used,
    • rank: 128
    • alpha: 256
    • lr (LoRA): 2e-4
    • lr (projector): 2e-5

v1.1.1

7 months ago

In this version, we release the training scripts, data, and evaluation scripts on benchmarks for LLaVA 1.5. Bake your LLaVA today!

LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo!

v1.1.0

7 months ago

🔥 LLaVA-1.5 is out! This release supports LLaVA-1.5 model inference and serving. We will release the training scripts, data, and evaluation scripts on benchmarks in the coming week.

LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo, with training and evaluation scripts coming in the next week!

v1.0.2

8 months ago
  • Added model zoo
  • Improved support for ScienceQA with latest training configurations
  • Improved docs

We are working to continue improving the documentation. Please let us know if you find any documentation unclear, thanks!

v1.0.1

9 months ago
  • Added LLaMA-2 support
  • Full LoRA support. To make model training more accessible, we release a set of model weights based on LoRA, which supports training on academic resources (e.g. 4x A6000s, or 8x 3090s, without the need of CPU offloading)
  • A more versatile design for training large multimodal models, including swapping different language models, vision encoders, and more coming soon
  • Support higher resolution input using CLIP-ViT-L-336px as the vision encoder for a more detailed visual understanding
  • Ablate and clean up some design choices to make the training simpler and smoother
  • Full DeepSpeed support
  • Improved model checkpoint saving during pretraining stage to save disk space
  • Improved WebUI interface
  • Improved support for inference with multiple-GPUs
  • Support inference with 4-bit and 8-bit quantization
  • Support interactive CLI inference

We train all models in this release using LLaVA-LCS-558K for pretraining and LLaVA-Instruct-80K for instruction tuning, to maintain an efficient and affordable training budget. The full training (including both pretraining and finetuning) can be completed within 6 hours on 8x 3090s.

We hope this release further benefits the community and makes large multimodal models more accessible.

Detailed Changes

  • Tokenization. We remove the dependency of the additional tokens (<IM_START>, <IM_END>, <IM_PATCH>), so that during the pretraining stage, the tokenizer does not change at all and we only update the linear projector weights.
  • Prompt.
    • Pretraining. We simplified the pretraining prompts by removing additional instructions like Describe the image details, which we find to allow the zero-shot inference and can slightly improve the training speed.
    • We keep the train/test prompt consistent, which we find to slightly improve the model's performance during the inference.