[NeurIPS'23 Oral] Visual Instruction Tuning (LLaVA) built towards GPT-4V level capabilities and beyond.
LLaVA-1.6 is out! With additional scaling to LLaVA-1.5, LLaVA-1.6-34B outperforms Gemini Pro on some benchmarks. It can now process 4x more pixels and perform more tasks/applications than before. Check out the blog post, and explore the demo! Models are available in Model Zoo. Training/eval data and scripts coming soon.
In this version, we release the training scripts, data, and evaluation scripts on benchmarks for LLaVA 1.5. Bake your LLaVA today!
LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo!
🔥 LLaVA-1.5 is out! This release supports LLaVA-1.5 model inference and serving. We will release the training scripts, data, and evaluation scripts on benchmarks in the coming week.
LLaVA-1.5 achieves SoTA on 11 benchmarks, with just simple modifications to the original LLaVA, utilizes all public data, completes training in ~1 day on a single 8-A100 node, and surpasses methods like Qwen-VL-Chat that use billion-scale data. Check out the technical report, and explore the demo! Models are available in Model Zoo, with training and evaluation scripts coming in the next week!
We are working to continue improving the documentation. Please let us know if you find any documentation unclear, thanks!
We train all models in this release using LLaVA-LCS-558K for pretraining and LLaVA-Instruct-80K for instruction tuning, to maintain an efficient and affordable training budget. The full training (including both pretraining and finetuning) can be completed within 6 hours on 8x 3090s.
We hope this release further benefits the community and makes large multimodal models more accessible.
<IM_START>
, <IM_END>
, <IM_PATCH>
), so that during the pretraining stage, the tokenizer does not change at all and we only update the linear projector weights.Describe the image details
, which we find to allow the zero-shot inference and can slightly improve the training speed.