Effective prompting for Large Multimodal Models like GPT-4 Vision, LLaVA...
[CVPR 2024 🔥] Grounding Large Multimodal Model (GLaMM), the first-of-it...
The Cradle framework is a first attempt at General Computer Control (GCC...
LLaVA-Interactive-Demo
PG-Video-LLaVA: Pixel Grounding in Large Multimodal Video Models
[CVPR'24] HallusionBench: You See What You Think? Or You Think What You ...
MLLM-Tool: A Multimodal Large Language Model For Tool Agent Learning
Official code for Paper "Mantis: Multi-Image Instruction Tuning"
Official Repo of Graphist