[CVPR 2024] Alpha-CLIP: A CLIP Model Focusing on Wherever You Want
AI Research Platform for Reinforcement Learning from Real Panoramic Images.
A curated list of awesome vision and language resources (still under con...
[arXiv 2023] PointLLM: Empowering Large Language Models to Understand Po...
Conceptual 12M is a dataset containing (image-URL, caption) pairs collec...
Implementation of 'X-Linear Attention Networks for Image Captioning' [CV...
This repo lists relevant papers summarized in our survey paper: A Syste...
CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-H...
code for TCL: Vision-Language Pre-Training with Triple Contrastive Learn...
Code/Data for the paper: "LLaVAR: Enhanced Visual Instruction Tuning for...
[ICLR'24] Mitigating Hallucination in Large Multi-Modal Models via Robus...
Official Implementation of "GiT: Towards Generalist Vision Transformer t...
HPT - Open Multimodal LLMs from HyperGAI
Align and Prompt: Video-and-Language Pre-training with Entity Prompts
[CVPR 2022 Oral] TubeDETR: Spatio-Temporal Video Grounding with Transfor...