CALVIN - A benchmark for Language-Conditioned Policy Learning for Long-H...
💐Kaleido-BERT: Vision-Language Pre-training on Fashion Domain. (CVPR2021)
Tools for movie and video research
This is the third party implementation of the paper Grounding DINO: Marr...
A Framework of Small-scale Large Multimodal Models
Code and Model for VAST: A Vision-Audio-Subtitle-Text Omni-Modality Foun...
[ICCV 2023} Official repo of "BEVBert: Multimodal Map Pre-training for L...
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrar...
Code for "Learning the Best Pooling Strategy for Visual Semantic Embeddi...
PyTorch code for BagFormer: Better Cross-Modal Retrieval via bag-wise in...
Pytorch code for Language Models with Image Descriptors are Strong Few-S...
NExT-QA: Next Phase of Question-Answering to Explaining Temporal Actions...
[ICCV 2023] Official implementation of "PØDA: Prompt-driven Zero-shot Do...
A detection/segmentation dataset with labels characterized by intricate ...
[CVPR 2023] Official repository of paper titled "CLIP2Protect: Protectin...