SALMONN: Speech Audio Language Music Open Neural Network
A curated list of Visual Question Answering(VQA)(Image/Video Question An...
FarmVibes.AI: Multi-Modal GeoSpatial ML Models for Agriculture and Susta...
【ICLR 2024🔥】 Extending Video-Language Pretraining to N-modality by La...
Open-source evaluation toolkit of large vision-language models (LVLMs), ...
Source code for "Taming Visually Guided Sound Generation" (Oral at the B...
Democratization of RT-2 "RT-2: New model translates vision and language ...
This repository collects papers for "A Survey on Knowledge Distillation ...
[MIR-2023-Survey] A continuously updated paper list for multi-modal pre-...
[ICCV2023] Official Implementation of "UniTR: A Unified and Efficient Mu...
Robust robotic localization and mapping, together with NavAbility(TM). ...
VLE: Vision-Language Encoder (VLE: 视觉-语言多模态预训练模型)
[CVPR 2024] "LL3DA: Visual Interactive Instruction Tuning for Omni-3D Un...
[CVPR2020] Unsupervised Multi-Modal Image Registration via Geometry Pres...
[CVPR2024] ViP-LLaVA: Making Large Multimodal Models Understand Arbitrar...