BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models
Note: For a simple presentation, the questions in Domestic Robot and Open Game have been simplified from multiple-choice format. Please see our Benchmark for more examples and detailed questions.
baseline/:
evaluate/:
evaluate_results/:
jsonl/:
This directory contains all JSONL files with the question, image relative location, and the ground truth answer.
Sample JSONL format:
{
"question_id": "bottle_test_broken_large_000_001",
"image": "bottle_test_broken_large_000.png",
"text": "Is there any defect in the object in this image? Answer the question using a single word or phrase.",
"answer": "Yes"
}
The image is the relative image location of corresponding style image folder, the text is the question, answer is ground truth answer.
imgs/:
results/:
scripts/:
git clone [email protected]:AIFEG/BenchLMM.git
cd BenchLMM
mkdir evaluate_results
{
"question_id": 110,
"prompt": "Is there any defect in the object in this image? Answer the question using a single word or phrase.",
"model_output": "Yes",
}
xxxx_StyleName.jsonl
like the following project tree. You must keep the style of the suffix consistent with the example..
├── answers_Benchmark_AD.jsonl
├── xxxxxxxx_CT.jsonl
├── xxxxxxxx_MRI.jsonl
├── xxxxxxxx_Med-X-RAY.jsonl
├── xxxxxxxx_RS.jsonl
├── xxxxxxxx_Robots.jsonl
├── xxxxxxxx_defect_detection.jsonl
├── xxxxxxxx_game.jsonl
├── xxxxxxxx_infrard.jsonl
├── xxxxxxxx_style_cartoon.jsonl
├── xxxxxxxx_style_handmake.jsonl
├── xxxxxxxx_style_painting.jsonl
├── xxxxxxxx_style_sketch.jsonl
├── xxxxxxxx_style_tattoo.jsonl
├── xxxxxxxx_xray.jsonl
bash scripts/evaluate.sh
Note: Score will be saved in the file results. Robots and game scores are included in the evaluate_results/Robots.jsonl and evaluate_results/game.jsonl respectively.
Model | VRAM required |
---|---|
InstructBLIP-7B | 30GB |
InstructBLIP-13B | 65GB |
LLava-1.5-7B | <24GB |
LLava-1.5-13B | 30GB |
git clone https://github.com/haotian-liu/LLaVA.git
cd LLaVA
conda create -n llava python=3.10 -y
conda activate llava
pip install --upgrade pip # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation
Add the file BenchLMM_LLaVA_model_vqa.py to the path LLaVA/llava/eval/
Modify the file path and run the script scripts/LLaVA.sh
bash scripts/LLaVA.sh
bash scripts/evaluate.sh
Note: Score will be saved in the file results.
git clone https://github.com/salesforce/LAVIS.git
cd LAVIS
pip install -e .
Prepare Vicuna Weights
InstructBLIP uses frozen Vicuna 7B and 13B models. Please first follow the instructions to prepare Vicuna v1.1 weights.
Then modify the llm_model
in the Model Config to the folder that contains Vicuna weights.
Run InstructBLIP on our Benchmark
Modify the file path and run the script BenchLMM/scripts/InstructBLIP.sh
bash BenchLMM/scripts/InstructBLIP.sh
bash BenchLMM/scripts/evaluate.sh
Note: Score will be saved in the file results.
@article{cai2023benchlmm,
title={BenchLMM: Benchmarking Cross-style Visual Capability of Large Multimodal Models},
author={Cai, Rizhao and Song, Zirui and Guan, Dayan and Chen, Zhenhao and Luo, Xing and Yi, Chenyu and Kot, Alex},
journal={arXiv preprint arXiv:2312.02896},
year={2023}
}
If you have any question or issue with our project, please contact Dayan Guan: [email protected]
This research is supported in part by the Rapid-Rich Object Search (ROSE) Lab of Nanyang Technological University and the NTU-PKU Joint Research Institute (a collaboration between NTU and Peking University that is sponsored by a donation from the Ng Teng Fong Charitable Foundation). We are deeply grateful to Yaohang Li from the University of Technology Sydney for his invaluable assistance in conducting the experiments, and to Jingpu Yang, Helin Wang, Zihui Cui, Yushan Jiang, Fengxian Ji, and Yuxiao Hang from NLULab@NEUQ (Northeastern University at Qinhuangdao, China) for their meticulous efforts in annotating the dataset. We also would like to thank Prof. Miao Fang (PI of NLULab@NEUQ) for his supervision and insightful suggestion during discussion on this project.