Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models
TextHawk is a Multimodal Large Language Model (MLLM) specifically designed for document-oriented tasks, while preserving the general capabilities. It is aimed to explore efficient fine-grained perception by designing four dedicated components:
We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:
DocGemini consists of 30K images and 195K QA pairs with insights.
Note: The generated dataset is undergoing legal assessment. Alternatively, you can produce data on your own using the scripts we provide.
Model | ViT (Params.) |
MME perception |
MMB dev |
SEED image |
GQA | DocVQA | ChartQA | InfoVQA | TabFact | WTQ | RefCOCO val |
RefCOCO test-A |
RefCOCO test-B |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
$\text{Donut}$ | $\text{Swin-B}$ (0.1B) |
- | - | - | - | 67.5 | 41.8 | 11.6 | 54.6 | 18.8 | - | - | - |
$\text{Pix2Struct}$ | - | - | - | - | - | 76.6 | 58.6 | 40.0 | - | - | - | - | - |
$\text{InternLM-XC}$ | $\text{EVA-G}$ (1B) |
1528.4 | 74.8 | 66.1 | - | - | - | - | - | - | - | - | - |
$\text{LLaVA-1.5-7B}$ | $\text{CLIP-L}$ (0.3B) |
1510.7 | 65.2 | - | 62.0 | - | - | - | - | - | - | - | - |
$\text{Shikra-7B}$ | $\text{CLIP-L}$ (0.3B) |
- | 58.8 | - | - | - | - | - | - | - | 87.0 | 91.1 | 81.8 |
$\text{Qwen-VL-Chat}$ | $\text{CLIP-G}$ (2B) |
1487.6 | 60.6 | 65.4 | 57.5 | 62.6 | 66.3 | - | - | - | 88.6 | 92.3 | 84.5 |
$\text{Monkey}$ | $\text{CLIP-G}$ (2B) |
- | 59.3 | - | 60.7 | 66.5 | 65.1 | 36.1 | - | 25.3 | - | - | - |
$\text{UReader}$ | $\text{CLIP-L}$ (0.3B) |
- | - | - | - | 65.4 | 59.3 | 42.2 | 67.6 | 29.4 | - | - | - |
$\text{TextMonkey}$ | $\text{CLIP-G}$ (2B) |
- | - | - | - | 73.0 | 66.9 | - | - | 31.9 | - | - | - |
$\textbf{TextHawk}^*$ | $\text{SigLIP-SO}$ (0.4B) |
1520.9 | 73.0 | 69.2 | 64.7 | 73.6 | 64.0 | 47.3 | 70.7 | 33.5 | 87.3 | 90.9 | 83.3 |
$\textbf{TextHawk}$ | $\text{SigLIP-SO}$ (0.4B) |
1500.0 | 74.6 | 69.2 | 64.6 | 76.4 | 66.6 | 50.6 | 71.1 | 34.7 | 87.2 | 90.8 | 82.5 |
Note: $\textbf{TextHawk}^*$ is fine-tuned without the DocGemini.