TextHawk Save

Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Project README

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

examples

Introduction

TextHawk is a Multimodal Large Language Model (MLLM) specifically designed for document-oriented tasks, while preserving the general capabilities. It is aimed to explore efficient fine-grained perception by designing four dedicated components:

  • ReSampling and ReArrangement (ReSA)
  • Scalable Positional Embeddings (SPEs)
  • Query Proposal Network (QPN)
  • Multi-Level Cross-Attention (MLCA)

DocGemini

We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:

  • A brief summary of the document topics.
  • Short QA pairs, up to 10.
  • Insights behind each answer.
  • [Optional] An imaginary conversations between two researchers.

DocGemini consists of 30K images and 195K QA pairs with insights.

Note: The generated dataset is undergoing legal assessment. Alternatively, you can produce data on your own using the scripts we provide.

Benchmarks

Model ViT
(Params.)
MME
perception
MMB
dev
SEED
image
GQA DocVQA ChartQA InfoVQA TabFact WTQ RefCOCO
val
RefCOCO
test-A
RefCOCO
test-B
$\text{Donut}$ $\text{Swin-B}$
(0.1B)
- - - - 67.5 41.8 11.6 54.6 18.8 - - -
$\text{Pix2Struct}$ - - - - - 76.6 58.6 40.0 - - - - -
$\text{InternLM-XC}$ $\text{EVA-G}$
(1B)
1528.4 74.8 66.1 - - - - - - - - -
$\text{LLaVA-1.5-7B}$ $\text{CLIP-L}$
(0.3B)
1510.7 65.2 - 62.0 - - - - - - - -
$\text{Shikra-7B}$ $\text{CLIP-L}$
(0.3B)
- 58.8 - - - - - - - 87.0 91.1 81.8
$\text{Qwen-VL-Chat}$ $\text{CLIP-G}$
(2B)
1487.6 60.6 65.4 57.5 62.6 66.3 - - - 88.6 92.3 84.5
$\text{Monkey}$ $\text{CLIP-G}$
(2B)
- 59.3 - 60.7 66.5 65.1 36.1 - 25.3 - - -
$\text{UReader}$ $\text{CLIP-L}$
(0.3B)
- - - - 65.4 59.3 42.2 67.6 29.4 - - -
$\text{TextMonkey}$ $\text{CLIP-G}$
(2B)
- - - - 73.0 66.9 - - 31.9 - - -
$\textbf{TextHawk}^*$ $\text{SigLIP-SO}$
(0.4B)
1520.9 73.0 69.2 64.7 73.6 64.0 47.3 70.7 33.5 87.3 90.9 83.3
$\textbf{TextHawk}$ $\text{SigLIP-SO}$
(0.4B)
1500.0 74.6 69.2 64.6 76.4 66.6 50.6 71.1 34.7 87.2 90.8 82.5

Note: $\textbf{TextHawk}^*$ is fine-tuned without the DocGemini.

Open Source Agenda is not affiliated with "TextHawk" Project. README Source: yuyq96/TextHawk
Stars
27
Open Issues
1
Last Commit
2 weeks ago
Repository

Open Source Agenda Badge

Open Source Agenda Rating