TextHawk Save

Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

Project README

TextHawk: Exploring Efficient Fine-Grained Perception of Multimodal Large Language Models

examples

Introduction

TextHawk is a Multimodal Large Language Model (MLLM) specifically designed for document-oriented tasks, while preserving the general capabilities. It is aimed to explore efficient fine-grained perception by designing four dedicated components:

ReSampling and ReArrangement (ReSA)
Scalable Positional Embeddings (SPEs)
Query Proposal Network (QPN)
Multi-Level Cross-Attention (MLCA)

DocGemini

We create a new instruction-tuning dataset DocGemini for document-oriented tasks by enriching the multimodal document data with Gemini Pro. Each data sample contains:

A brief summary of the document topics.
Short QA pairs, up to 10.
Insights behind each answer.
[Optional] An imaginary conversations between two researchers.

DocGemini consists of 30K images and 195K QA pairs with insights.

Note: The generated dataset is undergoing legal assessment. Alternatively, you can produce data on your own using the scripts we provide.

Benchmarks

Model	ViT (Params.)	MME perception	MMB dev	SEED image	GQA	DocVQA	ChartQA	InfoVQA	TabFact	WTQ	RefCOCO val	RefCOCO test-A	RefCOCO test-B
$\text{Donut}$	$\text{Swin-B}$ (0.1B)	-	-	-	-	67.5	41.8	11.6	54.6	18.8	-	-	-
$\text{Pix2Struct}$	-	-	-	-	-	76.6	58.6	40.0	-	-	-	-	-
$\text{InternLM-XC}$	$\text{EVA-G}$ (1B)	1528.4	74.8	66.1	-	-	-	-	-	-	-	-	-
$\text{LLaVA-1.5-7B}$	$\text{CLIP-L}$ (0.3B)	1510.7	65.2	-	62.0	-	-	-	-	-	-	-	-
$\text{Shikra-7B}$	$\text{CLIP-L}$ (0.3B)	-	58.8	-	-	-	-	-	-	-	87.0	91.1	81.8
$\text{Qwen-VL-Chat}$	$\text{CLIP-G}$ (2B)	1487.6	60.6	65.4	57.5	62.6	66.3	-	-	-	88.6	92.3	84.5
$\text{Monkey}$	$\text{CLIP-G}$ (2B)	-	59.3	-	60.7	66.5	65.1	36.1	-	25.3	-	-	-
$\text{UReader}$	$\text{CLIP-L}$ (0.3B)	-	-	-	-	65.4	59.3	42.2	67.6	29.4	-	-	-
$\text{TextMonkey}$	$\text{CLIP-G}$ (2B)	-	-	-	-	73.0	66.9	-	-	31.9	-	-	-
$\textbf{TextHawk}^*$	$\text{SigLIP-SO}$ (0.4B)	1520.9	73.0	69.2	64.7	73.6	64.0	47.3	70.7	33.5	87.3	90.9	83.3
$\textbf{TextHawk}$	$\text{SigLIP-SO}$ (0.4B)	1500.0	74.6	69.2	64.6	76.4	66.6	50.6	71.1	34.7	87.2	90.8	82.5

Note: $\textbf{TextHawk}^*$ is fine-tuned without the DocGemini.

Open Source Agenda is not affiliated with "TextHawk" Project. README Source: yuyq96/TextHawk

Stars

Open Issues

Last Commit

2 weeks ago

Repository

yuyq96/TextHawk

Open Source Agenda Badge

<a href="https://www.opensourceagenda.com/projects/texthawk"><img src="https://www.opensourceagenda.com/projects/texthawk/reviews/badge.svg" alt="Open Source Agenda"></a>

Submit Review Review Your Favorite Project

Submit Resource Articles, Courses, Videos

Submit Article Submit a post to our blog

From the blog

Dec 11, 2022

How to Choose Which Programming Language to Learn First?

From the blog

Dec 11, 2022