Awesome Document Understanding
A curated list of resources for Document Understanding (DU) topic related to Intelligent Document Processing (IDP), which is relative to Robotic Process Automation (RPA) from unstructured data, especially form Visually Rich Documents (VRDs).
Note 1: bolded positions are more important then others.
Note 2: due to the novelty of the field, this list is under construction - contributions are welcome (thank you in advance!). Please remember to use following convention:
Table of contents
-
Introduction
-
Research topics
-
Key Information Extraction (KIE)
-
Document Layout Analysis (DLA)
-
Document Question Answering (DQA)
-
Scientific Document Understanding (SDU)
-
Optical Character Recognition (OCR)
-
Related
-
General
-
Tabular Data Comprehension (TDC)
-
Robotic Process Automation (RPA)
-
Others
-
Resources
-
Datasets for Pre-training Language Models
-
PDF processing tools
-
Conferences / workshops
-
Blogs
-
Solutions
-
Examples
-
Visually Rich Documents (VRDs)
-
Key Information Extraction (KIE)
-
Document Layout Analysis (DLA)
-
Document Question Answering (DQA)
-
Inspirations
Introduction
Documents are a core part of many businesses in many fields such as law, finance, and technology among others. Automatic understanding of documents such as invoices, contracts, and resumes is lucrative, opening up many new avenues of business. The fields of natural language processing and computer vision have seen tremendous progress through the development of deep learning such that these methods have started to become infused in contemporary document understanding systems. source
Papers
2023
-
DocILE Benchmark for Document Information Localization and Extraction, [Website] [benchmark] [code ]
Štěpán Šimsa, Milan Šulc, Michal Uřičář, Yash Patel, Ahmed Hamdi, Matěj Kocián, Matyáš Skalický, Jiří Matas, Antoine Doucet, Mickaël Coustaty, Dimosthenis Karatzas arxiv pre-print 2023
This paper introduces the DocILE benchmark with the largest dataset of business documents for the tasks of Key Information Localization and Extraction and Line Item Recognition. It contains 6.7k annotated business documents, 100k synthetically generated documents, and nearly~1M unlabeled documents for unsupervised pre-training. The dataset has been built with knowledge of domain- and task-specific aspects, resulting in the following key features: (i) annotations in 55 classes, which surpasses the granularity of previously published key information extraction datasets by a large margin; (ii) Line Item Recognition represents a highly practical information extraction task, where key information has to be assigned to items in a table; (iii) documents come from numerous layouts and the test set includes zero- and few-shot cases as well as layouts commonly seen in the training set. The benchmark comes with several baselines, including RoBERTa, LayoutLMv3 and DETR-based Table Transformer. These baseline models were applied to both tasks of the DocILE benchmark, with results shared in this paper, offering a quick starting point for future work. The dataset and baselines are available at this https URL.
2022
-
Business Document Information Extraction: Towards Practical Benchmarks
Matyáš Skalický, Štěpán Šimsa, Michal Uřičář, Milan Šulc CLEF 2022
Information extraction from semi-structured documents is crucial for frictionless business-to-business (B2B) communication. While machine learning problems related to Document Information Extraction (IE) have been studied for decades, many common problem definitions and benchmarks do not reflect domain-specific aspects and practical needs for automating B2B document communication. We review the landscape of Document IE problems, datasets and benchmarks. We highlight the practical aspects missing in the common definitions and define the Key Information Localization and Extraction (KILE) and Line Item Recognition (LIR) problems. There is a lack of relevant datasets and benchmarks for Document IE on semi-structured business documents as their content is typically legally protected or sensitive. We discuss potential sources of available documents including synthetic data.
-
Doc2Graph: A Task Agnostic Document Understanding Framework Based on Graph Neural Networks, [code ]
Andrea Gemelli, Sanket Biswas, Enrico Civitelli, Josep Lladós, Simone Marinai TiE Workshop @ ECCV 2022
Geometric Deep Learning has recently attracted significant interest in a wide range of machine learning fields, including document analysis. The application of Graph Neural Networks (GNNs) has become crucial in various document-related tasks since they can unravel important structural patterns, fundamental in key information extraction processes. Previous works in the literature propose task-driven models and do not take into account the full power of graphs. We propose Doc2Graph, a task-agnostic document understanding framework based on a GNN model, to solve different tasks given different types of documents. We evaluated our approach on two challenging datasets for key information extraction in form understanding, invoice layout analysis and table detection
2021
-
Document AI: Benchmarks, Models and Applications
Lei Cui, Yiheng Xu, Tengchao Lv, Furu Wei arxiv 2021
Document AI, or Document Intelligence, is a relatively new research topic that refers to the techniques for automatically reading, understanding, and analyzing business documents. It is an important research direction for natural language processing and computer vision. In recent years, the popularity of deep learning technology has greatly advanced the development of Document AI, such as document layout analysis, visual information extraction, document visual question answering, document image classification, etc. This paper briefly reviews some of the representative models, tasks, and benchmark datasets. Furthermore, we also introduce early-stage heuristic rule-based document analysis, statistical machine learning algorithms, and deep learning approaches especially pre-training methods. Finally, we look into future directions for Document AI research.
-
Efficient Automated Processing of the Unstructured Documents using Artificial Intelligence: A Systematic Literature Review and Future Directions
Dipali Baviskar, Swati Ahirrao, Vidyasagar Potdar, Ketan Kotecha IEEE Access 2021
The unstructured data impacts 95% of the organizations and costs them millions of dollars annually. If managed well, it can significantly improve business productivity. The traditional information extraction techniques are limited in their functionality, but AI-based techniques can provide a better solution. A thorough investigation of AI-based techniques for automatic information extraction from unstructured documents is missing in the literature. The purpose of this Systematic Literature Review (SLR) is to recognize, and analyze research on the techniques used for automatic information extraction from unstructured documents and to provide directions for future research. The SLR guidelines proposed by Kitchenham and Charters were adhered to conduct a literature search on various databases between 2010 and 2020. We found that: 1. The existing information extraction techniques are template-based or rule-based, 2. The existing methods lack the capability to tackle complex document layouts in real-time situations such as invoices and purchase orders, 3.The datasets available publicly are task-specific and of low quality. Hence, there is a need to develop a new dataset that reflects real-world problems. Our SLR discovered that AI-based approaches have a strong potential to extract useful information from unstructured documents automatically. However, they face certain challenges in processing multiple layouts of the unstructured documents. Our SLR brings out conceptualization of a framework for construction of high-quality unstructured documents dataset with strong data validation techniques for automated information extraction. Our SLR also reveals a need for a close association between the businesses and researchers to handle various challenges of the unstructured data analysis.
2020
2018
-
Future paradigms of automated processing of business documents
Matteo Cristania, Andrea Bertolasob, Simone Scannapiecoc, Claudio Tomazzolia International Journal of Information Management 2018
In this paper we summarize the results obtained so far in the communities interested in the development of automated processing techniques as applied to business documents, and devise a few evolutions that are demanded by the current stage of either those techniques by themselves or by collateral sector advancements. It emerges a clear picture of a field that has put an enormous effort in solving problems that changed a lot during the last 30 years, and is now rapidly evolving to incorporate document processing into workflow management systems on one side and to include features derived by the introduction of cloud computing technologies on the other side. We propose an architectural schema for business document processing that comes from the two above evolution lines.
Older
Research topics
Others
Resources
Back to top
Datasets for Pre-training Language Models
-
The RVL-CDIP Dataset - dataset consists of 400,000 grayscale images in 16 classes, with 25,000 images per class
-
The Industry Documents Library - a portal to millions of documents created by industries that influence public health, hosted by the UCSF Library
-
Color Document Dataset - from the Intelligent Sensory Information Systems, University of Amsterdam
-
The IIT CDIP Collection - dataset consists of documents from the states' lawsuit against the tobacco industry in the 1990s, consists of around 7 million documents
-
borb - is a pure python library to read, write and manipulate PDF documents. It represents a PDF document as a JSON-like datastructure of nested lists, dictionaries and primitives (numbers, string, booleans, etc).
-
pawls - PDF Annotations with Labels and Structure is software that makes it easy to collect a series of annotations associated with a PDF document
-
pdfplumber - Plumb a PDF for detailed information about each text character, rectangle, and line. Plus: Table extraction and visual debugging
-
Pdfminer.six - Pdfminer.six is a community maintained fork of the original PDFMiner. It is a tool for extracting information from PDF documents. It focuses on getting and analyzing text data
-
Layout Parser - Layout Parser is a deep learning based tool for document image layout analysis tasks
-
Tabulo - Table extraction from images
-
OCRmyPDF - OCRmyPDF adds an OCR text layer to scanned PDF files, allowing them to be searched or copy-pasted
-
PDFBox - The Apache PDFBox library is an open source Java tool for working with PDF documents. This project allows creation of new PDF documents, manipulation of existing documents and the ability to extract content from documents
-
PdfPig - This project allows users to read and extract text and other content from PDF files. In addition the library can be used to create simple PDF documents containing text and geometrical shapes. This project aims to port PDFBox to C#
-
parsing-prickly-pdfs - Resources and worksheet for the NICAR 2016 workshop of the same name
-
pdf-text-extraction-benchmark - PDF tools benchmark
-
Born digital pdf scanner - checking if pdf is born-digital
-
OpenContracts Apache2-licensed, PDF annotating platform for visually-rich documents that preserves the original layout and exports x,y positional data for tokens as well as span starts and stops. Based on PAWLs, but with a Python-based backend and readily deployable on your local machine, company intranet or the web via Docker Compose.
-
deepdoctection deepdoctection is a Python library that orchestrates document extraction and document layout analysis tasks for images and pdf documents using deep learning models. It does not implement models but enables you to build pipelines using highly acknowledged libraries for object detection, OCR and selected NLP tasks and provides an integrated framework for fine-tuning, evaluating and running models.
-
pydoxtools Pydoxtools is an AI-composition library for dpocument analysis. It features an extensive toolset for building complex document analysis pipelines and recognizes most document formats out of the box. It supports typical NLP tasks such as keywords, summarization, question_answering out of the box. and features a high quality low-CPU/memory table extraction algorithm and makes NLP batch operations on a cluster easy.
Conferences, workshops
Back to top
General/ Business / Finance
-
International Conference on Document Analysis and Recognition (ICDAR) [2021, 2019, 2017]
- Workshop on Document Intelligence (DI) [2021, 2019]
- Financial Narrative Processing Workshop (FNP) [2021, 2020, 2019 ]
- Workshop on Economics and Natural Language Processing (ECONLP) [2021, 2019, 2018 ]
- INTERNATIONAL WORKSHOP ON DOCUMENT ANALYSIS SYSTEMS (DAS) [2020, 2018, 2016]
-
ACM International Conference on AI in Finance (ICAIF)
-
The AAAI-21 Workshop on Knowledge Discovery from Unstructured Data in Financial Services
-
CVPR 2020 Workshop on Text and Documents in the Deep Learning Era
-
KDD Workshop on Machine Learning in Finance (KDD MLF 2020)
-
FinIR 2020: The First Workshop on Information Retrieval in Finance
-
2nd KDD Workshop on Anomaly Detection in Finance (KDD 2019)
-
Document Understanding Conference (DUC 2007)
Scientific Document Understanding
-
The AAAI-21 Workshop on Scientific Document Understanding (SDU 2021)
-
First Workshop on Scholarly Document Processing (SDProc 2020)
- International Workshop on SCIentific DOCument Analysis (SCIDOCA) [2020, 2018, 2017 ]
Blogs
Back to top
-
A Survey of Document Understanding Models, 2021
-
Document Form Extraction, 2021
-
How to automate processes with unstructured data, 2021
-
A Comprehensive Guide to OCR with RPA and Document Understanding, 2021
-
Information Extraction from Receipts with Graph Convolutional Networks, 2021
-
How to extract structured data from invoices, 2021
-
Extracting Structured Data from Templatic Documents, 2020
-
To apply AI for good, think form extraction, 2020
-
UiPath Document Understanding Solution Architecture and Approach, 2020
-
How Can I Automate Data Extraction from Complex Documents?, 2020
-
LegalTech: Information Extraction in legal documents, 2020
Solutions
Back to top
Big companies:
-
Abby
-
Accenture
-
Amazon
-
Google
-
Microsoft
-
Uipath
Smaller:
-
Applica.ai
-
Base64.ai
-
Docstack
-
Element AI
-
Indico
-
Instabase
-
Konfuzio
-
Metamaze
-
Nanonets
-
Rossum
-
Silo
Examples
Visually Rich Documents
Back to top
In VRDs the importance of the layout information is crucial to understand the whole document correctly (this is the case with almost all business documents). For humans spatial information improves readability and speeds document understanding.
Invoice / Resume / Job Ad
NDA / Annual reports
Back to top
The aim of this task is to extract texts of a number of key fields from a given collection of documents containing similar key entities.
Scanned Receipts
NDA / Annual reports
Examples of a real business applications and data for Kleister datasets (The key entities are in blue)
An example of a commercial real estate flyer and manually entered listing information © ProMaker Commercial Real Estate LLC, © BrokerSavant Inc.
Value-added tax invoice
Webpages
Document Layout Analysis
Back to top
In computer vision or natural language processing, document layout analysis is the process of identifying and categorizing the regions of interest in the scanned image of a text document. A reading system requires the segmentation of text zones from non-textual ones and the arrangement in their correct reading order. Detection and labeling of the different zones (or blocks) as text body, illustrations, math symbols, and tables embedded in a document is called geometric layout analysis. But text zones play different logical roles inside the document (titles, captions, footnotes, etc.) and this kind of semantic labeling is the scope of the logical layout analysis. (https://en.wikipedia.org/wiki/Document_layout_analysis)
Scientific publication
Historical newspapers
Business documents
Red: text block, Blue: figure.
Document Question Answering
Back to top
DocVQA example
Inspirations
Back to top
Domain
-
https://github.com/kba/awesome-ocr
-
https://github.com/Liquid-Legal-Institute/Legal-Text-Analytics
-
https://github.com/icoxfog417/awesome-financial-nlp
-
https://github.com/BobLd/DocumentLayoutAnalysis
-
https://github.com/bikash/DocumentUnderstanding
-
https://github.com/harpribot/awesome-information-retrieval
-
https://github.com/roomylee/awesome-relation-extraction
-
https://github.com/caufieldjh/awesome-bioie
-
https://github.com/HelloRusk/entity-related-papers
-
https://github.com/pliang279/awesome-multimodal-ml
-
https://github.com/thunlp/LegalPapers
-
https://github.com/heartexlabs/awesome-data-labeling
General AI/DL/ML
-
https://github.com/jsbroks/awesome-dataset-tools
-
https://github.com/EthicalML/awesome-production-machine-learning
-
https://github.com/eugeneyan/applied-ml
-
https://github.com/awesomedata/awesome-public-datasets
-
https://github.com/keon/awesome-nlp
-
https://github.com/thunlp/PLMpapers
-
https://github.com/jbhuang0604/awesome-computer-vision#awesome-lists
-
https://github.com/papers-we-love/papers-we-love
-
https://github.com/BAILOOL/DoYouEvenLearn
-
https://github.com/hibayesian/awesome-automl-papers