docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
This patch release fixes issues with preprocessor and greatly improves text detection models.
Brought to you by @fg-mindee & @charlesmindee
Note: doctr 0.2.1 requires TensorFlow 2.4.0 or higher.
With this iteration, DocTR brings you a set of newly pretrained parameters for db_resnet50
which was trained using a much wider range of data augmentations!
architecture | FUNSD recall | FUNSD precision | CORD recall | CORD precision |
---|---|---|---|---|
db_resnet50 + crnn_vgg16_bn (v0.2.0) | 64.8 | 70.3 | 67.7 | 78.4 |
db_resnet50 + crnn_vgg16_bn (v0.2.1) | 70.08 | 74.77 | 82.19 | 79.67 |
Users might be tempted to filtered text recognition predictions, which was not easy previously without a prediction's confidence. We harmonized our recognition models to provide the sequence prediction probability.
Using the following image:
with this snippet
from doctr.documents import DocumentFile
from doctr.models import recognition_predictor
predictor = recognition_predictor(pretrained=True)
doc = DocumentFile.from_images("path/to/reco_sample.jpg")
print(predictor(doc))
will get you a list of tuples (word value, sequence confidence):
[('invite', 0.9302278757095337)]
For those who play around with the predictor's component, you might value your understanding of their composition. In order to get a cleaner interface, we improved the representation of all predictors component.
The following snippet:
from doctr.models import ocr_predictor
print(ocr_predictor())
now yields a much cleaner representation of the predictor composition
OCRPredictor(
(det_predictor): DetectionPredictor(
(pre_processor): PreProcessor(
(resize): Resize(output_size=(1024, 1024), method='bilinear')
(normalize): Compose(
(transforms): [
LambdaTransformation(),
Normalize(mean=[0.7979999780654907, 0.7850000262260437, 0.7720000147819519], std=[0.2639999985694885, 0.27489998936653137, 0.28700000047683716]),
]
)
)
(model): DBNet(
(feat_extractor): IntermediateLayerGetter()
(fpn): FeaturePyramidNetwork(channels=128)
(probability_head): <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f6f645f58e0>
(threshold_head): <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f6f7ce15310>
(postprocessor): DBPostProcessor(box_thresh=0.1, max_candidates=1000)
)
)
(reco_predictor): RecognitionPredictor(
(pre_processor): PreProcessor(
(resize): Resize(output_size=(32, 128), method='bilinear', preserve_aspect_ratio=True, symmetric_pad=False)
(normalize): Compose(
(transforms): [
LambdaTransformation(),
Normalize(mean=[0.5, 0.5, 0.5], std=[1.0, 1.0, 1.0]),
]
)
)
(model): CRNN(
(feat_extractor): <doctr.models.backbones.vgg.VGG object at 0x7f6f7d866040>
(decoder): <tensorflow.python.keras.engine.sequential.Sequential object at 0x7f6f7cce2430>
(postprocessor): CTCPostProcessor(vocab_size=118)
)
)
(doc_builder): DocumentBuilder(resolve_lines=False, resolve_blocks=False, paragraph_break=0.035)
)
Renamed ExactMatch
to TextMatch
since the metric now produces different levels of flexibility for the evaluation. Additionally, the constructor flags have been deprecated since the summary will provide all different types of evaluation.
0.2.0 | 0.2.1 |
---|---|
>>> from doctr.utils.metrics import ExactMatch >>> metric = ExactMatch(ignore_case=True) >>> metric.update(["i", "am", "a", "jedi"], ["I", "am", "a", "sith"]) >>> print(metric.summary()) 0.75 |
>>> from doctr.utils.metrics import TextMatch >>> metric = TextMatch() >>> metric.update(["i", "am", "a", "jedi"], ["I", "am", "a", "sith"]) >>> print(metric.summary()) {'raw': 0.5, 'caseless': 0.75, 'unidecode': 0.5, 'unicase': 0.75} |
Raw being the exact match, caseless being the exact match of lower case counterparts, unidecode being the exact match of unidecoded counterparts, and unicase being the exact match of unidecoded lower-case counterparts.
Deep learning model building and inference
db_resnet50
(#277)Utility features relevant to the library use cases.
Data transformations operations
Verifications of the package well-being before release
Online resources for potential users
Reference training scripts
Other tools and implementations
OCRDataset
(#270)OCRMetric
update edge case (#267)Resize
when preserving aspect ratio (#266)RandomSaturation
(#277)OCRDataset
(#274)doctr.documents.elements
(#274)ignore_case
and ignore_accents
from recognition postprocessors (#284)OCRDataset
(#278)DocumentBuilder
and recognition models (#284)This release improves model performances and extends library features considerably (including a minimal API template, new datasets, newly trained models).
Release handled by @fg-mindee & @charlesmindee
Note: doctr 0.2.0 requires TensorFlow 2.4.0 or higher.
Enjoy our newly trained detection and recognition models with improved robustness and performances! Check our fully benchmark in the documentation for further details.
This release comes with a large improvement of line detection. While it is only done in post-processing for now, we considered many cases to make sure you get a consistent and helpful result:
Before | After |
---|---|
You can now expect reading images or PDF from files, binary streams, or even URLs. We completely revamped our document reading pipeline with the new DocumentFile
class methods
from doctr.documents import DocumentFile
# PDF
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images()
# Image
single_img_doc = DocumentFile.from_images("path/to/your/img.jpg")
# Multiple page images
multi_img_doc = DocumentFile.from_images(["path/to/page1.jpg", "path/to/page2.jpg"])
# Web page
webpage_doc = DocumentFile.from_url("https://www.yoursite.com").as_images()
If by any chance your PDF is a source file (web page are converted into such PDF) and not a scanned version, you will also be able to read the information inside
from doctr.documents import DocumentFile
pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Retrieve bounding box and text information
words = pdf_doc.get_words()
By adding multithreading dataloaders and transformations in DocTR, we can now provide you with reference training scripts to train models on your own!
Text detection script (additional details available in README)
python references/detection.train.py /path/to/dataset db_resnet50 -b 8 --input-size 512 --epochs 20
Text recognition script (additional details available in README)
python references/detection.train.py /path/to/dataset db_resnet50 -b 8 --input-size 512 --epochs 20
If you enjoy DocTR, you might want to integrate it in your API. For your convenience, we added a minimal API template with routes for text detection, text recognition or plain OCR!
Run it as follows in a docker container:
PORT=8050 docker-compose up -d --build
Your API is now running locally on port 8050! Navigate to http://localhost:8050/redoc to check your documentation
Or start making your first request!
import requests
import io
with open('/path/to/your/image.jpeg', 'rb') as f:
data = f.read()
response = requests.post("http://localhost:8050/recognition", files={'file': io.BytesIO(data)})
In order to ensure that all compression features are fully functional in DocTR, support for TensorFlow < 2.4.0 has been dropped.
OCRPredictor
used to be taking a list of documents as input, and now only takes list of pages.
0.1.1 | 0.2.0 |
---|---|
>>> predictor = ... >>> page = np.zeros((h, w, 3), dtype=np.uint8) >>> out = predictor([[page]]) |
>>> predictor = ... >>> page = np.zeros((h, w, 3), dtype=np.uint8) >>> out = predictor([page]) |
To gain more flexibility on the training side, the model call method was changed to yield a dictionary with multiple entries
0.1.1 | 0.2.0 |
---|---|
>>> from doctr.models import db_resnet50, DBPostProcessor >>> model = db_resnet50(pretrained=True) >>> postprocessor = DBPostProcessor() >>> prob_map = model(input_t, training=False) >>> boxes = postprocessor(prob_map) |
>>> from doctr.models import db_resnet50 >>> model = db_resnet50(pretrained=True) >>> out = model(input_t, training=False) >>> boxes = out['boxes'] |
Easy-to-use datasets for OCR
DataLoader
as a dataset wrapper for parallel high performance data reading (#198, #201)OCRDataset
(#244)Deep learning model building and inference
crnn_resnet31
recognition model (#160)Utility features relevant to the library use cases.
Data transformations operations
Compose
, Resize
, Normalize
& LambdaTransformation
(#205)Verifications of the package well-being before release
OCRDataset
(#244)Online resources for potential users
Other tools and implementations
This release patch fixes several bugs, introduces OCR datasets and improves model performances.
Release handled by @fg-mindee & @charlesmindee
Note: doctr 0.1.1 requires TensorFlow 2.3.0 or higher.
Whether this is for training or evaluation purposes, DocTR provides you with objects to easily download and manipulate datasets. Access OCR datasets within a few lines of code:
from doctr.datasets import FUNSD
train_set = FUNSD(train=True, download=True)
img, target = train_set[0]
While DocTR 0.1.0 gave you access to pretrained models, you had no way to find the performances of these models apart from computing them yourselves. As of now, we have added a performance benchmark in our documentation for all our models and made the evaluation script available for seamless reproducibility:
python scripts/evaluate.py ocr_db_crnn_vgg
Since we want to make DocTR a convenience for you to build OCR-related applications and services, we made a minimal Streamlit demo app to showcase its text detection capabilities. You can run the demo with the following commands:
streamlit run demo/app.py
Here is how it renders performing text detection on a sample document:
For improved clarity, the evaluation metrics' methods were renamed.
0.1.0 | 0.1.1 |
---|---|
>>> from doctr.utils import ExactMatch >>> metric = ExactMatch() >>> metric.update_state(['Hello', 'world'], ['hello', 'world']) >>> metric.result() |
>>> from doctr.utils import ExactMatch >>> metric = ExactMatch() >>> metric.update(['Hello', 'world'], ['hello', 'world']) >>> metric.summary() |
As the range of backbones and combinations evolves, we have updated the name of high-level predictors:
0.1.0 | 0.1.1 |
---|---|
>>> from doctr.models import ocr_db_crnn |
>>> from doctr.models import ocr_db_crnn_vgg |
Easy-to-use datasets for OCR
FUNSD
dataset (#136, #141)Deep learning model building and inference
Utility features relevant to the library use cases.
Verifications of the package well-being before release
crnn_resnet31
(#148), and OCR predictors (#150)Online resources for potential users
FUNSD
in documentation (#143, #149, #150, #155)sar_resnet31
to recognition models documentation (#150)Other tools and implementations
analyze.py
script runs (#142)bitmap_to_boxes
method (#155)ExactMatch
(#120)VisionDataset
and FUNSD
(#147)max_length
and input_shape
of SAR (#143)NestedObject
when they have no children (#137)FUNSD
(#154)This first release adds pretrained models for end-to-end OCR and document manipulation utilities.
Release handled by @fg-mindee & @charlesmindee
Note: doctr 0.1.0 requires TensorFlow 2.3.0 or newer.
Since document processing is at the core of this project, being able to read documents efficiently is a priority. In this release, we considered PDF and image-based files.
PDF reading is a wrapper around PyMuPDF back-end for fast file reading
from doctr.documents import read_pdf
# from path
doc = read_pdf("path/to/your/doc.pdf")
# from stream
with open("path/to/your/doc.pdf", 'rb') as f:
doc = read_pdf(f.read())
while image reading is using OpenCV backend
from doctr.documents import read_img
page = read_img("path/to/your/img.jpg")
Whether you conduct text detection, text recognition or end-to-end OCR, this release brings you pretrained models and advanced predictors (that will take care of all preprocessing, model inference and post-processing for you) for easy-to-use pythonic features
Currently, only DBNet-based architectures are supported, more to come in the next releases!
from doctr.documents import read_pdf
from doctr.models import db_resnet50_predictor
model = db_resnet50_predictor(pretrained=True)
doc = read_pdf("path/to/your/doc.pdf")
result = model(doc)
There are two architectures implemented for recognition: CRNN, and SAR
from doctr.models import crnn_vgg16_bn_predictor
model = crnn_vgg16_bn_predictor(pretrained=True)
Simply combining two models into a two-stage architecture, OCR predictors bring you the easiest way to analyze your document
from doctr.documents import read_pdf
from doctr.models import ocr_db_crnn
model = ocr_db_crnn(pretrained=True)
doc = read_pdf("path/to/your/doc.pdf")
result = model([doc])
Documentation reading and manipulation
Deep learning model building and inference
Utility features relevant to the library use cases.
Verifications of the package well-being before release
Online resources for potential users
Other tools and implementations
This release is only a mirror for pretrained detection & recognition models.