docTR (Document Text Recognition) - a seamless, high-performing & accessible library for OCR-related tasks powered by Deep Learning.
Note: doctr 0.8.1 requires either TensorFlow >= 2.11.0 or PyTorch >= 1.12.0.
Fixed conda receipt and CI jobs for conda and pypi releases
Fixed some broken links
Pre-Release: FAST text detection model from FAST: Faster Arbitrarily-Shaped Text Detector with Minimalist Kernel Representation -> Checkpoints will be provided with the next release
Note: doctr 0.8.0 requires either TensorFlow >= 2.11.0 or PyTorch >= 1.12.0.
db_resnet50_rotation
(PyTorch) and linknet_resnet18_rotation
(TensorFlow) are removed (All models can handle rotated documents now).show(doc)
changed to .show()
WildReceipt
dataset added by @HamzaGbadaocr_predictor
to maniplulate the detection predictions in the middle of the pipeline to your needs by @felixdittrich92
from doctr.model import ocr_predictor
class CustomHook:
def __call__(self, loc_preds):
# Manipulate the location predictions here
# 1. The outpout structure needs to be the same as the input location predictions
# 2. Be aware that the coordinates are relative and needs to be between 0 and 1
return loc_preds
my_hook = CustomHook()
predictor = ocr_predictor(pretrained=True)
# Add a hook in the middle of the pipeline
predictor.add_hook(my_hook)
# You can also add multiple hooks which will be executed sequentially
for hook in [my_hook, my_hook, my_hook]:
predictor.add_hook(hook)
tqdm
instead of fastprogress
in reference scripts by @odulcy-mindee in https://github.com/mindee/doctr/pull/1389
WILDRECEIPT
in docs and fix README.md
by @odulcy-mindee in https://github.com/mindee/doctr/pull/1363
Full Changelog: https://github.com/mindee/doctr/compare/v0.7.0...v0.8.0
Note: doctr 0.7.0 requires either TensorFlow >= 2.11.0 or PyTorch >= 1.12.0. Note: We will release the missing PyTorch checkpoints with 0.7.1
preserve_aspect_ratio
parameter to True
by default in https://github.com/mindee/doctr/pull/1279
=> To restore the old behaviour you can pass preserve_aspect_ratio=False
to the predictor
instanceThe KIE predictor is a more flexible predictor compared to OCR as your detection model can detect multiple classes in a document. For example, you can have a detection model to detect just dates and adresses in a document.
The KIE predictor makes it possible to use detector with multiple classes with a recognition model and to have the whole pipeline already setup for you.
from doctr.io import DocumentFile
from doctr.models import kie_predictor
# Model
model = kie_predictor(det_arch='db_resnet50', reco_arch='crnn_vgg16_bn', pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf")
# Analyze
result = model(doc)
predictions = result.pages[0].predictions
for class_name in predictions.keys():
list_predictions = predictions[class_name]
for prediction in list_predictions:
print(f"Prediction for {class_name}: {prediction}")
The KIE predictor results per page are in a dictionary format with each key representing a class name and it's value are the predictions for that class.
tensorflow_addons
by @felixdittrich92 in https://github.com/mindee/doctr/pull/1252
Full Changelog: https://github.com/mindee/doctr/compare/v0.6.0...v0.7.0
Note: doctr 0.6.0 requires either TensorFlow >= 2.9.0 or PyTorch >= 1.8.0.
from doctr.io import DocumentFile
from doctr.models import ocr_predictor, from_hub
image = DocumentFile.from_images(['data/example.jpg'])
# Load a custom detection model from huggingface hub
det_model = from_hub('Felix92/doctr-torch-db-mobilenet-v3-large')
# Load a custom recognition model from huggingface hub
reco_model = from_hub('Felix92/doctr-torch-crnn-mobilenet-v3-large-french')
# You can easily plug in this models to the OCR predictor
predictor = ocr_predictor(det_arch=det_model, reco_arch=reco_model)
result = predictor(image)
from doctr.models import recognition, login_to_hub, push_to_hf_hub
login_to_hub()
my_awesome_model = recognition.crnn_mobilenet_v3_large(pretrained=True)
push_to_hf_hub(my_awesome_model, model_name='doctr-crnn-mobilenet-v3-large-french-v1', task='recognition', arch='crnn_mobilenet_v3_large')
Documentation: https://mindee.github.io/doctr/using_doctr/sharing_models.html
from doctr.datasets import CORD
# Crop boxes as is (can contain irregular)
train_set = CORD(train=True, download=True, recognition_task=True)
# Crop rotated boxes (always regular)
train_set = CORD(train=True, download=True, use_polygons=True, recognition_task=True)
img, target = train_set[0]
Documentation: https://mindee.github.io/doctr/using_doctr/using_datasets.html
NOTE: full production pipeline with ONNX / build is planned for 0.7.0 (the models can be only exported up to the logits without any post processing included)
using_doctr
by @odulcy-mindee in https://github.com/mindee/doctr/pull/993
io/pdf.py
to new pypdfium2 API by @mara004 in https://github.com/mindee/doctr/pull/944
Full Changelog: https://github.com/mindee/doctr/compare/v0.5.1...v0.6.0
This minor release includes: improvement of the documentation thanks to @felixdittrich92, bugs fixed, support of rotation extended to Tensorflow backend, a switch from PyMuPDF to pypdfmium2 and a nice integration to the Hugginface Hub thanks to @fg-mindee !
Note: doctr 0.5.0 requires either TensorFlow 2.4.0 or PyTorch 1.8.0.
The documentation has been improved adding a new theme, illustrations, and docstring has been completed and developed. This how it renders:
We provide weights for the linknet_resnet18_rotation
model which has been deeply modified: We implemented a new loss (based on Dice Loss and Focal Loss), we changed the computation of the targets so that polygons are shrunken the same way they are in the DBNet which improves highly the precision of the segmenter and we trained the model preserving the aspect ratio of the images.
All these improvements led to much better results, and the pretrained model is now very robust.
You can now choose to preserve the aspect ratio in the detection_predictor:
>>> from doctr.models import detection_predictor
>>> predictor = detection_predictor('db_resnet50_rotation', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True)
This option can also be activated in the high level end-to-end predictor:
>>> from doctr.model import ocr_predictor
>>> model = ocr_predictor('linknet_resnet18_rotation', pretrained=True, assume_straight_pages=False, preserve_aspect_ratio=True)
The artefact detection model is now available on the HugginFace Hub, this is amazing:
On DocTR, you can now use the .from_hub()
method so that those 2 snippets are equivalent:
# Pretrained
from doctr.models.obj_detection import fasterrcnn_mobilenet_v3_large_fpn
model = fasterrcnn_mobilenet_v3_large_fpn(pretrained=True)
and:
# HF Hub
from doctr.models.obj_detection.factory import from_hub
model = from_hub("mindee/fasterrcnn_mobilenet_v3_large_fpn")
We replaced for the PyMuPDF dependency with pypdfmium2 for a license-compatibility issue, so we loose the word and objects extraction from source pdf which was done with PyMuPDF. It wasn't used in any models so it is not a big issue, but anyway we will work in the future to re-integrate such a feature.
Full Changelog: https://github.com/mindee/doctr/compare/v0.5.0...v0.5.1
This release adds support of rotated documents, and extends both the model & dataset zoos.
Note: doctr 0.5.0 requires either TensorFlow 2.4.0 or PyTorch 1.8.0.
It's no secret: this release focus was to bring the same level of performance to rotated documents!
docTR is meant to be your best tool for seamless document processing, and it couldn't do without supporting a very natural & common augmentation of input documents. This large project was subdivided into three parts:
Developing a heuristic-based method to estimate the page skew, and rotate it before forwarding it to any deep learning model. Our thanks to @Rob192 for his contribution on this part :pray:
This behaviour can be enabled to avoid retraining the text detection models. However, the heuristics approach has its limits in terms of robustness.
The core of this project was to enable our text detection models to produce non-degraded heatmaps & localization candidates when processing a rotated page.
Finally, once the localization candidates have been extracted, there is no saying that this localization candidate will read from left to right. In order to remove this doubt, a lightweight image orientation classifier was added to refine the crops that will be sent to text recognition!
The stability of trainings in deep learning for complex tasks has mostly been helped by leveraging transfer learning. As such, OCR tasks usually require a backbone as a feature extractor. For this reason, all checkpoints of classification models in both PyTorch & TensorFlow have been updated :rocket: Those were trained using our synthetic character classification dataset, for more details cf. Character classification training
Thanks to @felixdittrich92, the list of supported datasets has considerably grown :partying_face: This includes widely popular datasets used for benchmarks on OCR-related tasks, you can find the full list over here :point_right: #587
Additionally, we followed up on the existing CharGenerator
by introducing WordGenerator
:
Below are some samples using a font_size=32
:
Two new notebooks have made their way into the documentation:
With the retraining of all classification backbones, several changes have been introduced:
linknet16
--> linknet_resnet18
In order to unify our data pipelines, we forced the conversion to relative coordinates on all datasets!
0.4.1 | 0.5.0 |
---|---|
>>> from doctr.datasets import FUNSD >>> ds = FUNSD(train=True, download=True) >>> img, target = ds[0] >>> print(target['boxes'].dtype, target['boxes'].max()) (dtype('int64'), 862) |
>>> from doctr.datasets import FUNSD >>> ds = FUNSD(train=True, download=True) >>> img, target = ds[0] >>> print(target['boxes'].dtype, target['boxes'].max()) (dtype('float32'), 0.98341835) |
Full Changelog: https://github.com/mindee/doctr/compare/v0.4.1...v0.5.0
This patch release brings the support of AMP for PyTorch training to docTR along with artefact object detection.
Note: doctr 0.4.1 requires either TensorFlow 2.4.0 or PyTorch 1.8.0.
Training scripts with PyTorch back-end now benefit from AMP to reduce the RAM footprint and potentially increase the maximum batch size! This comes especially handy on text detection which require high spatial resolution inputs!
Document understanding goes beyond textual elements, as information can be encoded in other visual forms. For this reason, we have extended the range of supported tasks by adding object detection. This will be focused on non-textual elements in documents, including QR codes, barcodes, ID pictures, and logos.
Here are some early results:
This release comes with a training & validation set DocArtefacts, and a reference training script. Keep an eye for models we will be releasing in the next release!
You've been waiting for it, from now on, we will be adding regularly new tutorials for docTR in the form of jupyter notebooks that you can open and run locally or on Google Colab for instance!
Check the new page in the documentation to have an updated list of all our community notebooks: https://mindee.github.io/doctr/latest/notebooks.html
Float-precision can be leveraged in deep learning to decrease the RAM footprint of trainings. The common data type float32
has a lower resolution counterpart float16
which is usually only supported on GPU for common deep learning operations. Initially, we were planning to make all our operations available in both to reduce memory footprint in the end.
However, with the latest development of Deep Learning frameworks, and their Automatic Mixed Precision mechanism, this isn't required anymore and only adds more constraints on the development side. We thus deprecated this feature from our datasets and predictors:
0.4.0 | 0.4.1 |
---|---|
>>> from doctr.datasets import FUNSD >>> ds = FUNSD(train=True, download=True, fp16=True) >>> print(getattr(ds, "fp16")) True |
>>> from doctr.datasets import FUNSD >>> ds = FUNSD(train=True, download=True) >>> print(getattr(ds, "fp16")) None |
OCRPredictor.__repr__
in #595 (@RBMindee)Our thanks & warm welcome to the following persons for their first contributions: @mzeidhassan @k-for-code @felixdittrich92 @SiddhantBahuguna @RBMindee @thentgesMindee :pray:
Full Changelog: https://github.com/mindee/doctr/compare/v0.4.0...v0.4.1
This release brings the support of PyTorch out of beta, makes text recognition more robust, and provides light architectures for complex tasks.
Note: doctr 0.4.0 requires either TensorFlow 2.4.0 or PyTorch 1.8.0.
Some documents such as French ID card include very long strings that can be challenging to transcribe:
This release enables a smart split/merge strategy for wide crops to avoid performance drops. Previously the whole crop was analyzed altogether, while right now, it is split into reasonably sized crops, the inference is performed in batch then predictions are merged together.
The following snippet:
from doctr.io import DocumentFile
from doctr.models import ocr_predictor
doc = DocumentFile.from_images('path/to/img.png')
predictor = ocr_predictor(pretrained=True)
print(predictor(doc).pages[0])
used to yield:
Page(
dimensions=(447, 640)
(blocks): [Block(
(lines): [Line(
(words): [
Word(value='1XXXXXX', confidence=0.0023),
Word(value='1XXXX', confidence=0.0018),
]
)]
(artefacts): []
)]
)
and now yields:
Page(
dimensions=(447, 640)
(blocks): [Block(
(lines): [Line(
(words): [
Word(value='IDFRABERTHIER<<<<<<<<<<<<<<<<<<<<<<', confidence=0.49),
Word(value='8806923102858CORINNE<<<<<<<6512068F6', confidence=0.22),
]
)]
(artefacts): []
)]
)
PyTorch support is now no longer in beta, so we made some efforts so that switching from one deep learning backend to another is unified :raised_hands: Predictors are designed to be the recommended interface for inference with your models!
0.3.1 (TensorFlow) | 0.3.1 (PyTorch) | 0.4.0 |
---|---|---|
>>> from doctr.models import detection_predictor >>> predictor = detection_predictor(pretrained=True) >>> out = predictor(doc, training=False) |
>>> from doctr.models import detection_predictor >>> import torch >>> predictor = detection_predictor(pretrained=True) >>> predictor.model.eval() >>> with torch.no_grad(): out = predictor(doc) |
>>> from doctr.models import detection_predictor >>> predictor = detection_predictor(pretrained=True) >>> out = predictor(doc) |
As PyTorch goes out of beta, we have bridged the gap between PyTorch & TensorFlow pretrained models' availability. Additionally, by leveraging our integration of light backbones, this release comes with lighter architectures for text detection and text recognition:
The full list of supported architectures is available :point_right: here
If you have enjoyed the Streamlit demo, but prefer not to run in on your own hardware, feel free to check out the online version on HuggingFace Spaces:
Courtesy of @osanseviero for deploying it, and HuggingFace for hosting & serving :pray:
After going over some backbone compatibility and re-assessing whether all combinations should be trained, DocTR is focusing on reproducing the paper's authors' will or improve upon it. As such, we have deprecated the following recognition models (that had no pretrained params): crnn_resnet31
, sar_vgg16_bn
.
Since doctr.models.export
was specific to TensorFlow and it didn't bring much more value than TensorFlow tutorials, we added instructions in the documentation and deprecated the submodule.
Resources to access data in efficient ways
Features to manipulate input & outputs
.synthesize
method to Page
and Document
#472 (@fg-mindee)Deep learning model building and inference
db_mobilenet_v3_large
#485 #487 , crnn_vgg16_bn
#487, db_resnet50
#489, crnn_mobilenet_v3_small
& crnn_mobilenet_v3_small
#517 #516 (@charlesmindee)Utility features relevant to the library use cases.
Data transformations operations
RandomCrop
transformation #448 (@charlesmindee)Verifications of the package well-being before release
RandomCrop
#448 (@charlesmindee)Online resources for potential users
RandomCrop
#448 (@charlesmindee)db_mobilenet_v3_large
#485 in the documentation (@charlesmindee)Reference training scripts
Other tools and implementations
RandomCrop
#473 (@fg-mindee)DocDataset
& OCRDataset
#474 (@charlesmindee)DetectionDataset
label format #491 (@fg-mindee)doctr.models.export
#463 (@fg-mindee)crnn_resnet31
& sar_vgg16_bn
recognition models #468 (@fg-mindee)DocumentBuilder
to doctr.models.builder
, split predictor into framework-specific objects #481 (@fg-mindee)DocumentBuilder
& refactored crop preparation and result processing in ocr predictors #497 (@fg-mindee)doctr.models.export
#463 (@fg-mindee)doctr.utils.font
submodule #472 (@fg-mindee)author_email
in setup #493 (@fg-mindee)common
, pytorch
& tensorflow
#498 #503 #506 (@fg-mindee)Many thanks to our contributors, we are delighted to see that there are more every week!
This release stabilizes the support for PyTorch backend while extending the range features (new task, superior pretrained models, speed ups).
Brought to you by @fg-mindee & @charlesmindee
Note: doctr 0.3.1 requires either TensorFlow 2.4.0 or PyTorch 1.8.0.
Which each release, we hope to bring you improved models and more comprehensive evaluation results. As part of the 0.3.1 release, we provide you with:
crnn_vgg16_bn
& sar_resnet31
Without any surprise, just like many other libraries, DocTR's future will involve some balance between speed and pure performance. To make this choice available to you, we added support of MobileNet V3 and pretrained it for character classification for both PyTorch & TensorFlow.
Whether you are a user looking for inference speed, or a dedicated model trainer looking for optimal data loading, you will be thrilled to know that we have greatly improved our data loading/processing by leveraging multi-threading!
We value the accessibility of this project and thus commit to improving tools for entry-level users. Deploying a demo from a Python library is not the expertise of every developer, so this release improves the existing demo:
Page selection was added for multi-page documents, the predictions are used to produce a synthesized version of the initial document, and you get the JSON export! We're looking forward to your feedback :hugs:
As DocTR continues to move forward with more complex tasks, paving the way for a consistent training procedure will become necessary. Pretraining has shown potential in many deep learning tasks, and we want to explore opportunities to make training for OCR even more accessible.
So this release makes a big step forward by adding on-the-fly character generator and training scripts, which allows you to train a character classifier without any pre-existing data :hushed:
In order to harmonize data processing between frameworks, the default data type of dataloaders has been switched to float32 for TensorFlow backend:
0.3.0 | 0.3.1 |
---|---|
>>> from doctr.datasets import FUNSD >>> ds = FUNSD() >>> img, target = ds[0] >>> print(img.dtype) <dtype: 'uint8'> >>> print(img.numpy().min(), img.numpy().max()) 0 255 |
>>> from doctr.datasets import FUNSD >>> ds = FUNSD() >>> img, target = ds[0] >>> print(img.dtype) <dtype: 'float32'> >>> print(img.numpy().min(), img.numpy().max()) 0.0 1.0 |
Whether it is for exporting predictions or loading input data, the library lets you play around with inputs and outputs using minimal code. Since its usage is constantly expanding, the doctr.documents
module was repurposed into doctr.io
.
0.3.0 | 0.3.1 |
---|---|
>>> from doctr.documents import DocumentFile >>> pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images() |
>>> from doctr.io import DocumentFile >>> pdf_doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images() |
It now also includes an image
submodule for easy tensor <--> numpy conversion for all supported data types.
As multithreading is getting increasingly used to boost performances in the entire library, it has been moved from utilities of TF-only datasets to doctr.utils.multithreading
:
0.3.0 | 0.3.1 |
---|---|
>>> from doctr.datasets.multithreading import multithread_exec >>> results = multithread_exec(lambda x: x ** 2, [1, 4, 8]) |
>>> from doctr.utils.multithreading import multithread_exec >>> results = multithread_exec(lambda x: x ** 2, [1, 4, 8]) |
Resources to access data in efficient ways
Features to manipulate input & outputs
Element
creation from dictionary (#386)Deep learning model building and inference
crnn_resnet31
as a recognition model (#361)crnn_vgg16_bn
in TF (#395)master
in TF (#396)sar_resnet31
in TF (#395)Utility features relevant to the library use cases.
Data transformations operations
rotate
function (#358) and its corresponding augmentation module (#363)Verifications of the package well-being before release
RandomRotate
(#363)Online resources for potential users
RandomRotate
(#363)CharacterGenerator
(#412)Reference training scripts
Other tools and implementations
PIL
version due to issues with version 8.3 (#362)weasyprint
version due to issues with version 53.0 (#404)matplotlib
version due to issues with version 3.4.3 (#413)doctr.datasets
(#354)tf.float32
instead of tf.uint8
(#367, #375)doctr.documents
to doctr.io
(#390)doctr.utils
(#371)doctr.models._utils.rotate_page
to doctr.utils.geometry.rotate_image
(#371)doctr.documents
to doctr.io
in documentation and README (#390)setup.py
and in README (#444)doctr.documents
to doctr.io
(#390)tf.float32
by default for datasets (#367)This release adds support for PyTorch backend & rotated text elements.
Release brought to you by @fg-mindee & @charlesmindee
Note: doctr 0.3.0 requires either TensorFlow 2.4.0 or PyTorch 1.8.0.
This release comes with exciting news: we added support of PyTorch for the whole library!
If you have both TensorFlow & Pytorch, simply switch DocTR backend by using the USE_TORCH
and USE_TF
environment variables.
export USE_TORCH='1'
Then DocTR will do the rest for you to play along with PyTorch:
import torch
from doctr.models import db_resnet50
model = db_resnet50(pretrained=True).eval()
with torch.no_grad():
out = model(torch.rand(1, 3, 1024, 1024))
More pretrained models to come in the next releases!
Users might be tempted to filtered text recognition predictions, which was not easy previously without a prediction's confidence. We harmonized our recognition models to provide the sequence prediction probability.
Following up on some feedback about the lack of clarity for visualization of dense predictions, we added a page reconstruction feature.
import matplotlib.pyplot as plt
from doctr.utils.visualization import synthesize_page
from doctr.documents import DocumentFile
from doctr.models import ocr_predictor
model = ocr_predictor(pretrained=True)
# PDF
doc = DocumentFile.from_pdf("path/to/your/doc.pdf").as_images()
# Analyze
result = model(doc)
# Reconstruct the first page
reconstructed_page = synthesize_page(result.export()[0])
plt.imshow(reconstructed_page); plt.show()
Using the predictions from our models, we try to synthesize the document with only its textual information!
While the paper doesn't introduce different versions of the LinkNet architectures, we want to keep the possibility to add more. In order to stabilize the interface early on, we renamed linknet
into linknet16
0.2.1 | 0.3.0 |
---|---|
>>> from doctr.models import linknet >>> model = linknet(pretrained=True) |
>>> from doctr.models import linknet16 >>> model = linknet16(pretrained=True) |
Resources to access data in efficient ways
Features to manipulate document information
Deep learning model building and inference
conv_sequence
& parameter loading (#323), resnet31
(#327), vgg16_bn
(#328), CRNN (#318), SAR (#333), MASTER (#329, #335, #340, #342)Utility features relevant to the library use cases.
Data transformations operations
Resize
in PyTorch (#313), ColorInversion
(#322)Verifications of the package well-being before release
Online resources for potential users
Reference training scripts
Other tools and implementations
ColorInversion
unittest (#298, #339)wandb
in the detection script (#288)wandb
config for training scripts (#302)OCRDataset
and CORD
(#289, #299):pray: Thanks to our contributors :pray: @Rob192