A Repo For Document AI
#121 Adding support for W&B logging and visualizing evaluation results #132 Adding new properties for 'Page' and new attributes for 'Image'
The W&B WandbTableAgent
is a new objects the generates a table rows with images and bounding boxes and sends this table to W&B
server. Having setup a W&B account this class allows monitoring evaluation results during training.
Moreover, a WandbWriter
has been added that allows writing logs in JSON format and sending them to W&B.
Table
and Image
objectsSome properties to Table
have been added:
- `csv`: Returning a list of list of string (cell entry)
- '__str__`: Returning a string representation of a table
Some attributes to Image
have been added in order to take care of data lineage for multi page documents:
- `document_id`: Global document identifier (equal to `image_id` for single page documents)
- `page_number`: Page number in multi page documents (defaults to 0)
#129 with cm 5fb2355 NN (defrost Hugging Face hub version) with #122 #124 (partial) with #125 NN (small bug fixes related to PR #117) with #118
#101 Docs are built now with MkDocs, Material for MkDocs as well as Mkdocstring. This PR is already productive #110 Adding state_id #115 Adding Table-Transformer with custom pipeline components #117 Adding pipeline component for NMS per pairs
ImageAnnotation
change as they pass through a pipeline. To better detect the change of the annotation, the state_id
is introduced, which, unlike the static annotation_id
, changes when the annotation changes by adding e.g. sub-categories.
The following has been added:
AnnotationNmsService
Adding AnnotationNmsService
. It allows to run non-maximum-supression on pairs or more generally on group of image annotations. Compared to the post-processing step in object detectors, this step allows to suppress annotations that have been detected from different detectors.
This service runs for Tensorflow and Pytorch and chooses the necessary functions accordingly.
#100 with PR #101 #112 with special fix on notebooks repo
#94 Adding support for LayoutLMv2 and LayoutXLM #99 Adding support for LayoutLMv3 #97 Refactoring repo structure and moving jupyter notebooks to notebooks
-Model wrappers for LayoutLMv2 have been added. To give the whole concept some more structure, two new new base classes (sub classed from LMTokenClassifier and LMSequenceClassifier resp.) HFLayoutLmTokenClassifierBase
and HFLayoutLmSequenceClassifierBase
have been added.
-Adding sliding window for training and inference: Before adding sliding windows, pages with more than 512 tokens could be processed by splitting the page batch into several disjoint batches. This approach has the disadvantage that one loses context especially for tokens very close to where the batch has been dissected. Sliding window generates several overlapping batches so that there is always a batch where any token has context (except the first and last tokens). For inference one needs to add a post processing step for tokens in more than one batch: We currently choose the prediction with the highest score but there are other approaches. The effect on inference has not been tested yet and the implementation may be subject to change.
-Adding support in training script for LayoutLMv2 and LayoutXLM.
Note: As transformer
tokenizers implement the handling of distributing bounding boxes to tokens which is a part of the pre-processing step in this library as well, users must not use these but have to take the tokenizers generating the vocab of the underlying language model. This means:
LayoutLMv2 -> LayoutLMTokenizerFast
LayoutXLM -> XLMRobertaTokenizerFast
Disentangling code base from jupyter notebooks.
-Adding HFLayoutLmv3SequenceClassifier
and HFLayoutLmv3TokenClassifier
-LMTokenClassifierService
does not require a mapping function in its __init__
method anymore because the inference processing works with one single mapping for all models.
-When processing features, it is now possible to choose the segment positions to be used as bounding boxes. This implies, that the segment positions will need child
-specific relationships
to the words.
-Evaluator
has a new method compare
which makes it possible to compare ground truth sample from a dataset with predictions from a pipeline. Currently, only object detection models can be compared.
#91 with PR #93 #95 with PR #96
Patch release:
Du to changes of the hf_hub as of release 0.11.0, only versions <0.11.0 can be currently used.
#69 Modified cell merging in table refinement process and new row/column stretching rule #72 Optimizing reading order #76 Refactoring pipeline base classes #82 Adding an image transformer and corresponding pipeline component #86 Modify API for analyzing document output
TableSegmentationRefinementService
: When merging cells, the merged cell can be equal to one of the input cells (e.g. if the largest cell contains all other cell. In this case the merging cell cannot be dumped and the smaller cells won't be deactivated. A logic has been added that deals with this situation.
TableSegmentationService
: To tile tables with rows and column more equally a new row/column stretching rule has been added.
Optimization of the arrangement of layout blocks so that the reading order gets more robust even when the layout elements vary heavily.
jdeskew
to estimate the distortion angle of a skewed document and to rotate it accordingly so that text lines are horizontal lines and easier to consume for OCR systems.ImageTransformer
, that accepts and returns an image as numpy array
a new pipeline component SimpleTransformService
that accepts an ImageTransformer
and updates the necessary meta data.ObjectTypes
strings have been changed to lower case. The reason for that is that ObjectTypes
members are now made
available as attributes for sub classes of ImageAnnotationObj
.Page
object now derived from Image
has been created. This new object replaces the object of the same name. Moreover, a couple of Layout
structures have been created. Both Page
and Image
represent a view on the underlying Image
, resp. ImageAnnotation
and providing a more intuitive interface to document parsing, text extraction/text classification than the Image
and ImageAnnotation
classes.CustomDataset
has been added to provide users an easy interface to create custom datasets. This class reduces the boilerplate and now users have only to write a DataFlowBuilder
and need to instantiate CustomDataset
.ModelProfile
has been provided with a new attribute model_wrapper.TextExtractionService
has been provided with a new function run_time_ocr_language_selection. If Tesseract has been chosen as text_extract_detector
and a LanguageDetectionService
is a predecessor pipeline component, setting run_time_ocr_language_selection=True
will select the Tesseract model with the predicted language. You can therefore have different languages in one stream of documents.#66 with PR #68 #70 with PR #71 #73 with PR 74 #77 with PR #78 #80 with PR #81 #84 with PR #85
#55 Adding precision/recall/F1 metrics
#57 More docs for LayoutLM
#61 Enum
for categories
#63 Unifying log messages
#65 Reducing the number of extra install options
Precision, recall and F1 metrics (macro/micro/average versions) have been added to evaluate token classification models. Regarding visualization, some options have been added to display token class output at page level.
As side notes, two docs have been added to discuss
The current data model is based on object detection tasks. This can be seen by the choice of classes that includes Image
, CategoryAnnotation
, ImageAnnotation
, and the relationships between ImageAnnotation
and sub categories. On the other hand, however, category types from Document AI tasks are generally used to set up the sequential steps in the code base. These category types have been stored in the category_names
attribute as a string type. All category types are currently an attribute of the AttrDict
instance names
. As the number of category types increases, this procedure means that the names cannot be maintained well. Furthermore, one is not able to group category types.
This vulnerability should be eliminated with the introduction of special Enum
types for groups of categories. In the future, an Enum
member will be stored in the category_names
attribute. This ensures that categories can also be controlled using Enum
type in the future.
Enum
members will also be used as keys of sub categories.
Enum
s are defined as string Enum
s, so one can still call Enum
members with their original names.
Log messages have been unified across all libraries, while keeping logs unchanged when they are devoted to training scripts - so that Tensorboard works correctly. Moreover, many assertion errors have been replaced with a more precise built-in error type.
The number of extra install options has been reduced by two. The installation docs has been modified accordingly.
The concept of lazy modules has been added. Lazy modules allows to defer the execution of importing modules until the moment when an imported module is used for the first time. This give some speed gains.
fixes: # 53 with PR # 54
Evaluator running over pipelines: #38 Adding Tree edit distance metric: #38 Adding LayoutLMv1 model: #44 Adding Doclaynet dataset: #45 New design of Page class: #47
When running evaluation for table recognition the prediction depend on a chain of pipeline components for object detection and post processing (cell/row column matching and table refinement). The evaluator therefore needed to compare results between the ground truth of a dataset and the prediction of a whole pipeline.
For comparing prediction and ground truth on a datapoint, the evaluator first has to make a copy of the ground truth and then it need to erase all interim results that will later be generated when running through the pipeline. In order to know what has to be erased a new meta data scheme had to be established for each pipeline component indicating what type of annotation (image annotation or category annotation) will be generated when passing the datapoint through the component.
Moreover, additional functions had to be added to each metric so that one can specify over which sub category/summary the evaluation is required.
Tree edit distance has been proposed to compare HTML representation of tables in the realm of table recognition. It is possible to call this metric on a given category for every task that generates an XML representation. The code has been mainly taken from the Pubtabnet repo.
The major addition of this release add support to train, eval and run LayoutLMv1 models in deepdoctection pipelines. Using separate pipeline components for sequence and token classification with support of LayoutLM extends massively the applicability of this repo. LayoutLMv1 is basically a BERT model that accepts multimodal features like tokens and bounding boxes and comes with different flavors. Higher LayoutLM versions with additional features will be added in later releases. To the model belong a training script for fine tuning based on the custom trainer from transformer library. A notebook to showcase the new functionality has been added.
Doclaynet is new dataset for document layout analysis that contains around 80k manually labeled images like financial reports, patents and others. Compared to automatically generated labels from other datasets like Publaynet, Doclaynet has high variability in document layouts that allows training models that are able to determine layouts for a large variety of documents.
The original page class suffered from poor design choices resulting in challenges when adding additional features to the output. It therefore had to be completely redesigned and simplified where now has a modular approach that can easily be extended with new components.
Patch to add long description
Re-organizing extra dependencies: #35 Optimizing typing: #36 Training script for Detectron2 and new models: #37
Adding basic, full and all extra dependencies for TF as well as full and all dependencies for PT. Compared to old dependency setting it is not compulsory in the basic setting to have pycocotools as well as lxml. All dependencies includes packages that have a predictor wrapper. Setup has now several installation options depending on whether package has been downloaded from PyPi or Guthub. Test-suite has been divided into tests groups according to the additional package distributions. CI for tests when merging into master have been added.
Static typing has been optimized to reduce the massive amount of typing issues due to incorrect typing. Some additional types (e.g. Pathlike) have been added.
Detectron2 is now on equal footing with Tensorpacks models and is easily trainable on dd datasets. Training metrics show that this framework is superior to the Tensorpack implementation in terms of speed and accuracy. A training script with identical API to Tensorpack has been provided which is based on D2's train_net script. Central to this script is a trainer derived from D2's DefaultTrainer with custom data loading methods. Training of given PyTorch models have been resumed for 20-50k iterations to overcome the poor accuracy in higher iou metrics.
Language detection: #33 Merging datasets: #34
Adding a predictor for language detection, that accepts a string and predicts the returned language. As model, we use the large fasttext word embedding that is also included in the model catalog. Detemining the language is crucial when applying downstream NLP tasks.
Along with language detector a new pipeline component LanguageDetectionService has been implemented. The service can be used in two situations:
before the text extraction: A OCR predictor extracts a snippet on a region of the page a passes it to the language detector. The result can be used to do a proper text extraction thereafter with a OCR model specialized on the inferred language.
after the text extraction: If the text extraction does not really depend on a specific language (e.g. text extraction with a PDF miner) one can use the pipeline component to determine a more confident prediction.
Adding a new class derived from DatasetBase to construct datasets as union of pre selected datapoints without touching the original datasets.
To train models on multiple datasets MergeDataset accepts a number of datasets and builds meta data (e.g. categories) and dataflow based on their inputs. Configuring the datasets (filtering, replace cats with sub cats) before creating the merge is allowed, configuring the dataflow of each individual dataflow as well.
Getting started notebook fails: #26 When printing tables from page object, output does not show last row: #28 Some pipeline components do not have a clone method: #31
Models are not registered with a dataclass that allows saving the necessary meta data (urls, hf repo id, etc.) and retrieving the information from the ModelCatalog
For metrics, datasets and pipeline components we now use a small library catalogue, that easily allows creating registries for these datapoints. This class is especially designed for adding custom objects in individual objects.
Some TF warnings (esp. for warnings appearing in TF >= 2.5) are now silenced.