Deepdoctection Versions Save

A Repo For Document AI

v.0.22

1 year ago

Enhancements

Summary

#121 Adding support for W&B logging and visualizing evaluation results #132 Adding new properties for 'Page' and new attributes for 'Image'

Details

Adding support for W&B

The W&B WandbTableAgent is a new objects the generates a table rows with images and bounding boxes and sends this table to W&B server. Having setup a W&B account this class allows monitoring evaluation results during training.

Moreover, a WandbWriter has been added that allows writing logs in JSON format and sending them to W&B.

Adding new properties for Table and Image objects

Some properties to Table have been added:

- `csv`: Returning a list of list of string (cell entry)
- '__str__`: Returning a string representation of a table

Some attributes to Image have been added in order to take care of data lineage for multi page documents:

- `document_id`: Global document identifier (equal to `image_id` for single page documents)
- `page_number`: Page number in multi page documents (defaults to 0)

Bugs

#129 with cm 5fb2355 NN (defrost Hugging Face hub version) with #122 #124 (partial) with #125 NN (small bug fixes related to PR #117) with #118

v.0.21

1 year ago

Enhancements

Summary

#101 Docs are built now with MkDocs, Material for MkDocs as well as Mkdocstring. This PR is already productive #110 Adding state_id #115 Adding Table-Transformer with custom pipeline components #117 Adding pipeline component for NMS per pairs

Details

Adding state_id

ImageAnnotation change as they pass through a pipeline. To better detect the change of the annotation, the state_id is introduced, which, unlike the static annotation_id, changes when the annotation changes by adding e.g. sub-categories.

Adding Table-Transformer

The following has been added:

  • dataset pubtables1m_struct for table structure recognition using Pubtables-1M
  • A derived ObjectDetector wrapper HFDetrDerivedDetector for TableTransformerForObjectDetection
  • A pipeline component PubtablesSegmentationService in order to segment the table structure recognition result of the model. (The segmentation following the first approach in deepdoctection cannot be used).
  • A training script for training TableTransformerForObjectDetection models for object detection as well as DataCollator and Detr mappers.

Adding AnnotationNmsService

Adding AnnotationNmsService. It allows to run non-maximum-supression on pairs or more generally on group of image annotations. Compared to the post-processing step in object detectors, this step allows to suppress annotations that have been detected from different detectors.

This service runs for Tensorflow and Pytorch and chooses the necessary functions accordingly.

Bugs

#100 with PR #101 #112 with special fix on notebooks repo

v.0.20

1 year ago

Enhancements

Summary

#94 Adding support for LayoutLMv2 and LayoutXLM #99 Adding support for LayoutLMv3 #97 Refactoring repo structure and moving jupyter notebooks to notebooks

Details

Adding support for LayoutLMv2 and LayoutXLM

-Model wrappers for LayoutLMv2 have been added. To give the whole concept some more structure, two new new base classes (sub classed from LMTokenClassifier and LMSequenceClassifier resp.) HFLayoutLmTokenClassifierBase and HFLayoutLmSequenceClassifierBase have been added. -Adding sliding window for training and inference: Before adding sliding windows, pages with more than 512 tokens could be processed by splitting the page batch into several disjoint batches. This approach has the disadvantage that one loses context especially for tokens very close to where the batch has been dissected. Sliding window generates several overlapping batches so that there is always a batch where any token has context (except the first and last tokens). For inference one needs to add a post processing step for tokens in more than one batch: We currently choose the prediction with the highest score but there are other approaches. The effect on inference has not been tested yet and the implementation may be subject to change. -Adding support in training script for LayoutLMv2 and LayoutXLM.

Note: As transformer tokenizers implement the handling of distributing bounding boxes to tokens which is a part of the pre-processing step in this library as well, users must not use these but have to take the tokenizers generating the vocab of the underlying language model. This means:

LayoutLMv2 -> LayoutLMTokenizerFast LayoutXLM -> XLMRobertaTokenizerFast

Refactoring repo structure

Disentangling code base from jupyter notebooks.

Add layoutlmv3 and add more features for layoutlm processing

-Adding HFLayoutLmv3SequenceClassifier and HFLayoutLmv3TokenClassifier -LMTokenClassifierService does not require a mapping function in its __init__ method anymore because the inference processing works with one single mapping for all models. -When processing features, it is now possible to choose the segment positions to be used as bounding boxes. This implies, that the segment positions will need child-specific relationships to the words. -Evaluator has a new method compare which makes it possible to compare ground truth sample from a dataset with predictions from a pipeline. Currently, only object detection models can be compared.

Bugs

#91 with PR #93 #95 with PR #96

v.0.19

1 year ago

Patch release:

Du to changes of the hf_hub as of release 0.11.0, only versions <0.11.0 can be currently used.

v.018

1 year ago

Enhancements

Summary

#69 Modified cell merging in table refinement process and new row/column stretching rule #72 Optimizing reading order #76 Refactoring pipeline base classes #82 Adding an image transformer and corresponding pipeline component #86 Modify API for analyzing document output

Details

Modified cell merging in table refinement process and new row/column stretching rule

TableSegmentationRefinementService : When merging cells, the merged cell can be equal to one of the input cells (e.g. if the largest cell contains all other cell. In this case the merging cell cannot be dumped and the smaller cells won't be deactivated. A logic has been added that deals with this situation.

TableSegmentationService: To tile tables with rows and column more equally a new row/column stretching rule has been added.

Optimizing reading order

Optimization of the arrangement of layout blocks so that the reading order gets more robust even when the layout elements vary heavily.

Refactoring pipeline base classes

  • Adding a new attribute name so that each pipeline component in a pipeline can be uniquely described by its predictor and its component.
  • Removing some parameters in classes that do not really belong to these classes adding meth get_pipeline_info to abstract base class pipeline.

Adding an image transformer (not a model) and corresponding pipeline component (closes #30)

  • Adding the package jdeskew to estimate the distortion angle of a skewed document and to rotate it accordingly so that text lines are horizontal lines and easier to consume for OCR systems.
  • Adding an new class interface ImageTransformer, that accepts and returns an image as numpy array a new pipeline component SimpleTransformService that accepts an ImageTransformer and updates the necessary meta data.

Modify API for analyzing document output

  • ObjectTypes strings have been changed to lower case. The reason for that is that ObjectTypes members are now made available as attributes for sub classes of ImageAnnotationObj.
  • Unused and deprecated data structures have been deleted.
  • A new Page object now derived from Image has been created. This new object replaces the object of the same name. Moreover, a couple of Layout structures have been created. Both Page and Image represent a view on the underlying Image, resp. ImageAnnotation and providing a more intuitive interface to document parsing, text extraction/text classification than the Image and ImageAnnotation classes.
  • A new class CustomDataset has been added to provide users an easy interface to create custom datasets. This class reduces the boilerplate and now users have only to write a DataFlowBuilder and need to instantiate CustomDataset.
  • ModelProfile has been provided with a new attribute model_wrapper.
  • TextExtractionService has been provided with a new function run_time_ocr_language_selection. If Tesseract has been chosen as text_extract_detector and a LanguageDetectionService is a predecessor pipeline component, setting run_time_ocr_language_selection=True will select the Tesseract model with the predicted language. You can therefore have different languages in one stream of documents.
  • All notebooks have been revisited and updated. Many notebooks are now almost one year old and do not give a exhaustive overview what can be solved with that library.
  • Beside notebooks, a substantial part of the docs have been updated.

Bugs

#66 with PR #68 #70 with PR #71 #73 with PR 74 #77 with PR #78 #80 with PR #81 #84 with PR #85

v.017

1 year ago

Enhancements

Summary

#55 Adding precision/recall/F1 metrics #57 More docs for LayoutLM #61 Enum for categories #63 Unifying log messages #65 Reducing the number of extra install options

Details

Adding Precision/recall/F1 metrics

Precision, recall and F1 metrics (macro/micro/average versions) have been added to evaluate token classification models. Regarding visualization, some options have been added to display token class output at page level.

More docs for LayoutLM

As side notes, two docs have been added to discuss

  • results of sequence classification problems on modern type documents
  • results of LayoutLM models with visual backbone trained on layout analysis tasks

Enums for categories

The current data model is based on object detection tasks. This can be seen by the choice of classes that includes Image, CategoryAnnotation, ImageAnnotation, and the relationships between ImageAnnotation and sub categories. On the other hand, however, category types from Document AI tasks are generally used to set up the sequential steps in the code base. These category types have been stored in the category_names attribute as a string type. All category types are currently an attribute of the AttrDict instance names. As the number of category types increases, this procedure means that the names cannot be maintained well. Furthermore, one is not able to group category types.

This vulnerability should be eliminated with the introduction of special Enum types for groups of categories. In the future, an Enum member will be stored in the category_names attribute. This ensures that categories can also be controlled using Enum type in the future. Enum members will also be used as keys of sub categories. Enums are defined as string Enums, so one can still call Enum members with their original names.

Unifying log messages

Log messages have been unified across all libraries, while keeping logs unchanged when they are devoted to training scripts - so that Tensorboard works correctly. Moreover, many assertion errors have been replaced with a more precise built-in error type.

Reducing number of extra install options

The number of extra install options has been reduced by two. The installation docs has been modified accordingly.

The concept of lazy modules has been added. Lazy modules allows to defer the execution of importing modules until the moment when an imported module is used for the first time. This give some speed gains.

Bugs

fixes: # 53 with PR # 54

v.0.16

1 year ago

Enhancements

Summary

Evaluator running over pipelines: #38 Adding Tree edit distance metric: #38 Adding LayoutLMv1 model: #44 Adding Doclaynet dataset: #45 New design of Page class: #47

Details

Evaluator running over pipelines:

When running evaluation for table recognition the prediction depend on a chain of pipeline components for object detection and post processing (cell/row column matching and table refinement). The evaluator therefore needed to compare results between the ground truth of a dataset and the prediction of a whole pipeline.

  • For comparing prediction and ground truth on a datapoint, the evaluator first has to make a copy of the ground truth and then it need to erase all interim results that will later be generated when running through the pipeline. In order to know what has to be erased a new meta data scheme had to be established for each pipeline component indicating what type of annotation (image annotation or category annotation) will be generated when passing the datapoint through the component.

  • Moreover, additional functions had to be added to each metric so that one can specify over which sub category/summary the evaluation is required.

Adding TEDS metric

Tree edit distance has been proposed to compare HTML representation of tables in the realm of table recognition. It is possible to call this metric on a given category for every task that generates an XML representation. The code has been mainly taken from the Pubtabnet repo.

Adding LayoutLMv1

The major addition of this release add support to train, eval and run LayoutLMv1 models in deepdoctection pipelines. Using separate pipeline components for sequence and token classification with support of LayoutLM extends massively the applicability of this repo. LayoutLMv1 is basically a BERT model that accepts multimodal features like tokens and bounding boxes and comes with different flavors. Higher LayoutLM versions with additional features will be added in later releases. To the model belong a training script for fine tuning based on the custom trainer from transformer library. A notebook to showcase the new functionality has been added.

Adding Doclaynet

Doclaynet is new dataset for document layout analysis that contains around 80k manually labeled images like financial reports, patents and others. Compared to automatically generated labels from other datasets like Publaynet, Doclaynet has high variability in document layouts that allows training models that are able to determine layouts for a large variety of documents.

New design of page class

The original page class suffered from poor design choices resulting in challenges when adding additional features to the output. It therefore had to be completely redesigned and simplified where now has a modular approach that can easily be extended with new components.

v.015

1 year ago

Patch to add long description

v.014

1 year ago

Enhancements

Summary

Re-organizing extra dependencies: #35 Optimizing typing: #36 Training script for Detectron2 and new models: #37

Details

Re-organizing extra dependencies

Adding basic, full and all extra dependencies for TF as well as full and all dependencies for PT. Compared to old dependency setting it is not compulsory in the basic setting to have pycocotools as well as lxml. All dependencies includes packages that have a predictor wrapper. Setup has now several installation options depending on whether package has been downloaded from PyPi or Guthub. Test-suite has been divided into tests groups according to the additional package distributions. CI for tests when merging into master have been added.

Optimizing typing

Static typing has been optimized to reduce the massive amount of typing issues due to incorrect typing. Some additional types (e.g. Pathlike) have been added.

Training scripts for Detectron2 and new PyTorch models

Detectron2 is now on equal footing with Tensorpacks models and is easily trainable on dd datasets. Training metrics show that this framework is superior to the Tensorpack implementation in terms of speed and accuracy. A training script with identical API to Tensorpack has been provided which is based on D2's train_net script. Central to this script is a trainer derived from D2's DefaultTrainer with custom data loading methods. Training of given PyTorch models have been resumed for 20-50k iterations to overcome the poor accuracy in higher iou metrics.

v.0.13

2 years ago

Enhancements

Summary

Language detection: #33 Merging datasets: #34

Details

Language detection

Adding a predictor for language detection, that accepts a string and predicts the returned language. As model, we use the large fasttext word embedding that is also included in the model catalog. Detemining the language is crucial when applying downstream NLP tasks.

Along with language detector a new pipeline component LanguageDetectionService has been implemented. The service can be used in two situations:

  • before the text extraction: A OCR predictor extracts a snippet on a region of the page a passes it to the language detector. The result can be used to do a proper text extraction thereafter with a OCR model specialized on the inferred language.

  • after the text extraction: If the text extraction does not really depend on a specific language (e.g. text extraction with a PDF miner) one can use the pipeline component to determine a more confident prediction.

Merging datasets

Adding a new class derived from DatasetBase to construct datasets as union of pre selected datapoints without touching the original datasets.

To train models on multiple datasets MergeDataset accepts a number of datasets and builds meta data (e.g. categories) and dataflow based on their inputs. Configuring the datasets (filtering, replace cats with sub cats) before creating the merge is allowed, configuring the dataflow of each individual dataflow as well.

Bugs

Getting started notebook fails: #26 When printing tables from page object, output does not show last row: #28 Some pipeline components do not have a clone method: #31

Improvements

Dataclass for model profile and new ModelCatalog

Models are not registered with a dataclass that allows saving the necessary meta data (urls, hf repo id, etc.) and retrieving the information from the ModelCatalog

Unifying registries

For metrics, datasets and pipeline components we now use a small library catalogue, that easily allows creating registries for these datapoints. This class is especially designed for adding custom objects in individual objects.

Silence some TF warnings

Some TF warnings (esp. for warnings appearing in TF >= 2.5) are now silenced.