OCR engine for all the languages
Hotfix release fixing a regression in no_segmentation recognition.
This release contains two small fixes for a regression related to bumping lightning up to 2.2 and a crash in Segmentation
instantiation occurring when the first region type does not contain a region/dict.
Kraken 5.x is a major release introducing trainable reading order, a cleaner API, and changes resulting in a ~50% performance improvement of recognition inference, in addition to a large number of smaller bug fixes and stability improvements.
--threads
option of all commands has been split into --workers
and --threads
.kraken.repo
methods have been adapted to the new Zenodo API. They also correctly handle versioned records now.--fixed-splits
in ketos test
(@PonteIneptique)kraken.containers
replace the previous dicts produced and expected by segment/rpred/serialize
.kraken.serialize.serialize_segmentation()
has been removed as part of the container class rework.train/rotrain/segtrain/pretrain
cosine annealing scheduling now allows setting the final learning rate with --cos-min-lr
.Reading order can now be learned with ketos rotrain
and reading order models can be added to segmentation model files. The training process is documented here.
The polygon extractor is responsible for taking a page image, baselines, and their bounding polygons and dewarping + masking out the line. Here is an example:
The new polygon extractor reduces line extraction time 30x, roughly halving inference time and significantly speeding up training from XML files and compilation of datasets. It should be noted that polygon extraction does not concern data in the legacy bounding box format nor does it touch the segmentation process as it is only a preprocessing step in the recognizer on an already existing segmentation.
Not all improvements in the polygon extractor are backward compatible, causing models trained with data extracted with the old implementation to suffer from a slight reduction in accuracy (usually <0.25 percentage points). Therefore models now contain a flag in their metadata indicating which implementation has been used to train them. This flag can be overridden, e.g.:
$ kraken --no-legacy-polygons -i ... ... ocr ...
to enable all speedups for a slight increase in character error rate.
For training the new extractor is enabled per default, i.e. models trained with kraken 5.x will perform slightly worse on earlier kraken version but will still work. It is possible to force use of only backwards compatible speedups:
$ ketos compile --legacy-polygons ...
$ ketos train --legacy-polygons ....
$ ketos pretrain --legacy-polygons ...
The command line tools now handle multiprocessing and thread pools more completely and configurably. --workers
has been split into --threads
and --workers
, the former option limiting the size of thread pools (as much as possible) for intra-op parallelization, the latter setting the number of worker processes, usually for the purpose of data loading in training and dataset compilation.
While 5.x preserves the general OCR functional blocks, the existing dictionary-based data structures have been replaced with container classes and the XML parser has been reworked.
For straightforward processing little has changed. Most keys of the dictionaries have been converted into attributes of their respective classes.
The segmentation methods now return a Segmentation object containing Region and BaselineLine/BBoxLine objects:
>>> pageseg.segment(im)
{'text_direction': 'horizontal-lr',
'boxes': [(x1, y1, x2, y2),...],
'script_detection': False
}
>>> blla.segment(im)
{'text_direction': '$dir',
'type': 'baseline',
'lines': [{'baseline': [[x0, y0], [x1, y1], ..., [x_n, y_n]], 'boundary': [[x0, y0, x1, y1], ... [x_m, y_m]]}, ...
{'baseline': [[x0, ...]], 'boundary': [[x0, ...]]}]
'regions': [{'region': [[x0, y0], [x1, y1], ..., [x_n, y_n]], 'type': 'image'}, ...
{'region': [[x0, ...]], 'type': 'text'}]
}
becomes:
>>> pageseg.segment(im)
Segmentation(type='bbox',
imagename=None,
text_direction='horizontal-lr',
script_detection=False,
lines=[BBoxLine(id='f1d5b1e2-030c-41d5-b299-8a114eb0996e',
bbox=[34, 198, 279, 251],
text=None,
base_dir=None,
type='bbox',
imagename=None,
tags=None,
split=None,
regions=None,
text_direction='horizontal-lr'),
BBoxLine(...],
line_orders=[])
>>> blla.segment(im)
Segmentation(type='baseline',
imagename=im,
text_direction='horizontal-lr',
script_detection=False,
lines=[BaselineLine(id='50ab1a29-c3b6-4659-9713-ff246b21d2dc',
baseline=[[183, 284], [272, 282]],
boundary=[[183, 284], ... ,[183, 284]],
text=None,
base_dir=None,
type='baselines',
tags={'type': 'default'},
split=None,
regions=['e28ccb6b-2874-4be0-8e0d-38948f0fdf09']), ...],
regions={'text': [Region(id='e28ccb6b-2874-4be0-8e0d-38948f0fdf09',
boundary=[[123, 218], ..., [123, 218]],
tags={'type': 'text'}), ...],
'foo': [Region(...), ...]},
line_orders=[])
The recognizer now yields
BaselineOCRRecords
/BBoxOCRRecords
which both inherit from the BaselineLine
/BBoxLine
classes:
>>> record = rpred(network=model,
im=im,
segmentation=baseline_seg)
>>> record = next(rpred.rpred(im))
>>> record
BaselineOCRRecord pred: 'predicted text' baseline: ...
>>> record.type
'baselines'
>>> record.line
BaselineLine(...)
>>> record.prediction
'predicted text'
One complication is the new serialization function which now accepts a
Segmentation
object instead of a list of ocr_records
and ancillary metadata:
>>> records = list(x for x in rpred(...))
>>> serialize(records,
image_name=im.filename,
image_size=im.size,
writing_mode='horizontal-tb',
scripts=['Latn', 'Hebr'],
regions=[{...}],
template='alto',
template_source='native',
processing_steps=proc_steps)
becomes:
>>> import dataclasses
>>> baseline_seg
Segmentation(...)
>>> records = list(x for x in rpred(..., segmentation=baseline_seg))
>>> results = dataclasses.replace(baseline_seg, lines=records)
>>> serialize(results,
image_size=im.size,
writing_mode='horizontal-tb',
scripts=['Latn', 'Hebr'],
template='alto',
template_source='native',
processing_steps=proc_steps)
This requires the construction of a new Segmentation
object that contains the
records produced by the text predictor. The most straightforward way to create
this new Segmentation
is through the dataclasses.replace
function as our
container classes are immutable.
Lastly, serialize_segmentation
has been removed. The serialize
function now
accepts Segmentation
objects which do not contain text predictions:
>>> serialize_segmentation(segresult={'text_direction': '$dir',
'type': 'baseline',
'lines': [{'baseline': [[x0, y0], [x1, y1], ..., [x_n, y_n]], 'boundary': [[x0, y0, x1, y1], ... [x_m, y_m]]}, ...
{'baseline': [[x0, ...]], 'boundary': [[x0, ...]]}]
'regions': [{'region': [[x0, y0], [x1, y1], ..., [x_n, y_n]], 'type': 'image'}, ...
{'region': [[x0, ...]], 'type': 'text'}]
},
image_name=im.filename,
image_size=im.size,
template='alto',
template_source='native',
processing_steps=proc_steps)
is replaced by:
>>> baseline_seg
Segmentation(...)
>>> serialize(baseline_seg,
image_size=im.size,
writing_mode='horizontal-tb',
scripts=['Latn', 'Hebr'],
template='alto',
template_source='native',
processing_steps=proc_steps)
The kraken.lib.xml.parse_{xml,alto,page}
methods have been replaced by a single kraken.lib.xml.XMLPage
class.
>>> parse_xml('xyz.xml')
{'image': impath,
'lines': [{'boundary': [[x0, y0], ...],
'baseline': [[x0, y0], ...],
'text': apdjfqpf',
'tags': {'type': 'default', ...}},
...
{...}],
'regions': {'region_type_0': [[[x0, y0], ...], ...], ...}}
becomes
>>> XMLPage('xyz.xml')
XMLPage xyz.xml (format: alto, image: impath)
As the parser is now aware of reading order the XMLPage.lines
attribute is an
unordered dict of BaselineLine
/BBoxLine
container classes. As ALTO/PageXML
files can generally contain multiple different reading orders the
XMLPage.get_sorted_lines()/XMLPAge.get_sorted_regions()
method on the object
provides an ordered view of lines or regions. The default order
line_implicit
/region_implicit
corresponds to the order produced by the
previous parsers, i.e. the order formed by the sequence of elements in the XML
tree.
XMLPage
objects can be converted into a Segmentation
container using the
XMLPage.to_container()
method:
>>> XMLPage('xyz.xml').to_container()
Segmentation(...)
Full Changelog: https://github.com/mittagessen/kraken/compare/4.3.13...5.2
This is mostly a bugfix release but also includes a couple of minor improvements and changes.
contrib/extract_lines.py
work with binary datasets--resize
) add/both have been renamed to union/new. (Thibault Clérice) #488
This is just another hotfix release.
This is a hotfix release to 4.3.0 correcting a regression in the CLI, fixing pretrain validation losses, and the conda environment files.
--warmup
and --freeze-backbone
(mostly to enable fine-tuning pretrained models)ketos compile
to create precompiled datasets with lines without a corresponding transcription with the --keep-empty-lines
switch (mostly for pretraining models).--failed-sample-threshold
in training modules, aborting training after a certain number of samples failed to load--logger/--log-dir
optionsocr_record
with new smart classes BaselineOCRRecord
and BBoxOCRRecord
. These keep track of reading/display order, compute bounding polygons from the whole line bounding polygon, and average confidences when slicing.--text-direction
instead of assuming horizontal lines (--text-direction horizontal-lr/-rl
).--pad
.--template
option.Full Changelog: https://github.com/mittagessen/kraken/compare/4.2.0...4.3.0
This is mainly a bugfix release containing small improvements such as additional tests, typing, spelling corrections, additional contrib scripts, and fixes for rarely used functionality.
TextLine
w/o text in ALTO to avoid standard-violating empty TextLine
sketos train
and KrakenTrainer
actually loads a given codec now.contrib/forced_alignment_overlay.py
now preserves the input file and only replaces the character cuts.TextEquiv
for Word
and TextLine
in PAGE XML output