A stupid OCR for malayalam language
https://harish2704.github.io/ml-tesseract-demo/
A stupid OCR for malayalam language. It can be Easily configured to process any other languages with complex scripts
https://harish2704.github.io/pottan-demo/
git clone https://github.com/harish2704/pottan-ocr
cd pottan-ocr
env DISTRO=debian ./tools/install-dependencies.sh
env DISTRO=fedora ./tools/install-dependencies.sh
env DISTRO=opensuse ./tools/install-dependencies.sh
./tools/install-dependencies.sh
By default, the installer will install dependencies which is necessary to run the OCR. For training the OCR, pass the string for_training
as first argument to installer.
./tools/install-dependencies.sh for_training
wget 'https://github.com/harish2704/pottan-ocr-data/raw/master/crnn_11032020_171631_5.h5' -O pottan_ocr_latest.h5
cp ./config.yaml.sample ./config.yaml
./bin/pottan ocr <trained_model.h5> <iamge_path> [ pottan_ocr_output.html ]
For more details, see the --help
of bin/pottan
and its subcommands
Usage:
./pottan <command> [ arguments ]
List of available commands ( See '--help' of individual command for more details ):
extractWikiDump - Extract words from wiki xml dump ( most most of the text corpus ). Output is written to stdout.
datagen - Prepare training data from data/train.txt & data/validate.txt. ( Depreciated. used only for manual varification of training data )
train - Run the training
ocr - Run charector recognition with a pre-trained model and image file
./config.yaml.sample
should be available in the output of command fc-list :lang=ml
datagen
does exactly this. When running training, if the images already found to exists in the cache directory( eg: point cache directory to generated images directory ), it will be used for the training instead of generating new images. This idea is used to reduce CPU load during production training sessionsFor more details, see wiki
#pottan-ocr:matrix.org
( https://riot.im/app/#/room/#pottan-ocr:matrix.org ).