Skip to content

free-variation/ocr-arabic-script

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

31 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

ocr-arabic-script

Experiments in OCR for historical texts written in Arabic script.

Prerequisites

  • GNU Make
  • GNU gawk
  • xmllint, part of the libxml2-utils package.
  • A working Python3 environment
  • pip
  • GNU parallel, for running kraken operations in parallel, which may be somewhat faster than kraken batched operations on multicore machines with lower core counts and no GPU.

Installation

make deps

Configuration

The system is configured via environment variables set in a local, non-versisoned file ./config

PyTorch device

To point to a GPU, set for example

DEVICE=cuda:0

The default device is cpu.

Number of threads for OCR step

This parameter is passed to kraken's ocr command. For a 4-core system,

NUM_THREADS=4

The default is `1'.

Test Runs

Binarization

make binarize-all

This will binarie all the images in data/fas, yielding image files ending in -bin.png

Optionally, use the parallelized version of this target:

make binarize-all-par

Segmentation

make segment-all

This will segment all the binaried images in data/fas, yielding ALTO XML files ending in -seg.xml

Optionally, use the parallelized version of this target:

make segment-all-par

Because the parallelized version runs multiple processes, the overhead of the initial load of the neural model is multiplied by the number of cores avialable on the machine (the parallel default). Experiment to determine whether parallelization is beneficial on your hardware. On a Macbook Pro (2019) the speedup is considerable.

Recognition

make ocr-all

This target will run kraken's OCR over the segmented images, again yielding ALTO XML files, this time containing <CONTENT> elements. The filenames of the output end in -rec.xml.

Optionally, use the parallelized version of this target:

make ocr-all-par

Same caveats apply.

About

Experiments in OCR for historical arabic texts.

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published