Skip to content
This repository has been archived by the owner on Mar 3, 2024. It is now read-only.

Lotemn102/HebHTR

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

1 Commit
 
 

Repository files navigation

HebHTR

Hebrew Handwritten Text Recognizer, based on Machine Learning. Implemented with TensorFlow and OpenCV.
Model is based on Harald Scheidl SimpleHTR model [1], and CTC-WordBeam algoritm [2].

Getting Started

Prerequisites

Currently HebHTR is only supported on Linux. I've tested it on Ubuntu 18.04.

In order to run HebHTR you need to compile Harald Scheidl's CTC-WordBeam. In order to do that you need to clone the CTC-WordBeam, go to cpp/proj/ directory and run the script ./buildTF.sh

Quick Start

from HebHTR import *

# Create new HebHTR object.
img = HebHTR('example.png')

# Infer words from image.
text = img.imgToWords(iterations=5, decoder_type='word_beam', remove_vertical_lines=False,
                        remove_horziontal_lines=False)

Result:

How This Works

HebHTR first detects the sentences in the text-based image. Then, for each sentence, it crops all the words in the sentence, and passes each word to the model to decode the text written in it.

1

Word segmentation is based on one of my previous works which can be found here.

About the Model

As mentioned, this model was written by Harald Scheidl. This model was trained to decode text from images with a single word. I've trained the model on a Hebrew words dataset. The accuracy level of this model is 88%, with a character error rate around 4%.

The model receives input image of the shape 128*32, binary colored. It has 5 CNN layers, 2 RNN layers, and eventually words are being decoded with a CTC-WordBeam algoritm.

2

Explanation in much more details can be found in Harald's article [1].

All words prediced by this model should be fit it's input data, i.e binary colored images of size 128*32. Therefore, each image is normalized to binary color. Then, it is resized (without distortion) until it either has a width of 128 or a height of 32. Finally, it is copied into a (white) target image of size 128×32.

The following figure demonstrates this process:

About the Dataset

I've created a dataset of around 100,000 Hebrew words. Around 50,000 of them are real words, taken from students scanned exams. Segementation of those words was done using one of my previous works which can be found here.
This data was cleaned and labeled manually by me. The other 50,000 words were made artificially also by me. The word list for creating the artificial words is taken from MILA's Hebrew stopwords lexicon [3]. Over all, the whole dataset contains 25 different handwritten fonts. The dataset also contains digits and punctuation characters.

All words in the dataset have the size of 128×32, and were encoded into black and white (binary).
For example:

About the Corpus

The corpus which is being used in the Word Beam contains of around 500,000 unique Hebrew words. The corpus was created by me using the MILA's Arutz 7 corpus [4], TheMarker corpus [5], HaKnesset corpus [6].

Avaliable Functions

imgToWords

imgToWords(remove_horziontal_lines=False, remove_vertical_lines=False, iterations=5,
                    decoder_type='best_path')

Converts a text-based image to text.

Parameters:

  • remove_horziontal_lines (bool): Whether to remove horizontal lines from the text or not. Default value is set to 'False'.

  • remove_vertical_lines (bool): Whether to remove vertical lines from the text or not. Default value is set to 'False'.

  • iterations (int): Number of dilation iterations that will be done on the image. Image is dilated to find the contours of it's words. efault value is set to 5.

  • decoder_type (string): Which decoder to use when infering a word. There are two decoding options:

    • 'word_beam' - CTC word beam algorithm.
    • 'best_path' - Determinded by taking the model's most likely character at each position.

    The word beam decoding has significant better results.

Returns

  • Text decoded by the model from the image (string).

drawRectangles

drawRectangles(output_path=None, remove_horziontal_lines=False, remove_vertical_lines=False,
                        iterations=5, dilate=True)

This function draws rectangles around the words in the text. With this function, you can see how 'remove_horizontal_lines', 'remove_vertical_lines' and 'iterations' variables affect the HebHTR segmentation performance.

Parameters:

  • output_path (string): A path to save the image to. If None is given as a parameter, image will be saved in the original image parent directory.

  • remove_horziontal_lines (bool): Whether to remove horizontal lines from the text or not. Default value is set to 'False'.

  • remove_vertical_lines (bool): Whether to remove vertical lines from the text or not. Default value is set to 'False'.

  • iterations (int): Number of dilation iterations that will be done on the image. Image is dilated to find the contours of it's words. efault value is set to 5.

  • dilate (bool): Whether to dilate the text in the image or not. Default is set to 'True'. It is recommended to dilate the image for better segmentation.

Returns

  • None. Saves the image in the output path.

Improve Accuracy

Model's accuracy is around 88%, but because of the word segmentation, for large texts accuracy might be much lower.
I suggest two ways to improve it:

1. Change number of iterations.
Higher number of iterations is suitable for large letters and a lot of spaces between words, while lower number of iterations is siutable for smaller handwrite. Use the drawRectangles function to see how the number of iterations affects HebHTR segmentation of your text. I will use the following sentence as an example:

For 3 iterations we get the following segmentation:

Which the model infers as:

, כולת להקשום לעצמ נו - סוגיה מעני ות המשתנה עםהזמן

And for 6 iterations we get the following segmentation:

Which the model infers as:

היכולת להקשיב לעמנו - סוגיה מעניינת המשתנה עם הזמן


2. Remove horizontal and/or vertical lines.
Removing those lines might improve sentences segmentation, and thus improve model's infering accuracy.

For example:

Without using any of the removing options, we get complete gibberish:

4- א- תמ" - מו, רח או- ין אות הלחמה הברים+ מידווסט יות באלו ברוחם נ: ורם מוטי אות, מוטין, אל ליוי יורטי ודורי ידי מ- יוש: מלי. - ימש, - ואירופאים - צרפת - וסוריה ניתנה לצרפת. השושלת ההאשמית רצתה את השליטה בסוריה - בשם הלאום הערבי.

but when we use both of the removing options, we get:

והם כבשו את דמשק, אך לאחר המלחמה הבריטים העדיפו את בעלי בריתם האירופאים - צרפת - וסוריה ניתנה לצרפת. (השושלת ההאשמית רצתה את השליטה בסוריה - בשם הלאום הערבי.


If none of the above helps, i suggest you try to segment the text to single words with another algorithm which fits to your data, and then infer each word with the model.

Requierments

  • TensorFlow 1.12.0
  • Numpy 16.4 (will work on 17.0 as well)
  • OpenCV

References

[1] Harald Scheid's SimpleHTR model
[2] Harald Scheid's CTC-WordBeam algorithm
[3] The MILA Hebrew Lexicon
[4] MILA's Arutz 7 corpus
[5] MILA's TheMarker corpus
[6] MILA's HaKnesset corpus