Revision history [back]

The problem you are looking for called "Page layout detection and character segmentation" , the generic steps go as follow:

1-Detect page zones such as , Text Headers , Text paragraph , Graphics and pictures , tables , .... 2-For Text zones (Header , table cell , paragraph) do the following.

Split into lines
split line into words
split word into characters

In your case you only have one paragraph -you can split paragraph by using horizontal histogram and cut line on local minimum , or you can use contours by adding regions which share vertically some height threshold into one line.

-sort the lines from top to bottom.
-for every line sort regions from left to right.
-In one line if there is horizontal overlap , merge the two regions in one bigger region (this will solve i,j problems)
-Then you can split the line into characters by taking every region as one character or ligature (rr,ff,vv).

finally if you need ready made solution , Tesseract can do all previous tasks plus the recognition

The problem you are looking for called "Page layout detection and character segmentation" , the generic steps go as follow:

1-Detect

Detect page zones such as , ~~Text~~ Text Headers , Text paragraph , ~~Graphics~~ Graphics and pictures , tables , ~~.... 2-For~~ ....

For Text zones (Header , ~~table~~ table cell , paragraph) do the following.

Split into lines

split line into words

split word into characters

In your case you only have one paragraph -you can split paragraph by using horizontal histogram and cut line on local minimum , or you can use contours by adding regions which share vertically some height threshold into one line.

-sort the lines from top to bottom.

-for every line sort regions from left to right.

-In one line if there is horizontal overlap , merge the two regions in one bigger region (this will solve i,j problems)

-Then you can split the line into characters by taking every region as one character or ligature (rr,ff,vv).

finally if you need ready made solution , Tesseract can do all previous tasks plus the recognition