Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

The problem you are looking for called "Page layout detection and character segmentation" , the generic steps go as follow:

1-Detect page zones such as , Text Headers , Text paragraph , Graphics and pictures , tables , .... 2-For Text zones (Header , table cell , paragraph) do the following.

  • Split into lines
  • split line into words
  • split word into characters

In your case you only have one paragraph -you can split paragraph by using horizontal histogram and cut line on local minimum , or you can use contours by adding regions which share vertically some height threshold into one line.

  • -sort the lines from top to bottom.
  • -for every line sort regions from left to right.
  • -In one line if there is horizontal overlap , merge the two regions in one bigger region (this will solve i,j problems)
  • -Then you can split the line into characters by taking every region as one character or ligature (rr,ff,vv).

finally if you need ready made solution , Tesseract can do all previous tasks plus the recognition

The problem you are looking for called "Page layout detection and character segmentation" , the generic steps go as follow:

1-Detect

  1. Detect page zones such as , Text Text Headers , Text paragraph , Graphics Graphics and pictures , tables , .... 2-For ....
  2. For Text zones (Header , table table cell , paragraph) do the following.

    • Split into lines
    • split line into words
    • split word into characters

In your case you only have one paragraph -you can split paragraph by using horizontal histogram and cut line on local minimum , or you can use contours by adding regions which share vertically some height threshold into one line.

  • -sort the lines from top to bottom.
  • -for every line sort regions from left to right.
  • -In one line if there is horizontal overlap , merge the two regions in one bigger region (this will solve i,j problems)
  • -Then you can split the line into characters by taking every region as one character or ligature (rr,ff,vv).

finally if you need ready made solution , Tesseract can do all previous tasks plus the recognition