Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Yes, thresholding, also known as binarization, is the right idea. All of the artifacts (defects) you mentioned are considered in the construction and evaluation of binarization algorithms, because they are created for the purpose of digitizing printed matters, and therefore they have to deal with every kind of practical issues.

When reading research papers on binarization, one should make the following distinctions in terms of their applicability to your own needs:

  • Whether the paper targets pristine pages (e.g. book pages printed in modern times), or degraded/weathered pages, or ancient articles
  • Whether the paper targets binarization of machine-printed text (including typesetting text that was produced before the computerization era), or hand-written text.

You can read about academic research from the following sources:

  • ICDAR (International Conference on Document Analysis and Recognition)
  • IJDAR (International Journal on Document Analysis and Recognition)
  • Various other image processing research venues, such as CVPR, ICASSP, SIGGRAPH, PAMI, etc.

There is a notable algorithm competition, known as DIBCO (Document Image Binarization Contest), which was held in year 2009 and 2011. During these two contests, a very large number of algorithms created by researchers and commercial entities all over the world are evaluated systematically. All of the evaluation results are available online.

Because of proprietary reasons, this is all I can say. I will not be able to comment any further beyond this. Good luck with your findings.