I have a scanned document which I will like to remove the underlined text before I run OCR through it. The reason why I need to remove the underline is because I noticed the accuracy of the OCR to recognize is bothered by the underlined.


For example, in the attached image, if I removed the underlines, the 2 dates can be recognized accurately else, one of the dates is not recognizable.

Any python sample code is much appreciated.


What did you try? Did you bother to search for a solution?

Hint: the HoughLinesP function is what you are looking for. Here's a tutorial, just change the line color to white: python tutorial

kbarni ( 2020-11-19 11:32:05 -0500 )

I did some searches on possible solutions. I also tried the HougLinesP function as you shared. Also tried "contours" way of looking for lines in the image. But the result wasn't satisfactorily. For example, I got extra line created.

kst ( 2020-11-19 17:49:01 -0500 )

@kst. I deleted my answers. Because, u didn't providing second image(invalid date stamp). The first image will work w/out underline. How will I know if one of the dates is not recognizable. Even if it is both underlined or just one underline.

supra56 ( 2020-11-20 07:02:05 -0500 )

answered 2020-11-24 02:58:40 -0500

berak

have a look at the morphology tutorial

a long horizontal kernel should do the trick.

