Detecting Hand drawn striked out lines on top a scanned pdf and detecting such sections in the document

asked 2017-08-05 15:47:09 -0500

AnkurRoy gravatar image

updated 2017-08-10 12:18:13 -0500

C:\fakepath\strikeout.[C:\fakepath\strikeout.pdf.bmp](/upfiles/15023854825724684.bmp)pdf.bmpI am having a scanned pdf with some sections striked out with a pen. I need to find out the sections in the whole document and identify those as they should be omitted in the next iteration for generating the pdf and also trying to generate a report with those parts(missing) which were detected with hand drawn pen strikeouts

So far I have done below steps: 1. grey scale the image. 2. blur the image. 3. Generate binary of the image.(Inverse) 4. Generate the contours. 5. Trying to identify large striked out sections by calculating the area of the bounding rectangles of each contour with average area of bounding rectangles for all contours. But not working as document may have large headers and logos/seals as well.(Need some better detection algorithm) 6. Also need solution for drawing the contours detected(drawing with Mat I can do) and reverse process them so that I can end up with parts of the actual image pdf so that I can generate the report. For 6th point I am currently using OCR (Tessaract) but it is not working accurately.

edit retag flag offensive close merge delete

Comments

a sample image and the code you used will be helpful

sturkmen gravatar imagesturkmen ( 2017-08-06 08:12:20 -0500 )edit

Do you have the original pdf files?

burhan986 gravatar imageburhan986 ( 2017-08-07 06:14:18 -0500 )edit

Providing sample PDF with all 3 types of strike out options.....say I need which day as output...Please delete .bmp extension and you'll get the PDF....had to do as it doesn't allow .pdf upload

AnkurRoy gravatar imageAnkurRoy ( 2017-08-10 12:11:07 -0500 )edit