I am having a scanned pdf with some sections striked out with a pen. I need to find out the sections in the whole document and identify those as they should be omitted in the next iteration for generating the pdf and also trying to generate a report with those parts(missing) which were detected with hand drawn pen strikeouts
So far I have done below steps: 1. grey scale the image. 2. blur the image. 3. Generate binary of the image.(Inverse) 4. Generate the contours. 5. Trying to identify large striked out sections by calculating the area of the bounding rectangles of each contour with average area of bounding rectangles for all contours. But not working as document may have large headers and logos/seals as well.(Need some better detection algorithm) 6. Also need solution for drawing the contours detected(drawing with Mat I can do) and reverse process them so that I can end up with parts of the actual image pdf so that I can generate the report. For 6th point I am currently using OCR (Tessaract) but it is not working accurately.