Extracting information from national id

python

asked 2018-11-22 07:40:51 -0600

Ahmed
56 ●1 ●2 ●7

updated 2018-11-24 13:07:51 -0600

I'm trying to do OCR arabic on the following ID but I get a very noisy picture, and can't extract information from it.

Here is my attempt

import tesserocr
from PIL import Image
import pytesseract
import matplotlib as plt
import cv2
import imutils
import numpy as np

image = cv2.imread(r'c:\ahmed\ahmed.jpg')
gray = cv2.cvtColor(image, cv2.COLOR_BGR2GRAY)
gray = cv2.bilateralFilter(gray,11,18,18)

gray = cv2.GaussianBlur(gray,(5,5), 0)

kernel = np.ones((2,2), np.uint8)


gray = cv2.adaptiveThreshold(gray,255,cv2.ADAPTIVE_THRESH_GAUSSIAN_C,
            cv2.THRESH_BINARY,11,2)
#img_dilation = cv2.erode(gray, kernel, iterations=1)


#cv2.imshow("dilation", img_dilation)

cv2.imshow("gray", gray)

text = pytesseract.image_to_string(gray, lang='ara')
print(text)
with open(r"c:\ahmed\file.txt", "w", encoding="utf-8") as myfile:
    myfile.write(text)
cv2.waitKey(0)

another sample

image description

edit retag flag offensive close merge delete

add a comment

2 answers

Sort by » oldest newest most voted

answered 2018-11-24 06:23:33 -0600

supra56

943 ●9 ●6

#!/usr/bin/env python
#Raspberry pi 3B+, kernel 5.14.82, Debian Strecth, OpenCV 4.0-pre
#Date 24th November, 2018

from PIL import Image
import pytesseract
import cv2
import numpy as np

def main():
    img = cv2.imread('id.jpg' )
    #img = cv2.resize(img, (640, 480)) 

    canny = cv2.Canny(img, 400, 600)
    gray = cv2.GaussianBlur(canny,(5,5), 0)
    inverted = cv2.bitwise_not(gray)

    test1 = Image.fromarray(gray)
    test2 = Image.fromarray(inverted)

    result = pytesseract.image_to_string(test1, lang="eng", config="-c tessedit_char_whitelist=0123456789X")
    print( result)
    print( "-------")
    result = pytesseract.image_to_string(test2, lang="eng")
    print( result)

    cv2.imwrite('inverted.jpg', inverted)
    cv2.imwrite('gray.jpg', gray)
    cv2.imshow('Gray', gray)
    cv2.imshow('Inverted', inverted)

    k = cv2.waitKey(0)

if __name__ == "__main__":
    main()

image description

edit flag offensive delete link

Comments

@Ahmed. I set min from 60 to 400, will load faster. The max I set to 600. Anything higher than 600 will take longer to 30 sec.

supra56 ( 2018-11-24 06:28:07 -0600 )edit

Btw, I changed from ara to eng on my linux raspberry pi. Sorry, Ahmed.

supra56 ( 2018-11-24 06:30:09 -0600 )edit

amaaaaaaaazing

Ahmed ( 2018-11-24 09:25:17 -0600 )edit

but as you notice there are some text that are distored like the second line from the top is completely disappeard

Ahmed ( 2018-11-24 10:01:18 -0600 )edit

I get a better result with that code.. can you try it and tell me your opinion

https://pastebin.com/wu3bbhAX

Ahmed ( 2018-11-24 10:05:22 -0600 )edit

@Ahmed. I will attempt wit your code. But I have to remmed it out last 2 lines and also I can't do lang='ara' I do have ara training.

supra56 ( 2018-11-24 10:23:26 -0600 )edit

you can see the picture is better and clearer with my last attempt...you can check it out.. any issues with it that can come in future images ?

Ahmed ( 2018-11-24 11:07:06 -0600 )edit

Here is code I modified. You may try to add txt file. from PIL import Image

import pytesseract
import cv2
import numpy as np

def main():
    img = cv2.imread('id.jpg' )
    edged = cv2.Canny(img, 400, 600)
    fil = cv2.bilateralFilter(edged, 9, 75, 75)
    blur = cv2.GaussianBlur(fil, (5, 5), 0)
    image =cv2.fastNlMeansDenoising(blur ,None, 4, 7, 21)
    cv2.imshow("gray", image)    
    resize = cv2.resize(image, (640, 480))
    cv2.imshow("resize", resize)     
    text = pytesseract.image_to_string(resize, lang='eng')
    print(text)
    cv2.waitKey(0)

if __name__ == "__main__":
    main()

supra56 ( 2018-11-24 12:17:41 -0600 )edit

@. Ahmed. Don't used cv2.adaptiveThreshold, because it will not printed it out because of arabic and numbers are trumbled. You will have to use Canny, equalizer, threshold and finncontours, etc to suit your needed. I am using lang='eng'. I do have ara.traineddata in my currently folder.

supra56 ( 2018-11-24 12:23:07 -0600 )edit

Still, the second line in the top is not appearing at all, I don't care about the first line that has puncatation but the second line after it... do you have a solution for this? I also tried another test, it doesn't work at all

Ahmed ( 2018-11-24 13:02:23 -0600 )edit

see more comments

answered 2018-11-22 17:32:12 -0600

kvc
11 ●1 ●1

updated 2018-11-23 07:34:41 -0600

Hi,

Your pipeline works well with high quality image (scanned images) but not enough for captured images (smart phone, your case). You may consider 2 options:

traditional image analyse
Machine learning

*. Traditional image analyse: this option is the same your pipeline except : - No Gaussian blur which will remove contours - No bilateralFilter because it is slow - Instead add some quality correction like constrast improvement, illumination correction. - Replace your adaptiveThreshold by Canny (see canny reference in Opencv)

*. ML option, I would refer to EAST : https://github.com/argman/EAST.

Finally, regard your case, I will choose option 1 because it seems like your ID is cropped well.

Good luck.

edit flag offensive delete link

Comments

can you add a sample code for processing the above image and it's results ?

Ahmed ( 2018-11-23 04:08:13 -0600 )edit

@Ahmed: Don't send to Tesseract your the whole processed image.

Quality improvement -> Canny -> findcontours will help you locate your text lines (rectangle or rotated rect). Then, you can crop text line images from the original image, send them to Tesseract to get text.

kvc ( 2018-11-25 21:56:20 -0600 )edit

add a comment

Extracting information from national id

2 answers

Comments

Comments

Links

Question Tools

Stats

Related questions

Extracting information from national id edit

2 answers

Comments

Comments

Links

Question Tools

Stats

Related questions

Extracting information from national id