What are some good resources for arabic OCR in the wild dataset?

asked 2016-05-17 02:11:03 -0600

adamdylan gravatar image

Hello there, I've recently started working on a OCR in the wild algorythm using neural networks. My requirements are as follow: Arabic text, Natural images(not scans etc.)

My goal is detecting weather the image has text or not and then extract the text.

I need some help from you, I need large dataset. If there's any, it would be great, otherwise, I would appreciate some help thinking of reasonable methods to create such dataset by my own.

Thank you very much, A Dylan

edit retag flag offensive close merge delete

Comments

On Ubuntu, if you hit sudo apt-get install tesseract-ocr and then hit tab, you can see a range of available language models for tesseract OCR system.

StevenPuttemans gravatar imageStevenPuttemans ( 2016-05-17 09:15:02 -0600 )edit

Hey, I try it , i got some error when running the following command: tesseract photo.jpeg out -l ara (I installed the language package) The error is:

Tesseract Open Source OCR Engine v3.04.00 with Leptonica
Cube ERROR (CubeRecoContext::Load): unable to read cube language model params from /opt/local/share/tessdata/ara.cube.lm
Cube ERROR (CubeRecoContext::Create): unable to init CubeRecoContext object
init_cube_objects(false, &tessdata_manager):Error:Assert failed:in file tessedit.cpp, line 205
adamdylan gravatar imageadamdylan ( 2016-05-18 03:09:22 -0600 )edit

I guess you will need to address this as an issue at the tesseract github, to get better support!

StevenPuttemans gravatar imageStevenPuttemans ( 2016-05-18 03:21:29 -0600 )edit