Ask Your Question
2

What are the suitable datasets for an Offline English handwritten OCR application

asked 2013-05-28 05:56:17 -0600

Heshan Sandeepa gravatar image

hi, this is not relevant to opencv directly. But i feel, people in this forum, can help me. I am developing offline English handwritten OCR application using OpenCV and LibSVM. i need some dataset for train my application. I search the google and found few, but some of them are not free, some datasets are only for printed text.. likewise there are many problems. It must be Free, English and Handwritten dataset. Can anyone suggest me available dataset for that ? Plz help.

thank you

edit retag flag offensive close merge delete

3 answers

Sort by ยป oldest newest most voted
6

answered 2013-05-28 08:07:17 -0600

Guanta gravatar image

One of the most popular datasets is the IAM Dataset. It will probably suit you the most: http://www.iam.unibe.ch/fki/databases/iam-handwriting-database.

Note: MNIST has only handwritten digits and I guess you don't want to train only digits.

edit flag offensive delete link more

Comments

2

oh, yes. i forgot about the digits only

berak gravatar imageberak ( 2013-05-28 08:23:23 -0600 )edit

i checked it, but it contains "words" not individual characters

Heshan Sandeepa gravatar imageHeshan Sandeepa ( 2013-05-28 16:08:34 -0600 )edit
2

Indeed, the most handwriting-recognition algorithms work on line or word basis (typically using HMMs). I just double-checked my references and most of them use this database, only few papers used characters at all. The only one with segmented english characters I found was the CEDAR database: http://www.cedar.buffalo.edu/Databases/ . Please let me know if you find sth else.

Guanta gravatar imageGuanta ( 2013-05-29 03:23:13 -0600 )edit
1

but CEDAR is not free :( . any way thanks. i am still searching and , i will let you know if i found

Heshan Sandeepa gravatar imageHeshan Sandeepa ( 2013-05-29 03:43:33 -0600 )edit
1

hi Guanta, yes i checked IAM databases also. but i am not able to construct images from those .GXL files. they have sentence and single word images. so i download them and segment and there by prepared some single characters images. at the moment i can satisfy with them, but i need more :) . thank you very much for considering and for your effort in my problem.

Heshan Sandeepa gravatar imageHeshan Sandeepa ( 2013-07-11 06:55:21 -0600 )edit

@David Jgones: I think, with the additional graph-datasets given in the link you'll get a huge amount of data.

Guanta gravatar imageGuanta ( 2013-07-11 08:10:34 -0600 )edit

@Guanta, if we train classifier for words instead of characters, that means the cluster size is much larger, lets say there are 5000 frequently used words, then it will be 5000 clusters, but there are only 72 clusters for charaters + digits. Without so much in-field pratice, is 5000 clusters normal? does that has huge impact to performance?

marc_c gravatar imagemarc_c ( 2017-01-02 18:12:46 -0600 )edit

I think you missunderstand a concept here. It's about images having words, so no real text.

Guanta gravatar imageGuanta ( 2017-03-20 16:40:35 -0600 )edit
1

answered 2013-07-01 10:20:02 -0600

ankitrawat8 gravatar image

The ETL6 database...contains isolated English and Japanese characters. Around 1300 images of each of the English characters it contains.

Is there any other database you found? Please let me know.

edit flag offensive delete link more

Comments

hi, did you able to download this ETL6 database. if so let me know how. anyway i got the IAM database. but it has sentence and single word images. so segment them and prepared some character images. but i need more images . thank you very much for considering and your effort my question.

Heshan Sandeepa gravatar imageHeshan Sandeepa ( 2013-07-11 07:06:44 -0600 )edit

yes i was able to download the ETL6 DB from the link (you might need to register first): https://projects.itri.aist.go.jp/etlcdb/wordpress/?p=139&lang=en you can read the ETL6 images with the help of this page: http://projects.itri.aist.go.jp/etlcdb/util/et2tif.htm.

ankitrawat8 gravatar imageankitrawat8 ( 2013-09-24 02:50:58 -0600 )edit
2

answered 2018-02-15 07:57:11 -0600

Sachin21 gravatar image

Do check an image dataset at KAGGLE it contains A-Z handwritten 370000+ images

edit flag offensive delete link more

Comments

Thanks due, i will check that

Heshan Sandeepa gravatar imageHeshan Sandeepa ( 2018-06-07 22:36:42 -0600 )edit

Question Tools

1 follower

Stats

Asked: 2013-05-28 05:56:17 -0600

Seen: 5,375 times

Last updated: Jul 01 '13