What are the suitable datasets for an Offline English handwritten OCR application

asked 2013-05-28 05:56:17 -0600

339 ●3 ●7 ●18

hi, this is not relevant to opencv directly. But i feel, people in this forum, can help me. I am developing offline English handwritten OCR application using OpenCV and LibSVM. i need some dataset for train my application. I search the google and found few, but some of them are not free, some datasets are only for printed text.. likewise there are many problems. It must be Free, English and Handwritten dataset. Can anyone suggest me available dataset for that ? Plz help.

thank you

edit retag flag offensive close merge delete

add a comment

3 answers

Sort by » oldest newest most voted

answered 2013-05-28 08:07:17 -0600

Guanta
6736 ●6 ●25 ●79

One of the most popular datasets is the IAM Dataset. It will probably suit you the most: http://www.iam.unibe.ch/fki/databases/iam-handwriting-database.

Note: MNIST has only handwritten digits and I guess you don't want to train only digits.

edit flag offensive delete link

Comments

oh, yes. i forgot about the digits only

berak ( 2013-05-28 08:23:23 -0600 )edit

i checked it, but it contains "words" not individual characters

Heshan Sandeepa ( 2013-05-28 16:08:34 -0600 )edit

Indeed, the most handwriting-recognition algorithms work on line or word basis (typically using HMMs). I just double-checked my references and most of them use this database, only few papers used characters at all. The only one with segmented english characters I found was the CEDAR database: http://www.cedar.buffalo.edu/Databases/ . Please let me know if you find sth else.

Guanta ( 2013-05-29 03:23:13 -0600 )edit

but CEDAR is not free :( . any way thanks. i am still searching and , i will let you know if i found

Heshan Sandeepa ( 2013-05-29 03:43:33 -0600 )edit

Seems that IAM also has letters: http://www.iam.unibe.ch/fki/databases/iam-graph-database/download-the-iam-graph-database

Guanta ( 2013-07-04 07:04:55 -0600 )edit

hi Guanta, yes i checked IAM databases also. but i am not able to construct images from those .GXL files. they have sentence and single word images. so i download them and segment and there by prepared some single characters images. at the moment i can satisfy with them, but i need more :) . thank you very much for considering and for your effort in my problem.

Heshan Sandeepa ( 2013-07-11 06:55:21 -0600 )edit

@David Jgones: I think, with the additional graph-datasets given in the link you'll get a huge amount of data.

Guanta ( 2013-07-11 08:10:34 -0600 )edit

@Guanta, if we train classifier for words instead of characters, that means the cluster size is much larger, lets say there are 5000 frequently used words, then it will be 5000 clusters, but there are only 72 clusters for charaters + digits. Without so much in-field pratice, is 5000 clusters normal? does that has huge impact to performance?

marc_c ( 2017-01-02 18:12:46 -0600 )edit

I think you missunderstand a concept here. It's about images having words, so no real text.

Guanta ( 2017-03-20 16:40:35 -0600 )edit

add a comment

answered 2013-07-01 10:20:02 -0600

ankitrawat8
11 ●1

The ETL6 database...contains isolated English and Japanese characters. Around 1300 images of each of the English characters it contains.

Is there any other database you found? Please let me know.

edit flag offensive delete link

Comments

hi, did you able to download this ETL6 database. if so let me know how. anyway i got the IAM database. but it has sentence and single word images. so segment them and prepared some character images. but i need more images . thank you very much for considering and your effort my question.

Heshan Sandeepa ( 2013-07-11 07:06:44 -0600 )edit

yes i was able to download the ETL6 DB from the link (you might need to register first): https://projects.itri.aist.go.jp/etlcdb/wordpress/?p=139&lang=en you can read the ETL6 images with the help of this page: http://projects.itri.aist.go.jp/etlcdb/util/et2tif.htm.

ankitrawat8 ( 2013-09-24 02:50:58 -0600 )edit

add a comment

answered 2018-02-15 07:57:11 -0600

Sachin21
21 ●1 ●1

Do check an image dataset at KAGGLE it contains A-Z handwritten 370000+ images

edit flag offensive delete link

Comments

Thanks due, i will check that

Heshan Sandeepa ( 2018-06-07 22:36:42 -0600 )edit

add a comment

What are the suitable datasets for an Offline English handwritten OCR application

3 answers

Comments

Comments

Comments

Links

Question Tools

Stats

Related questions

What are the suitable datasets for an Offline English handwritten OCR application edit

3 answers

Comments

Comments

Comments

Links

Question Tools

Stats

Related questions

What are the suitable datasets for an Offline English handwritten OCR application