Generating good training data for haar cascades

asked 2015-03-13 02:34:33 -0600

dmilne gravatar image

I am trying to build haar cascades for doing OCR of a specific font; one classifier per character.

I can generate tons of training data just by drawing the font onto images. So, the plan is to generate positive training data for each character, and use the examples of other characters as negative training data (let me know if this is a dumb idea, please)

I am wondering how much variation I should put into the training data. Normally I'd just try everything, but I gather these things take days to train (for each character!) so some advice would be good.

So, a few questions:

  • Does the training algorithm recognise that I don't care about transparent pixels? Or will it perform better if I superimpose the characters over different backgrounds?
  • Should I include images where each character is shown with different prefixes and suffixes, or should I just treat each character individually?
  • Should I include images where the character is scaled up and down? I gather the algorithm pretty much ignores size, and scales everything down for efficiency anyway?

Thanks!

P.S. This question is also on StackOverflow. Apologies for cross-posting.

edit retag flag offensive close merge delete

Comments

Did you ever find an answer to your problem? I'm interested in this problem field as well.

I think that using other characters as negatives is exactly what you want to do. It should also include windows that cover the spaces between charactes and between lines - randomly spaced over a page. Your OCR must decide what is and is not the character in question so you will need to weaken the weights of features found on other letters. Consider "l" and "t" or "v" and w" They have a lot of corresponding features so you will need to show your cascade the difference with your negatives.

Warren gravatar imageWarren ( 2015-05-11 08:44:53 -0600 )edit