Revision history [back]

Basically what you are doing wrong is using the createsamples utility to introduce transformations. This will work in cleaner and lab environments but it creates unrealistic features for the object in real life situations. You should start by removing that part and gather positive samples the hard way, by retrieving thousands of original training images with as much variation as possible captured in the samples.

You will also need to introduce a lot more negatives. You want to detect an object instance in a random situation, where the variation in possible backgrounds is huge. You have to think about it in the following matter. You have to try to include every possible background as a negative. This means that many object models for out in the wild detection come with huge sets of negatives, think in the number of multiple 100'000s. However the advantage is that you can provide a wild range of negatives, that are larger then your model, and the algoritm will take random samples from that images. That means that 5000 images of large resolution could easily get you 150.000 windows to train as negatives.

Aside from that, parameter tweaking is always one of the long taking processes. Each application will have its own set of specific settings to get the best result. This is a long period of trial and error I am afraid, which could be partially automated.

Also, use LBP until you reach a somewhat decent model. The training is a tenfold faster and detection is faster also, mainly due to using integer operations in both steps. It will get your models with alot of samples trained in days rather then in weeks with HAAR features.

About the bonus questions:

Actually the viewpoint in academic research is the more the merier. It means that the more negatives you introduce, even if the chance of it happening is like 0.001% will increase the performance of your system! You want to reduce all random background variation as much as possible and exclude that from your search space.
I suggest quitting the old C-style HAARtraining API and take a look at the newer, better developed traincascade interface using the C++ interface. It works way better! The info option then handles the annotation files, in which you instruct where bounding boxes are found. And yes this is also one of the most time consuming parts, annotating all positives in the positive dataset, but it reduces the automatic scaling to a standard size immensly!