Ask Your Question
1

LSVM translation invariant

asked 2017-06-04 03:28:03 -0600

smaher gravatar image

updated 2017-06-05 08:30:51 -0600

Hi,

I am using LatentSVM in opencv 2.4 on models from opencv_extra as suggested here.

This is a voc sample Test image, the results with 1 different shift


output_1 output_2


I noticed different scores and bounding boxes each 1-pixel shift, what are the causes of this variation although the object is fully present in these cases? Here is the code

  • I want to ask about why different translation/roi image size produces different scores and predictions? And are there any way to guarantee what kind of displacements are best to achieve good detection (e.g the object is in the middle or in the boarder of the image, etc)
  • Are there any down resolution factors other than HOG blocks calculation? like stride in convolution and score calculation step, is it dense overlapped with 1 pixel stride?
  • For speeding up the detection, I changed LAMBDA to 5 instead of 10. Other than misdetection from different pyramid scale size, are there any miscalculation or dependent parameters in LSVM code to be ware of? or is it safe to change LAMBDA only limiting pyramid scale size?

Edit 1 Here are important more in details points:

  1. Sliding window applied is only for illustration of variance output for different ROI/input, LatentSVM in opencv already applies sliding window of the input image so I am not trying to apply sliding window in here. I am just evaluating output stability with respect to translation.
  2. My main interest is to detect people in general not pedestrian as there are far more structure and deformation in general person detector than in pedestrian (for example DPM achieves 88% with INRIA pedestrian but 50% with VOC person class)
  3. Finally, I am interested on a non-deep learning method .. LSVM/DPM method is as far as I know is the state of the art classical method (any other high accuracy classical method?)
edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted
1

answered 2017-06-05 07:49:08 -0600

Its really hard to understand your parameters.. for example, your sliding window size taking up the whole image height everytime is very unusual, so I'll assume you know what you are doing and address your points.

I want to ask about why different translation/roi image size produces different scores and predictions?

Your detection window (ROI) must be consistent with the way you trained your SVM. So, if you trained your algorithm with a ROI size x and then try to make it run with ROI size y, the results may become unconsistent and hard to understand. Changing the translation rules of this window will also change the results. Because different translations will make the detector evaluate different ROI's, so the results are necessarily different.

And are there any way to guarantee what kind of displacements are best to achieve good detection

Only trial and error can provide an answer to this. But the work as been done, there is no need to reinvent the wheel. Most sliding window detectors work with a size of (64,128) and use a 4-pixel stride in both directions.

Are there any down resolution factors other than HOG blocks calculation? like stride in convolution and score calculation step, is it dense overlapped with 1 pixel stride?

I suggest you read some scientific literature. It has been a while since HOG has been the best way to find people in images. Even though people detection in urban scenarios remains as one of the unsolved problems in computer vision, there have been significant advancements and there are algorithms that perform somewhat accurately and fast. I suggest you review the state-of-the-art. You can start by this paper How Far are We from Solving Pedestrian Detection? Keep in mind there still isn't a reliable detector that works solely on image. Most reliable detectors use data from different sensors (thermal, LIDAR, RADAR) to achieve good results.

For speeding up the detection, I changed LAMBDA to 5 instead of 10. Other than misdetection from different pyramid scale size, are there any miscalculation or dependent parameters in LSVM code to be ware of? or is it safe to change LAMBDA only limiting pyramid scale size?

Not really. The less downscales you evaluate the less pedestrians you detect.You just need to do a balance between speed and accuracy needed for your application.

edit flag offensive delete link more

Comments

hey, 1. Sliding window applied is only for illustration of variance output for different ROI/input, LatentSVM in opencv already applies sliding window of the input image so I am not trying to apply sliding window in here. 2. My main interest is to detect people in general not pedestrian as there are far more structure and deformation in general person detector than in pedestrian (for example DPM achieves 88% with INRIA pedestrian but 50% with VOC person class) 3. Finally, I am interested on a non-deep learning method .. LSVM/DPM method is as far as I know is the state of the art classical method, do you know of any high accuracy classical method? Thanks for taking the time

smaher gravatar imagesmaher ( 2017-06-05 08:24:12 -0600 )edit

If you are using OpenCV pre-trained LatSVM detector, than you can be assured that you are far from using the best classical method in existence. That method uses the Latent SVM to classify the old HOG features.

The best methods do not have downloadable ready-to-use implementations, most of the times we have to get down to business with scientific papers and try to really understand what is being done and recreate researchers results.

Go to the link I provided in my answer, go to pdf page 24 and see the graph that ranks people detector algorithms. Then, go read about the best ones 1 by 1 and you will find non-DLearning methods that outperform the one you are using.

Pedro Batista gravatar imagePedro Batista ( 2017-06-05 09:01:06 -0600 )edit

Question Tools

1 follower

Stats

Asked: 2017-06-04 03:28:03 -0600

Seen: 423 times

Last updated: Jun 05 '17