Detection of multiple instance of object with Keypoint-Descriptors (BRISK, ORB, ...)
tl;dr : How to detect an object when multiple instances of the same object are present in the scene?
I am using a keypoint descriptor feature matcher (BRISK right now, others could follow but I want to stay in the binary descriptor world). I am using OpenCV3 in C++.
Now, I knew beforehand that while a keypoint descriptor matcher is good (great!) at detecting ONE instance of an object in an image, it is not able by itself to discriminate between 2-3-4-etc. instances of the same object (since it has no global knowledge of the object, it won't "understand" that a specific keypoint belongs to instance number 1 vs instance number 2).
Luckily, it is not impossible to distinguish between the instances and I would like your advice and experience with some methods I came through on the internet or by myself (which surely means that they exist somewhere on the internet ;) )
Some details on the kind of application I am interested in.
An "exact" model of the object is available (imagine I take a picture of the object in the same environment that the instances will be searched for. So the instances will be about the same size (scale) and with the same perspective, just maybe rotated (360deg) or deformed a little bit).
Occlusion will be present. Imagine that a camera is on a robot and three identical photos are hanged on the wall in front of it but two of them are partially hidden by two dudes. So the algorithm will find keypoint-descriptor matches on all of them, but since I want to locate the one that is not occluded, it safe to assume that the correct instance to will have the most keypoint-descriptor matches in his area.
Real time detection (or close to real time, about 2-3 fps min) for detection, from 1s to ~20s max for the "training" of the model (keypoint-descriptor computation + any other relevant op).
Now, here are the methods I am contemplating to use. Of course, unless there is already a ready-made implementation, I will need to implement it myself so any thoughts is welcomed!
Scanning Windows: A scanning window that is big enough to contain any rotated version of the model object (assuming ~no scale change) is used.
The scene image is scanned using some relevant step size. In each window, the matching score is computed (say hamming) between the model and the scene descriptors. The number of good individual matches gives a score for this window, best score is the location of the object. Seemingly naive approach.
Clustering: According to (Colet et al.,2009), Mean Shift clustering is a good choice because no fixed number of clusters needs to be specified. The full algorithm is as follows:
1) Cluster [chosen] feature 2D [keypoint] locations p using Mean Shift algorithm. Each cluster contains a subset of points p_k.
2) For each cluster of points p_k, choose a subset of n points and estimate a hypothesis with the best pose according to ...
Related: http://answers.opencv.org/question/44141/multiple-object-detenction-2d-camera/ http://answers.opencv.org/question/17985/detecting-multiple-instances-of-same-object-with/
Hmm considering you want real time performance I would skip all the keypoint detecting/matching/clustering and go for model based approaches like Viola&Jones or SVM+HOG...
Stupid question: is OpenCV including these things? I saw them frequently in the literature but while I am not afraid to program, I can't pretend I am pro programmer and that I am able to optimize code to the same level people are doing with OpenCV... Also, I need to have a licence similar to the one in OpenCV...
Well, I'll do my homework and look for implementation ;)
@Doombot: yes they are in openCV. Viola and Jones is called the cascade classifier inside the object detection module, the SVM is in the machine learning module and the HOG component in the features2d module.