Revision history [back]

Detection of multiple instance of object with bin KeyDesc

tl;dr : How to detect an object when multiple instances of the same object are present in the scene?

I am using a keypoint descriptor feature matcher (BRISK right now, others could follow but I want to stay in the binary descriptor world). I am using OpenCV3 in C++.

Now, I knew beforehand that while a keypoint descriptor matcher is good (great!) at detecting ONE instance of an object in an image, it is not able by itself to discriminate between 2-3-4-etc. instances of the same object (since it has no global knowledge of the object, it won't "understand" that a specific keypoint belongs to instance number 1 vs instance number 2).

Luckily, it is not impossible to distinguish between the instances and I would like your advice and experience with some methods I came through on the internet or by myself (which surely means that they exist somewhere on the internet ;) )

Some details on the kind of application I am interested in.

An "exact" model of the object is available (imagine I take a picture of the object in the same environment that the instances will be searched for. So the instances will be about the same size (scale) and with the same perspective, just maybe rotated (360deg) or deformed a little bit).
Occlusion will be present. Imagine that a camera is on a robot and three identical photos are hanged on the wall in front of it but two of them are partially hidden by two dudes. So the algorithm will find keypoint-descriptor matches on all of them, but since I want to locate the one that is not occluded, it safe to assume that the correct instance to will have the most keypoint-descriptor matches in his area.
Real time detection (or close to real time, about 2-3 fps min) for detection, from 1s to ~20s max for the "training" of the model (keypoint-descriptor computation + any other relevant op).

Now, here are the methods I am contemplating to use. Of course, unless there is already a ready-made implementation, I will need to implement it myself so any thoughts is welcomed!

Scanning Windows: A scanning window that is big enough to contain any rotated version of the model object (assuming ~no scale change) is used.

The scene image is scanned using some relevant step size. In each window, the matching score is computed (say hamming) between the model and the scene descriptors. The number of good individual matches gives a score for this window, best score is the location of the object. Seemingly naive approach.
Clustering: According to (Colet et al.,2009), Mean Shift clustering is a good choice because no fixed number of clusters needs to be specified. The full algorithm is as follows:

1) Cluster [chosen] feature 2D [keypoint] locations p using Mean Shift algorithm. Each cluster contains a subset of points p_k.

2) For each cluster of points p_k, choose a subset of n points and estimate a hypothesis with the best pose according to those points. If the amount of points consistent with the hypothesis is higher than a threshold e, create a new object instance and refine the estimated pose using all consistent points in the optimization. Keep repeating this procedure until the amount of unallocated points is lower than a threshold, or the maximum number of iterations has been exceeded.

3) Merge all found instances from different clusters whose estimated R,t are similar. The instances with the most consistent points survive.
Clustering 2: The keypoints of the model are grouped in clusters in relation to their geometrical proximity.

Only some majors clusters are kept, based on their number of keypoint, their global distribution on the surface of the model, etc. The distance between each cluster is computed as an identification metric for the object (this object has three clusters, distance between 1 and 2 is X...)

Descriptors of those keypoints are computed. The scene's keypoint-descriptors are computed.

The keypoint-descriptors of the first cluster of the model are searched in the scene and a certain number of plausible matches are kept for each. These plausible matches are clustered in relation to their geometrical proximity to identify the presence(s) of this cluster in the scene image. This step is repeated for all the major clusters of the model object image.

The distances between the clusters are computed. For all plausible matches, the homography is computed.

For now, this is what I have found/thought of. I don't have any implementation yet, so if you know a method with some sample code, I wouldn't say no. If not, well your comments would certainly allow me to start to code a promising strategy instead of something bad or inefficient.

Detection of multiple instance of object with bin KeyDescKeypoint-Descriptors (BRISK, ORB, ...)

tl;dr : How to detect an object when multiple instances of the same object are present in the scene?

I am using a keypoint descriptor feature matcher (BRISK right now, others could follow but I want to stay in the binary descriptor world). I am using OpenCV3 in C++.

Some details on the kind of application I am interested in.

An "exact" model of the object is available (imagine I take a picture of the object in the same environment that the instances will be searched for. So the instances will be about the same size (scale) and with the same perspective, just maybe rotated (360deg) or deformed a little bit).
Occlusion will be present. Imagine that a camera is on a robot and three identical photos are hanged on the wall in front of it but two of them are partially hidden by two dudes. So the algorithm will find keypoint-descriptor matches on all of them, but since I want to locate the one that is not occluded, it safe to assume that the correct instance to will have the most keypoint-descriptor matches in his area.
Real time detection (or close to real time, about 2-3 fps min) for detection, from 1s to ~20s max for the "training" of the model (keypoint-descriptor computation + any other relevant op).

Now, here are the methods I am contemplating to use. Of course, unless there is already a ready-made implementation, I will need to implement it myself so any thoughts is welcomed!

Scanning Windows: A scanning window that is big enough to contain any rotated version of the model object (assuming ~no scale change) is used.

The scene image is scanned using some relevant step size. In each window, the matching score is computed (say hamming) between the model and the scene descriptors. The number of good individual matches gives a score for this window, best score is the location of the object. Seemingly naive approach.
Clustering: According to (Colet et al.,2009), Mean Shift clustering is a good choice because no fixed number of clusters needs to be specified. The full algorithm is as follows:

1) Cluster [chosen] feature 2D [keypoint] locations p using Mean Shift algorithm. Each cluster contains a subset of points p_k.

2) For each cluster of points p_k, choose a subset of n points and estimate a hypothesis with the best pose according to those points. If the amount of points consistent with the hypothesis is higher than a threshold e, create a new object instance and refine the estimated pose using all consistent points in the optimization. Keep repeating this procedure until the amount of unallocated points is lower than a threshold, or the maximum number of iterations has been exceeded.

3) Merge all found instances from different clusters whose estimated R,t are similar. The instances with the most consistent points survive.
Clustering 2: The keypoints of the model are grouped in clusters in relation to their geometrical proximity.

Only some majors clusters are kept, based on their number of keypoint, their global distribution on the surface of the model, etc. The distance between each cluster is computed as an identification metric for the object (this object has three clusters, distance between 1 and 2 is X...)

Descriptors of those keypoints are computed. The scene's keypoint-descriptors are computed.

The keypoint-descriptors of the first cluster of the model are searched in the scene and a certain number of plausible matches are kept for each. These plausible matches are clustered in relation to their geometrical proximity to identify the presence(s) of this cluster in the scene image. This step is repeated for all the major clusters of the model object image.

The distances between the clusters are computed. For all plausible matches, the homography is computed.