I'm implementing a cache for virtual reality applications: given an input image query, return the result associated to the most visually similar cached image (so a previously processed query) if the distance between the query representation and the cached image representation is lower than a certain threshold. Our cache is relatively small and contains 10k images representations.
We use VLAD codes [[1]] as image representation since they are very compact and incredibly fast to compute (around 1 ms).
However, it has been shown in [[2]] that the the distance between the query code and the images in the dataset (the cache in this case) is very different from query to query, so it's not trivial to find an absolute threshold. In the same work it's proposed a method for object detection applications, which is not relevant in this context (we return just the most similar image, not all and only the images containing the query subject).
[[3]] offers a very precise method, but at the same time it's very expensive and returns short lists. It's based on spatial feature matching re-ranking, and if you want to know more details the quoted section is at the end of this question. I'm not an expert in computer vision, but this step sounds to me a lot like to use a Feature Matcher on the short-list of the top-k elements according to the image representation and re-rank them based on the number of features matched. My first question is: is that correct?
In our case this approach is not a problem, since most of the times the top-10 most similar VLAD codes contains the query subject, and so we should do this spatial matching step only on 10 images.
However, at this point I have a second question: if we had the problem of deciding an absolute threshold for image representations (as VLAD codes), this problem still persists with this approach? In the first case, the threshold was "the L2 distance between the query VLAD code and the closest VLAD code", here instead the threshold value would represent "the number of features matched between the query image and the image closest image using VLAD codes".
Of course my second question makes sense if the first question is positive.
The approach of [[3]]:
Geometrical Re-ranking verifies the global geometrical consistency between matches (Lowe 2004; Philbin et al. 2007) for a short-list of database images returned by the image search system. Here we implement the approach of Lowe (2004) and apply it to a short-list of 200 images. We first obtain a set of matches, i.e., each descriptor of the query image is matched to the 10 closest ones in all the short-list images. We then estimate an affine 2D transforma- tion in two steps. First, a Hough scheme estimates a trans- formation with 4 degrees of freedom. Each pair of matching regions generates a set of parameters that “vote” in a 4D histogram. In a second step, the sets of matches from the largest bins are used to estimate a finer 2D affine transform. The images for which the geometrical estimation succeeds are returned in first positions and ranked with a score based on the number of inliers. The images for which the estima- tion failed are appended to the geometrically matched ones, with their order unchanged.