LSH knnSearch slow in large dataset [closed]

asked 2020-04-08 10:15:32 -0600

vitto32 gravatar image

updated 2020-04-08 10:32:53 -0600

berak gravatar image

I'm using python version of OpenCV 4.2 and I'm trying to do feature matching inside a large dataset (10K low resolution images) using cv2.flann_Index().

I do need optimal performance in feature detection of "key" image and search. But I'm stuck.

Using AKAZE and KDTree (and after hours of parameters tuning) I do achieve a barely acceptable performance in searching (50ms-100ms over 5 million descriptors) but detectAndCompute takes more than expected (30-50ms for 400-900 descriptors). Memory consumption is also an issue (KAZE descriptors are 64 floats).

ORB + LSH looks promising (32 int descriptors) but the results (in terms of speed) are much worse, especially in knnSearch. I've tried with different parameters but the best result i get is 2000ms per search (table_number=8, key_size=32, multi_probe_level=0) and ORB's detectAndCompute is not faster than AKAZE's (I've read some papers stating it should be a lot faster).

Am I doing something conceptually wrong?

edit retag flag offensive reopen merge delete

Closed for the following reason the question is answered, right answer was accepted by sturkmen
close date 2020-10-14 13:46:46.557269

Comments

detectAndCompute takes more than expected

however, you would do this only for your query image(s), and cache the other 10k features from the database images, no ?

berak gravatar imageberak ( 2020-04-08 10:48:24 -0600 )edit

hi @vitto32 , I am facing a similar problem , can please help me.

I not sure how to add an index and how to search for similar descriptors given a query descriptor. (It would be very helpful if you post a snippet of your code "indexing and searching")

hha1 gravatar imagehha1 ( 2020-04-08 11:57:13 -0600 )edit
1

@berak yes I’m saving keypoints, descriptors and index. At startup I load them and then I start processing query images from different image queues

vitto32 gravatar imagevitto32 ( 2020-04-08 13:41:06 -0600 )edit

@hha1 look at cv2.flann_Index and knnSearch, I can post code if you are more specific about your problem

vitto32 gravatar imagevitto32 ( 2020-04-08 13:43:12 -0600 )edit

@vitto32 my problem is that I have like 10,000 images thus I have 10,000 descriptors. all i want to do is create an index for each of those descriptors and then given a query descriptor I want flann to return the "closest" descriptors or closest matches.

My problem with my code is that it does not store the indices for each descriptor and the search function is not returning any matches.

If you help me on how to create an index for each descriptor and search for matches given a query descriptor I would be so grateful.

It would be most appreciated if you can share a snippet of your code.

Thank you!

hha1 gravatar imagehha1 ( 2020-04-08 14:44:53 -0600 )edit
1

@hha1 each image should have more than 1 descriptor! In my case 10K images means a few million descriptors (how many depends on your params and image size).

What I do is to concatenate all the results from detectAndCompute in 2 huge arrays (haystackKps, haystackDess). I also populate a structure to keep track of image reference (es: position from 0 to 1500 refers to image #1, etc..)

Then i build the flann index passing haystackDess (million of descriptors) and I run knnSearch passing the queryDescriptors. I cycle all the results and I do assign the matches (and their distance) to related images. At the end I analyze the results and I do pick the image with more matches (more tuning can be done using distances and ransac).

PS: I do save index, kps and desc to file for 2nd run

vitto32 gravatar imagevitto32 ( 2020-04-09 04:55:34 -0600 )edit
1

the "haystack" problem be overcome by some form of clustering, e.g. BagOfWords so you get a fixed descriptor size per image. even with K=10000, you'll get a huge compression, reducing the data to the most significant features

unfortunately, opencv's implementation uses kmeans, and thus cannot be used with binary descriptors like ORB, but you can use various other clustering algos in python, e.g from scikit

berak gravatar imageberak ( 2020-04-09 05:11:59 -0600 )edit

Thanks @berak but scikit is a lot slower (500x in KDTree implementation). I've also tried OpenCV Kmeans (using float descriptors) and it can to be a good alternative to KDTree but is more difficult to balance (compression vs speed vs precision). But I'm stuck to float descriptors.

vitto32 gravatar imagevitto32 ( 2020-04-09 12:09:42 -0600 )edit

looking at C++ code seems that distance type can be passed as 3rd parameter to build function of any Index. HAMMING should be 9 but (in Python) I get "Unknown/unsupported distance type".

vitto32 gravatar imagevitto32 ( 2020-04-11 08:02:01 -0600 )edit