# facenet math

Hello, i hope this is not too off topic but i want to try anyway please.

Pre conditions I have:

• database filled with 128 floating vectors for different persons
• the vector is a unit vector (length of 1)

So i am able to calculate the distance for a given vector on the database (apache solr) successfully. My naive approach calculates all distances for a request and sorts by distance - which is working correctly.

Now my question: Is it possible to limit the records the db needs to scan for a given max distance without calculating the distance for all records? I want to use fixed fields(no db functions) to make this prediction.

I tried so far : mean, variance, using anchor points but i see no correlation with the calculated distance. Sorry - i am still learning but math is always a bit difficult for me (i try to improve).

Thank you + Greetings, Holger

edit retag close merge delete

Some (stupid) thoughts :

Can i somehow put points(determined by the vector) close to each other(certain distance) in a cluster/cloud? And then before if make a query - i determine in which cluster i should search in to get the real distance?

• And if this clustering is possible - how to compute it? - a word to google for would be enough.

So i am able to calculate the distance for a given vector on the database (apache solr) successfully

just curious, how that would look in code. would it be possible, to put some snippet on a gist or such ? (again, it's not important, only my own, personal curiosity)

(no db functions)

"db functions" would mean something like this, right ? http://www.solrtutorial.com/custom-so...

1

are you even storing the facenet features there ? how ?

my naive idea here would be:

when adding a new entity to the db, calculate distances to all others, store in a table like [id_a, id_b, dist] , one row for each distance. easy to index, and easy to retrieve with queries (idk their syntax) like "where id=217 and dist=min" or such.

1

Hi Berak - i can summarize a bit why i did right now:

1. Retrieve the feature vector of person(s)
2. For each vector component - a corresponding schema field in solr exists i.e embv_0, embv_1, embv_2,..., embv_n. So total of 128 fields
3. Store the vector for person(s) in solr
4. When i perform a search for similarity, i get the feature vector and the use dist function(https://lucene.apache.org/solr/guide/...) to sort by distance i.e dist(2, embv_0, embv_1, embv_2,..., embv_n, value0, value1, value2, ..., value_n)

This is performing not bad and correct but solr is forced to to a "full table scan" to get the min distance.I want to prevent this. My aim is to give the query some hints which element are unlike to match the query("filter query

1

Ok let me try to simulate your suggestion berak : I insert a vector and before that - i calculate all the distances to all existing vectors. For each calculated distance, i create a new record with distance and which vector were compared.

This works great if you only compare entries from the db with each other. You can tell which person from the db is most similar to other persons in the db by min query nearly instant - well great!

But what happens if you get a vector from a random image and want to tell if that person is in the db with a certain min distance? In that case i guess you need to again compare that vector with every other vector. That's a full table scan again and what i want to prevent.

Btw: i am not working for a company ^^ Maybe i want to found one ...(more)

So my top guess for this is right now:

Before insertion:

1. cluster points together using Hierarchical Clustering or any other clustering algorithm (K-Means, etc..)
2. insert the result into db

When doing query

1. get feature vector
2. get cluster centroids
3. compute the min distance to a centroid (average of the points in the cluster)
4. use that cluster id as a filter for the query

In my head this makes totally sense right now, but i need to verify. God thanks i have plenty of data to test on(Thank you Wikipedia).

And thank you again for your interest - that alone somehow helps.

Alternative would be to use in memory and distribute the data over several nodes to parallelize the search. But wait - that's exactly what apache solr in cluster mode (minus the in memory thing) is doing :-)

I don't want to reinvent wheel. While writing this i get some ideas how to improve my situation (distribute data on several nodes). Thank you!