offtopic: facenet model phenomenon [closed]

asked 2018-07-06 05:55:13 -0500

holger gravatar image

updated 2018-07-06 06:14:48 -0500

berak gravatar image

Hello i hope the following cnn/dnn questions are not too off topic:


  • I read about siamese networks, triplet(loss) and understand that facenet is a model architecture.
  • I also understand that the model openface.nn4.small2.v1.t7 used by the open cv demo is an implementation of the facnet architecture

I picked as an anchor image(every following pictures will be compared with this) of arnold schwarzenegger, age 40 from the front. During testing i noticed the following:

  • Slightly Different angles of the same person leads to a very low similarity score
    • Different ages of the same person leads to a very low similarity score
    • I picked an image of an old woman, a young woman(famous "lena" picture) and a baby To my suprise these image(which are clearly a different person than arni) are closers to the anchor than the pictures of the same person(different age / angel).

I am afraid i can not ask the net why it thinks lena is more similar to arnold than a picture of himself at a slightly different age. Can anyone with decent knowledge comment on this "phenomenon" ?

On the other hand, as long as the network detects the same person when having a high score (0.8 seems to be good), and i can confirm this, is all fine?

Maybe this all is just "network magic"? Any comments on this is highly welcome, maybe i should also try the tensorflow implementation of the facenet.

Greetings, Holger

edit retag flag offensive reopen merge delete

Closed for the following reason question is off-topic or not relevant by holger
close date 2018-07-06 11:04:32.690126


could you add, how you calculate the "similarity score" ?

berak gravatar imageberak ( 2018-07-06 06:38:40 -0500 )edit

Yes - of course - i made sure the detection itself works correctly (the code you provided) and took this as a template

My code (to big to paste it full)

//method to create embedding base on an extracted face
Mat createEmbedding(Mat face) {
    Mat blob = blobFromImage(face, 1.0 / 255, Size(96, 96), Scalar(0,0,0,0), true, false);
    Mat result = identityNet.forward().clone();
    return result;

//compute score using .dot - using it on the anchor itself give me similarity of 1.0 as expected
double score =;
cout << "score for anchor " << score << endl;

I could upload the full code (80 lines) to github

holger gravatar imageholger ( 2018-07-06 07:09:07 -0500 )edit

Hmm you give me an idea - maybe i shouldnt do it all together but extract the faces first and write them to disk and take this as input to further analyse. Will do this.

holger gravatar imageholger ( 2018-07-06 07:15:43 -0500 )edit
berak gravatar imageberak ( 2018-07-06 07:25:30 -0500 )edit

Ok if i understand you correctly(please comment):

  • Dont copy some code from the js example
  • Don't use the dot product
  • Read about the euclidean distance calculation again and implement it correctly

Did i got you right? I hope some because then i can "fix" this. Thank you, Holger

holger gravatar imageholger ( 2018-07-06 07:55:50 -0500 )edit

no, sorry, the dot product is ok (since the embeddings are L2 normalized, i forgot that)

berak gravatar imageberak ( 2018-07-06 07:57:20 -0500 )edit

ok then - i am not complaining about facenet at all. The positives are valid and ok (score > 0.75). I am just wondering about the negatives and their probability. From a human perspective its really stupid.

How can lena have higher prob than a picture of arnold himself. If a human would take a look he could clearly tell that lena (beautiful woman) is obviously not more similar than a picture of an older arnold.

Anyway - maybe this leads to nothing and should just accept the fact that the models contain some "magiic" it learned from pixels and is not a human.

holger gravatar imageholger ( 2018-07-06 08:05:59 -0500 )edit

Ok then - converting it to a 3 channel black and white image produces more reasonable result for my little dataset


Interesting. My theory is that this way i force the net to pay more attention on face landmarks. Only a theory - i need to verify on a bigger dataset (LFW dataset). But maybe i just f* up something this way, i will need to measure.

holger gravatar imageholger ( 2018-07-06 08:52:58 -0500 )edit

that the models contain some "magiic" it learned from pixels and is not a human.

hehe, right. it's also a knpwn fact, that you can fool cnn's e.g. with adversarial noise

berak gravatar imageberak ( 2018-07-06 08:59:03 -0500 )edit

well i guess a neuronal network just "sees" things just differently and is maybe right in its own world. Anyway - interesting discussion but will close it.

Thank you for your input! Greetings, Holger

holger gravatar imageholger ( 2018-07-06 09:02:45 -0500 )edit