Please see this webpage --

"The network training started with randomly initialized weights and used a structured metric loss that tries to project all the identities into non-overlapping balls of radius 0.6. The loss is basically a type of pair-wise hinge loss that runs over all pairs in a mini-batch and includes hard-negative mining at the mini-batch level. The training code is obviously also available, since that sort of thing is basically the point of dlib. You can find all details on training and model specifics by reading the example program and consulting the referenced parts of dlib. There is also a Python API for accessing the face recognition model."

Also see the loss function code and documentation here --

"WHAT THIS OBJECT REPRESENTS This object implements the loss layer interface defined above by EXAMPLE_LOSS_LAYER_. In particular, it allows you to learn to map objects into a vector space where objects sharing the same class label are close to each other, while objects with different labels are far apart.

To be specific, it optimizes the following loss function which considers all pairs of objects in a mini-batch and computes a different loss depending on their respective class labels. So if objects A1 and A2 in a mini-batch share the same class label then their contribution to the loss is:
max(0, length(A1-A2)-get_distance_threshold() + get_margin())

While if A1 and B1 have different class labels then their contribution to the loss function is:
max(0, get_distance_threshold()-length(A1-B1) + get_margin())

Therefore, this loss layer optimizes a version of the hinge loss. Moreover, the loss is trying to make sure that all objects with the same label are within get_distance_threshold() distance of each other. Conversely, if two objects have different labels then they should be more than get_distance_threshold() distance from each other in the learned embedding. So this loss function gives you a natural decision boundary for deciding if two objects are from the same class.

Finally, the loss balances the number of negative pairs relative to the number of positive pairs. Therefore, if there are N pairs that share the same identity in a mini-batch then the algorithm will only include the N worst non-matching pairs in the loss. That is, the algorithm performs hard negative mining on the non-matching pairs. This is important since there are in general way more non-matching pairs than matching pairs. So to avoid imbalance in the loss this kind of hard negative mining is useful."

If not clear, I reccomend reading the DeepFace and FaceNet papers