# De-duplicate images of faces

I have a group of images of faces (.jpg and .png). Some of the images are taken from the same source but have been cropped, rotated, flipped, resized, compressed, color modified, brightened, darkened, etc., and so are not exact duplicates. So the freeware I currently use will not find these 'transformed' near-duplicate images.

Since the images are 'almost,' but not quite, identical, I hope facial recognition software can help identify the duplicates. Also, I need a system that, once each face has been compared with each of the others, can tell me which files match so that I can compare the files and decide which one of the pair to keep and which one to discard.

I'm not a programmer, but my son can do the programming. I am trying to find possible tools he can use to implement a solution. (He says tools that can be integrated with Linux would be the easiest but that he can work with most anything.)

I've read the facerec_tutorial but don't really understand it. Can the opencv facial recognition software find the matches and then output some sort of data that can be used for human comparison of the files? If not, what tools would he need to use in order to take the output from the recognition software and make it useable so that a human could examine the images?

Ah, being a newbie, I can't yet answer my own question, so, in answer to berak's reply below, there are about 150,000 images that need to be deduplicated.

edit retag close merge delete

The approach with face detection is probably to complicated. The modifications rather point to a feature based approach (Sift, Surf, ...)

( 2015-01-31 02:16:55 -0500 )edit
1

i don't think opencv's face reco can help you here. it is doing 'supervised' learning (you give it a few imgs of person A, a few of person B, etc. and then it will predict, if it was A or B. unfortunately, shown an img of person C, it will still say A or B, because that's all it knows. )

but your case seems to be the 'unsupervised' one (you don't know the identity), so all you can do is check each img against each other for similarity.

( 2015-01-31 02:41:09 -0500 )edit

btw, how many images are there ?

( 2015-01-31 08:01:15 -0500 )edit

try phash first.

( 2015-01-31 09:47:32 -0500 )edit
1

" I can't yet answer my own question," - oh, just make a comment, we can convert it to an answer, if nessecary

( 2015-01-31 14:19:48 -0500 )edit
1

I used the phash demo on two sets (pair) of .jpg files. Each set contained the original image and a cropped image. The first pair was heavily cropped and I didn't expect good results. The second pair was a 'typical' crop, cropping off both sides of the image and leaving only the face in the middle. I got no matches either time. Here are the results from the 'typical' image:

Select 2 JPEG or BMP images to compare and click Submit. -Images may be saved for statistical analysis and to improve the pHash algorithms. Images will never be redistributed. -Algorithm: RADISH (radial hash); DCT hash; Marr/Mexican hat wavelet -RADISH: pHash determined your images are not similar with PCC = 0.366513. Threshold set to 0.85.

( 2015-01-31 21:01:08 -0500 )edit

-DCT: pHash determined your images are not similar with hamming distance = 30.000000. Threshold set to 26.00. -Marr/Mexican: pHash determined your images are not similar with normalized hamming distance = 0.468750. Threshold set to 0.40.

Then I cropped the image lightly, maybe 10% of the total area and got a report that the images are similar. Here is the result I got on just one test: -pHash determined your images are similar with PCC = 0.993786. Threshold set to 0.85.

( 2015-01-31 21:02:24 -0500 )edit

I then rotated the image 45 degrees (a typical rotation in the project I'm working on) and got a report that the images are not similar: _pHash determined your images are not similar with PCC = 0.308770. Threshold set to 0.85.

So I believe that, as currently configured, phash will not work for my project, unless ...

"Face detection" images can be extracted from the original images I'm working with and those extracted images rotated to the horizontal on the eyes, then resized, then compared with each other using phash.

Is there any way that opencv has tools that can detect then extract faces from images so that those images can then be compared?

( 2015-01-31 21:03:23 -0500 )edit
1

@VHarris, off course! Use the viola and jones cascade classifier to find faces in the image. Then cut out the found detections and run an eye detector on top of that. Both face and eye models are inside OpenCV. Then use the center point of your eye detections to align the faces.

( 2015-02-01 02:26:02 -0500 )edit

Just want to make sure I've got this right so far. 1) use face_cascade to find the face 2) use eye_cascade to find the eyes within the face 3) use the center point of the eye detections to align the face. How is this typically done? And once the faces are aligned, how are the faces extracted from the image for use by pHash? 4) use phash to compare the images

( 2015-02-01 13:57:34 -0500 )edit

Sort by » oldest newest most voted

time to start preliminary answers, i guess,..

1) yes, try a CascadeClassifier. if you can find out, that you can crop your images to the significant face part, you've already won something.

2) 3) optional align to eyes. you can use another CascadeClassifier to find them, or some landmarks, like from flandmark or dlib. then, it's basically this:

    double eyeXdis = eye_r.x - eye_l.x;
double eyeYdis = eye_r.y - eye_l.y;
double angle   = atan(eyeYdis/eyeXdis);
double degree  = angle*180/CV_PI;
double desired_eye_distance = 44.0;
double scale   = desired_eye_distance / eyeXdis;

Mat res;
Point2f center(test.cols/2, test.rows/2);
Mat rot = getRotationMatrix2D(center, degree, scale);
warpAffine(test, res, rot, Size(), INTER_CUBIC, BORDER_CONSTANT, Scalar(127));
// probably crop this again from center


4) now it's time to compare the images. ofc. you can save them to disk with imwrite() and use phash on that, but while you're in opencv, might as well try some other things:

• the most simple comparison would be a straight double dist = norm(a,b); and compare to some threshold
• if you got a good count of 'same' , 'not-same' pairs somehow, you could train that distance, also could use machine-learning, like an svm trained on distances
• instead of using the plain image pixels, you could advance to 'features' gotten from that, like dct(similar to phash), hog, lbph, or the mentioned surf or orb descriptors
• this is called 'face verification', an active research topic ;)
more

berak: Do you call the code you provided, FaceNormalizer? Can you help me understand where to find the values for the variables in your FaceNormalizer?

( 2015-02-01 21:05:20 -0500 )edit

So have I got this right for a rough draft of one possible solution: Use opencv_traincascade to train face_cascade; Use face_cascade to find the face within each image; Use Flandmark (or dlib) to find point detectors within the face; Use FaceNormalizer to extract the face from the image(?), rotate, and resize the extracted face; Use imwrite() to save the face-file to disk; Use ? to flip the image in the horizontal; Use imwrite() to save the now-flipped face-file to disk; Repeat until all images have been processed; Use phash to find similar faces among the files.

( 2015-02-01 22:34:31 -0500 )edit
1
• hmm, no idea, if making copies from 150000 images is feasible. you maybe could do all in memory. load one img, preprocess it, load a 2nd, preprocess & compare, load a 3rd, preprocess & compare, etc.
• eye-alignment / normalizing will only work, if both eyes are visible. that's a problem if you got profile-faces, too. (what kind of variables did you mean , besides eye-distance and scale / crop factors ?)
• i don't get the 'flipped' part. are there images rotated 90° or such ? maybe one would have to rotate / flip the image, until the face-detection finds something, and restrict it to the frontal / left profile case.
• again, phash is only one means to calculate a distance, whithin opencv, i'd rather try lbph, hog or sift features to compare
( 2015-02-02 00:50:44 -0500 )edit
1

let me add that the default LBP and HAAR models for frontal faces have a region of interest that should be inside the face normally. This will reduce unneccesary background information. If not on your set, then crop all your detections in relation to the center point by for example 15%

( 2015-02-02 03:35:18 -0500 )edit

so, i tried my own dogfood ...

(to my surprise, a simple L2 or cosine norm seems to do fairly well)

( 2015-02-02 03:51:30 -0500 )edit

That sample just loves the way Angelina is looking at you!

( 2015-02-02 03:56:26 -0500 )edit

Hi guys, a 'flipped' image is one that is a mirror image of another. That is, the image is loaded into, say, GIMP, then that image is flipped side-to-side and stored in the collection. So, for example, after flipping Angelina's image above, we would have two images of Angelina, one with her looking off to her right and one with her looking off to her left. There are a significant number of flipped pairs in the image collection. Would using cv::flip be a good way to find these mirror images?

( 2015-02-02 10:02:03 -0500 )edit

^^ ok, ok. so just horizontal flip, or cv::flip(img,img,1);

no idea btw, how to detect, which one is the flipped version.

( 2015-02-02 10:06:36 -0500 )edit

Sometimes a visual inspection gives clues as to which one is the flipped version. Otherwise I just keep the one with the higher dimensions or larger file size.

( 2015-02-02 10:20:32 -0500 )edit

berak, could you give me an explanation of what the code that you provided above does? Does it use the output from Flandmark (or dlib)? Or does it depend on the result from eye_cascade? I'd like to be able to describe to my son what each step in the process, from beginning to end, does.

( 2015-02-02 10:25:04 -0500 )edit

Official site

GitHub

Wiki

Documentation