Ask Your Question

How to search for near identical images?

asked 2018-05-24 18:18:42 -0500

Pzixel gravatar image

updated 2018-06-08 03:44:14 -0500

I want to implement an image DB. If i've got it correctly, I need to use BoW and then I would be able to classify a new image if I have seen it before.

But all examples I have seen nefore (e.g. this) have some set of traning images, when training is done then BoW is using to classify the rest.

My situation is different: I have to search through the BoW for an image, and if there is no match then we consider it to be a new image, otherwise we don't add it and say "already exist". Skewed, rescaled and others images should be considered the same, so I used SIFT as a feature detector.

So the problem in a nutshell: At beginning, I have an empty database of images (so I have no data to train on). I have a stream of images. I can calculate features of this image. Then I have check if there is some images which descriptors match descriptors of this image more than some treshold then I say "existing", otherwise I add these descriptors as a new image and return "new image". If I get the same image (or similar image) in future I say "existing".

It requires rebuild BoW with every new image, which seems to be a very complicated solution. All search engines on the internet are doing this so I'm surprised there is almost no information in the web.

Okay, I tried to use PHash to perform a search operation. But it fails to recognize slightly different images. here is a sample code (it's a thin Rust wrapper over C++ interface, so don't be affraid to see an alien language):

pub trait Database {
    fn save_image(&mut self, image: &[u8]);
    fn load_images(&self) -> Vec<Vec<u8>>;

pub enum ImageVariant {

pub struct Storage<T: Database> {
    database: T,
    hasher: PHash,
    images: Vec<Mat>

impl<T: Database> Storage<T> {
    pub fn new(database: T) -> Self {
        Self {
            hasher: PHash::new(),
            images: Vec::new()

// the main execution happens here
impl<T: Database, D: Clone> Storage<T> {
    pub fn save_image_if_new(&mut self, image: &[u8], filename: &str) -> ImageVariant {
        const DIFF: f64 = 0.5;

        let mat = Mat::image_decode(image, ImageReadMode::Grayscale);
        let mat = self.hasher.compute(&mat);
        let mut last_diff = std::f64::INFINITY;
        for image in self.images.iter() {
            let diff =, &image);
            if diff < last_diff {
                last_diff = diff;
        if last_diff < DIFF {
            return ImageVariant::AlreadyExists;

Then I run the test on following images:

  1. image description
  2. image description
  3. image description

So gives distance 28 for images 1 and 2 (should get small distance), while it gives 30 for images 1 and 3 (should get a big distance). So it's practically doesn't allow to know if I already say this image before.

The entire code is available here and here

Test code:

fn it_works() {
    let lenna = fs::read(get_asset_path("lenna.png")).unwrap();
    let lenna_demotivator = fs::read(get_asset_path("lenna_demotivator.png")).unwrap();
    let ...
edit retag flag offensive close merge delete


sorry,but your question is too fuzzy, or something, to be answered, or something.

also, you're kinda asking 5 questions at the same time.

split it up into small, seperate questions / problems, show what you tried so far, and at what point you got stuck, then we can (try to) help with it.

berak gravatar imageberak ( 2018-05-25 03:22:27 -0500 )edit

@berak okay, i removed some consideration, if you like to. My question is straight: how can a DB being implemented? All examples I have seen (e.g. this) have a training set of images and working set of images, while I have to have a dynamically increasing storage. There is no examples or information how to do so, while rebuilding BoW every time I add every image seems to be impractical.

Pzixel gravatar imagePzixel ( 2018-05-25 03:28:22 -0500 )edit

no, you only build the bow vocabulary once (and save it somehow, e.g. using cv::FileStorage) . you need enough data for this, though, so you find representative clusters for even unseen (future) images.

if you're using a flann::Index, (on BowimageDescriptors) you'll have to train it on startup of your program (it's pretty fast), you can serialize the train data, though, and add new items on demand.

how to store it ? idk. opencv does not have any (good) means for that. using a db for this is a bit unpractical, since you need ALL of the training data at once, to train your index. the better ideas i've seen here went like: make chunks of ~1000 feature vectors , and append new data to the last. once it's full, add a new chunk.

berak gravatar imageberak ( 2018-05-25 03:40:31 -0500 )edit

@berak I have rewritten my question more clearly, please, feel free to give a new feedback if I'm doing something wrong. I tried to provide some context (just in case) with an exact question "How BoW can be incrementally filled and used simultaneously)."

Pzixel gravatar imagePzixel ( 2018-05-25 03:43:55 -0500 )edit

I can calculate features of this image. Then I have check if there is some images which descriptors match descriptors of this image

no, wait, Bow does not work like this.

instead you compare your image features to the BoW cluster centers, and increase a bin for the best matching cluster feature. the resulting histogram is the final, BoW feature, that goes into your index (for either testing or training)

berak gravatar imageberak ( 2018-05-25 03:50:08 -0500 )edit

Well, maybe BoW is a no way here and it's an XY problem. This is why I add so much context, to make a decision wheighted. What approach should I choose then? I whould love to mark it as an answer if you have any idea how it could be implemented.

Pzixel gravatar imagePzixel ( 2018-05-25 03:53:24 -0500 )edit

and again, afaik, you cannot retrain an index, and retain the old tree structure (append data).

you have to retrain it from scratch, but at least you can try to store the features in a (more or less) dynamic way

berak gravatar imageberak ( 2018-05-25 03:56:35 -0500 )edit

Well, maybe BoW is a no way here

again, you cannot use matching sift features from images directly here, because most of the matches are insignificant for clustering/classification. also the dimension problem (3 features in 1 image, 17 in another). makes it impossible to use with an index (or any other ml)

the Bow approach is an attempt to overcome those problems. (and commonly used for CBIR like situations)

maybe you have to look at improvements of that idea, like VLAD or fisher-vectors ?

berak gravatar imageberak ( 2018-05-25 04:06:07 -0500 )edit

to get over the XY problem, you'd have to all the way back, and give us a specification, whatyou're trying to achieve, using what kind of data, etc.

berak gravatar imageberak ( 2018-05-25 04:07:28 -0500 )edit

I have a chat. I have a webhook that triggers when someone adds an image to the chat. I have to say if this image was previosly posted or not. Image is considered previosly posted if it looks similar to one of previosly posted, so rescaled, skewed etc images are recognized. That's all backstory here.

Pzixel gravatar imagePzixel ( 2018-05-25 04:31:24 -0500 )edit

1 answer

Sort by ยป oldest newest most voted

answered 2018-05-25 04:37:38 -0500

berak gravatar image

updated 2018-06-08 01:33:09 -0500

I have a chat. I have a webhook that triggers when someone adds an image to the chat. I have to say if this image was previosly posted or not. Image is considered previosly posted if it looks similar to one of previosly posted, so rescaled, skewed etc images are recognized. That's all backstory here.

maybe that problem (finding (close) duplicates) can be far easier (also much cheaper) solved using image hashes , instead of SIFT/BOW features. e.g. phash results in a64bit entry, that can be easily stored and searched in a common relational database (like postgres, which supports HAMMING norm, iirc)


while i would not say, your 2 lena images are the same, or even "sufficiently similar", it's you,who has to decide that, not me...

wrote a little test code:

void hash_fun(Ptr< img_hash::ImgHashBase > hasher) {
    Mat i1; hasher->compute(imread("lena1.png"), i1);
    Mat i2; hasher->compute(imread("lena2.png"), i2);
    Mat i3; hasher->compute(imread("solvay.jpg"), i3);

    cout << "lena1 -> lena2 : " << hasher->compare(i1,i2) << endl;
    cout << "lena1 -> solvay: " << hasher->compare(i1,i3) << endl;
    cout << "lena2 -> solvay: " << hasher->compare(i2,i3) << endl;

int main(int argc, char** argv)
    cout << endl << "average" << endl;
    cout << endl << "blockmean" << endl;
    cout << endl << "colormoments" << endl;
    cout << endl << "marrhidreth" << endl;
    cout << endl << "radialvariance" << endl;
    cout << endl << "phash" << endl;

resulting in:

lena1 -> lena2 : 31
lena1 -> solvay: 30
lena2 -> solvay: 23

lena1 -> lena2 : 131
lena1 -> solvay: 113
lena2 -> solvay: 136

lena1 -> lena2 : 0.6555
lena1 -> solvay: 22.5625
lena2 -> solvay: 22.4344

lena1 -> lena2 : 273
lena1 -> solvay: 307
lena2 -> solvay: 324

lena1 -> lena2 : 0.522041
lena1 -> solvay: 0.30779
lena2 -> solvay: 0.485308

lena1 -> lena2 : 28
lena1 -> solvay: 30
lena2 -> solvay: 40

so, it's like it always is, we have to watch our data closely, and pick the best fitting algo for it.

edit flag offensive delete link more


I upvote it for now, I'l accept it when I try it (if you don't mind).

Pzixel gravatar imagePzixel ( 2018-05-25 05:01:58 -0500 )edit

whatever you please ... (last thing on earth i need is: more karma here, though, hehe ;)

berak gravatar imageberak ( 2018-05-25 05:05:44 -0500 )edit

Well, I face an issue OpenCV(3.4.1) Error: Assertion failed (_src1.sameSize(_src2) && _src1.type() == _src2.type()) in norm. How could it be solved? Database may contain images of different size...

Pzixel gravatar imagePzixel ( 2018-06-06 16:31:36 -0500 )edit

@berak please, see edit.

Pzixel gravatar imagePzixel ( 2018-06-07 16:25:52 -0500 )edit

^^ you probably tried to compare() the images directly, but it expects hashes instead.

berak gravatar imageberak ( 2018-06-08 01:24:45 -0500 )edit

@berak. I did it once, but I've got an error that Mat have different sizes. So I learned documentation and I actually compare hashed. You can examine the code (it's Rust, but it's not complicated at all, save_image_if_new method), it actually compares hashes. You can run it on these images yourself and say what results do you get. As I see, you did, and got same numbers 28 vs 30. I'm affraid that it gives very bad signal/error level. Well, I'l try radialvariance, however, not sure if it will work fine.

Pzixel gravatar imagePzixel ( 2018-06-08 03:43:00 -0500 )edit

Hmm, I didn't notice that colormoments gave very good results. Going to try it. Thank you.

Pzixel gravatar imagePzixel ( 2018-06-08 03:55:03 -0500 )edit

you definitely need to try with more data !

(and the rust code actually looks nice ! i don't think, you'll have problems there)

berak gravatar imageberak ( 2018-06-08 04:16:59 -0500 )edit

@berak well, I implemented them all :D so I won't need to choose one.

Pzixel gravatar imagePzixel ( 2018-06-08 15:59:53 -0500 )edit
Login/Signup to Answer

Question Tools

1 follower


Asked: 2018-05-24 18:18:42 -0500

Seen: 503 times

Last updated: Jun 08 '18