Matching kmeans clusters generating non random sequence
I've been creating a bag of words based texture classifier using gaussian filterbanks. I've recently found a fairly fundamental flaw in that after i collect and save a set of 'model' histograms from training images, if i then generate another histogram from from one of the training images using an identical procedure and use compareHist with chisquared to match it doesn't give a perfect match but instead a set of seemingly random distances which reoccur exactly if it's the process is repeated from new.
I've done this in a loop (generating a histogram and matching to a save histogram of the same image), and example of the distances comparehist throws back is below.
{4,6,6,4,3,4,5,6,4,6,5,3,3,5,6,5,6,5,6,4,5,5,4,5,4,4,6,3,5}
I cant understand why the distance:
- isn't zero
- but also is identical each time i repeat it
I'm using the bag of words trainer to generate the clusters with KMEANS_PP_CENTERS being used to calculate the initial centres. Then comparing those clusters with chisquared.
Is this something which could be due to my code or from the clustering?
Thank you in advance this has been driving me crazy and my dissertation is due in a week and a half so stressful..
Note this is partial repost from my other post here but that's mainly because i need an answer pretty quick because it's holding up my project. Thanks
Below is a basic example of the kmeans variation i'm talking about, although not my specific problem:
int Flags = KMEANS_PP_CENTERS;
TermCriteria Tc(TermCriteria::MAX_ITER + TermCriteria::EPS, 1000, 0.0001);
float histArr[] = {0,255};
const float* hist= {histArr};
int histSize[] = {10};
int channels[] = {0};
Mat ou1;
namedWindow("testWin", CV_WINDOW_AUTOSIZE);
vector<Mat> compareMe;
for(int i=0;i<2;i++){
BOWKMeansTrainer tstTrain(30, Tc, 5, Flags);
Mat img1 = imread("../lena.png", CV_LOAD_IMAGE_GRAYSCALE);
imshow("testWin", img1);
waitKey(1000);
// filterHandle(img1, imgOut, filterbank, n_sigmas, n_orientations);
cout << "This is the size.." << img1.rows << " cols: " << img1.cols << endl;
Mat imgFlat = reshapeCol(img1);
cout << "This is the size.." << imgFlat.rows << " cols: " << imgFlat.cols << endl;
tstTrain.add(imgFlat);
Mat clusters = Mat::zeros(10,1, CV_32FC1);
clusters = tstTrain.cluster();
calcHist(&clusters, 1, channels, Mat(), ou1, 1, histSize, &histArr, true, false);
compareMe.push_back(ou1);
tstTrain.clear();
cout << "This is the tstTrain.size(): " << tstTrain.descripotorsCount() << endl;
}
double value = compareHist(compareMe[0], compareMe[1], CV_COMP_CHISQR);
cout << "This is the Chisqr comparison.." << value << endl;
compareMe.clear();
Below is the resize function:
Mat reshapeCol(Mat in){
Mat points(in.rows*in.cols, 1,CV_32F);
int cnt = 0;
cout << "inside. These are the rows: " << in.rows << " and cols: " << in.cols << endl;
for(int i =0;i<in.cols;i++){
for(int j=0;j<in.rows;j++){
points.at<float>(cnt, 0) = in.at<Vec3b>(i,j)[0];
cnt++;
}
}
return points;
}
This is my github, my specific problem is both the variation when i generate models ...
looking at some code might be helpful. (also, let's just pretend, that SO does not exist)
then , just to make sure, i understood you: the BOW training/clustering is an offline task, right ? you'd do that only once, then keep the dictionary around to later get distance-histograms from your image features (which should be totally deterministic)
When you save the histogram of the image, how do you save it? It could be due to compression errors!
Yes absolutely texton and model generation happen and are saved seperately to xml files. There is variation in the models generated from the same image, (i've put this down in part to kmeans variation), but as it's stored that shouldn't affect the clustering.
The main reason i found this bug was because when i changed the number of training images the results for the images changed(even if the ones i added were duplicates). I noticed this only happened to ones which were loaded after the duplicates.
I generated a new instance of BOWTrainer for every new image to confirm it wasn't something carried over and the models stay the same so have run out of idea's of where to look.. This is the repo, the testing module is novelImgTest.cpp
let me ask again: i'd use the BOWKMeansTrainer only for the clustering phase, which would result in a dictionary of N(clusters) x M(featuresize)
then, for later training/testing on the actual dataset , i'd do like this:
translate each feature into a distance vector, that is: get the distance from feature to each row in dictionary, save it as one value in a 'distance vector'. this is our new BOW feature . (that's, what the bowimgdescriptorextractor actually does on SIFT/SURF features.
then train a ml model on those distance vectors (or go cheap, and use histogram comparison or cosine distance)
i don't see , where repeated calls to BOWKMeansTrainer would ever happen there ?
is that for keypoint based approach, my implementation needs to use the MR8 filterbank, effectively recreating this
With that method you call it three times to generate the textons then for both models and the novel image which are the same method except the model is a known class.
i understood, that you're using a filterbank, and then use histograms from the output as a feature vector.
hmm, do you really need the BOWKMeansTrainer to generate your features for training and testing ? that sounds pretty artificial to me.
wouldn't it be just a straight norm(histogram, dictionary_row) x K ?
possibly, although i dont want to change too much at this stage if possible as hand in is within a couple of weeks. Do you think that could cause the sequence which it outputs? I know kmeans can have variations but the fact that the variations are the same each time seems to indicate that it's related to the number of iterations it runs, but i've reinitialised a new trainer each iteration. Do you think it's something to do with the iterations or the use of BOWTrainer?
well, i bet a fiver, that your randomness comes out of kmeans. (you won't bet against me, hehe)
and that it's always the same , repeatable sequence is easily explained, once you look here . the global cv::theRNG() gets initialized once on program startup with 0xffffffff, thus it will always generate the same sequence.
again, imho, you should use the BOWKMeansTrainer only for the initial clustering phase
ah that makes sense!! hmm ok so is there anyway to remove this inaccuracy? outside of running test images or or model generation multiple times per image or just testing/adding one model at a time i cant think how to get around this?
Actually to rephrase what do you think is the best method to do a fitlerbank based bag of words as there has to be a method which doesn't require you to duplicate training images to increase you chances of a match