Ask Your Question
0

KMean and PCA connection

asked 2014-02-28 01:50:55 -0600

updated 2014-02-28 01:54:41 -0600

berak gravatar image

As I understand pattern recognition, PCA is used to remove unnecessary data in the dataset so that when the dataset will be used in a KMean, it will perform less than a dataset not being PCA'd. So, I can have code(pseudocode) something like this:

 assign .csv to var DATA
 PCA_DATA = PCAcompute(DATA)
 result = Kmean(PCA_DATA)
 plotToGraph(result)

Am I correct?

I've been looking for sample programs where it imports a csv then do some clustering with PCA for almost a MONTH now. What I need to do is to compare the output of a Kmean result to a Kmean result with PCA using the iris dataset.

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted
1

answered 2014-02-28 06:17:12 -0600

Elix gravatar image

I did not use KMean but I used PCA for my neural network training data to reduce features. It is in C++ interface of OpenCV. Let's start by reading csv file. My csv file is like :

im_path_1;label1
im_path_2;label2

So to read that csv file, my function :

void read_csv(const string& filename, vector<mat>& images, vector<int>& labels, char separator = ';') 
{
    std::ifstream file(filename.c_str(), ifstream::in);
    if (!file) 
    {
        string error_message = "No valid input file was given, please check the given filename.";
        CV_Error(1, error_message);
    }
    string line, path, classlabel;
    while (getline(file, line)) 
    {
        stringstream liness(line);

        getline(liness, path, separator);
        getline(liness, classlabel);

        if(!path.empty() && !classlabel.empty()) 
        {
            Mat im = imread(path, 0);

            images.push_back(im);
            labels.push_back(atoi(classlabel.c_str()));
        }
    }
}

It is holding data in vector of Mat variables. OpenCV's PCA requires data to be rolled as row vectors in a Mat variable. To do that :

Mat rollVectortoMat(const vector<Mat> &data)
{
   Mat dst(static_cast<int>(data.size()), data[0].rows*data[0].cols, CV_32FC1);
   for(unsigned int i = 0; i < data.size(); i++)
   {
      Mat image_row = data[i].clone().reshape(1,1);
      Mat row_i = dst.row(i);                                       
      image_row.convertTo(row_i,CV_32FC1, 1/255.);
   }
   return dst;
}

A simple usage of this functions :

int main()
{

    PCA pca;

    vector<Mat> images_train;
    vector<int> labels_train;

    read_csv("train1k.txt",images_train,labels_train);

    Mat rawTrainData = rollVectortoMat(images_train);   

    int pca_size = 500;

    Mat trainData(rawTrainData.rows, pca_size,rawTrainData.type());
    Mat testData(rawTestData.rows,pca_size,rawTestData.type());


    pca(rawTrainData,Mat(),CV_PCA_DATA_AS_ROW,pca_size);

    for(int i = 0; i < rawTrainData.rows ; i++)
        pca.project(rawTrainData.row(i),trainData.row(i));

    cout<<trainData.size()<<endl;

    return 0;
}

trainData variable is the reduced version of the train set. And for pca_size variable; instead of using it as 500; you can give pca to 0.95 to retain %95 variance. I hope this helps for the PCA part. I used this reduced data to train a Neural Network.

edit flag offensive delete link more

Question Tools

Stats

Asked: 2014-02-28 01:50:55 -0600

Seen: 419 times

Last updated: Feb 28 '14