Ask Your Question
0

cv::ml::StatModel::calcError not working for responses of type CV_32S

asked 2015-03-10 09:53:48 -0600

grtlr gravatar image

updated 2015-03-10 10:41:49 -0600

I am using the master branch from the repository (hash:361eb633f6e841bcda18f970193fc4fb439bc4c8) . I have a feature vector consisting of several ordered variables. My responses on the other hand are categorical and of type CV_32S. I now want to create a RTrees for this problem. The documentation of TrainData::create() states that it is possible to have train data of type CV_32S:

responses – matrix of responses. If the responses are scalar, they should be stored as a single row or as a single column. The matrix should have type CV_32F or CV_32S (in the former case the responses are considered as ordered by default; in the latter case - as categorical)

In the documentation of RTrees I can't find a reason for this to be illegal.

However if I train my RTrees as follows:

#include <iostream>
#include <random>

#include <opencv2/core.hpp>
#include <opencv2/ml.hpp>

using namespace std;
using namespace cv;
using namespace cv::ml;

int main()
{
    random_device rd;
    mt19937 gen( rd() );
    uniform_real_distribution<> dis( 0, 1 );
    uniform_int_distribution<> dis1( 0, 1 );

    int samples = 100;

    Mat_<float> train( samples, 3 );
    for ( auto & x : train ) { x = dis( gen ); }

    // CASE #1
    //Mat_<int> resp( samples, 1 );
    //for ( auto & x : resp ) { x = dis1( gen ); }

    // CASE #2
    Mat resp( samples, 1, CV_32S );
    for ( auto it = resp.begin<int>(); it != resp.end<int>(); ++it ) { *it = dis1( gen );}

    // CASE #3
    //Mat_<float> resp( samples, 1 );
    //for ( auto & x : resp ) { x = dis1( gen ); }

    Mat_<char> types( train.cols + 1, 1 );
    types.setTo( cv::Scalar( VAR_ORDERED ) );
    types( train.cols, 0 ) = VAR_CATEGORICAL;

    Ptr<TrainData> tdata = TrainData::create( train, ROW_SAMPLE, resp, noArray(), noArray(), noArray(), types );
    Ptr<RTrees> rf = RTrees::create();

    rf->train( tdata );

    Mat_<float> calc_out;
    cout << "calc error: " << rf->calcError( tdata, false, noArray() ) << endl;

    Mat_<float> pred_out;
    rf->predict( tdata->getTrainSamples(), pred_out );

    int missclass = 0;
    for ( int i = 0; i < pred_out.rows; ++i )
    {
        Mat_<float> r = tdata->getTrainResponses();
        if ( pred_out( i, 0 ) != r( i, 0 ) )
        {
            missclass++;
        }
    }
    cout << "pred error: " << missclass / ( float )samples << endl;

    return 0;
}

A Gist of this can also be found here: LINK

In Case #1 and Case #2 the output is something like the following:

calc error: 46
pred error: 0.17

Only for *Case #3" the error is computed correctly:

calc error: 24
pred error: 0.24

Question #1 Is this behavior desired? If so, maybe this should be clarified in the documentation of StatModel, RTrees or TrainData?

The problem seems to be in this part of the StatModel::calcError() method:

...
float val = predict(sample);
float val0 = responses.at<float>(si);

if( isclassifier )
    err += fabs(val - val0) > FLT_EPSILON;
...

If responses is of type int this would lead to a different val0 then expected?

I think this could be fixed by checking the type of responses and switching between at<float> and at<int>?

Question #2 I was quite confused, that calcError returns a result between 0 <= x <= 100, although the return type is float. In my opinion a return value between 0 <= x <= 1 would be more appropriate. What do you think?

Conclusion Should this be posted to ... (more)

edit retag flag offensive close merge delete

1 answer

Sort by » oldest newest most voted
0

answered 2015-10-02 14:25:44 -0600

rafaoc gravatar image

About Question #1:
I saw also the problem with calcError and checked the source code for the function. To solve that is like you propose switching <at> float and <at> int. So:
switching

float val0 = responses.at<float>(si);

for

float val0 = responses.type()== CV_32S ? (float)responses.at<int>(si) : responses.at<float>(si);

The type evalution is needed in order to recognize if the data you have is float or integer. It because cv::ml::trainData allows integer as well float as input when you are creating it.
The casting (float) is necessary because the output of the predict function used in calcError is also float: float val = predict(sample); And val and val0 will then be compared

About Question #2
I had the same confusion and I also think the output error should be better 0 <= x <= 1

edit flag offensive delete link more

Question Tools

1 follower

Stats

Asked: 2015-03-10 09:50:05 -0600

Seen: 1,415 times

Last updated: Mar 10 '15