cv::ml::rtrees importance calculation

asked 2019-11-21 10:23:48 -0600

JoseAFern gravatar image

updated 2019-11-21 10:25:45 -0600

I am using cv::ml::rtrees and calling getVarImportance() from the model after training. I have read somewhere within the links provided by OpenCV that the way the importance is calculated is by permuting the values for each of the features so that if the error increases when looking at the out-of-bag samples after permuting a specific feature, this indicates this feature is important (i.e. gets a higher importance value). This all makes sense.

To test this, I have generated 4 features, each with 1000 values:

  1. std::poisson_distribution<bpint32>(5)
  2. std::binomial_distribution<bpint32>(1, 0.5)
  3. std::normal_distribution<>(0, 100)
  4. std::normal_distribution<>(0, 100)

The features matrix that I get is similar to this (each FX is a column of the matrix):

F1: 4, 5, 4, 4, 3, 6, 2, ...

F2: 0, 1, 0, 1, 0, 0, 0, ...

F3: 20, 38, 11, 45, 82, 49 ...

F4: 43, 92, 104, 19, 98, 27, ...

If I set up the labels column equal to F2 (0, 1, 0, 1, 0, 0, 0, ...) I would expect, once I run the training, that calling getVarImportance() should return an overwhelmingly bigger importance for F2 as opposed to all others. However, in the readings I get, generally, (1) the weights are all pretty similar to one another, and (2) F2 only comes up top only sometimes, seemingly randomly. These are the results I get in 3 consecutive runs:

Run 1: F3=0.264705, F1=0.260252, F2=0.25813, F4=0.216913

Run 2: F3=0.580336, F4=0.179718, F1=0.120606, F2=0.11934

Run 3: F2=0.280113, F1=0.278775, F3=0.221927, F4=0.219185

Surely this cannot be right. Any ideas?

edit retag flag offensive close merge delete