Can't get Logistic Regression results to be anything other than 0's
Hello, I'm working on an embedded project and I'm currently trying to learn how to use the OpenCV library to do simple logistic regression.
I am testing this on the titanic dataset and I've run into one major issue so far: the results
matrix is always set to a vector of zeros after I call logreg->predict(trainData->getTestSamples(), results)
.
Here is the relevant code:
#include <opencv2/ml.hpp>
using namespace cv;
using namespace ml;
using namespace std;
Ptr<LogisticRegression> model(float learningRate, int iterations, int miniBatchSize) {
Ptr<LogisticRegression> logreg = LogisticRegression::create();
logreg->setLearningRate(learningRate);
logreg->setIterations(iterations);
logreg->setMiniBatchSize(miniBatchSize);
logreg->setTrainMethod(LogisticRegression::BATCH);
logreg->setRegularization(LogisticRegression::REG_L2);
return logreg;
}
int main(int, char**)
{
const Ptr<TrainData> trainData = TrainData::loadFromCSV("data/train_cleaned.csv",
1, // lines to skip
0, // index of label
-1 // 1 response per line
);
trainData->setTrainTestSplitRatio(0.8);
Ptr<LogisticRegression> logreg = model(0.001, 10, 1);
logreg->train(trainData);
Mat results;
logreg->predict(trainData->getTestSamples(), results);
cout << results.t() << endl;
return 0;
}
I was thinking that maybe my data wasn't being processed correctly, so I tried to change trainTestSplitRatio to multiple smaller values and verified that the training and testing samples reflected the changes. There was still no difference in the predicted outputs, only a larger vector of zeros.
Example output: [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0]
Example input (data/train_cleaned.csv):
# "Survived","Pclass","Sex","Age","SibSp","Parch","Fare"
0.00000,1.00000,0.00000,0.27500,0.20000,0.00000,0.01415
The model is also correctly saved after training:
%YAML:1.0
---
opencv_ml_lr:
format: 3
classifier: Logistic Regression Classifier
alpha: 1.0000000000000000e-03
iterations: 1000
norm: 1
train_method: 0
learnt_thetas: !!opencv-matrix
rows: 1
cols: 7
dt: f
data: [ -1.31384659e-04, -1.67027669e-04, 1.30452652e-04,
-5.84131885e-05, -1.56885471e-05, -3.24403231e-07,
1.01427850e-05 ]
n_labels: !!opencv-matrix
rows: 2
cols: 1
dt: i
data: [ 0, 1 ]
o_labels: !!opencv-matrix
rows: 2
cols: 1
dt: i
data: [ 0, 1 ]
Perhaps there are errors in the model.
imho 10 iterations are not enough. try like 5000
can you put the csv somewhere ? (kaggle's data is behind a login wall)
data: http://s000.tinyupload.com/index.php?...
I've tried with more iterations, the issue seems to be that the sigmoid vector returned by calc_sigmoid in the ml/lr.cpp file is always really close to 0.5 but always lower. The more iterations the farther some predictions get from 0.5 towards 0 (still will output zero).
Thank you :)
test here uses lr=1.0, iter=10001 and batch=10 to solve the iris dataset.
did you try other ml algos, like SVM ?