Ask Your Question

Revision history [back]

click to hide/show revision 1
initial version

Estimate camera pose (extrinsic parameters) from homography / essential matrix

I am trying to estimate the camera pose from an estimated homography as explained in chapter 9.6.2 of Szeliski's book.

Basically, it says the following:

Let W=(0,-1,0; 1,0,0; 0,0,1). The left camera is assumed at [I|0].

Given an SVD decomposition of the essential matrix E

SVD(E) = U*diag(1,1,0)*V'

the extrinsic camera parameters [R|t] of the right camera are one of the following four solutions:

[U W V' | U*(0,0,+1)']
[U W V' | U*(0,0,-1)']
[U W'V' | U*(0,0,+1)']
[U W'V' | U*(0,0,-1)']

Now I am struggling with a few main issues.

1) I only have access to an estimated homography H. The way I understand it, it's not exactly an essential matrix, if the singular values are not two equal values. Therefore, what I am struggling with is that instead of

SVD(H) = U * diag(1,1,0) * V'

the SVD decomposes into something like

SVD(H) = U * diag(70,1.6,0.0001) * W

It's really weird that the singular values are not almost identical and that large. My first question is why this happens and what to do about it? Normalization? Scaling?

2) After a lot of thinking and more experimenting, I came up with the following implementation:

    static const Mat W = (Mat_<double>(3, 3) <<
        1,  0, 0,
        0, -1, 0,
        0,  0, 1
    );

    // Compute SVD of H
    SVD svd(H);
    cv::Mat_<double> R1 = svd.u * Mat::diag(svd.w) * svd.vt;
    cv::Mat_<double> R2 = svd.u * Mat::diag(svd.w) * W  * svd.vt;
    cv::Mat_<double> t = svd.u.col(2);

This way I get four possible solutions [R1|t], [R1|-t], [R2|t], [R2|-t], which produce some sort of results.

Apparently, in W, I don't swap x/y coordinates and I don't invert the x-coordinate. Only the y-coordinate is inverted.

I believe the swap can be explained by different image coordinate systems. So column and row might be inverted in my implementation. But I can't explain why I only have to mirror and not rotate. And overall I am not sure, if the implementation is correct.

3) In theory I think I need to triangulate a pair of matches and determine whether the 3D point is front of both planes. Only one of the four solutions will satisfy that condition. However, I don't know how to determine the near plane's normal and distance from the uncalibrated projection matrix.

4) This is all used in the context of image stitching. My goal is to enhance the current image stitching pipeline to support not only cameras rotating around itself, but also translating cameras with little rotation. Some preliminary results and a follow-up question will be posted soon. (TODO)

Estimate camera pose (extrinsic parameters) from homography / essential matrix

I am trying to estimate the camera pose from an estimated homography as explained in chapter 9.6.2 of Szeliski's Hartley & Zisserman's book.

Basically, it says the following:

Let W=(0,-1,0; 1,0,0; 0,0,1). The left camera is assumed at [I|0].

Given an SVD decomposition of the essential matrix E

SVD(E) = U*diag(1,1,0)*V'

the extrinsic camera parameters [R|t] of the right camera are one of the following four solutions:

[U W V' | U*(0,0,+1)']
[U W V' | U*(0,0,-1)']
[U W'V' | U*(0,0,+1)']
[U W'V' | U*(0,0,-1)']

Now I am struggling with a few main issues.

1) I only have access to an estimated homography H. The way I understand it, it's not exactly an essential matrix, if the singular values are not two equal values. Therefore, what I am struggling with is that instead of

SVD(H) = U * diag(1,1,0) * V'

the SVD decomposes into something like

SVD(H) = U * diag(70,1.6,0.0001) * W

It's really weird that the singular values are not almost identical and that large. My first question is why this happens and what to do about it? Normalization? Scaling?

2) After a lot of thinking and more experimenting, I came up with the following implementation:

    static const Mat W = (Mat_<double>(3, 3) <<
        1,  0, 0,
        0, -1, 0,
        0,  0, 1
    );

    // Compute SVD of H
    SVD svd(H);
    cv::Mat_<double> R1 = svd.u * Mat::diag(svd.w) * svd.vt;
    cv::Mat_<double> R2 = svd.u * Mat::diag(svd.w) * W  * svd.vt;
    cv::Mat_<double> t = svd.u.col(2);

This way I get four possible solutions [R1|t], [R1|-t], [R2|t], [R2|-t], which produce some sort of results.

Apparently, in W, I don't swap x/y coordinates and I don't invert the x-coordinate. Only the y-coordinate is inverted.

I believe the swap can be explained by different image coordinate systems. So column and row might be inverted in my implementation. But I can't explain why I only have to mirror and not rotate. And overall I am not sure, if the implementation is correct.

3) In theory I think I need to triangulate a pair of matches and determine whether the 3D point is front of both planes. Only one of the four solutions will satisfy that condition. However, I don't know how to determine the near plane's normal and distance from the uncalibrated projection matrix.

4) This is all used in the context of image stitching. My goal is to enhance the current image stitching pipeline to support not only cameras rotating around itself, but also translating cameras with little rotation. Some preliminary results and a follow-up question will be posted soon. (TODO)