Estimate camera pose (extrinsic parameters) from homography / essential matrix

asked 2014-07-29 12:50:57 -0500

Duffycola gravatar image

updated 2014-08-13 09:04:00 -0500

I am trying to estimate the camera pose from an estimated homography as explained in chapter 9.6.2 of Hartley & Zisserman's book.

Basically, it says the following:

Let W=(0,-1,0; 1,0,0; 0,0,1). The left camera is assumed at [I|0].

Given an SVD decomposition of the essential matrix E

SVD(E) = U*diag(1,1,0)*V'

the extrinsic camera parameters [R|t] of the right camera are one of the following four solutions:

[U W V' | U*(0,0,+1)']
[U W V' | U*(0,0,-1)']
[U W'V' | U*(0,0,+1)']
[U W'V' | U*(0,0,-1)']

Now I am struggling with a few main issues.

1) I only have access to an estimated homography H. The way I understand it, it's not exactly an essential matrix, if the singular values are not two equal values. Therefore, what I am struggling with is that instead of

SVD(H) = U * diag(1,1,0) * V'

the SVD decomposes into something like

SVD(H) = U * diag(70,1.6,0.0001) * W

It's really weird that the singular values are not almost identical and that large. My first question is why this happens and what to do about it? Normalization? Scaling?

2) After a lot of thinking and more experimenting, I came up with the following implementation:

    static const Mat W = (Mat_<double>(3, 3) <<
        1,  0, 0,
        0, -1, 0,
        0,  0, 1

    // Compute SVD of H
    SVD svd(H);
    cv::Mat_<double> R1 = svd.u * Mat::diag(svd.w) * svd.vt;
    cv::Mat_<double> R2 = svd.u * Mat::diag(svd.w) * W  * svd.vt;
    cv::Mat_<double> t = svd.u.col(2);

This way I get four possible solutions [R1|t], [R1|-t], [R2|t], [R2|-t], which produce some sort of results.

Apparently, in W, I don't swap x/y coordinates and I don't invert the x-coordinate. Only the y-coordinate is inverted.

I believe the swap can be explained by different image coordinate systems. So column and row might be inverted in my implementation. But I can't explain why I only have to mirror and not rotate. And overall I am not sure, if the implementation is correct.

3) In theory I think I need to triangulate a pair of matches and determine whether the 3D point is front of both planes. Only one of the four solutions will satisfy that condition. However, I don't know how to determine the near plane's normal and distance from the uncalibrated projection matrix.

4) This is all used in the context of image stitching. My goal is to enhance the current image stitching pipeline to support not only cameras rotating around itself, but also translating cameras with little rotation. Some preliminary results and a follow-up question will be posted soon. (TODO)

edit retag flag offensive close merge delete



Have you looked in the "Mastering openCV" book? There is a chapter which covers this topic (structure from motion) using optical flow or keypoint matching as a basis.

JohannesZ gravatar imageJohannesZ ( 2014-07-29 15:56:12 -0500 )edit

Thanks, that's good advice. I haven't completely tried it out yet, but at the core, where it estimates R|t from the essential matrix, it does so very much in the way it is explained in Hartley & Zisserman. There is a very detailed answer here with a python implementation:

Based on the referenced python script, I now stick to the textbook implementation and assume that my essential matrix is still incorrect. I will have to review the way I compute it and hopefully the singular values won't be that far off each other anymore.

I believe the reasons why the results with my implementation looked "useful" before are 1) the translation is almost zero anyway and 2) I ultimately only rotate/mirror the cams

Duffycola gravatar imageDuffycola ( 2014-08-13 08:57:22 -0500 )edit