Ask Your Question

Stereo camera pose estimation from solvePnPRansac using 3D points given wrt. the camera coordinate system

asked 2018-02-28 04:49:20 -0600

bendikiv gravatar image

updated 2018-02-28 04:58:24 -0600

I know that there exists many posts regarding pose estimation using solvePnP/solvePnPRansac, and I've read most of them, but my case differs slightly from what seems to be the standard scenario, and even if I think I've got it I just can't seem to get it to work and would like someone to correct me if I'm doing anything wrong. This post became quite long, but please bear with me.

I'm trying to use solvePnPRansac to calculate the motion of a stereo camera from one frame/time instance to the next. I detect features in the first frame and track them to the next frame. I'm also using a stereo camera that comes with a SDK which provides me with the corresponding 3D coordinates of the detected features. These 3D points are wrt. the camera coordinate system. In other words, the 2D/3D-points in the two consecutive frames are corresponding to the same features, but if the camera moves between the frames, the coordinates change (even the 3D points, since they are relative to the camera origin).

I believe that the 3D input points of solvePnPRansac should be wrt a world frame, but since I don't have a world frame, I try to do the following:

1) For the very first frame: I set the initial camera pose as the world frame, since I need a constant reference for computation of relative movement. This means that the 3D points calculated in this frame now equals the world points, and that the movement of the camera is relative to the initial camera pose.

2) Call solvePnPRansac with the world points from the first frame together with the 2D features detected in the second frame as inputs. It returns rvec and tvec

Now for my first question: Is tvec the vector from the camera origin (/the second frame) to the world origin (/the first frame), given in the camera's coordinates system?

Second question: I want the vector from the world frame to the camera/second frame given in world frame coordinates (this should be equal to the translation of the camera relative to the original pose=world frame), so I need to use translation = -(R)^T * tvec, where R is the rotation matrix given by rvec?

Now I'm a little confused as to which 3D points I should use in the further calculations. Should I transform the 3D points detected in the second frame (which is given wrt the camera) to the world frame? If I combine the tvec and rvec into a homogeneous-transformation matrix T (which would represent the homogeneous transformation from the world frame to the second frame), the transformation should be 3Dpoints_at_frame2_in_worldcoordinates = T^(-1) * 3Dpoints_at_frame2_in_cameracoordinates

If I do this, I can capture a new image (third frame), track the 2D features detected in the second frame to the third frame, compute the corresponding 3D points (which is given wrt the third frame) and call solvePnPRansac with "3Dpoints_at_frame2_in_worldcoordinates" and the ... (more)

edit retag flag offensive close merge delete

1 answer

Sort by ยป oldest newest most voted

answered 2018-02-28 05:54:37 -0600

Do PnP with previous frame 3D points (either camera or world) and current frame 2D points. If the 3D points have world coordinates you are done. If they are local camera coordinates, you must add the computed Rt transform to your global transform.

Read the following: Visual Odometry: Part 1 and Visual Odometry: Part 2

edit flag offensive delete link more


Thanks! Reading the articles helped me with further understanding of the concept! Especially regarding what 2D-3D points to use when solving the PnP.

So now I'm using previous frame 3D points in camera coordinates and current frame 2D points at each iteration in solvePnP. The results seem to be ok, however they contain lots of noise, and that makes it hard to tell if the global transform is correct.

If I construct a transformation matrix like this with each R and t (R gotten from Rodrigues and t=tvec): T_current = [R t; 0 1], and I have my T_global (which initially is a 4x4 identity matrix) is this the correct update formula: T_global = T_global * T_current? Does T_global now represent the transformation from the initial frame to the current frame? Is T_current = [R t; 0 1] correct?

bendikiv gravatar imagebendikiv ( 2018-02-28 12:01:51 -0600 )edit

Use cv::composeRT to concatenate your transformations. To reduce noise look into using keyframes where results are aggregated before moving on to the next keyframe. You'll need a concept of stereo error. I recommend reading "Fast, Unconstrained Camera Motion Estimation from Stereo without Tracking and Robust Statistics" by Heiko Hirschmuller.

Der Luftmensch gravatar imageDer Luftmensch ( 2018-02-28 13:54:56 -0600 )edit

Thanks again.

I've actually been doing the same inlier detection step as they do in the paper (been kind of following this approach, where the same inlier detection step is done in step 6: However, I noticed that I've been doing it on the 2D feature points, and I understand that I need to detect inliers among the 3D points. As of now, I only have a constant number as the "distance error threshold" for determining if a point is an inlier or outlier, but I will look into the more "dynamical" approach described in the paper (eq. 10).

Just to be sure that I use composeRT correctly: rvec1/tvec1 equals the global transform, while rvec2/tvec2 is the relative one? Which gives me the transformation from initial to the current frame?

bendikiv gravatar imagebendikiv ( 2018-03-01 09:27:38 -0600 )edit

*I can compute what rvec3/tvec3 equals from the formulas given in the documentation.

BUT, there is one thing that I've never quite understood, and would appreciate if you could help me with my confusion (because this affects the results from composeRT):

The rvec and tvec returned by solvePnPRansac, what exactly do they represent? What I've believed:

  • rvec/the rotation matrix formed by rvec represents the rotation from the previous coordinate frame to the current frame.

  • tvec is the vector from the origin of the current frame to the origin of the previous frame, with coordinates given in the current frame/coordinate system. I. e. t^n_{n, n-1}, if that notation makes sense to you.

Is this a correct interpretation of rvec and tvec returned by solvePnPRansac?

bendikiv gravatar imagebendikiv ( 2018-03-01 10:24:43 -0600 )edit

You have the freedom of choice to put these things together however you choose. rvec and tvec from cv::solvePnPRansac bring the model coords to the camera frame, so you will likely end up inverting the rotation matrix and negating the translation. Using the current frame's 2D points and the previous frame's 3D points is necessary for keyframe 3D-data fusion, however, you could swap it around and still do frame-to-frame VO just fine. I hope that helps. Just print out your values (translation and euler angles) for a short sequence and verify it makes sense.

Der Luftmensch gravatar imageDer Luftmensch ( 2018-03-02 08:40:52 -0600 )edit

Ok, I opened another question about composeRT and it makes more sense to me now. However, I still wonder about the tvec returned from solvePnPRansac. Yes, rvec and tvec brings points from the model cords to the camera cord. But, if I keep the camera absolute stlll and just print out tvec, shouldn't tvec remain somewhat close to (0, 0, 0) when I use previous 3D points (wrt camera frame) and current corresponding 2D points? Because now it doensn't do that at all (some of it I recognize as noise, but not all of it).

bendikiv gravatar imagebendikiv ( 2018-03-08 10:58:11 -0600 )edit

If you are using cv::solvePnPRansac, then there is little reason to first do inlier detection (as opposed to outlier rejection) unless you believe that the ratio of inliers to outliers is small. You also might have a very noisy 3D image or one with large errors if the baseline is small. If the 3D positions are not stable then you might observe the strange results you mention. Also, keep in mind that cv::solvePnPRansac does no weighting of points. The 3D error increases quadratically as one moves away from the camera and so in reality these points really should be given a much reduced weighting, but OpenCV does not (yet) provide such a capability. Maybe there are some visualizations you can add to better understand what is going on? Try a 3D rigid body transformation of current points?

Der Luftmensch gravatar imageDer Luftmensch ( 2018-03-08 18:16:44 -0600 )edit

I will look into the accuracy of the 3D points, maybe try to visualize them in some way. I've also asked the developers of the stereo camera about the accuracy of the 3D points, since I get them from a function in an API/SDK provided by them. The baseline of the camera is only 30 mm with a stated operating range of 2.5 meters, so I try to keep the scene close to the lenses (<0.5 m).

But, just to make sure and to have some sort of goal: The tvec output from solvePnPRansac should stay close to (0, 0, 0) if I don't move the camera?

Side note: I just realized that I've been using your code provided in this answer for the implementation of a max clique approximation!

bendikiv gravatar imagebendikiv ( 2018-03-09 03:42:51 -0600 )edit

My understanding is that if you have a static scene and camera, and you are accumulating transformations frame after frame, you will observe a random walk away from 0. Are you using the Duo3D? Since I posted that answer I've improved the code to take into account 3D-error. For frame-to-frame VO, my experience is that the inlier/max-clique with 3D-error method, though a little slower, provides more accurate results than RANSAC PnP, though it has more regular catastrophic errors (probably my fault somewhere). Hirschmuller also suggests an angular rejection test, which should be simple enough to add. It would be no mean feat to do keyframe VO and weighted PnP, either via RANSAC or inlier detection. You might also want to consider cv::matchGMS as a 2D-outlier rejection scheme.

Der Luftmensch gravatar imageDer Luftmensch ( 2018-03-09 10:55:21 -0600 )edit

Yes, I'm using the Duo3D M. I agree that by accumulating transformations frame after frame I need to expect some drift. But what I meant was that if I just print out the tvec returned from solvePnPRansac for each iteration, i. e. the local translation between two consequtive frames, then shouldn't it stay approximately at (0, 0, 0) if the points are accurate enough? Have you uploaded your updated max-clique script anywhere? And what exactly is the "inlier/max-clique with 3D-error method"? Does it include pose estimation (solving the PnP-problem), as an alternative to cv::solvePnP(Ransac)?

bendikiv gravatar imagebendikiv ( 2018-03-13 10:58:30 -0600 )edit

Question Tools

1 follower


Asked: 2018-02-28 04:49:20 -0600

Seen: 2,786 times

Last updated: Feb 28 '18