Stereo camera pose estimation from solvePnPRansac using 3D points given wrt. the camera coordinate system
I know that there exists many posts regarding pose estimation using solvePnP/solvePnPRansac, and I've read most of them, but my case differs slightly from what seems to be the standard scenario, and even if I think I've got it I just can't seem to get it to work and would like someone to correct me if I'm doing anything wrong. This post became quite long, but please bear with me.
I'm trying to use solvePnPRansac to calculate the motion of a stereo camera from one frame/time instance to the next. I detect features in the first frame and track them to the next frame. I'm also using a stereo camera that comes with a SDK which provides me with the corresponding 3D coordinates of the detected features. These 3D points are wrt. the camera coordinate system. In other words, the 2D/3D-points in the two consecutive frames are corresponding to the same features, but if the camera moves between the frames, the coordinates change (even the 3D points, since they are relative to the camera origin).
I believe that the 3D input points of solvePnPRansac should be wrt a world frame, but since I don't have a world frame, I try to do the following:
1) For the very first frame: I set the initial camera pose as the world frame, since I need a constant reference for computation of relative movement. This means that the 3D points calculated in this frame now equals the world points, and that the movement of the camera is relative to the initial camera pose.
2) Call solvePnPRansac with the world points from the first frame together with the 2D features detected in the second frame as inputs. It returns rvec and tvec
Now for my first question: Is tvec the vector from the camera origin (/the second frame) to the world origin (/the first frame), given in the camera's coordinates system?
Second question: I want the vector from the world frame to the camera/second frame given in world frame coordinates (this should be equal to the translation of the camera relative to the original pose=world frame), so I need to use translation = -(R)^T * tvec, where R is the rotation matrix given by rvec?
Now I'm a little confused as to which 3D points I should use in the further calculations. Should I transform the 3D points detected in the second frame (which is given wrt the camera) to the world frame? If I combine the tvec and rvec into a homogeneous-transformation matrix T (which would represent the homogeneous transformation from the world frame to the second frame), the transformation should be 3Dpoints_at_frame2_in_worldcoordinates = T^(-1) * 3Dpoints_at_frame2_in_cameracoordinates
If I do this, I can capture a new image (third frame), track the 2D features detected in the second frame to the third frame, compute the corresponding 3D points (which is given wrt the third frame) and call solvePnPRansac with "3Dpoints_at_frame2_in_worldcoordinates" and the ...