# Pose estimation using PNP: Strange wrong results

Hello, I am trying to use the PNP algorithm implementations in Open CV (EPNP, Iterative etc.) to get the metric pose estimates of cameras in a two camera pair (not a conventional stereo rig, the cameras are free to move independent of each other). My source of images currently is a robot simulator (Gazebo), where two cameras are simulated in a scene of objects. The images are almost ideal: i.e., zero distortion, no artifacts.

So to start off, this is my first pair of images.

I assume the right camera as "origin". In metric world coordinates, left camera is at (1,1,1) and right is at (-1,1,1) (2m baseline along X). Using feature matching, I construct the essential matrix and thereby the R and t of the left camera w.r.t. right. This is what I get.

```
R in euler angles: [-0.00462468, -0.0277675, 0.0017928]
t matrix: [-0.999999598978524; -0.0002907901840156801; -0.0008470441900959029]
```

Which is right, because the displacement is only along the X axis in the camera frame. For the second pair, the left camera is now at (1,1,2) (moved upwards by 1m).

Now the R and t of left w.r.t. right become:

```
R in euler angles: [0.0311084, -0.00627169, 0.00125991]
t matrix: [-0.894611301085138; -0.4468450866008623; -0.0002975759140359637]
```

Which again makes sense: there is no rotation; the displacement along Y axis is half of what the baseline (along X) is, so on, although this t doesn't give me the real metric estimates.

So in order to get metric estimates of pose in case 2, I constructed the 3D points using points from camera 1 and camera 2 in case 1 (taking the known baseline into account: which is 2m), and then ran the PNP algorithm with those 3D points and the image points from case 2. Strangely, both ITERATIVE and EPNP algorithms give me a similar and completely wrong result that looks like this:

```
Pose according to final PNP calculation is:
Rotation euler angles: [-9.68578, 15.922, -2.9001]
Metric translation in m: [-1.944911461358863; 0.11026997013253; 0.6083336931263812]
```

Am I missing something basic here? I thought this should be a relatively straightforward calculation for PNP given that there's no distortion etc. ANy comments or suggestions would be very helpful, thanks!

EDIT: Code for PNP implementation

Let's say pair 1 consists of queryImg1 and trainImg1; and pair 2 consists of queryImg2 and trainImg2 (2d vectors of points). Triangulation with pair 1 results in a vector of 3D points points3D.

- Iterate through trainImg1 and see if the same point exists in trainImg2 (because that camera does not move)
- If the same feature is tracked in trainImg2, find the corresponding match from queryImg2.
Form vectors P3D_tracked (subset of tracked 3D points), P2D_tracked (subset of tracked 2d points).

`for(int i = 0; i < (int)trainImg1.size(); i++) { vector<Point2d>::iterator iter = find(trainImg2.begin(), trainImg2.end(), trainImg1[i]); size_t index = distance(trainImg2.begin(), iter ...`

your choice of coordinate systems seems a bit arbitrary to me. Make sure that you follow OpenCV's convention by using right-handed coordinate systems with Z pointing away from the camera in direction of the observed scene. Also you state that your right camera is defined as the origin. However, in the very next sentence, you state that the right camera is at (-1,1,1). So if the projection matrices of your two views are wrong, so will the triangulated points be wrong and PnP of course won't be able to calculate the correct solution.

Sorry, the 1,1,1 and 1,-1,1 are just the positions in the simulator world which I mentioned for clarity. Opencv does not use these values, and the projection matrices are computed using the R and t obtained through decomposing the essential matrix.

I would start off by checking your feature point correspondences (display them on your synthetic images). Then, since you know the ground truth of the corresponding 3D points from your Gazebo model, check whether the values of your triangulated feature points actually make sense.

I've tried displaying the feature points, epipolar lines on the images and triangulated points through PCL and they all make sense. I've also noticed something curious here: let's say I have X number of 3D points from a correspondence, and if I run PNP on these X points and the corresponding image points, I get the correct answer. But if I take a subset of these X points (i.e., tracking only those points that are visible from the next pair of images) and then run PNP on those and the corresponding 2D points, I get this junk value. Another point is that if I let's say, track two 3D points in both case 1 and case 2, compute the distance between them in both cases and take the ratio of those distances, that ratio is in accordance with the change in baseline too.

how large is the subset of 2D-3D correspondences that gives you the wrong pose? It would also be helpful, if you could post your code (maybe append it to your question). The distance ratio is an indicator that your triangulated points are correct. So it really seems like there is something wrong with the pose estimation from the subset of correspondences.

Could be, but it happens with every two pairs of images, which is strange. Anyway, I am appending that particular piece of my code to the question.