# Stereo Calibration - what is optimised?

In the stereo camera calibration code, I'm a bit confused as to what parameters are optimised. If we ignore intrinsic parameters for now, is it:

(a) The left (or equivalently right) extrinsic (chessboard to camera) matrices and 6 parameters for the left-to-right transformation? If so, what are we taking the median of?

(b) The left and right extrinsic matrices, and then we take the median of the implied left-to-right transformation?

Any help would be much appreciated.

Thanks,

Matt

edit retag close merge delete

Sort by » oldest newest most voted

Disclaimer: I'm not that familiar with the function, and don't have time to dig into it right now :( so this will be a bit speculative. Hopefully someone else weighs in with more authority.

One of the outputs of stereoCalibrate is the (R,t) rigid transform relating the two camera coordinate frames, so I would guess that these are directly optimised. The cameras then need to be related to the target (chessboard) coordinate sytem, so there is probably a collection of transformations from one camera's coordinate system to the target (or vice versa).

From 'Learning OpenCV', I think where the median comes in is the generation of the initial guess for (R,t) between cameras, that is then refined during the optimization process. In a single-camera calibration, you get a transform relating camera and target, and by combining these from each camera, you can get an estimate of (R,t). But remember that single camera calibration gives a transform between the camera and every distinct target pose, corresponding to an image. So each pair of stereo images implies some estimate of (R,t). I think you then take the median of these estimates to get the initial guess.

I'm pretty sure you wouldn't do (b) because the optimization would be over-parameterized; more transformations than are needed to model the system. I don't think medians would be taken after the optimization, as the new values would not have been explicitly optimized, so may not agree nicely with e.g. the intrinsics that were used.

Hope this helps in some small way....

more

Hi AJW,

thanks for your answer. I agree, so lets discard option (b). However, if I look through the code I see:

    // we optimize for the inter-camera R(3),t(3), then, optionally,
// for intrinisic parameters of each camera ((fx,fy,cx,cy,k1,k2,p1,p2) ~ 8 parameters).
nparams = 6*(nimages+1) + (recomputeIntrinsics ? NINTRINSIC*2 : 0);


so, I think that we have option:

(c) Use cvFindExtrinsicCameraParams2 to find extrinsic parameters for each camera. This gives an implied (R,t) between the left and right camera. Take the median of the implied (R,t) transformations between the cameras as a starting estimate. Optimise all extrinsic parameter. So for 10 images, thats 66 parameters, (ignoring intrinsic for now).

Which still sounds like a lot of parameters.

Thanks

Matt

"A lot" has to be considered relative to the problem model.

You have a set of corner pixel coordinates, and the 3D coordinates of the corresponding physical points, expressed in the target's local coordinate system. The goal is to (virtually) project those 3D points onto the images, and compare where they landed with where you actually detected the corners in the images.

Now, in order to project 3D points onto the cameras, they need to be expressed in the camera's coordinate system. So you need a rigid transform from target system to camera. In each stereo image pair, the target is physically in a different pose, so we need a different transformation for each pose to be able to use the associated data. So that's 10 (R,t) transforms for 10 images, and one more to link the cameras.

Admittedly we don't really care about the target-to-camera transformations once we have the calibration, but they are needed during the process to model the geometry of the system.

Official site

GitHub

Wiki

Documentation