Disclaimer: I'm not that familiar with the function, and don't have time to dig into it right now :( so this will be a bit speculative. Hopefully someone else weighs in with more authority.

One of the outputs of stereoCalibrate is the (R,t) rigid transform relating the two camera coordinate frames, so I would guess that these are directly optimised. The cameras then need to be related to the target (chessboard) coordinate sytem, so there is probably a collection of transformations from one camera's coordinate system to the target (or vice versa).

From 'Learning OpenCV', I think where the median comes in is the generation of the *initial guess* for (R,t) between cameras, that is then refined during the optimization process. In a single-camera calibration, you get a transform relating camera and target, and by combining these from each camera, you can get an estimate of (R,t). But remember that single camera calibration gives a transform between the camera and *every distinct target pose*, corresponding to an image. So *each pair* of stereo images implies some estimate of (R,t). I think you then take the median of these estimates to get the initial guess.

I'm pretty sure you wouldn't do (b) because the optimization would be over-parameterized; more transformations than are needed to model the system. I don't think medians would be taken after the optimization, as the new values would not have been explicitly optimized, so may not agree nicely with e.g. the intrinsics that were used.

Hope this helps in some small way....