Depends on your application. Do you have a set of fixed markers or a several markers which can move around the scene ?

As you said, solvePNP will give the RT matrix of a camera given the 3D coordinates of some points on the image, and these coordinates have to be known by another method.

For augmented reality with markers, the concept is that you have an idea about the real-world size of the markers a priori, so, for instance, for a 10cm square marker you can say that the coordinates of its corners are (0,0,0), (0.1,0,0), (0,0.1,0), (0.1,0.1,0). Once you have detected it, solvePNP will give you the relative pose of the camera towards this marker.

Note that the RT matrix is the transform that converts absolute world coordinates to relative coordinates towards the camera. So, if the centre of the marker is the position P = (0.05,0.05,0,1.0) (homogeneous coordinates) will be the centre of the marker, and its relative position in relation to the camera will be RT*P. This can be also be used to determine the marker orientation.

Likewise, if you want draw something as overlay over the marker (augmented reality), you can use the coordinates of the marker as the "world coordinates", and render the overlay based in the computed camera pose.

That said, if you have several mobile markers, you have to compute for each marker the relative pose of the camera from it with separated calls of solvePNP.

Note that if the appearance of the markers is known, and you don't have their real-world size, you will have to assign a defined size in an arbitrary unit, since there is a infinite number of possible sizes + 3D positions which will have the same appearance in the camera.

Important: RT is a 4x4 Matrix and P is a 4x1 matrix (x,y,z,w) where w is 1.0 (homogeneous coordinates). Solve PNP will give you the the euler angles R', and a translation matrix T'.
You should compute the rotation matrix R (3x3) using cv::Rodrigues. I use the following procedure to compute RT from rvec and tvec from solvePNP :

```
void RvecTvecToRT(cv::Mat& Rvec, cv::Mat& Tvec, cv::Mat& RT)
{
RT = cv::Mat::eye(4,4,CV_64F); //identity matrix
cv::Mat R;
cv::Rodrigues(Rvec, R);
//We store the R and T that transforms from World to Camera coordinates
for(int i = 0; i < 3; i++) {
RT.at<double>(i,3) = Tvec.at<double>(i,0);
for(int j = 0; j < 3; j++) {
RT.at<double>(i,j) = R.at<double>(i,j);
}
}
}
```

Based in your comment, it is very similar with what I had implemented such thing long time ago, using pictures as AR markers.

Basically, as pre-processing step, you have first to compute the keypoints and associated descriptors for each AR marker. That is, for a marker, you will have a set of ... (more)