You can start like this:
- Detect features, or maybe using gpu, then extract descriptors; the choosing of the descriptors (and features also) is linked to your application (what are you trying to do). Here you have some explications about the features and descriptors.
- Use FileStorage for saving to .xml (and for loading from xml too)
- XML should be portable, so no problem using it on different environments
- For reading video frames you can use VideoCapture
- To detect object you can inspire from this example. But for this you need also to match the descriptors
- Maybe also using tracking for not detecting in every whole frame but in a small region
- For playing a video inside another I have no example, but you can use 2 VideoCapture and put the frame from one in the detected area of the frame of the other capture. For deforming the inside frame you can use warpAffine (or other geometric transformation you need).
Then you can save the new video of play it directly... Hope it helped. You can ask again after you started something and say what it doesn't work.