Hmm there are several things you will have to do
- Choose to deform the image or not (cope with fiseye lens deformation) so that you have an actual correct view of the room.
- Since many person detectors are not rotation invariant you will have to define a relation between position, rotation and size for each location in the input image.
- You can then apply your person detector.
- Use each detection together with a Kalman tracker for path estimation and speed estimation.
A less robust approach could be to assume that the background will never change drastically. In that case a more simpler approach could be to
- Apply efficient background subtraction
- If a blob appear, calculate its center point
- For each new center point, initialize a Kalman tracker
Start trying some stuff out and report back if something fails :)