Basically, what you could do is implement a basic background subtraction algorithm.
- Each 20 frames you create a reference frame for example (could be only once if static environment
- For the next frames, subtract the reference frame from the current one, resulting in regions with change and movement
- You could than use all 0 results in the subtraction as a binary mask for defining background and the pixel values as a mask for the foreground.