If you google "face anti-spoofing" or "liveness detection," you will see many articles and research papers which try to accomplish this task. There also looks to be a lot of open source code you can try out online
Face anti-spoofing/liveness detection can be thought of as a binary classification task (i.e. does a given image or video stream contain a genuine face or not). This isn't my research area, but I would assume there are two main routes you can take
Motion approach: make user blink or move in a way which convinces you that they are real. This will likely involve ML models to detected the desired action in a video stream
Feature approach: extract useful features from an image/video stream frames (might require the camera captures depth information in addition to color) and use the features to make a classification decision. This will almost certainly involve ML models to extract features and perform the final classification
There are probably hybrid approaches as well. Based on your static image constraint, (2) is probably where you want to start