Ask Your Question

Hand Posture Recognition using Machine Learning

asked 2016-07-15 04:42:00 -0600

Missing gravatar image

updated 2016-07-16 07:36:14 -0600

Hey guys,

I am currently working on my thesis and for that I am trying to recognize different hand postures for controlling an embedded system. My first implementation used simple skin color segmentation, calculation of contours, convex hull, defects and so on. This works, but I am not satisfied, since it is not very robust. This approach also detects other skin colored objetcs and I don't want to compute all these things every frame.

Because of that I started to think about machine learning algorithms. I thought about using a Support Vector Machine for my task. My problem is that I don't really know what to use. Is a SVM a good choice for hand posture recognition or is another machine learning algorithm more appropriate? I am also unsure about what features to use for training my model. Currently I use SURF-Features, since SIFT is to slow. Are other features more suitable? And my last problem is the training set. What kind of images do I need to train my model?

So my questions are:

  • Is a Support Vector Machine a good choice for hand posture recognition?

  • What are good features to train the model?

  • What kind of images do I need for my training? How many for each class? Grayscale, Binary, Edges?

I know that these are no specific questions, but there is so many literature out there and I need some advide where to look at.

I am working with the OpenCV Python Bindings for my image processing and use the scikit package for the machine learning part.


@Pedro Batista, first of all thank you very much for your detailed answer, I really appreciate that. The system should run in a laboratory environment. So the user has to interact with different devices and should be able to control some of these devices by hand postures/gestures. The background might be stable, but it is no simple white/black background. For the moment I assume that the user places his hand close in front of the camera.

Yesterday I made a minimal example with a SVM. I took some sample images of three different hand postures (Open Hand, Fist and two fingers). I took only 20 images for every posture of size 320*240. The size and distance of the hand was nearly the same in every image.

After that I segmented the hand by simple threshold in the YCrCb color space and performed some smoothing and opening/closing operations. By calculating the biggest contour (which I assume is the hand) I got two features, the area of the contour and the perimeter. Calculating more features should be no problem, like convexity defects, angles and so on. I used these two features to train my SVM and got the following classification (area on the x axis and perimeter on the y axis).

image description

So in this case the simple classification works quite well, but at the moment I just worked with an idealized situation. For the moment ... (more)

edit retag flag offensive close merge delete


As you can tell by that plot, using SVM to classify your Area/Perimeter data is overkilling the problem, even though it is a good simple way to figure out how SVM works. You can classify that data using simple heuristic (if statements). ML will become useful when feature space becomes so big that you can't interpret it yourself and need a machine to do it for you.

Pedro Batista gravatar imagePedro Batista ( 2016-07-18 05:21:37 -0600 )edit

k-Nearest Neighbours is the simplest ML because it actually doesn't compute a model to classify data. To classify a unlabelled sample, it goes through all the labelled train data and finds the samples that are the most similar to the one being classified.

The only parameter is k, which is the number of votes needed to classify each sample. If k=15, the algorithm will find the 15 closest samples and each then each label will represent a vote to identify the class.

Pedro Batista gravatar imagePedro Batista ( 2016-07-18 05:52:29 -0600 )edit

1 answer

Sort by ยป oldest newest most voted

answered 2016-07-15 09:18:56 -0600

updated 2016-07-15 09:25:33 -0600

As I can tell there are two different problems here.

The first is about hand segmentation, and for us to help with this you should provide more information about the algorithm's working environment (luminosity, noise, background, camera position in relation to hand, tracked hand size in relation to image, etc). I'll go on assuming that skin-colour segmentation is working well.

Then there is gesture classification.

SURF provides useful information about edge orientation and other things, but there is really not much point in using it when you are already able to segment the hand from the rest of the image.

When deciding a ML approach, you should start to consider the simplest and fastest methods first, even more so when you are working on an embedded system. SVM works good when data to be classified can be separated through a linear plane, unless we talk about non-linear SVM in which the data may be separated by curved planes. SVM is not a light ML algorithm, and there is no indication that it would carry any advantage in solving this problem, but the only way to be sure is to try.

When you have to solve a computer vision problem, the first thing you should do is try to deeply think about it and describe it, so, lets to that:

What is a gesture? What may differentiate two different gestures in a hand? Here are the first things that pop to my mind.

  • Number of fingers showing
  • Holes in hand (like when some one touches the thumb with index finger)
  • How much convexity defects
  • Angle between hand and arm?
  • form factor (relationship between perimeter and area)
  • Solidity (relationship between blob area and bounding box area)

If the distance from hand to camera is always roughly the same, or you can compute that distance you can even throw in simpler features like plain area and perimeter.

So, once you realize what kind of information you need, you can go on to think about what image processing methods would work to obtain such information. In your scenario you already have segmented the hand, which is very good. Once you use findCountorus() on the segmented image you can obtain a lot of the properties mentioned above, and with the convexity defects you can already obtain the number of fingers and other information So, the trick is to add as much relevant information as possible and use it to decide which gesture the hand is making.

After making these considerations you can start to think if ML is a good way to go. For this problem I would think about k Nearest Neighbours, which is a very simple ML algorithm that has an OpenCV implementation, it is light so it can work in embedded systems, works well when number of features is in the realm of the dozens, and for simpler problems will largely outperform more complicated ML-engines.

This answer is already too long so I wont go into a kNN explanation. I ... (more)

edit flag offensive delete link more

Question Tools



Asked: 2016-07-15 04:42:00 -0600

Seen: 1,538 times

Last updated: Jul 16 '16