This is part of a series I plan to write about Deep Learning and Computer Vision, addressing the problem of Hand Recognition. On this first story I make a introduction to the problem, and list the available solutions today, with they drawbacks. It’s far away from being academic, I do assume you know about what a Neural Network and a Convolutional Network is, but I don’t dive into them, just a superficial understanding is enough.
I think Hand Recognition (HR from now on) is an interesting problem since it has a good trade off between the complexity of the problem and the satisfaction it gives you to see your algorithm actually detecting your hand live on the webcam. The detection uses Convolutional Neural Networks, which are now the standard when it comes to image recognition, and if you really understand what do you want your algorithm to do, its easy to implement using frameworks such as Tensorflow.
Given that we use hands for almost every interaction we do with objects, it’s easy to understand the importance of being able to detect them just on plan RGB images, without the use of any other sensor.
According to some market research studies I’ve found, the hand gesture recognition market is expected to reach USD 30 billion by 2025. If you think how much of a change did the introduction of touch screens to our software UI interfaces and how we interact with our devices, touchless screens are for sure the next big step. While it’s not so easy to imagine how this will impact mobile phones and laptops, there are no doubts that hand recognition will make it easier to use the car’s computer or to turn up and down the television’s volume without having to figure out where the remote control is.
Before the invention of Deep Learning, old methods had to perform a lot of sophisticated calculations and heuristics to infer which seen shapes were actually hands. Strong priors and physics restrictions had to be imposed, and the results weren’t that good when they were evaluated on the wild, outside the highly controlled conditions of the laboratory. Furthermore, the detection was usually made using depth sensors, which provided more useful information than normal cameras.
The arrive of Neural Networks and Deep Learning changed it all on Computer Vision. Not only the algorithms used now are somehow more friendly, but the trend to use depth sensors to detect body parts is over. While it’s true that RGB cameras are just another type of sensor, they are cheaper than depth sensors and are present almost anywhere.
The first step we can take on Hand Detection is to perform it as a Image Classification task. We use a classifier, that will receive as input an image and will return the probability with which that image contains a hand. Any Neural Network classifier can be used for this, so long as it was trained on a dataset containing images labeled as hands. AlexNet, ResNet or any version of Inception will be able to perform this, each one with a different trade-off between accuracy and complexity.
Having a classifier able to distinguish hands images is a cool first step, but it’s not really useful. Usually, detecting the presence of absence of a hand is not enough, the really interesting functionality arises once we are able to track if through the scene and detect how it’s moving.
We can better interact with hands if we are able to detect where are the located on the captured image. On the object detection task, the output are the different objects seen on the image and their locations. These locations are usually on the bounding box format, where for each object the coordinates of the containing rectangle are given.
The Neural Networks used for Object Detection are Convolutional too, but they present different architectures to be able to look for known objects on every part of the scene. R-CNN (the first one released on 2011), Fast R-CNN, Faster R-CNN, YOLO and SSD are the most famous detection networks. These detection networks are used together with the same base networks used for image classification, so any combination of these two can be used: Fast R-CNN + ResNet, SSD + MobileNet, SSD + ResNet, etc.
To use any of these networks for hand detection you only need two things: the pre-trained model and the hands dataset. If you want to use Tensorflow, the models are available at the Zoo. The dataset should have thousands of images containing hands from different angles, and of course the ground truth labels. The last layers of the model are re-trained with your custom data, and the result is a new model able to detect hands similar to the ones on the dataset. Tensorflow provides an easy way to make this re-training, there are many guides for this if you Google “Tensorflow object detection”, I followed this one and got a working detection model. Although he doesn’t give much information about the training process, Victor Dibia made it on this helpful repository, using SSD + MobileNet for the model and the EgoHands dataset.
Being able to detect where a hand is is definitely more useful than just knowing there’s a hand somewhere on the image, and can let you develop some interesting interactions with the computer, but it’s still far away from providing us all the functionality we need to do cool things. If we want to make a game to play rock–paper–scissors for example, we need to be able to know what the hand is doing, not only where is it.
The problem of detecting what the hand is doing is called gesture recognition. One approach to tackle it is by doing an end-to-end training on a Neural Network with a dataset for the hand gestures we want to target. So if you know what gestures you want to detect, for example “OK”, “Peace”, “horns”, etc, you need a dataset containing some thousands of hands performing this gestures, each properly tagged. This work for example uses a simple CNN and it’s able to distinguish 5 gestures. This type of gesture detector is thought to be used on hand’s images, so a hand detector should be first used to detect where are the hands on the image and after that the gesture classifier run on each of these hand’s image patches to classify which gesture are they doing. Thought this is good enough if you just want to react to certain gestures, the disadvantages are that it has to be retrained if new gestures are to be detected, and that the running time gets slower the more hands are seen on the image, since the gesture classifier network has to be run for each one individually.
The evolution of hand detection is hand keypoint detection. The hand is no longer considered a single atomic unit, but instead is divided on a model of keypoints of interest, associated to different articulations. Knowing the position of each finger should provide enough information to detect which gesture is the hand doing.
OpenPose is a popular human body pose detector able to detect up to 125 body keypoints between bodies, faces, foot, and hands. It uses a 21 keypoint model for hands, four points for each finger plus one for the wrist. OpenPose started as the code repository for the published paper “Realtime Multi-Person Pose Estimation”, and has since grown into a very useful platform performing people tracking, Unity integration, etc.
OpenPose hand detector works good and can be used through a Python API, but it has one big issue: to detect the hand keypoints, you need to tell the API where the hand is located, which is it’s bounding box on the image. Wow, isn’t it what a hand detector should actually do?
OpenPose uses different networks to detect the body and the hand keypoints. On each image, OpenPose will first run the body keypoint location network, which using a bottom-up approach will first detect body keypoints and then wire them together on a reasonably way to construct the different human bodies. Thanks to this approach, the running time is constant, it doesn’t depend on how many people are seen on the image, because OpenPose’s Neural Network will detect them all at once. But for the hands the mechanism is different, instead of feeding the image just once into a Neural Network and detecting all the seen hands, parts of the image containing hands candidates are extracted and sequentially, one at a time, fed into the hand keypoint detector. Therefore, if ten persons are seen on the image, for each one two hand candidates areas of the image are going to be extracted and the hand detector will have to look on each of these twenty areas for keypoints. If you are running OpenPose with the hand detector turned on, the more people on the scene the slower it will run.
This hand candidates areas are just the parts of the image likely to contain a hand, and OpenPose propose this candidates based on the position of seen wrist, elbows and shoulders. Basically, if it sees these body parts are close together, it assumes where the hand should be. You can check it on the source code on the getHandFromPoseIndexes method.
The heuristic has sense, as the peoples hand’s location is strongly related to the position their wrists, elbows and shoulders, but it means OpenPose won’t be able to detect a hand not joined to any of these body keypoints. While floating hands are not that common, it may happen on scenarios with occlusions or if we spot to the camera just directly to a hand on a close-up.
If we want to use the OpenPose’s Python API, the same restriction applies. AS it can be seen on the API Tutorial for Hands, the wrapper needs the coordinates of the rectangles containing the hands. Trying to cheat the detector by giving it a rectangle containing the full image won’t provide any detection, as the network is trained to work with hands full sized to fit it’s 368x368 resolution.
So to use the OpenPose hand detection API, we either do the full body detection or we find out a way to detect the bounding boxes for the hands, and then feed that into the hand keypoint detection. That’s actually what I did for the last version of the Standalone Hand Keypoint Detector, using the Victor Dibia hand detector and then running OpenPose keypoint detection on those image patches. It works, but it has to be improved if any production use is to be done. The size relation between the bounding box and the hand contained is of vital importance, as the keypoint detection accuracy will degrade if this relation differs from the one used during training time.
Going further, hand keypoint detection is good, but having only the 2D coordinates might not be enough. Knowing the three-dimensional position of this keypoints is better, since we know the actual hand pose no matter which the angle with the camera was. The missing information about depth is the most fundamental problem to this task, but there’s a walk around. As hands do always have the same shape and articulation points, there’s a strong prior to which the 3D position of the hand model is once the 2D keypoints are detected. A Neural Network can be trained to learn this relation between 3D pose and 2D pose, and then used together with a hand keypoint detector. The keypoint detector will detect the 2D coordinates, and this are going to be converted into 3D ones by the introduced network. This is what Hand3D essentially does.
Just recently the 2019 Computer Vision and Pattern Recognition (CVPR) was held, and more interesting papers and projects were released. 3D Hand Shape and Pose Estimation from a Single RGB Image performs a whole 3D mesh reconstruction of the hand, what is (if accurate) all the information you can get about a hand’s pose. We can expect more and more hand recognition papers and works will continue to appear, some of them will shyly incorporate into the industry.
Even though the problem of hand recognition can seem to be nearly solved, there’s still a lot to go. Knowing the pose of the hand is of course fundamental, but it’s the launch point to any real use of hand recognition. Like Iron Man designing his armor, an interactive system needs to be able to detect movements at real time, and to react accordingly to these movements. It has to be able to learn from past use to get more accurate and suited to the user every day.
Further more, even if the academy is publish the results of promising work, there’s a big difference between code release and a working framework. Code release for a paper is there to be seen, luckily someone will answer you if you open a issue, but chances are it’s never going to be updated again. That’s good enough, but the developer’s community is increasingly demanding frameworks for Computer Vision, as the success and repercussion of OpenPose demonstrated. Once a paper’s code release, it evolved into a framework with a community behind, releasing new features regularly and with lots of people using it and therefore, testing it.
May it be the time for Hand Recognition ?
The second part for this series is already available at our blog, check it out to see how a Hand Keypoint Detection Neural Network is made!