GstInference Hand Pose Recognition application

From RidgeRun Developer Wiki



Previous: Example Applications/Detection Index Next: Benchmarks





Introduction to Hand pose recognition using GstInference and NVIDIA®Jetson™

RidgeRun loves GStreamer, Computer Vision, and Machine Learning. As a result of this interest, several exciting research and commercial projects have emerged. As an NVIDIA's ecosystem partner, RidgeRun also likes to explore the capabilities of the Jetson platforms for embedded systems applications. This makes the NVIDIA Jetson family a great combination for testing real-life problems using solutions running state-of-the-art algorithms.

This wiki describes how to implement a monocular RGB hand pose recognition system with a classification accuracy of 98%, using deep learning and NVIDIA Jetson SoCs. In order to reduce the hardware complexity and cost for consumer electronic applications, it uses a single color camera. The document explores the design including hardware setup, software architecture, and overall results, detailing the data set and training process.

This project is another in a growing set of results from the RidgeRun/Academy Research Partnership Program in conjunction with the Tecnologico de Costa Rica (TEC).

System setup

  • NVIDIA Jetson TX2
  • Logitech C90 Webcam
  • Jetpack 4.3
  • TensorFlow v1.14
  • CUDA 10.0
  • CUDNN 7.5.0
  • TinyYolo V3
  • MobileNet V2
  • OpenCV 3.4.0

Problem statement

Due to the high demand for human-machine interaction applications, hand pose and gesture detection have become a focal point for general machine learning applications including classic Computer Vision and modern Deep Learning. Our research was targeted to the detection and classification of 18 different static hand poses using a single, accessible RGB camera sensor. Hand poses are different from gestures since poses do not involve intermediate motion as part of the detection sequence. Figure 1 shows the supported hand poses.

Figure 1. Detectable hand poses

Reliable hand pose classification is still an active research topic. The high dimensionality of the problem and environment conditions make it challenging for computer systems. Several approaches are suggested, including the usage of depth sensors in conjunction with RGB cameras. Although these approaches reduce the problem complexity and increase the system accuracy, the hardware complexity increases turning into expensive solutions for consumer electronics applications. This project explores a simpler, single-camera system setup (which increases the problem complexity) in order to reduce the required hardware to achieve hand pose classification.

As part of the problem statement, the human subject shall emulate a camera view from a laptop or a vending machine. This is a key requirement that defines the project's proper data set.

Software design of a Hand pose recognition system

The following image depicts the general software design. The implementation uses two deep neural networks (NN) in order to achieve the goal. The first NN is based on TinyYolo V3 and has been trained to detect a hand within an input video frame. This stage provides a set of bounding box coordinates that are sent downstream for further processing. The next stage uses the bounding box coordinates to crop the hand region from the input image. Later, this cropped image is fed to a second NN based on MobileNet V2 and trained to classify the input image according to any of the specified hand poses.

Figure 2. General software design

The GStreamer multimedia framework along with RidgeRun's GstInference element simplified the video frame capture, processing, display, streaming, and saving steps, allowing the focus to be on the TinyYolo and MobileNet models.

Data set and training

The data set for training and validation was generated manually for this experiment. A total of +100,000 real images were captured and augmented to generate a final set of +450,000 images. The image distribution according to the hand pose is depicted in Figure 3.

Figure 3. Data set

The data set is available; if you are interested please send an email to support@ridgerun.com.

Results

Classification results

Figure 4 provides a summary of the classification results obtained with the proposed system. Figure 5 also provides the confusion matrix for the same set of images. The overall average detection accuracy is 98% with a worst-case for Pose 17; the unsupported pose, which was not part of the original set.

Figure 4. Classification results
Figure 5. Confusion Matrix

Running the system

In order to visualize the detection results, a small test application was written. Figure 6 shows a sample output from the application. The camera captures a hand pose and displays the detected pose in the upper right corner.

The system runs in real-time using 640x480 @ 30fps input video from a Logitech C90 Webcam.

Figure 6. Example output

The following video shows the test application running:

[Watch the video on Vimeo](https://vimeo.com/388568768)

Sample real-life application

In order to test the project results in a real-life application, we prepared a custom application running on a Jetson TX2 to use it for polling at RidgeRun booth in the GTC 2020!

[Watch the video on Vimeo](https://vimeo.com/389351642)

More Information

Contact us via support@ridgerun.com for more information. We're happy to discuss ideas with you.


Previous: Example Applications/Detection Index Next: Benchmarks