Mixed Martial Arts Fighter Detection
|
Description
RidgeRun has been working on using deep learning models for action recognition, some examples are the assembly line action recognition and the action recognition for MMA projects. In the latter, the network processes only the fighter of the picture, not the whole picture, during training and inferencing. That said, the first step to a successful action recognition application, in this case for MMA, is to properly identify the subjects in the scene. This is why we created a deep learning model that automatically detects the fighters, helping to crop and extract the MMA fighters out of the video as preprocessing for the action detection neural network.
As the name indicates, this is a neural network destined to detect MMA fighters. This network was developed to help with the data extraction that precedes a different neural network, that neural network needs the fighters cropped out of the photo, and doing it by hand would be costly and time-consuming. So we selected a neural network, trained it with a dataset that we created, transformed said network into a DeepStream engine, and deployed it into a Jetson Nano using GStreamer, resulting in a pipeline that runs at 41FPS with an IOU of 0.44. The Pipeline also uses an NVIDIA tracker, specifically the IOU tracker, this doesn't affect the pipeline performance as this tracker is the lightest one and the actual bottleneck is the network.
In the present document, it is described the process of selection of the model, including the dataset preparation, and the comparison of performance metrics on different scenarios to determine the best model to use for the application, considering not only accuracy but also system load for embedded systems.
The dataset
The dataset is comprised of 3830 images with the following distribution:
- 3181 images with 2 fighters in the scene
- 521 images with 1 fighter in the scene
- 128 images with no fighters in the scene
In all images, other people such as spectators, referees, etc, are included so the network can learn to ignore them when looking for fighters. The number of fighters in the images varies from 0 to 2 in order to avoid the network from forcibly looking for a fixed amount of subjects in the scene.
The dataset was originally augmented using shear, blur, noise, and exposure variations to generate new samples. From all images, 172 were reserved for testing the network, ensuring they do not generate any images within the training and validation groups.
The following are some samples taken from the training set, it can be noted they contain some augmentation effects such as noise and rotation.
Models exploration
The goal of this project is to test different models used for object detection and determine how they run in terms of precision and system load toward an embedded application. The models tested for this project are listed in the table below.
Type | Model used |
---|---|
Mobilenet | Mobilenet SSD v2 |
RCNN | Faster RCNN inception |
YOLO | Yolov3 and Yolov5 |
Performance results
We divided the experimentation of the different neural networks into 3 different stages. The first stage is running the neural networks in an environment with a small resource constraint, calculating every model's IOU and inference time. The models that have a higher average IOU and lower inference time are selected for the second stage. In the second stage, we run the models using Pytorch in a Jetson nano to get the CPU, RAM, and GPU utilization of each model, selecting the one that has the less utilization of resources and the lowest inference time. In the last stage, the selected model of the previous stage is optimized by converting it into a DeepStream Engine and the same metrics of the previous stage are calculated again.
First Stage
For this stage we have the following results when it comes to the inference time:
As you can see, the TensorFlow models Movilnet SSD 2 and Faster RCNN inception are slower than all the different YOLO networks, being the RCNN model the slowest of them all. As expected, the bigger YOLO models, like YOLOv5m and Yolov3 are the slower model of the YOLO family, but still outperform the TensorFlow models. When it comes to the smaller networks, Yolov5n and Yolov3 tiny are the fastest models, being Yolov3 tiny the fastest of all the YOLO networks.
When it comes to the IOU, the results are in the following image:
The network with the highest IOU is Yolov5m, followed by Yolov3 and Yolov5s. The faster networks don't fall behind by that much, having a difference of 0.04(yolov5n) and 0.14 (yolov3 tiny) when compared to the network with the best IOU. The Tensorflow networks fall behind when compared to all the other networks, having the lowest IOU of them all, taking this and the inference time into account, these networks are discarded and are not going to be used in the second stage of testing.
Second stage
In this stage, we ran the models in a Jetson Nano to check their performance in a system with a bigger resource constraint. The following image shows the inference time of each model tested:
As you can see the bigger networks, Yolov5m and Yolov3, are penalized because of their size, landing them in the area of the slower models in this stage. Yolov5n and Yolov3 tiny are the fastest networks, but this time the difference between them is smaller, just 8ms.
The CPU and RAM utilization of every single one of the models can be seen in these images:
The slower models, Yolov3 and Yolov5m, even though they are bigger models, also make less use of the CPU and RAM, which can be easily explained. The inference time of these networks is a lot bigger when compared to the other networks, which means that the system has to process frames at a slower rate, making it use fewer resources.
When it comes to the GPU utilization these are the generated graphs:
The network Yolov5n does not reach a 100% utilization of all the GPU resources and has an overall lower utilization of GPU when compared to the other networks. The slower networks use a lot more of the GPU, and their inference takes longer, which means a longer utilization of the GPU. The valleys that you can see in all the graphs of all the networks are the time between inferences. Yolov5n has the lowest utilization of GPU, a better IOU, and a similar inference time, CPU and RAM utilization to Yolov3 tiny, which is why it was selected for the last stage of experimentation.
Third and final stage
In this stage, we transformed the yolov5n PyTorch model into a DeepStream engine. This model was deployed using GStreamer and the DeepStream plugins. The GStreamer pipeline used runs at 41fps, which means that it takes 24.39ms from reading the frame from memory to displaying it, making it 3.31 times faster than the Pytorch model. This improvement comes with a cost, the new IOU calculated for this model is 0.44, making it less precise than its Pytorch counterpart. The new model uses more CPU and RAM resources as you can see in this image:
When it comes to the GPU resources you can check the following image:
This new model converted to a DeepStream Engine can actually make use of all the GPU resources that the Jetson nano has. This new network Running at 41FPS in a Jetson Nano with an IOU of 0.44 is considered a success.
Final network selection
Yolo stands for You Only Look Once, meaning, the processing of the image is done in two stages, the first one, where the network divides the image into n squares and tries to predict what object the boxes contain, and the second one, where the networks try to predict different bounding boxes and calculate the IOU between the predicted boxes and the squares that predict the objects. The predicted box with the best IOU is selected as the prediction for the object.
When it comes to selecting the bounding box there are multiple ways YOLO does it, selecting the box with the better IOU or doing something called max suppression. Max suppression removes the boxes that are similar to the one with the better IOU, this makes it so that when two objects overlap a lot, one gets suppressed, meaning, this is why sometimes the network struggles to infer between overlapped objects. In this case, we transform the neural network that runs in Pytorch into a DeepStream engine, that reduces the IOU but speeds up the inference by 3.31
An example of inference is the following:
This engine runs in DeepStream using the NVIDIA inference plugins and runs at 41fps with a video resolution of 1080x1920 pixels. The inference is done at a resolution of 412x412 pixels so the video is resized and the bounding boxes are normalized to the original resolution. This is done automatically by the pipeline elements, making the implementation of the neural network even easier.
For direct inquiries, please refer to the contact information available on our Contact page. Alternatively, you may complete and submit the form provided at the same link. We will respond to your request at our earliest opportunity.
Links to RidgeRun Resources and RidgeRun Artificial Intelligence Solutions can be found in the footer below.