Qualcomm Robotics RB5/RB6 - TensorFlow: Example pipeline
The Qualcomm Robotics RB5/RB6 board flashed with the LU Image already has the QTI Gstreamer plugins, because of this reason we do not need to download or install any SDK or similar to the board. The element we are using in this section is the qtimletflite
, that exposes the capabilites of TensorFlow Lite to GStreamer. We are going to use a trained model to test the plugin and measure its performance.
GStreamer pipeline
First, we are going to see a simple pipeline that uses qtimletflite
. In this example, we are using a model that uses computer vision for pose estimation.
1. First, we need a trained model for our pipeline. For this reason, please download the PoseNet model from Google Coral's repistory in the following link. This will download the model posenet_mobilenet_v1_075_481_641_quant.tflite
for testing.
2. Build the pipeline. We are going to use the following pipeline:
gst-launch-1.0 qtiqmmfsrc ! "video/x-raw,format=NV12,width=1280,height=720,framerate=30/1,camera=0" ! queue ! qtimletflite model=posenet_mobilenet_v1_075_481_641_quant.tflite postprocessing=posenet delegate=2 num-threads=2 ! queue ! qtioverlay ! waylandsink fullscreen=true qos=false sync=false enable-last-sample=false
In the above pipeline, we first have the qtiqmmfsrc
element, that will capture the frames from the camera. We then set the caps, specifying the resolution, framerate, format and the main camera. The qtimletflite
element only supports NV12 and NV21 format. In this element, we specify the model we are using, that is the one downloaded in the previous step. Please change the direction of the file to where you downloaded it. We also specify the the postprocessing to posenet; that is the result metadata, delegate that chooses the hardware where we are executing the process, and finally, the number of threads that are to be used for inference. The qtioverlay
will help us rendering the pose points on top of the video stream. And at last, the waylandsink
element will display the output in the monitor.
Measurements
For the pipeline we saw before, we are measuring its CPU usage and framerate changing the the delegate option and number of threads. The qtimletflite
element has 6 delegate options, that are the following:
- nnapi[1]: Android's Neural Network API. It provides acceleration for TensorFlow Lite models with supported hardware accelerators including: GPU, DSP, NPU.
- nnapi-npu: Android's Neural Network API, using the Qualcomm Neural Processing unit NPU230.
- Hexagon-nn: Uses the Qualcomm Hexagon DSP.
- GPU: Uses the Qualcomm Adreno 650 GPU.*
- xnnpack[2]: Is a highly optimized solution for neural network inference on ARM, x86, WebAssembly, and RISC-V platforms.
- CPU: Uses the Qualcomm Kryo 585 CPU.
On the other hand, the number of threads is going to be tested with 2; that is the default value of the property, and with 8; that is the amount of CPU cores in the Qualcomm Robotics RB5/RB6 board. To measure the CPU usage percentage, we are using the top
command and then pressing Shift
+i
to get the average CPU usage percentage per core. For the framerate, we use the gst-perf plugin from RidgeRun that computes the mean FPS, we get the average of 30 samples.
Results
In table 1, we can see the results obtained for CPU and framerate for every delegate available in the board for the qtimletflite
element. Also, we test each delegate with 2 and 8 threads, to check if the performance varies.
Delegate | Number of threads | CPU (%) | Framerate (fps) |
---|---|---|---|
nnapi | 2 | 15.5 | 30.527 |
8 | 15.4 | 30.377 | |
nnapi-npu | 2 | 17.9 | 22.965 |
8 | 18.1 | 24.061 | |
Hexagon-nn | 2 | 13.9 | 30.689 |
8 | 14.3 | 30.306 | |
xnnpack | 2 | 26.1 | 13.186 |
8 | 79.1 | 27.184 | |
CPU | 2 | 25.6 | 13.139 |
8 | 79.5 | 26.65 |
Also, in figure 1 we have a graph that shows the average framerate for each delegate for a better comparison between them.
In figure 2 we have a graph that shows the average CPU utilization perentage for each delegate for a better comparison.
From both Figure 1 and 2, we can see that the delegate with the best performance is the Hexagon DSP. The Hexagon DSP uses the HTA hardware component to execute the model, providing the best framerate with an average of 30.689fps using the least amount of CPU. This allows for other applications to use the available CPU, while keeping a good performance in the pipeline. Then, the Neural Network API also performs just as well, using a bit more of CPU while still having a great framerate. Both the xxnpack and the CPU perform better using 8 threads.
References