Qualcomm Robotics RB5/RB6 - TensorFlow: Example pipeline

The Qualcomm Robotics RB5/RB6 board flashed with the LU Image already has the QTI Gstreamer plugins, because of this reason we do not need to download or install any SDK or similar to the board. The element we are using in this section is the qtimletflite, that exposes the capabilites of TensorFlow Lite to GStreamer. We are going to use a trained model to test the plugin and measure its performance.

GStreamer pipeline

First, we are going to see a simple pipeline that uses qtimletflite. In this example, we are using a model that uses computer vision for pose estimation.

1. First, we need a trained model for our pipeline. For this reason, please download the PoseNet model from Google Coral's repistory in the following link. This will download the model posenet_mobilenet_v1_075_481_641_quant.tflite for testing.

2. Build the pipeline. We are going to use the following pipeline:

gst-launch-1.0 qtiqmmfsrc ! "video/x-raw,format=NV12,width=1280,height=720,framerate=30/1,camera=0" ! queue ! qtimletflite model=posenet_mobilenet_v1_075_481_641_quant.tflite postprocessing=posenet delegate=2 num-threads=2 ! queue ! qtioverlay ! waylandsink fullscreen=true qos=false sync=false enable-last-sample=false

In the above pipeline, we first have the qtiqmmfsrc element, that will capture the frames from the camera. We then set the caps, specifying the resolution, framerate, format and the main camera. The qtimletflite element only supports NV12 and NV21 format. In this element, we specify the model we are using, that is the one downloaded in the previous step. Please change the direction of the file to where you downloaded it. We also specify the the postprocessing to posenet; that is the result metadata, delegate that chooses the hardware where we are executing the process, and finally, the number of threads that are to be used for inference. The qtioverlay will help us rendering the pose points on top of the video stream. And at last, the waylandsink element will display the output in the monitor.

Measurements

For the pipeline we saw before, we are measuring its CPU usage and framerate changing the the delegate option and number of threads. The qtimletflite element has 6 delegate options, that are the following:

nnapi^[1]: Android's Neural Network API. It provides acceleration for TensorFlow Lite models with supported hardware accelerators including: GPU, DSP, NPU.
nnapi-npu: Android's Neural Network API, using the Qualcomm Neural Processing unit NPU230.
Hexagon-nn: Uses the Qualcomm Hexagon DSP.
GPU: Uses the Qualcomm Adreno 650 GPU.*
xnnpack^[2]: Is a highly optimized solution for neural network inference on ARM, x86, WebAssembly, and RISC-V platforms.
CPU: Uses the Qualcomm Kryo 585 CPU.

Note

Note: No measurements of the GPU were taken since some of the operations of the ML model cannot be executed in the hardware, and because of this, it ends up being executed in the CPU.

On the other hand, the number of threads is going to be tested with 2; that is the default value of the property, and with 8; that is the amount of CPU cores in the Qualcomm Robotics RB5/RB6 board. To measure the CPU usage percentage, we are using the top command and then pressing Shift+i to get the average CPU usage percentage per core. For the framerate, we use the gst-perf plugin from RidgeRun that computes the mean FPS, we get the average of 30 samples.

Results

In table 1, we can see the results obtained for CPU and framerate for every delegate available in the board for the qtimletflite element. Also, we test each delegate with 2 and 8 threads, to check if the performance varies.

Table 1: Performance of pipeline with nnapi as delegate.
Delegate	Number of threads	CPU (%)	Framerate (fps)
nnapi	2	15.5	30.527
nnapi	8	15.4	30.377
nnapi-npu	2	17.9	22.965
nnapi-npu	8	18.1	24.061
Hexagon-nn	2	13.9	30.689
Hexagon-nn	8	14.3	30.306
xnnpack	2	26.1	13.186
xnnpack	8	79.1	27.184
CPU	2	25.6	13.139
CPU	8	79.5	26.65

Also, in figure 1 we have a graph that shows the average framerate for each delegate for a better comparison between them.

Figure 1: Average framerate for each delegate available in qtimletflite

In figure 2 we have a graph that shows the average CPU utilization perentage for each delegate for a better comparison.

Figure 2: Average CPU utilization percentage for each delegate available in qtimletflite

From both Figure 1 and 2, we can see that the delegate with the best performance is the Hexagon DSP. The Hexagon DSP uses the HTA hardware component to execute the model, providing the best framerate with an average of 30.689fps using the least amount of CPU. This allows for other applications to use the available CPU, while keeping a good performance in the pipeline. Then, the Neural Network API also performs just as well, using a bit more of CPU while still having a great framerate. Both the xxnpack and the CPU perform better using 8 threads.

References

↑ TensorFlow Lite NNAPI delegate. Retrieved March 7, 2023, from [1]
↑ XNNPACK. Retrieved March 7, 2023, from [2]

⇐

Index

⇒

❯

[nnapi-1] TensorFlow Lite NNAPI delegate. Retrieved March 7, 2023, from [1]

[xnnpack-2] XNNPACK. Retrieved March 7, 2023, from [2]

[1]

[2]