NVIDIA Jetson Orin - Research on Software Encoding Acceleration with CUDA

From RidgeRun Developer Wiki


Previous: JetPack_5.0.2/Flashing_Board Index Next: JetPack 5.0.2/Performance Tuning/Evaluating Performance









Introduction

The following report comprehensively evaluates the preliminary results of incorporating CUDA optimizations into an H.264 encoding library. This report aims to help assess the feasibility and value of incorporating CUDA-accelerated enhancements to an existing encoder, specifically opting for the OpenH264 encoder due to its license compatibility. Additionally, all performance metrics and measurements presented in this report were conducted on a Jetson Orin Nano, as improving the performance of software-based codecs on this board was the primary motivation behind this effort.

The document is structured as follows: we provide a comparative analysis of the baseline performance of the OpenH264 encoder, both with and without its existing CPU SIMD (Single-Instruction Multiple-Data) optimizations. This analysis includes the benchmarking results, as well as profiling, from which optimization candidates are selected. Then, we present a description of the optimization process to establish if offloading the encoding algorithms benefits the performance. It contains the steps taken and the results obtained throughout the optimization cycles. Finally, we present our conclusions and recommendations for future work.

Baseline Performance

This section summarizes the baseline performance of the OpenH264 encoder, both with and without SIMD optimizations using two different builds and disabling the compilation of the optimized code from the meson building system. In order to obtain the measurements presented in this section, we employed the OpenH264 plugin in a GStreamer pipeline. This allowed for a variety of different encoding configurations, achieved by modifying the following parameters:

  • Complexity
  • Rate-control
  • Bitrate
  • Resolution

The CPU performance was measured using the proprietary RidgeRun Profiler tool, Linux Perf, and maximum performance was enabled on the Orin for all measurements, using the encoder in a single thread.

The results shown in Table 1 were obtained through varying the complexity on the encoder, ranging between low, medium and high complexity. This was accomplished by modifying the complexity property on the openh264enc element, while using a fixed resolution of 1920x1080. The rate control was left to its default value (quality), and the bitrate was fixed at 10Mbps. Consider the following pipeline as reference.

gst-launch-1.0 nvarguscamerasrc num-buffers=10000 ! 'video/x-raw(memory:NVMM),width=1920, height=1080' ! nvvidconv ! openh264enc bitrate=10000000 complexity=$complexity ! perf ! filesink location=$filepath.raw

Table 1. OpenH264 performance results through varying complexity, with and without optimizations using RidgeRun Profiler and RidgeRun GstPerf.

Varying Complexity
With Optimizations Without Optimizations
Complexity Low Medium High Low Medium High
Entire CPU Average (%) 15.7 15.49 16.12 16.33 16.5 16.33
RAM (MB) 59.6 59.51 59.7 60.91 61.48 64.4
Encoding rate (fps) 21 12.5 12.7 7.01 4.74 4.75

The results shown in Table 2 were obtained through varying the rate control on the encoder, ranging between all possible configurations allowed by the element. This includes quality, bitrate, buffer and disabled. This was accomplished by modifying the rate-control property on the gst-openh264enc element, while using a fixed input resolution of 1920x1080. The complexity was left to its default value (medium), and the bitrate was set to 10Mbps. Consider the following pipeline as a reference.

gst-launch-1.0 nvarguscamerasrc num-buffers=10000 ! 'video/x-raw(memory:NVMM),width=1920, height=1080' ! nvvidconv ! openh264enc bitrate=10000000 rate-control=$ratectl ! perf ! filesink location=$filepath.raw

Table 2. OpenH264 performance results through varying rate control, with and without optimizations using RidgeRun Profiler and RidgeRun GstPerf. The rate control is expressed and numbers: 0: quality, 1: bitrate, 2: buffer, and -1: off


Varying Rate-control
With Optimizations Without Optimizations
Rate-control 0 1 2 -1 0 1 2 -1
Entire CPU Average (%) 16.1 16.1 16.32 16.1 16.4 16.5 16.33 16.33
RAM (MB) 59.56 65.1 62.3 57.19 61.03 66.95 60.9 63.04
Encoding rate (fps) 12.8 13 8.7 11.8 4.72 4.71 4.56 4.43

The results shown in Table 3 were obtained through varying encoding bitrate, ranging between the constant bitrates of 1Mbps, 4 Mbps and 10 Mbps, and a variable bitrate in a constant camera capture. This was accomplished by modifying the bitrate property on the gst-openh264enc element at a fixed input resolution of 1920x1080. Complexity and rate control were left to their default values. Consider the following pipeline as a reference.

gst-launch-1.0 nvarguscamerasrc num-buffers=10000 ! 'video/x-raw(memory:NVMM),width=1920, height=1080' ! nvvidconv ! openh264enc bitrate=$bitrate ! perf ! filesink location=$filepath.raw

Table 3. OpenH264 performance results through varying bitrate, with and without optimizations using RidgeRun Profiler and RidgeRun GstPerf.

Varying Bitrate
With Optimizations Without Optimizations
Bit-rate (Mbps) 1 4 10 Variable 1 4 10 Variable
Entire CPU Average (%) 13.25 15.75 15.9 10.21 16.5 - 16.48 16.5
RAM (MB) 58.97 62.7 58.9 59 71.94 - 61.27 58.93
Encoding rate (fps) 30 19 13 29.9 4.73 4.71 4.71 4.73

The results shown in Table 4 were obtained through varying the input resolution, obtained directly from the camera. The selected resolutions include 1280x720, 1920x1080 and 3280x2160. The complexity and rate control properties were left to their default values, and the bitrate was set to 4Mbps. Consider the following pipeline as reference.

gst-launch-1.0 nvarguscamerasrc num-buffers=10000 ! 'video/x-raw(memory:NVMM),width=$width, height=$height ! nvvidconv ! openh264enc bitrate=4000000 ! perf ! filesink location=$filepath.raw

Table 4. OpenH264 performance results through varying resolution, with and without optimizations using RidgeRun Profiler and RidgeRun GstPerf. 4K was not profiled due to the low framerate (< 1 fps).

Varying Resolution
With Optimizations Without Optimizations
Resolution 1280x720 1920x1080 3280x2464 1280x720 1920x1080 3280x2464
Entire CPU Average (%) 7.7 15.69 15.09 16.67 16.5 -
RAM (MB) 46.3 56.6 122.52 56.52 59.7 -
Encoding rate (fps) 30 20 6.6 10.79 4.72 -

Profiling CPU without Optimization

In the previous introductory measurements, we have a table with the comparison in performance that highlights that the SIMD optimizations help to increase the framerate to acceptable values for most of the applications (30 fps) for certain applications. In this section, we will study the performance of the code implemented in pure C without explicit SIMD/Assembly optimizations.

Figure 1 shows the call graph obtained by running the following GStreamer pipeline without optimizations in the OpenH264 library. The configuration used for the said pipeline was a resolution of 1080p, fixed bitrate at 4Mbps, complexity and rate-control set at their default values. The hotspots are highlighted in green for clarity. Refer to Appendix A to see the full callgraph.

The results shown below correspond to the C-implementation of the encoder and will serve as a baseline for the C functions without optimizations. We will compare this performance against the optimized encoder that utilizes the CPU SIMD from ARM64 (CPU equipped within the NVIDIA Orin).

Figure 1. Call Graph without optimizations, with the critical path highlighted in green. Other parts have been skipped for simplicity (based on the one obtained by Linux Perf).

From the previous callgraph, the following table can be extracted, highlighting the hotspots, their execution time and then number of samples, as well as their overall percentage in the execution

Table 5. Hotspots table without optimizations enabled (obtained by Linux Perf).

Number of samples Function name Execution
time
(ms)
Samples
percentage
Executions
per frame
Relevance
(4.72 Hz = 212 ms)
6666 WelsEnc::WelsMdI4x4 6666 33.3% 44.44 21%
1702 WelsEnc::WelsSampleSatd16x16_c 1702 8.54 11.34 5.4%
590 WelsEnc::WelsSampleSatd8x8_c 590 2.95 3.9 1.8%
317 WelsEnc::WelsIDctFourT4Rec_c 317 1.585 2.11 -
294 WelsEnc::WelsMotionEstimateInitialPoint 294 1.47 1.96 -
282 WelsEnc::WelsQuantFour4x4_c 282 1.41 1.88 -
245 WelsEnc::WelsEncRecUV 245 1.225 1.63 -
185 WelsEnc::WelsDctMb 185 0.925 1.23 -
178 WelsEnc::FilteringEdgeLumaV 178 0.89 1.18 -
148 WelsEnc::WelsMotionEstimateSearch 148 0.74 0.9 -

From Table 5, it is noticeable that the intra-prediction using 4x4 macroblocks is consuming most of the time, suggesting that it is a good candidate for optimization. It is followed by 16x16 macroblock analysis that employs the Sum of Absolute Transformed Differences (SATD) to compute the cost of the encoding. Most candidates belong to the intra-predicition step of the encoding, which is in charge of the spatio-temporal analysis of the frames.

CPU-Limit Performance using NEON (Deadline)

Subsequently, the following call graph was extracted after enabling the optimizations on the OpenH264 library. The configuration used for the GStreamer pipeline was the same as the non-optimized pipeline: a resolution of 1080p, fixed bitrate at 4Mbps, complexity and rate-control set at their default values. The hotspots are highlighted to aid readability. Refer to Appendix B to see the full call graph.

Figure 2. Call Graph using optimizations with the critical path highlighted in green. Other parts have been skipped for simplicity

From the previous callgraph, the following table is extracted, which contains the primary hotspots, their execution time and number of samples, as well as their percentage on the overall execution.

Table 6. Hotspots table with optimizations enabled.

Number of samples Function name Execution
time
(ms)
Samples
percentage
Executions
per frame
Relevance
(20 Hz = 50 ms)
1237 WelsEnc::WelsMdI4x4 [Best candidate] 1237 13.17% 8.2 16.4%
271 WelsIntra4x4Combined3Satd_AArch64_neon 271 2.91% 1.8 3.6%
212 WelsIntra16x16Combined3Satd_AArch64_neon 212 2.57% 1.4 2.8%
155 WelsSampleSatd4x4_AArch64_neon 155 2.22% 1.03 2.1%
150 WelsEnc::WelsScan4x4DcAc_c 150 1.88% 1 -
220 WelsEnc::WriteBlockResidualCavlc 220 3.13% 1.46 -
118 WelsQuantFour4x4Max_AArch64_neon 118 3.16% 0.78 -
121 WelsVP::CVpFrameWork::Process
vaa_calc_sad_bgd_loop1
121 1.23% 0.8 -
111 WelsEnc::CWelsPreProcess::WelsMoveMemoryWrapper
__memcpy_generic
111 1.28% 0.74 -
251 WelsEnc::WelsEncRecI4x4Y 251 3.48% 1.67
97 WelsEnc::WriteBlockResidualCavlc
CavlcParamCal_c
97 1.20% 0.6 -

Analysis:

Based on the findings listed in Table 1 and Table 2, the primary candidate for optimization is the WelsMdI4x4 function, since it can be extracted as a common candidate from both modes (with and without optimizations). It can also be concluded that there is significant potential in focusing on said function based on the impact the NEON optimizations had on the performance, reducing the execution time approximately 35ms.

The hotspots seem to be located in the intra-prediction branch of the encoding. The most expensive functions are mainly the WelsMdIntraFinePartition, in which the functions WelsMdI4x4, WelsMdI16x16 are the primary candidates for optimization.

CPU Hotspot and Opportunities

After analyzing the call graph with and without NEON optimizations, the primary hotspot for optimization is the WelsMdInterSecondaryModesEnc function. Preliminary analysis determines that it can be attacked as a whole; however, to focus the efforts, optimization will be initially focused on the following functions:

Table 7. Hotspot analysis and opportunities in CPU were obtained by comparing the callgraphs obtained for the optimized and non-optimized results of the OpenH264 library.

Function Name Time per Run (or frame) Optimized in CPU? Still Optimizable?
WelsEnc::WelsMdI4x4 44.4ms Yes (5x gain) Yes. GPU
WelsEnc::WelsMdI16x16 14.3ms Yes (48x gain) Yes. GPU
WelsEnc::WelsMdIntraChroma 5.2ms Yes (5x gain) Yes. GPU
WelsEnc::WelsIMbChromaEncode 3.47ms Yes (4x gain) Yes. GPU

From Table 7, the principal bottlenecks have been addressed using NEON. According to our code overview, it is possible to accelerate the algorithms, given that they have been accelerated through hardware before. Moreover, the code works on macroblocks and can be an example of parallelism at the data level.

On the other hand, it might be required to perform some algorithm optimization and simplification to offload the workload correctly, efficiently exploiting the CUDA threads. This is possible, given that the assembly code already performs similarly.

The following subsections will detail the functions considered for optimisation during this work.

WelsEnc::WelsMdI4x4

A function that helps in the mode decision process for selecting the best intra-prediction mode for a 4x4 block.

Inputs:

  • Structure: pointer to a context structure sWelsEncCtx, which contains encoder settings, frames, and other relevant parameters.
  • Structure: pointer to the mode decision structure (SWelsMD), which holds information about mode decisions and costs.
  • Pointer to the macroblock structure (SMB), which stores information about the current macroblock.
  • Pointer to the macroblock cache structure (SMbCache), which stores intermediate results during mode decision.

Outputs:

  • Integer representing the best cost among different intra-prediction modes, used in the mode decision process to determine the best intra-prediction mode.

Opportunities: This function is optimizable because:

  • It doesn’t display any conflicting data dependencies or branching that might interfere with parallelization. It also contains some loops that point to a good optimization opportunity, as well as the fact that it was previously optimized through CPU.

WelsEnc::WelsMdI16x16

A function that helps in the mode decision process for selecting the best intra-prediction mode for a 16x16 block.

Inputs:

  • Structure: Pointer to a structure SWelsFuncPtrList, used to access functions related to intra-prediction.
  • Structure: Pointer to the current decoded layer structure, used to access stride information for the decoding layer.
  • Pointer to the macroblock cache structure (SMbCache), which stores intermediate results during mode decision.
  • Integer parameter representing the lambda value, a parameter used in rate-distortion optimization, used in calculating the cost for each intra-prediction mode.

Outputs:

  • Integer representing the best cost among different intra-prediction modes, used in the mode decision process to determine the best intra-prediction mode.

Opportunities: This function is optimizable because:

  • It doesn’t display any conflicting data dependencies or branching that might interfere with parallelization.

WelsEnc::WelsMdIntraChroma

A function that helps in the mode decision process for selecting the best intra-prediction mode for a 16x16 block.

Inputs:

  • Structure: Pointer to a structure SWelsFuncPtrList, used to access functions related to chroma intra-prediction.
  • Structure: Pointer to the SDqLayer, decoder layer structure, which contains information about the current decoding layer for chroma encoding.
  • Pointer to the macroblock cache structure (SMbCache), which stores intermediate results during mode decision.
  • Integer parameter representing the lambda value, a parameter used in rate-distortion optimization, used in calculating the cost for each chroma intra-prediction mode.

Outputs:

  • Integer representing the best cost among different intra-prediction modes, used in the mode decision process to determine the best intra-prediction mode.

Opportunities: This function is optimizable because:

  • It doesn’t display any conflicting data dependencies or branching that might interfere with parallelization.

WelsEnc::WelsIMbChromaEncode

This function is part of the chroma encoding process.

Inputs:

  • Structure: pointer to a context structure sWelsEncCtx, which contains encoder settings, frames, and other relevant parameters.
  • Structure: Pointer to the SMB, current macroblock structure.

Pointer to the macroblock cache structure (SMbCache), which stores intermediate results during mode decision.

Opportunities: This function is optimizable because:

  • It doesn’t display any conflicting data dependencies or branching that might interfere with parallelization.

Relevant Note

Offloading the functions in a standalone manner (as kernel units) may lead to suboptimal performance, given that it only works on 16x16 data structures. For best performance, the recommendation is to climb up to the WelsMdInterLoop. Thus, the library can deploy the kernels with multiple macroblocks and make the most of the thread-level parallelism from the GPU.

For simplicity, this project addresses each of the candidates listed above individually for a quantitative evaluation and estimation of the efforts required to accelerate the encoder via GPU.

CUDA Optimization Process

This report addresses the Optimization using CUDA, utilizing the GPU as an accelerator. However, the titles of the sections and chapters can be different depending on the acceleration mechanism or device. Overall, the Optimization process is an iterative methodology with a series of changes to “optimize” certain parts of the application, followed by measurements to see the improvements, then tuning and fixes. After the fixes, there will be another series of changes, restarting the cycle, as presented in Figure 3.

Figure 3. Optimization cycle

In this section, we expect to have more than one optimization cycle, leading to a chapter per cycle. We will identify each optimization by a number, e.g., Optimization 1 and Optimization 2. Then, each chapter will punctually describe the changes made for Optimization. After the Optimization, we will summarize the performance measurements against the reference (or baseline performance). Finally, the chapter must end with alternatives and Optimization opportunities for the following Optimization cycle.

For measuring the performance, we utilize the following tools:

  • RidgeRun GstPerf: To measure the framerate based on a pipeline
  • RidgeRun Profiler Library: To measure the function runtime when running a pipeline and get the actual function time.
  • NVIDIA Nsight (NvProf and Timeline): To measure the average runtime of each kernel ported and the sparsity of the workload (how separated are the kernel executions).

Moreover, it is essential to claim that Optimizations 2-4 (explained below) are executed on test benches, which are in isolation concerning the library. This is due to the integration complexity that involves working inside the library. The integration into the library might be covered in future research/work.


Required Change 1: Unified Memory Change

The first modification we have done to add CUDA support is to move the allocations to be CUDA Managed. It involved the replacement of all malloc-like allocations and using cudaMallocManaged instead. Thus, we can have all data available in GPU and CPU simultaneously without incurring explicit memory copies. All the default configuration values are the same as Tables 1-4.

Tables 10-13 summarize the results of this modification and their impact on the framerate.

Table 10. OpenH264 framerate results through varying complexity with CUDA and UM.

Varying Complexity
Without NEON
Optimizations
With NEON
Optimizations
With UM and no
NEON optimizations
With UM and NEON
optimizations
Complexity Low Mid High Low Mid High Low Mid High Low Mid High
Encoding rate (fps) 7.01 4.74 4.75 21 12.5 12.7 6.5 4.27 4.42 13.3 10.35 11.3

Table 11. OpenH264 framerate results through varying rate control with CUDA and UM.

Varying Rate-control
Without NEON
Optimizations
With NEON
Optimizations
With UM and no
NEON optimizations
With UM and NEON
optimizations
Rate
control
0 1 2 -1 0 1 2 -1 0 1 2 -1 0 1 2 -1
fps 4.7 4.71 4.56 4.43 12.8 13 8.7 11.8 4.2 4.4 3.9 4.2 11.09 10.85 7.45 8.1

Table 12. OpenH264 framerate results through varying bitrate with CUDA and UM.

Varying Bitrate
Without NEON
Optimizations
With NEON
Optimizations
With UM and no
NEON optimizations
With UM and NEON
optimizations
Bit-rate (Mbps) 1 4 10 var 1 4 10 var 1 4 10 var 1 4 10 var
fps 4.73 4.71 4.71 4.73 30 19 13 29.9 4.9 4.71 4.3 4.73 11.5 11.25 10.5 11.9

Table 13. OpenH264 framerate results are obtained through varying resolutions with CUDA and UM.

Varying Resolution
Without NEON
Optimizations
With NEON
Optimizations
With UM and no NEON
optimizations
With UM and NEON
optimizations
Resolution 1280x720 1920x1080 3280x2464 1280x720 1920x1080 3280x2464 1280x720 1920x1080 3280x2464 1280x720 1920x1080 3280x2464
Encoding rate (fps) 10.79 4.72 - 30 20 6.6 10.79 4.72 - 29.5 11.25 5.25

The assumed settings (if not explicit) are the same as the ones presented in the CPU without Optimization in Tables 1-4.

Results:

There is a certain degradation in the performance, possibly due to the memory transfers and alignment required by the SIMD instructions. However, this is the price to pay to move the logic to be CUDA-compatible.

Optimization 2: MdI4x4 Kernel Port

The second change corresponds to the port of MdI4x4 function, which has the most significant relevance in the execution of the encoder. For the results obtained for this section, the code was tested on isolation w.r.t. the encoder. The sum of times of the different parts of the code (Aggregate Consumption) should be less or equal to the time measured, given that the code tested omits some of the parts that are integrated into the code from the encoding library. It is important to recall that the measurements are taken outside of the encoder library and are done using instrumented test benches.

Table 14. MdI4x4 CUDA results in contrast to the equivalent C function and the approximated NEON function execution. Data taken by using the RidgeRun Profiling Library and NvProf.

Average Execution Time (us) - 1000 samples Speedup
C function NEON function CUDA Weight CUDA
Speedup
with C
CUDA
Speedup
with NEON
MdI4x4 (Metered time in library) 4.560 0.846 - - - -
PredIntra4x4Mode 0.000 0.000 0.000 0.000 0.000
WelsSampleSatd4x4_c 0.246 0.034 6.305 0.458 0.039 0.005
LumaI4x4Pred 0.096 0.018 5.837 0.124 0.016 0.003
WelsEncRecI4x4Y 0.834 0.243 8.906 0.124 0.094 0.027
Aggregated Consumption 3.645 0.768 75.447 - 0.048 0.010

Table 14 summarizes the results of the execution time of MdI4x4. The C function and CUDA function were tested and metered in isolation, whereas the NEON function time was computed based on the samples obtained from the call graphs and Tables 5 and 6. From Table 14, it is possible to observe that the execution time of the CUDA functions is, on average, more than 20 times slower than the C counterparts when executing a single 4x4 block. The hypothesis is due to the microarchitecture and clock speed reasons.

Table 15. Estimation for performance for the MdI4x4 function

Theoretical Estimation for Performance
Theoretical Required
Dimensions to Cope
with NEON
98 Blocks of 4x4 assuming
1 block allocation
Width 40 px
Height 40 px
Relevance 22.90% of the total loop consumption

Table 15 shows the minimum resolution and the number of blocks required to match the performance of the NEON functions by assuming that all the blocks will run at the same running time in an embarrassingly parallel fashion. The division between the CUDA running time and the NEON running time gives the number of 4x4 blocks. In the average case, it requires at least 98 blocks. By computing the square root of 98 and multiplying the result by 4, it is possible to compute the minimum width and height, which is a window of 40x40, to make 100 blocks. Finally, optimizing this function will lead to an improvement in the 22.90% of the running time of the overall code according to the call graph taken at the beginning of the project.

Table 16. Experimental results based on the device limitations. Data obtained from DeviceQuery and NvProf.

Experimental Device Limitations for Scalability
- Using EncI4x4 in worst case scenario
Number of SMs 8.00
Number of CUDA Cores per SM 128.00 cores
Required Number of Threads 256.00 This assumes that each macroblock requires 16x16
Number of Minimum Required Blocks 98.24
Number of concurrent blocks allowed
by the device
8.00 Equal to the number of SMs given the block size
Theoretical Missing Blocks Ratio 12.28 The ratio between the minimum required blocks to
match NEON and the concurrent blocks allowed by
the device assuming that a SM can concurrently
execute one block at a time. This loss implies that
the SM runs multiple executions for the total
required blocks.
Estimated sparsity of the job per single
block
19.28% Sparsity when launching multiple instances -
taking 30 samples with 16 threads each
Experimental avg when running at the
required num of blocks
22.18 us
Experimental speedup law when running
at the required num of blocks compared
to CUDA serial execution
40.15 times
Theoretical avg when running at the
required num of blocks in NEON
24.28 us
Theoretical avg speedup when running
on CUDA with 100 blocks compared to
NEON
1.09 times
Experimental avg when running at 1080p 17,755.00 us
Experimental speedup law when running
at the 1080p compared to CUDA serial
execution
173.35 times
Theoretical avg when running at the
required at 1080p in NEON
31,469.94 us
Theoretical avg speedup when running
on CUDA at 1080p compared to NEON
1.77 times

Table 16 summarizes the gains when experimentally launching the kernels by using several blocks that match the resolution. The first two rows indicate the platform characteristics, and the third is the number of threads required by a 16x16 macroblock. It leads to 8 concurrent streams (given the number of SMs) with a starvation ratio of 12.28 (sixth row), computed by the division of 98 blocks and the number of concurrent streams.

According to the experimental measurements, we have an occupation of ~80% (19.28% of slack given by the CUDA launching sparsity). By running 100 blocks, the CUDA algorithm can cope with the NEON time, leading to a 1.09x speedup. If launching an equivalent job of 1080p (with a worst-case scenario where the MdI4x4 always executes), the CUDA estimation leads to 1.77x speedup, suggesting that there are gains of moving the job to the GPU, probably given by the fact that the launching is better-organized.

The results for 1080p suggest that the improvement with the port to CUDA is about 1.18 times (under the relevance of 22.90% and 1.77 times average speed up) in the average case. It does not include any efforts in algorithm simplification or intrinsics that can positively impact the final results. However, these results assume that the GPU will be fully committed to the encoding part and used by a single stream 1080p.

The recommendation: research on the algorithm simplification (there are merges that can be beneficial) and continue with loop optimisation. As an estimate, the GPU seems to be not powerful enough to sustain 6 streams of 1080p as the CPU does and being conservative, the estimated number of streams is 2 assuming that the functions can be simplified algorithmically.

Optimization 3: MdI16x16 Kernel Port

Similar to Optimization 2, this section shows the results of porting the functions involved in the function MdI16x16. These measurements have the same assumptions as Optimization 2 regarding the use of testbenches and the aggregation of the functions.

Table 17. MdI16x16 CUDA results in contrast to the equivalent C function and the approximated NEON function execution. Data taken by using the RidgeRun Profiling Library and NvProf.

Average Execution Time (us)
- 1000 samples
Speedup
C function NEON
function
CUDA Weight CUDA
Speedup
with C
CUDA
Speedup
with NEON
MdI16x16 (Metered time in library) 5.870 0.547 - - - -
WelsI16x16LumaPred 0.069 0.034 5.610 0.048 0.012 0.006
WelsSampleSatd16x16 3.150 0.209 16.170 0.255 0.195 0.013
Aggregated Consumption 3.219 0.243 21.780 - 0.270 0.025

Table 17 shows the results of the aggregation of the functions involved in MdI16x16. It is possible to notice that the single-block CUDA counterpart is overall 4x slower than the C counterpart and 75x slower than the NEON counterpart.

Table 18. Estimation for performance for the MdI16x16 function

Theoretical Estimation for Performance
Theoretical Required
Dimensions to Cope with NEON
40 Blocks of 16x16 assuming
1 block allocation
Width 101 px
Height 101 px
Relevance 3.72% of the total loop consumption

Similarly to the previous optimization, we have computed the minimum resolution required to cope with the NEON performance, assuming perfect parallelism (Table 18). In this case, the minimum resolution is close to 100x100. Nevertheless, the impact of the performance of this function is 3.72% of the overall code’s performance.

Table 19. Experimental results based on the device limitations

Experimental Device Limitations for Scalability
- Using Satd16x16 in worst case scenario
Number of SMs 8.00
Number of CUDA Cores per SM 128.00 cores
Required Number of Threads 256.00 This assumes that each macroblock
requires 16x16
Number of Minimum Required Blocks 39.84
Number of concurrent blocks allowed
by the device
8.00 Equal to the number of SMs given
the block size
Theoretical Missing Blocks Ratio 4.98 The ratio between the minimum
required blocks to match NEON and
the concurrent blocks allowed by the
device assuming that a SM can
concurrently execute one block at a
time. This loss implies that the SM
runs multiple executions for the total
required blocks.
Estimated sparcity of the job per single
block
0.43% Dividing the average between the
reported in the table and the value
from nvprof
Experimental avg when running at the
required num of blocks
20.78 us
Experimental speedup law when running
at the required num of blocks compared
to CUDA serial execution
70.03 times
Theoretical avg when running at the
required num of blocks in NEON
18.82 us
Theoretical avg speedup when running
on CUDA with 90 blocks compared to
NEON
0.91 times
Experimental avg when running at 1080p 538.00 us
Experimental speedup law when running
at the 1080p compared to CUDA serial
execution
243.45 times
Theoretical avg when running at the
required at 1080p in NEON
1,694.00 us
Theoretical avg speedup when running
on CUDA at 1080p compared to NEON
3.15 times

Table 19 shows that the MdI16x16 has a more significant speedup than the NEON counterpart when launching several blocks that match the required to process a 1080p feed. However, the issue relies on the relevance of the function, contributing MdI16x16 to only 1.08x speedup.

The recommendation: The results suggest that the improvement with the port to CUDA is about 1.08 times in the average case. This is because of the relevance of the function in the overall execution time from the call graph comparison. Moreover, these results assume that the GPU will be fully committed to the encoding part and used by a single stream 1080p. The recommendation is to simplify if other parts can be simplified and merged.

Optimization 4: EncRecI16x16Y Kernel Port

This optimization covers the EncRecI16x16Y kernel port. This function is much more complex than the MdI16x16 and, unlike the EncRecI4x4Y, which is contained in MdI4x4, it is not contained in the MdI16x16.

Table 20. EncRecI16x16Y CUDA results in contrast to the equivalent C function and the approximated NEON function execution. Data was taken by using the RidgeRun Profiling Library and NvProf.

Average Execution Time (us)
- 1000 samples
Speedup
C function NEON
function
CUDA Weight Weighted
Correction
CUDA
Speedup
with C
CUDA
Speedup
with NEON
EncRecI16x16Y (Metered time in library) 14.300 3.260 - - 0.111 - -
WelsDctMb 3.470 0.019 10.040 0.028 0.251 0.346 0.002
WelsHadamardT4Dc 0.224 - 6.339 0.001 0.009 0.035 -
WelsQuant4x4Dc 0.172 0.637 9.916 0.001 0.009 0.017 0.064
WelsScan4x4DcAc 0.076 - 14.479 0.014 0.130 0.005 -
WelsGetNoneZeroCount 0.013 0.000 5.930 0.001 0.009 0.002 0.000
WelsQuantFour4x4 0.896 0.232 5.923 0.042 0.382 0.151 0.039
WelsScan4x4Ac 0.172 0.637 9.916 0.001 0.009 0.017 0.064
WelsIHadamard4x4Dc 0.277 - 6.222 0.001 0.009 0.045 -
WelsDequantLumaDc4x4 0.159 - 9.847 0.001 0.009 0.016 -
WelsDequantIHadamard4x4 0.001 0.009
WelsDequantFour4x4 0.284 - 9.846 0.001 0.009 0.029 -
WelsIDctFourT4Rec 0.901 0.196 23.847 0.017 0.156 0.038 0.008
WelsIDctRecI16x16Dc 1.970 - 9.851 0.001 0.009 0.200 -
Aggregated Consumption 8.614 1.721 122.156 0.176 - 0.071 0.014

Table 20 shows the results of the functions invoked by the EncRecI16x16Y. The aggregated sum of the functions shows a single-block degradation of more than 10x for the C counterpart and 75x for the NEON counterpart. These results are not different from the ones observed in the previous optimizations.

When digging into the estimation for performance, the EncRecI16x16Y requires more blocks than the MdI16x16 (31 more), as observed in Table 21. Still, the resolution is 100 times smaller than a 1080p resolution, giving room for parallelism.

Table 21. Estimation for performance for the EncRecI16x16Y function

Theoretical Estimation for Performance
Theoretical Required
Dimensions to Cope with
NEON
71 Blocks of 16x16 assuming
1 block allocation
Width 135 px
Height 135 px
Relevance 9.11% of the total loop consumption

Table 22. Experimental results based on the device limitations

Experimental Device Limitations for Scalability
- Using EncI16x16 in worst case scenario
Number of SMs 8.00
Number of CUDA Cores per SM 128.00 cores
Required Number of Threads 256.00 This assumes that each macroblock
requires 16x16
Number of Minimum Required Blocks 70.97
Number of concurrent blocks allowed
by the device
8.00 Equal to the number of SMs given
the block size
Theoretical Missing Blocks Ratio 8.87 The ratio between the minimum
required blocks to match NEON and
the concurrent blocks allowed by the
device assuming that a SM can
concurrently execute one block at a
time. This loss implies that the SM
runs multiple executions for the total
required blocks.
Estimated sparsity of the job per single
block
20.00% Assuming the worst metric from other runs
Experimental avg when running at the
required num of blocks
198.51 us
Experimental speedup law when running
at the required num of blocks compared
to CUDA serial execution
61.54 times
Theoretical avg when running at the
required num of blocks in NEON
172.11 us
Theoretical avg speedup when running
on CUDA with 100 blocks compared to
NEON
0.87 times
Experimental avg when running at 1080p 5,408.57 us
Experimental speedup law when running
at the 1080p compared to CUDA serial
execution
182.94 times
Theoretical avg when running at the
required at 1080p in NEON
13,941.24 us
Theoretical avg speedup when running
on CUDA at 1080p compared to NEON
2.58 times

Table 22 shows the analysis for the EncRecI16x16Y function, highlighting an expected speedup of 2.58x when running at the maximum occupancy in the GPU. However, the actual gain when taking into consideration the relevance of the function is 1.14x (2.58x and 9.11%).

The recommendation: the recommendation is similar to the one suggested in Optimization 3.

Conclusions and Future Work

Assuming that the flow is in cascade and based on the relevance of the speedup, the total speedup provided by the former three optimizations is about 1.18 x 1.08 x 1.14 = 1.453x. This is faster than the NEON implementation for a single stream allocated in a single CPU core. Therefore, the GPU can be occupied by just one 1080p stream at a time, leading to 6x less performance than the CPU (using the six cores). This means that one single stream consumes all the GPU but still it runs more efficiently than being executed in one single CPU core (not all the CPU in Orin Nano). The limitation is that for architectures like Jetson Orin Nano, we have only 1 GPU and it might be better to use one single CPU core to encode one stream than loading the GPU completely with the H264 codec. This at least with the functions evaluated in this document.

At this point, all the analysis has been done by estimating the performance of the kernels in testbenches. Nevertheless, accelerating single functions is unsuitable from the scheduling and workload point-of-view. Hence, the first recommendation for future work is to port the MdInterLoop as a whole, including all the functions invoked along its execution. This will provide enough workload to the GPU since the macroblocks can be executed as CUDA blocks.

Another next step in this research is to start porting the library to the implementation done during this research for more realistic results and optimize by using algorithm simplification by combining and merging functions mathematically, as done in the assembly implementation of NEON.

We have estimated that porting the implementation may lead to running up to 2 simultaneous 1080p30 streams at complete GPU occupation. This includes function and algorithm simplification and migrating the MdInterLoop function to the GPU. From the feasibility perspective, the results suggest that running the encoder on GPU involves trade-offs, particularly the sacrifice of GPU resources that can be used for image processing and AI since the encoder is compute-hungry. On the other hand, not freeing the CPU might incur slowdowns in applications and control.

Finally, our recommendation is to consider that continuing the porting of the encoder requires expertise in CUDA, and it is a complex task given that a deep understanding of the H.264 encoding process and algorithm is needed. Likewise, the architecture of the encoder is CPU-friendly, and it requires re-engineering the library to make it GPU-friendly. All the work might take six months to one year, depending on the team.



Previous: JetPack_5.0.2/Flashing_Board Index Next: JetPack 5.0.2/Performance Tuning/Evaluating Performance