RidgeRun Developer Manual/Profiling Tools/Linux Perf: Difference between revisions

Revision as of 20:28, 2 November 2023

Introduction to Linux Perf

Linux Perf is one of the most complete and accurate tools for profiling an application since it has access to kernel ABI/API. It can intercept calls and get access to the hardware counters. It is possible to determine:

Branching: how much does the branching affect the performance
Cache miss rates: how much the application has a good memory access pattern
Alignment faults: same as before but related to cache line trashing
Context switches: how much the application hides
CPU clocks and migration: how much time the application uses the CPU in active or waiting mode, and the migration amongst cores
Construct the call graph: description of how each function is called

In this case, we cover the tool for application optimisation, where we want to optimise only specific parts. This is crucial for applications that are unknown and need to be accelerated.

Installing Linux Perf

Linux Distributions with full kernel

Linux distros with full kernel can be installed out-of-the-box from a generic image. This is the case for Desktop and Server computers, where you can install a generic Ubuntu, Debian, Fedora, etc.

For these cases, you can install Linux Perf by using a package manager. For the case of Ubuntu or Debian, you can use:

sudo apt install linux-tools-common linux-tools-generic linux-tools-$(uname -r)

Then, you should be able to execute perf --version to double-check.

Linux Distributions with kernel minimised

There are distributions with minimised kernels where some features have been removed to reduce their size. For this example, we are covering NVIDIA Jetson with Jetpack 5.1, since it is one of the worst corner cases of installing perf, since there is no repo in APT package manager.

1. Check if it is suitable

gunzip -c /proc/config.gz | grep PERF

Most of the options must be in Yes (y) or Module (m):

nvidia@ubuntu:~$ gunzip -c /proc/config.gz | grep PERF
CONFIG_CGROUP_PERF=y
CONFIG_CC_OPTIMIZE_FOR_PERFORMANCE=y
CONFIG_HAVE_PERF_EVENTS=y
CONFIG_PERF_EVENTS=y
# CONFIG_DEBUG_PERF_USE_VMALLOC is not set
CONFIG_HW_PERF_EVENTS=y
CONFIG_CPU_FREQ_DEFAULT_GOV_PERFORMANCE=y
CONFIG_CPU_FREQ_GOV_PERFORMANCE=y
CONFIG_HAVE_PERF_REGS=y
CONFIG_HAVE_PERF_USER_STACK_DUMP=y
# CONFIG_CLS_U32_PERF is not set
# CONFIG_PCIEASPM_PERFORMANCE is not set
# CONFIG_PCIE_BUS_PERFORMANCE is not set
# CONFIG_BCMDHD_CUSTOM_NET_PERF_TEGRA is not set
CONFIG_DEVFREQ_GOV_PERFORMANCE=y
CONFIG_NTB_PERF=m

2. Get the kernel sources and decompress the sources into a folder

From the public sources of Jetpack, move kernel_src.tbz2 to the board. You can find them on the official website of NVIDIA under the name "Driver Package (BSP) Sources".

For simplicity, assume they are in ~/kernel/. It will lead to two folders: kernel and hardware

3. If you are on a computer, load the toolchain for cross-compilation. Otherwise, if you are on the platform, continue with the following step.

4. Compile the code

cd ~/kernel/kernel/kernel-5.10/tools/perf
make

The path may change depending on the Jetpack version

5. Install Linux Perf

Install dependencies

sudo apt install libelf-dev libunwind-dev libaudit-dev systemtap-sdt-dev libslang2-dev libdw-dev  libperl-dev  libiberty-dev  libzstd-dev libcap-dev

In this case, we are just adding the path:

echo 'export PATH=/home/${USER}/kernel/kernel/kernel-5.10/tools/perf:${PATH}' >> ~/.bashrc

Close the session and open it again.

6. Unfold the kernel options:

There are some missing components required for Perf. You can add them by using:

sudo unminimize

Usage

Extracting a Call Graph

Now, it is time to create a basic call graph.

1. Disable protection

sudo su
echo -1 > /proc/sys/kernel/perf_event_paranoid

2. Launch the process

gst-launch-1.0 videotestsrc is-live=1 num-buffers=500 ! 'video/x-raw,width=1920,height=1080,framerate=30/1' ! nvvidconv !  openh264enc bitrate=4000000 ! fakesink

3. Intercept the process and start the instrumentation

sudo perf record -F1000 -g --call-graph dwarf --delay=5 -v -p $(pgrep -d, gst-launch-1.0) -a sleep 10

where the arguments are:

-p: process ID
-g: call graph generation
--call-graph dwarf: call graph with symbols in gcc
-a: all CPUs
-F: sampling frequency
-D: delay after starting the program in millis
sleep 10: take for 10 secs

This generates a file called perf.data

4. Generate the report

sudo perf report -i perf.data --stdio  -g --call-graph=flat

where the arguments are:

-i: perf file
--stdio: print information
--call-graph=flat: show the information all colapsed

The result would be similar to the following:

# Children      Self  Command          Shared Object                 Symbol                                                                       
# ........  ........  ...............  ............................  .............................................................................
#
    70.94%     0.00%  videotestsrc0:s  [unknown]                     [.] 0000000000000000
           13.95%
                0
                0x561d02977470
                0x2
                gst_openh264enc_handle_frame
                WelsEnc::CWelsH264SVCEncoder::EncodeFrameInternal
                WelsEnc::WelsEncoderEncodeExt
                WelsEnc::WelsCodeOneSlice
                WelsEnc::WelsPSliceMdEnc
                WelsEnc::WelsMdInterMbLoop
                WelsEnc::WelsSpatialWriteMbSyn
                WelsEnc::WelsWriteMbResidual
                WelsEnc::WriteBlockResidualCavlc

           5.20%
                0
                0x561d02977470
                0x2
                gst_openh264enc_handle_frame
                WelsEnc::CWelsH264SVCEncoder::EncodeFrameInternal
                WelsEnc::WelsEncoderEncodeExt
                WelsEnc::WelsCodeOneSlice
                WelsEnc::WelsPSliceMdEnc
                WelsEnc::WelsMdInterMbLoop
                WelsEnc::WelsSpatialWriteMbSyn
                WelsEnc::WelsWriteMbResidual
                WelsEnc::WriteBlockResidualCavlc
                WelsEnc::CavlcParamCal_c

Sometimes, there are symbols missing. To see actual names, activate symbols in the compilation by adding -g -fno-omit-frame-pointer -funwind-tables flag.

Compute times from the Call Graph

To have more control over time, we need to specify the number of runs of our application. In the context of GStreamer, it means that we need to fix the number of frames to process. Let's suppose we have the following pipeline:

gst-launch-1.0 videotestsrc is-live=1 num-buffers=150 ! "video/x-raw,width=1920,height=1080" ! perf ! openh264enc bitrate=1000000 ! h264parse ! mpegtsmux ! filesink location=test.ts -e

In this case, we are certain that the number of frames is 150, so we expect to have 150 runs of a given transformation function. Let's encapsulate this into a bash file called pipe.sh.

Now, for the recording, we want to fix a sampling frequency. It will add the temporal framework to our measurements. For framerates under 30 fps, 1000 Hz seems a reasonable number. The faster, the better. So, for the recording, we can use:

perf record -F1000 -g ./pipe.sh

Once finished, we can open the report with:

perf report --stdio  -g --call-graph=flat,count

It will generate the following results

# Children      Self  Command          Shared Object                 Symbol                                                                       
# ........  ........  ...............  ............................  .............................................................................
#
    70.94%     0.00%  videotestsrc0:s  [unknown]                     [.] 0000000000000000
           316
                0
                0x561d02977470
                0x2
                gst_openh264enc_handle_frame
                WelsEnc::CWelsH264SVCEncoder::EncodeFrameInternal
                WelsEnc::WelsEncoderEncodeExt
                WelsEnc::WelsCodeOneSlice
                WelsEnc::WelsPSliceMdEnc
                WelsEnc::WelsMdInterMbLoop
                WelsEnc::WelsSpatialWriteMbSyn
                WelsEnc::WelsWriteMbResidual
                WelsEnc::WriteBlockResidualCavlc

           118
                0
                0x561d02977470
                0x2
                gst_openh264enc_handle_frame
                WelsEnc::CWelsH264SVCEncoder::EncodeFrameInternal
                WelsEnc::WelsEncoderEncodeExt
                WelsEnc::WelsCodeOneSlice
                WelsEnc::WelsPSliceMdEnc
                WelsEnc::WelsMdInterMbLoop
                WelsEnc::WelsSpatialWriteMbSyn
                WelsEnc::WelsWriteMbResidual
                WelsEnc::WriteBlockResidualCavlc
                WelsEnc::CavlcParamCal_c

           1036
                0
                0x561d02977470
                0x2
                gst_openh264enc_handle_frame
                WelsEnc::CWelsH264SVCEncoder::EncodeFrameInternal
                WelsEnc::WelsEncoderEncodeExt
                WelsEnc::WelsCodeOneSlice
                WelsEnc::WelsPSliceMdEnc
                WelsEnc::WelsMdInterMbLoop

Total execution time

Please, notice the number of samples taken in WriteBlockResidualCavlc, CavlcParamCal_c, WelsMdInterMbLoop , which are 316, 118 and 1036, respectively. Since we know that we are sampling at 1000 Hz, it gives us a period of T = 1/1000 = 1ms (increasing the frequency, we have more accuracy). It leads to an execution time for the functions of 316 * 1ms = 316 ms, 118 * 1ms = 118ms, 1036 * 1ms = 1036ms, respectively, for all the execution of the pipeline.

Execution per invocation

To know how much time the execution takes per frame, we just divide the total execution time computed above by the number of frames (150 in this example), so the WriteBlockResidualCavlc, CavlcParamCal_c, WelsMdInterMbLoop functions take 316ms / 150 = 2ms, 118ms / 150 = 1.2ms, 1036ms / 150 = 6.9ms, respectively.

It is important to mention that, perhaps, all the functions do not execute per frame. However, we take an average behaviour operating the same way.

A more complete representation can be done with:

perf report --stdio  -g --call-graph=graph,count

leading to:

# Children      Self  Command          Shared Object                 Symbol                                                                       
# ........  ........  ...............  ............................  .............................................................................
#
    70.94%     0.00%  videotestsrc0:s  [unknown]                     [.] 0000000000000000
            |
            ---0
               |          
                --1441--0x561d02977470
                          0x2
                          |          
                           --1439--gst_openh264enc_handle_frame
                                     WelsEnc::CWelsH264SVCEncoder::EncodeFrameInternal
                                     |          
                                      --1438--WelsEnc::WelsEncoderEncodeExt
                                                |          
                                                |--1078--WelsEnc::WelsCodeOneSlice
                                                |          |          
                                                |          |--1065--WelsEnc::WelsPSliceMdEnc
                                                |          |          |          
                                                |          |           --1036--WelsEnc::WelsMdInterMbLoop
                                                |          |                     |          
continues...

You can still perform the same analysis as above.

In some cases, you might need to run the report as root for symbol resolution.

Previous: Profiling_Tools

Index

Next: Methodologies