Embedded GStreamer Performance Tuning

From RidgeRun Developer Wiki

This is a guide to embedded GStreamer performance pipeline tuning using a DM368 SoC running RidgeRun's SDK as the example hardware. It is based on an example case were 1080p video is to be recorded from a camera at 30fps using a H264 encoder at 12Mbps bitrate.

Introduction

Performance is a ghost that is always haunting us when dealing with tasks with real time deadlines. In multimedia, you must be able to process a frame every 33 milliseconds for 30 fps video or a frame will get dropped. When these real time tasks are done by an embedded system where resources are limited, tuning the system to overcome the real time constrains can become a tricky challenge. Multimedia handling is a very common field where time and resource constraints are likely to hinder our desired performance, specially when dealing with live audio / video capture.

This wiki uses a DM368 SoC running RidgeRun's SDK for the example hardware to guide you on performance tuning of a resource intensive 1080p30 GStreamer pipeline. The tricks done here should be easily extrapolated to other development environments.

Use case 1080p30 12 Mbps bitrate video encoding

In order to guide you through the different tuning stages, a simple use case example is used. The goal of this example is to record a video stream from a camera to an SD card attached to our embedded system. The system requirements/constraints include:

  • 1920x1080 Full HD resolution.
  • 30 frames per second.
  • H264 encoded at 12 Mbps bitrate.
  • QuickTime container
  • Write the movie file to an SD card.

Besides these explicit requirements, it is implicit that the system must operate successfully eal time, since we are capturing a live video stream. All that said, let's get our gloves and helmet and put hands to work.

The Sharp Cliff of Failure

What is our actual time constraint? Looking among the requirements it is seen that the only temporal requirement is given by the 30 frames per second. Easy enough, we can translate this into:

T = 1 s / 30 frames
T = 0.03333 s
T = 33.33 ms

This means that we have 33.33 ms to process the frame or we won't be able to keep up to that frame rate. This leads us to our first thumb rule:

You will be as slow as your slowest element.

If our most time consuming element can't beat the 33.33 ms timeline the pipeline can not be scheduled to work at this frame rate.

Initial Pipeline Profiling

Update: there is another tool - GstHarness that may prove helpful.

The initial pipeline to be used to capture and record the video is the following:

gst-launch -e \
v4l2src input-src=camera ! \
video/x-raw-yuv, width=1920, height=1088 ! \
dmaiaccel ! \
dmaiperf print-arm-load=true ! \
dmaienc_h264 targetbitrate=12000000 maxbitrate=12000000 ! \
qtmux ! \
filesink location=/media/sdcard/recording.mp4

Note how there are no queues in this pipeline. Although our bowels scream for queues, lets learn how to properly place them on the pipeline. An excess of queues can result result harmful rather than helpful.

Start without queues and profile the pipeline to find the sweet spots

The above pipeline will be profiled as the diagram in Figure 1.

*The following profiling diagram was made by RidgeRun's GstTuner tool. Contact support@ridgerun.com for more information about it.
Figure 1. Initial profiling

In the diagram shown in Figure 1 you will see the current pipeline with a bunch of measurements everywhere. We are now only paying attention to the mean and the accumulated processing time of the elements. The mean is the average time that the element needs to process a buffer, and the accumulated time is the sum of these means and represents the actual processing time that is needed to handle a buffer as a result of chaining the process function of each element. Table 1 summarizes this information for the ease of view.


Table1. Summary of mean and accumulated processing times for the initial profiling
Element Mean (ms) Accumulated (ms)
v4l2src --- ---
capsfilter 0.98 0.98
dmaiaccel 46.53 47.50
dmaiperf 2.07 49.58
dmaienc_h264 61.68 111.26
qtmux 1.02 112.28
filesink --- ---
INFO:
Timestamp: 1:33:29.215087770; bps: 11151032; fps: 3.55; CPU: 88; 
INFO:
Timestamp: 1:33:30.443846226; bps: 12758306; fps: 4.07; CPU: 87; 
INFO:
Timestamp: 1:33:31.665201188; bps: 12831449; fps: 4.09; CPU: 87; 
INFO:
Timestamp: 1:33:32.888005101; bps: 12820949; fps: 4.09; CPU: 87;

It is shown on Table 1 that our slowest element is the encoder with a processing time of 61.68ms so with this configuration it's not possible to schedule the pipeline at 30fps. It is also evident that with all the elements combined it would take 112.28ms to process a single buffer, which is way above our time constraint. The places where the time limit has been violated is marked with red on both the diagram and the table. Lets set up some properties to speed up the pipeline.

Dumping captured video header

It is often helpful to examine the headers in the video file to look at the video frame timing. For example, if you expect 30 fps, but are seeing 23 to 25 fps, it might be that ever 5th video frame is being dropped. Simple math makes this clear:

30 fps * 4 / 5 = 24 fps

There are different host tools depending on they video file type.

ffprobe

You can use ffprobe to dump the heads in an MP4 file. An example is

./ffprobe -show_frames -i video.mp4 -v debug
./ffprobe -show_packets -show_data video.mp4 | less
./ffprobe -show_frames video.mp4 | grep 'size\|coded_picture_number\|pict_type'

Setting properties

Our initial pipeline approach wasn't enough to fulfill our real time requirements. However, we can still configure the elements in our pipeline to gain some processing speed.

Start by tuning the properties of the different elements

The Table 2 shows some recommended element configurations and their respective description. It's important to know that although these configurations are recommended they might not be ideal for every scenario.

Table2. Properties to speed up the pipeline
Element Property Description
v4l2src always-copy=false Uses v4l2 original video buffers instead of copying them into a new buffer. Besides avoiding the memcpy, video buffers are contiguous which will avoid another memcpy on the accel.
dmaienc_h264 encodingpreset=2 Configure the encoder to run at high speed
dmaienc_h264 single-nalu=true Encode the whole image into a single NAL unit, to avoid walking through the buffer.
dmaienc_h264 ratecontrol=2 Use variable bitrate coding
filesink enable-last-buffer=false Tell the sink that we don't need him keeping a reference of the last buffer.

The pipeline is now looking something like the following:

gst-launch -e \
v4l2src always-copy=false input-src=camera ! \
video/x-raw-yuv, width=1920, height=1088 ! \
dmaiaccel ! \
dmaiperf print-arm-load=true ! \
dmaienc_h264 encodingpreset=2 ratecontrol=2 single-nalu=true targetbitrate=12000000 maxbitrate=12000000 ! \
qtmux ! \
filesink location=/media/sdcard/recording.mp4 enable-last-buffer=false

Figure 2 shows the diagram of the profiled pipeline with the respective properties set.

Figure 2. Profiling pipeline with properties set

As before, the following table summarizes the important information at this stage of the tuning.

Table3. Summary of mean and accumulated processing times for the profiling with properties
Element Mean (ms) Accumulated (ms)
v4l2src --- ---
capsfilter 0.74 0.74
dmaiaccel 0.98 1.71
dmaiperf 1.22 2.93
dmaienc_h264 30.53 33.46
qtmux 0.84 34.30
filesink --- ---
INFO:
Timestamp: 1:35:02.193290601; bps: 64009961; fps: 20.42; CPU: 33; 
INFO:
Timestamp: 1:35:03.194402098; bps: 75127432; fps: 23.97; CPU: 20; 
INFO:
Timestamp: 1:35:04.252337265; bps: 74111636; fps: 23.65; CPU: 17; 
INFO:
Timestamp: 1:35:05.255173848; bps: 75052455; fps: 23.95; CPU: 16;

Note how our timings have improved significantly. Now, our most time consuming element is the encoder with a processing time of 30.53ms, which means that the pipeline should be able to be scheduled to work at 30fps. However, not so happily the accumulated time breaks our limit when the buffer is pushed into the encoder. We need to stop the processing chain before entering the process of the encoder.

Parallelizing with Queues

On the diagram in Figure 2 and the stats on Table 3, it is shown in red the element at which the accumulated time exceeds our time constraint of T=33.33ms. We must break the processing chain before pushing the buffer into the encoder. This is achieved by placing a queue at this location. The queue will receive the buffer from the dmaiperf and return immediately, pushing the buffer to the encoder in a separate thread. In some way, one can say that they are processing in parallel. Also notice how the processing time of the encoder is around 30ms, too near of our limit. It is also a good idea to put a queue after the encoder and let the rest of the pipeline process on a different thread.

Place queues on the spots where the T is exceeded or nearly exceeded.

The pipeline is now something like the following

gst-launch -e \
v4l2src always-copy=false input-src=camera ! \
video/x-raw-yuv, width=1920, height=1088 ! \
dmaiaccel ! \
dmaiperf print-arm-load=true ! \
queue ! \
dmaienc_h264 encodingpreset=2 ratecontrol=2 single-nalu=true targetbitrate=12000000 maxbitrate=12000000 ! \
queue ! \
qtmux ! \
filesink location=/media/sdcard/recording.mp4 enable-last-buffer=false

Figure 3 show the profiling diagram with the queues respectively placed and Table 3 shows the stats summary.

Figure 3. Profiling pipeline parallelized with queues
Table4. Summary of mean and accumulated processing times for the profiling with properties
Element Mean (ms) Accumulated (ms)
v4l2src --- ---
capsfilter 1.67 1.67
dmaiaccel 1.06 2.72
dmaiperf 1.35 4.07
queue 1.23 1.23
dmaienc_h264 30.87 32.10
queue 1.19 1.19
qtmux 0.84 2.02
filesink --- ---
INFO:
Timestamp: 1:35:49.495732469; bps: 70679097; fps: 22.55; CPU: 21; 
INFO:
Timestamp: 1:35:50.499756552; bps: 81144860; fps: 25.89; CPU: 20; 
INFO:
Timestamp: 1:35:51.525898595; bps: 76350877; fps: 24.36; CPU: 21; 
INFO:
Timestamp: 1:35:52.527519641; bps: 78257742; fps: 24.97; CPU: 37;

We have get rid of all the red blocks. Now the pipeline should be able to schedule 30fps. But, why aren't we able to achieve them? There is still work to do.

Finding the Buffer Leak

So far, we've boosted performance by setting the properties appropriately and placing queues where necessary. Still, we aren't able to achieve our goal of 30 fps. To find out what is going on a more specific analysis is needed. Figure 4 shows the scheduling information of the source pad of the v4l2src element. This measurements express the amount of time between two consecutive buffers are pushed.

Figure 4. Scheduling information for v4l2 src pad
The previous profiling diagram was made by RidgeRun's GstTuner tool. Contact support@ridgerun.com for more information about it.

Observing the steady state, say after 3 seconds, the buffer push timing jumps between 33.33ms and 66.67ms. It's not a coincidence that the upper value is exactly the double of the first one. To understand what's happening, one must first know that v4l2src has a buffer pool of a fixed number of buffers which are recycled along the whole pipeline lifetime. This means that if the pool runs out of buffers, it must wait until a buffer is freed to write a new image to it.

If v4l2 buffers are not returning fast enough you will loose images waiting for them to free.

By default v4l2src creates a 3 buffer pool. It is clear from figure 4 that these 3 buffers aren't freed fast enough to keep up the 30 fps and that's why every once a buffer is lost and the next image will be grabbed 33.33ms later giving a total of 66.67ms.

Easy enough, one could increase the queue size to a higher value and expect better results. Doing so, the pipeline should look like:

gst-launch -e \
v4l2src queue-size=6 always-copy=false input-src=camera ! \
video/x-raw-yuv, width=1920, height=1088 ! \
dmaiaccel ! \
dmaiperf print-arm-load=true ! \
queue ! \
dmaienc_h264 encodingpreset=2 ratecontrol=2 single-nalu=true targetbitrate=12000000 maxbitrate=12000000 ! \
queue ! \
qtmux ! \
filesink location=/media/sdcard/recording.mp4 enable-last-buffer=false

Running the pipeline with --gst-debug=*v4l2*:4 reveals the following

0:00:01.250882632  1126    0x15018 DEBUG                   v4l2
../../../src/sys/v4l2/gstv4l2bufferpool.c:333:gst_v4l2_buffer_pool_new:<v4l2src0> STREAMING, requesting 6 MMAP buffers
0:00:01.251481882  1126    0x15018 WARN                    v4l2
../../../src/sys/v4l2/gstv4l2bufferpool.c:351:gst_v4l2_buffer_pool_new:<v4l2src0> using 3 buffers instead

This means that the driver didn't find enough memory to create a pool of the desired size and created one of the default size of 3. Be careful that if you don't enable debug, GStreamer will not notify you about this. Odd.

Let's make a new memory layout for our system. First, some calculations from the amount of memory needed.

Input buffer: 
* Note the usage of 1088 instead of plain 1080
* Need to multiply by 2 because of UYVY format bpp.

1920*1088*2 = 4177920 = 0x3FC000

Buffer pool:
* Need the pool to be aligned to a 4MB boundary
* Compute a 6 buffer pool

(4177920*6 + (1024*1024*4)) & ~(1024*1024*4 - 1) = 25165824 = 0x1800000

Output buffer:
* In this example we do not need to display images so we need no memory for display

0

CMEM 
* For CMEM we allocate 10 times the size of the input buffer for this example
* The 1.5 is because of the NV12 format handled by the pipeline

1920*1088*1.5*10 = 31334400 = 0x1DE2000

Using the numbers computed above we proceed to configure the overall system. For this we are using the make config menu of RidgeRun's SDK.

  • Video output
make config
 '-> Architecture configurations
   '-> Video Output
     '-> Disable
  • Video input
make config
 '-> Architecture configurations
   '-> Maximum Video Input Buffer Size
     '-> Other
       '-> 25165824
  • Amount of reserved memory from the kernel
make config
  '-> Proprietary software
    '-> Amount of reserved memory from the kernel
      '-> 0x1800000 + 0x0 + 0x1DE2000 = 0x35e2000
           (input)  (output) (CMEM)
  • Amount of reserved memory for cmemk
make config
  '-> Proprietary software
    '-> Amount of reserved memory for cmemk
      '-> 0x1DE2000

Build and install the SDK on the target board. Running again the pipeline now performs as shown on figure 6:

Scheduling of the pipeline with a queue size of 6
INFO:
Timestamp: 0:31:36.440458099; bps: 94215945; fps: 30.06; CPU: 37; 
INFO:
Timestamp: 0:31:37.472113180; bps: 94215945; fps: 30.06; CPU: 40; 
INFO:
Timestamp: 0:31:38.503786681; bps: 91176721; fps: 29.09; CPU: 34; 
INFO:
Timestamp: 0:31:39.535298349; bps: 94215945; fps: 30.06; CPU: 40;

Note how now the buffers are pushed every 33.33ms. There is one exception between second 6 and 7 where a buffer is lost, however the average framerate is maintained to 30fps.

Tuned Pipeline

gst-launch -e \
v4l2src queue-size=6 always-copy=false input-src=camera ! \
video/x-raw-yuv, width=1920, height=1088 ! \
dmaiaccel ! \
dmaiperf print-arm-load=true ! \
queue ! \
dmaienc_h264 encodingpreset=2 ratecontrol=2 single-nalu=true targetbitrate=12000000 maxbitrate=12000000 ! \
queue ! \
qtmux ! \
filesink location=/media/sdcard/recording.mp4 enable-last-buffer=false

Additional Considerations

Configure Linux for Heavy Write Operations

What happens the storage device can't keep up and the application is generating more and more data that is passed to the system via a file write() call?

The answer is not simple - there is an entire conference on the topic - 2012 Linux Storage, Filesystem, and Memory Management Summit. For our purposes, we can use the knowledge that Linux puts all available memory to work holding the data to be written to the device, where an SD card or USB thumb drive are two interesting cases. Kernel memory that is holding data to be written to a storage device is called write back cache. There are memory management information and tuning parameters you can use to monitor and control how the buffer cache is managed.

To get a feel for what is going on, write a Gbyte of zeros to an SD card or USB thumb drive. I did this on my Ubuntu desktop PC. Since the speed of the storage device is the limiting factor, you should see similar results if run on an embedded device

In one window run

rm -f /tmp/meminfo.log ; while sleep 0.5 ; do cat /proc/meminfo | tee -a /tmp/meminfo.log ; done

You will want to start this right before the next command and stop it after 60 seconds or so.

dd bs=1M count=1024 -if /dev/zero -of=/media/sd/zero_1G.binary ; echo "dd done" ; sync ; echo "sync done"

I used a USB thumb drive. The dd command returned after 31 seconds. The sync was done at 53 seconds. Interesting, the activity light on the thumb drive blinked for another 5 seconds or so as the cache in the USB subsystem and thumb drive itself was written to NAND. Using a transfer time of 1 Gbyte in 58 seconds indicates I was saving data at around 17 Mbytes/sec.

I then pulled out the values of some that I thought might be interesting using

S=Writeback S=Dirty

grep "^$S:" /tmp/meminfo.log  | awk '{print $2}' | grep -v '^0$' > /tmp/$S.log

and plotted the data. From the Writeback entry I noticed something in the kernel seems to cap the value as it stayed between 16M and 19M for the majority of the time the thumb drive activity light was blinking.

The Dirty value stayed at zero for around 15 seconds, then shot way up, but was non-uniform until elapsed time of 32 seconds, the was flat until 41 sec, then was as pretty as you please dropping linearly at a rate of 20.3 Mbytes/sec [sample 83 was 549M, sample 137 was 2M, and I took samples every 1/2 second). Notice at the end of the graph the rate the Dirty value changed matched my rough calculation of the thumb drive write speed -- we are watching the dirty cache as it gets written as fast as possible to the NAND in the thumb drive.

What can we learn about the Dirty cache being zero for the first 15 seconds? Maybe it has something to do with how fast /dev/zero can produce zeros. A simple test shows us the kernel is blazing fast at producing zeros

dd bs=100M count=1024 if=/dev/zero of=/dev/null

The above command reported a transfer rate of 4.3 GB/s, so our 1G collection of zeros took around 2% of the overall time. So if dd can produce the needed zeros in around 1 second, why did dd take 31 seconds to complete? Clearly the write() system call was not returning immediately due to the kernel pacing how much data could be held in the page cache. In looking in our /proc/meminfo logged data, we can see the free memory stays constant until elapsed time 15 sec then it drops by around 350 Mbytes. This exactly lines up with the jump in the Dirty value to 671M.


(expire centisecs tuning)

In an /etc/init.d/* file that starts the GStreamer media server, add

echo 100 > /proc/sys/vm/dirty_expire_centisecs

which will minimize the idle time when no data is being written to the SD card even though there is data to be written.

Choose the Proper SD Card

Depending on your GStreamer application requirements, you may need to optimize for write-intense scenarios. One example is running a dual-recording pipeline of two streams of 1080P@30, your system should be able to write data fast enough to keep up with the video stream. To increase the write performance to an SD card, it is recommended to use a proper SD Card (i.e. Class 10), but is also important to tune your filesystem and possibly the kernel's virtual memory subsystem. For more details read our guide High performance SD card tuning using the EXT4 file system.

Leave Enough Memory for the Kernel

Consider Enabling DCache Buffers

(Patching DMAI to enable cache buffers on ARM side processing)

Implement a More Efficient Flushing Mechanism for the Filesink

(filesink patch)