AM5728 Multimedia Performance Testbench

From RidgeRun Developer Wiki


Introduction

On this page you are going to find the results of a multimedia performance test-bench for AM5728 EVM. The test-bench was executed using an SO image built by AM5728 RR-SDK, with GStreamer 1.6.1 version. This test-bench is based on GStreamer pipelines, and evaluates different video and audio tasks.


Multimedia Performance Test-bench Description

This test-bench makes a comparison between executing GStreamer pipelines that uses hardware acceleration modules available in AM5728 SoC and when those pipelines don´t use hardware acceleration.


Measurement parameters specified in the test:

  • CPU load percentage per core
  • Memory consumption
  • Memory Bandwidth
  • Frame rate
  • Encode time of a raw audio block


Multimedia tasks evaluated in the test-bench:

  • AAC audio encode
  • H264 video encode
  • MPEG4 video encode
  • H264 video decode
  • MPEG4 video decode
  • MPEG2 video decode
  • JPEG video decode
  • Resolution scale and color-space conversion


Note: Encode time of a raw audio block parameter was only measured for AAC audio encode multimedia task.

Different tools were used to measure the parameters specified in this test-bench. GstShark was the principal tool used to measure CPU load per core, frame rate and buffer processing time. GstShark is a benchmarking and profiling tool for Gstreamer pipelines developed by RidgeRun (If you want more information of this tool, please follow this link GstShark). Top Linux command was used to measure memory consumption. Bandwidth application was used to measure memory bandwidth (If you want more information of this tool, please follow this link Bandwidth).

AM5728 EVM Multimedia Performance Test-bench Results

In this section you will find the performance results obtained after applying the test-bench for each multimedia task.

AAC audio encode

To evaluate AAC encode using hardware acceleration and non using it, we use the FAAC GStreamer plugin. The hardware accelerated implementation of AAC audio encode with FAAC plugin use the NEON&VFPv4 ARM CORTEX A-15 extension. The software implementation only use the ARM. The test pipelines only differ in faac element, using in one case the hardware accelerated implementation, and in the other case using the non hardware accelerated implementation.

A raw audio file of 45.1 MB and 4:28 (min/s) duration was used as input in all the test pipelines of this section.


Total execution time

Test pipeline:

gst-launch-1.0 filesrc location=/am5728-gst-tests/audio-samples/audio_sample.raw ! audioparse ! faac ! fakesink -e

Obtained Results:

AM572x-testbench-AAC-exec-time.png
AM572x-testbench-AAC-exec-time.png


In the chart above, is clearly shown that when using hardware acceleration (NEON&VFPv4 extension), the total execution time of the AAC audio encode pipeline is significantly reduced. The average difference is 184.8 ms less in the pipeline execution time when the NEON&VFv4 extension is used.


Buffer processing time

Test pipeline:

GST_TRACER_PLUGINS="proctime" gst-launch-1.0 filesrc location=/am5728-gst-tests/audio-samples/audio_sample.raw ! audioparse ! faac ! fakesink -e

Obtained Results:

AM572x-testbench-AAC-buffertime.png
AM572x-testbench-AAC-buffertime.png


In the chart above, it can be seen in a general way that when using hardware acceleration (NEON&VFPv4 extension), a lees buffer processing time is achieved. The average difference per buffer is 17.47 us when hardware acceleration (NEON&VFPv4 extension) is used.


CPU load % per core

Test pipeline:

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 filesrc location=/am5728-gst-tests/audio-samples/audio_sample.raw num-buffers=1000 ! audioparse ! faac ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-AAC-cpuload.png
AM572x-testbench-AAC-cpuload.png


In the chart above, it can be seen that when using hardware acceleration (NEON&VFPv4 extension), no reduction is achieved in the CPU workload. The average difference between CPU_0_accel and CPU_1_unaccel is 0.5 % more load for CPU_0_accel. The average difference between CPU_1_accel and CPU_0_unaccel is 0.125 % more load for CPU_1_accel (negligible value).

Frame-rate

Test pipeline:

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 filesrc location=/am5728-gst-tests/audio-samples/audio_sample.raw num-buffers=1000 ! audioparse ! faac ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-AAC-framerate.png
AM572x-testbench-AAC-framerate.png


In the chart above, it can be seen in a general way that in both cases, the frame-rate reaches the expected value of 43 fps and then remains stable.

Memory consumption

Test pipeline:

gst-launch-1.0 filesrc location=/am5728-gst-tests/audio-samples/audio_sample.raw ! audioparse ! faac ! fakesink -e

Obtained Results:

AM572x-testbench-AAC-memuse.png
AM572x-testbench-AAC-memuse.png


In the chart above, it can be seen that when using hardware acceleration (NEON&VFPv4 extension), no reduction is achieved in memory consumption. The average difference is 120 KB of more consumption when hardware acceleration is not used (depreciable value).

Memory bandwidth consumption

Test pipeline:

gst-launch-1.0 filesrc location=/am5728-gst-tests/audio-samples/audio_sample.raw ! audioparse ! faac ! fakesink -e

Note: In both charts the memory bandwidth consumption is presented separately in sequential (seq) and aleatory (al) memory access.

Memory bandwidth consumption by memory readings obtained results:

AM572x-testbench-AAC-readbandwidth.png
AM572x-testbench-AAC-readbandwidth.png


In the chart above, it can be seen that when using hardware acceleration (NEON&VFPv4 extension), more memory bandwidth consumption by memory readings is obtained. The average difference is 213.5 MB/s for sequential reads and 64.6 MB/s for aleatory reads.

Memory bandwidth consumption by memory writings obtained results:

AM572x-testbench-AAC-writebandwidth.png
AM572x-testbench-AAC-writebandwidth.png


In the chart above, it can be seen that when using hardware acceleration (NEON&VFPv4 extension), more memory bandwidth consumption by memory writings is obtained. The average difference is 206.8 MB/s for sequential writes and 79 MB/s for aleatory writes.

H264 video encode

In this section you will find a comparison of H264 video encode GStreamer pipelines performance results between hardware accelerated and only software implementation. The hardware accelerated implementation uses gst-plugins-ducati (ducatih264enc element), and on the other side, the only software implementation uses the openh264 plugin (openh264enc element). The test pipelines only differ in H264 encode GStreamer element, using in one case the hardware accelerated, and in the other case using the non hardware accelerated implementation.


CPU load % per core

Test pipeline (ducatih264enc):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)NV12,width=720,height=420,framerate=(fraction)30/1' ! ducatih264enc ! fakesink sync=true

Test pipeline (openh264enc):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)I420,width=720,height=420,framerate=(fraction)30/1' ! openh264enc ! fakesink sync=true

Obtained Results:

AM572x-testbench-H264-enc-cpuload.png
AM572x-testbench-H264-enc-cpuload.png


In the chart above, is clearly shown that when using hardware acceleration, a substantial reduction in CPU workload is achieved. The average difference between CPU_0_accel and CPU_1_unaccel is 20.2 % less load for CPU_0_accel. In both cases the another corresponding core is practically off, and there is no difference between them.


Frame-rate

Test pipeline (ducatih264enc):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)NV12,width=720,height=420,framerate=(fraction)30/1' ! ducatih264enc ! fakesink sync=true

Test pipeline (openh264enc):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)I420,width=720,height=420,framerate=(fraction)30/1' ! openh264enc ! fakesink sync=true

Obtained Results:

AM572x-testbench-H264-enc-framerate.png
AM572x-testbench-H264-enc-framerate.png


In the chart above, it can be seen in a general way that in both cases, the frame-rate reaches the expected value of 30 fps and then remains stable.


Memory consumption

Test pipeline (ducatih264enc):

gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)NV12,width=720,height=420,framerate=(fraction)30/1' ! ducatih264enc ! fakesink sync=true

Test pipeline (openh264enc):

gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)I420,width=720,height=420,framerate=(fraction)30/1' ! openh264enc ! fakesink sync=true

Obtained Results:

AM572x-testbench-H264-enc-memuse.png
AM572x-testbench-H264-enc-memuse.png


In the chart above, it can be seen that when using hardware acceleration, a little reduction is achieved in memory consumption. The average difference is 363 KB of less consumption when hardware acceleration is used.


Memory bandwidth consumption

Test pipeline (ducatih264enc):

gst-launch-1.0 -e videotestsrc is-live=true ! 'video/x-raw,format=(string)NV12,width=720,height=420,framerate=(fraction)30/1' ! ducatih264enc ! fakesink sync=true

Test pipeline (openh264enc):

gst-launch-1.0 -e videotestsrc is-live=true ! 'video/x-raw,format=(string)I420,width=720,height=420,framerate=(fraction)30/1' ! openh264enc ! fakesink sync=true

Note: In both charts the memory bandwidth consumption is presented separately in sequential (seq) and aleatory (al) memory access.

Memory bandwidth consumption by memory readings obtained results:

AM572x-testbench-H264-enc-readbandwidth.png
AM572x-testbench-H264-enc-readbandwidth.png


In the chart above, it can be seen that when using hardware acceleration, more memory bandwidth consumption by sequential memory readings is obtained (average difference of 137.3 MB/s). In the case of aleatory memory reads, the average difference is 48.3 MB/s less when using hardware acceleration.

Memory bandwidth consumption by memory writings obtained results:

AM572x-testbench-H264-enc-writebandwidth.png
AM572x-testbench-H264-enc-writebandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory writings is obtained. The average difference is 473.1 MB/s for sequential writes and 2.6 MB/s for aleatory writes (this last value es negligible).

MPEG4 video encode

In this section you will find a comparison of MPEG4 video encode GStreamer pipelines performance results between hardware accelerated and only software implementation. The hardware accelerated implementation uses gst-plugins-ducati (ducatimpeg4enc element), and on the other side, the only software implementation uses the gst-plugins-libav (avenc_mpeg4 element). The test pipelines only differ in MPEG4 encode GStreamer element, using in one case the hardware accelerated, and in the other case using the non hardware accelerated implementation.


CPU load % per core

Test pipeline (ducatimpeg4enc):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)NV12,width=720,height=420,framerate=(fraction)30/1' ! ducatimpeg4enc ! fakesink sync=true

Test pipeline (avenc_mpeg4):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)I420,width=720,height=420,framerate=(fraction)30/1' ! avenc_mpeg4 ! fakesink sync=true

Obtained Results:

AM572x-testbench-MPEG4-enc-cpuload.png
AM572x-testbench-MPEG4-enc-cpuload.png


In the chart above, is clearly shown that when using hardware acceleration, a substantial reduction in CPU workload is achieved. The average difference between CPU_1_accel and CPU_1_unaccel is 48.8 % less load for CPU_1_accel. In both cases the CPU_0 has the same average workload percentage, so there is no difference between them.


Frame-rate

Test pipeline (ducatimpeg4enc):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)NV12,width=720,height=420,framerate=(fraction)30/1' ! ducatimpeg4enc ! fakesink sync=true

Test pipeline (avenc_mpeg4):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)I420,width=720,height=420,framerate=(fraction)30/1' ! avenc_mpeg4 ! fakesink sync=true

Obtained Results:

AM572x-testbench-MPEG4-enc-framerate.png
AM572x-testbench-MPEG4-enc-framerate.png


In the chart above, it can be seen in a general way that in both cases, the frame-rate reaches the expected value of 30 fps and then remains stable.

Memory consumption

Test pipeline (ducatimpeg4enc):

gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)NV12,width=720,height=420,framerate=(fraction)30/1' ! ducatimpeg4enc ! fakesink sync=true

Test pipeline (avenc_mpeg4):

gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)I420,width=720,height=420,framerate=(fraction)30/1' ! avenc_mpeg4 ! fakesink sync=true

Obtained Results:

AM572x-testbench-MPEG4-enc-memuse.png
AM572x-testbench-MPEG4-enc-memuse.png


In the chart above, it can be seen that when using hardware acceleration, a big reduction is achieved in memory consumption. The average difference is 4514 KB of less consumption when hardware acceleration is used.

Memory bandwidth consumption

Test pipeline (ducatimpeg4enc):

gst-launch-1.0 -e videotestsrc is-live=true ! 'video/x-raw,format=(string)NV12,width=720,height=420,framerate=(fraction)30/1' ! ducatimpeg4enc ! fakesink sync=true

Test pipeline (avenc_mpeg4):

gst-launch-1.0 -e videotestsrc is-live=true ! 'video/x-raw,format=(string)I420,width=720,height=420,framerate=(fraction)30/1' ! avenc_mpeg4 ! fakesink sync=true

Note: In both charts the memory bandwidth consumption is presented separately in sequential (seq) and aleatory (al) memory access.

Memory bandwidth consumption by memory readings obtained results:

AM572x-testbench-MPEG4-enc-readbandwidth.png
AM572x-testbench-MPEG4-enc-readbandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory readings is obtained. The average difference is 358.1 MB/s for sequential reads and 446.9 MB/s for aleatory reads.

Memory bandwidth consumption by memory writings obtained results:

AM572x-testbench-MPEG4-enc-writebandwidth.png
AM572x-testbench-MPEG4-enc-writebandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory writings is obtained. The average difference is 1832.7 MB/s for sequential writes and 499.9 MB/s for aleatory writes.


H264 video decode

In this section you will find a comparison of H264 video decode GStreamer pipelines performance results between hardware accelerated and only software implementation. The hardware accelerated implementation uses gst-plugins-ducati (ducatih264dec element), and on the other side, the only software implementation uses the gst-plugins-libav (avdec_h264 element). The test pipelines only differ in H264 decode GStreamer element, using in one case the hardware accelerated, and in the other case using the non hardware accelerated implementation.


CPU load % per core

Test pipeline (ducatih264dec):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-H264.mov ! qtdemux name=demux demux.video_0 ! queue ! h264parse ! ducatih264dec ! fakesink sync=true -e

Test pipeline (avdec_h264):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-H264.mov ! qtdemux name=demux demux.video_0 ! queue ! h264parse ! avdec_h264 ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-H264-dec-cpuload.png
AM572x-testbench-H264-dec-cpuload.png


In the chart above, is clearly shown that when using hardware acceleration, a substantial reduction in CPU workload is achieved. The average difference between CPU_0_accel and CPU_0_unaccel is 49.2% less load for CPU_0_accel. The average difference between CPU_1_accel and CPU_1_unaccel is 39% less load for CPU_1_accel.

Frame-rate

Test pipeline (ducatih264dec):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-H264.mov ! qtdemux name=demux demux.video_0 ! queue ! h264parse ! ducatih264dec ! fakesink sync=true -e

Test pipeline (avdec_h264):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-H264.mov ! qtdemux name=demux demux.video_0 ! queue ! h264parse ! avdec_h264 ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-H264-dec-framerate.png
AM572x-testbench-H264-dec-framerate.png


In the chart above, it can be seen in a general way that in both cases, the frame-rate reaches the expected value of 24 fps and then remains stable.

Memory consumption

Test pipeline (ducatih264dec):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-H264.mov ! qtdemux name=demux demux.video_0 ! queue ! h264parse ! ducatih264dec ! fakesink sync=true -e

Test pipeline (avdec_h264):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-H264.mov ! qtdemux name=demux demux.video_0 ! queue ! h264parse ! avdec_h264 ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-H264-dec-memuse.png
AM572x-testbench-H264-dec-memuse.png


In the chart above, it can be seen that when using hardware acceleration, an enormous reduction is achieved in memory consumption. The average difference is 10 869 KB of less consumption when hardware acceleration is used.

Memory bandwidth consumption

Test pipeline (ducatih264dec):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_H264.mp4 ! qtdemux name=demux demux.video_0 ! queue ! h264parse ! ducatih264dec ! fakesink sync=true -e

Test pipeline (avdec_h264):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_H264.mp4 ! qtdemux name=demux demux.video_0 ! queue ! h264parse ! avdec_h264 ! fakesink sync=true -e

Note: In both charts the memory bandwidth consumption is presented separately in sequential (seq) and aleatory (al) memory access.

Memory bandwidth consumption by memory readings obtained results:

AM572x-testbench-H264-dec-readbandwidth.png
AM572x-testbench-H264-dec-readbandwidth.png


In the chart above, it can be seen that when using hardware acceleration, more memory bandwidth consumption by memory readings is obtained. The average difference is 328.6 MB/s for sequential reads and 44.4 MB/s for aleatory reads.

Memory bandwidth consumption by memory writings obtained results:

AM572x-testbench-H264-dec-writebandwidth.png
AM572x-testbench-H264-dec-writebandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory writings is obtained. The average difference is 7.9 MB/s for sequential writes and 45 MB/s for aleatory writes. Only a little optimization is achieved.


MPEG4 video decode

In this section you will find a comparison of MPEG4 video decode GStreamer pipelines performance results between hardware accelerated and only software implementation. The hardware accelerated implementation uses gst-plugins-ducati (ducatimpeg4dec element), and on the other side, the only software implementation uses the gst-plugins-libav (avdec_mpeg4 element). The test pipelines only differ in MPEG4 decode GStreamer element, using in one case the hardware accelerated, and in the other case using the non hardware accelerated implementation.


CPU load % per core

Test pipeline (ducatimpeg4dec):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-720x420-MPEG4.mp4 ! qtdemux name=demux demux.video_0 ! queue ! mpeg4videoparse ! ducatimpeg4dec ! fakesink sync=true -e

Test pipeline (avdec_mpeg4):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-720x420-MPEG4.mp4 ! qtdemux name=demux demux.video_0 ! queue ! mpeg4videoparse ! avdec_mpeg4 ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-MPEG4-dec-cpuload.png
AM572x-testbench-MPEG4-dec-cpuload.png


In the chart above, is clearly shown that when using hardware acceleration, a reduction in CPU workload is achieved. The average difference between CPU_1_accel and CPU_1_unaccel is 9.27% less load for CPU_1_accel. In both cases the CPU_0 has a very similar average workload percentage, so there is no significant difference between them.

Frame-rate

Test pipeline (ducatimpeg4dec):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-720x420-MPEG4.mp4 ! qtdemux name=demux demux.video_0 ! queue ! mpeg4videoparse ! ducatimpeg4dec ! fakesink sync=true -e

Test pipeline (avdec_mpeg4):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-720x420-MPEG4.mp4 ! qtdemux name=demux demux.video_0 ! queue ! mpeg4videoparse ! avdec_mpeg4 ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-MPEG4-dec-framerate.png
AM572x-testbench-MPEG4-dec-framerate.png


In the chart above, it can be seen in a general way that in both cases, the frame-rate reaches the expected value of 24 fps and then remains stable.

Memory consumption

Test pipeline (ducatimpeg4dec):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-720x420-MPEG4.mp4 ! qtdemux name=demux demux.video_0 ! queue ! mpeg4videoparse ! ducatimpeg4dec ! fakesink sync=true -e

Test pipeline (avdec_mpeg4):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-720x420-MPEG4.mp4 ! qtdemux name=demux demux.video_0 ! queue ! mpeg4videoparse ! avdec_mpeg4 ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-MPEG4-dec-memuse.png
AM572x-testbench-MPEG4-dec-memuse.png


In the chart above, it can be seen that when using hardware acceleration, a big reduction is achieved in memory consumption. The average difference is 3 977 KB of less consumption when hardware acceleration is used.

Memory bandwidth consumption

Test pipeline (ducatimpeg4dec):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_MPEG4.mp4 ! qtdemux name=demux demux.video_0 ! queue ! mpeg4videoparse ! ducatimpeg4dec ! fakesink sync=true -e

Test pipeline (avdec_mpeg4):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_MPEG4.mp4 ! qtdemux name=demux demux.video_0 ! queue ! mpeg4videoparse ! avdec_mpeg4 ! fakesink sync=true -e

Note: In both charts the memory bandwidth consumption is presented separately in sequential (seq) and aleatory (al) memory access.

Memory bandwidth consumption by memory readings obtained results:

AM572x-testbench-MPEG4-dec-readbandwidth.png
AM572x-testbench-MPEG4-dec-readbandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory readings is obtained. The average difference is 185.2 MB/s for sequential reads and 214.3 MB/s for aleatory reads.

Memory bandwidth consumption by memory writings obtained results:

AM572x-testbench-MPEG4-dec-writebandwidth.png
AM572x-testbench-MPEG4-dec-writebandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory writings is obtained. The average difference is 1313.2 MB/s for sequential writes and 157.6 MB/s for aleatory writes.


MPEG2 video decode

In this section you will find a comparison of MPEG2 video decode GStreamer pipelines performance results between hardware accelerated and only software implementation. The hardware accelerated implementation uses gst-plugins-ducati (ducatimpeg2dec element), and on the other side, the only software implementation uses the gst-plugins-libav (avdec_mpeg2video element). The test pipelines only differ in MPEG2 decode GStreamer element, using in one case the hardware accelerated, and in the other case using the non hardware accelerated implementation.


CPU load % per core

Test pipeline (ducatimpeg2dec):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_Trailer-MPEG2.mpg num-buffers=260 ! mpegpsdemux ! queue ! mpegvideoparse ! ducatimpeg2dec ! fakesink sync=true -e

Test pipeline (avdec_mpeg2video):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_Trailer-MPEG2.mpg num-buffers=260 ! mpegpsdemux ! queue ! mpegvideoparse ! avdec_mpeg2video ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-MPEG2-dec-cpuload.png
AM572x-testbench-MPEG2-dec-cpuload.png


In the chart above, is clearly shown that when using hardware acceleration, a reduction in CPU workload is achieved. The average difference between CPU_1_accel and CPU_1_unaccel is 21.3% less load for CPU_1_accel. The average difference between CPU_0_accel and CPU_0_unaccel is 18.1% less load for CPU_0_accel.

Frame-rate

Test pipeline (ducatimpeg2dec):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_Trailer-MPEG2.mpg num-buffers=260 ! mpegpsdemux ! queue ! mpegvideoparse ! ducatimpeg2dec ! fakesink sync=true -e

Test pipeline (avdec_mpeg2video):

GGST_TRACER_PLUGINS="framerate" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_Trailer-MPEG2.mpg num-buffers=260 ! mpegpsdemux ! queue ! mpegvideoparse ! avdec_mpeg2video ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-MPEG2-dec-framerate.png
AM572x-testbench-MPEG2-dec-framerate.png


In the chart above, it can be seen in a general way that in both cases, the frame-rate reaches the expected value of 25 fps and then remains stable.

Memory consumption

Test pipeline (ducatimpeg2dec):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_Trailer-MPEG2.mpg num-buffers=260 ! mpegpsdemux ! queue ! mpegvideoparse ! ducatimpeg2dec ! fakesink sync=true -e

Test pipeline (avdec_mpeg2video):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_Trailer-MPEG2.mpg num-buffers=260 ! mpegpsdemux ! queue ! mpegvideoparse ! avdec_mpeg2video ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-MPEG2-dec-memuse.png
AM572x-testbench-MPEG2-dec-memuse.png


In the chart above, it can be seen that when using hardware acceleration, a reduction is achieved in memory consumption. The average difference is 1 129 KB of less consumption when hardware acceleration is used.

Memory bandwidth consumption

Test pipeline (ducatimpeg2dec):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_Trailer-MPEG2.mpg ! mpegpsdemux ! queue ! mpegvideoparse ! ducatimpeg2dec ! fakesink sync=true -e

Test pipeline (avdec_mpeg2video):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_Trailer-MPEG2.mpg ! mpegpsdemux ! queue ! mpegvideoparse ! avdec_mpeg2video ! fakesink sync=true -e

Note: In both charts the memory bandwidth consumption is presented separately in sequential (seq) and aleatory (al) memory access.

Memory bandwidth consumption by memory readings obtained results:

AM572x-testbench-MPEG2-dec-readbandwidth.png
AM572x-testbench-MPEG2-dec-readbandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory readings is obtained. The average difference is 436.4 MB/s for sequential reads and 330.1 MB/s for aleatory reads.

Memory bandwidth consumption by memory writings obtained results:

AM572x-testbench-MPEG2-dec-writebandwidth.png
AM572x-testbench-MPEG2-dec-writebandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory writings is obtained. The average difference is 1090.9 MB/s for sequential writes and 155.1 MB/s for aleatory writes.



JPEG video decode

In this section you will find a comparison of JPEG video decode GStreamer pipelines performance results between hardware accelerated and only software implementation. The hardware accelerated implementation uses gst-plugins-ducati (ducatijpegdec element), and on the other side, the only software implementation uses the gst-plugins-libav (avdec_mjpeg element). The test pipelines only differ in JPEG decode GStreamer element, using in one case the hardware accelerated, and in the other case using the non hardware accelerated implementation.


CPU load % per core

Test pipeline (ducatijpegdec):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-MJPEG.mov ! qtdemux name=demux demux.video_0 ! queue ! jpegparse ! ducatijpegdec ! fakesink sync=true -e

Test pipeline (avdec_mjpeg):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-MJPEG.mov ! qtdemux name=demux demux.video_0 ! queue ! jpegparse ! avdec_mjpeg ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-JPEG-dec-cpuload.png
AM572x-testbench-JPEG-dec-cpuload.png


In the chart above, is clearly shown that when using hardware acceleration, a big reduction in CPU workload is achieved. The average difference between CPU_1_accel and CPU_1_unaccel is 42.8% less load for CPU_1_accel. In both cases the corresponding CPU_0 core is practically off, and there is no difference between them.

Frame-rate

Test pipeline (ducatijpegdec):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-MJPEG.mov ! qtdemux name=demux demux.video_0 ! queue ! jpegparse ! ducatijpegdec ! fakesink sync=true -e

Test pipeline (avdec_mjpeg):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-MJPEG.mov ! qtdemux name=demux demux.video_0 ! queue ! jpegparse ! avdec_mjpeg ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-JPEG-dec-framerate.png
AM572x-testbench-JPEG-dec-framerate.png


In the chart above, it can be seen in a general way that in both cases, the frame-rate reaches the expected value of 25 fps and then remains stable.

Memory consumption

Test pipeline (ducatijpegdec):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-MJPEG.mov ! qtdemux name=demux demux.video_0 ! queue ! jpegparse ! ducatijpegdec ! fakesink sync=true -e

Test pipeline (avdec_mjpeg):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/TearOfSteel-Short-1920x800-MJPEG.mov ! qtdemux name=demux demux.video_0 ! queue ! jpegparse ! avdec_mjpeg ! fakesink sync=true -e

Obtained Results:

AM572x-testbench-JPEG-dec-memuse.png
AM572x-testbench-JPEG-dec-memuse.png


In the chart above, it can be seen that when using hardware acceleration, a reduction is achieved in memory consumption. The average difference is 1304 KB of less consumption when hardware acceleration is used.

Memory bandwidth consumption

Test pipeline (ducatijpegdec):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_MJPEG.mov ! qtdemux name=demux demux.video_0 ! queue ! jpegparse ! ducatijpegdec ! fakesink sync=true -e

Test pipeline (avdec_mjpeg):

gst-launch-1.0 filesrc location=/am5728-gst-tests/video-samples/Wreck-It_Ralph_MJPEG.mov ! qtdemux name=demux demux.video_0 ! queue ! jpegparse ! avdec_mjpeg ! fakesink sync=true -e

Note: In both charts the memory bandwidth consumption is presented separately in sequential (seq) and aleatory (al) memory access.

Memory bandwidth consumption by memory readings obtained results:

AM572x-testbench-JPEG-dec-readbandwidth.png
AM572x-testbench-JPEG-dec-readbandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory readings is obtained. The average difference is 448.9 MB/s for sequential reads and 175.5 MB/s for aleatory reads.

Memory bandwidth consumption by memory writings obtained results:

AM572x-testbench-JPEG-dec-writebandwidth.png
AM572x-testbench-JPEG-dec-writebandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by sequential memory writings is obtained, and a little more memory bandwidth is consumed by aleatory writes. The average difference is 1046.9 MB/s for sequential writes and 155.1 MB/s for aleatory writes.


Resolution scale and color-space conversion

In this section you will find a comparison of resolution scale and color-space conversion GStreamer pipelines performance results between hardware accelerated and only software implementation. The hardware accelerated implementation uses gst-plugins-vpe (vpe element), and on the other side, the only software implementation uses the videoscale and videoconvert elements. The test pipelines only differ in resolution scale and color-space conversion GStreamer element, using in one case the hardware accelerated, and in the other case using the non hardware accelerated implementation.


CPU load % per core

Test pipeline (vpe):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)YUY2,width=320,height=240,framerate=(fraction)30/1' ! vpe ! 'video/x-raw, format=(string)NV12, width=(int)1280, height=(int)1024' ! fakesink sync=true

Test pipeline (videoscale and videoconvert):

GST_TRACER_PLUGINS="cpuusage" gst-launch-1.0 -e videotestsrc num-buffers=525 is-live=true ! 'video/x-raw,format=(string)YUY2,width=320,height=240,framerate=(fraction)30/1' ! videoscale ! 'video/x-raw, format=(string)YUY2, width=(int)1280, height=(int)1024' ! videoconvert ! 'video/x-raw, format=(string)NV12, width=(int)1280, height=(int)1024' ! fakesink sync=true

Obtained Results:

AM572x-testbench-VPE-cpuload.png
AM572x-testbench-VPE-cpuload.png


In the chart above, is clearly shown that when using hardware acceleration, an enormous reduction in CPU workload is achieved. The average difference between CPU_0_accel and CPU_1_unaccel is 80.7% less load for CPU_0_accel. The average difference between CPU_1_accel and CPU_0_unaccel is 1.3% more load for CPU_1_accel, but this value is negligible in comparison with the 80.7% of cpu workload reduction achieved in the other cores.

Frame-rate

Test pipeline (vpe):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)YUY2,width=320,height=240,framerate=(fraction)30/1' ! vpe ! 'video/x-raw, format=(string)NV12, width=(int)1280, height=(int)1024' ! fakesink sync=true

Test pipeline (videoscale and videoconvert):

GST_TRACER_PLUGINS="framerate" gst-launch-1.0 -e videotestsrc num-buffers=525 is-live=true ! 'video/x-raw,format=(string)YUY2,width=320,height=240,framerate=(fraction)30/1' ! videoscale ! 'video/x-raw, format=(string)YUY2, width=(int)1280, height=(int)1024' ! videoconvert ! 'video/x-raw, format=(string)NV12, width=(int)1280, height=(int)1024' ! fakesink sync=true

Obtained Results:

AM572x-testbench-vpe-framerate.png
AM572x-testbench-vpe-framerate.png


In the chart above, it can be seen that when vpe coprocessor is used, the expected value of 30 fps is achieved and then frame-rate maintains stable. In the other case, when no hardware acceleration is used, the expected value of 30 fps can not be achieved, and it shows an unstable behavior between 24 and 25 fps.

Memory consumption

Test pipeline (vpe):

gst-launch-1.0 -e videotestsrc num-buffers=640 is-live=true ! 'video/x-raw,format=(string)YUY2,width=320,height=240,framerate=(fraction)30/1' ! vpe ! 'video/x-raw, format=(string)NV12, width=(int)1280, height=(int)1024' ! fakesink sync=true

Test pipeline (videoscale and videoconvert):

gst-launch-1.0 -e videotestsrc num-buffers=525 is-live=true ! 'video/x-raw,format=(string)YUY2,width=320,height=240,framerate=(fraction)30/1' ! videoscale ! 'video/x-raw, format=(string)YUY2, width=(int)1280, height=(int)1024' ! videoconvert ! 'video/x-raw, format=(string)NV12, width=(int)1280, height=(int)1024' ! fakesink sync=true

Obtained Results:

AM572x-testbench-vpe-memuse.png
AM572x-testbench-vpe-memuse.png


In the chart above, it can be seen that when using hardware acceleration, an increment is achieved in memory consumption. The average difference is 2765 KB of more consumption when hardware acceleration is used.

Memory bandwidth consumption

Test pipeline (ducatijpegdec):

gst-launch-1.0 -e videotestsrc is-live=true ! 'video/x-raw,format=(string)YUY2,width=320,height=240,framerate=(fraction)30/1' ! vpe ! 'video/x-raw, format=(string)NV12, width=(int)1280, height=(int)1024' ! fakesink sync=true

Test pipeline (avdec_mjpeg):

gst-launch-1.0 -e videotestsrc is-live=true ! 'video/x-raw,format=(string)YUY2,width=320,height=240,framerate=(fraction)30/1' ! videoscale ! 'video/x-raw, format=(string)YUY2, width=(int)1280, height=(int)1024' ! videoconvert ! 'video/x-raw, format=(string)NV12, width=(int)1280, height=(int)1024' ! fakesink sync=true

Note: In both charts the memory bandwidth consumption is presented separately in sequential (seq) and aleatory (al) memory access.

Memory bandwidth consumption by memory readings obtained results:

AM572x-testbench-vpe-readbandwidth.png
AM572x-testbench-vpe-readbandwidth.png


In the chart above, it can be seen that when using hardware acceleration, more memory bandwidth consumption by sequential memory readings is obtained. The average difference is 69.8 MB/s for sequential reads. But, 112.3 MB/s less memory bandwidth consumption by aleatory memory readings is obtained.

Memory bandwidth consumption by memory writings obtained results:

AM572x-testbench-vpe-writebandwidth.png
AM572x-testbench-vpe-writebandwidth.png


In the chart above, it can be seen that when using hardware acceleration, less memory bandwidth consumption by memory writings is obtained. The average difference is 638.2 MB/s for sequential writes and 26.9 MB/s for aleatory writes.