FPGA Image Signal Processor - FPGA ISP Accelerators - Convolution

Introduction

The FPGA-ISP convolution is the professional version of the V4L2 FPGA convolution, developed as a templated module capable of optimizing the resource usage according to the application needs. As a common convolution accelerator, it is capable of receiving a video frame from the kernel space, apply a custom kernel, and return the frame filtered.

With multiple convolution accelerators, it is possible to perform more complex operations, such as demosaicing, Sobel, DoG (Differential of Gaussian), LoG (Laplacian of Gaussian), and other spatial filters.

Hardware description optimizations allow this accelerator to avoid any bottleneck created by this module, thanks to data parallelism, fitting up to eight pixels in a single bus transference.

Our initial goal of making this accelerator run at 30fps @4K became real. This little monster can have a peak throughput of 60fps @4K, limited just by the FPGA clock and the PCIe bandwidth.

Convolution I/O properties

The input/output images have limitations in format and size, which are indicated below:

Property	Input image	Output image
Min width	8	8
Max width	4096	4096
Min height	8	8
Max height	2160	2160
Formats	8-bit Gray (Mono)	8-bit Gray (Mono)

Convolution in action

Currently, the convolution example is fixed to a 3x3 Gaussian kernel, represented in 16-bit Fixed-Point (Q_0,16):

0.0625	0.125	0.0625
0.125	0.25	0.125
0.0625	0.125	0.0625

In a Gaussian blur, the edges will degrade, losing their sharpness.

Another example is a edge detection kernel, described by:

-1	-1	-1
-1	8	-1
-1	-1	-1

The resulting image has blurring which corresponds to the convolution output.

Current throughput

The convolution accelerator has the following throughput in several resolutions:

Maximum framerate using 3x3 convolution for several standard resolutions on a PicoEVB
Resolution	Peak framerate PicoEVB (fps)	Actual framerate PicoEVB - PCIe (fps)
4k	60	45,137
1080p	241	159,7
720p	542	300,2

For these measurements, we are using a 3x3 kernel. It is completely possible to have greater kernels without sacrificing speed, thanks to parallelism in terms of pixel calculation, which actually lasts one clock. Besides, having greater kernels will lead to more area consumption. Furthermore, you can have a complete convolution over a multichannel (N channels) image just splitting the channels and instantiating N convolution units, enabling real parallel processing with the same execution time as a mono-channel convolution inside of the FPGA depending on your application.

Known issues

1. GStreamer autonegotiation: The caps, such as width, height, and format, must be specified in the pipeline.

2. Numerical precision: Since the resolution is lower than using floating-point numbers, different results might be obtained.

References

Previous: Modules/Undistort

Index

Next: Modules/Fast_Fourier_Transform_1D

❯