Getting Started with Xilinx OpenCV in Vivado HLS
This page is under development. |
Introduction to Xilinx OpenCV
Have you ever thought in accelerating image processing algorithms in FPGA? Xilinx OpenCV (also known as xfOpenCV[1]) is a templated-library optimized for FPGA High-Level Synthesis (HLS), allowing to create image processing pipelines easily in the same fashion that you may do it with the well-known OpenCV library. Inside this function, you might find some common algorithms such as [2]:
- Color space conversion
- Image resizing
- Border and edge detection algorithms (Canny, Sobel)
- Warp transformation
- Hog transform
- Matrix-matrix operations (addition, weighted-addition)
- And... More
Currently, all the library is intended for Xilinx SDx tool suite, but it is completely possible to integrate Xilinx OpenCV to the Vivado HLS tool for accelerator-oriented projects, which are not based on System-on-Chip (SoC) but in a common FPGA, such as Artix 7 in the PicoEVB, which communicates to a host computer through a PCIe port.
Now: is it Open and Free?
Yes, it is! Xilinx OpenCV is completely supported and maintained by the community under the BSD-3 license. That means that xfOpenCV can be used in projects by recognizing the author when distributing the application.
In this wiki, we are going to explore how to use Xilinx OpenCV library in Vivado HLS.
About High-Level Synthesis
To build our first Xilinx OpenCV project, we need to know how to integrate it to Vivado HLS. Please, consider that this tutorial is based on Vivado HLS 2018.3, which is backwards compatible.
Basic components
A basic Vivado HLS project is composed of the following components:
1. Source code: It contains the module (C++ function) of the accelerator and a header which allows to integrate it to other parts, such as the testbench.
2. Testbench: It is a C++ code which loads the data into the accelerator, executes it and retrieves the data, allowing to compare the results in order to verify if the implementation is correct in C++ and RTL.
3. TCL Script: It contains the commands needed by Vivado HLS to integrate the several components (source code, testbench and other resources), and the stages to perform (synthesis, C simulation, C/RTL co-simulation, RTL export).
Coding style
The coding style in Vivado HLS is quite similar to C++. You can use namespaces, classes (limited), arbitrary datatypes, unions, templates, etc. Nevertheless, HLS has an extra layer with a sort of directives which tells the preprocessor (part of the synthesizer) how the units/instructions should be mapped into hardware. With directives, you can specify:
1. Loop unrolling
2. Port representation
3. Dataflow regions
4. Memory representation: as registers, blocks of memory
5. Variable types
For more information, please, refer to [3].
Considerations when developing for FPGAs
In general, Xilinx OpenCV uses arrays for representing the image matrices in a softwarish fashion. When implementing them onto an FPGA, it leads to a high Block RAM usage, consuming more resources than the really needed without any gain in performance. This approach is useful when using Vivado SDx suite, but for common FPGAs, this approach is no longer suitable. For this reason, we consider using matrices represented as streams, which have the same performance than using arrays and allows to lower the footprint of our accelerators.
However, there are some Xilinx OpenCV kernels which will not be available when using the stream representation of the matrices, especially, those which use not sequential accesses to the data.
If your application requires algorithms which these requirements, please, consider using a more advanced tool such as Xilinx SDx.
Downloading Xilinx OpenCV
To download Xilinx OpenCV, you can clone the repo and it is already ready to use. Neither Building nor compilation is needed since it is a templated library. For our purposes, we are going to use the 2018.3 release and clone it inside ./lib
of our working directory, where we are going to save our source code as well.
# Create the library path mkdir lib/ # Clone the repo RELEASE=2018.3_release https://github.com/Xilinx/xfopencv.git -b $RELEASE lib/xfopencv
The code shown above will create the ./lib
folder and clone the repo into it under the folder ./lib/xfopencv
. It will also select the tag 2018.3_release
.
That's all you need to have xfOpenCV in your project.
Inside of ./lib/xfopencv
, you will find the following folders:
- HLS_Use_Model: Which contains two examples (a basic one and another with AXI) and a manual about how to use xfOpenCV with HLS.
- Examples: With all the example codes for each kernel. They teach how to use the functions available in xfOpenCV.
- examples_sdaccel: With examples for SDAccel tool (Now SDx)
- include: With all the templated library source code.
We are interested in analyzing the HLS_Use_Model and include.
First example - Single Kernel Application
In this example, we are going to implement a morphological operation (erosion).
Requirements
We require to take into account the following constraints:
1. The ports shall be AXI Stream: This allows to have continuous data flow without addressing overhead, which results in higher throughput and better bandwidth usage.
2. The verification shall be done against OpenCV library: The results will be compared to the OpenCV results. An important remark on this is that Xilinx already provides OpenCV library inside of their binaries. The version is the 2.4.8 or 3.x, depending on the version. The erosion
operation doesn't change from one version to the other.
How does it look like in software
We are going to create a OpenCV baseline in order to contrast the results between implementing it in software and, then, moving to hardware.
#include "opencv2/core/core.hpp" /* Makes available the cv::Mat and constants */ #include "opencv2/highgui/highgui.hpp" /* Allows to use cv::imread and cv::imwrite - version 2.4 */ #include "opencv2/imgproc/imgproc.hpp" /* Allows to use cv::erode */ int main(int argc, char** argv) { /* Check arguments */ if (argc != 2) { fprintf(stderr, "Invalid Number of Arguments!\nUsage:\n"); fprintf(stderr, "%s <input image path> \n", argv[0]); return -1; } cv::Mat ocv_ref; cv::Mat in_img; /* Load input image - in grayscale */ printf("Reading image...\n"); in_img = cv::imread(argv[1], 0); if (in_img.data == NULL) { fprintf(stderr, "Cannot open image at %s\n", argv[1]); return -1; } uint16_t height = in_img.rows; uint16_t width = in_img.cols; /* Preparing kernels */ cv::Mat element = cv::getStructuringElement( XF_SHAPE_CROSS, cv::Size(FILTER_WIDTH_EROSION, FILTER_WIDTH_EROSION), cv::Point(-1, -1)); /* Get output */ printf("Software processing...\n"); ocv_ref.create(height, width, CV_8UC1); cv::erode(in_img, ocv_ref, element, cv::Point(-1, -1), EROSION_NITER, cv::BORDER_CONSTANT); /* Write output for reference */ imwrite("output_ocv.png", ocv_ref); printf("Software done...\n"); return 0; }
The code above shows how to perform the erosion in software and C++. The process can be summarized in:
1. Reading the image into a cv::Mat
(line 19)
2. Prepare the kernel for the erosion (line 29)
3. Execute the erosion function (or kernel) (line 36)
4. Write back the image from a cv::Mat
(line 40)
This code, with a couple of modifications, will allow us to verify our accelerator.
Implementing the accelerator in Vivado HLS and Xilinx OpenCV
The process doesn't differ much from the software approach. The main change is the function usage and how the matrices are read and written through the ports. Also, all the accelerator is wrapped into a module (or C++ function).
Representing the matrices
The matrices in Xilinx OpenCV are represented by classes named xf::Mat
, which are templated in the number of pixels per clock, the pixel type, and the maximum dimensions allowed by the accelerator, in order to reserve the required hardware resources. A typical declaration of a matrix in Xilinx OpenCV looks like this:
static xf::Mat<TYPE, HEIGHT_MAX, WIDTH_MAX, NPC> imgOutput(height, width);
Analyzing it, we can notice the following things:
1. TYPE
: It is the pixel/image type. For a 8-bit grayscale image, we can use XF_8UC1
.
2. height
and width
are the current dimensions of the image which comes to the accelerator.
3. HEIGHT_MAX
and WIDTH_MAX
: These are the maximum image dimensions. height
and width
must be lower or equal to these maximums.
4. NPC
: It is the number of pixels per clock. Some kernels can support up to 8 pixels per clock in an 8-bit grayscale image. For this case, we are using the maximum by setting it to XF_NPPC8
.
5. The static
keyword, which is mandatory for streams inside of the accelerator. It is strongly recommended using namespaces to avoid variable naming crashes.
Now, to fulfill the requirement of representing the matrices as data streams, we need to place a directive just after the declaration.
#pragma HLS stream variable = imgOutput.data dim = 1 depth = 2
In this case, we are indicating the synthesizer that we want to implement the data member of the matrices as a stream, with a dimension of 1 and depth of 2 elements. It is interpreted as a two-element FIFO.
Setting up the external ports
The arguments of a function are interpreted as external I/O ports. For our basic accelerator, we require the following ports:
1. Image Input
2. Image Output
3. Width: represented as 16-bit unsigned integers
4. Height: represented as 16-bit unsigned integers
The first two arguments (image input and output), we need to indicate to the tool that they are AXI streams (AXI is the protocol). To do so, we place a directive of type INTERFACE (line 8-9) after the function declaration.
#include <hls_stream.h> /* Brings the stream */ typedef ap_axiu<PIXEL_WIDTH, 1, 1, 1> package_t; typedef hls::stream<package_t> stream_t; void erosion_accel(stream_t& stream_in, stream_t& stream_out, dim_t height, dim_t width) { #pragma HLS INTERFACE axis register both port = stream_in #pragma HLS INTERFACE axis register both port = stream_out /* ... */ }
For more information about this pragma, you can find more information in [4].
Interfacing streams to matrices and viceversa
The stream variables don't have any image property such as the number of channels, pixel type, or dimensions. In order to embed these properties, you will need to "cast" the stream variable into a matrix xf::Mat
. There are a couple of functions which allows these conversions in the two directions.
/* Converting from stream (stream_in) to xf::Mat (imgInput) */ xf::AXIvideo2xfMat(stream_in, imgInput); /* Converting from xf::Mat (imgOutput) to stream (stream_out) */ xf::xfMat2AXIvideo(imgOutput, stream_out);
Completing the accelerator
Now, with all the basics already explained, we can proceed with the whole accelerator (wrapper.cpp).
/* xfOpenCV kernels */ #include "imgproc/xf_erosion.hpp" /* Project headers */ #include "wrapper.hpp" unsigned char EROSION_KERNEL[FILTER_WIDTH_EROSION * FILTER_WIDTH_EROSION] = { 0, 0, 1, 0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 0, 0, 1, 0, 0, 0, 0, 1, 0, 0}; void erosion_accel(stream_t& stream_in, stream_t& stream_out, dim_t height, dim_t width) { #pragma HLS INTERFACE axis register both port = stream_in #pragma HLS INTERFACE axis register both port = stream_out /* Define xf::Mat vars - They are intended to be FIFOS - csim needs these as * statics */ static xf::Mat<TYPE, HEIGHT_MAX, WIDTH_MAX, NPC1> imgOutput(height, width); static xf::Mat<TYPE, HEIGHT_MAX, WIDTH_MAX, NPC1> imgInput(height, width); /* Internal streams */ #pragma HLS stream variable = imgInput.data dim = 1 depth = 2 #pragma HLS stream variable = imgOutput.data dim = 1 depth = 2 /* The basic flow is 1. From AXI Stream to xf::Mat (line 33) 2. Kernel execution (line 35) 3. From xf::Mat to AXI Stream (line 39) The dataflow directive allow us ot have better throughput by processing in a dataflow fashion (line 32) */ #pragma HLS dataflow xf::AXIvideo2xfMat(stream_in, imgInput); xf::erode<XF_BORDER_REPLICATE, TYPE, HEIGHT_MAX, WIDTH_MAX, XF_SHAPE_CROSS, FILTER_WIDTH_EROSION, FILTER_WIDTH_EROSION, EROSION_NITER, NPC1>( imgInput, imgOutput, EROSION_KERNEL); xf::xfMat2AXIvideo(imgOutput, stream_out); }
The declarations are present in the header file (wrapper.hpp):
#pragma once #include <ap_int.h> /* Brings the ap_uint type */ #include <hls_stream.h> /* Brings the stream */ #include "common/xf_common.h" /* Brings the macros for channel and data width*/ #include "common/xf_infra.h" #include "common/xf_utility.h" /* Define 8 pixel version */ #define NPC1 XF_NPPC8 /* Define image format */ #define TYPE XF_8UC1 /* GRAY8 */ const int WIDTH_MAX = 2048; const int HEIGHT_MAX = 1536; /* Define the AXI Stream type */ const int PPC = 8; const int PIXEL_WIDTH = 8 * PPC; /* 8 pixels per clock*/ typedef ap_axiu<PIXEL_WIDTH, 1, 1, 1> package_t; typedef hls::stream<package_t> stream_t; /* Define dimension data type */ const int DIMENSION_WIDTH = 16; typedef ap_uint<DIMENSION_WIDTH> dim_t; /* Accelerator specific parameters */ const int FILTER_WIDTH_EROSION = 5; const int EROSION_NITER = 1; extern unsigned char EROSION_KERNEL[FILTER_WIDTH_EROSION * FILTER_WIDTH_EROSION]; /* Declare the top function - This function needs redundant inputs */ void erosion_accel(stream_t& stream_in, stream_t& stream_out, dim_t height, dim_t width);
Adding a testbench
The test benches are optional but strongly recommended to debug and test the accelerators before launching to the FPGA. Taking advantage of the software version already presented above, we modify it to run the accelerator, compare and compute the error matrix.
#include <stdint.h> #include <stdio.h> #include <stdlib.h> /* OpenCV - Xilinx custom version */ #include "opencv2/core/core.hpp" #include "opencv2/highgui/highgui.hpp" #include "opencv2/imgproc/imgproc.hpp" /* Software tools for testbench - They must be in this order */ #include "common/xf_sw_utils.h" #include "common/xf_axi.h" /* Import the accelerator */ #include "wrapper.hpp" using namespace std; const int max_deviation = 10; /* 4% per pixel */ const float max_error = 5.0f; /* 5% per frame */ int main(int argc, char** argv) { if (argc != 2) { fprintf(stderr, "Invalid Number of Arguments!\nUsage:\n"); fprintf(stderr, "%s <input image path> \n", argv[0]); return -1; } cv::Mat out_hw, ocv_ref; cv::Mat in_img, diff; /* Load input image */ printf("Reading image...\n"); in_img = cv::imread(argv[1], 0); if (in_img.data == NULL) { fprintf(stderr, "Cannot open image at %s\n", argv[1]); return -1; } uint16_t height = in_img.rows; uint16_t width = in_img.cols; /* Preparing kernels */ cv::Mat element = cv::getStructuringElement( XF_SHAPE_CROSS, cv::Size(FILTER_WIDTH_EROSION, FILTER_WIDTH_EROSION), cv::Point(-1, -1)); /* Get output */ printf("Software processing...\n"); ocv_ref.create(height, width, CV_8UC1); cv::erode(in_img, ocv_ref, element, cv::Point(-1, -1), EROSION_NITER, cv::BORDER_CONSTANT); /* Write output for reference */ imwrite("output_ocv.png", ocv_ref); printf("Software done...\n"); /* Create the hardware streams */ printf("Hardware processing...\n"); out_hw.create(height, width, CV_8UC1); /* Generate the streams from the input image */ stream_t src_hw, sink_hw; cvMat2AXIvideoxf<NPC1>(in_img, src_hw); /* Execute accelerator */ erosion_accel(src_hw, sink_hw, height, width); /* Retrieve the output and put it into a cv::Mat */ AXIvideo2cvMatxf<NPC1>(sink_hw, out_hw); /* Write the output image */ cv::imwrite("output_hls.jpg", out_hw); printf("Hardware done...\n"); /* Compare */ cv::absdiff(ocv_ref, out_hw, diff); cv::imwrite("out_error.jpg", diff); double minval = 256, maxval = 0; int cnt = 0; for (int i = 0; i < height; i++) { for (int j = 0; j < width; j++) { uchar v = diff.at<uchar>(i, j); if (v > max_deviation) cnt++; if (minval > v) minval = v; if (maxval < v) maxval = v; } } float err_per = 100.0 * (float)cnt / (in_img.rows * in_img.cols); fprintf(stderr, " Minimum error in intensity = %f\n Maximum error in intensity = " "%f\n Percentage of pixels above error threshold = %f\n", minval, maxval, err_per); if (err_per > max_error) return 1; return 0; }
Writing the synthesis script
We have the source code so far, but we need to synthesize it by indicating Vivado HLS how to use them and what they are. Also, it is important to load a sample image in order to debug with a real image (see line 10 of the script shown below). Besides, in the CXX Flags, it is necessary to specify where is the xfOpenCV library located, which is done in line 5. The following script (script.tcl) does all the job for you.
open_project erosion_accel set_top erosion_accel # This includes the library of xfOpenCV, cloned into the same path but under the lib directory set CXX_FLAGS "-D__XFCV_HLS_MODE__ --std=c++11 -I./lib/xfopencv/include" # Indicate the source code add_files wrapper.cpp -cflags "$CXX_FLAGS" # Indicate the testbench resources add_files -tb sampleimage.jpg add_files -tb wrapper_tb.cc -cflags "$CXX_FLAGS" open_solution "solution" # In our case, we are setting our PicoEVB FPGA set_part {xc7a50tcsg325-2} -tool vivado create_clock -period 10 -name default # C software simulation csim_design -argv {sampleimage.jpg} -clean -compiler gcc # Synthesis csynth_design # C/RTL co-simulation: this is the most realistic simulation cosim_design -argv {sampleimage.jpg} # Copy the result images to the current path set img_path "preprocessor/solution/csim/build" file copy $img_path/out_error.jpg $img_path/output_hls.jpg $img_path/output_ocv.png ./ exit
To execute the script, we need to run:
vivado_hls -f script.tcl
Results
After running the code through the script, three files are generated:
1. output_ocv.png
: The image generated by software (reference)
2. output_hls.jpg
: The image generated by hardware (experimental)
3. out_error.jpg
: The difference between the experimental respect to the reference.
For our case, running on Vivado 2018.2.
-
Figure 1. Output from OpenCV
-
Figure 2. Output from FPGA
-
Figure 3. Absolute Error
Known issues
1. Numerical representation: given the fact the xfOpenCV uses fixed-point numbers instead of floating-point, the final results differ.
2. New release incompatibilities: This example is based on Vivado 2018.2. The latest xfOpenCV release is based on Vivado 2019.1, which is not compatible backwards. In order to use the kernels from the library, please, use the release for 2018.x.
3. Timing issue in some kernels: some kernels need some adjustment in order to meet timing constraints and being optimal.
4. Matrix represented as streams: some kernels which access data in a not sequential fashion will synthesize with warnings, but it will lead to co-simulation deadlocks and hangings.
For direct inquiries, please refer to the contact information available on our Contact page. Alternatively, you may complete and submit the form provided at the same link. We will respond to your request at our earliest opportunity.
Links to RidgeRun Resources and RidgeRun Artificial Intelligence Solutions can be found in the footer below.
See also
- ↑ Xilinx OpenCV. Available in https://github.com/Xilinx/xfopencv
- ↑ Xilinx OpenCV User Guide. Available in https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_3/ug1233-xilinx-opencv-user-guide.pdf
- ↑ Vivado HLS user guide 2018.3. Available in https://www.xilinx.com/support/documentation/sw_manuals/xilinx2018_3/ug902-vivado-high-level-synthesis.pdf
- ↑ Pragma HLS interface. Available in https://www.xilinx.com/html_docs/xilinx2017_4/sdaccel_doc/jit1504034365862.html