Frameworks for AI on FPGA

From RidgeRun Developer Wiki



Previous: AI and Machine Learning/Introduction to AI Index Next: AI and Machine Learning/Quantisation





Introduction

Following up with the AI acceleration on FPGAs, some frameworks allow the acceleration of inference from a model or a graph. In particular, it is possible to accelerate small networks on FPGAs using network mapping and co-processing techniques. In this wiki, we are going to introduce some of these tools.

Popular Architectures

There are two popular techniques used to implement networks on FPGAs: 1) network mapping and 2) co-processing. Both have advantages and disadvantages that we are going to cover along this page.

Network Mapping

The network mapping maps an entire network into a hardware implementation based on the layers and/or the arithmetic operations performed by the model.

In the picture below, we illustrate the process of how the hardware is implemented on the FPGA in such a way that the network behaves like a circuit rather than a group of macro-operations. In this case, we illustrate a multi-layer perceptron network, where the perceptron can be observed as a column of multipliers followed by a tree of adders until reaching an activation function .

The network mapping technique is powerful regarding latency without having batches of inputs. Moreover, the implementation footprint is significant, and this technique becomes challenging when large and complex models with tens of layers are used, depending on the target FPGA and its available resources.

The most common frameworks that use this technique are:

  • HLS4ML: An Open-Source Framework from CERN for models based on TensorFlow
  • FINN-R: An Open-Source Framework from AMD for models based on PyTorch

We are going to explore the differences in the following sections.

Co-Processing

Co-processing is a lightweight technique that accelerates models by focusing on the operations rather than mapping the network on the FPGA. For instance, it is possible to accelerate operations such as matrix-matrix multiplications, matrix-vector multiplications, convolutions, and matrix binary/unary element-wise operations. Other approaches focus on developing execution units to accelerate operations using SIMD or vector instructions.

We can classify the co-processing hardware into the following categories:

L1

Accelerates the operations through vector instructions. The resulting co-processor is similar to a vector unit like AVX, AMX or a DSP, with fused multiply-add (FMA) or multiply-accumulate (MAC) operations on vectors.

L2

Accelerates the matrix operations such as the matrix-matrix multiplications and matrix element-wise operations. For instance, with these units, it is possible to accelerate:

  • Matrix Multiplication: dense, convolution, attention layers
  • Matrix Element-Wise Add/Multiplication: batch normalisation, concatenation, addition
  • Matrix Element-Wise Unary/Mapping: activations, scaling

It also includes per-layer acceleration, where each layer is allocated and executed by an accelerator specialised in one of the aforementioned operations.

L3

They are more advanced and complex than the previous architectures. They usually have scheduling operation accelerators focused on deep learning tasks, such as matrix multiplication, convolution, activations, and attention. There are commercial alternatives, such as the MathWorks Deep Learning HDL Toolbox, the AMD DPU, and AMD AIE.

Usually, they show a lower performance in terms of latency but they are more flexible and can run larger models and expense of hardware reusability.




The following picture illustrates the differences at the architectural level contrasted to

On the left, we have the Level 1 acceleration, found on top of vector units such as the AVX512-VNNI, DSP-like processing and other extensions. Level 2 involves matrix operations (centre). Level 3 is a more complex structure with one or several cores specialised in deep learning operations, like Matrix Multiplication, Convolution, Matrix Operations and Activations.

Frameworks

There are some existing solutions that involve any of the aforementioned architectures. In this section, we are going to mention some, highlighting their characteristics.

HLS4ML

HLS4ML is an Open-Source framework that receives a TensorFlow model and converts it to hardware using High-Level Synthesis. This creates a single accelerator based on Network Mapping, layer-by-layer, using a dataflow approach. It is strongly supported by Vitis and AMD, even though there is some support for Intel FPGAs.

It is licensed under Apache 2.0.

FINN

AMD FINN is an AMD framework that uses Network Mapping, mapping each layer to a single accelerator instance, interconnected through AXI4 Stream. It receives a PyTorch model that can be quantised using Brevitas (a framework for non-uniform quantisation), which is later translated into a high-level synthesis accelerator. It mainly targets AMD FPGAs.

It is an Open Source project licensed under BSD 3-clause.

MathWorks Deep Learning HDL Toolbox

It is a set of tools for deep learning acceleration using MatLab Simulink. It converts a model into an HDL accelerator that can be used on FPGAs. Since it compiles to HDL, it can be used in various FPGAs, including AMD, Lattice, Intel and Microchip.

It is a closed-source and commercial solution.

AMD DPU

The AMD Deep Learning Processing Units are soft accelerators for AMD FPGAs. They comprise a Co-Processing Architecture with microcodes, scheduling, and on-chip memory. They require Vitis AI for their usage, which is in charge of Quantising and Optimising the network for its later execution on the DPU.

The AMD DPU is a closed-source solution from AMD and is subject to commercial licensing and restrictions.

AMD AIE

The AMD Artificial Intelligence Engines are accelerators based on L2 Co-Processing Architectures. They are equipped with Versal FPGAs for advanced AI and DSP acceleration.

It is a closed-source solution and is subject to commercial licensing and restrictions.

Microchip

Microchip is working on a solution to map models to TensorFlow Lite, executed using Matrix Accelerators based on L2 Co-Processing Architectures.

We don't have information about its licensing model. Stay tuned for updates!

Lattice sensAI

The Lattice sensAI comprises an ecosystem with services, reference designs, model support, and IP cores for acceleration. Overall, the architectures described by sensAI are based on Co-Processing Architectures.

For more information: Lattice sensAI

Intel FPGA AI Suite

The Intel FPGA AI Suite is a framework that combines OpenVINO (the Intel framework for deep learning acceleration) and FPGAs. They use a compiler to convert the model into reconfigurable logic that is synthesisable into the FPGA. The architecture is based on a Co-Processing Architectures.

It is a commercial solution.

RidgeRun Services

RidgeRun has expertise in offloading processing algorithms using FPGAs, from Image Signal Processing to AI offloading. Our services include:

  • Algorithm Acceleration using FPGAs.
  • Image Signal Processing IP Cores.
  • Linux Device Drivers.
  • Low Power AI Acceleration using FPGAs.
  • Accelerated C++ Applications.

And it includes much more. Contact us at https://www.ridgerun.com/contact.



Previous: AI and Machine Learning/Introduction to AI Index Next: AI and Machine Learning/Quantisation