Introduction to AI for FPGAs

Follow Us On

Previous: AI and Machine Learning

Index

Next: AI and Machine Learning/Frameworks

Introduction

Artificial Intelligence (AI) is a hot topic nowadays due to modern techniques' versatility and generalisation capacities, such as Deep Learning, Transformers and Large Language Models. FPGAs are not excluded from this world and have interesting advantages over traditional computing hardware, i.e., CPUs and GPUs.

FPGAs are useful for the inference stage, where the model performs predictions such as classification, detection, anomaly detection, and other tasks. Given their architecture, which heavily employs floating-point operations, model training is not widely a strength of FPGAs.

This section will compare traditional architectures and FPGAs regarding the AI inference process, deepening into execution details and exhibiting the advantages of one architecture over the other, clearing an overall landscape for further decisions. We kindly invite you to review our other wiki related to hardware acceleration on CPU, GPU and FPGA.

An Overview on Deep Learning

Deep Learning techniques are widely popular in modern AI due to their ability to abstract complex functions and learn hidden patterns (aka features). Within the most popular techniques:

Multi-Layer Perceptron Neural Networks
Recurrent Neural Networks
Convolutional Neural Networks
Transformer-Based Networks
Hybrid Architectures

The aforementioned techniques share the possibility of having multiple layers separated by a non-linearity, often called activation functions, which avoids the mathematical simplification (or collapse) of the operations and enhances the learning process.

Mathematical Operators

All the techniques often share some mathematical operations, particularly vector-matrix multiplication and the matrix-matrix multiplication. The Perceptron Model can be defined mathematically as a vector-vector dot product:

$y=\sigma {\big (}\mathbf {x} \cdot \mathbf {w} {\big )}$

where $\mathbf {x}$ is the vector of inputs, $\mathbf {w}$ is the vector of weights, $\sigma$ is the activation function and $y$ is the output of the perceptron.

When having multiple perceptrons in a single layer, the weights become a matrix, where each column encapsulates the weights for a perceptron:

$\mathbf {y} =\sigma {\big (}\mathbf {x} \cdot \mathbf {W} {\big )}$

where the outputs $\mathbf {y}$ become a vector, and the function processes each element of the resulting vector element-wise.

The framework is still extensible. The input vector can be extended to be a matrix, encapsulating multiple input vectors $\mathbf {x}$ in rows of an input matrix $\mathbf {X}$ , allowing parallelism within the layer computation to process more than one sample at a time, such that:

$\mathbf {Y} =\sigma {\big (}\mathbf {X} \cdot \mathbf {W} {\big )}$

becoming a general matrix-matrix multiplication.

Another operation is the convolution, widely used for deep learning on images and 2D data, which can be expressed mathematically as:

$Y_{i,j}=\sigma {\big (}\sum _{\alpha =-N_{k}/2}^{N_{k}/2}\sum _{\beta =-N_{k}/2}^{N_{k}/2}X_{i+\alpha ,j+\beta }k_{\alpha +N_{k}/2,\beta +N_{k}/2}{\big )}$

where $i,j$ is the position of the output pixel (in an image), $N_{k}$ is the kernel size (i.e. 3, for a 3x3 kernel matrix), $k$ is the kernel matrix, often called feature map.

There are different variants of convolution, such as the depth-wise, which performs the convolution using the same kernel matrix for all the input channels, and the point-wise, which performs the convolution of all the channels altogether using a 1x1 kernel. Despite the differences, it is possible to implement the convolutions as a specialised unit using the mathematical expression from above or using a matrix-matrix multiplication by transforming the input matrix as a vector, and the kernel matrix is extended to meet the size of the input vector.

The transformer architecture also makes use of the matrix-matrix multiplication for the attention mechanisms, where the attention operator can be defined as

${\text{Attention}}{\big (}\mathbf {Q} ,\mathbf {K} ,\mathbf {V} {\big )}={\text{Softmax}}\left({\frac {\mathbf {Q} \mathbf {K} ^{\text{T}}}{\sqrt {d_{k}}}}\right)\mathbf {V}$

where $\mathbf {Q} ,\mathbf {K} ,\mathbf {V}$ are the query, key and value weight matrices, respectively.

There are other operators, such as element-wise operators, i.e. multiplication and addition, that perform per-element matrix operations, useful for adding biases or performing an optimised batch normalisation, which can be decomposed as

$\mathbf {y} =\mathbf {a} \mathbf {x} +\mathbf {b}$

where $\mathbf {a} ,\mathbf {b}$ are the variance and mean normalisers. Within the family of element-wise operators, some unary operators perform the activations, applying functions like $\tanh$ , $\exp$ , ReLU, GeLU, SiLU, and others.

Operator Implementation

The implementation of the operators is guided by the level of parallelism required for better performance. Platforms focusing on general massive parallelism, such as GPUs, often face underutilisation issues. On the other hand, there is a trade-off between raw performance and execution latency.

The raw performance focuses on how to get the maximum performance on the platform. In the GPU cases, the ideal workload will take as many resources as possible, processing multiple samples simultaneously. For instance, for a single perceptron layer, a proper GPU implementation will try to process multiple samples at a time, extending the vector-matrix multiplication to a matrix-matrix multiplication and increasing the throughput. However, the issue of processing multiple samples is the latency. The samples must be grouped in a batch to be effectively processed by the GPU, adding some extra latency to the first sample.

The execution latency focuses on how to get a single sample processed as fast as possible. In GPU cases, this may lead to an underutilisation of the platform, and it is highly influenced by the communication between the host and the GPU accelerator. The FPGAs and ASICs are suitable for execution latency, given that the neural networks can be modelled as a cascade of arithmetic operators separated by pipelines.

Therefore, depending on the platform, the mathematical operators can be expressed in a way that optimises the platform's efficiency for acceleration.

Architectures for Running AI

Four popular architectures are used for AI acceleration: CPU, GPU, ASICs and FPGAs. By default, most frameworks can use CPUs to accelerate AI. Still, they are often replaced by other accelerators, given their architecture, which does not support batching samples in a world managed by raw performance. On the other hand, GPUs have emerged as the most popular solution for accelerating AI and Deep Learning. They are well-suited for batched execution given their massive parallelism based on SIMT (Single Instruction Multiple Threads).

On the other hand, specialised hardware like ASICs and FPGAs are well-suited for AI because they implement the operations needed for the AI tasks, making better use of the hardware available and increasing efficiency in performance and energy. There are some units, such as the Neural Processing Unit (NPU), Tensor Processing Unit (TPU) and Deep Learning Accelerators (DLA), which are specialised but behave similarly to a GPU at the architecture level with less flexibility.

FPGAs are becoming an interesting AI solution that focuses on execution latency. Given their flexibility of re-programmability and capability of connecting to peripherals like cameras and radio transceivers, they can maintain low latency with high inference rates.

Choosing the Proper Hardware

It may depend on the use case and performance goals when choosing hardware to accelerate the AI inference. The following list will characterise each piece of hardware using the following criteria:

Performance Orientation: raw performance, execution latency
Deployment Type: production, prototyping
Applications: low-latency or relaxed

Depending on the architecture, FPGAs can exhibit multiple characteristics. For this section, we will assume that the model is entirely mapped into hardware.

CPU

The CPUs can support most models available in the state-of-the-art, given its versatility. Here are some characteristics:

Flexible in terms of performance orientation: If execution latency is needed, they are sometimes better than GPU.
Deployment are not their focus.
They are useful for relaxed applications and prototyping.
Support for high-resolution models that use floating-point numbers.

CPUs are recommended for prototyping and relaxed applications that do not require high performance.

GPU

GPUs are the most popular devices for AI acceleration, and they support various operators through frameworks such as PyTorch and TensorFlow. Here are some characteristics:

Oriented to raw performance: the most optimal way to execute a model is by using sample batches.
They can be used for prototyping and production applications: NVIDIA Jetson is a clear example of a production-ready product.
The best application use cases are focused on batched execution and relaxed latency.
Suitable for many applications except for low-latency.
Support for high-resolution models that use floating-point numbers. Some GPUs now integrate quantised types like int4, float8, bfloat16 and float16.

GPUs are recommended for most use cases where low latency or hard real-time response is a constraint. It involves audio processing, computer vision, large language models, and others.

FPGAs

FPGAs are rarely taken as an option for AI inference unless the environment requires hard real-time and ultra-low latency. One of the key advantages is the possibility of integrating the AI implementation along with the signal pre-processing. For instance, anomaly detection can be performed by RF transceivers, Smart NIC filtering, and basic computer vision. Here are some characteristics:

They are more oriented in ultra-low latency and energy efficiency.
Intended for production.
The application often focuses on hard real-time, determinism and low-latency, such as RF, computer vision, and others.
Focused on quantised and optimised networks.

FPGAs are recommended when determinism and low-latency are required. Moreover, it is recommended that the AI inference must be next to the pre-processing stages. Some examples:

Video systems with ISP included in the FPGA.
RF systems with packet filtering.
Smart NICs.

Other applications, like algorithm acceleration and efficient computing, will be studied in the following sections.