Quantisation for AI on FPGA

Introduction

Quantisation is a popular technique that allows deep learning models to shrink in storage size and speed computation in certain hardware architectures. In traditional architectures such as GPUs and CPUs, small data types like half-precision floating-point promise more than two times speed-up in execution times thanks to SIMD instructions, with additional advantages, such as reduced data communication times and fewer storage requirements.

However, traditional architectures are not well optimised for the quantisation that is just-right for the application. For example, a model can have good quality of results (accuracy, recall, etc.) with 10-bit precision. However, these architectures offer 8 or 16-bit, with no middle point.

FPGAs, in contrast, allow the synthesis of accelerators that match the precision of the model. It can synthesise tensor accelerators of 10-bit precision with floating-point or fixed-point capabilities. It comes with some advantages:

Optimal data communication
Optimal hardware resource consumption

Apart from that, it comes with the inherent advantage of energy efficiency.

This wiki will dig into some considerations before quantising a model for FPGAs.

Quantisation Techniques

We can categorise the quantisation techniques according to scaling and granularity. Scaling is related to the map used to transform a quantity into another. The mappings are usually the following:

Linear: the numbers are converted by multiplying a scale. This is the case when moving from floating-point to integer 8, where the numbers are transformed to the integer representation by normalising the number in floating-point and multiplying it by 128.
Logaritmic: rather than using a multiplying scale, the logarithm function transforms from a floating-point to a reduced-precision number, like float 8 or integer 8.
Ternary: a representation that determines two thresholds: a lower threshold defining the negative, a higher threshold defining the positive, and, if the number falls in between, defines the 0.
Binary: similar to ternary but uses only a threshold.

The quantisations often target a lower precision numerical representation; the usual targets are:

4-bit integer
8-bit integer
16-bit integer
16-bit floating-point
16-bit brain floating-point (bfloat)

On the other hand, granularity allows for uniform quantisation. However, heterogeneous and non-uniform representations are possible. The most popular configurations are:

All uniform: the entire model is quantised using the same numerical representation.
Weights and Activations: the model has a different quantisation for weights and activations.
Layerwise: each layer uses a different quantisation.
All heterogeneous: weights, activations and all layers have different quantisations.

The most popular are the first two, since it is widely compatible with CPUs and GPUs.

Quantisation Opportunities in FPGAs

When moving to FPGA implementations, they are flexible enough to support arbitrary numerical precisions, not subject to 4, 8 or 16-bit numerical representations. It also makes it possible to have custom precision fixed-point, floating-point and posit. Beyond that, it is possible to have different precisions for each layer, depending on how critical and relevant they are for the final results.

Some frameworks like Brevitas and QKeras support quantisation to arbitrary precisions, involving refinement techniques like post-training quantisation (PTQ) or zero-shot quantisation (ZSQ). Brevitas, particularly, permits the implementation of FPGAs through FINN. However, depending on the degree of customisation, such as the introduction of alternative numerical representations like arbitrary precision floating-point and posit, the work becomes more challenging due to the availability of tools. Despite that, FPGAs have much to offer in this field, allowing compressed models to execute without the need for de-quantisation-quantisation on the fly, as many traditional architectures perform.

RidgeRun Services

RidgeRun has expertise in offloading processing algorithms using FPGAs, from Image Signal Processing to AI offloading. Our services include:

Algorithm Acceleration using FPGAs.
Image Signal Processing IP Cores.
Linux Device Drivers.
Low Power AI Acceleration using FPGAs.
Accelerated C++ Applications.

And it includes much more. Contact us at https://www.ridgerun.com/contact.

Previous: AI and Machine Learning/Frameworks

Index

Next: AI and Machine Learning/Pruning

❯

FPGA Minutes to Become an Expert

Introduction
What is an FPGA? FPGAs and their applications Popular Vendors Development process When to Choose an FPGA over a GPU/CPU?
FPGA Knowledge
Synthesis Flows High Level Synthesis Vendor Specific Primitives
Xilinx FPGAs
Evaluation boards Minimum Design for KV260 Development workflow and tools Getting Started
Lattice FPGAs
Evaluation boards Development workflow and tools Getting Started
Simulation Tools
CocoTB
AI and Machine Learning
Introduction to AI Architectures and Frameworks Quantisation Pruning
Contact Us