Quantisation for AI on FPGA

From RidgeRun Developer Wiki



Previous: AI and Machine Learning/Frameworks Index Next: AI and Machine Learning/Pruning





Introduction

Quantisation is a popular technique that allows deep learning models to shrink in storage size and speed computation in certain hardware architectures. In traditional architectures such as GPUs and CPUs, small data types like half-precision floating-point promise more than two times speed-up in execution times thanks to SIMD instructions, with additional advantages, such as reduced data communication times and fewer storage requirements.

However, traditional architectures are not well optimised for the quantisation that is just-right for the application. For example, a model can have good quality of results (accuracy, recall, etc.) with 10-bit precision. However, these architectures offer 8 or 16-bit, with no middle point.

FPGAs, in contrast, allow the synthesis of accelerators that match the precision of the model. It can synthesise tensor accelerators of 10-bit precision with floating-point or fixed-point capabilities. It comes with some advantages:

  1. Optimal data communication
  2. Optimal hardware resource consumption

Apart from that, it comes with the inherent advantage of energy efficiency.

This wiki will dig into some considerations before quantising a model for FPGAs.

Quantisation Techniques

We can categorise the quantisation techniques according to scaling and granularity. Scaling is related to the map used to transform a quantity into another. The mappings are usually the following:

  • Linear: the numbers are converted by multiplying a scale. This is the case when moving from floating-point to integer 8, where the numbers are transformed to the integer representation by normalising the number in floating-point and multiplying it by 128.
  • Logaritmic: rather than using a multiplying scale, the logarithm function transforms from a floating-point to a reduced-precision number, like float 8 or integer 8.
  • Ternary: a representation that determines two thresholds: a lower threshold defining the negative, a higher threshold defining the positive, and, if the number falls in between, defines the 0.
  • Binary: similar to ternary but uses only a threshold.

The quantisations often target a lower precision numerical representation; the usual targets are:

  • 4-bit integer
  • 8-bit integer
  • 16-bit integer
  • 16-bit floating-point
  • 16-bit brain floating-point (bfloat)

On the other hand, granularity allows for uniform quantisation. However, heterogeneous and non-uniform representations are possible. The most popular configurations are:

  • All uniform: the entire model is quantised using the same numerical representation.
  • Weights and Activations: the model has a different quantisation for weights and activations.
  • Layerwise: each layer uses a different quantisation.
  • All heterogeneous: weights, activations and all layers have different quantisations.

The most popular are the first two, since it is widely compatible with CPUs and GPUs.

Quantisation Opportunities in FPGAs

When moving to FPGA implementations, they are flexible enough to support arbitrary numerical precisions, not subject to 4, 8 or 16-bit numerical representations. It also makes it possible to have custom precision fixed-point, floating-point and posit. Beyond that, it is possible to have different precisions for each layer, depending on how critical and relevant they are for the final results.

Some frameworks like Brevitas and QKeras support quantisation to arbitrary precisions, involving refinement techniques like post-training quantisation (PTQ) or zero-shot quantisation (ZSQ). Brevitas, particularly, permits the implementation of FPGAs through FINN. However, depending on the degree of customisation, such as the introduction of alternative numerical representations like arbitrary precision floating-point and posit, the work becomes more challenging due to the availability of tools. Despite that, FPGAs have much to offer in this field, allowing compressed models to execute without the need for de-quantisation-quantisation on the fly, as many traditional architectures perform.


RidgeRun Services

RidgeRun has expertise in offloading processing algorithms using FPGAs, from Image Signal Processing to AI offloading. Our services include:

  • Algorithm Acceleration using FPGAs.
  • Image Signal Processing IP Cores.
  • Linux Device Drivers.
  • Low Power AI Acceleration using FPGAs.
  • Accelerated C++ Applications.

And it includes much more. Contact us at https://www.ridgerun.com/contact.



Previous: AI and Machine Learning/Frameworks Index Next: AI and Machine Learning/Pruning