Pruning for AI on FPGA

Previous: AI and Machine Learning/Quantisation

Introduction

Pruning is a model optimisation technique that removes weights that do not significantly impact the inference process. It allows for reducing the storage and computations required for the inference tasks. However, it requires specialised data structures and hardware for the computations.

This wiki will explain the sparse data representations, acceleration considerations and how we can use the FPGA to accelerate sparse operations.

Data Representations

Regarding the data structures, sparse matrix representations are used for matrix computations, encoding the valid rows and columns with a value. However, there are multiple ways to encode the rows, columns and values:

Coordinate (COO)

It encodes the dense matrices into sparse matrices through three different arrays:

Row indices
Column indices
Values

The three arrays are same-sized and the arrays' positions are accessed using the same index.

So, to access the dense matrix index from a sparse representation, it is possible to use the following pseudocode:

index_1d = rows_indices[i] * leading_dimension + column_indices[i]

Assuming that the indices can be encoded in 16-bit unsigned integers and the values are 32-bit floating-points, the number of valid elements to start saving storage is given by:

total_dense_matrix_number_elements >= ((sizeof(indices) * 2 + sizeof(values)) * valid_elements) / sizeof(values)

Not respecting this inequation will lead to suboptimal compression and the matrix will take more storage than the dense representation. As a rule of thumb, having 50% or fewer valid elements is recommended than the total number of elements of the dense representation.

Compressed Sparse Rows/Columns

These representations compress the arrays of rows or columns to take less space than the previous representation. They encode offsets to reach the index value. They aim to relax the number of valid elements to reach compression, allowing more elements to be compressed.

To compute the dense position, the following expression can be used:

index_1d = row * leading_dimension + column_indices[row_offsets[row] + k]

This representation may lead to complications when determining the matching offset with the column index.

Sliced Ellpack (SELL)

The Sliced Ellpack format is standardised and state-of-the-art. This format allows for significant improvement in the performance of all problems involving low variability in the number of nonzero elements per row.

An MxN sparse matrix A is equivalent to a sliced sparse matrix As with slices = (m/slicesize) slice rows and n columns. Each slice is stored in column-major order to improve memory coalescing and utilisation.

The definition of the sparse matrix requires an array of slices, slice size and number of elements.

The hardware implementation for this representation is reduced in complexity and saves space, given both benefits together.

Hardware Acceleration Minutes

The hardware accelerator design for sparse matrix operations depends on the representation utilised for compressing the sparse representation. In common, they are usually based on operation queues that operate on the matrix elements. In the case of the matrix operations, the accelerator contrasts the operands, seeking valid operations.

Likewise, they are often granular and divided into Processing Elements, allowing processing multiple batches at a time. However, they may require good scheduling to avoid underutilization.

As suggested by Jiajun Li in AdaPrune: An Accelerator-Aware Pruning Technique for Sustainable CNN Accelerators^[1], multi-PE sparse accelerators may be sub-optimally utilised when the loads are not balanced. Therefore, a good sparse accelerator is usually equipped with front-ends that analyse the operations and schedule the operations into the multiple PEs, optimising the computing times in exchange for resources utilised for making the accelerator more complex.

In FPGAs, we can create custom architectures that operate on sparse operations and deep learning model pruning artefacts. We can also go further with arbitrary precision and approximate computing operations that help reduce the resource footprint of the accelerators implemented on the FPGA.

RidgeRun Services

RidgeRun has expertise in offloading processing algorithms using FPGAs, from Image Signal Processing to AI offloading. Our services include:

Algorithm Acceleration using FPGAs.
Image Signal Processing IP Cores.
Linux Device Drivers.
Low Power AI Acceleration using FPGAs.
Accelerated C++ Applications.

And it includes much more. Contact us at https://www.ridgerun.com/contact.

Previous: AI and Machine Learning/Quantisation

Index

↑ Li. J. An Accelerator-Aware Pruning Technique for Sustainable CNN Accelerators. Available at: https://ieeexplore.ieee.org/document/9359522

❯

[1] Li. J. An Accelerator-Aware Pruning Technique for Sustainable CNN Accelerators. Available at: https://ieeexplore.ieee.org/document/9359522

[1]

FPGA Minutes to Become an Expert

Introduction
What is an FPGA? FPGAs and their applications Popular Vendors Development process When to Choose an FPGA over a GPU/CPU?
FPGA Knowledge
Synthesis Flows High Level Synthesis Vendor Specific Primitives
Xilinx FPGAs
Evaluation boards Minimum Design for KV260 Development workflow and tools Getting Started
Lattice FPGAs
Evaluation boards Development workflow and tools Getting Started
Simulation Tools
CocoTB
AI and Machine Learning
Introduction to AI Architectures and Frameworks Quantisation Pruning
Contact Us