eIQ Neutron NPU overview

From RidgeRun Developer Wiki





Follow Us On Twitter LinkedIn Email Share this page



Preferred Partner Logo 3



Introduction

The eIQ Neutron NPU is NXP’s dedicated Neural Processing Unit designed to accelerate Machine Learning inference on supported i.MX platforms such as the Verdin iMX95.

It is optimized for running quantized neural networks efficiently, offloading compute-heavy operations from the CPU and significantly improving inference performance.

Neutron-S NPU Overview

The Neutron-S implementation is built around three main hardware components:

  • Neutron compute core: Handles the core tensor operations (MACs) and supports pipelining across multiple instances.
  • RISC-V controller: Manages the execution of workloads by configuring and coordinating the compute core.
  • Data Mover: A DMA-like engine responsible for transferring data between system DDR memory and the NPU’s internal TCM.

These components work together with the main SoC CPU (Cortex-A55), which remains responsible for:

  • Running the Linux OS and inference framework (TensorFlow Lite)
  • Loading models and preparing input data
  • Dispatching supported operations to the NPU

Key Features

  • Optimized for quantized CNN workloads
  • Supports 8-bit weights and 8/16-bit activations
  • Integrates with TensorFlow Lite (TFLite)
  • Automatically falls back to Cortex-A CPU for unsupported operations
  • Allows offloading of custom nodes (neutronGraph) through the TFLite delegate
  • Includes model conversion tools via the eIQ toolkit to optimize performance and memory usage

Software Architecture

The Neutron-S software stack is split into three main parts:

1. Model Conversion (Offline)

A dedicated converter tool (part of the eIQ toolkit) processes a standard TFLite model and prepares it for execution on the NPU.

During this step:

  • Supported operations are replaced with a custom neutronOp node
  • The model is augmented with:
 * Firmware-specific binary data
 * Precompiled weights
 * Memory layout definitions for inputs and outputs

The result is a modified TFLite model that can be executed using the Neutron delegate.

2. Runtime Stack (Cortex-A)

On the application side (running on the Cortex-A CPU under Linux), the following components are involved:

  • TensorFlow Lite inference engine
  • Neutron delegate library
  • User-space Neutron driver
  • Linux kernel driver for Neutron

This stack is responsible for executing the model and routing supported operations to the NPU.

3. NPU Firmware (RISC-V)

Inside the Neutron-S hardware, the RISC-V controller runs firmware that:

  • Interprets the compiled neutronGraph
  • Schedules execution on the compute cores
  • Coordinates data transfers via the Data Mover

Architecture Diagram

The following diagram illustrates how the different components interact across the system:

Neutron-S software architecture

The model conversion step is handled by an offline tool that prepares a standard TensorFlow Lite model for execution on the Neutron-S NPU. During this process, supported operators are replaced with a custom neutronOp node. This node embeds all required artifacts for execution on the NPU, including firmware-specific binaries, model weights, and memory definitions for inputs and outputs. The result is a modified TFLite model that can be executed by the TensorFlow Lite runtime together with the Neutron delegate.

On the runtime side, the Cortex-A Linux stack includes all necessary components to execute and offload inference workloads. This includes the TensorFlow Lite interpreter, the Neutron delegate library, a user-space driver, and the corresponding Linux kernel driver. Together, these components enable communication with the Neutron-S hardware and ensure that supported operations are properly accelerated.

Notes

  • Only supported operators are offloaded to the NPU; the rest are executed on the CPU.
  • Proper model conversion is required to take full advantage of the Neutron-S acceleration.
  • The combination of delegate + converted model is essential—using one without the other will not enable NPU acceleration.