Documentation/accel/amdxdna/amdnpu.rst

   1 .. SPDX-License-Identifier: GPL-2.0-only
   2
   3 .. include:: <isonum.txt>
   4
   5 =========
   6  AMD NPU
   7 =========
   8
   9 :Copyright: |copy| 2024 Advanced Micro Devices, Inc.
  10 :Author: Sonal Santan <sonal.santan@amd.com>
  11
  12 Overview
  13 ========
  14
  15 AMD NPU (Neural Processing Unit) is a multi-user AI inference accelerator
  16 integrated into AMD client APU. NPU enables efficient execution of Machine
  17 Learning applications like CNN, LLM, etc. NPU is based on
  18 `AMD XDNA Architecture`_. NPU is managed by **amdxdna** driver.
  19
  20
  21 Hardware Description
  22 ====================
  23
  24 AMD NPU consists of the following hardware components:
  25
  26 AMD XDNA Array
  27 --------------
  28
  29 AMD XDNA Array comprises of 2D array of compute and memory tiles built with
  30 `AMD AI Engine Technology`_. Each column has 4 rows of compute tiles and 1
  31 row of memory tile. Each compute tile contains a VLIW processor with its own
  32 dedicated program and data memory. The memory tile acts as L2 memory. The 2D
  33 array can be partitioned at a column boundary creating a spatially isolated
  34 partition which can be bound to a workload context.
  35
  36 Each column also has dedicated DMA engines to move data between host DDR and
  37 memory tile.
  38
  39 AMD Phoenix and AMD Hawk Point client NPU have a 4x5 topology, i.e., 4 rows of
  40 compute tiles arranged into 5 columns. AMD Strix Point client APU have 4x8
  41 topology, i.e., 4 rows of compute tiles arranged into 8 columns.
  42
  43 Shared L2 Memory
  44 ----------------
  45
  46 The single row of memory tiles create a pool of software managed on chip L2
  47 memory. DMA engines are used to move data between host DDR and memory tiles.
  48 AMD Phoenix and AMD Hawk Point NPUs have a total of 2560 KB of L2 memory.
  49 AMD Strix Point NPU has a total of 4096 KB of L2 memory.
  50
  51 Microcontroller
  52 ---------------
  53
  54 A microcontroller runs NPU Firmware which is responsible for command processing,
  55 XDNA Array partition setup, XDNA Array configuration, workload context
  56 management and workload orchestration.
  57
  58 NPU Firmware uses a dedicated instance of an isolated non-privileged context
  59 called ERT to service each workload context. ERT is also used to execute user
  60 provided ``ctrlcode`` associated with the workload context.
  61
  62 NPU Firmware uses a single isolated privileged context called MERT to service
  63 management commands from the amdxdna driver.
  64
  65 Mailboxes
  66 ---------
  67
  68 The microcontroller and amdxdna driver use a privileged channel for management
  69 tasks like setting up of contexts, telemetry, query, error handling, setting up
  70 user channel, etc. As mentioned before, privileged channel requests are
  71 serviced by MERT. The privileged channel is bound to a single mailbox.
  72
  73 The microcontroller and amdxdna driver use a dedicated user channel per
  74 workload context. The user channel is primarily used for submitting work to
  75 the NPU. As mentioned before, a user channel requests are serviced by an
  76 instance of ERT. Each user channel is bound to its own dedicated mailbox.
  77
  78 PCIe EP
  79 -------
  80
  81 NPU is visible to the x86 host CPU as a PCIe device with multiple BARs and some
  82 MSI-X interrupt vectors. NPU uses a dedicated high bandwidth SoC level fabric
  83 for reading or writing into host memory. Each instance of ERT gets its own
  84 dedicated MSI-X interrupt. MERT gets a single instance of MSI-X interrupt.
  85
  86 The number of PCIe BARs varies depending on the specific device. Based on their
  87 functions, PCIe BARs can generally be categorized into the following types.
  88
  89 * PSP BAR: Expose the AMD PSP (Platform Security Processor) function
  90 * SMU BAR: Expose the AMD SMU (System Management Unit) function
  91 * SRAM BAR: Expose ring buffers for the mailbox
  92 * Mailbox BAR: Expose the mailbox control registers (head, tail and ISR
  93   registers etc.)
  94 * Public Register BAR: Expose public registers
  95
  96 On specific devices, the above-mentioned BAR type might be combined into a
  97 single physical PCIe BAR. Or a module might require two physical PCIe BARs to
  98 be fully functional. For example,
  99
 100 * On AMD Phoenix device, PSP, SMU, Public Register BARs are on PCIe BAR index 0.
 101 * On AMD Strix Point device, Mailbox and Public Register BARs are on PCIe BAR
 102   index 0. The PSP has some registers in PCIe BAR index 0 (Public Register BAR)
 103   and PCIe BAR index 4 (PSP BAR).
 104
 105 Process Isolation Hardware
 106 --------------------------
 107
 108 As explained before, XDNA Array can be dynamically divided into isolated
 109 spatial partitions, each of which may have one or more columns. The spatial
 110 partition is setup by programming the column isolation registers by the
 111 microcontroller. Each spatial partition is associated with a PASID which is
 112 also programmed by the microcontroller. Hence multiple spatial partitions in
 113 the NPU can make concurrent host access protected by PASID.
 114
 115 The NPU FW itself uses microcontroller MMU enforced isolated contexts for
 116 servicing user and privileged channel requests.
 117
 118
 119 Mixed Spatial and Temporal Scheduling
 120 =====================================
 121
 122 AMD XDNA architecture supports mixed spatial and temporal (time sharing)
 123 scheduling of 2D array. This means that spatial partitions may be setup and
 124 torn down dynamically to accommodate various workloads. A *spatial* partition
 125 may be *exclusively* bound to one workload context while another partition may
 126 be *temporarily* bound to more than one workload contexts. The microcontroller
 127 updates the PASID for a temporarily shared partition to match the context that
 128 has been bound to the partition at any moment.
 129
 130 Resource Solver
 131 ---------------
 132
 133 The Resource Solver component of the amdxdna driver manages the allocation
 134 of 2D array among various workloads. Every workload describes the number
 135 of columns required to run the NPU binary in its metadata. The Resource Solver
 136 component uses hints passed by the workload and its own heuristics to
 137 decide 2D array (re)partition strategy and mapping of workloads for spatial and
 138 temporal sharing of columns. The FW enforces the context-to-column(s) resource
 139 binding decisions made by the Resource Solver.
 140
 141 AMD Phoenix and AMD Hawk Point client NPU can support 6 concurrent workload
 142 contexts. AMD Strix Point can support 16 concurrent workload contexts.
 143
 144
 145 Application Binaries
 146 ====================
 147
 148 A NPU application workload is comprised of two separate binaries which are
 149 generated by the NPU compiler.
 150
 151 1. AMD XDNA Array overlay, which is used to configure a NPU spatial partition.
 152    The overlay contains instructions for setting up the stream switch
 153    configuration and ELF for the compute tiles. The overlay is loaded on the
 154    spatial partition bound to the workload by the associated ERT instance.
 155    Refer to the
 156    `Versal Adaptive SoC AIE-ML Architecture Manual (AM020)`_ for more details.
 157
 158 2. ``ctrlcode``, used for orchestrating the overlay loaded on the spatial
 159    partition. ``ctrlcode`` is executed by the ERT running in protected mode on
 160    the microcontroller in the context of the workload. ``ctrlcode`` is made up
 161    of a sequence of opcodes named ``XAie_TxnOpcode``. Refer to the
 162    `AI Engine Run Time`_ for more details.
 163
 164
 165 Special Host Buffers
 166 ====================
 167
 168 Per-context Instruction Buffer
 169 ------------------------------
 170
 171 Every workload context uses a host resident 64 MB buffer which is memory
 172 mapped into the ERT instance created to service the workload. The ``ctrlcode``
 173 used by the workload is copied into this special memory. This buffer is
 174 protected by PASID like all other input/output buffers used by that workload.
 175 Instruction buffer is also mapped into the user space of the workload.
 176
 177 Global Privileged Buffer
 178 ------------------------
 179
 180 In addition, the driver also allocates a single buffer for maintenance tasks
 181 like recording errors from MERT. This global buffer uses the global IOMMU
 182 domain and is only accessible by MERT.
 183
 184
 185 High-level Use Flow
 186 ===================
 187
 188 Here are the steps to run a workload on AMD NPU:
 189
 190 1.  Compile the workload into an overlay and a ``ctrlcode`` binary.
 191 2.  Userspace opens a context in the driver and provides the overlay.
 192 3.  The driver checks with the Resource Solver for provisioning a set of columns
 193     for the workload.
 194 4.  The driver then asks MERT to create a context on the device with the desired
 195     columns.
 196 5.  MERT then creates an instance of ERT. MERT also maps the Instruction Buffer
 197     into ERT memory.
 198 6.  The userspace then copies the ``ctrlcode`` to the Instruction Buffer.
 199 7.  Userspace then creates a command buffer with pointers to input, output, and
 200     instruction buffer; it then submits command buffer with the driver and goes
 201     to sleep waiting for completion.
 202 8.  The driver sends the command over the Mailbox to ERT.
 203 9.  ERT *executes* the ``ctrlcode`` in the instruction buffer.
 204 10. Execution of the ``ctrlcode`` kicks off DMAs to and from the host DDR while
 205     AMD XDNA Array is running.
 206 11. When ERT reaches end of ``ctrlcode``, it raises an MSI-X to send completion
 207     signal to the driver which then wakes up the waiting workload.
 208
 209
 210 Boot Flow
 211 =========
 212
 213 amdxdna driver uses PSP to securely load signed NPU FW and kick off the boot
 214 of the NPU microcontroller. amdxdna driver then waits for the alive signal in
 215 a special location on BAR 0. The NPU is switched off during SoC suspend and
 216 turned on after resume where the NPU FW is reloaded, and the handshake is
 217 performed again.
 218
 219
 220 Userspace components
 221 ====================
 222
 223 Compiler
 224 --------
 225
 226 Peano is an LLVM based open-source compiler for AMD XDNA Array compute tile
 227 available at:
 228 https://github.com/Xilinx/llvm-aie
 229
 230 The open-source IREE compiler supports graph compilation of ML models for AMD
 231 NPU and uses Peano underneath. It is available at:
 232 https://github.com/nod-ai/iree-amd-aie
 233
 234 Usermode Driver (UMD)
 235 ---------------------
 236
 237 The open-source XRT runtime stack interfaces with amdxdna kernel driver. XRT
 238 can be found at:
 239 https://github.com/Xilinx/XRT
 240
 241 The open-source XRT shim for NPU is can be found at:
 242 https://github.com/amd/xdna-driver
 243
 244
 245 DMA Operation
 246 =============
 247
 248 DMA operation instructions are encoded in the ``ctrlcode`` as
 249 ``XAIE_IO_BLOCKWRITE`` opcode. When ERT executes ``XAIE_IO_BLOCKWRITE``, DMA
 250 operations between host DDR and L2 memory are effected.
 251
 252
 253 Error Handling
 254 ==============
 255
 256 When MERT detects an error in AMD XDNA Array, it pauses execution for that
 257 workload context and sends an asynchronous message to the driver over the
 258 privileged channel. The driver then sends a buffer pointer to MERT to capture
 259 the register states for the partition bound to faulting workload context. The
 260 driver then decodes the error by reading the contents of the buffer pointer.
 261
 262
 263 Telemetry
 264 =========
 265
 266 MERT can report various kinds of telemetry information like the following:
 267
 268 * L1 interrupt counter
 269 * DMA counter
 270 * Deep Sleep counter
 271 * etc.
 272
 273
 274 References
 275 ==========
 276
 277 - `AMD XDNA Architecture <https://www.amd.com/en/technologies/xdna.html>`_
 278 - `AMD AI Engine Technology <https://www.xilinx.com/products/technology/ai-engine.html>`_
 279 - `Peano <https://github.com/Xilinx/llvm-aie>`_
 280 - `Versal Adaptive SoC AIE-ML Architecture Manual (AM020) <https://docs.amd.com/r/en-US/am020-versal-aie-ml>`_
 281 - `AI Engine Run Time <https://github.com/Xilinx/aie-rt/tree/release/main_aig>`_