llvm/docs/RISCV/RISCVVectorExtension.rst

   1 =========================
   2  RISC-V Vector Extension
   3 =========================
   4
   5 .. contents::
   6    :local:
   7
   8 The RISC-V target supports the 1.0 version of the `RISC-V Vector Extension (RVV) <https://github.com/riscv/riscv-v-spec/blob/v1.0/v-spec.adoc>`_.
   9 This guide gives an overview of how it's modelled in LLVM IR and how the backend generates code for it.
  10
  11 Mapping to LLVM IR types
  12 ========================
  13
  14 RVV adds 32 VLEN sized registers, where VLEN is an unknown constant to the compiler. To be able to represent VLEN sized values, the RISC-V backend takes the same approach as AArch64's SVE and uses `scalable vector types <https://llvm.org/docs/LangRef.html#t-vector>`_.
  15
  16 Scalable vector types are of the form ``<vscale x n x ty>``, which indicates a vector with a multiple of ``n`` elements of type ``ty``.
  17 On RISC-V ``n`` and ``ty`` control LMUL and SEW respectively.
  18
  19 LLVM only supports ELEN=32 or ELEN=64, so ``vscale`` is defined as VLEN/64 (see ``RISCV::RVVBitsPerBlock``).
  20 Note this means that VLEN must be at least 64, so VLEN=32 isn't currently supported.
  21
  22 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
  23 |                   | LMUL=⅛        | LMUL=¼         | LMUL=½           | LMUL=1            | LMUL=2            | LMUL=4            | LMUL=8            |
  24 +===================+===============+================+==================+===================+===================+===================+===================+
  25 | i64 (ELEN=64)     | N/A           | N/A            | N/A              | <v x 1 x i64>     | <v x 2 x i64>     | <v x 4 x i64>     | <v x 8 x i64>     |
  26 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
  27 | i32               | N/A           | N/A            | <v x 1 x i32>    | <v x 2 x i32>     | <v x 4 x i32>     | <v x 8 x i32>     | <v x 16 x i32>    |
  28 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
  29 | i16               | N/A           | <v x 1 x i16>  | <v x 2 x i16>    | <v x 4 x i16>     | <v x 8 x i16>     | <v x 16 x i16>    | <v x 32 x i16>    |
  30 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
  31 | i8                | <v x 1 x i8>  | <v x 2 x i8>   | <v x 4 x i8>     | <v x 8 x i8>      | <v x 16 x i8>     | <v x 32 x i8>     | <v x 64 x i8>     |
  32 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
  33 | double (ELEN=64)  | N/A           | N/A            | N/A              | <v x 1 x double>  | <v x 2 x double>  | <v x 4 x double>  | <v x 8 x double>  |
  34 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
  35 | float             | N/A           | N/A            | <v x 1 x float>  | <v x 2 x float>   | <v x 4 x float>   | <v x 8 x float>   | <v x 16 x float>  |
  36 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
  37 | half              | N/A           | <v x 1 x half> | <v x 2 x half>   | <v x 4 x half>    | <v x 8 x half>    | <v x 16 x half>   | <v x 32 x half>   |
  38 +-------------------+---------------+----------------+------------------+-------------------+-------------------+-------------------+-------------------+
  39
  40 (Read ``<v x k x ty>`` as ``<vscale x k x ty>``)
  41
  42
  43 Mask vector types
  44 -----------------
  45
  46 Mask vectors are physically represented using a layout of densely packed bits in a vector register.
  47 They are mapped to the following LLVM IR types:
  48
  49 - ``<vscale x 1 x i1>``
  50 - ``<vscale x 2 x i1>``
  51 - ``<vscale x 4 x i1>``
  52 - ``<vscale x 8 x i1>``
  53 - ``<vscale x 16 x i1>``
  54 - ``<vscale x 32 x i1>``
  55 - ``<vscale x 64 x i1>``
  56
  57 Two types with the same SEW/LMUL ratio will have the same related mask type.
  58 For instance, two different comparisons one under SEW=64, LMUL=2 and the other under SEW=32, LMUL=1 will both generate a mask ``<vscale x 2 x i1>``.
  59
  60 Representation in LLVM IR
  61 =========================
  62
  63 Vector instructions can be represented in three main ways in LLVM IR:
  64
  65 1. Regular instructions on both scalable and fixed-length vector types
  66
  67    .. code-block:: llvm
  68
  69        %c = add <vscale x 4 x i32> %a, %b
  70        %f = add <4 x i32> %d, %e
  71
  72 2. RISC-V vector intrinsics, which mirror the `C intrinsics specification <https://github.com/riscv-non-isa/rvv-intrinsic-doc>`_
  73
  74    These come in unmasked variants:
  75
  76    .. code-block:: llvm
  77
  78        %c = call @llvm.riscv.vadd.nxv4i32.nxv4i32(
  79               <vscale x 4 x i32> %passthru,
  80               <vscale x 4 x i32> %a,
  81               <vscale x 4 x i32> %b,
  82               i64 %avl
  83             )
  84
  85    As well as masked variants:
  86
  87    .. code-block:: llvm
  88
  89        %c = call @llvm.riscv.vadd.mask.nxv4i32.nxv4i32(
  90               <vscale x 4 x i32> %passthru,
  91               <vscale x 4 x i32> %a,
  92               <vscale x 4 x i32> %b,
  93               <vscale x 4 x i1> %mask,
  94               i64 %avl,
  95               i64 0 ; policy (must be an immediate)
  96             )
  97
  98    Both allow setting the AVL as well as controlling the inactive/tail elements via the passthru operand, but the masked variant also provides operands for the mask and ``vta``/``vma`` policy bits.
  99
 100    The only valid types are scalable vector types.
 101
 102 3. :ref:`Vector predication (VP) intrinsics <int_vp>`
 103
 104    .. code-block:: llvm
 105
 106        %c = call @llvm.vp.add.nxv4i32(
 107               <vscale x 4 x i32> %a,
 108               <vscale x 4 x i32> %b,
 109               <vscale x 4 x i1> %m
 110               i32 %evl
 111             )
 112
 113    Unlike RISC-V intrinsics, VP intrinsics are target agnostic so they can be emitted from other optimisation passes in the middle-end (like the loop vectorizer). They also support fixed-length vector types.
 114
 115    VP intrinsics also don't have passthru operands, but tail/mask undisturbed behaviour can be emulated by using the output in a ``@llvm.vp.merge``.
 116    It will get lowered as a ``vmerge``, but will be merged back into the underlying instruction's mask via ``RISCVDAGToDAGISel::performCombineVMergeAndVOps``.
 117
 118
 119 The different properties of the above representations are summarized below:
 120
 121 +----------------------+--------------+-----------------+----------+------------------+----------------------+-----------------+
 122 |                      | AVL          | Masking         | Passthru | Scalable vectors | Fixed-length vectors | Target agnostic |
 123 +======================+==============+=================+==========+==================+======================+=================+
 124 | LLVM IR instructions | Always VLMAX | No              | None     | Yes              | Yes                  | Yes             |
 125 +----------------------+--------------+-----------------+----------+------------------+----------------------+-----------------+
 126 | RVV intrinsics       | Yes          | Yes             | Yes      | Yes              | No                   | No              |
 127 +----------------------+--------------+-----------------+----------+------------------+----------------------+-----------------+
 128 | VP intrinsics        | Yes (EVL)    | Yes             | No       | Yes              | Yes                  | Yes             |
 129 +----------------------+--------------+-----------------+----------+------------------+----------------------+-----------------+
 130
 131 SelectionDAG lowering
 132 =====================
 133
 134 For most regular **scalable** vector LLVM IR instructions, their corresponding SelectionDAG nodes are legal on RISC-V and don't require any custom lowering.
 135
 136 .. code-block::
 137
 138    t5: nxv4i32 = add t2, t4
 139
 140 RISC-V vector intrinsics also don't require any custom lowering.
 141
 142 .. code-block::
 143
 144    t12: nxv4i32 = llvm.riscv.vadd TargetConstant:i64<10056>, undef:nxv4i32, t2, t4, t6
 145
 146 Fixed-length vectors
 147 --------------------
 148
 149 Because there are no fixed-length vector patterns, fixed-length vectors need to be custom lowered and performed in a scalable "container" type:
 150
 151 1. The fixed-length vector operands are inserted into scalable containers with ``insert_subvector`` nodes. The container type is chosen such that its minimum size will fit the fixed-length vector (see ``getContainerForFixedLengthVector``).
 152 2. The operation is then performed on the container type via a **VL (vector length) node**. These are custom nodes defined in ``RISCVInstrInfoVVLPatterns.td`` that mirror target agnostic SelectionDAG nodes, as well as some RVV instructions. They contain an AVL operand, which is set to the number of elements in the fixed-length vector.
 153    Some nodes also have a passthru or mask operand, which will usually be set to ``undef`` and all ones when lowering fixed-length vectors.
 154 3. The result is put back into a fixed-length vector via ``extract_subvector``.
 155
 156 .. code-block::
 157
 158        t2: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %0
 159        t6: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %1
 160      t4: v4i32 = extract_subvector t2, Constant:i64<0>
 161      t7: v4i32 = extract_subvector t6, Constant:i64<0>
 162    t8: v4i32 = add t4, t7
 163
 164    // is custom lowered to:
 165
 166        t2: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %0
 167        t6: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %1
 168        t15: nxv2i1 = RISCVISD::VMSET_VL Constant:i64<4>
 169      t16: nxv2i32 = RISCVISD::ADD_VL t2, t6, undef:nxv2i32, t15, Constant:i64<4>
 170    t17: v4i32 = extract_subvector t16, Constant:i64<0>
 171
 172 VL nodes often have a passthru or mask operand, which are usually set to ``undef`` and all ones for fixed-length vectors.
 173
 174 The ``insert_subvector`` and ``extract_subvector`` nodes responsible for wrapping and unwrapping will get combined away, and eventually we will lower all fixed-length vector types to scalable. Note that fixed-length vectors at the interface of a function are passed in a scalable vector container.
 175
 176 .. note::
 177
 178    The only ``insert_subvector`` and ``extract_subvector`` nodes that make it through lowering are those that can be performed as an exact subregister insert or extract. This means that any fixed-length vector ``insert_subvector`` and ``extract_subvector`` nodes that aren't legalized must lie on a register group boundary, so the exact VLEN must be known at compile time (i.e., compiled with ``-mrvv-vector-bits=zvl`` or ``-mllvm -riscv-v-vector-bits-max=VLEN``, or have an exact ``vscale_range`` attribute).
 179
 180 Vector predication intrinsics
 181 -----------------------------
 182
 183 VP intrinsics also get custom lowered via VL nodes.
 184
 185 .. code-block::
 186
 187    t12: nxv2i32 = vp_add t2, t4, t6, Constant:i64<8>
 188
 189    // is custom lowered to:
 190
 191    t18: nxv2i32 = RISCVISD::ADD_VL t2, t4, undef:nxv2i32, t6, Constant:i64<8>
 192
 193 The VP EVL and mask are used for the VL node's AVL and mask respectively, whilst the passthru is set to ``undef``.
 194
 195 Instruction selection
 196 =====================
 197
 198 ``vl`` and ``vtype`` need to be configured correctly, so we can't just directly select the underlying vector ``MachineInstr``. Instead pseudo instructions are selected, which carry the extra information needed to emit the necessary ``vsetvli``\s later.
 199
 200 .. code-block::
 201
 202    %c:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %a:vrm2, %b:vrm2, %vl:gpr, 5 /*sew*/, 3 /*policy*/
 203
 204 Each vector instruction has multiple pseudo instructions defined in ``RISCVInstrInfoVPseudos.td``.
 205 There is a variant of each pseudo for each possible LMUL, as well as a masked variant. So a typical instruction like ``vadd.vv`` would have the following pseudos:
 206
 207 .. code-block::
 208
 209    %rd:vr = PseudoVADD_VV_MF8 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
 210    %rd:vr = PseudoVADD_VV_MF4 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
 211    %rd:vr = PseudoVADD_VV_MF2 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
 212    %rd:vr = PseudoVADD_VV_M1 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
 213    %rd:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %rs2:vrm2, %rs1:vrm2, %avl:gpr, sew:imm, policy:imm
 214    %rd:vrm4 = PseudoVADD_VV_M4 %passthru:vrm4(tied-def 0), %rs2:vrm4, %rs1:vrm4, %avl:gpr, sew:imm, policy:imm
 215    %rd:vrm8 = PseudoVADD_VV_M8 %passthru:vrm8(tied-def 0), %rs2:vrm8, %rs1:vrm8, %avl:gpr, sew:imm, policy:imm
 216    %rd:vr = PseudoVADD_VV_MF8_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
 217    %rd:vr = PseudoVADD_VV_MF4_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
 218    %rd:vr = PseudoVADD_VV_MF2_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
 219    %rd:vr = PseudoVADD_VV_M1_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
 220    %rd:vrm2 = PseudoVADD_VV_M2_MASK %passthru:vrm2(tied-def 0), %rs2:vrm2, %%rs1:vrm2, mask:$v0, %avl:gpr, sew:imm, policy:imm
 221    %rd:vrm4 = PseudoVADD_VV_M4_MASK %passthru:vrm4(tied-def 0), %rs2:vrm4, %rs1:vrm4, mask:$v0, %avl:gpr, sew:imm, policy:imm
 222    %rd:vrm8 = PseudoVADD_VV_M8_MASK %passthru:vrm8(tied-def 0), %rs2:vrm8, %rs1:vrm8, mask:$v0, %avl:gpr, sew:imm, policy:imm
 223
 224 .. note::
 225
 226    Whilst the SEW can be encoded in an operand, we need to use separate pseudos for each LMUL since different register groups will require different register classes: see :ref:`rvv_register_allocation`.
 227
 228
 229 Pseudos have operands for the AVL and SEW (encoded as a power of 2), as well as potentially the mask, policy or rounding mode if applicable.
 230 The passthru operand is tied to the destination register which will determine the inactive/tail elements.
 231
 232 For scalable vectors that should use VLMAX, the AVL is set to a sentinel value of ``-1``.
 233
 234 There are patterns for target agnostic SelectionDAG nodes in ``RISCVInstrInfoVSDPatterns.td``, VL nodes in ``RISCVInstrInfoVVLPatterns.td`` and RVV intrinsics in ``RISCVInstrInfoVPseudos.td``.
 235
 236 Instructions that operate only on masks like VMAND or VMSBF uses pseudo instructions suffixed with B1, B2, B4, B8, B16, B32, or B64 where the number is SEW/LMUL representing
 237 the ratio between SEW and LMUL needed in vtype. These instructions always operate as if EEW=1 and always use a value of 0 as their SEW operand.
 238
 239 Mask patterns
 240 -------------
 241
 242 For masked pseudos the mask operand is copied to the physical ``$v0`` register during instruction selection with a glued ``CopyToReg`` node:
 243
 244 .. code-block::
 245
 246      t23: ch,glue = CopyToReg t0, Register:nxv4i1 $v0, t6
 247    t25: nxv4i32 = PseudoVADD_VV_M2_MASK Register:nxv4i32 $noreg, t2, t4, Register:nxv4i1 $v0, TargetConstant:i64<8>, TargetConstant:i64<5>, TargetConstant:i64<1>, t23:1
 248
 249 The patterns in ``RISCVInstrInfoVVLPatterns.td`` only match masked pseudos to reduce the size of the match table, even if the node's mask is all ones and could be an unmasked pseudo.
 250 ``RISCVFoldMasks::convertToUnmasked`` will detect if the mask is all ones and convert it into its unmasked form.
 251
 252 .. code-block::
 253
 254    $v0 = PseudoVMSET_M_B16 -1, 32
 255    %rd:vrm2 = PseudoVADD_VV_M2_MASK %passthru:vrm2(tied-def 0), %rs2:vrm2, %rs1:vrm2, $v0, %avl:gpr, sew:imm, policy:imm
 256
 257    // gets optimized to:
 258
 259    %rd:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %rs2:vrm2, %rs1:vrm2, %avl:gpr, sew:imm, policy:imm
 260
 261 .. note::
 262
 263    Any ``vmset.m`` can be treated as an all ones mask since the tail elements past AVL are ``undef`` and can be replaced with ones.
 264
 265 .. _rvv_register_allocation:
 266
 267 Register allocation
 268 ===================
 269
 270 Register allocation is split between vector and scalar registers, with vector allocation running first:
 271
 272 .. code-block::
 273
 274   $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, %vl:gpr, 5, 3
 275
 276 .. note::
 277
 278    Register allocation is split so that :ref:`RISCVInsertVSETVLI` can run after vector register allocation, but before scalar register allocation. It needs to be run before scalar register allocation as it may need to create a new virtual register to set the AVL to VLMAX.
 279
 280    Performing ``RISCVInsertVSETVLI`` after vector register allocation imposes fewer constraints on the machine scheduler since it cannot schedule instructions past ``vsetvli``\s, and it allows us to emit further vector pseudos during spilling or constant rematerialization.
 281
 282 There are four register classes for vectors:
 283
 284 - ``VR`` for vector registers (``v0``, ``v1,``, ..., ``v32``). Used when :math:`\text{LMUL} \leq 1` and mask registers.
 285 - ``VRM2`` for vector groups of length 2 i.e., :math:`\text{LMUL}=2` (``v0m2``, ``v2m2``, ..., ``v30m2``)
 286 - ``VRM4`` for vector groups of length 4 i.e., :math:`\text{LMUL}=4` (``v0m4``, ``v4m4``, ..., ``v28m4``)
 287 - ``VRM8`` for vector groups of length 8 i.e., :math:`\text{LMUL}=8` (``v0m8``, ``v8m8``, ..., ``v24m8``)
 288
 289 :math:`\text{LMUL} \lt 1` types and mask types do not benefit from having a dedicated class, so ``VR`` is used in their case.
 290
 291 Some instructions have a constraint that a register operand cannot be ``V0`` or overlap with ``V0``, so for these cases we also have ``VRNoV0`` variants.
 292
 293 .. _RISCVInsertVSETVLI:
 294
 295 RISCVInsertVSETVLI
 296 ==================
 297
 298 After vector registers are allocated, the ``RISCVInsertVSETVLI`` pass will insert the necessary ``vsetvli``\s for the pseudos.
 299
 300 .. code-block::
 301
 302   dead $x0 = PseudoVSETVLI %vl:gpr, 209, implicit-def $vl, implicit-def $vtype
 303   $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, $noreg, 5, implicit $vl, implicit $vtype
 304
 305 The physical ``$vl`` and ``$vtype`` registers are implicitly defined by the ``PseudoVSETVLI``, and are implicitly used by the ``PseudoVADD``.
 306 The ``vtype`` operand (``209`` in this example) is encoded as per the specification via ``RISCVVType::encodeVTYPE``.
 307
 308 ``RISCVInsertVSETVLI`` performs dataflow analysis to emit as few ``vsetvli``\s as possible. It will also try to minimize the number of ``vsetvli``\s that set VL, i.e., it will emit ``vsetvli x0, x0`` if only ``vtype`` needs changed but ``vl`` doesn't.
 309
 310 Pseudo expansion and printing
 311 =============================
 312
 313 After scalar register allocation, the ``RISCVExpandPseudoInsts.cpp`` pass expands the ``PseudoVSETVLI`` instructions.
 314
 315 .. code-block::
 316
 317    dead $x0 = VSETVLI $x1, 209, implicit-def $vtype, implicit-def $vl
 318    renamable $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, $noreg, 5, implicit $vl, implicit $vtype
 319
 320 Note that the vector pseudo remains as it's needed to encode the register class for the LMUL. Its AVL and SEW operands are no longer used.
 321
 322 ``RISCVAsmPrinter`` will then lower the pseudo instructions into real ``MCInst``\s.
 323
 324 .. code-block:: nasm
 325
 326    vsetvli a0, zero, e32, m2, ta, ma
 327    vadd.vv v8, v8, v10
 328
 329
 330
 331 See also
 332 ========
 333
 334 - `[llvm-dev] [RFC] Code generation for RISC-V V-extension <https://lists.llvm.org/pipermail/llvm-dev/2020-October/145850.html>`_
 335 - `2023 LLVM Dev Mtg - Vector codegen in the RISC-V backend <https://youtu.be/-ox8iJmbp0c?feature=shared>`_
 336 - `2023 LLVM Dev Mtg - How to add an C intrinsic and code-gen it, using the RISC-V vector C intrinsics <https://youtu.be/t17O_bU1jks?feature=shared>`_
 337 - `2021 LLVM Dev Mtg “Optimizing code for scalable vector architectures” <https://youtu.be/daWLCyhwrZ8?feature=shared>`_