llvm/docs/VectorizationPlan.rst

   1 ==================
   2 Vectorization Plan
   3 ==================
   4
   5 .. contents::
   6    :local:
   7
   8 Abstract
   9 ========
  10 The vectorization transformation can be rather complicated, involving several
  11 potential alternatives, especially for outer-loops [1]_ but also possibly for
  12 innermost loops. These alternatives may have significant performance impact,
  13 both positive and negative. A cost model is therefore employed to identify the
  14 best alternative, including the alternative of avoiding any transformation
  15 altogether.
  16
  17 The Vectorization Plan is an explicit model for describing vectorization
  18 candidates. It serves for both optimizing candidates including estimating their
  19 cost reliably, and for performing their final translation into IR. This
  20 facilitates dealing with multiple vectorization candidates.
  21
  22 Current Status
  23 ==============
  24 VPlan is currently used to drive code-generation in LoopVectorize. VPlans are
  25 constructed after all cost-based and most legality-related decisions have been
  26 taken. As def-use chains between recipes are now fully modeled in VPlan,
  27 VPlan-based analyses and transformations are used to simplify and modularize
  28 the vectorization process [10]_. Those include transformations to
  29
  30 1. Legalize the initial VPlan, e.g. by introducing specialized recipes for
  31    reductions and interleave groups.
  32
  33 2. Optimize the legalized VPlan, e.g. by removing redundant recipes or
  34    introducing active-lane-masks.
  35
  36 3. Apply unroll- and vectorization-factor specific optimizations, e.g. removing
  37    the backedge to reiterate the vector loop based on VF & UF.
  38
  39 Refer to :numref:`fig-vplan-transform-pipeline` for an overview of the current
  40 transformation pipeline.
  41
  42 Note that some legality checks are already done in VPlan, including checking if
  43 all users of a fixed-order recurrence can be re-ordered. This is implemented as
  44 a VPlan-to-VPlan transformation that either applies a valid re-ordering or
  45 bails out marking the VPlan as invalid.
  46
  47 .. _fig-vplan-transform-pipeline:
  48 .. figure:: ./vplan-transform-pipeline.png
  49    :width: 800 px
  50
  51    VPlan Transformation Pipeline in 2024
  52
  53
  54 VPlan currently models the complete vector loop, as well as additional parts of
  55 the vectorization skeleton. Refer to :numref:`fig-vplan-scope` for an overview
  56 of the scope covered by VPlan.
  57
  58 .. _fig-vplan-scope:
  59 .. figure:: ./vplan-scope.png
  60    :width: 800 px
  61
  62    Scope modeled in VPlan in 2024
  63
  64
  65 High-level Design
  66 =================
  67
  68 Vectorization Workflow
  69 ----------------------
  70 VPlan-based vectorization involves three major steps, taking a "scenario-based
  71 approach" to vectorization planning:
  72
  73 1. Legal Step: check if a loop can be legally vectorized; encode constraints and
  74    artifacts if so.
  75 2. Plan Step:
  76
  77    a. Build initial VPlans following the constraints and decisions taken by
  78       Legal Step 1, and compute their cost.
  79    b. Apply optimizations to the VPlans, possibly forking additional VPlans.
  80       Prune sub-optimal VPlans having relatively high cost.
  81 3. Execute Step: materialize the best VPlan. Note that this is the only step
  82    that modifies the IR.
  83
  84 Design Guidelines
  85 -----------------
  86 In what follows, the term "input IR" refers to code that is fed into the
  87 vectorizer whereas the term "output IR" refers to code that is generated by the
  88 vectorizer. The output IR contains code that has been vectorized or "widened"
  89 according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
  90 according to an Unroll Factor (UF).
  91 The design of VPlan follows several high-level guidelines:
  92
  93 1. Analysis-like: building and manipulating VPlans must not modify the input IR.
  94    In particular, if the best option is not to vectorize at all, the
  95    vectorization process terminates before reaching Step 3, and compilation
  96    should proceed as if VPlans had not been built.
  97
  98 2. Align Cost & Execute: each VPlan must support both estimating the cost and
  99    generating the output IR code, such that the cost estimation evaluates the
 100    to-be-generated code reliably.
 101
 102 3. Support vectorizing additional constructs:
 103
 104    a. Outer-loop vectorization. In particular, VPlan must be able to model the
 105       control-flow of the output IR which may include multiple basic-blocks and
 106       nested loops.
 107    b. SLP vectorization.
 108    c. Combinations of the above, including nested vectorization: vectorizing
 109       both an inner loop and an outer-loop at the same time (each with its own
 110       VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
 111       inside [4]_, (re)vectorizing input IR containing vector code.
 112    d. Function vectorization [2]_.
 113
 114 4. Support multiple candidates efficiently. In particular, similar candidates
 115    related to a range of possible VF's and UF's must be represented efficiently.
 116    Potential versioning needs to be supported efficiently.
 117
 118 5. Support vectorizing idioms, such as interleaved groups of strided loads or
 119    stores. This is achieved by modeling a sequence of output instructions using
 120    a "Recipe", which is responsible for computing its cost and generating its
 121    code.
 122
 123 6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
 124    such regions may need to be, for example, predicated and linearized, or
 125    replicated VF*UF times to handle scalarized and predicated instructions.
 126    Innerloops are also modelled as SESE regions.
 127
 128 7. Support instruction-level analysis and transformation, as part of Planning
 129    Step 2.b: During vectorization instructions may need to be traversed, moved,
 130    replaced by other instructions or be created. For example, vector idiom
 131    detection and formation involves searching for and optimizing instruction
 132    patterns.
 133
 134 Definitions
 135 ===========
 136 The low-level design of VPlan comprises of the following classes.
 137
 138 :LoopVectorizationPlanner:
 139   A LoopVectorizationPlanner is designed to handle the vectorization of a loop
 140   or a loop nest. It can construct, optimize and discard one or more VPlans,
 141   each VPlan modelling a distinct way to vectorize the loop or the loop nest.
 142   Once the best VPlan is determined, including the best VF and UF, this VPlan
 143   drives the generation of output IR.
 144
 145 :VPlan:
 146   A model of a vectorized candidate for a given input IR loop or loop nest. This
 147   candidate is represented using a Hierarchical CFG. VPlan supports estimating
 148   the cost and driving the generation of the output IR code it represents.
 149
 150 :Hierarchical CFG:
 151   A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
 152   Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
 153   cross-Tile edges are lifted to connect Tiles instead of the original
 154   basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
 155   Region and Block are used rather than Tile [5]_ to avoid confusion with loop
 156   tiling.
 157
 158 :VPBlockBase:
 159   The building block of the Hierarchical CFG. A pure-virtual base-class of
 160   VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
 161   control-flow relations with other VPBlocks. Note that in contrast to the IR
 162   BasicBlock, a VPBlockBase models its control-flow successors and predecessors
 163   directly, rather than through a Terminator branch or through predecessor
 164   branches that "use" the VPBlockBase.
 165
 166 :VPBasicBlock:
 167   VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
 168   Hierarchical CFG. It represents a sequence of output IR instructions that will
 169   appear consecutively in an output IR basic-block. The instructions of this
 170   basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
 171   sequence of zero or more VPRecipes that model the cost and generation of the
 172   output IR instructions.
 173
 174 :VPRegionBlock:
 175   VPRegionBlock is a subclass of VPBlockBase. It models a collection of
 176   VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
 177   CFG. A VPRegionBlock may indicate that its contents are to be replicated a
 178   constant number of times when output IR is generated, effectively representing
 179   a loop with constant trip-count that will be completely unrolled. This is used
 180   to support scalarized and predicated instructions with a single model for
 181   multiple candidate VF's and UF's.
 182
 183 :VPRecipeBase:
 184   A pure-virtual base class modeling a sequence of one or more output IR
 185   instructions, possibly based on one or more input IR instructions. These
 186   input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
 187   may specify how its ingredients are to be transformed to produce the output IR
 188   instructions; e.g., cloned once, replicated multiple times or widened
 189   according to selected VF.
 190
 191 :VPValue:
 192   The base of VPlan's def-use relations class hierarchy. When instantiated, it
 193   models a constant or a live-in Value in VPlan. It has users, which are of type
 194   VPUser, but no operands.
 195
 196 :VPUser:
 197   A VPUser represents an entity that uses a number of VPValues as operands.
 198   VPUser is similar in some aspects to LLVM's User class.
 199
 200 :VPDef:
 201   A VPDef represents an entity that defines zero, one or multiple VPValues.
 202   It is used to model the fact that recipes in VPlan can define multiple
 203   VPValues.
 204
 205 :VPInstruction:
 206   A VPInstruction is a recipe characterized by a single opcode and optional
 207   flags, free of ingredients or other meta-data. VPInstructions also extend
 208   LLVM IR's opcodes with idiomatic operations that enrich the Vectorizer's
 209   semantics.
 210
 211 :VPTransformState:
 212   Stores information used for generating output IR, passed from
 213   LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
 214   additional information down to VPBlocks and VPRecipes.
 215
 216 The Planning Process and VPlan Roadmap
 217 ======================================
 218
 219 Transforming the Loop Vectorizer to use VPlan follows a staged approach. First,
 220 VPlan was only used to record the final vectorization decisions, and to execute
 221 them: the Hierarchical CFG models the planned control-flow, and Recipes capture
 222 decisions taken inside basic-blocks. Currently, VPlan is used also as the basis
 223 for taking these decisions, effectively turning them into a series of
 224 VPlan-to-VPlan algorithms. Finally, VPlan will support the planning process
 225 itself including cost-based analyses for making these decisions, to fully
 226 support compositional and iterative decision making.
 227
 228 Some decisions are local to an instruction in the loop, such as whether to widen
 229 it into a vector instruction or replicate it, keeping the generated instructions
 230 in place. Other decisions, however, involve moving instructions, replacing them
 231 with other instructions, and/or introducing new instructions. For example, a
 232 cast may sink past a later instruction and be widened to handle first-order
 233 recurrence; an interleave group of strided gathers or scatters may effectively
 234 move to one place where they are replaced with shuffles and a common wide vector
 235 load or store; new instructions may be introduced to compute masks, shuffle the
 236 elements of vectors, and pack scalar values into vectors or vice-versa.
 237
 238 In order for VPlan to support making instruction-level decisions and analyses,
 239 it needs to model the relevant instructions along with their def/use relations.
 240 This too follows a staged approach: first, the new instructions that compute
 241 masks are modeled as VPInstructions, along with their induced def/use subgraph.
 242 This effectively models masks in VPlan, facilitating VPlan-based predication.
 243 Next, the logic embedded within each Recipe for generating its instructions at
 244 VPlan execution time, will instead take part in the planning process by modeling
 245 them as VPInstructions. Finally, only logic that applies to instructions as a
 246 group will remain in Recipes, such as interleave groups and potentially other
 247 idiom groups having synergistic cost.
 248
 249 Related LLVM components
 250 -----------------------
 251 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
 252    tree, where TSLP [3]_ adds Plan Step 2.b.
 253
 254 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
 255    Polly [7]_.
 256
 257 3. Loop Vectorizer: the Vectorization Plan aims to upgrade the infrastructure of
 258    the Loop Vectorizer and extend it to handle outer loops [8]_, [9]_.
 259
 260 References
 261 ----------
 262 .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
 263     Nuzman and Ayal Zaks, PACT 2008.
 264
 265 .. [2] "Proposal for function vectorization and loop vectorization with function
 266     calls", Xinmin Tian, [`cfe-dev
 267     <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
 268     March 2, 2016.
 269     See also `review <https://reviews.llvm.org/D22792>`_.
 270
 271 .. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
 272     Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.
 273
 274 .. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
 275     overhead", Hao Zhou and Jingling Xue, CGO 2016.
 276
 277 .. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
 278     Brian Koblenz, PLDI 1991
 279
 280 .. [6] "Structural analysis: A new approach to flow analysis in optimizing
 281     compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
 282
 283 .. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
 284     thesis, 2011.
 285
 286 .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
 287     European LLVM Developers' Meeting 2017.
 288
 289 .. [9] "Extending LoopVectorizer: OpenMP4.5 SIMD and Outer Loop
 290     Auto-Vectorization", Intel Vectorizer Team, LLVM Developers' Meeting 2016.
 291
 292 .. [10] "VPlan: Status Update and Future Roadmap", Ayal Zaks and Florian Hahn,
 293     LLVM Developers' Meeting 2023, https://www.youtube.com/watch?v=SzGP4PgMuLE