docs/Proposals/VectorizationPlan.rst

   1 ==================
   2 Vectorization Plan
   3 ==================
   4
   5 .. contents::
   6    :local:
   7
   8 Abstract
   9 ========
  10 The vectorization transformation can be rather complicated, involving several
  11 potential alternatives, especially for outer-loops [1]_ but also possibly for
  12 innermost loops. These alternatives may have significant performance impact,
  13 both positive and negative. A cost model is therefore employed to identify the
  14 best alternative, including the alternative of avoiding any transformation
  15 altogether.
  16
  17 The Vectorization Plan is an explicit model for describing vectorization
  18 candidates. It serves for both optimizing candidates including estimating their
  19 cost reliably, and for performing their final translation into IR. This
  20 facilitates dealing with multiple vectorization candidates.
  21
  22 High-level Design
  23 =================
  24
  25 Vectorization Workflow
  26 ----------------------
  27 VPlan-based vectorization involves three major steps, taking a "scenario-based
  28 approach" to vectorization planning:
  29
  30 1. Legal Step: check if a loop can be legally vectorized; encode constraints and
  31    artifacts if so.
  32 2. Plan Step:
  33
  34    a. Build initial VPlans following the constraints and decisions taken by
  35       Legal Step 1, and compute their cost.
  36    b. Apply optimizations to the VPlans, possibly forking additional VPlans.
  37       Prune sub-optimal VPlans having relatively high cost.
  38 3. Execute Step: materialize the best VPlan. Note that this is the only step
  39    that modifies the IR.
  40
  41 Design Guidelines
  42 -----------------
  43 In what follows, the term "input IR" refers to code that is fed into the
  44 vectorizer whereas the term "output IR" refers to code that is generated by the
  45 vectorizer. The output IR contains code that has been vectorized or "widened"
  46 according to a loop Vectorization Factor (VF), and/or loop unroll-and-jammed
  47 according to an Unroll Factor (UF).
  48 The design of VPlan follows several high-level guidelines:
  49
  50 1. Analysis-like: building and manipulating VPlans must not modify the input IR.
  51    In particular, if the best option is not to vectorize at all, the
  52    vectorization process terminates before reaching Step 3, and compilation
  53    should proceed as if VPlans had not been built.
  54
  55 2. Align Cost & Execute: each VPlan must support both estimating the cost and
  56    generating the output IR code, such that the cost estimation evaluates the
  57    to-be-generated code reliably.
  58
  59 3. Support vectorizing additional constructs:
  60
  61    a. Outer-loop vectorization. In particular, VPlan must be able to model the
  62       control-flow of the output IR which may include multiple basic-blocks and
  63       nested loops.
  64    b. SLP vectorization.
  65    c. Combinations of the above, including nested vectorization: vectorizing
  66       both an inner loop and an outer-loop at the same time (each with its own
  67       VF and UF), mixed vectorization: vectorizing a loop with SLP patterns
  68       inside [4]_, (re)vectorizing input IR containing vector code.
  69    d. Function vectorization [2]_.
  70
  71 4. Support multiple candidates efficiently. In particular, similar candidates
  72    related to a range of possible VF's and UF's must be represented efficiently.
  73    Potential versioning needs to be supported efficiently.
  74
  75 5. Support vectorizing idioms, such as interleaved groups of strided loads or
  76    stores. This is achieved by modeling a sequence of output instructions using
  77    a "Recipe", which is responsible for computing its cost and generating its
  78    code.
  79
  80 6. Encapsulate Single-Entry Single-Exit regions (SESE). During vectorization
  81    such regions may need to be, for example, predicated and linearized, or
  82    replicated VF*UF times to handle scalarized and predicated instructions.
  83    Innerloops are also modelled as SESE regions.
  84
  85 Low-level Design
  86 ================
  87 The low-level design of VPlan comprises of the following classes.
  88
  89 :LoopVectorizationPlanner:
  90   A LoopVectorizationPlanner is designed to handle the vectorization of a loop
  91   or a loop nest. It can construct, optimize and discard one or more VPlans,
  92   each VPlan modelling a distinct way to vectorize the loop or the loop nest.
  93   Once the best VPlan is determined, including the best VF and UF, this VPlan
  94   drives the generation of output IR.
  95
  96 :VPlan:
  97   A model of a vectorized candidate for a given input IR loop or loop nest. This
  98   candidate is represented using a Hierarchical CFG. VPlan supports estimating
  99   the cost and driving the generation of the output IR code it represents.
 100
 101 :Hierarchical CFG:
 102   A control-flow graph whose nodes are basic-blocks or Hierarchical CFG's. The
 103   Hierarchical CFG data structure is similar to the Tile Tree [5]_, where
 104   cross-Tile edges are lifted to connect Tiles instead of the original
 105   basic-blocks as in Sharir [6]_, promoting the Tile encapsulation. The terms
 106   Region and Block are used rather than Tile [5]_ to avoid confusion with loop
 107   tiling.
 108
 109 :VPBlockBase:
 110   The building block of the Hierarchical CFG. A pure-virtual base-class of
 111   VPBasicBlock and VPRegionBlock, see below. VPBlockBase models the hierarchical
 112   control-flow relations with other VPBlocks. Note that in contrast to the IR
 113   BasicBlock, a VPBlockBase models its control-flow successors and predecessors
 114   directly, rather than through a Terminator branch or through predecessor
 115   branches that "use" the VPBlockBase.
 116
 117 :VPBasicBlock:
 118   VPBasicBlock is a subclass of VPBlockBase, and serves as the leaves of the
 119   Hierarchical CFG. It represents a sequence of output IR instructions that will
 120   appear consecutively in an output IR basic-block. The instructions of this
 121   basic-block originate from one or more VPBasicBlocks. VPBasicBlock holds a
 122   sequence of zero or more VPRecipes that model the cost and generation of the
 123   output IR instructions.
 124
 125 :VPRegionBlock:
 126   VPRegionBlock is a subclass of VPBlockBase. It models a collection of
 127   VPBasicBlocks and VPRegionBlocks which form a SESE subgraph of the output IR
 128   CFG. A VPRegionBlock may indicate that its contents are to be replicated a
 129   constant number of times when output IR is generated, effectively representing
 130   a loop with constant trip-count that will be completely unrolled. This is used
 131   to support scalarized and predicated instructions with a single model for
 132   multiple candidate VF's and UF's.
 133
 134 :VPRecipeBase:
 135   A pure-virtual base class modeling a sequence of one or more output IR
 136   instructions, possibly based on one or more input IR instructions. These
 137   input IR instructions are referred to as "Ingredients" of the Recipe. A Recipe
 138   may specify how its ingredients are to be transformed to produce the output IR
 139   instructions; e.g., cloned once, replicated multiple times or widened
 140   according to selected VF.
 141
 142 :VPTransformState:
 143   Stores information used for generating output IR, passed from
 144   LoopVectorizationPlanner to its selected VPlan for execution, and used to pass
 145   additional information down to VPBlocks and VPRecipes.
 146
 147 Related LLVM components
 148 -----------------------
 149 1. SLP Vectorizer: one can compare the VPlan model with LLVM's existing SLP
 150    tree, where TSLP [3]_ adds Plan Step 2.b.
 151
 152 2. RegionInfo: one can compare VPlan's H-CFG with the Region Analysis as used by
 153    Polly [7]_.
 154
 155 References
 156 ----------
 157 .. [1] "Outer-loop vectorization: revisited for short SIMD architectures", Dorit
 158     Nuzman and Ayal Zaks, PACT 2008.
 159
 160 .. [2] "Proposal for function vectorization and loop vectorization with function
 161     calls", Xinmin Tian, [`cfe-dev
 162     <http://lists.llvm.org/pipermail/cfe-dev/2016-March/047732.html>`_].,
 163     March 2, 2016.
 164     See also `review <https://reviews.llvm.org/D22792>`_.
 165
 166 .. [3] "Throttling Automatic Vectorization: When Less is More", Vasileios
 167     Porpodas and Tim Jones, PACT 2015 and LLVM Developers' Meeting 2015.
 168
 169 .. [4] "Exploiting mixed SIMD parallelism by reducing data reorganization
 170     overhead", Hao Zhou and Jingling Xue, CGO 2016.
 171
 172 .. [5] "Register Allocation via Hierarchical Graph Coloring", David Callahan and
 173     Brian Koblenz, PLDI 1991
 174
 175 .. [6] "Structural analysis: A new approach to flow analysis in optimizing
 176     compilers", M. Sharir, Journal of Computer Languages, Jan. 1980
 177
 178 .. [7] "Enabling Polyhedral Optimizations in LLVM", Tobias Grosser, Diploma
 179     thesis, 2011.
 180
 181 .. [8] "Introducing VPlan to the Loop Vectorizer", Gil Rapaport and Ayal Zaks,
 182     European LLVM Developers' Meeting 2017.