mlir/docs/Quantization.md

   1 # MLIR Quantization
   2
   3 This document outlines the design of the MLIR quantization system. While the
   4 term "quantization" is highly overloaded, in this case, it refers to a fairly
   5 narrow scope of techniques in use to enable conversion of floating-point
   6 computations to corresponding and plausible variants expressed in integer math
   7 for inference, as has historically been supported by low-bit depth inference
   8 engines such as TFLite, various accelerator hardware, and many DSPs.
   9
  10 Much of this is inspired by the approach taken
  11 [in this paper](https://arxiv.org/abs/1712.05877) with many extensions and
  12 adaptations folded in. It specifically documents the positions that MLIR has
  13 taken on the topic, and is not a general reference.
  14
  15 [TOC]
  16
  17 ## Uniform quantization
  18
  19 The primary quantization mechanism supported by MLIR is a scheme which can
  20 express fixed point and affine transformations via uniformly spaced point on the
  21 Real number line.
  22
  23 Further, the scheme can be applied:
  24
  25 *   *per-layer* : Applying to every value within the target type.
  26 *   *per-axis* (also called *per-channel*) : Applying individually to each index
  27     along a specific axis of a tensor type.
  28
  29 ### Fixed point values
  30
  31 [Fixed point](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) values are a
  32 [Real](https://en.wikipedia.org/wiki/Real_number) number divided by a *scale*.
  33 We will call the result of the divided Real the *scaled value*.
  34
  35 $$ real\_value = scaled\_value * scale $$
  36
  37 The scale can be interpreted as the distance, in Real units, between neighboring
  38 scaled values. For example, if the scale is $$ \pi $$, then fixed point values
  39 with this scale can only represent multiples of $$ \pi $$, and nothing in
  40 between. The maximum rounding error to convert an arbitrary Real to a fixed
  41 point value with a given $$ scale $$ is $$ \frac{scale}{2} $$. Continuing the
  42 previous example, when $$ scale = \pi $$, the maximum rounding error will be $$
  43 \frac{\pi}{2} $$.
  44
  45 Multiplication can be performed on scaled values with different scales, using
  46 the same algorithm as multiplication of Real values (note that product scaled
  47 value has $$ scale_{product} = scale_{left \mbox{ } operand} * scale_{right
  48 \mbox{ } operand} $$). Addition can be performed on scaled values, as long as
  49 they have the same scale, using the same algorithm as addition of Real values.
  50 This makes it convenient to represent scaled values on a computer as signed
  51 integers, and perform arithmetic on those signed integers, because the results
  52 will be correct scaled values.
  53
  54 ### Affine values
  55
  56 Mathematically speaking, affine values are the result of
  57 [adding a Real-valued *zero point*, to a scaled value](https://en.wikipedia.org/wiki/Affine_transformation#Representation).
  58 Or equivalently, subtracting a zero point from an affine value results in a
  59 scaled value:
  60
  61 $$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$
  62
  63 Essentially, affine values are a shifting of the scaled values by some constant
  64 amount. Arithmetic (i.e., addition, subtraction, multiplication, division)
  65 cannot, in general, be directly performed on affine values; you must first
  66 [convert](#affine-to-fixed-point) them to the equivalent scaled values.
  67
  68 As alluded to above, the motivation for using affine values is to more
  69 efficiently represent the Real values that will actually be encountered during
  70 computation. Frequently, the Real values that will be encountered are not
  71 symmetric around the Real zero. We also make the assumption that the Real zero
  72 is encountered during computation, and should thus be represented.
  73
  74 In this case, it's inefficient to store scaled values represented by signed
  75 integers, as some of the signed integers will never be used. The bit patterns
  76 corresponding to those signed integers are going to waste.
  77
  78 In order to exactly represent the Real zero with an integral-valued affine
  79 value, the zero point must be an integer between the minimum and maximum affine
  80 value (inclusive). For example, given an affine value represented by an 8 bit
  81 unsigned integer, we have: $$ 0 \leq zero\_point \leq 255$$. This is important,
  82 because in deep neural networks' convolution-like operations, we frequently
  83 need to zero-pad inputs and outputs, so zero must be exactly representable, or
  84 the result will be biased.
  85
  86 ### Relation
  87
  88 Real values, fixed point values, and affine values relate through the following
  89 equation, which demonstrates how to convert one type of number to another:
  90
  91 $$ real\_value = scaled\_value * scale = (affine\_value - zero\_point) * scale $$
  92
  93 Note that computers generally store mathematical values using a finite number of
  94 bits. Thus, while the above conversions are exact, to store the result in a
  95 finite number of bits, we must, in general, round the result of the conversion
  96 (this applies to both cases: storing using floating point and storing using
  97 fixed point). Note that a full discussion of rounding behavior is outside the
  98 scope of this document, and it is safe to assume unless otherwise stated that
  99 rounding should be according to the IEEE754 default of RNE (where hardware
 100 permits).
 101
 102 ### Converting between Real and fixed point or affine
 103
 104 To convert a Real value to a fixed point value, you must know the scale. To
 105 convert a Real value to an affine value, you must know the scale and zero point.
 106
 107 #### Real to affine
 108
 109 To convert an input tensor of Real-valued elements (usually represented by a
 110 floating point format, frequently
 111 [Single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format))
 112 to a tensor of affine elements represented by an integral type (e.g. 8-bit
 113 unsigned integer), the following conversion can be performed (note that it is
 114 not required that all representable values of the integral type are used):
 115
 116 $$
 117 \begin{align*}
 118 af&fine\_value_{uint8 \, or \, uint16} \\
 119       &= clampToTargetSize(roundToNearestInteger( \frac{real\_value_{Single}}{scale_{Single}})_{sint32} + zero\_point_{uint8 \, or \, uint16})
 120 \end{align*}
 121 $$
 122
 123 In the above, we assume that $$real\_value$$ is a Single, $$scale$$ is a Single,
 124 $$roundToNearestInteger$$ returns a signed 32 bit integer, and $$zero\_point$$
 125 is an unsigned 8 or 16 bit integer. Note that bit depth and number of fixed
 126 point values are indicative of common types on typical hardware but is not
 127 constrained to particular bit depths or a requirement that the entire range of
 128 an N-bit integer is used.
 129
 130 #### Affine to Real
 131
 132 To convert an output tensor of affine elements represented by uint8
 133 or uint16 to a tensor of Real-valued elements (usually represented with a
 134 floating point format, frequently Single precision), the following conversion
 135 can be performed:
 136
 137 $$
 138 \begin{align*}
 139 re&al\_value_{Single} \\
 140       &= roundToNearestFloat((affine\_value_{uint8 \, or \, uint16} - zero\_point_{uint8 \, or \, uint16})_{sint32})_{Single} * scale_{Single}
 141 \end{align*}
 142 $$
 143
 144 In the above, we assume that the result of subtraction is in 32-bit signed
 145 integer format, and that $$roundToNearestFloat$$ returns a Single.
 146
 147 #### Affine to fixed point
 148
 149 When the affine and fixed point scales are the same, subtract the zero point
 150 from the affine value to get the equivalent fixed point value.
 151
 152 $$
 153 scaled\_value = affine\_value_{non\mbox{-}negative} - zero\_point_{non\mbox{-}negative}
 154 $$
 155
 156 #### Fixed point to affine
 157
 158 When the affine and fixed point scales are the same, add the zero point to the
 159 fixed point value to get the equivalent affine value.
 160
 161 $$
 162 affine\_value_{non\mbox{-}negative} = scaled\_value + zero\_point_{non\mbox{-}negative}
 163 $$
 164
 165 ## Usage within MLIR
 166
 167 There are several components to the quantization system being developed within
 168 MLIR:
 169
 170 *   *Quantization* dialect containing:
 171
 172     *   A family of [QuantizedTypes](#quantized-type) which represent the
 173         mapping between *expressed* values (typically of a floating point
 174         computer type) and *storage* values (typically of an integral computer
 175         type).
 176     *   [Type conversion ops](#quantized-type-conversion-ops) for converting
 177         between types based on a QuantizedType and its *expressed* and *storage*
 178         sub-types.
 179     *   [Instrumentation ops](#instrumentation-and-constraint-ops) for assigning
 180         instrumentation points within the computation where runtime statistics
 181         may help guide the quantization process.
 182
 183 *   [Integration with simulated quantization at training time](#integration-with-simulated-quantization-at-training-time)
 184
 185 *   [TFLite native quantization](#tflite-native-quantization)
 186
 187     *   The TFLite op-set natively supports uniform-quantized variants.
 188     *   Passes and tools exist to convert directly from the *TensorFlow* dialect
 189         to the TFLite quantized op-set.
 190
 191 *   [*FxpMath* dialect](#fxpmath-dialect) containing (experimental) generalized
 192     representations of fixed-point math ops and conversions:
 193
 194     *   [Real math ops](#real-math-ops) representing common combinations of
 195         arithmetic operations that closely match corresponding fixed-point math
 196         concepts (as opposed to being spread across multiple ops as is typical
 197         in source dialects).
 198     *   [Fixed-point math ops](#fixed-point-math-ops) that for carrying out
 199         computations on integers, as are typically needed by uniform
 200         quantization schemes.
 201     *   Passes to lower from real math ops to fixed-point math ops.
 202
 203 *   [Solver tools](#solver-tools) which can (experimentally and generically
 204     operate on computations expressed in the *FxpMath* dialect in order to
 205     convert from floating point types to appropriate *QuantizedTypes*, allowing
 206     the computation to be further lowered to integral math ops.
 207
 208 Not every application of quantization will use all facilities. Specifically, the
 209 TensorFlow to TensorFlow Lite conversion uses the QuantizedTypes but has its own
 210 ops for type conversion and expression of the backing math.
 211
 212 ## Quantization Dialect
 213
 214 ### Quantized type
 215
 216 TODO : Flesh this section out.
 217
 218 *   QuantizedType base class
 219 *   UniformQuantizedType
 220
 221 ### Quantized type conversion ops
 222
 223 *   qcast : Convert from an expressed type to QuantizedType
 224 *   dcast : Convert from a QuantizedType to its expressed type
 225 *   scast : Convert between a QuantizedType and its storage type
 226
 227 ### Instrumentation and constraint ops
 228
 229 *   const_fake_quant : Emulates the logic of the historic TensorFlow
 230     fake_quant_with_min_max_args op.
 231 *   stats_ref : Declares that statistics should be gathered at this point with a
 232     unique key and made available to future passes of the solver.
 233 *   stats : Declares inline statistics (per layer and per axis) for the point in
 234     the computation. stats_ref ops are generally converted to stats ops once
 235     trial runs have been performed.
 236 *   coupled_ref : Declares points in the computation to be coupled from a type
 237     inference perspective based on a unique key.
 238
 239 ## Integration with simulated quantization at training time
 240
 241 TensorFlow has historically used the
 242 [tf.quantization.fake_quant_\*](https://www.tensorflow.org/api_docs/python/tf/quantization/fake_quant_with_min_max_args)
 243 family of operations to simulate the effect of quantization at training time.
 244
 245 As originally implemented, TensorFlow Lite was the primary user of such
 246 operations at inference time. When quantized inference was enabled, if every
 247 eligible tensor passed through an appropriate fake_quant node (the rules of
 248 which tensors can have fake_quant applied are somewhat involved), then
 249 TensorFlow Lite would use the attributes of the fake_quant ops to make a
 250 judgment about how to convert to use kernels from its quantized ops subset.
 251
 252 In MLIR-based quantization, fake_quant_\* ops are handled by converting them to
 253 a sequence of *qcast* (quantize) followed by *dcast* (dequantize) with an
 254 appropriate *UniformQuantizedType* as the target of the qcast operation.
 255
 256 This allows subsequent compiler passes to preserve the knowledge that
 257 quantization was simulated in a certain way while giving the compiler
 258 flexibility to move the casts as it simplifies the computation and converts it
 259 to a form based on integral arithmetic.
 260
 261 This scheme also naturally allows computations that are *partially quantized*
 262 where the parts which could not be reduced to integral ops are still carried out
 263 in floating point with appropriate conversions at the boundaries.
 264
 265 ## TFLite Native Quantization
 266
 267 TODO : Flesh this out
 268
 269 ### General algorithm
 270
 271 1.  Take input min/max information and set the ArrayInfo (which really is
 272     InputOrOutputArrayInfo.
 273 1.  In LegalizeTF, convert ArrayInfo min/max to tf.Quantize and tf.Dequantize
 274     nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q
 275     -> tfl.DQ).
 276 1.  Hardcode logic/propagation needs to happen here.
 277 1.  Run TF constant folding.
 278 1.  In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ).
 279 1.  Run quantization pass that take (tfl.DQ (for both input and weights) -> op
 280     -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q)
 281     with (constant_quant).
 282
 283 ## FxpMath Dialect
 284
 285 ### Real math ops
 286
 287 Note that these all support explicit clamps, which allows for simple fusions and
 288 representation of some common sequences quantization-compatible math. Of
 289 addition, some support explicit biases, which are often represented as separate
 290 adds in source dialects.
 291
 292 TODO: This op set is still evolving and needs to be completed.
 293
 294 *   RealBinaryOp
 295     *   RealAddEwOp
 296     *   RealSubEwOp
 297     *   RealMulEwOp
 298     *   RealDivEwOp
 299 *   RealUnaryOp
 300     *   IDENTITY
 301     *   TANH
 302     *   SIGMOID
 303     *   EXP
 304     *   LOG
 305     *   NEG
 306     *   RSQRT
 307     *   SIN
 308     *   SQUARE
 309     *   SQRT
 310     *   CMPZ
 311     *   CMPNZ
 312     *   CMPLZ
 313     *   CMPGZ
 314
 315 ### Fixed-point math ops
 316
 317 TODO: This op set only has enough ops to lower a simple power-of-two
 318 RealAddEwOp.
 319
 320 *   RoundingDivideByPotFxpOp
 321 *   SaturatingAddFxpOp
 322
 323 ## Solver tools
 324
 325 Solver tools exist to analyze an MLIR-computation, expressed in either a
 326 supported source dialect or in the *real math ops* set and solve for appropriate
 327 QuantizedTypes that allow the computation to be lowered to integral math.
 328
 329 These tools are an active area of work and may be expanded in the future to
 330 adjacent areas such as solving for transformations to other kinds of lower
 331 precision types (i.e. bfloat16 or fp16).
 332
 333 Solver tools are expected to operate in several modes, depending on the
 334 computation and the manner in which it was trained:
 335
 336 *   *Transform* : With all available information in the MLIR computation, infer
 337     boundaries where the computation can be carried out with integral math and
 338     change types accordingly to appropriate QuantizedTypes:
 339
 340     *   For passthrough ops which do not perform active math, change them to
 341         operate directly on the storage type, converting in and out at the edges
 342         via scast ops.
 343     *   For ops that have the *Quantizable* trait, the type can be set directly.
 344         This includes ops from the [real math ops set]{#real-math-ops}.
 345     *   For others, encase them in appropriate dcast/qcast ops, presuming that
 346         some follow-on pass will know what to do with them.
 347
 348 *   *Instrument* : Most of the time, there are not sufficient implied
 349     constraints within a computation to perform many transformations. For this
 350     reason, the solver can insert instrumentation ops at points where additional
 351     runtime statistics may yield solutions. It is expected that such
 352     computations will be lowered as-is for execution, run over an appropriate
 353     eval set, and statistics at each instrumentation point made available for a
 354     future invocation of the solver.
 355
 356 *   *Simplify* : A variety of passes and simplifications are applied once
 357     QuantizedTypes are added in order to arrive at a computation that is
 358     expressed in as much integral math, with the fewest number of casts as
 359     possible.