mlir/docs/Quantization.md

   1 # Quantization
   2
   3 This document outlines the design of the MLIR quantization system. While the
   4 term "quantization" is highly overloaded, in this case, it refers to a fairly
   5 narrow scope of techniques in use to enable conversion of floating-point
   6 computations to corresponding and plausible variants expressed in integer math
   7 for inference, as has historically been supported by low-bit depth inference
   8 engines such as TFLite, various accelerator hardware, and many DSPs.
   9
  10 Much of this is inspired by the approach taken
  11 [in this paper](https://arxiv.org/abs/1712.05877) with many extensions and
  12 adaptations folded in. It specifically documents the positions that MLIR has
  13 taken on the topic, and is not a general reference.
  14
  15 [TOC]
  16
  17 ## Uniform quantization
  18
  19 The primary quantization mechanism supported by MLIR is a scheme which can
  20 express fixed point and affine transformations via uniformly spaced point on the
  21 [Real](https://en.wikipedia.org/wiki/Real_number) number line.
  22
  23 Further, the scheme can be applied:
  24
  25 *   *per-layer* : Applying to every value within the target type.
  26 *   *per-axis* (also called *per-channel*) : Applying individually to each index
  27     along a specific axis of a tensor type.
  28
  29 ### Fixed point values
  30
  31 [Fixed point](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) values are a
  32 [Real](https://en.wikipedia.org/wiki/Real_number) number divided by a *scale*.
  33 We will call the result of the divided real the *scaled value*.
  34
  35 $$ real\\_value = scaled\\_value * scale $$
  36
  37 The scale can be interpreted as the distance, in real units, between neighboring
  38 scaled values. For example, if the scale is $ \pi $, then fixed point values
  39 with this scale can only represent multiples of $ \pi $, and nothing in
  40 between. The maximum rounding error to convert an arbitrary Real to a fixed
  41 point value with a given $ scale $ is $ \frac{scale}{2} $. Continuing the
  42 previous example, when $ scale = \pi $, the maximum rounding error will be $
  43 \frac{\pi}{2} $.
  44
  45 Multiplication can be performed on scaled values with different scales, using
  46 the same algorithm as multiplication of real values (note that product scaled
  47 value has $ scale_{product} = scale_{left \mbox{ } operand} * scale_{right
  48 \mbox{ } operand} $). Addition can be performed on scaled values, so long as
  49 they have the same scale, using the same algorithm for addition of real values.
  50 This makes it convenient to represent scaled values on a computer as signed
  51 integers, and perform arithmetic on those signed integers, because the results
  52 will be correct scaled values.
  53
  54 ### Affine values
  55
  56 Mathematically speaking, affine values are the result of
  57 [adding a Real-valued *zero point*, to a scaled value](https://en.wikipedia.org/wiki/Affine_transformation#Representation).
  58 Alternatively (and equivalently), subtracting a zero point from an affine value results in a
  59 scaled value:
  60
  61 $$ real\\_value = scaled\\_value * scale = (affine\\_value - zero\\_point) * scale $$
  62
  63 Essentially, affine values are a shift of the scaled values by some constant
  64 amount. Arithmetic (i.e., addition, subtraction, multiplication, division)
  65 cannot, in general, be directly performed on affine values; they must first be
  66 [converted](#affine-to-fixed-point) to the equivalent scaled values.
  67
  68 As alluded to above, the motivation for using affine values is to more
  69 efficiently represent real values that will actually be encountered during
  70 computation. Frequently, real values that will be encountered are not
  71 symmetric around the real zero. We also make the assumption that the real zero
  72 is encountered during computation, and should thus be represented.
  73
  74 In this case, it is inefficient to store scaled values represented by signed
  75 integers, as some of the signed integers will never be used. In effect, the bit patterns
  76 corresponding to those signed integers are going to waste.
  77
  78 In order to exactly represent the real zero with an integral-valued affine
  79 value, the zero point must be an integer between the minimum and maximum affine
  80 value (inclusive). For example, given an affine value represented by an 8 bit
  81 unsigned integer, we have: $ 0 \leq zero\\_point \leq 255 $. This is important,
  82 because in convolution-like operations of deep neural networks, we frequently
  83 need to zero-pad inputs and outputs, so zero must be exactly representable, or
  84 the result will be biased.
  85
  86 ### Relation
  87
  88 Real values, fixed point values, and affine values relate through the following
  89 equation, which demonstrates how to convert one type of number to another:
  90
  91 $$ real\\_value = scaled\\_value * scale = (affine\\_value - zero\\_point) * scale $$
  92
  93 Note that computers generally store mathematical values using a finite number of
  94 bits. Thus, while the above conversions are exact, to store the result in a
  95 finite number of bits, we must, in general, round the result of the conversion
  96 (this applies to both cases: storing using floating point and storing using
  97 fixed point). Note that a full discussion of rounding behavior is outside the
  98 scope of this document, and it is safe to assume unless otherwise stated that
  99 rounding should be according to the IEEE754 default of RNE (where hardware
 100 permits).
 101
 102 ### Converting between real and fixed point or affine
 103
 104 To convert a real value to a fixed point value, we must know the scale. To
 105 convert a real value to an affine value, we must know the scale and the zero point.
 106
 107 #### Real to affine
 108
 109 To convert an input tensor of real-valued elements (usually represented by a
 110 floating point format, frequently
 111 [Single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format))
 112 to a tensor of affine elements represented by an integral type (e.g. 8-bit
 113 unsigned integer), the following conversion can be performed (note that it is
 114 not required that all representable values of the integral type are used):
 115
 116 $$
 117 \begin{align*}
 118 af&fine\\\_value \\\\
 119   &= clampToTargetSize(roundToNearestInteger( \frac{real\\\_value}{scale}) + zero\\\_point \\\\
 120 \end{align*}
 121 $$
 122
 123 where we assume the following types:
 124
 125 - `real_value`: Single
 126 - `scale`: Single
 127 - `roundToNearestInteger`: returns a 32-bit integer
 128 - `zero_point`: 8-bit or 16-bit integer
 129 - `affine_value`: 8-bit or 16-bit integer
 130
 131 Note that bit depth and number of fixed point values are indicative
 132 of common types on typical hardware but is not constrained to
 133 particular bit depths or a requirement that the entire range of an
 134 N-bit integer is used.
 135
 136 #### Affine to real
 137
 138 To convert an output tensor of affine elements represented by uint8
 139 or uint16 to a tensor of real-valued elements (usually represented with a
 140 floating point format, frequently Single precision), the following conversion
 141 can be performed:
 142
 143 $$
 144 \begin{align*}
 145 re&al\\\_value \\\\
 146       &= roundToNearestFloat(affine\\\_value - zero\\\_point) * scale
 147 \end{align*}
 148 $$
 149
 150 where we assume the following types:
 151
 152 - `real_value`: Single
 153 - `scale`: Single
 154 - `affine_value`: 8-bit or 16-bit integer
 155 - `zero_point`: 8-bit or 16-bit integer
 156 - `roundToNearestFloat`: returns a Single
 157 - `-` (subtraction): returns a 32-bit signed integer
 158
 159 #### Affine to fixed point
 160
 161 When the affine and fixed point scales are the same, subtract the zero point
 162 from the affine value to get the equivalent fixed point value.
 163
 164 $$
 165 \begin{align*}
 166   scaled\\\_value = affine\\\_value_{non\mbox{-}negative} - zero\\\_point_{non\mbox{-}negative}
 167 \end{align*}
 168 $$
 169
 170 #### Fixed point to affine
 171
 172 When the affine and fixed point scales are the same, add the zero point to the
 173 fixed point value to get the equivalent affine value.
 174
 175 $$
 176 \begin{align*}
 177   affine\\\_value_{non\mbox{-}negative} = scaled\\\_value + zero\\\_point_{non\mbox{-}negative}
 178 \end{align*}
 179 $$
 180
 181 ## Usage within MLIR
 182
 183 There are several components to the quantization system being developed within
 184 MLIR:
 185
 186 *   *Quantization* dialect containing:
 187
 188     *   A family of [QuantizedTypes](#quantized-type) which represent the
 189         mapping between *expressed* values (typically of a floating point
 190         computer type) and *storage* values (typically of an integral computer
 191         type).
 192     *   [Type conversion ops](#quantized-type-conversion-operations) for converting
 193         between types based on a QuantizedType and its *expressed* and *storage*
 194         sub-types.
 195     *   [Instrumentation ops](#instrumentation-and-constraint-operations) for assigning
 196         instrumentation points within the computation where runtime statistics
 197         may help guide the quantization process.
 198
 199 *   [Integration with simulated quantization at training time](#integration-with-simulated-quantization-at-training-time)
 200
 201 *   [TFLite native quantization](#tflite-native-quantization)
 202
 203     *   The TFLite op-set natively supports uniform-quantized variants.
 204     *   Passes and tools exist to convert directly from the *TensorFlow* dialect
 205         to the TFLite quantized operation set.
 206
 207 Not every application of quantization will use all of these facilities. Specifically, the
 208 TensorFlow to TensorFlow Lite conversion uses the QuantizedTypes but has its own
 209 operations for type conversion and expression of the supporting math.
 210
 211 ## Quantization Dialect
 212
 213 ### Quantized type
 214
 215 TODO: Flesh this section out.
 216
 217 *   QuantizedType base class
 218 *   UniformQuantizedType
 219
 220 ### Quantized type conversion operations
 221
 222 *   qcast : Convert from an expressed type to QuantizedType
 223 *   dcast : Convert from a QuantizedType to its expressed type
 224 *   scast : Convert between a QuantizedType and its storage type
 225
 226 ### Instrumentation and constraint operations
 227
 228 *   const_fake_quant : Emulates the logic of the historic TensorFlow
 229     fake_quant_with_min_max_args operation.
 230 *   stats_ref : Declares that statistics should be gathered at this point with a
 231     unique key and made available to future passes of the solver.
 232 *   stats : Declares inline statistics (per layer and per axis) for the point in
 233     the computation. stats_ref ops are generally converted to statistical operations once
 234     trial runs have been performed.
 235 *   coupled_ref : Declares points in the computation to be coupled from a type
 236     inference perspective based on a unique key.
 237
 238 ## Integration with simulated quantization at training time
 239
 240 TensorFlow has historically used the
 241 [tf.quantization.fake_quant_\*](https://www.tensorflow.org/api_docs/python/tf/quantization/fake_quant_with_min_max_args)
 242 family of operations to simulate the effect of quantization at training time.
 243
 244 As originally implemented, TensorFlow Lite was the primary user of such
 245 operations at inference time. When quantized inference was enabled, if every
 246 eligible tensor passed through an appropriate fake_quant node (the rules of
 247 which tensors can have fake_quant applied are somewhat involved), then
 248 TensorFlow Lite would use the attributes of the fake_quant operations to make a
 249 judgment about how to convert to use kernels from its quantized operations subset.
 250
 251 In MLIR-based quantization, fake_quant_\* operations are handled by converting them to
 252 a sequence of *qcast* (quantize) followed by *dcast* (dequantize) with an
 253 appropriate *UniformQuantizedType* as the target of the qcast operation.
 254
 255 This allows subsequent compiler passes to preserve the knowledge that
 256 quantization was simulated in a certain way, while giving the compiler
 257 flexibility to move the casts as it simplifies the computation and converts it
 258 to a form based on integral arithmetic.
 259
 260 This scheme also naturally allows computations that are *partially quantized*
 261 where the parts which could not be reduced to integral operations are still carried out
 262 in floating point with appropriate conversions at the boundaries.
 263
 264 ## TFLite native quantization
 265
 266 TODO: Flesh this out
 267
 268 ### General algorithm
 269
 270 1.  Take input min/max information and set the ArrayInfo (which really is
 271     InputOrOutputArrayInfo.
 272 1.  In LegalizeTF, convert ArrayInfo min/max to tf.Quantize and tf.Dequantize
 273     nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q
 274     -> tfl.DQ).
 275 1.  Hardcode logic/propagation needs to happen here.
 276 1.  Run TF constant folding.
 277 1.  In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ).
 278 1.  Run quantization pass that take (tfl.DQ (for both input and weights) -> op
 279     -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q)
 280     with (constant_quant).