3 This document outlines the design of the MLIR quantization system. While the
4 term "quantization" is highly overloaded, in this case, it refers to a fairly
5 narrow scope of techniques in use to enable conversion of floating-point
6 computations to corresponding and plausible variants expressed in integer math
7 for inference, as has historically been supported by low-bit depth inference
8 engines such as TFLite, various accelerator hardware, and many DSPs.
10 Much of this is inspired by the approach taken
11 [in this paper](https://arxiv.org/abs/1712.05877) with many extensions and
12 adaptations folded in. It specifically documents the positions that MLIR has
13 taken on the topic, and is not a general reference.
17 ## Uniform quantization
19 The primary quantization mechanism supported by MLIR is a scheme which can
20 express fixed point and affine transformations via uniformly spaced point on the
21 [Real](https://en.wikipedia.org/wiki/Real_number) number line.
23 Further, the scheme can be applied:
25 * *per-layer* : Applying to every value within the target type.
26 * *per-axis* (also called *per-channel*) : Applying individually to each index
27 along a specific axis of a tensor type.
29 ### Fixed point values
31 [Fixed point](https://en.wikipedia.org/wiki/Fixed-point_arithmetic) values are a
32 [Real](https://en.wikipedia.org/wiki/Real_number) number divided by a *scale*.
33 We will call the result of the divided real the *scaled value*.
35 $$ real\\_value = scaled\\_value * scale $$
37 The scale can be interpreted as the distance, in real units, between neighboring
38 scaled values. For example, if the scale is $ \pi $, then fixed point values
39 with this scale can only represent multiples of $ \pi $, and nothing in
40 between. The maximum rounding error to convert an arbitrary Real to a fixed
41 point value with a given $ scale $ is $ \frac{scale}{2} $. Continuing the
42 previous example, when $ scale = \pi $, the maximum rounding error will be $
45 Multiplication can be performed on scaled values with different scales, using
46 the same algorithm as multiplication of real values (note that product scaled
47 value has $ scale_{product} = scale_{left \mbox{ } operand} * scale_{right
48 \mbox{ } operand} $). Addition can be performed on scaled values, so long as
49 they have the same scale, using the same algorithm for addition of real values.
50 This makes it convenient to represent scaled values on a computer as signed
51 integers, and perform arithmetic on those signed integers, because the results
52 will be correct scaled values.
56 Mathematically speaking, affine values are the result of
57 [adding a Real-valued *zero point*, to a scaled value](https://en.wikipedia.org/wiki/Affine_transformation#Representation).
58 Alternatively (and equivalently), subtracting a zero point from an affine value results in a
61 $$ real\\_value = scaled\\_value * scale = (affine\\_value - zero\\_point) * scale $$
63 Essentially, affine values are a shift of the scaled values by some constant
64 amount. Arithmetic (i.e., addition, subtraction, multiplication, division)
65 cannot, in general, be directly performed on affine values; they must first be
66 [converted](#affine-to-fixed-point) to the equivalent scaled values.
68 As alluded to above, the motivation for using affine values is to more
69 efficiently represent real values that will actually be encountered during
70 computation. Frequently, real values that will be encountered are not
71 symmetric around the real zero. We also make the assumption that the real zero
72 is encountered during computation, and should thus be represented.
74 In this case, it is inefficient to store scaled values represented by signed
75 integers, as some of the signed integers will never be used. In effect, the bit patterns
76 corresponding to those signed integers are going to waste.
78 In order to exactly represent the real zero with an integral-valued affine
79 value, the zero point must be an integer between the minimum and maximum affine
80 value (inclusive). For example, given an affine value represented by an 8 bit
81 unsigned integer, we have: $ 0 \leq zero\\_point \leq 255 $. This is important,
82 because in convolution-like operations of deep neural networks, we frequently
83 need to zero-pad inputs and outputs, so zero must be exactly representable, or
84 the result will be biased.
88 Real values, fixed point values, and affine values relate through the following
89 equation, which demonstrates how to convert one type of number to another:
91 $$ real\\_value = scaled\\_value * scale = (affine\\_value - zero\\_point) * scale $$
93 Note that computers generally store mathematical values using a finite number of
94 bits. Thus, while the above conversions are exact, to store the result in a
95 finite number of bits, we must, in general, round the result of the conversion
96 (this applies to both cases: storing using floating point and storing using
97 fixed point). Note that a full discussion of rounding behavior is outside the
98 scope of this document, and it is safe to assume unless otherwise stated that
99 rounding should be according to the IEEE754 default of RNE (where hardware
102 ### Converting between real and fixed point or affine
104 To convert a real value to a fixed point value, we must know the scale. To
105 convert a real value to an affine value, we must know the scale and the zero point.
109 To convert an input tensor of real-valued elements (usually represented by a
110 floating point format, frequently
111 [Single precision](https://en.wikipedia.org/wiki/Single-precision_floating-point_format))
112 to a tensor of affine elements represented by an integral type (e.g. 8-bit
113 unsigned integer), the following conversion can be performed (note that it is
114 not required that all representable values of the integral type are used):
118 af&fine\\\_value \\\\
119 &= clampToTargetSize(roundToNearestInteger( \frac{real\\\_value}{scale}) + zero\\\_point \\\\
123 where we assume the following types:
125 - `real_value`: Single
127 - `roundToNearestInteger`: returns a 32-bit integer
128 - `zero_point`: 8-bit or 16-bit integer
129 - `affine_value`: 8-bit or 16-bit integer
131 Note that bit depth and number of fixed point values are indicative
132 of common types on typical hardware but is not constrained to
133 particular bit depths or a requirement that the entire range of an
134 N-bit integer is used.
138 To convert an output tensor of affine elements represented by uint8
139 or uint16 to a tensor of real-valued elements (usually represented with a
140 floating point format, frequently Single precision), the following conversion
146 &= roundToNearestFloat(affine\\\_value - zero\\\_point) * scale
150 where we assume the following types:
152 - `real_value`: Single
154 - `affine_value`: 8-bit or 16-bit integer
155 - `zero_point`: 8-bit or 16-bit integer
156 - `roundToNearestFloat`: returns a Single
157 - `-` (subtraction): returns a 32-bit signed integer
159 #### Affine to fixed point
161 When the affine and fixed point scales are the same, subtract the zero point
162 from the affine value to get the equivalent fixed point value.
166 scaled\\\_value = affine\\\_value_{non\mbox{-}negative} - zero\\\_point_{non\mbox{-}negative}
170 #### Fixed point to affine
172 When the affine and fixed point scales are the same, add the zero point to the
173 fixed point value to get the equivalent affine value.
177 affine\\\_value_{non\mbox{-}negative} = scaled\\\_value + zero\\\_point_{non\mbox{-}negative}
183 There are several components to the quantization system being developed within
186 * *Quantization* dialect containing:
188 * A family of [QuantizedTypes](#quantized-type) which represent the
189 mapping between *expressed* values (typically of a floating point
190 computer type) and *storage* values (typically of an integral computer
192 * [Type conversion ops](#quantized-type-conversion-operations) for converting
193 between types based on a QuantizedType and its *expressed* and *storage*
195 * [Instrumentation ops](#instrumentation-and-constraint-operations) for assigning
196 instrumentation points within the computation where runtime statistics
197 may help guide the quantization process.
199 * [Integration with simulated quantization at training time](#integration-with-simulated-quantization-at-training-time)
201 * [TFLite native quantization](#tflite-native-quantization)
203 * The TFLite op-set natively supports uniform-quantized variants.
204 * Passes and tools exist to convert directly from the *TensorFlow* dialect
205 to the TFLite quantized operation set.
207 Not every application of quantization will use all of these facilities. Specifically, the
208 TensorFlow to TensorFlow Lite conversion uses the QuantizedTypes but has its own
209 operations for type conversion and expression of the supporting math.
211 ## Quantization Dialect
215 TODO: Flesh this section out.
217 * QuantizedType base class
218 * UniformQuantizedType
220 ### Quantized type conversion operations
222 * qcast : Convert from an expressed type to QuantizedType
223 * dcast : Convert from a QuantizedType to its expressed type
224 * scast : Convert between a QuantizedType and its storage type
226 ### Instrumentation and constraint operations
228 * const_fake_quant : Emulates the logic of the historic TensorFlow
229 fake_quant_with_min_max_args operation.
230 * stats_ref : Declares that statistics should be gathered at this point with a
231 unique key and made available to future passes of the solver.
232 * stats : Declares inline statistics (per layer and per axis) for the point in
233 the computation. stats_ref ops are generally converted to statistical operations once
234 trial runs have been performed.
235 * coupled_ref : Declares points in the computation to be coupled from a type
236 inference perspective based on a unique key.
238 ## Integration with simulated quantization at training time
240 TensorFlow has historically used the
241 [tf.quantization.fake_quant_\*](https://www.tensorflow.org/api_docs/python/tf/quantization/fake_quant_with_min_max_args)
242 family of operations to simulate the effect of quantization at training time.
244 As originally implemented, TensorFlow Lite was the primary user of such
245 operations at inference time. When quantized inference was enabled, if every
246 eligible tensor passed through an appropriate fake_quant node (the rules of
247 which tensors can have fake_quant applied are somewhat involved), then
248 TensorFlow Lite would use the attributes of the fake_quant operations to make a
249 judgment about how to convert to use kernels from its quantized operations subset.
251 In MLIR-based quantization, fake_quant_\* operations are handled by converting them to
252 a sequence of *qcast* (quantize) followed by *dcast* (dequantize) with an
253 appropriate *UniformQuantizedType* as the target of the qcast operation.
255 This allows subsequent compiler passes to preserve the knowledge that
256 quantization was simulated in a certain way, while giving the compiler
257 flexibility to move the casts as it simplifies the computation and converts it
258 to a form based on integral arithmetic.
260 This scheme also naturally allows computations that are *partially quantized*
261 where the parts which could not be reduced to integral operations are still carried out
262 in floating point with appropriate conversions at the boundaries.
264 ## TFLite native quantization
268 ### General algorithm
270 1. Take input min/max information and set the ArrayInfo (which really is
271 InputOrOutputArrayInfo.
272 1. In LegalizeTF, convert ArrayInfo min/max to tf.Quantize and tf.Dequantize
273 nodes. (or tf.FakeQuant) Convert all constant FakeQuants to (tf.FQ -> tfl.Q
275 1. Hardcode logic/propagation needs to happen here.
276 1. Run TF constant folding.
277 1. In PrepareTFL, convert all tf.FQ to (tfl.Q -> tfl.DQ).
278 1. Run quantization pass that take (tfl.DQ (for both input and weights) -> op
279 -> tfl.Q) and replaces with (op). Also replace (constant_float -> tfl.Q)
280 with (constant_quant).