1 ==================================================
2 How to manually use the Individual pieces of Polly
3 ==================================================
5 Execute the individual Polly passes manually
6 ============================================
8 .. sectionauthor:: Singapuram Sanjay Srivallabh
10 This example presents the individual passes that are involved when optimizing
11 code with Polly. We show how to execute them individually and explain for
12 each which analysis is performed or what transformation is applied. In this
13 example the polyhedral transformation is user-provided to show how much
14 performance improvement can be expected by an optimal automatic optimizer.
16 1. **Create LLVM-IR from the C code**
17 -------------------------------------
18 Polly works on LLVM-IR. Hence it is necessary to translate the source
19 files into LLVM-IR. If more than one file should be optimized the
20 files can be combined into a single file with llvm-link.
22 .. code-block:: console
24 clang -S -emit-llvm matmul.c -Xclang -disable-O0-optnone -o matmul.ll
27 2. **Prepare the LLVM-IR for Polly**
28 ------------------------------------
30 Polly is only able to work with code that matches a canonical form.
31 To translate the LLVM-IR into this form we use a set of
32 canonicalization passes. They are scheduled by using
33 '-polly-canonicalize'.
35 .. code-block:: console
37 opt -S -polly-canonicalize matmul.ll -o matmul.preopt.ll
39 3. **Show the SCoPs detected by Polly (optional)**
40 --------------------------------------------------
42 To understand if Polly was able to detect SCoPs, we print the structure
43 of the detected SCoPs. In our example two SCoPs are detected. One in
44 'init_array' the other in 'main'.
46 .. code-block:: console
48 $ opt -basic-aa -polly-ast -analyze matmul.preopt.ll -polly-process-unprofitable -polly-use-llvm-names
52 :: isl ast :: init_array :: %for.cond1.preheader---%for.end19
56 for (int c0 = 0; c0 <= 1535; c0 += 1)
57 for (int c1 = 0; c1 <= 1535; c1 += 1)
58 Stmt_for_body3(c0, c1);
61 { /* original code */ }
63 :: isl ast :: main :: %for.cond1.preheader---%for.end30
67 for (int c0 = 0; c0 <= 1535; c0 += 1)
68 for (int c1 = 0; c1 <= 1535; c1 += 1) {
69 Stmt_for_body3(c0, c1);
70 for (int c2 = 0; c2 <= 1535; c2 += 1)
71 Stmt_for_body8(c0, c1, c2);
75 { /* original code */ }
77 4. **Highlight the detected SCoPs in the CFGs of the program (requires graphviz/dotty)**
78 ----------------------------------------------------------------------------------------
80 Polly can use graphviz to graphically show a CFG in which the detected
81 SCoPs are highlighted. It can also create '.dot' files that can be
82 translated by the 'dot' utility into various graphic formats.
85 .. code-block:: console
87 $ opt -polly-use-llvm-names -basic-aa -view-scops -disable-output matmul.preopt.ll
88 $ opt -polly-use-llvm-names -basic-aa -view-scops-only -disable-output matmul.preopt.ll
90 The output for the different functions:
92 - view-scops : main_, init_array_, print_array_
93 - view-scops-only : main-scopsonly_, init_array-scopsonly_, print_array-scopsonly_
95 .. _main: http://polly.llvm.org/experiments/matmul/scops.main.dot.png
96 .. _init_array: http://polly.llvm.org/experiments/matmul/scops.init_array.dot.png
97 .. _print_array: http://polly.llvm.org/experiments/matmul/scops.print_array.dot.png
98 .. _main-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.main.dot.png
99 .. _init_array-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.init_array.dot.png
100 .. _print_array-scopsonly: http://polly.llvm.org/experiments/matmul/scopsonly.print_array.dot.png
102 5. **View the polyhedral representation of the SCoPs**
103 ------------------------------------------------------
105 .. code-block:: console
107 $ opt -polly-use-llvm-names -basic-aa -polly-scops -analyze matmul.preopt.ll -polly-process-unprofitable
111 [...]Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond1.preheader => for.end19' in function 'init_array':
113 Region: %for.cond1.preheader---%for.end19
115 Invariant Accesses: {
124 float MemRef_A[*][1536]; // Element size 4
125 float MemRef_B[*][1536]; // Element size 4
127 Arrays (Bounds as pw_affs) {
128 float MemRef_A[*][ { [] -> [(1536)] } ]; // Element size 4
129 float MemRef_B[*][ { [] -> [(1536)] } ]; // Element size 4
136 { Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 };
138 { Stmt_for_body3[i0, i1] -> [i0, i1] };
139 MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
140 { Stmt_for_body3[i0, i1] -> MemRef_A[i0, i1] };
141 MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
142 { Stmt_for_body3[i0, i1] -> MemRef_B[i0, i1] };
144 [...]Printing analysis 'Polly - Create polyhedral description of Scops' for region: 'for.cond1.preheader => for.end30' in function 'main':
146 Region: %for.cond1.preheader---%for.end30
148 Invariant Accesses: {
157 float MemRef_C[*][1536]; // Element size 4
158 float MemRef_A[*][1536]; // Element size 4
159 float MemRef_B[*][1536]; // Element size 4
161 Arrays (Bounds as pw_affs) {
162 float MemRef_C[*][ { [] -> [(1536)] } ]; // Element size 4
163 float MemRef_A[*][ { [] -> [(1536)] } ]; // Element size 4
164 float MemRef_B[*][ { [] -> [(1536)] } ]; // Element size 4
171 { Stmt_for_body3[i0, i1] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 };
173 { Stmt_for_body3[i0, i1] -> [i0, i1, 0, 0] };
174 MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
175 { Stmt_for_body3[i0, i1] -> MemRef_C[i0, i1] };
178 { Stmt_for_body8[i0, i1, i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1535 };
180 { Stmt_for_body8[i0, i1, i2] -> [i0, i1, 1, i2] };
181 ReadAccess := [Reduction Type: NONE] [Scalar: 0]
182 { Stmt_for_body8[i0, i1, i2] -> MemRef_C[i0, i1] };
183 ReadAccess := [Reduction Type: NONE] [Scalar: 0]
184 { Stmt_for_body8[i0, i1, i2] -> MemRef_A[i0, i2] };
185 ReadAccess := [Reduction Type: NONE] [Scalar: 0]
186 { Stmt_for_body8[i0, i1, i2] -> MemRef_B[i2, i1] };
187 MustWriteAccess := [Reduction Type: NONE] [Scalar: 0]
188 { Stmt_for_body8[i0, i1, i2] -> MemRef_C[i0, i1] };
192 6. **Show the dependences for the SCoPs**
193 -----------------------------------------
195 .. code-block:: console
197 $ opt -basic-aa -polly-use-llvm-names -polly-dependences -analyze matmul.preopt.ll -polly-process-unprofitable
201 [...]Printing analysis 'Polly - Calculate dependences' for region: 'for.cond1.preheader => for.end19' in function 'init_array':
208 Reduction dependences:
210 Transitive closure of reduction dependences:
212 [...]Printing analysis 'Polly - Calculate dependences' for region: 'for.cond1.preheader => for.end30' in function 'main':
214 { Stmt_for_body3[i0, i1] -> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] -> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 }
218 { Stmt_for_body3[i0, i1] -> Stmt_for_body8[i0, i1, 0] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535; Stmt_for_body8[i0, i1, i2] -> Stmt_for_body8[i0, i1, 1 + i2] : 0 <= i0 <= 1535 and 0 <= i1 <= 1535 and 0 <= i2 <= 1534 }
219 Reduction dependences:
221 Transitive closure of reduction dependences:
224 7. **Export jscop files**
225 -------------------------
227 .. code-block:: console
229 $ opt -basic-aa -polly-use-llvm-names -polly-export-jscop matmul.preopt.ll -polly-process-unprofitable
233 [...]Writing JScop '%for.cond1.preheader---%for.end19' in function 'init_array' to './init_array___%for.cond1.preheader---%for.end19.jscop'.
235 Writing JScop '%for.cond1.preheader---%for.end30' in function 'main' to './main___%for.cond1.preheader---%for.end30.jscop'.
239 8. **Import the changed jscop files and print the updated SCoP structure (optional)**
240 -------------------------------------------------------------------------------------
242 Polly can reimport jscop files, in which the schedules of the statements
243 are changed. These changed schedules are used to describe
244 transformations. It is possible to import different jscop files by
245 providing the postfix of the jscop file that is imported.
247 We apply three different transformations on the SCoP in the main
248 function. The jscop files describing these transformations are
249 hand written (and available in docs/experiments/matmul).
253 As a baseline we do not call any Polly code generation, but only apply the normal -O3 optimizations.
255 .. code-block:: console
257 $ opt -basic-aa -polly-use-llvm-names matmul.preopt.ll -polly-import-jscop -polly-ast -analyze -polly-process-unprofitable
262 :: isl ast :: main :: %for.cond1.preheader---%for.end30
266 for (int c0 = 0; c0 <= 1535; c0 += 1)
267 for (int c1 = 0; c1 <= 1535; c1 += 1) {
268 Stmt_for_body3(c0, c1);
269 for (int c3 = 0; c3 <= 1535; c3 += 1)
270 Stmt_for_body8(c0, c1, c3);
274 { /* original code */ }
277 **Loop Interchange (and Fission to allow the interchange)**
279 We split the loops and can now apply an interchange of the loop dimensions that enumerate Stmt_for_body8.
281 .. Although I feel (and have created a .jscop) we can avoid splitting the loops.
283 .. code-block:: console
285 $ opt -basic-aa -polly-use-llvm-names matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged -polly-ast -analyze -polly-process-unprofitable
290 :: isl ast :: main :: %for.cond1.preheader---%for.end30
295 for (int c1 = 0; c1 <= 1535; c1 += 1)
296 for (int c2 = 0; c2 <= 1535; c2 += 1)
297 Stmt_for_body3(c1, c2);
298 for (int c1 = 0; c1 <= 1535; c1 += 1)
299 for (int c2 = 0; c2 <= 1535; c2 += 1)
300 for (int c3 = 0; c3 <= 1535; c3 += 1)
301 Stmt_for_body8(c1, c3, c2);
305 { /* original code */ }
308 **Interchange + Tiling**
310 In addition to the interchange we now tile the second loop nest.
312 .. code-block:: console
314 $ opt -basic-aa -polly-use-llvm-names matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-ast -analyze -polly-process-unprofitable
319 :: isl ast :: main :: %for.cond1.preheader---%for.end30
324 for (int c1 = 0; c1 <= 1535; c1 += 1)
325 for (int c2 = 0; c2 <= 1535; c2 += 1)
326 Stmt_for_body3(c1, c2);
327 for (int c1 = 0; c1 <= 1535; c1 += 64)
328 for (int c2 = 0; c2 <= 1535; c2 += 64)
329 for (int c3 = 0; c3 <= 1535; c3 += 64)
330 for (int c4 = c1; c4 <= c1 + 63; c4 += 1)
331 for (int c5 = c3; c5 <= c3 + 63; c5 += 1)
332 for (int c6 = c2; c6 <= c2 + 63; c6 += 1)
333 Stmt_for_body8(c4, c6, c5);
337 { /* original code */ }
341 **Interchange + Tiling + Strip-mining to prepare vectorization**
343 To later allow vectorization we create a so called trivially
344 parallelizable loop. It is innermost, parallel and has only four
345 iterations. It can be replaced by 4-element SIMD instructions.
347 .. code-block:: console
349 $ opt -basic-aa -polly-use-llvm-names matmul.preopt.ll -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-ast -analyze -polly-process-unprofitable
354 :: isl ast :: main :: %for.cond1.preheader---%for.end30
359 for (int c1 = 0; c1 <= 1535; c1 += 1)
360 for (int c2 = 0; c2 <= 1535; c2 += 1)
361 Stmt_for_body3(c1, c2);
362 for (int c1 = 0; c1 <= 1535; c1 += 64)
363 for (int c2 = 0; c2 <= 1535; c2 += 64)
364 for (int c3 = 0; c3 <= 1535; c3 += 64)
365 for (int c4 = c1; c4 <= c1 + 63; c4 += 1)
366 for (int c5 = c3; c5 <= c3 + 63; c5 += 1)
367 for (int c6 = c2; c6 <= c2 + 63; c6 += 4)
368 for (int c7 = c6; c7 <= c6 + 3; c7 += 1)
369 Stmt_for_body8(c4, c7, c5);
373 { /* original code */ }
376 9. **Codegenerate the SCoPs**
377 -----------------------------
379 This generates new code for the SCoPs detected by polly. If
380 -polly-import-jscop is present, transformations specified in the
381 imported jscop files will be applied.
384 .. code-block:: console
386 $ opt -S matmul.preopt.ll | opt -S -O3 -o matmul.normalopt.ll
388 .. code-block:: console
390 $ opt -S matmul.preopt.ll -basic-aa -polly-use-llvm-names -polly-import-jscop -polly-import-jscop-postfix=interchanged -polly-codegen -polly-process-unprofitable | opt -S -O3 -o matmul.polly.interchanged.ll
394 Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged'.
395 File could not be read: No such file or directory
396 Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged'.
398 .. code-block:: console
400 $ opt -S matmul.preopt.ll -basic-aa -polly-use-llvm-names -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled -polly-codegen -polly-process-unprofitable | opt -S -O3 -o matmul.polly.interchanged+tiled.ll
404 Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled'.
405 File could not be read: No such file or directory
406 Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled'.
408 .. code-block:: console
410 $ opt -S matmul.preopt.ll -basic-aa -polly-use-llvm-names -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector -polly-codegen -polly-vectorizer=polly -polly-process-unprofitable | opt -S -O3 -o matmul.polly.interchanged+tiled+vector.ll
414 Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled+vector'.
415 File could not be read: No such file or directory
416 Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled+vector'.
418 .. code-block:: console
420 $ opt -S matmul.preopt.ll -basic-aa -polly-use-llvm-names -polly-import-jscop -polly-import-jscop-postfix=interchanged+tiled+vector -polly-codegen -polly-vectorizer=polly -polly-parallel -polly-process-unprofitable | opt -S -O3 -o matmul.polly.interchanged+tiled+openmp.ll
424 Reading JScop '%for.cond1.preheader---%for.end19' in function 'init_array' from './init_array___%for.cond1.preheader---%for.end19.jscop.interchanged+tiled+vector'.
425 File could not be read: No such file or directory
426 Reading JScop '%for.cond1.preheader---%for.end30' in function 'main' from './main___%for.cond1.preheader---%for.end30.jscop.interchanged+tiled+vector'.
429 10. **Create the executables**
430 ------------------------------
432 .. code-block:: console
434 $ llc matmul.normalopt.ll -o matmul.normalopt.s -relocation-model=pic
435 $ gcc matmul.normalopt.s -o matmul.normalopt.exe
436 $ llc matmul.polly.interchanged.ll -o matmul.polly.interchanged.s -relocation-model=pic
437 $ gcc matmul.polly.interchanged.s -o matmul.polly.interchanged.exe
438 $ llc matmul.polly.interchanged+tiled.ll -o matmul.polly.interchanged+tiled.s -relocation-model=pic
439 $ gcc matmul.polly.interchanged+tiled.s -o matmul.polly.interchanged+tiled.exe
440 $ llc matmul.polly.interchanged+tiled+vector.ll -o matmul.polly.interchanged+tiled+vector.s -relocation-model=pic
441 $ gcc matmul.polly.interchanged+tiled+vector.s -o matmul.polly.interchanged+tiled+vector.exe
442 $ llc matmul.polly.interchanged+tiled+vector+openmp.ll -o matmul.polly.interchanged+tiled+vector+openmp.s -relocation-model=pic
443 $ gcc matmul.polly.interchanged+tiled+vector+openmp.s -lgomp -o matmul.polly.interchanged+tiled+vector+openmp.exe
445 11. **Compare the runtime of the executables**
446 ----------------------------------------------
448 By comparing the runtimes of the different code snippets we see that a
449 simple loop interchange gives here the largest performance boost.
450 However in this case, adding vectorization and using OpenMP degrades
453 .. code-block:: console
455 $ time ./matmul.normalopt.exe
460 $ time ./matmul.polly.interchanged.exe
465 $ time ./matmul.polly.interchanged+tiled.exe
470 $ time ./matmul.polly.interchanged+tiled+vector.exe
475 $ time ./matmul.polly.interchanged+tiled+vector+openmp.exe