clang/docs/DataFlowAnalysisIntro.md

   1 # Data flow analysis: an informal introduction
   2
   3 ## Abstract
   4
   5 This document introduces data flow analysis in an informal way. The goal is to
   6 give the reader an intuitive understanding of how it works, and show how it
   7 applies to a range of refactoring and bug finding problems.
   8
   9 Data flow analysis is a well-established technique; it is described in many
  10 papers, books, and videos. If you would like a more formal, or a more thorough
  11 explanation of the concepts mentioned in this document, please refer to the
  12 following resources:
  13
  14 *   [The Lattice article in Wikipedia](https://en.wikipedia.org/wiki/Lattice_\(order\)).
  15 *   Videos on the PacketPrep YouTube channel that introduce lattices and the
  16     necessary background information:
  17     [#20](https://www.youtube.com/watch?v=73j_FXBXGm8),
  18     [#21](https://www.youtube.com/watch?v=b5sDjo9tfE8),
  19     [#22](https://www.youtube.com/watch?v=saOG7Uooeho),
  20     [#23](https://www.youtube.com/watch?v=3EAYX-wZH0g),
  21     [#24](https://www.youtube.com/watch?v=KRkHwQtW6Cc),
  22     [#25](https://www.youtube.com/watch?v=7Gwzsc4rAgw).
  23 *   [Introduction to Dataflow Analysis](https://www.youtube.com/watch?v=OROXJ9-wUQE)
  24 *   [Introduction to abstract interpretation](http://www.cs.tau.ac.il/~msagiv/courses/asv/absint-1.pdf).
  25 *   [Introduction to symbolic execution](https://www.cs.umd.edu/~mwh/se-tutorial/symbolic-exec.pdf).
  26 *   [Static Program Analysis by Anders Møller and Michael I. Schwartzbach](https://cs.au.dk/~amoeller/spa/).
  27 *   [EXE: automatically generating inputs of death](https://css.csail.mit.edu/6.858/2020/readings/exe.pdf)
  28     (a paper that successfully applies symbolic execution to real-world
  29     software).
  30
  31 ## Data flow analysis
  32
  33 ### The purpose of data flow analysis
  34
  35 Data flow analysis is a static analysis technique that proves facts about a
  36 program or its fragment. It can make conclusions about all paths through the
  37 program, while taking control flow into account and scaling to large programs.
  38 The basic idea is propagating facts about the program through the edges of the
  39 control flow graph (CFG) until a fixpoint is reached.
  40
  41 ### Sample problem and an ad-hoc solution
  42
  43 We would like to explain data flow analysis while discussing an example. Let's
  44 imagine that we want to track possible values of an integer variable in our
  45 program. Here is how a human could annotate the code:
  46
  47 ```c++
  48 void Example(int n) {
  49   int x = 0;
  50   // x is {0}
  51   if (n > 0) {
  52     x = 5;
  53     // x is {5}
  54   } else {
  55     x = 42;
  56     // x is {42}
  57   }
  58   // x is {5; 42}
  59   print(x);
  60 }
  61 ```
  62
  63 We use sets of integers to represent possible values of `x`. Local variables
  64 have unambiguous values between statements, so we annotate program points
  65 between statements with sets of possible values.
  66
  67 Here is how we arrived at these annotations. Assigning a constant to `x` allows
  68 us to make a conclusion that `x` can only have one value. When control flow from
  69 the "then" and "else" branches joins, `x` can have either value.
  70
  71 Abstract algebra provides a nice formalism that models this kind of structure,
  72 namely, a lattice. A join-semilattice is a partially ordered set, in which every
  73 two elements have a least upper bound (called a *join*).
  74
  75 ```
  76 join(a, b) ⩾ a   and   join(a, b) ⩾ b   and   join(x, x) = x
  77 ```
  78
  79 For this problem we will use the lattice of subsets of integers, with set
  80 inclusion relation as ordering and set union as a join.
  81
  82 Lattices are often represented visually as Hasse diagrams. Here is a Hasse
  83 diagram for our lattice that tracks subsets of integers:
  84
  85 ![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsInfiniteLattice.svg)
  86
  87 Computing the join in the lattice corresponds to finding the lowest common
  88 ancestor (LCA) between two nodes in its Hasse diagram. There is a vast amount of
  89 literature on efficiently implementing LCA queries for a DAG, however Efficient
  90 Implementation of Lattice Operations (1989)
  91 ([CiteSeerX](https://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.106.4911),
  92 [doi](https://doi.org/10.1145%2F59287.59293)) describes a scheme that
  93 particularly well-suited for programmatic implementation.
  94
  95 ### Too much information and "top" values
  96
  97 Let's try to find the possible sets of values of `x` in a function that modifies
  98 `x` in a loop:
  99
 100 ```c++
 101 void ExampleOfInfiniteSets() {
 102   int x = 0; // x is {0}
 103   while (condition()) {
 104     x += 1;  // x is {0; 1; 2; …}
 105   }
 106   print(x);  // x is {0; 1; 2; …}
 107 }
 108 ```
 109
 110 We have an issue: `x` can have any value greater than zero; that's an infinite
 111 set of values, if the program operated on mathematical integers. In C++ `int` is
 112 limited by `INT_MAX` so technically we have a set `{0; 1; …; INT_MAX}` which is
 113 still really big.
 114
 115 To make our analysis practical to compute, we have to limit the amount of
 116 information that we track. In this case, we can, for example, arbitrarily limit
 117 the size of sets to 3 elements. If at a certain program point `x` has more than
 118 3 possible values, we stop tracking specific values at that program point.
 119 Instead, we denote possible values of `x` with the symbol `⊤` (pronounced "top"
 120 according to a convention in abstract algebra).
 121
 122 ```c++
 123 void ExampleOfTopWithALoop() {
 124   int x = 0;  // x is {0}
 125   while (condition()) {
 126     x += 1;   // x is ⊤
 127   }
 128   print(x);   // x is ⊤
 129 }
 130 ```
 131
 132 The statement "at this program point, `x`'s possible values are `⊤`" is
 133 understood as "at this program point `x` can have any value because we have too
 134 much information, or the information is conflicting".
 135
 136 Note that we can get more than 3 possible values even without a loop:
 137
 138 ```c++
 139 void ExampleOfTopWithoutLoops(int n) {
 140   int x = 0;  // x is {0}
 141   switch(n) {
 142     case 0:  x = 1; break; // x is {1}
 143     case 1:  x = 9; break; // x is {9}
 144     case 2:  x = 7; break; // x is {7}
 145     default: x = 3; break; // x is {3}
 146   }
 147   // x is ⊤
 148 }
 149 ```
 150
 151 ### Uninitialized variables and "bottom" values
 152
 153 When `x` is declared but not initialized, it has no possible values. We
 154 represent this fact symbolically as `⊥` (pronounced "bottom").
 155
 156 ```c++
 157 void ExampleOfBottom() {
 158   int x;    // x is ⊥
 159   x = 42;   // x is {42}
 160   print(x);
 161 }
 162 ```
 163
 164 Note that using values read from uninitialized variables is undefined behaviour
 165 in C++. Generally, compilers and static analysis tools can assume undefined
 166 behavior does not happen. We must model uninitialized variables only when we are
 167 implementing a checker that specifically is trying to find uninitialized reads.
 168 In this example we show how to model uninitialized variables only to demonstrate
 169 the concept of "bottom", and how it applies to possible value analysis. We
 170 describe an analysis that finds uninitialized reads in a section below.
 171
 172 ### A practical lattice that tracks sets of concrete values
 173
 174 Taking into account all corner cases covered above, we can put together a
 175 lattice that we can use in practice to track possible values of integer
 176 variables. This lattice represents sets of integers with 1, 2, or 3 elements, as
 177 well as top and bottom. Here is a Hasse diagram for it:
 178
 179 ![Hasse diagram for a lattice of integer sets](DataFlowAnalysisIntroImages/IntegerSetsFiniteLattice.svg)
 180
 181 ### Formalization
 182
 183 Let's consider a slightly more complex example, and think about how we can
 184 compute the sets of possible values algorithmically.
 185
 186 ```c++
 187 void Example(int n) {
 188   int x;          // x is ⊥
 189   if (n > 0) {
 190     if (n == 42) {
 191        x = 44;    // x is {44}
 192     } else {
 193        x = 5;     // x is {5}
 194     }
 195     print(x);     // x is {44; 5}
 196   } else {
 197     x = n;        // x is ⊤
 198   }
 199   print(x);       // x is ⊤
 200 }
 201 ```
 202
 203 As humans, we understand the control flow from the program text. We used our
 204 understanding of control flow to find program points where two flows join.
 205 Formally, control flow is represented by a CFG (control flow graph):
 206
 207 ![CFG for the code above](DataFlowAnalysisIntroImages/CFGExample.svg)
 208
 209 We can compute sets of possible values by propagating them through the CFG of
 210 the function:
 211
 212 *   When `x` is declared but not initialized, its possible values are `{}`. The
 213     empty set plays the role of `⊥` in this lattice.
 214
 215 *   When `x` is assigned a concrete value, its possible set of values contains
 216     just that specific value.
 217
 218 *   When `x` is assigned some unknown value, it can have any value. We represent
 219     this fact as `⊤`.
 220
 221 *   When two control flow paths join, we compute the set union of incoming
 222     values (limiting the number of elements to 3, representig larger sets as
 223     `⊤`).
 224
 225 The sets of possible values are influenced by:
 226
 227 *   Statements, for example, assignments.
 228
 229 *   Joins in control flow, for example, ones that appear at the end of "if"
 230     statements.
 231
 232 **Effects of statements** are modeled by what is formally known as a transfer
 233 function. A transfer function takes two arguments: the statement, and the state
 234 of `x` at the previous program point. It produces the state of `x` at the next
 235 program point. For example, the transfer function for assignment ignores the
 236 state at the previous program point:
 237
 238 ```c++
 239 // GIVEN: x is {42; 44}
 240 x = 0;
 241 // CONCLUSION: x is {0}
 242 ```
 243
 244 The transfer function for `+` performs arithmetic on every set member:
 245
 246 ```c++
 247 // GIVEN: x is {42, 44}
 248 x = x + 100;
 249 // CONCLUSION: x is {142, 144}
 250 ```
 251
 252 **Effects of control flow** are modeled by joining the knowledge from all
 253 possible previous program points.
 254
 255 ```c++
 256 if (...) {
 257   ...
 258   // GIVEN: x is {42}
 259 } else {
 260   ...
 261   // GIVEN: x is {44}
 262 }
 263 // CONCLUSION: x is {42; 44}
 264 ```
 265
 266 ```c++
 267 // GIVEN: x is {42}
 268 while (...) {
 269   ...
 270   // GIVEN: x is {44}
 271 }
 272 // CONCLUSION: {42; 44}
 273 ```
 274
 275 The predicate that we marked "given" is usually called a precondition, and the
 276 conclusion is called a postcondition.
 277
 278 In terms of the CFG, we join the information from all predecessor basic blocks.
 279
 280 ![Modeling the effects of a CFG basic block](DataFlowAnalysisIntroImages/CFGJoinRule.svg)
 281
 282 Putting it all together, to model the effects of a basic block we compute:
 283
 284 ```
 285 out = transfer(basic_block, join(in_1, in_2, ..., in_n))
 286 ```
 287
 288 (Note that there are other ways to write this equation that produce higher
 289 precision analysis results. The trick is to keep exploring the execution paths
 290 separately and delay joining until later. However, we won't discuss those
 291 variations here.)
 292
 293 To make a conclusion about all paths through the program, we repeat this
 294 computation on all basic blocks until we reach a fixpoint. In other words, we
 295 keep propagating information through the CFG until the computed sets of values
 296 stop changing.
 297
 298 If the lattice has a finite height and transfer functions are monotonic the
 299 algorithm is guaranteed to terminate.  Each iteration of the algorithm can
 300 change computed values only to larger values from the lattice. In the worst
 301 case, all computed values become `⊤`, which is not very useful, but at least the
 302 analysis terminates at that point, because it can't change any of the values.
 303
 304 Fixpoint iteration can be optimised by only reprocessing basic blocks which had
 305 one of their inputs changed on the previous iteration. This is typically
 306 implemented using a worklist queue. With this optimisation the time complexity
 307 becomes `O(m * |L|)`, where `m` is the number of basic blocks in the CFG and
 308 `|L|` is the size of lattice used by the analysis.
 309
 310 ## Symbolic execution: a very short informal introduction
 311
 312 ### Symbolic values
 313
 314 In the previous example where we tried to figure out what values a variable can
 315 have, the analysis had to be seeded with a concrete value. What if there are no
 316 assignments of concrete values in the program? We can still deduce some
 317 interesting information by representing unknown input values symbolically, and
 318 computing results as symbolic expressions:
 319
 320 ```c++
 321 void PrintAbs(int x) {
 322   int result;
 323   if (x >= 0) {
 324     result = x;   // result is {x}
 325   } else {
 326     result = -x;  // result is {-x}
 327   }
 328   print(result);  // result is {x; -x}
 329 }
 330 ```
 331
 332 We can't say what specific value gets printed, but we know that it is either `x`
 333 or `-x`.
 334
 335 Dataflow analysis is an istance of abstract interpretation, and does not dictate
 336 how exactly the lattice and transfer functions should be designed, beyond the
 337 necessary conditions for the analysis to converge. Nevertheless, we can use
 338 symbolic execution ideas to guide our design of the lattice and transfer
 339 functions: lattice values can be symbolic expressions, and transfer functions
 340 can construct more complex symbolic expressions from symbolic expressions that
 341 represent arguments. See [this StackOverflow
 342 discussion](https://cstheory.stackexchange.com/questions/19708/symbolic-execution-is-a-case-of-abstract-interpretation)
 343 for a further comparison of abstract interpretation and symbolic execution.
 344
 345 ### Flow condition
 346
 347 A human can say about the previous example that the function returns `x` when
 348 `x >= 0`, and `-x` when `x < 0`. We can make this conclusion programmatically by
 349 tracking a flow condition. A flow condition is a predicate written in terms of
 350 the program state that is true at a specific program point regardless of the
 351 execution path that led to this statement. For example, the flow condition for
 352 the program point right before evaluating `result = x` is `x >= 0`.
 353
 354 If we enhance the lattice to be a set of pairs of values and predicates, the
 355 dataflow analysis computes the following values:
 356
 357 ```c++
 358 void PrintAbs(int x) {
 359   int result;
 360   if (x >= 0) {
 361     // Flow condition: x >= 0.
 362     result = x;   // result is {x if x >= 0}
 363   } else {
 364     // Flow condition: x < 0.
 365     result = -x;  // result is {-x if x < 0}
 366   }
 367   print(result);  // result is {x if x >= 0; -x if x < 0}
 368 }
 369 ```
 370
 371 Of course, in a program with loops, symbolic expressions for flow conditions can
 372 grow unbounded. A practical static analysis system must control this growth to
 373 keep the symbolic representations manageable and ensure that the data flow
 374 analysis terminates. For example, it can use a constraint solver to prune
 375 impossible flow conditions, and/or it can abstract them, losing precision, after
 376 their symbolic representations grow beyond some threshold. This is similar to
 377 how we had to limit the sizes of computed sets of possible values to 3 elements.
 378
 379 ### Symbolic pointers
 380
 381 This approach proves to be particularly useful for modeling pointer values,
 382 since we don't care about specific addresses but just want to give a unique
 383 identifier to a memory location.
 384
 385 ```c++
 386 void ExampleOfSymbolicPointers(bool b) {
 387   int x = 0;     // x is {0}
 388   int* ptr = &x; // x is {0}      ptr is {&x}
 389   if (b) {
 390     *ptr = 42;   // x is {42}     ptr is {&x}
 391   }
 392   print(x);      // x is {0; 42}  ptr is {&x}
 393 }
 394 ```
 395
 396 ## Example: finding output parameters
 397
 398 Let's explore how data flow analysis can help with a problem that is hard to
 399 solve with other tools in Clang.
 400
 401 ### Problem description
 402
 403 Output parameters are function parameters of pointer or reference type whose
 404 pointee is completely overwritten by the function, and not read before it is
 405 overwritten. They are common in pre-C++11 code due to the absence of move
 406 semantics. In modern C++ output parameters are non-idiomatic, and return values
 407 are used instead.
 408
 409 Imagine that we would like to refactor output parameters to return values to
 410 modernize old code. The first step is to identify refactoring candidates through
 411 static analysis.
 412
 413 For example, in the following code snippet the pointer `c` is an output
 414 parameter:
 415
 416 ```c++
 417 struct Customer {
 418   int account_id;
 419   std::string name;
 420 }
 421
 422 void GetCustomer(Customer *c) {
 423   c->account_id = ...;
 424   if (...) {
 425     c->name = ...;
 426   } else {
 427     c->name = ...;
 428   }
 429 }
 430 ```
 431
 432 We would like to refactor this code into:
 433
 434 ```c++
 435 Customer GetCustomer() {
 436   Customer c;
 437   c.account_id = ...;
 438   if (...) {
 439     c.name = ...;
 440   } else {
 441     c.name = ...;
 442   }
 443   return c;
 444 }
 445 ```
 446
 447 However, in the function below the parameter `c` is not an output parameter
 448 because its field `name` is not overwritten on every path through the function.
 449
 450 ```c++
 451 void GetCustomer(Customer *c) {
 452   c->account_id = ...;
 453   if (...) {
 454     c->name = ...;
 455   }
 456 }
 457 ```
 458
 459 The code also cannot read the value of the parameter before overwriting it:
 460
 461 ```c++
 462 void GetCustomer(Customer *c) {
 463   use(c->account_id);
 464   c->name = ...;
 465   c->account_id = ...;
 466 }
 467 ```
 468
 469 Functions that escape the pointer also block the refactoring:
 470
 471 ```c++
 472 Customer* kGlobalCustomer;
 473
 474 void GetCustomer(Customer *c) {
 475   c->name = ...;
 476   c->account_id = ...;
 477   kGlobalCustomer = c;
 478 }
 479 ```
 480
 481 To identify a candidate function for refactoring, we need to do the following:
 482
 483 *   Find a function with a non-const pointer or reference parameter.
 484
 485 *   Find the definition of that function.
 486
 487 *   Prove that the function completely overwrites the pointee on all paths
 488     before returning.
 489
 490 *   Prove that the function reads the pointee only after overwriting it.
 491
 492 *   Prove that the function does not persist the pointer in a data structure
 493     that is live after the function returns.
 494
 495 There are also requirements that all usage sites of the candidate function must
 496 satisfy, for example, that function arguments do not alias, that users are not
 497 taking the address of the function, and so on. Let's consider verifying usage
 498 site conditions to be a separate static analysis problem.
 499
 500 ### Lattice design
 501
 502 To analyze the function body we can use a lattice which consists of normal
 503 states and failure states. A normal state describes program points where we are
 504 sure that no behaviors that block the refactoring have occurred. Normal states
 505 keep track of all parameter's member fields that are known to be overwritten on
 506 every path from function entry to the corresponding program point. Failure
 507 states accumulate observed violations (unsafe reads and pointer escapes) that
 508 block the refactoring.
 509
 510 In the partial order of the lattice failure states compare greater than normal
 511 states, which guarantees that they "win" when joined with normal states. Order
 512 between failure states is determined by inclusion relation on the set of
 513 accumulated violations (lattice's `⩽` is `⊆` on the set of violations). Order
 514 between normal states is determined by reversed inclusion relation on the set of
 515 overwritten parameter's member fields (lattice's `⩽` is `⊇` on the set of
 516 overwritten fields).
 517
 518 ![Lattice for data flow analysis that identifies output parameters](DataFlowAnalysisIntroImages/OutputParameterIdentificationLattice.svg)
 519
 520 To determine whether a statement reads or writes a field we can implement
 521 symbolic evaluation of `DeclRefExpr`s, `LValueToRValue` casts, pointer
 522 dereference operator and `MemberExpr`s.
 523
 524 ### Using data flow results to identify output parameters
 525
 526 Let's take a look at how we use data flow analysis to identify an output
 527 parameter. The refactoring can be safely done when the data flow algorithm
 528 computes a normal state with all of the fields proven to be overwritten in the
 529 exit basic block of the function.
 530
 531 ```c++
 532 struct Customer {
 533   int account_id;
 534   std::string name;
 535 };
 536
 537 void GetCustomer(Customer* c) {
 538   // Overwritten: {}
 539   c->account_id = ...; // Overwritten: {c->account_id}
 540   if (...) {
 541     c->name = ...;     // Overwritten: {c->account_id, c->name}
 542   } else {
 543     c->name = ...;     // Overwritten: {c->account_id, c->name}
 544   }
 545   // Overwritten: {c->account_id, c->name}
 546 }
 547 ```
 548
 549 When the data flow algorithm computes a normal state, but not all fields are
 550 proven to be overwritten we can't perform the refactoring.
 551
 552 ```c++
 553 void target(bool b, Customer* c) {
 554   // Overwritten: {}
 555   if (b) {
 556     c->account_id = 42;     // Overwritten: {c->account_id}
 557   } else {
 558     c->name = "Konrad";  // Overwritten: {c->name}
 559   }
 560   // Overwritten: {}
 561 }
 562 ```
 563
 564 Similarly, when the data flow algorithm computes a failure state, we also can't
 565 perform the refactoring.
 566
 567 ```c++
 568 Customer* kGlobalCustomer;
 569
 570 void GetCustomer(Customer* c) {
 571   // Overwritten: {}
 572   c->account_id = ...;    // Overwritten: {c->account_id}
 573   if (...) {
 574     print(c->name);       // Unsafe read
 575   } else {
 576     kGlobalCustomer = c;  // Pointer escape
 577   }
 578   // Unsafe read, Pointer escape
 579 }
 580 ```
 581
 582 ## Example: finding dead stores
 583
 584 Let's say we want to find redundant stores, because they indicate potential
 585 bugs.
 586
 587 ```c++
 588 x = GetX();
 589 x = GetY();
 590 ```
 591
 592 The first store to `x` is never read, probably there is a bug.
 593
 594 The implementation of dead store analysis is very similar to output parameter
 595 analysis: we need to track stores and loads, and find stores that were never
 596 read.
 597
 598 [Liveness analysis](https://en.wikipedia.org/wiki/Live_variable_analysis) is a
 599 generalization of this idea, which is often used to answer many related
 600 questions, for example:
 601
 602 * finding dead stores,
 603 * finding uninitialized variables,
 604 * finding a good point to deallocate memory,
 605 * finding out if it would be safe to move an object.
 606
 607 ## Example: definitive initialization
 608
 609 Definitive initialization proves that variables are known to be initialized when
 610 read. If we find a variable which is read when not initialized then we generate
 611 a warning.
 612
 613 ```c++
 614 void Init() {
 615   int x;    // x is uninitialized
 616   if (cond()) {
 617     x = 10; // x is initialized
 618   } else {
 619     x = 20; // x is initialized
 620   }
 621   print(x); // x is initialized
 622 }
 623 ```
 624
 625 ```c++
 626 void Uninit() {
 627   int x;    // x is uninitialized
 628   if (cond()) {
 629     x = 10; // x is initialized
 630   }
 631   print(x); // x is maybe uninitialized, x is being read, report a bug.
 632 }
 633 ```
 634
 635 For this purpose we can use lattice in a form of a mapping from variable
 636 declarations to initialization states; each initialization state is represented
 637 by the followingn lattice:
 638
 639 ![Lattice for definitive initialization analysis](DataFlowAnalysisIntroImages/DefinitiveInitializationLattice.svg)
 640
 641 A lattice element could also capture the source locations of the branches that
 642 lead us to the corresponding program point. Diagnostics would use this
 643 information to show a sample buggy code path to the user.
 644
 645 ## Example: refactoring raw pointers to `unique_ptr`
 646
 647 Modern idiomatic C++ uses smart pointers to express memory ownership, however in
 648 pre-C++11 code one can often find raw pointers that own heap memory blocks.
 649
 650 Imagine that we would like to refactor raw pointers that own memory to
 651 `unique_ptr`. There are multiple ways to design a data flow analysis for this
 652 problem; let's look at one way to do it.
 653
 654 For example, we would like to refactor the following code that uses raw
 655 pointers:
 656
 657 ```c++
 658 void UniqueOwnership1() {
 659   int *pi = new int;
 660   if (...) {
 661     Borrow(pi);
 662     delete pi;
 663   } else {
 664     TakeOwnership(pi);
 665   }
 666 }
 667 ```
 668
 669 into code that uses `unique_ptr`:
 670
 671 ```c++
 672 void UniqueOwnership1() {
 673   auto pi = std::make_unique<int>();
 674   if (...) {
 675     Borrow(pi.get());
 676   } else {
 677     TakeOwnership(pi.release());
 678   }
 679 }
 680 ```
 681
 682 This problem can be solved with a lattice in form of map from value declarations
 683 to pointer states:
 684
 685 ![Lattice that identifies candidates for unique_ptr refactoring](DataFlowAnalysisIntroImages/UniquePtrLattice.svg)
 686
 687 We can perform the refactoring if at the exit of a function `pi` is
 688 `Compatible`.
 689
 690 ```c++
 691 void UniqueOwnership1() {
 692   int *pi;             // pi is Compatible
 693   pi = new int;        // pi is Defined
 694   if (...) {
 695     Borrow(pi);        // pi is Defined
 696     delete pi;         // pi is Compatible
 697   } else {
 698     TakeOwnership(pi); // pi is Compatible
 699   }
 700   // pi is Compatible
 701 }
 702 ```
 703
 704 Let's look at an example where the raw pointer owns two different memory blocks:
 705
 706 ```c++
 707 void UniqueOwnership2() {
 708   int *pi = new int;  // pi is Defined
 709   Borrow(pi);
 710   delete pi;          // pi is Compatible
 711   if (smth) {
 712     pi = new int;     // pi is Defined
 713     Borrow(pi);
 714     delete pi;        // pi is Compatible
 715   }
 716   // pi is Compatible
 717 }
 718 ```
 719
 720 It can be refactored to use `unique_ptr` like this:
 721
 722 ```c++
 723 void UniqueOwnership2() {
 724   auto pi = make_unique<int>();
 725   Borrow(pi);
 726   if (smth) {
 727     pi = make_unique<int>();
 728     Borrow(pi);
 729   }
 730 }
 731 ```
 732
 733 In the following example, the raw pointer is used to access the heap object
 734 after the ownership has been transferred.
 735
 736 ```c++
 737 void UniqueOwnership3() {
 738   int *pi = new int; // pi is Defined
 739   if (...) {
 740     Borrow(pi);
 741     delete pi;       // pi is Compatible
 742   } else {
 743     vector<unique_ptr<int>> v = {std::unique_ptr(pi)}; // pi is Compatible
 744     print(*pi);
 745     use(v);
 746   }
 747   // pi is Compatible
 748 }
 749 ```
 750
 751 We can refactor this code to use `unique_ptr`, however we would have to
 752 introduce a non-owning pointer variable, since we can't use the moved-from
 753 `unique_ptr` to access the object:
 754
 755 ```c++
 756 void UniqueOwnership3() {
 757   std::unique_ptr<int> pi = std::make_unique<int>();
 758   if (...) {
 759     Borrow(pi);
 760   } else {
 761     int *pi_non_owning = pi.get();
 762     vector<unique_ptr<int>> v = {std::move(pi)};
 763     print(*pi_non_owning);
 764     use(v);
 765   }
 766 }
 767 ```
 768
 769 If the original code didn't call `delete` at the very end of the function, then
 770 our refactoring may change the point at which we run the destructor and release
 771 memory. Specifically, if there is some user code after `delete`, then extending
 772 the lifetime of the object until the end of the function may hold locks for
 773 longer than necessary, introduce memory overhead etc.
 774
 775 One solution is to always replace `delete` with a call to `reset()`, and then
 776 perform another analysis that removes unnecessary `reset()` calls.
 777
 778 ```c++
 779 void AddedMemoryOverhead() {
 780   HugeObject *ho = new HugeObject();
 781   use(ho);
 782   delete ho; // Release the large amount of memory quickly.
 783   LongRunningFunction();
 784 }
 785 ```
 786
 787 This analysis will refuse to refactor code that mixes borrowed pointer values
 788 and unique ownership. In the following code, `GetPtr()` returns a borrowed
 789 pointer, which is assigned to `pi`. Then, `pi` is used to hold a uniquely-owned
 790 pointer. We don't distinguish between these two assignments, and we want each
 791 assignment to be paired with a corresponding sink; otherwise, we transition the
 792 pointer to a `Conflicting` state, like in this example.
 793
 794 ```c++
 795 void ConflictingOwnership() {
 796   int *pi;           // pi is Compatible
 797   pi = GetPtr();     // pi is Defined
 798   Borrow(pi);        // pi is Defined
 799
 800   pi = new int;      // pi is Conflicting
 801   Borrow(pi);
 802   delete pi;
 803   // pi is Conflicting
 804 }
 805 ```
 806
 807 We could still handle this case by finding a maximal range in the code where
 808 `pi` could be in the Compatible state, and only refactoring that part.
 809
 810 ```c++
 811 void ConflictingOwnership() {
 812   int *pi;
 813   pi = GetPtr();
 814   Borrow(pi);
 815
 816   std::unique_ptr<int> pi_unique = std::make_unique<int>();
 817   Borrow(pi_unique.get());
 818 }
 819 ```
 820
 821 ## Example: finding redundant branch conditions
 822
 823 In the code below `b1` should not be checked in both the outer and inner "if"
 824 statements. It is likely there is a bug in this code.
 825
 826 ```c++
 827 int F(bool b1, bool b2) {
 828   if (b1) {
 829     f();
 830     if (b1 && b2) {  // Check `b1` again -- unnecessary!
 831       g();
 832     }
 833   }
 834 }
 835 ```
 836
 837 A checker that finds this pattern syntactically is already implemented in
 838 ClangTidy using AST matchers (`bugprone-redundant-branch-condition`).
 839
 840 To implement it using the data flow analysis framework, we can produce a warning
 841 if any part of the branch condition is implied by the flow condition.
 842
 843 ```c++
 844 int F(bool b1, bool b2) {
 845   // Flow condition: true.
 846   if (b1) {
 847     // Flow condition: b1.
 848     f();
 849     if (b1 && b2) { // `b1` is implied by the flow condition.
 850       g();
 851     }
 852   }
 853 }
 854 ```
 855
 856 One way to check this implication is to use a SAT solver. Without a SAT solver,
 857 we could keep the flow condition in the CNF form and then it would be easy to
 858 check the implication.
 859
 860 ## Example: finding unchecked `std::optional` unwraps
 861
 862 Calling `optional::value()` is only valid if `optional::has_value()` is true. We
 863 want to show that when `x.value()` is executed, the flow condition implies
 864 `x.has_value()`.
 865
 866 In the example below `x.value()` is accessed safely because it is guarded by the
 867 `x.has_value()` check.
 868
 869 ```c++
 870 void Example(std::optional<int> &x) {
 871   if (x.has_value()) {
 872     use(x.value());
 873   }
 874 }
 875 ```
 876
 877 While entering the if branch we deduce that `x.has_value()` is implied by the
 878 flow condition.
 879
 880 ```c++
 881 void Example(std::optional<int> x) {
 882   // Flow condition: true.
 883   if (x.has_value()) {
 884     // Flow condition: x.has_value() == true.
 885     use(x.value());
 886   }
 887   // Flow condition: true.
 888 }
 889 ```
 890
 891 We also need to prove that `x` is not modified between check and value access.
 892 The modification of `x` may be very subtle:
 893
 894 ```c++
 895 void F(std::optional<int> &x);
 896
 897 void Example(std::optional<int> &x) {
 898   if (x.has_value()) {
 899     // Flow condition: x.has_value() == true.
 900     unknown_function(x); // may change x.
 901     // Flow condition: true.
 902     use(x.value());
 903   }
 904 }
 905 ```
 906
 907 ## Example: finding dead code behind A/B experiment flags
 908
 909 Finding dead code is a classic application of data flow analysis.
 910
 911 Unused flags for A/B experiment hide dead code. However, this flavor of dead
 912 code is invisible to the compiler because the flag can be turned on at any
 913 moment.
 914
 915 We could make a tool that deletes experiment flags. The user tells us which flag
 916 they want to delete, and we assume that the it's value is a given constant.
 917
 918 For example, the user could use the tool to remove `example_flag` from this
 919 code:
 920
 921 ```c++
 922 DEFINE_FLAG(std::string, example_flag, "", "A sample flag.");
 923
 924 void Example() {
 925   bool x = GetFlag(FLAGS_example_flag).empty();
 926   f();
 927   if (x) {
 928     g();
 929   } else {
 930     h();
 931   }
 932 }
 933 ```
 934
 935 The tool would simplify the code to:
 936
 937 ```c++
 938 void Example() {
 939   f();
 940   g();
 941 }
 942 ```
 943
 944 We can solve this problem with a classic constant propagation lattice combined
 945 with symbolic evaluation.
 946
 947 ## Example: finding inefficient usages of associative containers
 948
 949 Real-world code often accidentally performs repeated lookups in associative
 950 containers:
 951
 952 ```c++
 953 map<int, Employee> xs;
 954 xs[42]->name = "...";
 955 xs[42]->title = "...";
 956 ```
 957
 958 To find the above inefficiency we can use the available expressions analysis to
 959 understand that `m[42]` is evaluated twice.
 960
 961 ```c++
 962 map<int, Employee> xs;
 963 Employee &e = xs[42];
 964 e->name = "...";
 965 e->title = "...";
 966 ```
 967
 968 We can also track the `m.contains()` check in the flow condition to find
 969 redundant checks, like in the example below.
 970
 971 ```c++
 972 std::map<int, Employee> xs;
 973 if (!xs.contains(42)) {
 974   xs.insert({42, someEmployee});
 975 }
 976 ```
 977
 978 ## Example: refactoring types that implicitly convert to each other
 979
 980 Refactoring one strong type to another is difficult, but the compiler can help:
 981 once you refactor one reference to the type, the compiler will flag other places
 982 where this information flows with type mismatch errors. Unfortunately this
 983 strategy does not work when you are refactoring types that implicitly convert to
 984 each other, for example, replacing `int32_t` with `int64_t`.
 985
 986 Imagine that we want to change user IDs from 32 to 64-bit integers. In other
 987 words, we need to find all integers tainted with user IDs. We can use data flow
 988 analysis to implement taint analysis.
 989
 990 ```c++
 991 void UseUser(int32_t user_id) {
 992   int32_t id = user_id;
 993   // Variable `id` is tainted with a user ID.
 994   ...
 995 }
 996 ```
 997
 998 Taint analysis is very well suited to this problem because the program rarely
 999 branches on user IDs, and almost certainly does not perform any computation
1000 (like arithmetic).