clang-tools-extra/pseudo/Disambiguation.md

   1 # Disambiguation
   2
   3 The C++ grammar is highly ambiguous, so the GLR parser produces a forest of
   4 parses, represented compactly by a DAG.
   5 A real C++ parser finds the correct parse through semantic analysis: mostly
   6 resolving names. But we can't do that, as we don't parse the headers.
   7
   8 Our disambiguation phase should take the parse forest, and choose a single parse
   9 tree that is most likely.
  10 It might **optionally** use some other hints (e.g. coding style, or what
  11 specific names tend to mean in this codebase).
  12
  13 There are some grammatical ambiguities that can be resolved without semantic
  14 analysis, e.g. whether `int <declarator>{}` is a function-definition.
  15 We eliminate these earlier e.g., with rule guards. By "disambiguation" we mean
  16 choosing between interpretations that we can't reject confidently and locally.
  17
  18 ## Types of evidence
  19
  20 We have limited information to go on, and strive to use similar heuristics a
  21 human reader might.
  22
  23 ### Likely and unlikely structure
  24
  25 In some cases, the shape of a particular interpretation is unlikely but not
  26 impossible. For example, the statement `x(a);` might:
  27
  28 - call a function `x` (likely)
  29 - construct a temporary of class type `x` (less likely)
  30 - define a variable `a` of type `x`, which is an alias for e.g. `int`
  31   (unlikely!)
  32
  33 We model this as a bonus/penalty for including a particular forest node in the
  34 chosen parse. For each rule we want to apply, we can write some code to
  35 recognize the corresponding pattern in the parse tree, and run these recognizers
  36 at each node to assign the bonuses.
  37
  38 ### Interpreting names
  39
  40 Just as resolving names allows a C++ parser to choose the right parse (rejecting
  41 others), chunks of a parse tree imply things about how names resolve.
  42
  43 Because the same name often means the same thing in different contexts, we can
  44 apply knowledge from elsewhere. This can be as simple as knowing "`vector` is
  45 usually a type", and adding bonuses to nodes that include that interpretation.
  46
  47 However we can also transfer knowlegde across the same file we're parsing:
  48
  49 ```cpp
  50 // Is `Builder` a class or a namespace?
  51 void Builder::build() { ... }
  52 // ...
  53 // `Builder` is a type.
  54 Builder b;
  55 ```
  56
  57 We can use this to understand more-ambiguous code based on names in a section
  58 we're more sure about. It also pushes us to provide a consistent parse, rather
  59 than interpreting each occurrence of an unclear name differently.
  60
  61 Again, we can define bonuses/penalties for forest nodes that interpret names,
  62 but this time those bonuses change as we disambiguate. Specifically:
  63
  64 - we can group identifiers into **classes**, most importantly "all identifiers
  65   with text 'foo'" but also "all snake_case identifiers".
  66 - clusters of nodes immediately above the identifiers in the parse forest are
  67   **interpretations**, they bind the identifier to a **kind** such "type",
  68   "value", "namespace", "other".
  69 - for each class we can query information once from an external source (such as
  70   an index or hard-coded list), yielding a vector of weights (one per kind)
  71 - at each point we can compute metrics based on interpretations in the forest:
  72    - the number of identifiers in the class that are interpreted as each kind
  73      (e.g. *all* remaining interpretations of 'foo' at 3:7 are 'type')
  74    - the number of identifiers in the class that *may* be interpereted as each
  75      kind (e.g. 'foo' at 3:7 might be a 'type').
  76 - we can mash these metrics together into a vector of bonuses for a class (e.g.
  77   for identifiers with text 'bar', 'type'=>+5, 'namespace'=>+1, 'value'=>-2).
  78 - these bonuses are assigned to corresponding interpretations in the graph
  79
  80 ### Templates
  81
  82 Another aspect of a name is whether it names a template (type or value). This
  83 is ambiguous in many more cases since CTAD allowed template arguments to be
  84 omitted.
  85
  86 A fairly simple heuristic appears sufficient here: things that look like
  87 templates usually are, so if a node for certain rules exists in the forest
  88 (e.g. `template-id := template-name < template-argument-list >`) then we treat
  89 the template name as a probable template, and apply a bonus to every node that
  90 interprets it that way. We do this even if alternate parses are possible
  91 (`a < b > :: c` might be a comparison, but is still evidence `a` is a template).
  92
  93 ## Algorithm sketch
  94
  95 With static node scores, finding the best tree is a very tractable problem
  96 with an efficient solution.
  97 With dynamic scores it becomes murky and we have to settle for approximations.
  98 These build on the same idea, so we'll look at the simple version first.
  99
 100 ### Naive version (static scores)
 101
 102 At a high level, we want to assign bonuses to nodes, and find the tree that
 103 maximizes the total score. If bonuses were fixed, independent of other
 104 disambiguation decisions, then we could simply walk bottom-up, aggregating
 105 scores and replacing each ambiguous node with the top-scoring alternative
 106 subtree. This could happen directly on the parse tree.
 107
 108 Given this tree as input:
 109
 110 ```mermaid
 111 flowchart TB
 112     subgraph &nbsp;
 113       idA["a"]
 114       open["("]
 115       idB["b"]
 116       close[")"]
 117       semi[";"]
 118     end
 119     class idA,open,idB,close,semi token;
 120
 121     typeA["type := IDENTIFIER"] --- idA
 122     exprA["expr := IDENTIFIER"] --- idA
 123     exprB["expr := IDENTIFIER"] --- idB
 124     declB["declarator := IDENTIFIER"] --- idB
 125     stmtExpr --- semi
 126     stmtDecl --- semi
 127
 128     stmtAmbig["stmt?"]:::ambig
 129     stmtAmbig === stmtExpr["stmt := expr ;"]
 130       stmtExpr --- exprAmbig["expr?"]:::ambig
 131         exprAmbig === funcall["expr := expr ( expr )"]:::good
 132           funcall --- exprA
 133           funcall --- open
 134           funcall --- exprB["expr := IDENTIFIER"]
 135           funcall --- close
 136         exprAmbig -.- construct["expr := type ( expr )"]:::bad
 137           construct --- typeA
 138           construct --- open
 139           construct --- exprB
 140           construct --- close
 141     stmtAmbig -.- stmtDecl["stmt := decl"]
 142       stmtDecl --- decl["decl := type declarator ;"]
 143         decl --- typeA
 144         decl --- declParens["declarator := ( declarator )"]:::bad
 145           declParens --- open
 146           declParens --- declB
 147           declParens --- close
 148
 149     classDef ambig fill:blue,color:white;
 150     classDef token fill:yellow;
 151     classDef good fill:lightgreen
 152     classDef bad fill:pink
 153 ```
 154
 155 A post-order traversal reaches the ambiguous node `expr?` first.
 156 The left alternative has a total score of +1 (green bonus for
 157 `expr := expr (expr)`) and the right alternative has a total bonus of -1
 158 (red penalty for `expr := type (expr)`). So we replace `expr?` with its left
 159 alternative.
 160
 161 As we continue traversing, we reach `stmt?` next: again we have +1 in the left
 162 subtree and -1 in the right subtree, so we pick the left one. Result:
 163
 164 ```mermaid
 165 flowchart TB
 166     subgraph &nbsp;
 167       idA["a"]
 168       open["("]
 169       idB["b"]
 170       close[")"]
 171       semi[";"]
 172     end
 173     class idA,open,idB,close,semi token;
 174
 175     typeA["type := IDENTIFIER"] --- idA
 176     exprA["expr := IDENTIFIER"] --- idA
 177     exprB["expr := IDENTIFIER"] --- idB
 178     declB["declarator := IDENTIFIER"] --- idB
 179     stmtExpr --- semi
 180     stmtDecl --- semi
 181
 182     stmtExpr["stmt := expr ;"]
 183       stmtExpr --- funcall["expr := expr ( expr )"]
 184           funcall --- exprA
 185           funcall --- open
 186           funcall --- exprB["expr := IDENTIFIER"]
 187           funcall --- close
 188
 189     classDef token fill:yellow;
 190 ```
 191
 192 ### Degrees of freedom
 193
 194 We must traverse the DAG bottom-up in order to make score-based decisions:
 195 if an ambiguous node has ambiguous descendants then we can't calculate the score
 196 for that subtree.
 197
 198 This gives us a topological **partial** order, but we don't have to go from
 199 left-to-right. At any given point there is a "frontier" of ambiguous nodes with
 200 no ambiguous descendants. The sequence we choose matters: each choice adds
 201 more interpretations that should affect future choices.
 202
 203 Initially, most of the ambiguous nodes in the frontier will be e.g. "is this
 204 identifier a type or a value". If we had to work left-to-right then we'd
 205 immediately be forced to resolve the first name in the file, likely with
 206 little to go on and high chance of a mistake.
 207 But there are probably names where we have strong evidence, e.g. we've seen an
 208 (unambiguous) declaration of a variable `foo`, so other occurrences of `foo`
 209 are very likely to be values rather than types. We can disambiguate these with
 210 high confidence, and these choices are likely to "unlock" other conclusions that
 211 we can use for further disambiguation.
 212
 213 This is intuitively similar to how people decipher ambiguous code: they find a
 214 piece that's easy to understand, read that to learn what names mean, and use
 215 the knowledge gained to study the more difficult parts.
 216
 217 To formalize this a little:
 218 - we prioritize making the **highest confidence** decisions first
 219 - we define **confidence** as the score of the accepted alternative, minus the
 220   score of the best rejected alternative.
 221
 222 ### Removing the bottom-up restriction
 223
 224 Strictly only resolving "frontier" ambiguities may cause problems.
 225 Consider the following example:
 226
 227 ```mermaid
 228 flowchart TB
 229     subgraph &nbsp;
 230       a:::token
 231       b:::token
 232     end
 233
 234     ambig1:::ambig
 235     ambig1 --- choice1:::good
 236     ambig1 --- choice2
 237     choice1 --- ambig2:::ambig
 238     ambig2 --- aType["a is class"] --- a
 239     ambig2 --- aTemplate["a is CTAD"] --- a
 240     choice1 --- bValue["b is variable"] --- b
 241
 242     classDef ambig fill:blue,color:white;
 243     classDef token fill:yellow;
 244     classDef good fill:lightgreen
 245     classDef bad fill:pink
 246 ```
 247
 248 We have some evidence that `choice1` is good. If we selected it, we would know
 249 that `b` is a variable and could use this in disambiguating the rest of the
 250 file. However we can't select `choice1` until we decide exactly how to interpret
 251 `a`, and there might be little info to do so. Gating higher-confidence decisions
 252 on lower-confidence ones increases our chance of making an error.
 253
 254 A possible fix would be to generalize to a **range** of possible scores for
 255 nodes above the frontier, and rank by **minimum confidence**, i.e. the highest
 256 min-score of the accepted alternative, minus the highest max-score among the
 257 rejected alternative.
 258
 259 ## Details
 260
 261 The remaining challenges are mainly:
 262 - defining the score function for an alternative. This is TBD, pending
 263   experiments.
 264 - finding a data structure and algorithm to efficiently resolve/re-evaluate
 265   in a loop until we've resolved all ambiguities.
 266
 267 ### Disambiguation DAG
 268
 269 Rather than operate on the forest directly, it's simpler to consider a reduced
 270 view that hides complexity unrelated to disambiguation:
 271
 272 **Forest:**
 273
 274 ```mermaid
 275 flowchart TB
 276   subgraph &nbsp;
 277     open["{"]
 278     a
 279     star["*"]
 280     b
 281     semi[";"]
 282     close["}"]
 283   end
 284   class open,a,star,b,semi,close token
 285
 286   compound-stmt --- open
 287   compound-stmt --- stmt?
 288   compound-stmt --- close
 289
 290   stmt?:::ambig --- decl-stmt
 291   decl-stmt --- type-name
 292   type-name --a is type--- a
 293   decl-stmt --- declarator
 294   declarator --- star
 295   declarator --- declarator_b["declarator"]
 296   declarator_b --b is value--- b
 297   decl-stmt --- semi
 298
 299   stmt?:::ambig --- expr-stmt
 300   expr-stmt --- expr1["expr"]
 301   expr-stmt --- star
 302   expr-stmt --- expr_b["expr"]
 303   expr-stmt --- semi
 304   expr_a --a is value--- a
 305   expr_b --b is value--- b
 306
 307   classDef ambig fill:blue,color:white;
 308   classDef token fill:yellow;
 309 ```
 310
 311 **Ambiguity graph:**
 312
 313 ```mermaid
 314 flowchart TB
 315   subgraph &nbsp;
 316     a
 317     b
 318   end
 319   class a,b token
 320
 321   root --- stmt?
 322
 323   stmt?:::ambig --- decl-stmt
 324   decl-stmt --a is type--- a
 325   decl-stmt --b is value--- b
 326
 327   stmt?:::ambig --- expr-stmt
 328   expr-stmt --a is value--- a
 329   expr-stmt --b is value--- b
 330
 331   classDef ambig fill:blue,color:white;
 332   classDef token fill:yellow;
 333 ```
 334
 335 Here the clusters of non-ambiguous forest nodes are grouped together, so that the DAG is bipartite with ambiguous/cluster nodes, and interpretation edges at the bottom.
 336
 337 Scoring the clusters and selecting which to include is equivalent to disambiguating the full graph.
 338
 339 ### Resolving the ambiguity DAG
 340
 341 The static scores of the forest nodes are aggregated into static scores for the clusters.
 342 The interpretation edges of the frontier clusters can be scored based on the context available so far, and the scores "bubble up" to parent nodes, with ambiguous nodes creating score ranges as described above.
 343
 344 The main dynamic signal is when a token has been fully resolved, which happens when all the interpretations leading to it have the same label.
 345
 346 The naive algorithm is to score all clusters, choose the best to resolve and repeat.
 347 However this is very slow:
 348  - there are **many** ambiguities available at first, therefore many clusters to score
 349  - each time we resolve an ambiguity, we invalidate previously computed scores
 350  - while the clusters become fewer over time, there are more interpretations per cluster
 351
 352 It's tempting to use a priority queue to avoid repeatedly scanning clusters. However if we invalidate a large fraction of a heap's elements each round, we lose the efficiency benefits it brings.
 353 We could reuse scores if the resolved cluster doesn't tell us much about the target cluster.
 354 The simplest idea is to only recalculate clusters with an overlapping word, this may not save much (consider `std`) as clusters get larger.
 355 A heuristic to estimate how much a cluster affects another may help.
 356
 357 To stop the clusters having too many interpretation edges (and thus take too long to score), we can drop the edges for any token that is fully resolved. We need to track these anyway (for scoring of interpretations of other identifiers with the same text). And once only a single interpretation exists, removing it has no impact on scores.
 358
 359 So for now the sketch is:
 360 - build the ambiguity DAG
 361 - compute scores for all clusters
 362 - place confidences (score difference) for each cluster in a priority queue
 363 - while there is still ambiguity:
 364   - take the most confident cluster C and resolve it
 365   - propagate the score change to all of C's ancestors
 366   - work out which identifiers are now resolved, record that and remove the interpretations from the graph
 367   - recompute scores for the K clusters most affected by resolving those identifiers, and their ancestors