flang/docs/ParserCombinators.md

   1 <!--===- docs/ParserCombinators.md
   2
   3    Part of the LLVM Project, under the Apache License v2.0 with LLVM Exceptions.
   4    See https://llvm.org/LICENSE.txt for license information.
   5    SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
   6
   7 -->
   8
   9 # Parser Combinators
  10
  11 ```{contents}
  12 ---
  13 local:
  14 ---
  15 ```
  16
  17 This document is a primer on Parser Combinators and their use in Flang.
  18
  19 ## Concept
  20 The Fortran language recognizer here can be classified as an LL recursive
  21 descent parser.  It is composed from a *parser combinator* library that
  22 defines a few fundamental parsers and a few ways to compose them into more
  23 powerful parsers.
  24
  25 For our purposes here, a *parser* is any object that attempts to recognize
  26 an instance of some syntax from an input stream.  It may succeed or fail.
  27 On success, it may return some semantic value to its caller.
  28
  29 In C++ terms, a parser is any instance of a class that
  30 1. has a `constexpr` default constructor,
  31 1. defines a type named `resultType`, and
  32 1. provides a function (`const` member or `static`) that accepts a reference to a
  33 `ParseState` as its argument and returns a `std::optional<resultType>` as a
  34 result, with the presence or absence of a value in the `std::optional<>`
  35 signifying success or failure, respectively.
  36 ```
  37 std::optional<resultType> Parse(ParseState &) const;
  38 ```
  39 The `resultType` of a parser is typically the class type of some particular
  40 node type in the parse tree.
  41
  42 `ParseState` is a class that encapsulates a position in the source stream,
  43 collects messages, and holds a few state flags that determine tokenization
  44 (e.g., are we in a character literal?).  Instances of `ParseState` are
  45 independent and complete -- they are cheap to duplicate whenever necessary to
  46 implement backtracking.
  47
  48 The `constexpr` default constructor of a parser is important.  The functions
  49 (below) that operate on instances of parsers are themselves all `constexpr`.
  50 This use of compile-time expressions allows the entirety of a recursive
  51 descent parser for a language to be constructed at compilation time through
  52 the use of templates.
  53
  54 ### Fundamental Predefined Parsers
  55 These objects and functions are (or return) the fundamental parsers:
  56
  57 * `ok` is a trivial parser that always succeeds without advancing.
  58 * `pure(x)` returns a trivial parser that always succeeds without advancing,
  59   returning some value `x`.
  60 * `pure<T>()` is `pure(T{})` but does not require that T be copy-constructible.
  61 * `fail<T>(msg)` denotes a trivial parser that always fails, emitting the
  62   given message as a side effect.  The template parameter is the type of
  63   the value that the parser never returns.
  64 * `nextCh` consumes the next character and returns its location,
  65   and fails at EOF.
  66 * `"xyz"_ch` succeeds if the next character consumed matches any of those
  67   in the string and returns its location.  Be advised that the source
  68   will have been normalized to lower case (miniscule) letters outside
  69   character and Hollerith literals and edit descriptors before parsing.
  70
  71 ### Combinators
  72 These functions and operators combine existing parsers to generate new parsers.
  73 They are `constexpr`, so they should be viewed as type-safe macros.
  74
  75 * `!p` succeeds if p fails, and fails if p succeeds.
  76 * `p >> q` fails if p does, otherwise running q and returning its value when
  77   it succeeds.
  78 * `p / q` fails if p does, otherwise running q and returning p's value
  79   if q succeeds.
  80 * `p || q` succeeds if p does, otherwise running q.  The two parsers must
  81   have the same type, and the value returned by the first succeeding parser
  82   is the value of the combination.
  83 * `first(p1, p2, ...)` returns the value of the first parser that succeeds.
  84   All of the parsers in the list must return the same type.
  85   It is essentially the same as `p1 || p2 || ...` but has a slightly
  86   faster implementation and may be easier to format in your code.
  87 * `lookAhead(p)` succeeds if p does, but doesn't modify any state.
  88 * `attempt(p)` succeeds if p does, safely preserving state on failure.
  89 * `many(p)` recognizes a greedy sequence of zero or more nonempty successes
  90   of p, and returns `std::list<>` of their values.  It always succeeds.
  91 * `some(p)` recognized a greedy sequence of one or more successes of p.
  92   It fails if p immediately fails.
  93 * `skipMany(p)` is the same as `many(p)`, but it discards the results.
  94 * `maybe(p)` tries to match p, returning an `std::optional<T>` value.
  95   It always succeeds.
  96 * `defaulted(p)` matches p, and when p fails it returns a
  97   default-constructed instance of p's resultType.  It always succeeds.
  98 * `nonemptySeparated(p, q)` repeatedly matches "p q p q p q ... p",
  99   returning a `std::list<>` of only the values of the p's.  It fails if
 100   p immediately fails.
 101 * `extension<feature>([msg,]p)` parses p if strict standard compliance is
 102   disabled, or with an optional warning when nonstandard usage warnings
 103   are enabled.
 104 * `deprecated(p)` parses p if strict standard compliance is disabled,
 105   with a warning if deprecated usage warnings are enabled.
 106 * `inContext(msg, p)` runs p within an error message context; any
 107   message that `p` generates will be tagged with `msg` as its
 108   context.  Contexts may nest.
 109 * `withMessage(msg, p)` succeeds if `p` does, and if it does not,
 110   it discards the messages from `p` and fails with the specified message.
 111 * `recovery(p, q)` is equivalent to `p || q`, except that error messages
 112   generated from the first parser are retained, and a flag is set in
 113   the ParseState to remember that error recovery was necessary.
 114 * `localRecovery(msg, p, q)` is equivalent to
 115   `recovery(withMessage(msg, p), q >> pure<A>())` where `A` is the
 116   result type of 'p'.
 117   It is useful for targeted error recovery situations within statements.
 118
 119 Note that
 120 ```
 121 a >> b >> c / d / e
 122 ```
 123 matches a sequence of five parsers, but returns only the result that was
 124 obtained by matching `c`.
 125
 126 ### Applicatives
 127 The following *applicative* combinators combine parsers and modify or
 128 collect the values that they return.
 129
 130 * `construct<T>(p1, p2, ...)` matches zero or more parsers in succession,
 131   collecting their results and then passing them with move semantics to a
 132   constructor for the type T if they all succeed.
 133   If there is a single parser as the argument and it returns no usable
 134   value but only success or failure (_e.g.,_ `"IF"_tok`), the default
 135   nullary constructor of the type `T` is called.
 136 * `sourced(p)` matches p, and fills in its `source` data member with the
 137   locations of the cooked character stream that it consumed
 138 * `applyFunction(f, p1, p2, ...)` matches one or more parsers in succession,
 139   collecting their results and passing them as rvalue reference arguments to
 140   some function, returning its result.
 141 * `applyLambda([](&&x){}, p1, p2, ...)` is the same thing, but for lambdas
 142   and other function objects.
 143 * `applyMem(mf, p1, p2, ...)` is the same thing, but invokes a member
 144   function of the result of the first parser for updates in place.
 145
 146 ### Token Parsers
 147 Last, we have these basic parsers on which the actual grammar of the Fortran
 148 is built.  All of the following parsers consume characters acquired from
 149 `nextCh`.
 150
 151 * `space` always succeeds after consuming any spaces
 152 * `spaceCheck` always succeeds after consuming any spaces, and can emit
 153   a warning if there was no space in free form code before a character
 154   that could continue a name or keyword
 155 * `digit` matches one cooked decimal digit (0-9)
 156 * `letter` matches one cooked letter (A-Z)
 157 * `"..."_tok` match the content of the string, skipping spaces before and
 158   after.  Internal spaces are optional matches.  The `_tok` suffix is
 159   optional when the parser appears before the combinator `>>` or after
 160   the combinator `/`.  If the quoted string ends in a character that
 161   could appear in an identifier, a missing space will be diagnosed in
 162   free form source in pedantic mode if the next character could also
 163   be part of an identifier -- add a trailing blank to avoid this.
 164 * `"..."_sptok` is a string match in which the spaces are required in
 165    free form source.
 166 * `"..."_id` is a string match for a complete identifier (not a prefix of
 167    a longer identifier or keyword).
 168 * `parenthesized(p)` is shorthand for `"(" >> p / ")"`.
 169 * `bracketed(p)` is shorthand for `"[" >> p / "]"`.
 170 * `nonemptyList(p)` matches a comma-separated list of one or more
 171   instances of p.
 172 * `nonemptyList(errorMessage, p)` is equivalent to
 173   `withMessage(errorMessage, nonemptyList(p))`, which allows one to supply
 174   a meaningful error message in the event of an empty list.
 175 * `optionalList(p)` is the same thing, but can be empty, and always succeeds.
 176
 177 ### Debugging Parser
 178 Last, a string literal `"..."_debug` denotes a parser that emits the string to
 179 `llvm::errs` and succeeds.  It is useful for tracing while debugging a parser but should
 180 obviously not be committed for production code.