3 ===========================================
4 Kaleidoscope: Implementing a Parser and AST
5 ===========================================
10 Chapter 2 Introduction
11 ======================
13 Welcome to Chapter 2 of the "`Implementing a language with
14 LLVM <index.html>`_" tutorial. This chapter shows you how to use the
15 lexer, built in `Chapter 1 <LangImpl01.html>`_, to build a full
16 `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
17 language. Once we have a parser, we'll define and build an `Abstract
18 Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
20 The parser we will build uses a combination of `Recursive Descent
21 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
23 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
24 parse the Kaleidoscope language (the latter for binary expressions and
25 the former for everything else). Before we get to parsing though, let's
26 talk about the output of the parser: the Abstract Syntax Tree.
28 The Abstract Syntax Tree (AST)
29 ==============================
31 The AST for a program captures its behavior in such a way that it is
32 easy for later stages of the compiler (e.g. code generation) to
33 interpret. We basically want one object for each construct in the
34 language, and the AST should closely model the language. In
35 Kaleidoscope, we have expressions, a prototype, and a function object.
36 We'll start with expressions first:
40 /// ExprAST - Base class for all expression nodes.
46 /// NumberExprAST - Expression class for numeric literals like "1.0".
47 class NumberExprAST : public ExprAST {
51 NumberExprAST(double Val) : Val(Val) {}
54 The code above shows the definition of the base ExprAST class and one
55 subclass which we use for numeric literals. The important thing to note
56 about this code is that the NumberExprAST class captures the numeric
57 value of the literal as an instance variable. This allows later phases
58 of the compiler to know what the stored numeric value is.
60 Right now we only create the AST, so there are no useful accessor
61 methods on them. It would be very easy to add a virtual method to pretty
62 print the code, for example. Here are the other expression AST node
63 definitions that we'll use in the basic form of the Kaleidoscope
68 /// VariableExprAST - Expression class for referencing a variable, like "a".
69 class VariableExprAST : public ExprAST {
73 VariableExprAST(const std::string &Name) : Name(Name) {}
76 /// BinaryExprAST - Expression class for a binary operator.
77 class BinaryExprAST : public ExprAST {
79 std::unique_ptr<ExprAST> LHS, RHS;
82 BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
83 std::unique_ptr<ExprAST> RHS)
84 : Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
87 /// CallExprAST - Expression class for function calls.
88 class CallExprAST : public ExprAST {
90 std::vector<std::unique_ptr<ExprAST>> Args;
93 CallExprAST(const std::string &Callee,
94 std::vector<std::unique_ptr<ExprAST>> Args)
95 : Callee(Callee), Args(std::move(Args)) {}
98 This is all (intentionally) rather straight-forward: variables capture
99 the variable name, binary operators capture their opcode (e.g. '+'), and
100 calls capture a function name as well as a list of any argument
101 expressions. One thing that is nice about our AST is that it captures
102 the language features without talking about the syntax of the language.
103 Note that there is no discussion about precedence of binary operators,
104 lexical structure, etc.
106 For our basic language, these are all of the expression nodes we'll
107 define. Because it doesn't have conditional control flow, it isn't
108 Turing-complete; we'll fix that in a later installment. The two things
109 we need next are a way to talk about the interface to a function, and a
110 way to talk about functions themselves:
114 /// PrototypeAST - This class represents the "prototype" for a function,
115 /// which captures its name, and its argument names (thus implicitly the number
116 /// of arguments the function takes).
119 std::vector<std::string> Args;
122 PrototypeAST(const std::string &name, std::vector<std::string> Args)
123 : Name(name), Args(std::move(Args)) {}
125 const std::string &getName() const { return Name; }
128 /// FunctionAST - This class represents a function definition itself.
130 std::unique_ptr<PrototypeAST> Proto;
131 std::unique_ptr<ExprAST> Body;
134 FunctionAST(std::unique_ptr<PrototypeAST> Proto,
135 std::unique_ptr<ExprAST> Body)
136 : Proto(std::move(Proto)), Body(std::move(Body)) {}
139 In Kaleidoscope, functions are typed with just a count of their
140 arguments. Since all values are double precision floating point, the
141 type of each argument doesn't need to be stored anywhere. In a more
142 aggressive and realistic language, the "ExprAST" class would probably
145 With this scaffolding, we can now talk about parsing expressions and
146 function bodies in Kaleidoscope.
151 Now that we have an AST to build, we need to define the parser code to
152 build it. The idea here is that we want to parse something like "x+y"
153 (which is returned as three tokens by the lexer) into an AST that could
154 be generated with calls like this:
158 auto LHS = std::make_unique<VariableExprAST>("x");
159 auto RHS = std::make_unique<VariableExprAST>("y");
160 auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS),
163 In order to do this, we'll start by defining some basic helper routines:
167 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
168 /// token the parser is looking at. getNextToken reads another token from the
169 /// lexer and updates CurTok with its results.
171 static int getNextToken() {
172 return CurTok = gettok();
175 This implements a simple token buffer around the lexer. This allows us
176 to look one token ahead at what the lexer is returning. Every function
177 in our parser will assume that CurTok is the current token that needs to
183 /// LogError* - These are little helper functions for error handling.
184 std::unique_ptr<ExprAST> LogError(const char *Str) {
185 fprintf(stderr, "LogError: %s\n", Str);
188 std::unique_ptr<PrototypeAST> LogErrorP(const char *Str) {
193 The ``LogError`` routines are simple helper routines that our parser will
194 use to handle errors. The error recovery in our parser will not be the
195 best and is not particular user-friendly, but it will be enough for our
196 tutorial. These routines make it easier to handle errors in routines
197 that have various return types: they always return null.
199 With these basic helper functions, we can implement the first piece of
200 our grammar: numeric literals.
202 Basic Expression Parsing
203 ========================
205 We start with numeric literals, because they are the simplest to
206 process. For each production in our grammar, we'll define a function
207 which parses that production. For numeric literals, we have:
211 /// numberexpr ::= number
212 static std::unique_ptr<ExprAST> ParseNumberExpr() {
213 auto Result = std::make_unique<NumberExprAST>(NumVal);
214 getNextToken(); // consume the number
215 return std::move(Result);
218 This routine is very simple: it expects to be called when the current
219 token is a ``tok_number`` token. It takes the current number value,
220 creates a ``NumberExprAST`` node, advances the lexer to the next token,
223 There are some interesting aspects to this. The most important one is
224 that this routine eats all of the tokens that correspond to the
225 production and returns the lexer buffer with the next token (which is
226 not part of the grammar production) ready to go. This is a fairly
227 standard way to go for recursive descent parsers. For a better example,
228 the parenthesis operator is defined like this:
232 /// parenexpr ::= '(' expression ')'
233 static std::unique_ptr<ExprAST> ParseParenExpr() {
234 getNextToken(); // eat (.
235 auto V = ParseExpression();
240 return LogError("expected ')'");
241 getNextToken(); // eat ).
245 This function illustrates a number of interesting things about the
248 1) It shows how we use the LogError routines. When called, this function
249 expects that the current token is a '(' token, but after parsing the
250 subexpression, it is possible that there is no ')' waiting. For example,
251 if the user types in "(4 x" instead of "(4)", the parser should emit an
252 error. Because errors can occur, the parser needs a way to indicate that
253 they happened: in our parser, we return null on an error.
255 2) Another interesting aspect of this function is that it uses recursion
256 by calling ``ParseExpression`` (we will soon see that
257 ``ParseExpression`` can call ``ParseParenExpr``). This is powerful
258 because it allows us to handle recursive grammars, and keeps each
259 production very simple. Note that parentheses do not cause construction
260 of AST nodes themselves. While we could do it this way, the most
261 important role of parentheses are to guide the parser and provide
262 grouping. Once the parser constructs the AST, parentheses are not
265 The next simple production is for handling variable references and
272 /// ::= identifier '(' expression* ')'
273 static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
274 std::string IdName = IdentifierStr;
276 getNextToken(); // eat identifier.
278 if (CurTok != '(') // Simple variable ref.
279 return std::make_unique<VariableExprAST>(IdName);
282 getNextToken(); // eat (
283 std::vector<std::unique_ptr<ExprAST>> Args;
286 if (auto Arg = ParseExpression())
287 Args.push_back(std::move(Arg));
295 return LogError("Expected ')' or ',' in argument list");
303 return std::make_unique<CallExprAST>(IdName, std::move(Args));
306 This routine follows the same style as the other routines. (It expects
307 to be called if the current token is a ``tok_identifier`` token). It
308 also has recursion and error handling. One interesting aspect of this is
309 that it uses *look-ahead* to determine if the current identifier is a
310 stand alone variable reference or if it is a function call expression.
311 It handles this by checking to see if the token after the identifier is
312 a '(' token, constructing either a ``VariableExprAST`` or
313 ``CallExprAST`` node as appropriate.
315 Now that we have all of our simple expression-parsing logic in place, we
316 can define a helper function to wrap it together into one entry point.
317 We call this class of expressions "primary" expressions, for reasons
318 that will become more clear `later in the
319 tutorial <LangImpl6.html#user-defined-unary-operators>`_. In order to parse an arbitrary
320 primary expression, we need to determine what sort of expression it is:
325 /// ::= identifierexpr
328 static std::unique_ptr<ExprAST> ParsePrimary() {
331 return LogError("unknown token when expecting an expression");
333 return ParseIdentifierExpr();
335 return ParseNumberExpr();
337 return ParseParenExpr();
341 Now that you see the definition of this function, it is more obvious why
342 we can assume the state of CurTok in the various functions. This uses
343 look-ahead to determine which sort of expression is being inspected, and
344 then parses it with a function call.
346 Now that basic expressions are handled, we need to handle binary
347 expressions. They are a bit more complex.
349 Binary Expression Parsing
350 =========================
352 Binary expressions are significantly harder to parse because they are
353 often ambiguous. For example, when given the string "x+y\*z", the parser
354 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
355 definitions from mathematics, we expect the later parse, because "\*"
356 (multiplication) has higher *precedence* than "+" (addition).
358 There are many ways to handle this, but an elegant and efficient way is
359 to use `Operator-Precedence
360 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
361 This parsing technique uses the precedence of binary operators to guide
362 recursion. To start with, we need a table of precedences:
366 /// BinopPrecedence - This holds the precedence for each binary operator that is
368 static std::map<char, int> BinopPrecedence;
370 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
371 static int GetTokPrecedence() {
372 if (!isascii(CurTok))
375 // Make sure it's a declared binop.
376 int TokPrec = BinopPrecedence[CurTok];
377 if (TokPrec <= 0) return -1;
382 // Install standard binary operators.
383 // 1 is lowest precedence.
384 BinopPrecedence['<'] = 10;
385 BinopPrecedence['+'] = 20;
386 BinopPrecedence['-'] = 20;
387 BinopPrecedence['*'] = 40; // highest.
391 For the basic form of Kaleidoscope, we will only support 4 binary
392 operators (this can obviously be extended by you, our brave and intrepid
393 reader). The ``GetTokPrecedence`` function returns the precedence for
394 the current token, or -1 if the token is not a binary operator. Having a
395 map makes it easy to add new operators and makes it clear that the
396 algorithm doesn't depend on the specific operators involved, but it
397 would be easy enough to eliminate the map and do the comparisons in the
398 ``GetTokPrecedence`` function. (Or just use a fixed-size array).
400 With the helper above defined, we can now start parsing binary
401 expressions. The basic idea of operator precedence parsing is to break
402 down an expression with potentially ambiguous binary operators into
403 pieces. Consider, for example, the expression "a+b+(c+d)\*e\*f+g".
404 Operator precedence parsing considers this as a stream of primary
405 expressions separated by binary operators. As such, it will first parse
406 the leading primary expression "a", then it will see the pairs [+, b]
407 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
408 primary expressions, the binary expression parser doesn't need to worry
409 about nested subexpressions like (c+d) at all.
411 To start, an expression is a primary expression potentially followed by
412 a sequence of [binop,primaryexpr] pairs:
417 /// ::= primary binoprhs
419 static std::unique_ptr<ExprAST> ParseExpression() {
420 auto LHS = ParsePrimary();
424 return ParseBinOpRHS(0, std::move(LHS));
427 ``ParseBinOpRHS`` is the function that parses the sequence of pairs for
428 us. It takes a precedence and a pointer to an expression for the part
429 that has been parsed so far. Note that "x" is a perfectly valid
430 expression: As such, "binoprhs" is allowed to be empty, in which case it
431 returns the expression that is passed into it. In our example above, the
432 code passes the expression for "a" into ``ParseBinOpRHS`` and the
433 current token is "+".
435 The precedence value passed into ``ParseBinOpRHS`` indicates the
436 *minimal operator precedence* that the function is allowed to eat. For
437 example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
438 passed in a precedence of 40, it will not consume any tokens (because
439 the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
445 /// ::= ('+' primary)*
446 static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
447 std::unique_ptr<ExprAST> LHS) {
448 // If this is a binop, find its precedence.
450 int TokPrec = GetTokPrecedence();
452 // If this is a binop that binds at least as tightly as the current binop,
453 // consume it, otherwise we are done.
454 if (TokPrec < ExprPrec)
457 This code gets the precedence of the current token and checks to see if
458 if is too low. Because we defined invalid tokens to have a precedence of
459 -1, this check implicitly knows that the pair-stream ends when the token
460 stream runs out of binary operators. If this check succeeds, we know
461 that the token is a binary operator and that it will be included in this
466 // Okay, we know this is a binop.
468 getNextToken(); // eat binop
470 // Parse the primary expression after the binary operator.
471 auto RHS = ParsePrimary();
475 As such, this code eats (and remembers) the binary operator and then
476 parses the primary expression that follows. This builds up the whole
477 pair, the first of which is [+, b] for the running example.
479 Now that we parsed the left-hand side of an expression and one pair of
480 the RHS sequence, we have to decide which way the expression associates.
481 In particular, we could have "(a+b) binop unparsed" or "a + (b binop
482 unparsed)". To determine this, we look ahead at "binop" to determine its
483 precedence and compare it to BinOp's precedence (which is '+' in this
488 // If BinOp binds less tightly with RHS than the operator after RHS, let
489 // the pending operator take RHS as its LHS.
490 int NextPrec = GetTokPrecedence();
491 if (TokPrec < NextPrec) {
493 If the precedence of the binop to the right of "RHS" is lower or equal
494 to the precedence of our current operator, then we know that the
495 parentheses associate as "(a+b) binop ...". In our example, the current
496 operator is "+" and the next operator is "+", we know that they have the
497 same precedence. In this case we'll create the AST node for "a+b", and
498 then continue parsing:
502 ... if body omitted ...
506 LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
508 } // loop around to the top of the while loop.
511 In our example above, this will turn "a+b+" into "(a+b)" and execute the
512 next iteration of the loop, with "+" as the current token. The code
513 above will eat, remember, and parse "(c+d)" as the primary expression,
514 which makes the current pair equal to [+, (c+d)]. It will then evaluate
515 the 'if' conditional above with "\*" as the binop to the right of the
516 primary. In this case, the precedence of "\*" is higher than the
517 precedence of "+" so the if condition will be entered.
519 The critical question left here is "how can the if condition parse the
520 right hand side in full"? In particular, to build the AST correctly for
521 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
522 variable. The code to do this is surprisingly simple (code from the
523 above two blocks duplicated for context):
527 // If BinOp binds less tightly with RHS than the operator after RHS, let
528 // the pending operator take RHS as its LHS.
529 int NextPrec = GetTokPrecedence();
530 if (TokPrec < NextPrec) {
531 RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
536 LHS = std::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
538 } // loop around to the top of the while loop.
541 At this point, we know that the binary operator to the RHS of our
542 primary has higher precedence than the binop we are currently parsing.
543 As such, we know that any sequence of pairs whose operators are all
544 higher precedence than "+" should be parsed together and returned as
545 "RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
546 specifying "TokPrec+1" as the minimum precedence required for it to
547 continue. In our example above, this will cause it to return the AST
548 node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
551 Finally, on the next iteration of the while loop, the "+g" piece is
552 parsed and added to the AST. With this little bit of code (14
553 non-trivial lines), we correctly handle fully general binary expression
554 parsing in a very elegant way. This was a whirlwind tour of this code,
555 and it is somewhat subtle. I recommend running through it with a few
556 tough examples to see how it works.
558 This wraps up handling of expressions. At this point, we can point the
559 parser at an arbitrary token stream and build an expression from it,
560 stopping at the first token that is not part of the expression. Next up
561 we need to handle function definitions, etc.
566 The next thing missing is handling of function prototypes. In
567 Kaleidoscope, these are used both for 'extern' function declarations as
568 well as function body definitions. The code to do this is
569 straight-forward and not very interesting (once you've survived
575 /// ::= id '(' id* ')'
576 static std::unique_ptr<PrototypeAST> ParsePrototype() {
577 if (CurTok != tok_identifier)
578 return LogErrorP("Expected function name in prototype");
580 std::string FnName = IdentifierStr;
584 return LogErrorP("Expected '(' in prototype");
586 // Read the list of argument names.
587 std::vector<std::string> ArgNames;
588 while (getNextToken() == tok_identifier)
589 ArgNames.push_back(IdentifierStr);
591 return LogErrorP("Expected ')' in prototype");
594 getNextToken(); // eat ')'.
596 return std::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
599 Given this, a function definition is very simple, just a prototype plus
600 an expression to implement the body:
604 /// definition ::= 'def' prototype expression
605 static std::unique_ptr<FunctionAST> ParseDefinition() {
606 getNextToken(); // eat def.
607 auto Proto = ParsePrototype();
608 if (!Proto) return nullptr;
610 if (auto E = ParseExpression())
611 return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
615 In addition, we support 'extern' to declare functions like 'sin' and
616 'cos' as well as to support forward declaration of user functions. These
617 'extern's are just prototypes with no body:
621 /// external ::= 'extern' prototype
622 static std::unique_ptr<PrototypeAST> ParseExtern() {
623 getNextToken(); // eat extern.
624 return ParsePrototype();
627 Finally, we'll also let the user type in arbitrary top-level expressions
628 and evaluate them on the fly. We will handle this by defining anonymous
629 nullary (zero argument) functions for them:
633 /// toplevelexpr ::= expression
634 static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
635 if (auto E = ParseExpression()) {
636 // Make an anonymous proto.
637 auto Proto = std::make_unique<PrototypeAST>("", std::vector<std::string>());
638 return std::make_unique<FunctionAST>(std::move(Proto), std::move(E));
643 Now that we have all the pieces, let's build a little driver that will
644 let us actually *execute* this code we've built!
649 The driver for this simply invokes all of the parsing pieces with a
650 top-level dispatch loop. There isn't much interesting here, so I'll just
651 include the top-level loop. See `below <#full-code-listing>`_ for full code in the
652 "Top-Level Parsing" section.
656 /// top ::= definition | external | expression | ';'
657 static void MainLoop() {
659 fprintf(stderr, "ready> ");
663 case ';': // ignore top-level semicolons.
673 HandleTopLevelExpression();
679 The most interesting part of this is that we ignore top-level
680 semicolons. Why is this, you ask? The basic reason is that if you type
681 "4 + 5" at the command line, the parser doesn't know whether that is the
682 end of what you will type or not. For example, on the next line you
683 could type "def foo..." in which case 4+5 is the end of a top-level
684 expression. Alternatively you could type "\* 6", which would continue
685 the expression. Having top-level semicolons allows you to type "4+5;",
686 and the parser will know you are done.
691 With just under 400 lines of commented code (240 lines of non-comment,
692 non-blank code), we fully defined our minimal language, including a
693 lexer, parser, and AST builder. With this done, the executable will
694 validate Kaleidoscope code and tell us if it is grammatically invalid.
695 For example, here is a sample interaction:
700 ready> def foo(x y) x+foo(y, 4.0);
701 Parsed a function definition.
702 ready> def foo(x y) x+y y;
703 Parsed a function definition.
704 Parsed a top-level expr
705 ready> def foo(x y) x+y );
706 Parsed a function definition.
707 Error: unknown token when expecting an expression
708 ready> extern sin(a);
709 ready> Parsed an extern
713 There is a lot of room for extension here. You can define new AST nodes,
714 extend the language in many ways, etc. In the `next
715 installment <LangImpl03.html>`_, we will describe how to generate LLVM
716 Intermediate Representation (IR) from the AST.
721 Here is the complete code listing for our running example. Because this
722 uses the LLVM libraries, we need to link them in. To do this, we use the
723 `llvm-config <http://llvm.org/cmds/llvm-config.html>`_ tool to inform
724 our makefile/command line about which options to use:
729 clang++ -g -O3 toy.cpp `llvm-config --cxxflags`
735 .. literalinclude:: ../../../examples/Kaleidoscope/Chapter2/toy.cpp
738 `Next: Implementing Code Generation to LLVM IR <LangImpl03.html>`_