1 ===========================================
2 Kaleidoscope: Implementing a Parser and AST
3 ===========================================
11 Welcome to Chapter 2 of the "`Implementing a language with
12 LLVM <index.html>`_" tutorial. This chapter shows you how to use the
13 lexer, built in `Chapter 1 <LangImpl01.html>`_, to build a full
14 `parser <http://en.wikipedia.org/wiki/Parsing>`_ for our Kaleidoscope
15 language. Once we have a parser, we'll define and build an `Abstract
16 Syntax Tree <http://en.wikipedia.org/wiki/Abstract_syntax_tree>`_ (AST).
18 The parser we will build uses a combination of `Recursive Descent
19 Parsing <http://en.wikipedia.org/wiki/Recursive_descent_parser>`_ and
21 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_ to
22 parse the Kaleidoscope language (the latter for binary expressions and
23 the former for everything else). Before we get to parsing though, let's
24 talk about the output of the parser: the Abstract Syntax Tree.
26 The Abstract Syntax Tree (AST)
27 ==============================
29 The AST for a program captures its behavior in such a way that it is
30 easy for later stages of the compiler (e.g. code generation) to
31 interpret. We basically want one object for each construct in the
32 language, and the AST should closely model the language. In
33 Kaleidoscope, we have expressions, a prototype, and a function object.
34 We'll start with expressions first:
38 /// ExprAST - Base class for all expression nodes.
44 /// NumberExprAST - Expression class for numeric literals like "1.0".
45 class NumberExprAST : public ExprAST {
49 NumberExprAST(double Val) : Val(Val) {}
52 The code above shows the definition of the base ExprAST class and one
53 subclass which we use for numeric literals. The important thing to note
54 about this code is that the NumberExprAST class captures the numeric
55 value of the literal as an instance variable. This allows later phases
56 of the compiler to know what the stored numeric value is.
58 Right now we only create the AST, so there are no useful accessor
59 methods on them. It would be very easy to add a virtual method to pretty
60 print the code, for example. Here are the other expression AST node
61 definitions that we'll use in the basic form of the Kaleidoscope
66 /// VariableExprAST - Expression class for referencing a variable, like "a".
67 class VariableExprAST : public ExprAST {
71 VariableExprAST(const std::string &Name) : Name(Name) {}
74 /// BinaryExprAST - Expression class for a binary operator.
75 class BinaryExprAST : public ExprAST {
77 std::unique_ptr<ExprAST> LHS, RHS;
80 BinaryExprAST(char op, std::unique_ptr<ExprAST> LHS,
81 std::unique_ptr<ExprAST> RHS)
82 : Op(op), LHS(std::move(LHS)), RHS(std::move(RHS)) {}
85 /// CallExprAST - Expression class for function calls.
86 class CallExprAST : public ExprAST {
88 std::vector<std::unique_ptr<ExprAST>> Args;
91 CallExprAST(const std::string &Callee,
92 std::vector<std::unique_ptr<ExprAST>> Args)
93 : Callee(Callee), Args(std::move(Args)) {}
96 This is all (intentionally) rather straight-forward: variables capture
97 the variable name, binary operators capture their opcode (e.g. '+'), and
98 calls capture a function name as well as a list of any argument
99 expressions. One thing that is nice about our AST is that it captures
100 the language features without talking about the syntax of the language.
101 Note that there is no discussion about precedence of binary operators,
102 lexical structure, etc.
104 For our basic language, these are all of the expression nodes we'll
105 define. Because it doesn't have conditional control flow, it isn't
106 Turing-complete; we'll fix that in a later installment. The two things
107 we need next are a way to talk about the interface to a function, and a
108 way to talk about functions themselves:
112 /// PrototypeAST - This class represents the "prototype" for a function,
113 /// which captures its name, and its argument names (thus implicitly the number
114 /// of arguments the function takes).
117 std::vector<std::string> Args;
120 PrototypeAST(const std::string &name, std::vector<std::string> Args)
121 : Name(name), Args(std::move(Args)) {}
123 const std::string &getName() const { return Name; }
126 /// FunctionAST - This class represents a function definition itself.
128 std::unique_ptr<PrototypeAST> Proto;
129 std::unique_ptr<ExprAST> Body;
132 FunctionAST(std::unique_ptr<PrototypeAST> Proto,
133 std::unique_ptr<ExprAST> Body)
134 : Proto(std::move(Proto)), Body(std::move(Body)) {}
137 In Kaleidoscope, functions are typed with just a count of their
138 arguments. Since all values are double precision floating point, the
139 type of each argument doesn't need to be stored anywhere. In a more
140 aggressive and realistic language, the "ExprAST" class would probably
143 With this scaffolding, we can now talk about parsing expressions and
144 function bodies in Kaleidoscope.
149 Now that we have an AST to build, we need to define the parser code to
150 build it. The idea here is that we want to parse something like "x+y"
151 (which is returned as three tokens by the lexer) into an AST that could
152 be generated with calls like this:
156 auto LHS = llvm::make_unique<VariableExprAST>("x");
157 auto RHS = llvm::make_unique<VariableExprAST>("y");
158 auto Result = std::make_unique<BinaryExprAST>('+', std::move(LHS),
161 In order to do this, we'll start by defining some basic helper routines:
165 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
166 /// token the parser is looking at. getNextToken reads another token from the
167 /// lexer and updates CurTok with its results.
169 static int getNextToken() {
170 return CurTok = gettok();
173 This implements a simple token buffer around the lexer. This allows us
174 to look one token ahead at what the lexer is returning. Every function
175 in our parser will assume that CurTok is the current token that needs to
181 /// LogError* - These are little helper functions for error handling.
182 std::unique_ptr<ExprAST> LogError(const char *Str) {
183 fprintf(stderr, "LogError: %s\n", Str);
186 std::unique_ptr<PrototypeAST> LogErrorP(const char *Str) {
191 The ``LogError`` routines are simple helper routines that our parser will
192 use to handle errors. The error recovery in our parser will not be the
193 best and is not particular user-friendly, but it will be enough for our
194 tutorial. These routines make it easier to handle errors in routines
195 that have various return types: they always return null.
197 With these basic helper functions, we can implement the first piece of
198 our grammar: numeric literals.
200 Basic Expression Parsing
201 ========================
203 We start with numeric literals, because they are the simplest to
204 process. For each production in our grammar, we'll define a function
205 which parses that production. For numeric literals, we have:
209 /// numberexpr ::= number
210 static std::unique_ptr<ExprAST> ParseNumberExpr() {
211 auto Result = llvm::make_unique<NumberExprAST>(NumVal);
212 getNextToken(); // consume the number
213 return std::move(Result);
216 This routine is very simple: it expects to be called when the current
217 token is a ``tok_number`` token. It takes the current number value,
218 creates a ``NumberExprAST`` node, advances the lexer to the next token,
221 There are some interesting aspects to this. The most important one is
222 that this routine eats all of the tokens that correspond to the
223 production and returns the lexer buffer with the next token (which is
224 not part of the grammar production) ready to go. This is a fairly
225 standard way to go for recursive descent parsers. For a better example,
226 the parenthesis operator is defined like this:
230 /// parenexpr ::= '(' expression ')'
231 static std::unique_ptr<ExprAST> ParseParenExpr() {
232 getNextToken(); // eat (.
233 auto V = ParseExpression();
238 return LogError("expected ')'");
239 getNextToken(); // eat ).
243 This function illustrates a number of interesting things about the
246 1) It shows how we use the LogError routines. When called, this function
247 expects that the current token is a '(' token, but after parsing the
248 subexpression, it is possible that there is no ')' waiting. For example,
249 if the user types in "(4 x" instead of "(4)", the parser should emit an
250 error. Because errors can occur, the parser needs a way to indicate that
251 they happened: in our parser, we return null on an error.
253 2) Another interesting aspect of this function is that it uses recursion
254 by calling ``ParseExpression`` (we will soon see that
255 ``ParseExpression`` can call ``ParseParenExpr``). This is powerful
256 because it allows us to handle recursive grammars, and keeps each
257 production very simple. Note that parentheses do not cause construction
258 of AST nodes themselves. While we could do it this way, the most
259 important role of parentheses are to guide the parser and provide
260 grouping. Once the parser constructs the AST, parentheses are not
263 The next simple production is for handling variable references and
270 /// ::= identifier '(' expression* ')'
271 static std::unique_ptr<ExprAST> ParseIdentifierExpr() {
272 std::string IdName = IdentifierStr;
274 getNextToken(); // eat identifier.
276 if (CurTok != '(') // Simple variable ref.
277 return llvm::make_unique<VariableExprAST>(IdName);
280 getNextToken(); // eat (
281 std::vector<std::unique_ptr<ExprAST>> Args;
284 if (auto Arg = ParseExpression())
285 Args.push_back(std::move(Arg));
293 return LogError("Expected ')' or ',' in argument list");
301 return llvm::make_unique<CallExprAST>(IdName, std::move(Args));
304 This routine follows the same style as the other routines. (It expects
305 to be called if the current token is a ``tok_identifier`` token). It
306 also has recursion and error handling. One interesting aspect of this is
307 that it uses *look-ahead* to determine if the current identifier is a
308 stand alone variable reference or if it is a function call expression.
309 It handles this by checking to see if the token after the identifier is
310 a '(' token, constructing either a ``VariableExprAST`` or
311 ``CallExprAST`` node as appropriate.
313 Now that we have all of our simple expression-parsing logic in place, we
314 can define a helper function to wrap it together into one entry point.
315 We call this class of expressions "primary" expressions, for reasons
316 that will become more clear `later in the
317 tutorial <LangImpl6.html#user-defined-unary-operators>`_. In order to parse an arbitrary
318 primary expression, we need to determine what sort of expression it is:
323 /// ::= identifierexpr
326 static std::unique_ptr<ExprAST> ParsePrimary() {
329 return LogError("unknown token when expecting an expression");
331 return ParseIdentifierExpr();
333 return ParseNumberExpr();
335 return ParseParenExpr();
339 Now that you see the definition of this function, it is more obvious why
340 we can assume the state of CurTok in the various functions. This uses
341 look-ahead to determine which sort of expression is being inspected, and
342 then parses it with a function call.
344 Now that basic expressions are handled, we need to handle binary
345 expressions. They are a bit more complex.
347 Binary Expression Parsing
348 =========================
350 Binary expressions are significantly harder to parse because they are
351 often ambiguous. For example, when given the string "x+y\*z", the parser
352 can choose to parse it as either "(x+y)\*z" or "x+(y\*z)". With common
353 definitions from mathematics, we expect the later parse, because "\*"
354 (multiplication) has higher *precedence* than "+" (addition).
356 There are many ways to handle this, but an elegant and efficient way is
357 to use `Operator-Precedence
358 Parsing <http://en.wikipedia.org/wiki/Operator-precedence_parser>`_.
359 This parsing technique uses the precedence of binary operators to guide
360 recursion. To start with, we need a table of precedences:
364 /// BinopPrecedence - This holds the precedence for each binary operator that is
366 static std::map<char, int> BinopPrecedence;
368 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
369 static int GetTokPrecedence() {
370 if (!isascii(CurTok))
373 // Make sure it's a declared binop.
374 int TokPrec = BinopPrecedence[CurTok];
375 if (TokPrec <= 0) return -1;
380 // Install standard binary operators.
381 // 1 is lowest precedence.
382 BinopPrecedence['<'] = 10;
383 BinopPrecedence['+'] = 20;
384 BinopPrecedence['-'] = 20;
385 BinopPrecedence['*'] = 40; // highest.
389 For the basic form of Kaleidoscope, we will only support 4 binary
390 operators (this can obviously be extended by you, our brave and intrepid
391 reader). The ``GetTokPrecedence`` function returns the precedence for
392 the current token, or -1 if the token is not a binary operator. Having a
393 map makes it easy to add new operators and makes it clear that the
394 algorithm doesn't depend on the specific operators involved, but it
395 would be easy enough to eliminate the map and do the comparisons in the
396 ``GetTokPrecedence`` function. (Or just use a fixed-size array).
398 With the helper above defined, we can now start parsing binary
399 expressions. The basic idea of operator precedence parsing is to break
400 down an expression with potentially ambiguous binary operators into
401 pieces. Consider, for example, the expression "a+b+(c+d)\*e\*f+g".
402 Operator precedence parsing considers this as a stream of primary
403 expressions separated by binary operators. As such, it will first parse
404 the leading primary expression "a", then it will see the pairs [+, b]
405 [+, (c+d)] [\*, e] [\*, f] and [+, g]. Note that because parentheses are
406 primary expressions, the binary expression parser doesn't need to worry
407 about nested subexpressions like (c+d) at all.
409 To start, an expression is a primary expression potentially followed by
410 a sequence of [binop,primaryexpr] pairs:
415 /// ::= primary binoprhs
417 static std::unique_ptr<ExprAST> ParseExpression() {
418 auto LHS = ParsePrimary();
422 return ParseBinOpRHS(0, std::move(LHS));
425 ``ParseBinOpRHS`` is the function that parses the sequence of pairs for
426 us. It takes a precedence and a pointer to an expression for the part
427 that has been parsed so far. Note that "x" is a perfectly valid
428 expression: As such, "binoprhs" is allowed to be empty, in which case it
429 returns the expression that is passed into it. In our example above, the
430 code passes the expression for "a" into ``ParseBinOpRHS`` and the
431 current token is "+".
433 The precedence value passed into ``ParseBinOpRHS`` indicates the
434 *minimal operator precedence* that the function is allowed to eat. For
435 example, if the current pair stream is [+, x] and ``ParseBinOpRHS`` is
436 passed in a precedence of 40, it will not consume any tokens (because
437 the precedence of '+' is only 20). With this in mind, ``ParseBinOpRHS``
443 /// ::= ('+' primary)*
444 static std::unique_ptr<ExprAST> ParseBinOpRHS(int ExprPrec,
445 std::unique_ptr<ExprAST> LHS) {
446 // If this is a binop, find its precedence.
448 int TokPrec = GetTokPrecedence();
450 // If this is a binop that binds at least as tightly as the current binop,
451 // consume it, otherwise we are done.
452 if (TokPrec < ExprPrec)
455 This code gets the precedence of the current token and checks to see if
456 if is too low. Because we defined invalid tokens to have a precedence of
457 -1, this check implicitly knows that the pair-stream ends when the token
458 stream runs out of binary operators. If this check succeeds, we know
459 that the token is a binary operator and that it will be included in this
464 // Okay, we know this is a binop.
466 getNextToken(); // eat binop
468 // Parse the primary expression after the binary operator.
469 auto RHS = ParsePrimary();
473 As such, this code eats (and remembers) the binary operator and then
474 parses the primary expression that follows. This builds up the whole
475 pair, the first of which is [+, b] for the running example.
477 Now that we parsed the left-hand side of an expression and one pair of
478 the RHS sequence, we have to decide which way the expression associates.
479 In particular, we could have "(a+b) binop unparsed" or "a + (b binop
480 unparsed)". To determine this, we look ahead at "binop" to determine its
481 precedence and compare it to BinOp's precedence (which is '+' in this
486 // If BinOp binds less tightly with RHS than the operator after RHS, let
487 // the pending operator take RHS as its LHS.
488 int NextPrec = GetTokPrecedence();
489 if (TokPrec < NextPrec) {
491 If the precedence of the binop to the right of "RHS" is lower or equal
492 to the precedence of our current operator, then we know that the
493 parentheses associate as "(a+b) binop ...". In our example, the current
494 operator is "+" and the next operator is "+", we know that they have the
495 same precedence. In this case we'll create the AST node for "a+b", and
496 then continue parsing:
500 ... if body omitted ...
504 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
506 } // loop around to the top of the while loop.
509 In our example above, this will turn "a+b+" into "(a+b)" and execute the
510 next iteration of the loop, with "+" as the current token. The code
511 above will eat, remember, and parse "(c+d)" as the primary expression,
512 which makes the current pair equal to [+, (c+d)]. It will then evaluate
513 the 'if' conditional above with "\*" as the binop to the right of the
514 primary. In this case, the precedence of "\*" is higher than the
515 precedence of "+" so the if condition will be entered.
517 The critical question left here is "how can the if condition parse the
518 right hand side in full"? In particular, to build the AST correctly for
519 our example, it needs to get all of "(c+d)\*e\*f" as the RHS expression
520 variable. The code to do this is surprisingly simple (code from the
521 above two blocks duplicated for context):
525 // If BinOp binds less tightly with RHS than the operator after RHS, let
526 // the pending operator take RHS as its LHS.
527 int NextPrec = GetTokPrecedence();
528 if (TokPrec < NextPrec) {
529 RHS = ParseBinOpRHS(TokPrec+1, std::move(RHS));
534 LHS = llvm::make_unique<BinaryExprAST>(BinOp, std::move(LHS),
536 } // loop around to the top of the while loop.
539 At this point, we know that the binary operator to the RHS of our
540 primary has higher precedence than the binop we are currently parsing.
541 As such, we know that any sequence of pairs whose operators are all
542 higher precedence than "+" should be parsed together and returned as
543 "RHS". To do this, we recursively invoke the ``ParseBinOpRHS`` function
544 specifying "TokPrec+1" as the minimum precedence required for it to
545 continue. In our example above, this will cause it to return the AST
546 node for "(c+d)\*e\*f" as RHS, which is then set as the RHS of the '+'
549 Finally, on the next iteration of the while loop, the "+g" piece is
550 parsed and added to the AST. With this little bit of code (14
551 non-trivial lines), we correctly handle fully general binary expression
552 parsing in a very elegant way. This was a whirlwind tour of this code,
553 and it is somewhat subtle. I recommend running through it with a few
554 tough examples to see how it works.
556 This wraps up handling of expressions. At this point, we can point the
557 parser at an arbitrary token stream and build an expression from it,
558 stopping at the first token that is not part of the expression. Next up
559 we need to handle function definitions, etc.
564 The next thing missing is handling of function prototypes. In
565 Kaleidoscope, these are used both for 'extern' function declarations as
566 well as function body definitions. The code to do this is
567 straight-forward and not very interesting (once you've survived
573 /// ::= id '(' id* ')'
574 static std::unique_ptr<PrototypeAST> ParsePrototype() {
575 if (CurTok != tok_identifier)
576 return LogErrorP("Expected function name in prototype");
578 std::string FnName = IdentifierStr;
582 return LogErrorP("Expected '(' in prototype");
584 // Read the list of argument names.
585 std::vector<std::string> ArgNames;
586 while (getNextToken() == tok_identifier)
587 ArgNames.push_back(IdentifierStr);
589 return LogErrorP("Expected ')' in prototype");
592 getNextToken(); // eat ')'.
594 return llvm::make_unique<PrototypeAST>(FnName, std::move(ArgNames));
597 Given this, a function definition is very simple, just a prototype plus
598 an expression to implement the body:
602 /// definition ::= 'def' prototype expression
603 static std::unique_ptr<FunctionAST> ParseDefinition() {
604 getNextToken(); // eat def.
605 auto Proto = ParsePrototype();
606 if (!Proto) return nullptr;
608 if (auto E = ParseExpression())
609 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
613 In addition, we support 'extern' to declare functions like 'sin' and
614 'cos' as well as to support forward declaration of user functions. These
615 'extern's are just prototypes with no body:
619 /// external ::= 'extern' prototype
620 static std::unique_ptr<PrototypeAST> ParseExtern() {
621 getNextToken(); // eat extern.
622 return ParsePrototype();
625 Finally, we'll also let the user type in arbitrary top-level expressions
626 and evaluate them on the fly. We will handle this by defining anonymous
627 nullary (zero argument) functions for them:
631 /// toplevelexpr ::= expression
632 static std::unique_ptr<FunctionAST> ParseTopLevelExpr() {
633 if (auto E = ParseExpression()) {
634 // Make an anonymous proto.
635 auto Proto = llvm::make_unique<PrototypeAST>("", std::vector<std::string>());
636 return llvm::make_unique<FunctionAST>(std::move(Proto), std::move(E));
641 Now that we have all the pieces, let's build a little driver that will
642 let us actually *execute* this code we've built!
647 The driver for this simply invokes all of the parsing pieces with a
648 top-level dispatch loop. There isn't much interesting here, so I'll just
649 include the top-level loop. See `below <#full-code-listing>`_ for full code in the
650 "Top-Level Parsing" section.
654 /// top ::= definition | external | expression | ';'
655 static void MainLoop() {
657 fprintf(stderr, "ready> ");
661 case ';': // ignore top-level semicolons.
671 HandleTopLevelExpression();
677 The most interesting part of this is that we ignore top-level
678 semicolons. Why is this, you ask? The basic reason is that if you type
679 "4 + 5" at the command line, the parser doesn't know whether that is the
680 end of what you will type or not. For example, on the next line you
681 could type "def foo..." in which case 4+5 is the end of a top-level
682 expression. Alternatively you could type "\* 6", which would continue
683 the expression. Having top-level semicolons allows you to type "4+5;",
684 and the parser will know you are done.
689 With just under 400 lines of commented code (240 lines of non-comment,
690 non-blank code), we fully defined our minimal language, including a
691 lexer, parser, and AST builder. With this done, the executable will
692 validate Kaleidoscope code and tell us if it is grammatically invalid.
693 For example, here is a sample interaction:
698 ready> def foo(x y) x+foo(y, 4.0);
699 Parsed a function definition.
700 ready> def foo(x y) x+y y;
701 Parsed a function definition.
702 Parsed a top-level expr
703 ready> def foo(x y) x+y );
704 Parsed a function definition.
705 Error: unknown token when expecting an expression
706 ready> extern sin(a);
707 ready> Parsed an extern
711 There is a lot of room for extension here. You can define new AST nodes,
712 extend the language in many ways, etc. In the `next
713 installment <LangImpl03.html>`_, we will describe how to generate LLVM
714 Intermediate Representation (IR) from the AST.
719 Here is the complete code listing for our running example. Because this
720 uses the LLVM libraries, we need to link them in. To do this, we use the
721 `llvm-config <http://llvm.org/cmds/llvm-config.html>`_ tool to inform
722 our makefile/command line about which options to use:
727 clang++ -g -O3 toy.cpp `llvm-config --cxxflags`
733 .. literalinclude:: ../../examples/Kaleidoscope/Chapter2/toy.cpp
736 `Next: Implementing Code Generation to LLVM IR <LangImpl03.html>`_