fix PR9856, an incorrectly conservative assertion: a global can be
[llvm/stm8.git] / docs / tutorial / LangImpl2.html
blobacccd20a09048426ca3dbd62e2dfaa18d4b36f7a
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
4 <html>
5 <head>
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10 </head>
12 <body>
14 <h1>Kaleidoscope: Implementing a Parser and AST</h1>
16 <ul>
17 <li><a href="index.html">Up to Tutorial Index</a></li>
18 <li>Chapter 2
19 <ol>
20 <li><a href="#intro">Chapter 2 Introduction</a></li>
21 <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
22 <li><a href="#parserbasics">Parser Basics</a></li>
23 <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
24 <li><a href="#parserbinops">Binary Expression Parsing</a></li>
25 <li><a href="#parsertop">Parsing the Rest</a></li>
26 <li><a href="#driver">The Driver</a></li>
27 <li><a href="#conclusions">Conclusions</a></li>
28 <li><a href="#code">Full Code Listing</a></li>
29 </ol>
30 </li>
31 <li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
32 </ul>
34 <div class="doc_author">
35 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
36 </div>
38 <!-- *********************************************************************** -->
39 <h2><a name="intro">Chapter 2 Introduction</a></h2>
40 <!-- *********************************************************************** -->
42 <div>
44 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
45 with LLVM</a>" tutorial. This chapter shows you how to use the lexer, built in
46 <a href="LangImpl1.html">Chapter 1</a>, to build a full <a
47 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
48 our Kaleidoscope language. Once we have a parser, we'll define and build an <a
49 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
50 Tree</a> (AST).</p>
52 <p>The parser we will build uses a combination of <a
53 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
54 Parsing</a> and <a href=
55 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
56 Parsing</a> to parse the Kaleidoscope language (the latter for
57 binary expressions and the former for everything else). Before we get to
58 parsing though, lets talk about the output of the parser: the Abstract Syntax
59 Tree.</p>
61 </div>
63 <!-- *********************************************************************** -->
64 <h2><a name="ast">The Abstract Syntax Tree (AST)</a></h2>
65 <!-- *********************************************************************** -->
67 <div>
69 <p>The AST for a program captures its behavior in such a way that it is easy for
70 later stages of the compiler (e.g. code generation) to interpret. We basically
71 want one object for each construct in the language, and the AST should closely
72 model the language. In Kaleidoscope, we have expressions, a prototype, and a
73 function object. We'll start with expressions first:</p>
75 <div class="doc_code">
76 <pre>
77 /// ExprAST - Base class for all expression nodes.
78 class ExprAST {
79 public:
80 virtual ~ExprAST() {}
83 /// NumberExprAST - Expression class for numeric literals like "1.0".
84 class NumberExprAST : public ExprAST {
85 double Val;
86 public:
87 NumberExprAST(double val) : Val(val) {}
89 </pre>
90 </div>
92 <p>The code above shows the definition of the base ExprAST class and one
93 subclass which we use for numeric literals. The important thing to note about
94 this code is that the NumberExprAST class captures the numeric value of the
95 literal as an instance variable. This allows later phases of the compiler to
96 know what the stored numeric value is.</p>
98 <p>Right now we only create the AST, so there are no useful accessor methods on
99 them. It would be very easy to add a virtual method to pretty print the code,
100 for example. Here are the other expression AST node definitions that we'll use
101 in the basic form of the Kaleidoscope language:
102 </p>
104 <div class="doc_code">
105 <pre>
106 /// VariableExprAST - Expression class for referencing a variable, like "a".
107 class VariableExprAST : public ExprAST {
108 std::string Name;
109 public:
110 VariableExprAST(const std::string &amp;name) : Name(name) {}
113 /// BinaryExprAST - Expression class for a binary operator.
114 class BinaryExprAST : public ExprAST {
115 char Op;
116 ExprAST *LHS, *RHS;
117 public:
118 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
119 : Op(op), LHS(lhs), RHS(rhs) {}
122 /// CallExprAST - Expression class for function calls.
123 class CallExprAST : public ExprAST {
124 std::string Callee;
125 std::vector&lt;ExprAST*&gt; Args;
126 public:
127 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
128 : Callee(callee), Args(args) {}
130 </pre>
131 </div>
133 <p>This is all (intentionally) rather straight-forward: variables capture the
134 variable name, binary operators capture their opcode (e.g. '+'), and calls
135 capture a function name as well as a list of any argument expressions. One thing
136 that is nice about our AST is that it captures the language features without
137 talking about the syntax of the language. Note that there is no discussion about
138 precedence of binary operators, lexical structure, etc.</p>
140 <p>For our basic language, these are all of the expression nodes we'll define.
141 Because it doesn't have conditional control flow, it isn't Turing-complete;
142 we'll fix that in a later installment. The two things we need next are a way
143 to talk about the interface to a function, and a way to talk about functions
144 themselves:</p>
146 <div class="doc_code">
147 <pre>
148 /// PrototypeAST - This class represents the "prototype" for a function,
149 /// which captures its name, and its argument names (thus implicitly the number
150 /// of arguments the function takes).
151 class PrototypeAST {
152 std::string Name;
153 std::vector&lt;std::string&gt; Args;
154 public:
155 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
156 : Name(name), Args(args) {}
159 /// FunctionAST - This class represents a function definition itself.
160 class FunctionAST {
161 PrototypeAST *Proto;
162 ExprAST *Body;
163 public:
164 FunctionAST(PrototypeAST *proto, ExprAST *body)
165 : Proto(proto), Body(body) {}
167 </pre>
168 </div>
170 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
171 Since all values are double precision floating point, the type of each argument
172 doesn't need to be stored anywhere. In a more aggressive and realistic
173 language, the "ExprAST" class would probably have a type field.</p>
175 <p>With this scaffolding, we can now talk about parsing expressions and function
176 bodies in Kaleidoscope.</p>
178 </div>
180 <!-- *********************************************************************** -->
181 <h2><a name="parserbasics">Parser Basics</a></h2>
182 <!-- *********************************************************************** -->
184 <div>
186 <p>Now that we have an AST to build, we need to define the parser code to build
187 it. The idea here is that we want to parse something like "x+y" (which is
188 returned as three tokens by the lexer) into an AST that could be generated with
189 calls like this:</p>
191 <div class="doc_code">
192 <pre>
193 ExprAST *X = new VariableExprAST("x");
194 ExprAST *Y = new VariableExprAST("y");
195 ExprAST *Result = new BinaryExprAST('+', X, Y);
196 </pre>
197 </div>
199 <p>In order to do this, we'll start by defining some basic helper routines:</p>
201 <div class="doc_code">
202 <pre>
203 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
204 /// token the parser is looking at. getNextToken reads another token from the
205 /// lexer and updates CurTok with its results.
206 static int CurTok;
207 static int getNextToken() {
208 return CurTok = gettok();
210 </pre>
211 </div>
214 This implements a simple token buffer around the lexer. This allows
215 us to look one token ahead at what the lexer is returning. Every function in
216 our parser will assume that CurTok is the current token that needs to be
217 parsed.</p>
219 <div class="doc_code">
220 <pre>
222 /// Error* - These are little helper functions for error handling.
223 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
224 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
225 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
226 </pre>
227 </div>
230 The <tt>Error</tt> routines are simple helper routines that our parser will use
231 to handle errors. The error recovery in our parser will not be the best and
232 is not particular user-friendly, but it will be enough for our tutorial. These
233 routines make it easier to handle errors in routines that have various return
234 types: they always return null.</p>
236 <p>With these basic helper functions, we can implement the first
237 piece of our grammar: numeric literals.</p>
239 </div>
241 <!-- *********************************************************************** -->
242 <h2><a name="parserprimexprs">Basic Expression Parsing</a></h2>
243 <!-- *********************************************************************** -->
245 <div>
247 <p>We start with numeric literals, because they are the simplest to process.
248 For each production in our grammar, we'll define a function which parses that
249 production. For numeric literals, we have:
250 </p>
252 <div class="doc_code">
253 <pre>
254 /// numberexpr ::= number
255 static ExprAST *ParseNumberExpr() {
256 ExprAST *Result = new NumberExprAST(NumVal);
257 getNextToken(); // consume the number
258 return Result;
260 </pre>
261 </div>
263 <p>This routine is very simple: it expects to be called when the current token
264 is a <tt>tok_number</tt> token. It takes the current number value, creates
265 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally
266 returns.</p>
268 <p>There are some interesting aspects to this. The most important one is that
269 this routine eats all of the tokens that correspond to the production and
270 returns the lexer buffer with the next token (which is not part of the grammar
271 production) ready to go. This is a fairly standard way to go for recursive
272 descent parsers. For a better example, the parenthesis operator is defined like
273 this:</p>
275 <div class="doc_code">
276 <pre>
277 /// parenexpr ::= '(' expression ')'
278 static ExprAST *ParseParenExpr() {
279 getNextToken(); // eat (.
280 ExprAST *V = ParseExpression();
281 if (!V) return 0;
283 if (CurTok != ')')
284 return Error("expected ')'");
285 getNextToken(); // eat ).
286 return V;
288 </pre>
289 </div>
291 <p>This function illustrates a number of interesting things about the
292 parser:</p>
295 1) It shows how we use the Error routines. When called, this function expects
296 that the current token is a '(' token, but after parsing the subexpression, it
297 is possible that there is no ')' waiting. For example, if the user types in
298 "(4 x" instead of "(4)", the parser should emit an error. Because errors can
299 occur, the parser needs a way to indicate that they happened: in our parser, we
300 return null on an error.</p>
302 <p>2) Another interesting aspect of this function is that it uses recursion by
303 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
304 <tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
305 recursive grammars, and keeps each production very simple. Note that
306 parentheses do not cause construction of AST nodes themselves. While we could
307 do it this way, the most important role of parentheses are to guide the parser
308 and provide grouping. Once the parser constructs the AST, parentheses are not
309 needed.</p>
311 <p>The next simple production is for handling variable references and function
312 calls:</p>
314 <div class="doc_code">
315 <pre>
316 /// identifierexpr
317 /// ::= identifier
318 /// ::= identifier '(' expression* ')'
319 static ExprAST *ParseIdentifierExpr() {
320 std::string IdName = IdentifierStr;
322 getNextToken(); // eat identifier.
324 if (CurTok != '(') // Simple variable ref.
325 return new VariableExprAST(IdName);
327 // Call.
328 getNextToken(); // eat (
329 std::vector&lt;ExprAST*&gt; Args;
330 if (CurTok != ')') {
331 while (1) {
332 ExprAST *Arg = ParseExpression();
333 if (!Arg) return 0;
334 Args.push_back(Arg);
336 if (CurTok == ')') break;
338 if (CurTok != ',')
339 return Error("Expected ')' or ',' in argument list");
340 getNextToken();
344 // Eat the ')'.
345 getNextToken();
347 return new CallExprAST(IdName, Args);
349 </pre>
350 </div>
352 <p>This routine follows the same style as the other routines. (It expects to be
353 called if the current token is a <tt>tok_identifier</tt> token). It also has
354 recursion and error handling. One interesting aspect of this is that it uses
355 <em>look-ahead</em> to determine if the current identifier is a stand alone
356 variable reference or if it is a function call expression. It handles this by
357 checking to see if the token after the identifier is a '(' token, constructing
358 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
359 </p>
361 <p>Now that we have all of our simple expression-parsing logic in place, we can
362 define a helper function to wrap it together into one entry point. We call this
363 class of expressions "primary" expressions, for reasons that will become more
364 clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to
365 parse an arbitrary primary expression, we need to determine what sort of
366 expression it is:</p>
368 <div class="doc_code">
369 <pre>
370 /// primary
371 /// ::= identifierexpr
372 /// ::= numberexpr
373 /// ::= parenexpr
374 static ExprAST *ParsePrimary() {
375 switch (CurTok) {
376 default: return Error("unknown token when expecting an expression");
377 case tok_identifier: return ParseIdentifierExpr();
378 case tok_number: return ParseNumberExpr();
379 case '(': return ParseParenExpr();
382 </pre>
383 </div>
385 <p>Now that you see the definition of this function, it is more obvious why we
386 can assume the state of CurTok in the various functions. This uses look-ahead
387 to determine which sort of expression is being inspected, and then parses it
388 with a function call.</p>
390 <p>Now that basic expressions are handled, we need to handle binary expressions.
391 They are a bit more complex.</p>
393 </div>
395 <!-- *********************************************************************** -->
396 <h2><a name="parserbinops">Binary Expression Parsing</a></h2>
397 <!-- *********************************************************************** -->
399 <div>
401 <p>Binary expressions are significantly harder to parse because they are often
402 ambiguous. For example, when given the string "x+y*z", the parser can choose
403 to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
404 mathematics, we expect the later parse, because "*" (multiplication) has
405 higher <em>precedence</em> than "+" (addition).</p>
407 <p>There are many ways to handle this, but an elegant and efficient way is to
408 use <a href=
409 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
410 Parsing</a>. This parsing technique uses the precedence of binary operators to
411 guide recursion. To start with, we need a table of precedences:</p>
413 <div class="doc_code">
414 <pre>
415 /// BinopPrecedence - This holds the precedence for each binary operator that is
416 /// defined.
417 static std::map&lt;char, int&gt; BinopPrecedence;
419 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
420 static int GetTokPrecedence() {
421 if (!isascii(CurTok))
422 return -1;
424 // Make sure it's a declared binop.
425 int TokPrec = BinopPrecedence[CurTok];
426 if (TokPrec &lt;= 0) return -1;
427 return TokPrec;
430 int main() {
431 // Install standard binary operators.
432 // 1 is lowest precedence.
433 BinopPrecedence['&lt;'] = 10;
434 BinopPrecedence['+'] = 20;
435 BinopPrecedence['-'] = 20;
436 BinopPrecedence['*'] = 40; // highest.
439 </pre>
440 </div>
442 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
443 (this can obviously be extended by you, our brave and intrepid reader). The
444 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
445 or -1 if the token is not a binary operator. Having a map makes it easy to add
446 new operators and makes it clear that the algorithm doesn't depend on the
447 specific operators involved, but it would be easy enough to eliminate the map
448 and do the comparisons in the <tt>GetTokPrecedence</tt> function. (Or just use
449 a fixed-size array).</p>
451 <p>With the helper above defined, we can now start parsing binary expressions.
452 The basic idea of operator precedence parsing is to break down an expression
453 with potentially ambiguous binary operators into pieces. Consider ,for example,
454 the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
455 as a stream of primary expressions separated by binary operators. As such,
456 it will first parse the leading primary expression "a", then it will see the
457 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
458 are primary expressions, the binary expression parser doesn't need to worry
459 about nested subexpressions like (c+d) at all.
460 </p>
463 To start, an expression is a primary expression potentially followed by a
464 sequence of [binop,primaryexpr] pairs:</p>
466 <div class="doc_code">
467 <pre>
468 /// expression
469 /// ::= primary binoprhs
471 static ExprAST *ParseExpression() {
472 ExprAST *LHS = ParsePrimary();
473 if (!LHS) return 0;
475 return ParseBinOpRHS(0, LHS);
477 </pre>
478 </div>
480 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
481 us. It takes a precedence and a pointer to an expression for the part that has been
482 parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
483 allowed to be empty, in which case it returns the expression that is passed into
484 it. In our example above, the code passes the expression for "a" into
485 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
487 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
488 minimal operator precedence</em> that the function is allowed to eat. For
489 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
490 passed in a precedence of 40, it will not consume any tokens (because the
491 precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
492 with:</p>
494 <div class="doc_code">
495 <pre>
496 /// binoprhs
497 /// ::= ('+' primary)*
498 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
499 // If this is a binop, find its precedence.
500 while (1) {
501 int TokPrec = GetTokPrecedence();
503 // If this is a binop that binds at least as tightly as the current binop,
504 // consume it, otherwise we are done.
505 if (TokPrec &lt; ExprPrec)
506 return LHS;
507 </pre>
508 </div>
510 <p>This code gets the precedence of the current token and checks to see if if is
511 too low. Because we defined invalid tokens to have a precedence of -1, this
512 check implicitly knows that the pair-stream ends when the token stream runs out
513 of binary operators. If this check succeeds, we know that the token is a binary
514 operator and that it will be included in this expression:</p>
516 <div class="doc_code">
517 <pre>
518 // Okay, we know this is a binop.
519 int BinOp = CurTok;
520 getNextToken(); // eat binop
522 // Parse the primary expression after the binary operator.
523 ExprAST *RHS = ParsePrimary();
524 if (!RHS) return 0;
525 </pre>
526 </div>
528 <p>As such, this code eats (and remembers) the binary operator and then parses
529 the primary expression that follows. This builds up the whole pair, the first of
530 which is [+, b] for the running example.</p>
532 <p>Now that we parsed the left-hand side of an expression and one pair of the
533 RHS sequence, we have to decide which way the expression associates. In
534 particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
535 To determine this, we look ahead at "binop" to determine its precedence and
536 compare it to BinOp's precedence (which is '+' in this case):</p>
538 <div class="doc_code">
539 <pre>
540 // If BinOp binds less tightly with RHS than the operator after RHS, let
541 // the pending operator take RHS as its LHS.
542 int NextPrec = GetTokPrecedence();
543 if (TokPrec &lt; NextPrec) {
544 </pre>
545 </div>
547 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
548 precedence of our current operator, then we know that the parentheses associate
549 as "(a+b) binop ...". In our example, the current operator is "+" and the next
550 operator is "+", we know that they have the same precedence. In this case we'll
551 create the AST node for "a+b", and then continue parsing:</p>
553 <div class="doc_code">
554 <pre>
555 ... if body omitted ...
558 // Merge LHS/RHS.
559 LHS = new BinaryExprAST(BinOp, LHS, RHS);
560 } // loop around to the top of the while loop.
562 </pre>
563 </div>
565 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
566 iteration of the loop, with "+" as the current token. The code above will eat,
567 remember, and parse "(c+d)" as the primary expression, which makes the
568 current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with
569 "*" as the binop to the right of the primary. In this case, the precedence of "*" is
570 higher than the precedence of "+" so the if condition will be entered.</p>
572 <p>The critical question left here is "how can the if condition parse the right
573 hand side in full"? In particular, to build the AST correctly for our example,
574 it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
575 do this is surprisingly simple (code from the above two blocks duplicated for
576 context):</p>
578 <div class="doc_code">
579 <pre>
580 // If BinOp binds less tightly with RHS than the operator after RHS, let
581 // the pending operator take RHS as its LHS.
582 int NextPrec = GetTokPrecedence();
583 if (TokPrec &lt; NextPrec) {
584 <b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
585 if (RHS == 0) return 0;</b>
587 // Merge LHS/RHS.
588 LHS = new BinaryExprAST(BinOp, LHS, RHS);
589 } // loop around to the top of the while loop.
591 </pre>
592 </div>
594 <p>At this point, we know that the binary operator to the RHS of our primary
595 has higher precedence than the binop we are currently parsing. As such, we know
596 that any sequence of pairs whose operators are all higher precedence than "+"
597 should be parsed together and returned as "RHS". To do this, we recursively
598 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
599 precedence required for it to continue. In our example above, this will cause
600 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
601 of the '+' expression.</p>
603 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
604 and added to the AST. With this little bit of code (14 non-trivial lines), we
605 correctly handle fully general binary expression parsing in a very elegant way.
606 This was a whirlwind tour of this code, and it is somewhat subtle. I recommend
607 running through it with a few tough examples to see how it works.
608 </p>
610 <p>This wraps up handling of expressions. At this point, we can point the
611 parser at an arbitrary token stream and build an expression from it, stopping
612 at the first token that is not part of the expression. Next up we need to
613 handle function definitions, etc.</p>
615 </div>
617 <!-- *********************************************************************** -->
618 <h2><a name="parsertop">Parsing the Rest</a></h2>
619 <!-- *********************************************************************** -->
621 <div>
624 The next thing missing is handling of function prototypes. In Kaleidoscope,
625 these are used both for 'extern' function declarations as well as function body
626 definitions. The code to do this is straight-forward and not very interesting
627 (once you've survived expressions):
628 </p>
630 <div class="doc_code">
631 <pre>
632 /// prototype
633 /// ::= id '(' id* ')'
634 static PrototypeAST *ParsePrototype() {
635 if (CurTok != tok_identifier)
636 return ErrorP("Expected function name in prototype");
638 std::string FnName = IdentifierStr;
639 getNextToken();
641 if (CurTok != '(')
642 return ErrorP("Expected '(' in prototype");
644 // Read the list of argument names.
645 std::vector&lt;std::string&gt; ArgNames;
646 while (getNextToken() == tok_identifier)
647 ArgNames.push_back(IdentifierStr);
648 if (CurTok != ')')
649 return ErrorP("Expected ')' in prototype");
651 // success.
652 getNextToken(); // eat ')'.
654 return new PrototypeAST(FnName, ArgNames);
656 </pre>
657 </div>
659 <p>Given this, a function definition is very simple, just a prototype plus
660 an expression to implement the body:</p>
662 <div class="doc_code">
663 <pre>
664 /// definition ::= 'def' prototype expression
665 static FunctionAST *ParseDefinition() {
666 getNextToken(); // eat def.
667 PrototypeAST *Proto = ParsePrototype();
668 if (Proto == 0) return 0;
670 if (ExprAST *E = ParseExpression())
671 return new FunctionAST(Proto, E);
672 return 0;
674 </pre>
675 </div>
677 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
678 well as to support forward declaration of user functions. These 'extern's are just
679 prototypes with no body:</p>
681 <div class="doc_code">
682 <pre>
683 /// external ::= 'extern' prototype
684 static PrototypeAST *ParseExtern() {
685 getNextToken(); // eat extern.
686 return ParsePrototype();
688 </pre>
689 </div>
691 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
692 evaluate them on the fly. We will handle this by defining anonymous nullary
693 (zero argument) functions for them:</p>
695 <div class="doc_code">
696 <pre>
697 /// toplevelexpr ::= expression
698 static FunctionAST *ParseTopLevelExpr() {
699 if (ExprAST *E = ParseExpression()) {
700 // Make an anonymous proto.
701 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
702 return new FunctionAST(Proto, E);
704 return 0;
706 </pre>
707 </div>
709 <p>Now that we have all the pieces, let's build a little driver that will let us
710 actually <em>execute</em> this code we've built!</p>
712 </div>
714 <!-- *********************************************************************** -->
715 <h2><a name="driver">The Driver</a></h2>
716 <!-- *********************************************************************** -->
718 <div>
720 <p>The driver for this simply invokes all of the parsing pieces with a top-level
721 dispatch loop. There isn't much interesting here, so I'll just include the
722 top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
723 Parsing" section.</p>
725 <div class="doc_code">
726 <pre>
727 /// top ::= definition | external | expression | ';'
728 static void MainLoop() {
729 while (1) {
730 fprintf(stderr, "ready&gt; ");
731 switch (CurTok) {
732 case tok_eof: return;
733 case ';': getNextToken(); break; // ignore top-level semicolons.
734 case tok_def: HandleDefinition(); break;
735 case tok_extern: HandleExtern(); break;
736 default: HandleTopLevelExpression(); break;
740 </pre>
741 </div>
743 <p>The most interesting part of this is that we ignore top-level semicolons.
744 Why is this, you ask? The basic reason is that if you type "4 + 5" at the
745 command line, the parser doesn't know whether that is the end of what you will type
746 or not. For example, on the next line you could type "def foo..." in which case
747 4+5 is the end of a top-level expression. Alternatively you could type "* 6",
748 which would continue the expression. Having top-level semicolons allows you to
749 type "4+5;", and the parser will know you are done.</p>
751 </div>
753 <!-- *********************************************************************** -->
754 <h2><a name="conclusions">Conclusions</a></h2>
755 <!-- *********************************************************************** -->
757 <div>
759 <p>With just under 400 lines of commented code (240 lines of non-comment,
760 non-blank code), we fully defined our minimal language, including a lexer,
761 parser, and AST builder. With this done, the executable will validate
762 Kaleidoscope code and tell us if it is grammatically invalid. For
763 example, here is a sample interaction:</p>
765 <div class="doc_code">
766 <pre>
767 $ <b>./a.out</b>
768 ready&gt; <b>def foo(x y) x+foo(y, 4.0);</b>
769 Parsed a function definition.
770 ready&gt; <b>def foo(x y) x+y y;</b>
771 Parsed a function definition.
772 Parsed a top-level expr
773 ready&gt; <b>def foo(x y) x+y );</b>
774 Parsed a function definition.
775 Error: unknown token when expecting an expression
776 ready&gt; <b>extern sin(a);</b>
777 ready&gt; Parsed an extern
778 ready&gt; <b>^D</b>
780 </pre>
781 </div>
783 <p>There is a lot of room for extension here. You can define new AST nodes,
784 extend the language in many ways, etc. In the <a href="LangImpl3.html">next
785 installment</a>, we will describe how to generate LLVM Intermediate
786 Representation (IR) from the AST.</p>
788 </div>
790 <!-- *********************************************************************** -->
791 <h2><a name="code">Full Code Listing</a></h2>
792 <!-- *********************************************************************** -->
794 <div>
797 Here is the complete code listing for this and the previous chapter.
798 Note that it is fully self-contained: you don't need LLVM or any external
799 libraries at all for this. (Besides the C and C++ standard libraries, of
800 course.) To build this, just compile with:</p>
802 <div class="doc_code">
803 <pre>
804 # Compile
805 g++ -g -O3 toy.cpp
806 # Run
807 ./a.out
808 </pre>
809 </div>
811 <p>Here is the code:</p>
813 <div class="doc_code">
814 <pre>
815 #include &lt;cstdio&gt;
816 #include &lt;cstdlib&gt;
817 #include &lt;string&gt;
818 #include &lt;map&gt;
819 #include &lt;vector&gt;
821 //===----------------------------------------------------------------------===//
822 // Lexer
823 //===----------------------------------------------------------------------===//
825 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
826 // of these for known things.
827 enum Token {
828 tok_eof = -1,
830 // commands
831 tok_def = -2, tok_extern = -3,
833 // primary
834 tok_identifier = -4, tok_number = -5
837 static std::string IdentifierStr; // Filled in if tok_identifier
838 static double NumVal; // Filled in if tok_number
840 /// gettok - Return the next token from standard input.
841 static int gettok() {
842 static int LastChar = ' ';
844 // Skip any whitespace.
845 while (isspace(LastChar))
846 LastChar = getchar();
848 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
849 IdentifierStr = LastChar;
850 while (isalnum((LastChar = getchar())))
851 IdentifierStr += LastChar;
853 if (IdentifierStr == "def") return tok_def;
854 if (IdentifierStr == "extern") return tok_extern;
855 return tok_identifier;
858 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
859 std::string NumStr;
860 do {
861 NumStr += LastChar;
862 LastChar = getchar();
863 } while (isdigit(LastChar) || LastChar == '.');
865 NumVal = strtod(NumStr.c_str(), 0);
866 return tok_number;
869 if (LastChar == '#') {
870 // Comment until end of line.
871 do LastChar = getchar();
872 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp;&amp; LastChar != '\r');
874 if (LastChar != EOF)
875 return gettok();
878 // Check for end of file. Don't eat the EOF.
879 if (LastChar == EOF)
880 return tok_eof;
882 // Otherwise, just return the character as its ascii value.
883 int ThisChar = LastChar;
884 LastChar = getchar();
885 return ThisChar;
888 //===----------------------------------------------------------------------===//
889 // Abstract Syntax Tree (aka Parse Tree)
890 //===----------------------------------------------------------------------===//
892 /// ExprAST - Base class for all expression nodes.
893 class ExprAST {
894 public:
895 virtual ~ExprAST() {}
898 /// NumberExprAST - Expression class for numeric literals like "1.0".
899 class NumberExprAST : public ExprAST {
900 double Val;
901 public:
902 NumberExprAST(double val) : Val(val) {}
905 /// VariableExprAST - Expression class for referencing a variable, like "a".
906 class VariableExprAST : public ExprAST {
907 std::string Name;
908 public:
909 VariableExprAST(const std::string &amp;name) : Name(name) {}
912 /// BinaryExprAST - Expression class for a binary operator.
913 class BinaryExprAST : public ExprAST {
914 char Op;
915 ExprAST *LHS, *RHS;
916 public:
917 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
918 : Op(op), LHS(lhs), RHS(rhs) {}
921 /// CallExprAST - Expression class for function calls.
922 class CallExprAST : public ExprAST {
923 std::string Callee;
924 std::vector&lt;ExprAST*&gt; Args;
925 public:
926 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
927 : Callee(callee), Args(args) {}
930 /// PrototypeAST - This class represents the "prototype" for a function,
931 /// which captures its name, and its argument names (thus implicitly the number
932 /// of arguments the function takes).
933 class PrototypeAST {
934 std::string Name;
935 std::vector&lt;std::string&gt; Args;
936 public:
937 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
938 : Name(name), Args(args) {}
942 /// FunctionAST - This class represents a function definition itself.
943 class FunctionAST {
944 PrototypeAST *Proto;
945 ExprAST *Body;
946 public:
947 FunctionAST(PrototypeAST *proto, ExprAST *body)
948 : Proto(proto), Body(body) {}
952 //===----------------------------------------------------------------------===//
953 // Parser
954 //===----------------------------------------------------------------------===//
956 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
957 /// token the parser is looking at. getNextToken reads another token from the
958 /// lexer and updates CurTok with its results.
959 static int CurTok;
960 static int getNextToken() {
961 return CurTok = gettok();
964 /// BinopPrecedence - This holds the precedence for each binary operator that is
965 /// defined.
966 static std::map&lt;char, int&gt; BinopPrecedence;
968 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
969 static int GetTokPrecedence() {
970 if (!isascii(CurTok))
971 return -1;
973 // Make sure it's a declared binop.
974 int TokPrec = BinopPrecedence[CurTok];
975 if (TokPrec &lt;= 0) return -1;
976 return TokPrec;
979 /// Error* - These are little helper functions for error handling.
980 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
981 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
982 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
984 static ExprAST *ParseExpression();
986 /// identifierexpr
987 /// ::= identifier
988 /// ::= identifier '(' expression* ')'
989 static ExprAST *ParseIdentifierExpr() {
990 std::string IdName = IdentifierStr;
992 getNextToken(); // eat identifier.
994 if (CurTok != '(') // Simple variable ref.
995 return new VariableExprAST(IdName);
997 // Call.
998 getNextToken(); // eat (
999 std::vector&lt;ExprAST*&gt; Args;
1000 if (CurTok != ')') {
1001 while (1) {
1002 ExprAST *Arg = ParseExpression();
1003 if (!Arg) return 0;
1004 Args.push_back(Arg);
1006 if (CurTok == ')') break;
1008 if (CurTok != ',')
1009 return Error("Expected ')' or ',' in argument list");
1010 getNextToken();
1014 // Eat the ')'.
1015 getNextToken();
1017 return new CallExprAST(IdName, Args);
1020 /// numberexpr ::= number
1021 static ExprAST *ParseNumberExpr() {
1022 ExprAST *Result = new NumberExprAST(NumVal);
1023 getNextToken(); // consume the number
1024 return Result;
1027 /// parenexpr ::= '(' expression ')'
1028 static ExprAST *ParseParenExpr() {
1029 getNextToken(); // eat (.
1030 ExprAST *V = ParseExpression();
1031 if (!V) return 0;
1033 if (CurTok != ')')
1034 return Error("expected ')'");
1035 getNextToken(); // eat ).
1036 return V;
1039 /// primary
1040 /// ::= identifierexpr
1041 /// ::= numberexpr
1042 /// ::= parenexpr
1043 static ExprAST *ParsePrimary() {
1044 switch (CurTok) {
1045 default: return Error("unknown token when expecting an expression");
1046 case tok_identifier: return ParseIdentifierExpr();
1047 case tok_number: return ParseNumberExpr();
1048 case '(': return ParseParenExpr();
1052 /// binoprhs
1053 /// ::= ('+' primary)*
1054 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1055 // If this is a binop, find its precedence.
1056 while (1) {
1057 int TokPrec = GetTokPrecedence();
1059 // If this is a binop that binds at least as tightly as the current binop,
1060 // consume it, otherwise we are done.
1061 if (TokPrec &lt; ExprPrec)
1062 return LHS;
1064 // Okay, we know this is a binop.
1065 int BinOp = CurTok;
1066 getNextToken(); // eat binop
1068 // Parse the primary expression after the binary operator.
1069 ExprAST *RHS = ParsePrimary();
1070 if (!RHS) return 0;
1072 // If BinOp binds less tightly with RHS than the operator after RHS, let
1073 // the pending operator take RHS as its LHS.
1074 int NextPrec = GetTokPrecedence();
1075 if (TokPrec &lt; NextPrec) {
1076 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1077 if (RHS == 0) return 0;
1080 // Merge LHS/RHS.
1081 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1085 /// expression
1086 /// ::= primary binoprhs
1088 static ExprAST *ParseExpression() {
1089 ExprAST *LHS = ParsePrimary();
1090 if (!LHS) return 0;
1092 return ParseBinOpRHS(0, LHS);
1095 /// prototype
1096 /// ::= id '(' id* ')'
1097 static PrototypeAST *ParsePrototype() {
1098 if (CurTok != tok_identifier)
1099 return ErrorP("Expected function name in prototype");
1101 std::string FnName = IdentifierStr;
1102 getNextToken();
1104 if (CurTok != '(')
1105 return ErrorP("Expected '(' in prototype");
1107 std::vector&lt;std::string&gt; ArgNames;
1108 while (getNextToken() == tok_identifier)
1109 ArgNames.push_back(IdentifierStr);
1110 if (CurTok != ')')
1111 return ErrorP("Expected ')' in prototype");
1113 // success.
1114 getNextToken(); // eat ')'.
1116 return new PrototypeAST(FnName, ArgNames);
1119 /// definition ::= 'def' prototype expression
1120 static FunctionAST *ParseDefinition() {
1121 getNextToken(); // eat def.
1122 PrototypeAST *Proto = ParsePrototype();
1123 if (Proto == 0) return 0;
1125 if (ExprAST *E = ParseExpression())
1126 return new FunctionAST(Proto, E);
1127 return 0;
1130 /// toplevelexpr ::= expression
1131 static FunctionAST *ParseTopLevelExpr() {
1132 if (ExprAST *E = ParseExpression()) {
1133 // Make an anonymous proto.
1134 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
1135 return new FunctionAST(Proto, E);
1137 return 0;
1140 /// external ::= 'extern' prototype
1141 static PrototypeAST *ParseExtern() {
1142 getNextToken(); // eat extern.
1143 return ParsePrototype();
1146 //===----------------------------------------------------------------------===//
1147 // Top-Level parsing
1148 //===----------------------------------------------------------------------===//
1150 static void HandleDefinition() {
1151 if (ParseDefinition()) {
1152 fprintf(stderr, "Parsed a function definition.\n");
1153 } else {
1154 // Skip token for error recovery.
1155 getNextToken();
1159 static void HandleExtern() {
1160 if (ParseExtern()) {
1161 fprintf(stderr, "Parsed an extern\n");
1162 } else {
1163 // Skip token for error recovery.
1164 getNextToken();
1168 static void HandleTopLevelExpression() {
1169 // Evaluate a top-level expression into an anonymous function.
1170 if (ParseTopLevelExpr()) {
1171 fprintf(stderr, "Parsed a top-level expr\n");
1172 } else {
1173 // Skip token for error recovery.
1174 getNextToken();
1178 /// top ::= definition | external | expression | ';'
1179 static void MainLoop() {
1180 while (1) {
1181 fprintf(stderr, "ready&gt; ");
1182 switch (CurTok) {
1183 case tok_eof: return;
1184 case ';': getNextToken(); break; // ignore top-level semicolons.
1185 case tok_def: HandleDefinition(); break;
1186 case tok_extern: HandleExtern(); break;
1187 default: HandleTopLevelExpression(); break;
1192 //===----------------------------------------------------------------------===//
1193 // Main driver code.
1194 //===----------------------------------------------------------------------===//
1196 int main() {
1197 // Install standard binary operators.
1198 // 1 is lowest precedence.
1199 BinopPrecedence['&lt;'] = 10;
1200 BinopPrecedence['+'] = 20;
1201 BinopPrecedence['-'] = 20;
1202 BinopPrecedence['*'] = 40; // highest.
1204 // Prime the first token.
1205 fprintf(stderr, "ready&gt; ");
1206 getNextToken();
1208 // Run the main "interpreter loop" now.
1209 MainLoop();
1211 return 0;
1213 </pre>
1214 </div>
1215 <a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
1216 </div>
1218 <!-- *********************************************************************** -->
1219 <hr>
1220 <address>
1221 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1222 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1223 <a href="http://validator.w3.org/check/referer"><img
1224 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1226 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1227 <a href="http://llvm.org/">The LLVM Compiler Infrastructure</a><br>
1228 Last modified: $Date$
1229 </address>
1230 </body>
1231 </html>