PIC16 does allow colon after MBB labels, simplify EmitBasicBlockStart.
[llvm/avr.git] / docs / tutorial / LangImpl2.html
blob30215608a8ce53a109dec728c8dac10a178c7094
1 <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN"
2 "http://www.w3.org/TR/html4/strict.dtd">
4 <html>
5 <head>
6 <title>Kaleidoscope: Implementing a Parser and AST</title>
7 <meta http-equiv="Content-Type" content="text/html; charset=utf-8">
8 <meta name="author" content="Chris Lattner">
9 <link rel="stylesheet" href="../llvm.css" type="text/css">
10 </head>
12 <body>
14 <div class="doc_title">Kaleidoscope: Implementing a Parser and AST</div>
16 <ul>
17 <li><a href="index.html">Up to Tutorial Index</a></li>
18 <li>Chapter 2
19 <ol>
20 <li><a href="#intro">Chapter 2 Introduction</a></li>
21 <li><a href="#ast">The Abstract Syntax Tree (AST)</a></li>
22 <li><a href="#parserbasics">Parser Basics</a></li>
23 <li><a href="#parserprimexprs">Basic Expression Parsing</a></li>
24 <li><a href="#parserbinops">Binary Expression Parsing</a></li>
25 <li><a href="#parsertop">Parsing the Rest</a></li>
26 <li><a href="#driver">The Driver</a></li>
27 <li><a href="#conclusions">Conclusions</a></li>
28 <li><a href="#code">Full Code Listing</a></li>
29 </ol>
30 </li>
31 <li><a href="LangImpl3.html">Chapter 3</a>: Code generation to LLVM IR</li>
32 </ul>
34 <div class="doc_author">
35 <p>Written by <a href="mailto:sabre@nondot.org">Chris Lattner</a></p>
36 </div>
38 <!-- *********************************************************************** -->
39 <div class="doc_section"><a name="intro">Chapter 2 Introduction</a></div>
40 <!-- *********************************************************************** -->
42 <div class="doc_text">
44 <p>Welcome to Chapter 2 of the "<a href="index.html">Implementing a language
45 with LLVM</a>" tutorial. This chapter shows you how to use the lexer, built in
46 <a href="LangImpl1.html">Chapter 1</a>, to build a full <a
47 href="http://en.wikipedia.org/wiki/Parsing">parser</a> for
48 our Kaleidoscope language. Once we have a parser, we'll define and build an <a
49 href="http://en.wikipedia.org/wiki/Abstract_syntax_tree">Abstract Syntax
50 Tree</a> (AST).</p>
52 <p>The parser we will build uses a combination of <a
53 href="http://en.wikipedia.org/wiki/Recursive_descent_parser">Recursive Descent
54 Parsing</a> and <a href=
55 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
56 Parsing</a> to parse the Kaleidoscope language (the latter for
57 binary expressions and the former for everything else). Before we get to
58 parsing though, lets talk about the output of the parser: the Abstract Syntax
59 Tree.</p>
61 </div>
63 <!-- *********************************************************************** -->
64 <div class="doc_section"><a name="ast">The Abstract Syntax Tree (AST)</a></div>
65 <!-- *********************************************************************** -->
67 <div class="doc_text">
69 <p>The AST for a program captures its behavior in such a way that it is easy for
70 later stages of the compiler (e.g. code generation) to interpret. We basically
71 want one object for each construct in the language, and the AST should closely
72 model the language. In Kaleidoscope, we have expressions, a prototype, and a
73 function object. We'll start with expressions first:</p>
75 <div class="doc_code">
76 <pre>
77 /// ExprAST - Base class for all expression nodes.
78 class ExprAST {
79 public:
80 virtual ~ExprAST() {}
83 /// NumberExprAST - Expression class for numeric literals like "1.0".
84 class NumberExprAST : public ExprAST {
85 double Val;
86 public:
87 explicit NumberExprAST(double val) : Val(val) {}
89 </pre>
90 </div>
92 <p>The code above shows the definition of the base ExprAST class and one
93 subclass which we use for numeric literals. The important thing to note about
94 this code is that the NumberExprAST class captures the numeric value of the
95 literal as an instance variable. This allows later phases of the compiler to
96 know what the stored numeric value is.</p>
98 <p>Right now we only create the AST, so there are no useful accessor methods on
99 them. It would be very easy to add a virtual method to pretty print the code,
100 for example. Here are the other expression AST node definitions that we'll use
101 in the basic form of the Kaleidoscope language:
102 </p>
104 <div class="doc_code">
105 <pre>
106 /// VariableExprAST - Expression class for referencing a variable, like "a".
107 class VariableExprAST : public ExprAST {
108 std::string Name;
109 public:
110 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
113 /// BinaryExprAST - Expression class for a binary operator.
114 class BinaryExprAST : public ExprAST {
115 char Op;
116 ExprAST *LHS, *RHS;
117 public:
118 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
119 : Op(op), LHS(lhs), RHS(rhs) {}
122 /// CallExprAST - Expression class for function calls.
123 class CallExprAST : public ExprAST {
124 std::string Callee;
125 std::vector&lt;ExprAST*&gt; Args;
126 public:
127 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
128 : Callee(callee), Args(args) {}
130 </pre>
131 </div>
133 <p>This is all (intentionally) rather straight-forward: variables capture the
134 variable name, binary operators capture their opcode (e.g. '+'), and calls
135 capture a function name as well as a list of any argument expressions. One thing
136 that is nice about our AST is that it captures the language features without
137 talking about the syntax of the language. Note that there is no discussion about
138 precedence of binary operators, lexical structure, etc.</p>
140 <p>For our basic language, these are all of the expression nodes we'll define.
141 Because it doesn't have conditional control flow, it isn't Turing-complete;
142 we'll fix that in a later installment. The two things we need next are a way
143 to talk about the interface to a function, and a way to talk about functions
144 themselves:</p>
146 <div class="doc_code">
147 <pre>
148 /// PrototypeAST - This class represents the "prototype" for a function,
149 /// which captures its name, and its argument names (thus implicitly the number
150 /// of arguments the function takes).
151 class PrototypeAST {
152 std::string Name;
153 std::vector&lt;std::string&gt; Args;
154 public:
155 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
156 : Name(name), Args(args) {}
159 /// FunctionAST - This class represents a function definition itself.
160 class FunctionAST {
161 PrototypeAST *Proto;
162 ExprAST *Body;
163 public:
164 FunctionAST(PrototypeAST *proto, ExprAST *body)
165 : Proto(proto), Body(body) {}
167 </pre>
168 </div>
170 <p>In Kaleidoscope, functions are typed with just a count of their arguments.
171 Since all values are double precision floating point, the type of each argument
172 doesn't need to be stored anywhere. In a more aggressive and realistic
173 language, the "ExprAST" class would probably have a type field.</p>
175 <p>With this scaffolding, we can now talk about parsing expressions and function
176 bodies in Kaleidoscope.</p>
178 </div>
180 <!-- *********************************************************************** -->
181 <div class="doc_section"><a name="parserbasics">Parser Basics</a></div>
182 <!-- *********************************************************************** -->
184 <div class="doc_text">
186 <p>Now that we have an AST to build, we need to define the parser code to build
187 it. The idea here is that we want to parse something like "x+y" (which is
188 returned as three tokens by the lexer) into an AST that could be generated with
189 calls like this:</p>
191 <div class="doc_code">
192 <pre>
193 ExprAST *X = new VariableExprAST("x");
194 ExprAST *Y = new VariableExprAST("y");
195 ExprAST *Result = new BinaryExprAST('+', X, Y);
196 </pre>
197 </div>
199 <p>In order to do this, we'll start by defining some basic helper routines:</p>
201 <div class="doc_code">
202 <pre>
203 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
204 /// token the parser is looking at. getNextToken reads another token from the
205 /// lexer and updates CurTok with its results.
206 static int CurTok;
207 static int getNextToken() {
208 return CurTok = gettok();
210 </pre>
211 </div>
214 This implements a simple token buffer around the lexer. This allows
215 us to look one token ahead at what the lexer is returning. Every function in
216 our parser will assume that CurTok is the current token that needs to be
217 parsed.</p>
219 <div class="doc_code">
220 <pre>
222 /// Error* - These are little helper functions for error handling.
223 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
224 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
225 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
226 </pre>
227 </div>
230 The <tt>Error</tt> routines are simple helper routines that our parser will use
231 to handle errors. The error recovery in our parser will not be the best and
232 is not particular user-friendly, but it will be enough for our tutorial. These
233 routines make it easier to handle errors in routines that have various return
234 types: they always return null.</p>
236 <p>With these basic helper functions, we can implement the first
237 piece of our grammar: numeric literals.</p>
239 </div>
241 <!-- *********************************************************************** -->
242 <div class="doc_section"><a name="parserprimexprs">Basic Expression
243 Parsing</a></div>
244 <!-- *********************************************************************** -->
246 <div class="doc_text">
248 <p>We start with numeric literals, because they are the simplest to process.
249 For each production in our grammar, we'll define a function which parses that
250 production. For numeric literals, we have:
251 </p>
253 <div class="doc_code">
254 <pre>
255 /// numberexpr ::= number
256 static ExprAST *ParseNumberExpr() {
257 ExprAST *Result = new NumberExprAST(NumVal);
258 getNextToken(); // consume the number
259 return Result;
261 </pre>
262 </div>
264 <p>This routine is very simple: it expects to be called when the current token
265 is a <tt>tok_number</tt> token. It takes the current number value, creates
266 a <tt>NumberExprAST</tt> node, advances the lexer to the next token, and finally
267 returns.</p>
269 <p>There are some interesting aspects to this. The most important one is that
270 this routine eats all of the tokens that correspond to the production and
271 returns the lexer buffer with the next token (which is not part of the grammar
272 production) ready to go. This is a fairly standard way to go for recursive
273 descent parsers. For a better example, the parenthesis operator is defined like
274 this:</p>
276 <div class="doc_code">
277 <pre>
278 /// parenexpr ::= '(' expression ')'
279 static ExprAST *ParseParenExpr() {
280 getNextToken(); // eat (.
281 ExprAST *V = ParseExpression();
282 if (!V) return 0;
284 if (CurTok != ')')
285 return Error("expected ')'");
286 getNextToken(); // eat ).
287 return V;
289 </pre>
290 </div>
292 <p>This function illustrates a number of interesting things about the
293 parser:</p>
296 1) It shows how we use the Error routines. When called, this function expects
297 that the current token is a '(' token, but after parsing the subexpression, it
298 is possible that there is no ')' waiting. For example, if the user types in
299 "(4 x" instead of "(4)", the parser should emit an error. Because errors can
300 occur, the parser needs a way to indicate that they happened: in our parser, we
301 return null on an error.</p>
303 <p>2) Another interesting aspect of this function is that it uses recursion by
304 calling <tt>ParseExpression</tt> (we will soon see that <tt>ParseExpression</tt> can call
305 <tt>ParseParenExpr</tt>). This is powerful because it allows us to handle
306 recursive grammars, and keeps each production very simple. Note that
307 parentheses do not cause construction of AST nodes themselves. While we could
308 do it this way, the most important role of parentheses are to guide the parser
309 and provide grouping. Once the parser constructs the AST, parentheses are not
310 needed.</p>
312 <p>The next simple production is for handling variable references and function
313 calls:</p>
315 <div class="doc_code">
316 <pre>
317 /// identifierexpr
318 /// ::= identifier
319 /// ::= identifier '(' expression* ')'
320 static ExprAST *ParseIdentifierExpr() {
321 std::string IdName = IdentifierStr;
323 getNextToken(); // eat identifier.
325 if (CurTok != '(') // Simple variable ref.
326 return new VariableExprAST(IdName);
328 // Call.
329 getNextToken(); // eat (
330 std::vector&lt;ExprAST*&gt; Args;
331 if (CurTok != ')') {
332 while (1) {
333 ExprAST *Arg = ParseExpression();
334 if (!Arg) return 0;
335 Args.push_back(Arg);
337 if (CurTok == ')') break;
339 if (CurTok != ',')
340 return Error("Expected ')' or ',' in argument list");
341 getNextToken();
345 // Eat the ')'.
346 getNextToken();
348 return new CallExprAST(IdName, Args);
350 </pre>
351 </div>
353 <p>This routine follows the same style as the other routines. (It expects to be
354 called if the current token is a <tt>tok_identifier</tt> token). It also has
355 recursion and error handling. One interesting aspect of this is that it uses
356 <em>look-ahead</em> to determine if the current identifier is a stand alone
357 variable reference or if it is a function call expression. It handles this by
358 checking to see if the token after the identifier is a '(' token, constructing
359 either a <tt>VariableExprAST</tt> or <tt>CallExprAST</tt> node as appropriate.
360 </p>
362 <p>Now that we have all of our simple expression-parsing logic in place, we can
363 define a helper function to wrap it together into one entry point. We call this
364 class of expressions "primary" expressions, for reasons that will become more
365 clear <a href="LangImpl6.html#unary">later in the tutorial</a>. In order to
366 parse an arbitrary primary expression, we need to determine what sort of
367 expression it is:</p>
369 <div class="doc_code">
370 <pre>
371 /// primary
372 /// ::= identifierexpr
373 /// ::= numberexpr
374 /// ::= parenexpr
375 static ExprAST *ParsePrimary() {
376 switch (CurTok) {
377 default: return Error("unknown token when expecting an expression");
378 case tok_identifier: return ParseIdentifierExpr();
379 case tok_number: return ParseNumberExpr();
380 case '(': return ParseParenExpr();
383 </pre>
384 </div>
386 <p>Now that you see the definition of this function, it is more obvious why we
387 can assume the state of CurTok in the various functions. This uses look-ahead
388 to determine which sort of expression is being inspected, and then parses it
389 with a function call.</p>
391 <p>Now that basic expressions are handled, we need to handle binary expressions.
392 They are a bit more complex.</p>
394 </div>
396 <!-- *********************************************************************** -->
397 <div class="doc_section"><a name="parserbinops">Binary Expression
398 Parsing</a></div>
399 <!-- *********************************************************************** -->
401 <div class="doc_text">
403 <p>Binary expressions are significantly harder to parse because they are often
404 ambiguous. For example, when given the string "x+y*z", the parser can choose
405 to parse it as either "(x+y)*z" or "x+(y*z)". With common definitions from
406 mathematics, we expect the later parse, because "*" (multiplication) has
407 higher <em>precedence</em> than "+" (addition).</p>
409 <p>There are many ways to handle this, but an elegant and efficient way is to
410 use <a href=
411 "http://en.wikipedia.org/wiki/Operator-precedence_parser">Operator-Precedence
412 Parsing</a>. This parsing technique uses the precedence of binary operators to
413 guide recursion. To start with, we need a table of precedences:</p>
415 <div class="doc_code">
416 <pre>
417 /// BinopPrecedence - This holds the precedence for each binary operator that is
418 /// defined.
419 static std::map&lt;char, int&gt; BinopPrecedence;
421 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
422 static int GetTokPrecedence() {
423 if (!isascii(CurTok))
424 return -1;
426 // Make sure it's a declared binop.
427 int TokPrec = BinopPrecedence[CurTok];
428 if (TokPrec &lt;= 0) return -1;
429 return TokPrec;
432 int main() {
433 // Install standard binary operators.
434 // 1 is lowest precedence.
435 BinopPrecedence['&lt;'] = 10;
436 BinopPrecedence['+'] = 20;
437 BinopPrecedence['-'] = 20;
438 BinopPrecedence['*'] = 40; // highest.
441 </pre>
442 </div>
444 <p>For the basic form of Kaleidoscope, we will only support 4 binary operators
445 (this can obviously be extended by you, our brave and intrepid reader). The
446 <tt>GetTokPrecedence</tt> function returns the precedence for the current token,
447 or -1 if the token is not a binary operator. Having a map makes it easy to add
448 new operators and makes it clear that the algorithm doesn't depend on the
449 specific operators involved, but it would be easy enough to eliminate the map
450 and do the comparisons in the <tt>GetTokPrecedence</tt> function. (Or just use
451 a fixed-size array).</p>
453 <p>With the helper above defined, we can now start parsing binary expressions.
454 The basic idea of operator precedence parsing is to break down an expression
455 with potentially ambiguous binary operators into pieces. Consider ,for example,
456 the expression "a+b+(c+d)*e*f+g". Operator precedence parsing considers this
457 as a stream of primary expressions separated by binary operators. As such,
458 it will first parse the leading primary expression "a", then it will see the
459 pairs [+, b] [+, (c+d)] [*, e] [*, f] and [+, g]. Note that because parentheses
460 are primary expressions, the binary expression parser doesn't need to worry
461 about nested subexpressions like (c+d) at all.
462 </p>
465 To start, an expression is a primary expression potentially followed by a
466 sequence of [binop,primaryexpr] pairs:</p>
468 <div class="doc_code">
469 <pre>
470 /// expression
471 /// ::= primary binoprhs
473 static ExprAST *ParseExpression() {
474 ExprAST *LHS = ParsePrimary();
475 if (!LHS) return 0;
477 return ParseBinOpRHS(0, LHS);
479 </pre>
480 </div>
482 <p><tt>ParseBinOpRHS</tt> is the function that parses the sequence of pairs for
483 us. It takes a precedence and a pointer to an expression for the part that has been
484 parsed so far. Note that "x" is a perfectly valid expression: As such, "binoprhs" is
485 allowed to be empty, in which case it returns the expression that is passed into
486 it. In our example above, the code passes the expression for "a" into
487 <tt>ParseBinOpRHS</tt> and the current token is "+".</p>
489 <p>The precedence value passed into <tt>ParseBinOpRHS</tt> indicates the <em>
490 minimal operator precedence</em> that the function is allowed to eat. For
491 example, if the current pair stream is [+, x] and <tt>ParseBinOpRHS</tt> is
492 passed in a precedence of 40, it will not consume any tokens (because the
493 precedence of '+' is only 20). With this in mind, <tt>ParseBinOpRHS</tt> starts
494 with:</p>
496 <div class="doc_code">
497 <pre>
498 /// binoprhs
499 /// ::= ('+' primary)*
500 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
501 // If this is a binop, find its precedence.
502 while (1) {
503 int TokPrec = GetTokPrecedence();
505 // If this is a binop that binds at least as tightly as the current binop,
506 // consume it, otherwise we are done.
507 if (TokPrec &lt; ExprPrec)
508 return LHS;
509 </pre>
510 </div>
512 <p>This code gets the precedence of the current token and checks to see if if is
513 too low. Because we defined invalid tokens to have a precedence of -1, this
514 check implicitly knows that the pair-stream ends when the token stream runs out
515 of binary operators. If this check succeeds, we know that the token is a binary
516 operator and that it will be included in this expression:</p>
518 <div class="doc_code">
519 <pre>
520 // Okay, we know this is a binop.
521 int BinOp = CurTok;
522 getNextToken(); // eat binop
524 // Parse the primary expression after the binary operator.
525 ExprAST *RHS = ParsePrimary();
526 if (!RHS) return 0;
527 </pre>
528 </div>
530 <p>As such, this code eats (and remembers) the binary operator and then parses
531 the primary expression that follows. This builds up the whole pair, the first of
532 which is [+, b] for the running example.</p>
534 <p>Now that we parsed the left-hand side of an expression and one pair of the
535 RHS sequence, we have to decide which way the expression associates. In
536 particular, we could have "(a+b) binop unparsed" or "a + (b binop unparsed)".
537 To determine this, we look ahead at "binop" to determine its precedence and
538 compare it to BinOp's precedence (which is '+' in this case):</p>
540 <div class="doc_code">
541 <pre>
542 // If BinOp binds less tightly with RHS than the operator after RHS, let
543 // the pending operator take RHS as its LHS.
544 int NextPrec = GetTokPrecedence();
545 if (TokPrec &lt; NextPrec) {
546 </pre>
547 </div>
549 <p>If the precedence of the binop to the right of "RHS" is lower or equal to the
550 precedence of our current operator, then we know that the parentheses associate
551 as "(a+b) binop ...". In our example, the current operator is "+" and the next
552 operator is "+", we know that they have the same precedence. In this case we'll
553 create the AST node for "a+b", and then continue parsing:</p>
555 <div class="doc_code">
556 <pre>
557 ... if body omitted ...
560 // Merge LHS/RHS.
561 LHS = new BinaryExprAST(BinOp, LHS, RHS);
562 } // loop around to the top of the while loop.
564 </pre>
565 </div>
567 <p>In our example above, this will turn "a+b+" into "(a+b)" and execute the next
568 iteration of the loop, with "+" as the current token. The code above will eat,
569 remember, and parse "(c+d)" as the primary expression, which makes the
570 current pair equal to [+, (c+d)]. It will then evaluate the 'if' conditional above with
571 "*" as the binop to the right of the primary. In this case, the precedence of "*" is
572 higher than the precedence of "+" so the if condition will be entered.</p>
574 <p>The critical question left here is "how can the if condition parse the right
575 hand side in full"? In particular, to build the AST correctly for our example,
576 it needs to get all of "(c+d)*e*f" as the RHS expression variable. The code to
577 do this is surprisingly simple (code from the above two blocks duplicated for
578 context):</p>
580 <div class="doc_code">
581 <pre>
582 // If BinOp binds less tightly with RHS than the operator after RHS, let
583 // the pending operator take RHS as its LHS.
584 int NextPrec = GetTokPrecedence();
585 if (TokPrec &lt; NextPrec) {
586 <b>RHS = ParseBinOpRHS(TokPrec+1, RHS);
587 if (RHS == 0) return 0;</b>
589 // Merge LHS/RHS.
590 LHS = new BinaryExprAST(BinOp, LHS, RHS);
591 } // loop around to the top of the while loop.
593 </pre>
594 </div>
596 <p>At this point, we know that the binary operator to the RHS of our primary
597 has higher precedence than the binop we are currently parsing. As such, we know
598 that any sequence of pairs whose operators are all higher precedence than "+"
599 should be parsed together and returned as "RHS". To do this, we recursively
600 invoke the <tt>ParseBinOpRHS</tt> function specifying "TokPrec+1" as the minimum
601 precedence required for it to continue. In our example above, this will cause
602 it to return the AST node for "(c+d)*e*f" as RHS, which is then set as the RHS
603 of the '+' expression.</p>
605 <p>Finally, on the next iteration of the while loop, the "+g" piece is parsed
606 and added to the AST. With this little bit of code (14 non-trivial lines), we
607 correctly handle fully general binary expression parsing in a very elegant way.
608 This was a whirlwind tour of this code, and it is somewhat subtle. I recommend
609 running through it with a few tough examples to see how it works.
610 </p>
612 <p>This wraps up handling of expressions. At this point, we can point the
613 parser at an arbitrary token stream and build an expression from it, stopping
614 at the first token that is not part of the expression. Next up we need to
615 handle function definitions, etc.</p>
617 </div>
619 <!-- *********************************************************************** -->
620 <div class="doc_section"><a name="parsertop">Parsing the Rest</a></div>
621 <!-- *********************************************************************** -->
623 <div class="doc_text">
626 The next thing missing is handling of function prototypes. In Kaleidoscope,
627 these are used both for 'extern' function declarations as well as function body
628 definitions. The code to do this is straight-forward and not very interesting
629 (once you've survived expressions):
630 </p>
632 <div class="doc_code">
633 <pre>
634 /// prototype
635 /// ::= id '(' id* ')'
636 static PrototypeAST *ParsePrototype() {
637 if (CurTok != tok_identifier)
638 return ErrorP("Expected function name in prototype");
640 std::string FnName = IdentifierStr;
641 getNextToken();
643 if (CurTok != '(')
644 return ErrorP("Expected '(' in prototype");
646 // Read the list of argument names.
647 std::vector&lt;std::string&gt; ArgNames;
648 while (getNextToken() == tok_identifier)
649 ArgNames.push_back(IdentifierStr);
650 if (CurTok != ')')
651 return ErrorP("Expected ')' in prototype");
653 // success.
654 getNextToken(); // eat ')'.
656 return new PrototypeAST(FnName, ArgNames);
658 </pre>
659 </div>
661 <p>Given this, a function definition is very simple, just a prototype plus
662 an expression to implement the body:</p>
664 <div class="doc_code">
665 <pre>
666 /// definition ::= 'def' prototype expression
667 static FunctionAST *ParseDefinition() {
668 getNextToken(); // eat def.
669 PrototypeAST *Proto = ParsePrototype();
670 if (Proto == 0) return 0;
672 if (ExprAST *E = ParseExpression())
673 return new FunctionAST(Proto, E);
674 return 0;
676 </pre>
677 </div>
679 <p>In addition, we support 'extern' to declare functions like 'sin' and 'cos' as
680 well as to support forward declaration of user functions. These 'extern's are just
681 prototypes with no body:</p>
683 <div class="doc_code">
684 <pre>
685 /// external ::= 'extern' prototype
686 static PrototypeAST *ParseExtern() {
687 getNextToken(); // eat extern.
688 return ParsePrototype();
690 </pre>
691 </div>
693 <p>Finally, we'll also let the user type in arbitrary top-level expressions and
694 evaluate them on the fly. We will handle this by defining anonymous nullary
695 (zero argument) functions for them:</p>
697 <div class="doc_code">
698 <pre>
699 /// toplevelexpr ::= expression
700 static FunctionAST *ParseTopLevelExpr() {
701 if (ExprAST *E = ParseExpression()) {
702 // Make an anonymous proto.
703 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
704 return new FunctionAST(Proto, E);
706 return 0;
708 </pre>
709 </div>
711 <p>Now that we have all the pieces, let's build a little driver that will let us
712 actually <em>execute</em> this code we've built!</p>
714 </div>
716 <!-- *********************************************************************** -->
717 <div class="doc_section"><a name="driver">The Driver</a></div>
718 <!-- *********************************************************************** -->
720 <div class="doc_text">
722 <p>The driver for this simply invokes all of the parsing pieces with a top-level
723 dispatch loop. There isn't much interesting here, so I'll just include the
724 top-level loop. See <a href="#code">below</a> for full code in the "Top-Level
725 Parsing" section.</p>
727 <div class="doc_code">
728 <pre>
729 /// top ::= definition | external | expression | ';'
730 static void MainLoop() {
731 while (1) {
732 fprintf(stderr, "ready&gt; ");
733 switch (CurTok) {
734 case tok_eof: return;
735 case ';': getNextToken(); break; // ignore top-level semicolons.
736 case tok_def: HandleDefinition(); break;
737 case tok_extern: HandleExtern(); break;
738 default: HandleTopLevelExpression(); break;
742 </pre>
743 </div>
745 <p>The most interesting part of this is that we ignore top-level semicolons.
746 Why is this, you ask? The basic reason is that if you type "4 + 5" at the
747 command line, the parser doesn't know whether that is the end of what you will type
748 or not. For example, on the next line you could type "def foo..." in which case
749 4+5 is the end of a top-level expression. Alternatively you could type "* 6",
750 which would continue the expression. Having top-level semicolons allows you to
751 type "4+5;", and the parser will know you are done.</p>
753 </div>
755 <!-- *********************************************************************** -->
756 <div class="doc_section"><a name="conclusions">Conclusions</a></div>
757 <!-- *********************************************************************** -->
759 <div class="doc_text">
761 <p>With just under 400 lines of commented code (240 lines of non-comment,
762 non-blank code), we fully defined our minimal language, including a lexer,
763 parser, and AST builder. With this done, the executable will validate
764 Kaleidoscope code and tell us if it is grammatically invalid. For
765 example, here is a sample interaction:</p>
767 <div class="doc_code">
768 <pre>
769 $ <b>./a.out</b>
770 ready&gt; <b>def foo(x y) x+foo(y, 4.0);</b>
771 Parsed a function definition.
772 ready&gt; <b>def foo(x y) x+y y;</b>
773 Parsed a function definition.
774 Parsed a top-level expr
775 ready&gt; <b>def foo(x y) x+y );</b>
776 Parsed a function definition.
777 Error: unknown token when expecting an expression
778 ready&gt; <b>extern sin(a);</b>
779 ready&gt; Parsed an extern
780 ready&gt; <b>^D</b>
782 </pre>
783 </div>
785 <p>There is a lot of room for extension here. You can define new AST nodes,
786 extend the language in many ways, etc. In the <a href="LangImpl3.html">next
787 installment</a>, we will describe how to generate LLVM Intermediate
788 Representation (IR) from the AST.</p>
790 </div>
792 <!-- *********************************************************************** -->
793 <div class="doc_section"><a name="code">Full Code Listing</a></div>
794 <!-- *********************************************************************** -->
796 <div class="doc_text">
799 Here is the complete code listing for this and the previous chapter.
800 Note that it is fully self-contained: you don't need LLVM or any external
801 libraries at all for this. (Besides the C and C++ standard libraries, of
802 course.) To build this, just compile with:</p>
804 <div class="doc_code">
805 <pre>
806 # Compile
807 g++ -g -O3 toy.cpp
808 # Run
809 ./a.out
810 </pre>
811 </div>
813 <p>Here is the code:</p>
815 <div class="doc_code">
816 <pre>
817 #include &lt;cstdio&gt;
818 #include &lt;cstdlib&gt;
819 #include &lt;string&gt;
820 #include &lt;map&gt;
821 #include &lt;vector&gt;
823 //===----------------------------------------------------------------------===//
824 // Lexer
825 //===----------------------------------------------------------------------===//
827 // The lexer returns tokens [0-255] if it is an unknown character, otherwise one
828 // of these for known things.
829 enum Token {
830 tok_eof = -1,
832 // commands
833 tok_def = -2, tok_extern = -3,
835 // primary
836 tok_identifier = -4, tok_number = -5,
839 static std::string IdentifierStr; // Filled in if tok_identifier
840 static double NumVal; // Filled in if tok_number
842 /// gettok - Return the next token from standard input.
843 static int gettok() {
844 static int LastChar = ' ';
846 // Skip any whitespace.
847 while (isspace(LastChar))
848 LastChar = getchar();
850 if (isalpha(LastChar)) { // identifier: [a-zA-Z][a-zA-Z0-9]*
851 IdentifierStr = LastChar;
852 while (isalnum((LastChar = getchar())))
853 IdentifierStr += LastChar;
855 if (IdentifierStr == "def") return tok_def;
856 if (IdentifierStr == "extern") return tok_extern;
857 return tok_identifier;
860 if (isdigit(LastChar) || LastChar == '.') { // Number: [0-9.]+
861 std::string NumStr;
862 do {
863 NumStr += LastChar;
864 LastChar = getchar();
865 } while (isdigit(LastChar) || LastChar == '.');
867 NumVal = strtod(NumStr.c_str(), 0);
868 return tok_number;
871 if (LastChar == '#') {
872 // Comment until end of line.
873 do LastChar = getchar();
874 while (LastChar != EOF &amp;&amp; LastChar != '\n' &amp;&amp; LastChar != '\r');
876 if (LastChar != EOF)
877 return gettok();
880 // Check for end of file. Don't eat the EOF.
881 if (LastChar == EOF)
882 return tok_eof;
884 // Otherwise, just return the character as its ascii value.
885 int ThisChar = LastChar;
886 LastChar = getchar();
887 return ThisChar;
890 //===----------------------------------------------------------------------===//
891 // Abstract Syntax Tree (aka Parse Tree)
892 //===----------------------------------------------------------------------===//
894 /// ExprAST - Base class for all expression nodes.
895 class ExprAST {
896 public:
897 virtual ~ExprAST() {}
900 /// NumberExprAST - Expression class for numeric literals like "1.0".
901 class NumberExprAST : public ExprAST {
902 double Val;
903 public:
904 explicit NumberExprAST(double val) : Val(val) {}
907 /// VariableExprAST - Expression class for referencing a variable, like "a".
908 class VariableExprAST : public ExprAST {
909 std::string Name;
910 public:
911 explicit VariableExprAST(const std::string &amp;name) : Name(name) {}
914 /// BinaryExprAST - Expression class for a binary operator.
915 class BinaryExprAST : public ExprAST {
916 char Op;
917 ExprAST *LHS, *RHS;
918 public:
919 BinaryExprAST(char op, ExprAST *lhs, ExprAST *rhs)
920 : Op(op), LHS(lhs), RHS(rhs) {}
923 /// CallExprAST - Expression class for function calls.
924 class CallExprAST : public ExprAST {
925 std::string Callee;
926 std::vector&lt;ExprAST*&gt; Args;
927 public:
928 CallExprAST(const std::string &amp;callee, std::vector&lt;ExprAST*&gt; &amp;args)
929 : Callee(callee), Args(args) {}
932 /// PrototypeAST - This class represents the "prototype" for a function,
933 /// which captures its name, and its argument names (thus implicitly the number
934 /// of arguments the function takes).
935 class PrototypeAST {
936 std::string Name;
937 std::vector&lt;std::string&gt; Args;
938 public:
939 PrototypeAST(const std::string &amp;name, const std::vector&lt;std::string&gt; &amp;args)
940 : Name(name), Args(args) {}
944 /// FunctionAST - This class represents a function definition itself.
945 class FunctionAST {
946 PrototypeAST *Proto;
947 ExprAST *Body;
948 public:
949 FunctionAST(PrototypeAST *proto, ExprAST *body)
950 : Proto(proto), Body(body) {}
954 //===----------------------------------------------------------------------===//
955 // Parser
956 //===----------------------------------------------------------------------===//
958 /// CurTok/getNextToken - Provide a simple token buffer. CurTok is the current
959 /// token the parser is looking at. getNextToken reads another token from the
960 /// lexer and updates CurTok with its results.
961 static int CurTok;
962 static int getNextToken() {
963 return CurTok = gettok();
966 /// BinopPrecedence - This holds the precedence for each binary operator that is
967 /// defined.
968 static std::map&lt;char, int&gt; BinopPrecedence;
970 /// GetTokPrecedence - Get the precedence of the pending binary operator token.
971 static int GetTokPrecedence() {
972 if (!isascii(CurTok))
973 return -1;
975 // Make sure it's a declared binop.
976 int TokPrec = BinopPrecedence[CurTok];
977 if (TokPrec &lt;= 0) return -1;
978 return TokPrec;
981 /// Error* - These are little helper functions for error handling.
982 ExprAST *Error(const char *Str) { fprintf(stderr, "Error: %s\n", Str);return 0;}
983 PrototypeAST *ErrorP(const char *Str) { Error(Str); return 0; }
984 FunctionAST *ErrorF(const char *Str) { Error(Str); return 0; }
986 static ExprAST *ParseExpression();
988 /// identifierexpr
989 /// ::= identifier
990 /// ::= identifier '(' expression* ')'
991 static ExprAST *ParseIdentifierExpr() {
992 std::string IdName = IdentifierStr;
994 getNextToken(); // eat identifier.
996 if (CurTok != '(') // Simple variable ref.
997 return new VariableExprAST(IdName);
999 // Call.
1000 getNextToken(); // eat (
1001 std::vector&lt;ExprAST*&gt; Args;
1002 if (CurTok != ')') {
1003 while (1) {
1004 ExprAST *Arg = ParseExpression();
1005 if (!Arg) return 0;
1006 Args.push_back(Arg);
1008 if (CurTok == ')') break;
1010 if (CurTok != ',')
1011 return Error("Expected ')' or ',' in argument list");
1012 getNextToken();
1016 // Eat the ')'.
1017 getNextToken();
1019 return new CallExprAST(IdName, Args);
1022 /// numberexpr ::= number
1023 static ExprAST *ParseNumberExpr() {
1024 ExprAST *Result = new NumberExprAST(NumVal);
1025 getNextToken(); // consume the number
1026 return Result;
1029 /// parenexpr ::= '(' expression ')'
1030 static ExprAST *ParseParenExpr() {
1031 getNextToken(); // eat (.
1032 ExprAST *V = ParseExpression();
1033 if (!V) return 0;
1035 if (CurTok != ')')
1036 return Error("expected ')'");
1037 getNextToken(); // eat ).
1038 return V;
1041 /// primary
1042 /// ::= identifierexpr
1043 /// ::= numberexpr
1044 /// ::= parenexpr
1045 static ExprAST *ParsePrimary() {
1046 switch (CurTok) {
1047 default: return Error("unknown token when expecting an expression");
1048 case tok_identifier: return ParseIdentifierExpr();
1049 case tok_number: return ParseNumberExpr();
1050 case '(': return ParseParenExpr();
1054 /// binoprhs
1055 /// ::= ('+' primary)*
1056 static ExprAST *ParseBinOpRHS(int ExprPrec, ExprAST *LHS) {
1057 // If this is a binop, find its precedence.
1058 while (1) {
1059 int TokPrec = GetTokPrecedence();
1061 // If this is a binop that binds at least as tightly as the current binop,
1062 // consume it, otherwise we are done.
1063 if (TokPrec &lt; ExprPrec)
1064 return LHS;
1066 // Okay, we know this is a binop.
1067 int BinOp = CurTok;
1068 getNextToken(); // eat binop
1070 // Parse the primary expression after the binary operator.
1071 ExprAST *RHS = ParsePrimary();
1072 if (!RHS) return 0;
1074 // If BinOp binds less tightly with RHS than the operator after RHS, let
1075 // the pending operator take RHS as its LHS.
1076 int NextPrec = GetTokPrecedence();
1077 if (TokPrec &lt; NextPrec) {
1078 RHS = ParseBinOpRHS(TokPrec+1, RHS);
1079 if (RHS == 0) return 0;
1082 // Merge LHS/RHS.
1083 LHS = new BinaryExprAST(BinOp, LHS, RHS);
1087 /// expression
1088 /// ::= primary binoprhs
1090 static ExprAST *ParseExpression() {
1091 ExprAST *LHS = ParsePrimary();
1092 if (!LHS) return 0;
1094 return ParseBinOpRHS(0, LHS);
1097 /// prototype
1098 /// ::= id '(' id* ')'
1099 static PrototypeAST *ParsePrototype() {
1100 if (CurTok != tok_identifier)
1101 return ErrorP("Expected function name in prototype");
1103 std::string FnName = IdentifierStr;
1104 getNextToken();
1106 if (CurTok != '(')
1107 return ErrorP("Expected '(' in prototype");
1109 std::vector&lt;std::string&gt; ArgNames;
1110 while (getNextToken() == tok_identifier)
1111 ArgNames.push_back(IdentifierStr);
1112 if (CurTok != ')')
1113 return ErrorP("Expected ')' in prototype");
1115 // success.
1116 getNextToken(); // eat ')'.
1118 return new PrototypeAST(FnName, ArgNames);
1121 /// definition ::= 'def' prototype expression
1122 static FunctionAST *ParseDefinition() {
1123 getNextToken(); // eat def.
1124 PrototypeAST *Proto = ParsePrototype();
1125 if (Proto == 0) return 0;
1127 if (ExprAST *E = ParseExpression())
1128 return new FunctionAST(Proto, E);
1129 return 0;
1132 /// toplevelexpr ::= expression
1133 static FunctionAST *ParseTopLevelExpr() {
1134 if (ExprAST *E = ParseExpression()) {
1135 // Make an anonymous proto.
1136 PrototypeAST *Proto = new PrototypeAST("", std::vector&lt;std::string&gt;());
1137 return new FunctionAST(Proto, E);
1139 return 0;
1142 /// external ::= 'extern' prototype
1143 static PrototypeAST *ParseExtern() {
1144 getNextToken(); // eat extern.
1145 return ParsePrototype();
1148 //===----------------------------------------------------------------------===//
1149 // Top-Level parsing
1150 //===----------------------------------------------------------------------===//
1152 static void HandleDefinition() {
1153 if (FunctionAST *F = ParseDefinition()) {
1154 fprintf(stderr, "Parsed a function definition.\n");
1155 } else {
1156 // Skip token for error recovery.
1157 getNextToken();
1161 static void HandleExtern() {
1162 if (PrototypeAST *P = ParseExtern()) {
1163 fprintf(stderr, "Parsed an extern\n");
1164 } else {
1165 // Skip token for error recovery.
1166 getNextToken();
1170 static void HandleTopLevelExpression() {
1171 // Evaluate a top-level expression into an anonymous function.
1172 if (FunctionAST *F = ParseTopLevelExpr()) {
1173 fprintf(stderr, "Parsed a top-level expr\n");
1174 } else {
1175 // Skip token for error recovery.
1176 getNextToken();
1180 /// top ::= definition | external | expression | ';'
1181 static void MainLoop() {
1182 while (1) {
1183 fprintf(stderr, "ready&gt; ");
1184 switch (CurTok) {
1185 case tok_eof: return;
1186 case ';': getNextToken(); break; // ignore top-level semicolons.
1187 case tok_def: HandleDefinition(); break;
1188 case tok_extern: HandleExtern(); break;
1189 default: HandleTopLevelExpression(); break;
1194 //===----------------------------------------------------------------------===//
1195 // Main driver code.
1196 //===----------------------------------------------------------------------===//
1198 int main() {
1199 // Install standard binary operators.
1200 // 1 is lowest precedence.
1201 BinopPrecedence['&lt;'] = 10;
1202 BinopPrecedence['+'] = 20;
1203 BinopPrecedence['-'] = 20;
1204 BinopPrecedence['*'] = 40; // highest.
1206 // Prime the first token.
1207 fprintf(stderr, "ready&gt; ");
1208 getNextToken();
1210 MainLoop();
1211 return 0;
1213 </pre>
1214 </div>
1215 <a href="LangImpl3.html">Next: Implementing Code Generation to LLVM IR</a>
1216 </div>
1218 <!-- *********************************************************************** -->
1219 <hr>
1220 <address>
1221 <a href="http://jigsaw.w3.org/css-validator/check/referer"><img
1222 src="http://jigsaw.w3.org/css-validator/images/vcss" alt="Valid CSS!"></a>
1223 <a href="http://validator.w3.org/check/referer"><img
1224 src="http://www.w3.org/Icons/valid-html401" alt="Valid HTML 4.01!"></a>
1226 <a href="mailto:sabre@nondot.org">Chris Lattner</a><br>
1227 <a href="http://llvm.org">The LLVM Compiler Infrastructure</a><br>
1228 Last modified: $Date: 2007-10-17 11:05:13 -0700 (Wed, 17 Oct 2007) $
1229 </address>
1230 </body>
1231 </html>