Doc/ref2.tex

   1 \chapter{Lexical analysis}
   2
   3 A Python program is read by a {\em parser}.  Input to the parser is a
   4 stream of {\em tokens}, generated by the {\em lexical analyzer}.  This
   5 chapter describes how the lexical analyzer breaks a file into tokens.
   6 \index{lexical analysis}
   7 \index{parser}
   8 \index{token}
   9
  10 \section{Line structure}
  11
  12 A Python program is divided in a number of logical lines.  The end of
  13 a logical line is represented by the token NEWLINE.  Statements cannot
  14 cross logical line boundaries except where NEWLINE is allowed by the
  15 syntax (e.g. between statements in compound statements).
  16 \index{line structure}
  17 \index{logical line}
  18 \index{NEWLINE token}
  19
  20 \subsection{Comments}
  21
  22 A comment starts with a hash character (\verb@#@) that is not part of
  23 a string literal, and ends at the end of the physical line.  A comment
  24 always signifies the end of the logical line.  Comments are ignored by
  25 the syntax.
  26 \index{comment}
  27 \index{logical line}
  28 \index{physical line}
  29 \index{hash character}
  30
  31 \subsection{Line joining}
  32
  33 Two or more physical lines may be joined into logical lines using
  34 backslash characters (\verb/\/), as follows: when a physical line ends
  35 in a backslash that is not part of a string literal or comment, it is
  36 joined with the following forming a single logical line, deleting the
  37 backslash and the following end-of-line character.  For example:
  38 \index{physical line}
  39 \index{line joining}
  40 \index{backslash character}
  41 %
  42 \begin{verbatim}
  43 month_names = ['Januari', 'Februari', 'Maart',     \
  44                'April',   'Mei',      'Juni',      \
  45                'Juli',    'Augustus', 'September', \
  46                'Oktober', 'November', 'December']
  47 \end{verbatim}
  48
  49 \subsection{Blank lines}
  50
  51 A logical line that contains only spaces, tabs, and possibly a
  52 comment, is ignored (i.e., no NEWLINE token is generated), except that
  53 during interactive input of statements, an entirely blank logical line
  54 terminates a multi-line statement.
  55 \index{blank line}
  56
  57 \subsection{Indentation}
  58
  59 Leading whitespace (spaces and tabs) at the beginning of a logical
  60 line is used to compute the indentation level of the line, which in
  61 turn is used to determine the grouping of statements.
  62 \index{indentation}
  63 \index{whitespace}
  64 \index{leading whitespace}
  65 \index{space}
  66 \index{tab}
  67 \index{grouping}
  68 \index{statement grouping}
  69
  70 First, tabs are replaced (from left to right) by one to eight spaces
  71 such that the total number of characters up to there is a multiple of
  72 eight (this is intended to be the same rule as used by {\UNIX}).  The
  73 total number of spaces preceding the first non-blank character then
  74 determines the line's indentation.  Indentation cannot be split over
  75 multiple physical lines using backslashes.
  76
  77 The indentation levels of consecutive lines are used to generate
  78 INDENT and DEDENT tokens, using a stack, as follows.
  79 \index{INDENT token}
  80 \index{DEDENT token}
  81
  82 Before the first line of the file is read, a single zero is pushed on
  83 the stack; this will never be popped off again.  The numbers pushed on
  84 the stack will always be strictly increasing from bottom to top.  At
  85 the beginning of each logical line, the line's indentation level is
  86 compared to the top of the stack.  If it is equal, nothing happens.
  87 If it is larger, it is pushed on the stack, and one INDENT token is
  88 generated.  If it is smaller, it {\em must} be one of the numbers
  89 occurring on the stack; all numbers on the stack that are larger are
  90 popped off, and for each number popped off a DEDENT token is
  91 generated.  At the end of the file, a DEDENT token is generated for
  92 each number remaining on the stack that is larger than zero.
  93
  94 Here is an example of a correctly (though confusingly) indented piece
  95 of Python code:
  96
  97 \begin{verbatim}
  98 def perm(l):
  99         # Compute the list of all permutations of l
 100
 101     if len(l) <= 1:
 102                   return [l]
 103     r = []
 104     for i in range(len(l)):
 105              s = l[:i] + l[i+1:]
 106              p = perm(s)
 107              for x in p:
 108               r.append(l[i:i+1] + x)
 109     return r
 110 \end{verbatim}
 111
 112 The following example shows various indentation errors:
 113
 114 \begin{verbatim}
 115     def perm(l):                        # error: first line indented
 116     for i in range(len(l)):             # error: not indented
 117         s = l[:i] + l[i+1:]
 118             p = perm(l[:i] + l[i+1:])   # error: unexpected indent
 119             for x in p:
 120                     r.append(l[i:i+1] + x)
 121                 return r                # error: inconsistent dedent
 122 \end{verbatim}
 123
 124 (Actually, the first three errors are detected by the parser; only the
 125 last error is found by the lexical analyzer --- the indentation of
 126 \verb@return r@ does not match a level popped off the stack.)
 127
 128 \section{Other tokens}
 129
 130 Besides NEWLINE, INDENT and DEDENT, the following categories of tokens
 131 exist: identifiers, keywords, literals, operators, and delimiters.
 132 Spaces and tabs are not tokens, but serve to delimit tokens.  Where
 133 ambiguity exists, a token comprises the longest possible string that
 134 forms a legal token, when read from left to right.
 135
 136 \section{Identifiers}
 137
 138 Identifiers (also referred to as names) are described by the following
 139 lexical definitions:
 140 \index{identifier}
 141 \index{name}
 142
 143 \begin{verbatim}
 144 identifier:     (letter|"_") (letter|digit|"_")*
 145 letter:         lowercase | uppercase
 146 lowercase:      "a"..."z"
 147 uppercase:      "A"..."Z"
 148 digit:          "0"..."9"
 149 \end{verbatim}
 150
 151 Identifiers are unlimited in length.  Case is significant.
 152
 153 \subsection{Keywords}
 154
 155 The following identifiers are used as reserved words, or {\em
 156 keywords} of the language, and cannot be used as ordinary
 157 identifiers.  They must be spelled exactly as written here:
 158 \index{keyword}
 159 \index{reserved word}
 160
 161 \begin{verbatim}
 162 and        del        for        in         print
 163 break      elif       from       is         raise
 164 class      else       global     not        return
 165 continue   except     if         or         try
 166 def        finally    import     pass       while
 167 \end{verbatim}
 168
 169 %       # This Python program sorts and formats the above table
 170 %       import string
 171 %       l = []
 172 %       try:
 173 %               while 1:
 174 %                       l = l + string.split(raw_input())
 175 %       except EOFError:
 176 %               pass
 177 %       l.sort()
 178 %       for i in range((len(l)+4)/5):
 179 %               for j in range(i, len(l), 5):
 180 %                       print string.ljust(l[j], 10),
 181 %               print
 182
 183 \section{Literals} \label{literals}
 184
 185 Literals are notations for constant values of some built-in types.
 186 \index{literal}
 187 \index{constant}
 188
 189 \subsection{String literals}
 190
 191 String literals are described by the following lexical definitions:
 192 \index{string literal}
 193
 194 \begin{verbatim}
 195 stringliteral:  "'" stringitem* "'"
 196 stringitem:     stringchar | escapeseq
 197 stringchar:     <any ASCII character except newline or "\" or "'">
 198 escapeseq:      "'" <any ASCII character except newline>
 199 \end{verbatim}
 200 \index{ASCII}
 201
 202 String literals cannot span physical line boundaries.  Escape
 203 sequences in strings are actually interpreted according to rules
 204 similar to those used by Standard C.  The recognized escape sequences
 205 are:
 206 \index{physical line}
 207 \index{escape sequence}
 208 \index{Standard C}
 209 \index{C}
 210
 211 \begin{center}
 212 \begin{tabular}{|l|l|}
 213 \hline
 214 \verb/\\/       & Backslash (\verb/\/) \\
 215 \verb/\'/       & Single quote (\verb/'/) \\
 216 \verb/\a/       & ASCII Bell (BEL) \\
 217 \verb/\b/       & ASCII Backspace (BS) \\
 218 %\verb/\E/      & ASCII Escape (ESC) \\
 219 \verb/\f/       & ASCII Formfeed (FF) \\
 220 \verb/\n/       & ASCII Linefeed (LF) \\
 221 \verb/\r/       & ASCII Carriage Return (CR) \\
 222 \verb/\t/       & ASCII Horizontal Tab (TAB) \\
 223 \verb/\v/       & ASCII Vertical Tab (VT) \\
 224 \verb/\/{\em ooo}       & ASCII character with octal value {\em ooo} \\
 225 \verb/\x/{\em xx...}    & ASCII character with hex value {\em xx...} \\
 226 \hline
 227 \end{tabular}
 228 \end{center}
 229 \index{ASCII}
 230
 231 In strict compatibility with Standard C, up to three octal digits are
 232 accepted, but an unlimited number of hex digits is taken to be part of
 233 the hex escape (and then the lower 8 bits of the resulting hex number
 234 are used in all current implementations...).
 235
 236 All unrecognized escape sequences are left in the string unchanged,
 237 i.e., {\em the backslash is left in the string.}  (This behavior is
 238 useful when debugging: if an escape sequence is mistyped, the
 239 resulting output is more easily recognized as broken.  It also helps a
 240 great deal for string literals used as regular expressions or
 241 otherwise passed to other modules that do their own escape handling.)
 242 \index{unrecognized escape sequence}
 243
 244 \subsection{Numeric literals}
 245
 246 There are three types of numeric literals: plain integers, long
 247 integers, and floating point numbers.
 248 \index{number}
 249 \index{numeric literal}
 250 \index{integer literal}
 251 \index{plain integer literal}
 252 \index{long integer literal}
 253 \index{floating point literal}
 254 \index{hexadecimal literal}
 255 \index{octal literal}
 256 \index{decimal literal}
 257
 258 Integer and long integer literals are described by the following
 259 lexical definitions:
 260
 261 \begin{verbatim}
 262 longinteger:    integer ("l"|"L")
 263 integer:        decimalinteger | octinteger | hexinteger
 264 decimalinteger: nonzerodigit digit* | "0"
 265 octinteger:     "0" octdigit+
 266 hexinteger:     "0" ("x"|"X") hexdigit+
 267
 268 nonzerodigit:   "1"..."9"
 269 octdigit:       "0"..."7"
 270 hexdigit:        digit|"a"..."f"|"A"..."F"
 271 \end{verbatim}
 272
 273 Although both lower case `l' and upper case `L' are allowed as suffix
 274 for long integers, it is strongly recommended to always use `L', since
 275 the letter `l' looks too much like the digit `1'.
 276
 277 Plain integer decimal literals must be at most $2^{31} - 1$ (i.e., the
 278 largest positive integer, assuming 32-bit arithmetic).  Plain octal and
 279 hexadecimal literals may be as large as $2^{32} - 1$, but values
 280 larger than $2^{31} - 1$ are converted to a negative value by
 281 subtracting $2^{32}$.  There is no limit for long integer literals.
 282
 283 Some examples of plain and long integer literals:
 284
 285 \begin{verbatim}
 286 7     2147483647                        0177    0x80000000
 287 3L    79228162514264337593543950336L    0377L   0x100000000L
 288 \end{verbatim}
 289
 290 Floating point literals are described by the following lexical
 291 definitions:
 292
 293 \begin{verbatim}
 294 floatnumber:    pointfloat | exponentfloat
 295 pointfloat:     [intpart] fraction | intpart "."
 296 exponentfloat:  (intpart | pointfloat) exponent
 297 intpart:        digit+
 298 fraction:       "." digit+
 299 exponent:       ("e"|"E") ["+"|"-"] digit+
 300 \end{verbatim}
 301
 302 The allowed range of floating point literals is
 303 implementation-dependent.
 304
 305 Some examples of floating point literals:
 306
 307 \begin{verbatim}
 308 3.14    10.    .001    1e100    3.14e-10
 309 \end{verbatim}
 310
 311 Note that numeric literals do not include a sign; a phrase like
 312 \verb@-1@ is actually an expression composed of the operator
 313 \verb@-@ and the literal \verb@1@.
 314
 315 \section{Operators}
 316
 317 The following tokens are operators:
 318 \index{operators}
 319
 320 \begin{verbatim}
 321 +       -       *       /       %
 322 <<      >>      &       |       ^       ~
 323 <       ==      >       <=      <>      !=      >=
 324 \end{verbatim}
 325
 326 The comparison operators \verb@<>@ and \verb@!=@ are alternate
 327 spellings of the same operator.
 328
 329 \section{Delimiters}
 330
 331 The following tokens serve as delimiters or otherwise have a special
 332 meaning:
 333 \index{delimiters}
 334
 335 \begin{verbatim}
 336 (       )       [       ]       {       }
 337 ;       ,       :       .       `       =
 338 \end{verbatim}
 339
 340 The following printing ASCII characters are not used in Python.  Their
 341 occurrence outside string literals and comments is an unconditional
 342 error:
 343 \index{ASCII}
 344
 345 \begin{verbatim}
 346 @       $       "       ?
 347 \end{verbatim}
 348
 349 They may be used by future versions of the language though!