Doc/lib/libre.tex

   1 \section{\module{re} ---
   2          Regular expression operations}
   3 \declaremodule{standard}{re}
   4 \moduleauthor{Fredrik Lundh}{fredrik@pythonware.com}
   5 \sectionauthor{Andrew M. Kuchling}{amk@amk.ca}
   6
   7
   8 \modulesynopsis{Regular expression search and match operations with a
   9                 Perl-style expression syntax.}
  10
  11
  12 This module provides regular expression matching operations similar to
  13 those found in Perl.  Regular expression pattern strings may not
  14 contain null bytes, but can specify the null byte using the
  15 \code{\e\var{number}} notation.  Both patterns and strings to be
  16 searched can be Unicode strings as well as 8-bit strings.  The
  17 \module{re} module is always available.
  18
  19 Regular expressions use the backslash character (\character{\e}) to
  20 indicate special forms or to allow special characters to be used
  21 without invoking their special meaning.  This collides with Python's
  22 usage of the same character for the same purpose in string literals;
  23 for example, to match a literal backslash, one might have to write
  24 \code{'\e\e\e\e'} as the pattern string, because the regular expression
  25 must be \samp{\e\e}, and each backslash must be expressed as
  26 \samp{\e\e} inside a regular Python string literal.
  27
  28 The solution is to use Python's raw string notation for regular
  29 expression patterns; backslashes are not handled in any special way in
  30 a string literal prefixed with \character{r}.  So \code{r"\e n"} is a
  31 two-character string containing \character{\e} and \character{n},
  32 while \code{"\e n"} is a one-character string containing a newline.
  33 Usually patterns will be expressed in Python code using this raw
  34 string notation.
  35
  36 \begin{seealso}
  37   \seetitle{Mastering Regular Expressions}{Book on regular expressions
  38             by Jeffrey Friedl, published by O'Reilly.  The second
  39             edition of the book no longer covers Python at all,
  40             but the first edition covered writing good regular expression
  41             patterns in great detail.}
  42 \end{seealso}
  43
  44
  45 \subsection{Regular Expression Syntax \label{re-syntax}}
  46
  47 A regular expression (or RE) specifies a set of strings that matches
  48 it; the functions in this module let you check if a particular string
  49 matches a given regular expression (or if a given regular expression
  50 matches a particular string, which comes down to the same thing).
  51
  52 Regular expressions can be concatenated to form new regular
  53 expressions; if \emph{A} and \emph{B} are both regular expressions,
  54 then \emph{AB} is also a regular expression.  In general, if a string
  55 \emph{p} matches \emph{A} and another string \emph{q} matches \emph{B},
  56 the string \emph{pq} will match AB.  This holds unless \emph{A} or
  57 \emph{B} contain low precedence operations; boundary conditions between
  58 \emph{A} and \emph{B}; or have numbered group references.  Thus, complex
  59 expressions can easily be constructed from simpler primitive
  60 expressions like the ones described here.  For details of the theory
  61 and implementation of regular expressions, consult the Friedl book
  62 referenced above, or almost any textbook about compiler construction.
  63
  64 A brief explanation of the format of regular expressions follows.  For
  65 further information and a gentler presentation, consult the Regular
  66 Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
  67
  68 Regular expressions can contain both special and ordinary characters.
  69 Most ordinary characters, like \character{A}, \character{a}, or
  70 \character{0}, are the simplest regular expressions; they simply match
  71 themselves.  You can concatenate ordinary characters, so \regexp{last}
  72 matches the string \code{'last'}.  (In the rest of this section, we'll
  73 write RE's in \regexp{this special style}, usually without quotes, and
  74 strings to be matched \code{'in single quotes'}.)
  75
  76 Some characters, like \character{|} or \character{(}, are special.
  77 Special characters either stand for classes of ordinary characters, or
  78 affect how the regular expressions around them are interpreted.
  79
  80 The special characters are:
  81
  82 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
  83
  84 \item[\character{.}] (Dot.)  In the default mode, this matches any
  85 character except a newline.  If the \constant{DOTALL} flag has been
  86 specified, this matches any character including a newline.
  87
  88 \item[\character{\textasciicircum}] (Caret.)  Matches the start of the
  89 string, and in \constant{MULTILINE} mode also matches immediately
  90 after each newline.
  91
  92 \item[\character{\$}] Matches the end of the string or just before the
  93 newline at the end of the string, and in \constant{MULTILINE} mode
  94 also matches before a newline.  \regexp{foo} matches both 'foo' and
  95 'foobar', while the regular expression \regexp{foo\$} matches only
  96 'foo'.  More interestingly, searching for \regexp{foo.\$} in
  97 'foo1\textbackslash nfoo2\textbackslash n' matches 'foo2' normally,
  98 but 'foo1' in \constant{MULTILINE} mode.
  99
 100 \item[\character{*}] Causes the resulting RE to
 101 match 0 or more repetitions of the preceding RE, as many repetitions
 102 as are possible.  \regexp{ab*} will
 103 match 'a', 'ab', or 'a' followed by any number of 'b's.
 104
 105 \item[\character{+}] Causes the
 106 resulting RE to match 1 or more repetitions of the preceding RE.
 107 \regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
 108 will not match just 'a'.
 109
 110 \item[\character{?}] Causes the resulting RE to
 111 match 0 or 1 repetitions of the preceding RE.  \regexp{ab?} will
 112 match either 'a' or 'ab'.
 113
 114 \item[\code{*?}, \code{+?}, \code{??}] The \character{*},
 115 \character{+}, and \character{?} qualifiers are all \dfn{greedy}; they
 116 match as much text as possible.  Sometimes this behaviour isn't
 117 desired; if the RE \regexp{<.*>} is matched against
 118 \code{'<H1>title</H1>'}, it will match the entire string, and not just
 119 \code{'<H1>'}.  Adding \character{?} after the qualifier makes it
 120 perform the match in \dfn{non-greedy} or \dfn{minimal} fashion; as
 121 \emph{few} characters as possible will be matched.  Using \regexp{.*?}
 122 in the previous expression will match only \code{'<H1>'}.
 123
 124 \item[\code{\{\var{m}\}}]
 125 Specifies that exactly \var{m} copies of the previous RE should be
 126 matched; fewer matches cause the entire RE not to match.  For example,
 127 \regexp{a\{6\}} will match exactly six \character{a} characters, but
 128 not five.
 129
 130 \item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
 131 \var{m} to \var{n} repetitions of the preceding RE, attempting to
 132 match as many repetitions as possible.  For example, \regexp{a\{3,5\}}
 133 will match from 3 to 5 \character{a} characters.  Omitting \var{m}
 134 specifies a lower bound of zero,
 135 and omitting \var{n} specifies an infinite upper bound.  As an
 136 example, \regexp{a\{4,\}b} will match \code{aaaab} or a thousand
 137 \character{a} characters followed by a \code{b}, but not \code{aaab}.
 138 The comma may not be omitted or the modifier would be confused with
 139 the previously described form.
 140
 141 \item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
 142 match from \var{m} to \var{n} repetitions of the preceding RE,
 143 attempting to match as \emph{few} repetitions as possible.  This is
 144 the non-greedy version of the previous qualifier.  For example, on the
 145 6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
 146 \character{a} characters, while \regexp{a\{3,5\}?} will only match 3
 147 characters.
 148
 149 \item[\character{\e}] Either escapes special characters (permitting
 150 you to match characters like \character{*}, \character{?}, and so
 151 forth), or signals a special sequence; special sequences are discussed
 152 below.
 153
 154 If you're not using a raw string to
 155 express the pattern, remember that Python also uses the
 156 backslash as an escape sequence in string literals; if the escape
 157 sequence isn't recognized by Python's parser, the backslash and
 158 subsequent character are included in the resulting string.  However,
 159 if Python would recognize the resulting sequence, the backslash should
 160 be repeated twice.  This is complicated and hard to understand, so
 161 it's highly recommended that you use raw strings for all but the
 162 simplest expressions.
 163
 164 \item[\code{[]}] Used to indicate a set of characters.  Characters can
 165 be listed individually, or a range of characters can be indicated by
 166 giving two characters and separating them by a \character{-}.  Special
 167 characters are not active inside sets.  For example, \regexp{[akm\$]}
 168 will match any of the characters \character{a}, \character{k},
 169 \character{m}, or \character{\$}; \regexp{[a-z]}
 170 will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
 171 letter or digit.  Character classes such as \code{\e w} or \code{\e S}
 172 (defined below) are also acceptable inside a range.  If you want to
 173 include a \character{]} or a \character{-} inside a set, precede it with a
 174 backslash, or place it as the first character.  The
 175 pattern \regexp{[]]} will match \code{']'}, for example.
 176
 177 You can match the characters not within a range by \dfn{complementing}
 178 the set.  This is indicated by including a
 179 \character{\textasciicircum} as the first character of the set;
 180 \character{\textasciicircum} elsewhere will simply match the
 181 \character{\textasciicircum} character.  For example,
 182 \regexp{[{\textasciicircum}5]} will match
 183 any character except \character{5}, and
 184 \regexp{[\textasciicircum\code{\textasciicircum}]} will match any character
 185 except \character{\textasciicircum}.
 186
 187 \item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
 188 creates a regular expression that will match either A or B.  An
 189 arbitrary number of REs can be separated by the \character{|} in this
 190 way.  This can be used inside groups (see below) as well.  As the target
 191 string is scanned, REs separated by \character{|} are tried from left to
 192 right. When one pattern completely matches, that branch is accepted.
 193 This means that once \code{A} matches, \code{B} will not be tested further,
 194 even if it would produce a longer overall match.  In other words, the
 195 \character{|} operator is never greedy.  To match a literal \character{|},
 196 use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
 197
 198 \item[\code{(...)}] Matches whatever regular expression is inside the
 199 parentheses, and indicates the start and end of a group; the contents
 200 of a group can be retrieved after a match has been performed, and can
 201 be matched later in the string with the \regexp{\e \var{number}} special
 202 sequence, described below.  To match the literals \character{(} or
 203 \character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
 204 inside a character class: \regexp{[(] [)]}.
 205
 206 \item[\code{(?...)}] This is an extension notation (a \character{?}
 207 following a \character{(} is not meaningful otherwise).  The first
 208 character after the \character{?}
 209 determines what the meaning and further syntax of the construct is.
 210 Extensions usually do not create a new group;
 211 \regexp{(?P<\var{name}>...)} is the only exception to this rule.
 212 Following are the currently supported extensions.
 213
 214 \item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
 215 \character{L}, \character{m}, \character{s}, \character{u},
 216 \character{x}.)  The group matches the empty string; the letters set
 217 the corresponding flags (\constant{re.I}, \constant{re.L},
 218 \constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
 219 for the entire regular expression.  This is useful if you wish to
 220 include the flags as part of the regular expression, instead of
 221 passing a \var{flag} argument to the \function{compile()} function.
 222
 223 Note that the \regexp{(?x)} flag changes how the expression is parsed.
 224 It should be used first in the expression string, or after one or more
 225 whitespace characters.  If there are non-whitespace characters before
 226 the flag, the results are undefined.
 227
 228 \item[\code{(?:...)}] A non-grouping version of regular parentheses.
 229 Matches whatever regular expression is inside the parentheses, but the
 230 substring matched by the
 231 group \emph{cannot} be retrieved after performing a match or
 232 referenced later in the pattern.
 233
 234 \item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
 235 the substring matched by the group is accessible via the symbolic group
 236 name \var{name}.  Group names must be valid Python identifiers, and
 237 each group name must be defined only once within a regular expression.  A
 238 symbolic group is also a numbered group, just as if the group were not
 239 named.  So the group named 'id' in the example above can also be
 240 referenced as the numbered group 1.
 241
 242 For example, if the pattern is
 243 \regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
 244 name in arguments to methods of match objects, such as
 245 \code{m.group('id')} or \code{m.end('id')}, and also by name in
 246 pattern text (for example, \regexp{(?P=id)}) and replacement text
 247 (such as \code{\e g<id>}).
 248
 249 \item[\code{(?P=\var{name})}] Matches whatever text was matched by the
 250 earlier group named \var{name}.
 251
 252 \item[\code{(?\#...)}] A comment; the contents of the parentheses are
 253 simply ignored.
 254
 255 \item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
 256 consume any of the string.  This is called a lookahead assertion.  For
 257 example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
 258 followed by \code{'Asimov'}.
 259
 260 \item[\code{(?!...)}] Matches if \regexp{...} doesn't match next.  This
 261 is a negative lookahead assertion.  For example,
 262 \regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
 263 followed by \code{'Asimov'}.
 264
 265 \item[\code{(?<=...)}] Matches if the current position in the string
 266 is preceded by a match for \regexp{...} that ends at the current
 267 position.  This is called a \dfn{positive lookbehind assertion}.
 268 \regexp{(?<=abc)def} will find a match in \samp{abcdef}, since the
 269 lookbehind will back up 3 characters and check if the contained
 270 pattern matches.  The contained pattern must only match strings of
 271 some fixed length, meaning that \regexp{abc} or \regexp{a|b} are
 272 allowed, but \regexp{a*} and \regexp{a\{3,4\}} are not.  Note that
 273 patterns which start with positive lookbehind assertions will never
 274 match at the beginning of the string being searched; you will most
 275 likely want to use the \function{search()} function rather than the
 276 \function{match()} function:
 277
 278 \begin{verbatim}
 279 >>> import re
 280 >>> m = re.search('(?<=abc)def', 'abcdef')
 281 >>> m.group(0)
 282 'def'
 283 \end{verbatim}
 284
 285 This example looks for a word following a hyphen:
 286
 287 \begin{verbatim}
 288 >>> m = re.search('(?<=-)\w+', 'spam-egg')
 289 >>> m.group(0)
 290 'egg'
 291 \end{verbatim}
 292
 293 \item[\code{(?<!...)}] Matches if the current position in the string
 294 is not preceded by a match for \regexp{...}.  This is called a
 295 \dfn{negative lookbehind assertion}.  Similar to positive lookbehind
 296 assertions, the contained pattern must only match strings of some
 297 fixed length.  Patterns which start with negative lookbehind
 298 assertions may match at the beginning of the string being searched.
 299
 300 \end{list}
 301
 302 The special sequences consist of \character{\e} and a character from the
 303 list below.  If the ordinary character is not on the list, then the
 304 resulting RE will match the second character.  For example,
 305 \regexp{\e\$} matches the character \character{\$}.
 306
 307 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
 308
 309 \item[\code{\e \var{number}}] Matches the contents of the group of the
 310 same number.  Groups are numbered starting from 1.  For example,
 311 \regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
 312 \code{'the end'} (note
 313 the space after the group).  This special sequence can only be used to
 314 match one of the first 99 groups.  If the first digit of \var{number}
 315 is 0, or \var{number} is 3 octal digits long, it will not be interpreted
 316 as a group match, but as the character with octal value \var{number}.
 317 Inside the \character{[} and \character{]} of a character class, all numeric
 318 escapes are treated as characters.
 319
 320 \item[\code{\e A}] Matches only at the start of the string.
 321
 322 \item[\code{\e b}] Matches the empty string, but only at the
 323 beginning or end of a word.  A word is defined as a sequence of
 324 alphanumeric or underscore characters, so the end of a word is indicated by
 325 whitespace or a non-alphanumeric, non-underscore character.  Note that
 326 {}\code{\e b} is defined as the boundary between \code{\e w} and \code{\e
 327 W}, so the precise set of characters deemed to be alphanumeric depends on the
 328 values of the \code{UNICODE} and \code{LOCALE} flags.  Inside a character
 329 range, \regexp{\e b} represents the backspace character, for compatibility
 330 with Python's string literals.
 331
 332 \item[\code{\e B}] Matches the empty string, but only when it is \emph{not}
 333 at the beginning or end of a word.  This is just the opposite of {}\code{\e
 334 b}, so is also subject to the settings of \code{LOCALE} and \code{UNICODE}.
 335
 336 \item[\code{\e d}]Matches any decimal digit; this is
 337 equivalent to the set \regexp{[0-9]}.
 338
 339 \item[\code{\e D}]Matches any non-digit character; this is
 340 equivalent to the set \regexp{[{\textasciicircum}0-9]}.
 341
 342 \item[\code{\e s}]Matches any whitespace character; this is
 343 equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
 344
 345 \item[\code{\e S}]Matches any non-whitespace character; this is
 346 equivalent to the set \regexp{[\textasciicircum\ \e t\e n\e r\e f\e v]}.
 347
 348 \item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
 349 flags are not specified, matches any alphanumeric character and the
 350 underscore; this is equivalent to the set
 351 \regexp{[a-zA-Z0-9_]}.  With \constant{LOCALE}, it will match the set
 352 \regexp{[0-9_]} plus whatever characters are defined as alphanumeric for
 353 the current locale.  If \constant{UNICODE} is set, this will match the
 354 characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
 355 in the Unicode character properties database.
 356
 357 \item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
 358 flags are not specified, matches any non-alphanumeric character; this
 359 is equivalent to the set \regexp{[{\textasciicircum}a-zA-Z0-9_]}.   With
 360 \constant{LOCALE}, it will match any character not in the set
 361 \regexp{[0-9_]}, and not defined as alphanumeric for the current locale.
 362 If \constant{UNICODE} is set, this will match anything other than
 363 \regexp{[0-9_]} and characters marked as alphanumeric in the Unicode
 364 character properties database.
 365
 366 \item[\code{\e Z}]Matches only at the end of the string.
 367
 368 \end{list}
 369
 370 Most of the standard escapes supported by Python string literals are
 371 also accepted by the regular expression parser:
 372
 373 \begin{verbatim}
 374 \a      \b      \f      \n
 375 \r      \t      \v      \x
 376 \\
 377 \end{verbatim}
 378
 379 Octal escapes are included in a limited form: If the first digit is a
 380 0, or if there are three octal digits, it is considered an octal
 381 escape. Otherwise, it is a group reference.
 382
 383
 384 % Note the lack of a period in the section title; it causes problems
 385 % with readers of the GNU info version.  See http://www.python.org/sf/581414.
 386 \subsection{Matching vs Searching \label{matching-searching}}
 387 \sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
 388
 389 Python offers two different primitive operations based on regular
 390 expressions: match and search.  If you are accustomed to Perl's
 391 semantics, the search operation is what you're looking for.  See the
 392 \function{search()} function and corresponding method of compiled
 393 regular expression objects.
 394
 395 Note that match may differ from search using a regular expression
 396 beginning with \character{\textasciicircum}:
 397 \character{\textasciicircum} matches only at the
 398 start of the string, or in \constant{MULTILINE} mode also immediately
 399 following a newline.  The ``match'' operation succeeds only if the
 400 pattern matches at the start of the string regardless of mode, or at
 401 the starting position given by the optional \var{pos} argument
 402 regardless of whether a newline precedes it.
 403
 404 % Examples from Tim Peters:
 405 \begin{verbatim}
 406 re.compile("a").match("ba", 1)           # succeeds
 407 re.compile("^a").search("ba", 1)         # fails; 'a' not at start
 408 re.compile("^a").search("\na", 1)        # fails; 'a' not at start
 409 re.compile("^a", re.M).search("\na", 1)  # succeeds
 410 re.compile("^a", re.M).search("ba", 1)   # fails; no preceding \n
 411 \end{verbatim}
 412
 413
 414 \subsection{Module Contents}
 415 \nodename{Contents of Module re}
 416
 417 The module defines the following functions and constants, and an exception:
 418
 419
 420 \begin{funcdesc}{compile}{pattern\optional{, flags}}
 421   Compile a regular expression pattern into a regular expression
 422   object, which can be used for matching using its \function{match()} and
 423   \function{search()} methods, described below.
 424
 425   The expression's behaviour can be modified by specifying a
 426   \var{flags} value.  Values can be any of the following variables,
 427   combined using bitwise OR (the \code{|} operator).
 428
 429 The sequence
 430
 431 \begin{verbatim}
 432 prog = re.compile(pat)
 433 result = prog.match(str)
 434 \end{verbatim}
 435
 436 is equivalent to
 437
 438 \begin{verbatim}
 439 result = re.match(pat, str)
 440 \end{verbatim}
 441
 442 but the version using \function{compile()} is more efficient when the
 443 expression will be used several times in a single program.
 444 %(The compiled version of the last pattern passed to
 445 %\function{re.match()} or \function{re.search()} is cached, so
 446 %programs that use only a single regular expression at a time needn't
 447 %worry about compiling regular expressions.)
 448 \end{funcdesc}
 449
 450 \begin{datadesc}{I}
 451 \dataline{IGNORECASE}
 452 Perform case-insensitive matching; expressions like \regexp{[A-Z]}
 453 will match lowercase letters, too.  This is not affected by the
 454 current locale.
 455 \end{datadesc}
 456
 457 \begin{datadesc}{L}
 458 \dataline{LOCALE}
 459 Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
 460 \regexp{\e B} dependent on the current locale.
 461 \end{datadesc}
 462
 463 \begin{datadesc}{M}
 464 \dataline{MULTILINE}
 465 When specified, the pattern character \character{\textasciicircum}
 466 matches at the beginning of the string and at the beginning of each
 467 line (immediately following each newline); and the pattern character
 468 \character{\$} matches at the end of the string and at the end of each
 469 line (immediately preceding each newline).  By default,
 470 \character{\textasciicircum} matches only at the beginning of the
 471 string, and \character{\$} only at the end of the string and
 472 immediately before the newline (if any) at the end of the string.
 473 \end{datadesc}
 474
 475 \begin{datadesc}{S}
 476 \dataline{DOTALL}
 477 Make the \character{.} special character match any character at all,
 478 including a newline; without this flag, \character{.} will match
 479 anything \emph{except} a newline.
 480 \end{datadesc}
 481
 482 \begin{datadesc}{U}
 483 \dataline{UNICODE}
 484 Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
 485 \regexp{\e B} dependent on the Unicode character properties database.
 486 \versionadded{2.0}
 487 \end{datadesc}
 488
 489 \begin{datadesc}{X}
 490 \dataline{VERBOSE}
 491 This flag allows you to write regular expressions that look nicer.
 492 Whitespace within the pattern is ignored,
 493 except when in a character class or preceded by an unescaped
 494 backslash, and, when a line contains a \character{\#} neither in a
 495 character class or preceded by an unescaped backslash, all characters
 496 from the leftmost such \character{\#} through the end of the line are
 497 ignored.
 498 % XXX should add an example here
 499 \end{datadesc}
 500
 501
 502 \begin{funcdesc}{search}{pattern, string\optional{, flags}}
 503   Scan through \var{string} looking for a location where the regular
 504   expression \var{pattern} produces a match, and return a
 505   corresponding \class{MatchObject} instance.
 506   Return \code{None} if no
 507   position in the string matches the pattern; note that this is
 508   different from finding a zero-length match at some point in the string.
 509 \end{funcdesc}
 510
 511 \begin{funcdesc}{match}{pattern, string\optional{, flags}}
 512   If zero or more characters at the beginning of \var{string} match
 513   the regular expression \var{pattern}, return a corresponding
 514   \class{MatchObject} instance.  Return \code{None} if the string does not
 515   match the pattern; note that this is different from a zero-length
 516   match.
 517
 518   \note{If you want to locate a match anywhere in
 519   \var{string}, use \method{search()} instead.}
 520 \end{funcdesc}
 521
 522 \begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
 523   Split \var{string} by the occurrences of \var{pattern}.  If
 524   capturing parentheses are used in \var{pattern}, then the text of all
 525   groups in the pattern are also returned as part of the resulting list.
 526   If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
 527   occur, and the remainder of the string is returned as the final
 528   element of the list.  (Incompatibility note: in the original Python
 529   1.5 release, \var{maxsplit} was ignored.  This has been fixed in
 530   later releases.)
 531
 532 \begin{verbatim}
 533 >>> re.split('\W+', 'Words, words, words.')
 534 ['Words', 'words', 'words', '']
 535 >>> re.split('(\W+)', 'Words, words, words.')
 536 ['Words', ', ', 'words', ', ', 'words', '.', '']
 537 >>> re.split('\W+', 'Words, words, words.', 1)
 538 ['Words', 'words, words.']
 539 \end{verbatim}
 540
 541   This function combines and extends the functionality of
 542   the old \function{regsub.split()} and \function{regsub.splitx()}.
 543 \end{funcdesc}
 544
 545 \begin{funcdesc}{findall}{pattern, string}
 546   Return a list of all non-overlapping matches of \var{pattern} in
 547   \var{string}.  If one or more groups are present in the pattern,
 548   return a list of groups; this will be a list of tuples if the
 549   pattern has more than one group.  Empty matches are included in the
 550   result unless they touch the beginning of another match.
 551   \versionadded{1.5.2}
 552 \end{funcdesc}
 553
 554 \begin{funcdesc}{finditer}{pattern, string}
 555   Return an iterator over all non-overlapping matches for the RE
 556   \var{pattern} in \var{string}.  For each match, the iterator returns
 557   a match object.  Empty matches are included in the result unless they
 558   touch the beginning of another match.
 559   \versionadded{2.2}
 560 \end{funcdesc}
 561
 562 \begin{funcdesc}{sub}{pattern, repl, string\optional{, count}}
 563   Return the string obtained by replacing the leftmost non-overlapping
 564   occurrences of \var{pattern} in \var{string} by the replacement
 565   \var{repl}.  If the pattern isn't found, \var{string} is returned
 566   unchanged.  \var{repl} can be a string or a function; if it is a
 567   string, any backslash escapes in it are processed.  That is,
 568   \samp{\e n} is converted to a single newline character, \samp{\e r}
 569   is converted to a linefeed, and so forth.  Unknown escapes such as
 570   \samp{\e j} are left alone.  Backreferences, such as \samp{\e6}, are
 571   replaced with the substring matched by group 6 in the pattern.  For
 572   example:
 573
 574 \begin{verbatim}
 575 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
 576 ...        r'static PyObject*\npy_\1(void)\n{',
 577 ...        'def myfunc():')
 578 'static PyObject*\npy_myfunc(void)\n{'
 579 \end{verbatim}
 580
 581   If \var{repl} is a function, it is called for every non-overlapping
 582   occurrence of \var{pattern}.  The function takes a single match
 583   object argument, and returns the replacement string.  For example:
 584
 585 \begin{verbatim}
 586 >>> def dashrepl(matchobj):
 587 ....    if matchobj.group(0) == '-': return ' '
 588 ....    else: return '-'
 589 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
 590 'pro--gram files'
 591 \end{verbatim}
 592
 593   The pattern may be a string or an RE object; if you need to specify
 594   regular expression flags, you must use a RE object, or use embedded
 595   modifiers in a pattern; for example, \samp{sub("(?i)b+", "x", "bbbb
 596   BBBB")} returns \code{'x x'}.
 597
 598   The optional argument \var{count} is the maximum number of pattern
 599   occurrences to be replaced; \var{count} must be a non-negative
 600   integer.  If omitted or zero, all occurrences will be replaced.
 601   Empty matches for the pattern are replaced only when not adjacent to
 602   a previous match, so \samp{sub('x*', '-', 'abc')} returns
 603   \code{'-a-b-c-'}.
 604
 605   In addition to character escapes and backreferences as described
 606   above, \samp{\e g<name>} will use the substring matched by the group
 607   named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
 608   \samp{\e g<number>} uses the corresponding group number;
 609   \samp{\e g<2>} is therefore equivalent to \samp{\e 2}, but isn't
 610   ambiguous in a replacement such as \samp{\e g<2>0}.  \samp{\e 20}
 611   would be interpreted as a reference to group 20, not a reference to
 612   group 2 followed by the literal character \character{0}.  The
 613   backreference \samp{\e g<0>} substitutes in the entire substring
 614   matched by the RE.
 615 \end{funcdesc}
 616
 617 \begin{funcdesc}{subn}{pattern, repl, string\optional{, count}}
 618   Perform the same operation as \function{sub()}, but return a tuple
 619   \code{(\var{new_string}, \var{number_of_subs_made})}.
 620 \end{funcdesc}
 621
 622 \begin{funcdesc}{escape}{string}
 623   Return \var{string} with all non-alphanumerics backslashed; this is
 624   useful if you want to match an arbitrary literal string that may have
 625   regular expression metacharacters in it.
 626 \end{funcdesc}
 627
 628 \begin{excdesc}{error}
 629   Exception raised when a string passed to one of the functions here
 630   is not a valid regular expression (for example, it might contain
 631   unmatched parentheses) or when some other error occurs during
 632   compilation or matching.  It is never an error if a string contains
 633   no match for a pattern.
 634 \end{excdesc}
 635
 636
 637 \subsection{Regular Expression Objects \label{re-objects}}
 638
 639 Compiled regular expression objects support the following methods and
 640 attributes:
 641
 642 \begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
 643                                        endpos}}}
 644   If zero or more characters at the beginning of \var{string} match
 645   this regular expression, return a corresponding
 646   \class{MatchObject} instance.  Return \code{None} if the string does not
 647   match the pattern; note that this is different from a zero-length
 648   match.
 649
 650   \note{If you want to locate a match anywhere in
 651   \var{string}, use \method{search()} instead.}
 652
 653   The optional second parameter \var{pos} gives an index in the string
 654   where the search is to start; it defaults to \code{0}.  This is not
 655   completely equivalent to slicing the string; the
 656   \code{'\textasciicircum'} pattern
 657   character matches at the real beginning of the string and at positions
 658   just after a newline, but not necessarily at the index where the search
 659   is to start.
 660
 661   The optional parameter \var{endpos} limits how far the string will
 662   be searched; it will be as if the string is \var{endpos} characters
 663   long, so only the characters from \var{pos} to \code{\var{endpos} -
 664   1} will be searched for a match.  If \var{endpos} is less than
 665   \var{pos}, no match will be found, otherwise, if \var{rx} is a
 666   compiled regular expression object,
 667   \code{\var{rx}.match(\var{string}, 0, 50)} is equivalent to
 668   \code{\var{rx}.match(\var{string}[:50], 0)}.
 669 \end{methoddesc}
 670
 671 \begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
 672                                         endpos}}}
 673   Scan through \var{string} looking for a location where this regular
 674   expression produces a match, and return a
 675   corresponding \class{MatchObject} instance.  Return \code{None} if no
 676   position in the string matches the pattern; note that this is
 677   different from finding a zero-length match at some point in the string.
 678
 679   The optional \var{pos} and \var{endpos} parameters have the same
 680   meaning as for the \method{match()} method.
 681 \end{methoddesc}
 682
 683 \begin{methoddesc}[RegexObject]{split}{string\optional{,
 684                                        maxsplit\code{ = 0}}}
 685 Identical to the \function{split()} function, using the compiled pattern.
 686 \end{methoddesc}
 687
 688 \begin{methoddesc}[RegexObject]{findall}{string}
 689 Identical to the \function{findall()} function, using the compiled pattern.
 690 \end{methoddesc}
 691
 692 \begin{methoddesc}[RegexObject]{finditer}{string}
 693 Identical to the \function{finditer()} function, using the compiled pattern.
 694 \end{methoddesc}
 695
 696 \begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
 697 Identical to the \function{sub()} function, using the compiled pattern.
 698 \end{methoddesc}
 699
 700 \begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
 701                                       count\code{ = 0}}}
 702 Identical to the \function{subn()} function, using the compiled pattern.
 703 \end{methoddesc}
 704
 705
 706 \begin{memberdesc}[RegexObject]{flags}
 707 The flags argument used when the RE object was compiled, or
 708 \code{0} if no flags were provided.
 709 \end{memberdesc}
 710
 711 \begin{memberdesc}[RegexObject]{groupindex}
 712 A dictionary mapping any symbolic group names defined by
 713 \regexp{(?P<\var{id}>)} to group numbers.  The dictionary is empty if no
 714 symbolic groups were used in the pattern.
 715 \end{memberdesc}
 716
 717 \begin{memberdesc}[RegexObject]{pattern}
 718 The pattern string from which the RE object was compiled.
 719 \end{memberdesc}
 720
 721
 722 \subsection{Match Objects \label{match-objects}}
 723
 724 \class{MatchObject} instances support the following methods and
 725 attributes:
 726
 727 \begin{methoddesc}[MatchObject]{expand}{template}
 728  Return the string obtained by doing backslash substitution on the
 729 template string \var{template}, as done by the \method{sub()} method.
 730 Escapes such as \samp{\e n} are converted to the appropriate
 731 characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and
 732 named backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced
 733 by the contents of the corresponding group.
 734 \end{methoddesc}
 735
 736 \begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
 737 Returns one or more subgroups of the match.  If there is a single
 738 argument, the result is a single string; if there are
 739 multiple arguments, the result is a tuple with one item per argument.
 740 Without arguments, \var{group1} defaults to zero (the whole match
 741 is returned).
 742 If a \var{groupN} argument is zero, the corresponding return value is the
 743 entire matching string; if it is in the inclusive range [1..99], it is
 744 the string matching the corresponding parenthesized group.  If a
 745 group number is negative or larger than the number of groups defined
 746 in the pattern, an \exception{IndexError} exception is raised.
 747 If a group is contained in a part of the pattern that did not match,
 748 the corresponding result is \code{None}.  If a group is contained in a
 749 part of the pattern that matched multiple times, the last match is
 750 returned.
 751
 752 If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
 753 the \var{groupN} arguments may also be strings identifying groups by
 754 their group name.  If a string argument is not used as a group name in
 755 the pattern, an \exception{IndexError} exception is raised.
 756
 757 A moderately complicated example:
 758
 759 \begin{verbatim}
 760 m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
 761 \end{verbatim}
 762
 763 After performing this match, \code{m.group(1)} is \code{'3'}, as is
 764 \code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
 765 \end{methoddesc}
 766
 767 \begin{methoddesc}[MatchObject]{groups}{\optional{default}}
 768 Return a tuple containing all the subgroups of the match, from 1 up to
 769 however many groups are in the pattern.  The \var{default} argument is
 770 used for groups that did not participate in the match; it defaults to
 771 \code{None}.  (Incompatibility note: in the original Python 1.5
 772 release, if the tuple was one element long, a string would be returned
 773 instead.  In later versions (from 1.5.1 on), a singleton tuple is
 774 returned in such cases.)
 775 \end{methoddesc}
 776
 777 \begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
 778 Return a dictionary containing all the \emph{named} subgroups of the
 779 match, keyed by the subgroup name.  The \var{default} argument is
 780 used for groups that did not participate in the match; it defaults to
 781 \code{None}.
 782 \end{methoddesc}
 783
 784 \begin{methoddesc}[MatchObject]{start}{\optional{group}}
 785 \methodline{end}{\optional{group}}
 786 Return the indices of the start and end of the substring
 787 matched by \var{group}; \var{group} defaults to zero (meaning the whole
 788 matched substring).
 789 Return \code{-1} if \var{group} exists but
 790 did not contribute to the match.  For a match object
 791 \var{m}, and a group \var{g} that did contribute to the match, the
 792 substring matched by group \var{g} (equivalent to
 793 \code{\var{m}.group(\var{g})}) is
 794
 795 \begin{verbatim}
 796 m.string[m.start(g):m.end(g)]
 797 \end{verbatim}
 798
 799 Note that
 800 \code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
 801 \var{group} matched a null string.  For example, after \code{\var{m} =
 802 re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
 803 \code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
 804 \code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
 805 an \exception{IndexError} exception.
 806 \end{methoddesc}
 807
 808 \begin{methoddesc}[MatchObject]{span}{\optional{group}}
 809 For \class{MatchObject} \var{m}, return the 2-tuple
 810 \code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
 811 Note that if \var{group} did not contribute to the match, this is
 812 \code{(-1, -1)}.  Again, \var{group} defaults to zero.
 813 \end{methoddesc}
 814
 815 \begin{memberdesc}[MatchObject]{pos}
 816 The value of \var{pos} which was passed to the \function{search()} or
 817 \function{match()} method of the \class{RegexObject}.  This is the
 818 index into the string at which the RE engine started looking for a
 819 match.
 820 \end{memberdesc}
 821
 822 \begin{memberdesc}[MatchObject]{endpos}
 823 The value of \var{endpos} which was passed to the \function{search()}
 824 or \function{match()} method of the \class{RegexObject}.  This is the
 825 index into the string beyond which the RE engine will not go.
 826 \end{memberdesc}
 827
 828 \begin{memberdesc}[MatchObject]{lastindex}
 829 The integer index of the last matched capturing group, or \code{None}
 830 if no group was matched at all. For example, the expressions
 831 \regexp{(a)b}, \regexp{((a)(b))}, and \regexp{((ab))} will have
 832 \code{lastindex == 1} if applyied to the string \code{'ab'},
 833 while the expression \regexp{(a)(b)} will have \code{lastindex == 2},
 834 if applyied to the same string.
 835 \end{memberdesc}
 836
 837 \begin{memberdesc}[MatchObject]{lastgroup}
 838 The name of the last matched capturing group, or \code{None} if the
 839 group didn't have a name, or if no group was matched at all.
 840 \end{memberdesc}
 841
 842 \begin{memberdesc}[MatchObject]{re}
 843 The regular expression object whose \method{match()} or
 844 \method{search()} method produced this \class{MatchObject} instance.
 845 \end{memberdesc}
 846
 847 \begin{memberdesc}[MatchObject]{string}
 848 The string passed to \function{match()} or \function{search()}.
 849 \end{memberdesc}
 850
 851 \subsection{Examples}
 852
 853 \leftline{\strong{Simulating \cfunction{scanf()}}}
 854
 855 Python does not currently have an equivalent to \cfunction{scanf()}.
 856 \ttindex{scanf()}
 857 Regular expressions are generally more powerful, though also more
 858 verbose, than \cfunction{scanf()} format strings.  The table below
 859 offers some more-or-less equivalent mappings between
 860 \cfunction{scanf()} format tokens and regular expressions.
 861
 862 \begin{tableii}{l|l}{textrm}{\cfunction{scanf()} Token}{Regular Expression}
 863   \lineii{\code{\%c}}
 864          {\regexp{.}}
 865   \lineii{\code{\%5c}}
 866          {\regexp{.\{5\}}}
 867   \lineii{\code{\%d}}
 868          {\regexp{[-+]?\e d+}}
 869   \lineii{\code{\%e}, \code{\%E}, \code{\%f}, \code{\%g}}
 870          {\regexp{[-+]?(\e d+(\e.\e d*)?|\e d*\e.\e d+)([eE][-+]?\e d+)?}}
 871   \lineii{\code{\%i}}
 872          {\regexp{[-+]?(0[xX][\e dA-Fa-f]+|0[0-7]*|\e d+)}}
 873   \lineii{\code{\%o}}
 874          {\regexp{0[0-7]*}}
 875   \lineii{\code{\%s}}
 876          {\regexp{\e S+}}
 877   \lineii{\code{\%u}}
 878          {\regexp{\e d+}}
 879   \lineii{\code{\%x}, \code{\%X}}
 880          {\regexp{0[xX][\e dA-Fa-f]+}}
 881 \end{tableii}
 882
 883 To extract the filename and numbers from a string like
 884
 885 \begin{verbatim}
 886     /usr/sbin/sendmail - 0 errors, 4 warnings
 887 \end{verbatim}
 888
 889 you would use a \cfunction{scanf()} format like
 890
 891 \begin{verbatim}
 892     %s - %d errors, %d warnings
 893 \end{verbatim}
 894
 895 The equivalent regular expression would be
 896
 897 \begin{verbatim}
 898     (\S+) - (\d+) errors, (\d+) warnings
 899 \end{verbatim}
 900
 901 \leftline{\strong{Avoiding recursion}}
 902
 903 If you create regular expressions that require the engine to perform a
 904 lot of recursion, you may encounter a RuntimeError exception with
 905 the message \code{maximum recursion limit} exceeded. For example,
 906
 907 \begin{verbatim}
 908 >>> import re
 909 >>> s = 'Begin ' + 1000*'a very long string ' + 'end'
 910 >>> re.match('Begin (\w| )*? end', s).end()
 911 Traceback (most recent call last):
 912   File "<stdin>", line 1, in ?
 913   File "/usr/local/lib/python2.3/sre.py", line 132, in match
 914     return _compile(pattern, flags).match(string)
 915 RuntimeError: maximum recursion limit exceeded
 916 \end{verbatim}
 917
 918 You can often restructure your regular expression to avoid recursion.
 919
 920 Starting with Python 2.3, simple uses of the \regexp{*?} pattern are
 921 special-cased to avoid recursion.  Thus, the above regular expression
 922 can avoid recursion by being recast as
 923 \regexp{Begin [a-zA-Z0-9_ ]*?end}.  As a further benefit, such regular
 924 expressions will run faster than their recursive equivalents.