Doc/lib/libre.tex

   1 \section{\module{re} ---
   2          Regular expression operations}
   3 \declaremodule{standard}{re}
   4 \moduleauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
   5 \moduleauthor{Fredrik Lundh}{effbot@telia.com}
   6 \sectionauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
   7
   8
   9 \modulesynopsis{Regular expression search and match operations with a
  10                 Perl-style expression syntax.}
  11
  12
  13 This module provides regular expression matching operations similar to
  14 those found in Perl.  Regular expression pattern strings may not
  15 contain null bytes, but can specify the null byte using the
  16 \code{\e\var{number}} notation.  Both patterns and strings to be
  17 searched can be Unicode strings as well as 8-bit strings.  The
  18 \module{re} module is always available.
  19
  20 Regular expressions use the backslash character (\character{\e}) to
  21 indicate special forms or to allow special characters to be used
  22 without invoking their special meaning.  This collides with Python's
  23 usage of the same character for the same purpose in string literals;
  24 for example, to match a literal backslash, one might have to write
  25 \code{'\e\e\e\e'} as the pattern string, because the regular expression
  26 must be \samp{\e\e}, and each backslash must be expressed as
  27 \samp{\e\e} inside a regular Python string literal.
  28
  29 The solution is to use Python's raw string notation for regular
  30 expression patterns; backslashes are not handled in any special way in
  31 a string literal prefixed with \character{r}.  So \code{r"\e n"} is a
  32 two-character string containing \character{\e} and \character{n},
  33 while \code{"\e n"} is a one-character string containing a newline.
  34 Usually patterns will be expressed in Python code using this raw
  35 string notation.
  36
  37 \strong{Implementation note:}
  38 The \module{re}\refstmodindex{pre} module has two distinct
  39 implementations: \module{sre} is the default implementation and
  40 includes Unicode support, but may run into stack limitations for some
  41 patterns.  Though this will be fixed for a future release of Python,
  42 the older implementation (without Unicode support) is still available
  43 as the \module{pre}\refstmodindex{pre} module.
  44
  45
  46 \subsection{Regular Expression Syntax \label{re-syntax}}
  47
  48 A regular expression (or RE) specifies a set of strings that matches
  49 it; the functions in this module let you check if a particular string
  50 matches a given regular expression (or if a given regular expression
  51 matches a particular string, which comes down to the same thing).
  52
  53 Regular expressions can be concatenated to form new regular
  54 expressions; if \emph{A} and \emph{B} are both regular expressions,
  55 then \emph{AB} is also an regular expression.  If a string \emph{p}
  56 matches A and another string \emph{q} matches B, the string \emph{pq}
  57 will match AB.  Thus, complex expressions can easily be constructed
  58 from simpler primitive expressions like the ones described here.  For
  59 details of the theory and implementation of regular expressions,
  60 consult the Friedl book referenced below, or almost any textbook about
  61 compiler construction.
  62
  63 A brief explanation of the format of regular expressions follows.  For
  64 further information and a gentler presentation, consult the Regular
  65 Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
  66
  67 Regular expressions can contain both special and ordinary characters.
  68 Most ordinary characters, like \character{A}, \character{a}, or \character{0},
  69 are the simplest regular expressions; they simply match themselves.
  70 You can concatenate ordinary characters, so \regexp{last} matches the
  71 string \code{'last'}.  (In the rest of this section, we'll write RE's in
  72 \regexp{this special style}, usually without quotes, and strings to be
  73 matched \code{'in single quotes'}.)
  74
  75 Some characters, like \character{|} or \character{(}, are special.  Special
  76 characters either stand for classes of ordinary characters, or affect
  77 how the regular expressions around them are interpreted.
  78
  79 The special characters are:
  80
  81 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
  82
  83 \item[\character{.}] (Dot.)  In the default mode, this matches any
  84 character except a newline.  If the \constant{DOTALL} flag has been
  85 specified, this matches any character including a newline.
  86
  87 \item[\character{\^}] (Caret.)  Matches the start of the string, and in
  88 \constant{MULTILINE} mode also matches immediately after each newline.
  89
  90 \item[\character{\$}] Matches the end of the string, and in
  91 \constant{MULTILINE} mode also matches before a newline.
  92 \regexp{foo} matches both 'foo' and 'foobar', while the regular
  93 expression \regexp{foo\$} matches only 'foo'.
  94
  95 \item[\character{*}] Causes the resulting RE to
  96 match 0 or more repetitions of the preceding RE, as many repetitions
  97 as are possible.  \regexp{ab*} will
  98 match 'a', 'ab', or 'a' followed by any number of 'b's.
  99
 100 \item[\character{+}] Causes the
 101 resulting RE to match 1 or more repetitions of the preceding RE.
 102 \regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
 103 will not match just 'a'.
 104
 105 \item[\character{?}] Causes the resulting RE to
 106 match 0 or 1 repetitions of the preceding RE.  \regexp{ab?} will
 107 match either 'a' or 'ab'.
 108 \item[\code{*?}, \code{+?}, \code{??}] The \character{*}, \character{+}, and
 109 \character{?} qualifiers are all \dfn{greedy}; they match as much text as
 110 possible.  Sometimes this behaviour isn't desired; if the RE
 111 \regexp{<.*>} is matched against \code{'<H1>title</H1>'}, it will match the
 112 entire string, and not just \code{'<H1>'}.
 113 Adding \character{?} after the qualifier makes it perform the match in
 114 \dfn{non-greedy} or \dfn{minimal} fashion; as \emph{few} characters as
 115 possible will be matched.  Using \regexp{.*?} in the previous
 116 expression will match only \code{'<H1>'}.
 117
 118 \item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
 119 \var{m} to \var{n} repetitions of the preceding RE, attempting to
 120 match as many repetitions as possible.  For example, \regexp{a\{3,5\}}
 121 will match from 3 to 5 \character{a} characters.  Omitting \var{n}
 122 specifies an infinite upper bound; you can't omit \var{m}.
 123
 124 \item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
 125 match from \var{m} to \var{n} repetitions of the preceding RE,
 126 attempting to match as \emph{few} repetitions as possible.  This is
 127 the non-greedy version of the previous qualifier.  For example, on the
 128 6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
 129 \character{a} characters, while \regexp{a\{3,5\}?} will only match 3
 130 characters.
 131
 132 \item[\character{\e}] Either escapes special characters (permitting
 133 you to match characters like \character{*}, \character{?}, and so
 134 forth), or signals a special sequence; special sequences are discussed
 135 below.
 136
 137 If you're not using a raw string to
 138 express the pattern, remember that Python also uses the
 139 backslash as an escape sequence in string literals; if the escape
 140 sequence isn't recognized by Python's parser, the backslash and
 141 subsequent character are included in the resulting string.  However,
 142 if Python would recognize the resulting sequence, the backslash should
 143 be repeated twice.  This is complicated and hard to understand, so
 144 it's highly recommended that you use raw strings for all but the
 145 simplest expressions.
 146
 147 \item[\code{[]}] Used to indicate a set of characters.  Characters can
 148 be listed individually, or a range of characters can be indicated by
 149 giving two characters and separating them by a \character{-}.  Special
 150 characters are not active inside sets.  For example, \regexp{[akm\$]}
 151 will match any of the characters \character{a}, \character{k},
 152 \character{m}, or \character{\$}; \regexp{[a-z]}
 153 will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
 154 letter or digit.  Character classes such as \code{\e w} or \code{\e S}
 155 (defined below) are also acceptable inside a range.  If you want to
 156 include a \character{]} or a \character{-} inside a set, precede it with a
 157 backslash, or place it as the first character.  The
 158 pattern \regexp{[]]} will match \code{']'}, for example.
 159
 160 You can match the characters not within a range by \dfn{complementing}
 161 the set.  This is indicated by including a
 162 \character{\^} as the first character of the set; \character{\^} elsewhere will
 163 simply match the \character{\^} character.  For example, \regexp{[{\^}5]}
 164 will match any character except \character{5}.
 165
 166 \item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
 167 creates a regular expression that will match either A or B.  An
 168 arbitrary number of REs can be separated by the \character{|} in this
 169 way.  This can be used inside groups (see below) as well.  REs
 170 separated by \character{|} are tried from left to right, and the first
 171 one that allows the complete pattern to match is considered the
 172 accepted branch.  This means that if \code{A} matches, \code{B} will
 173 never be tested, even if it would produce a longer overall match.  In
 174 other words, the \character{|} operator is never greedy.  To match a
 175 literal \character{|}, use \regexp{\e|}, or enclose it inside a
 176 character class, as in \regexp{[|]}.
 177
 178 \item[\code{(...)}] Matches whatever regular expression is inside the
 179 parentheses, and indicates the start and end of a group; the contents
 180 of a group can be retrieved after a match has been performed, and can
 181 be matched later in the string with the \regexp{\e \var{number}} special
 182 sequence, described below.  To match the literals \character{(} or
 183 \character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
 184 inside a character class: \regexp{[(] [)]}.
 185
 186 \item[\code{(?...)}] This is an extension notation (a \character{?}
 187 following a \character{(} is not meaningful otherwise).  The first
 188 character after the \character{?}
 189 determines what the meaning and further syntax of the construct is.
 190 Extensions usually do not create a new group;
 191 \regexp{(?P<\var{name}>...)} is the only exception to this rule.
 192 Following are the currently supported extensions.
 193
 194 \item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
 195 \character{L}, \character{m}, \character{s}, \character{u},
 196 \character{x}.)  The group matches the empty string; the letters set
 197 the corresponding flags (\constant{re.I}, \constant{re.L},
 198 \constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
 199 for the entire regular expression.  This is useful if you wish to
 200 include the flags as part of the regular expression, instead of
 201 passing a \var{flag} argument to the \function{compile()} function.
 202
 203 Note that the \regexp{(?x)} flag changes how the expression is parsed.
 204 It should be used first in the expression string, or after one or more
 205 whitespace characters.  If there are non-whitespace characters before
 206 the flag, the results are undefined.
 207
 208 \item[\code{(?:...)}] A non-grouping version of regular parentheses.
 209 Matches whatever regular expression is inside the parentheses, but the
 210 substring matched by the
 211 group \emph{cannot} be retrieved after performing a match or
 212 referenced later in the pattern.
 213
 214 \item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
 215 the substring matched by the group is accessible via the symbolic group
 216 name \var{name}.  Group names must be valid Python identifiers.  A
 217 symbolic group is also a numbered group, just as if the group were not
 218 named.  So the group named 'id' in the example above can also be
 219 referenced as the numbered group 1.
 220
 221 For example, if the pattern is
 222 \regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
 223 name in arguments to methods of match objects, such as \code{m.group('id')}
 224 or \code{m.end('id')}, and also by name in pattern text
 225 (e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
 226
 227 \item[\code{(?P=\var{name})}] Matches whatever text was matched by the
 228 earlier group named \var{name}.
 229
 230 \item[\code{(?\#...)}] A comment; the contents of the parentheses are
 231 simply ignored.
 232
 233 \item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
 234 consume any of the string.  This is called a lookahead assertion.  For
 235 example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
 236 followed by \code{'Asimov'}.
 237
 238 \item[\code{(?!...)}] Matches if \regexp{...} doesn't match next.  This
 239 is a negative lookahead assertion.  For example,
 240 \regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
 241 followed by \code{'Asimov'}.
 242
 243 \item[\code{(?<=...)}] Matches if the current position in the string
 244 is preceded by a match for \regexp{...} that ends at the current
 245 position.  This is called a positive lookbehind assertion.
 246 \regexp{(?<=abc)def} will match \samp{abcdef}, since the lookbehind
 247 will back up 3 characters and check if the contained pattern matches.
 248 The contained pattern must only match strings of some fixed length,
 249 meaning that \regexp{abc} or \regexp{a|b} are allowed, but \regexp{a*}
 250 isn't.
 251
 252 \item[\code{(?<!...)}] Matches if the current position in the string
 253 is not preceded by a match for \regexp{...}.  This
 254 is called a negative lookbehind assertion.  Similar to positive lookbehind
 255 assertions, the contained pattern must only match strings of some
 256 fixed length.
 257
 258 \end{list}
 259
 260 The special sequences consist of \character{\e} and a character from the
 261 list below.  If the ordinary character is not on the list, then the
 262 resulting RE will match the second character.  For example,
 263 \regexp{\e\$} matches the character \character{\$}.
 264
 265 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
 266
 267 \item[\code{\e \var{number}}] Matches the contents of the group of the
 268 same number.  Groups are numbered starting from 1.  For example,
 269 \regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
 270 \code{'the end'} (note
 271 the space after the group).  This special sequence can only be used to
 272 match one of the first 99 groups.  If the first digit of \var{number}
 273 is 0, or \var{number} is 3 octal digits long, it will not be interpreted
 274 as a group match, but as the character with octal value \var{number}.
 275 Inside the \character{[} and \character{]} of a character class, all numeric
 276 escapes are treated as characters.
 277
 278 \item[\code{\e A}] Matches only at the start of the string.
 279
 280 \item[\code{\e b}] Matches the empty string, but only at the
 281 beginning or end of a word.  A word is defined as a sequence of
 282 alphanumeric characters, so the end of a word is indicated by
 283 whitespace or a non-alphanumeric character.  Inside a character range,
 284 \regexp{\e b} represents the backspace character, for compatibility with
 285 Python's string literals.
 286
 287 \item[\code{\e B}] Matches the empty string, but only when it is
 288 \emph{not} at the beginning or end of a word.
 289
 290 \item[\code{\e d}]Matches any decimal digit; this is
 291 equivalent to the set \regexp{[0-9]}.
 292
 293 \item[\code{\e D}]Matches any non-digit character; this is
 294 equivalent to the set \regexp{[{\^}0-9]}.
 295
 296 \item[\code{\e s}]Matches any whitespace character; this is
 297 equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
 298
 299 \item[\code{\e S}]Matches any non-whitespace character; this is
 300 equivalent to the set \regexp{[\^\ \e t\e n\e r\e f\e v]}.
 301
 302 \item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
 303 flags are not specified,
 304 matches any alphanumeric character; this is equivalent to the set
 305 \regexp{[a-zA-Z0-9_]}.  With \constant{LOCALE}, it will match the set
 306 \regexp{[0-9_]} plus whatever characters are defined as letters for
 307 the current locale.  If \constant{UNICODE} is set, this will match the
 308 characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
 309 in the Unicode character properties database.
 310
 311 \item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
 312 flags are not specified, matches any non-alphanumeric character; this
 313 is equivalent to the set \regexp{[{\^}a-zA-Z0-9_]}.   With
 314 \constant{LOCALE}, it will match any character not in the set
 315 \regexp{[0-9_]}, and not defined as a letter for the current locale.
 316 If \constant{UNICODE} is set, this will match anything other than
 317 \regexp{[0-9_]} and characters marked at alphanumeric in the Unicode
 318 character properties database.
 319
 320 \item[\code{\e Z}]Matches only at the end of the string.
 321
 322 \item[\code{\e \e}] Matches a literal backslash.
 323
 324 \end{list}
 325
 326
 327 \subsection{Matching vs. Searching \label{matching-searching}}
 328 \sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
 329
 330 Python offers two different primitive operations based on regular
 331 expressions: match and search.  If you are accustomed to Perl's
 332 semantics, the search operation is what you're looking for.  See the
 333 \function{search()} function and corresponding method of compiled
 334 regular expression objects.
 335
 336 Note that match may differ from search using a regular expression
 337 beginning with \character{\^}: \character{\^} matches only at the
 338 start of the string, or in \constant{MULTILINE} mode also immediately
 339 following a newline.  The ``match'' operation succeeds only if the
 340 pattern matches at the start of the string regardless of mode, or at
 341 the starting position given by the optional \var{pos} argument
 342 regardless of whether a newline precedes it.
 343
 344 % Examples from Tim Peters:
 345 \begin{verbatim}
 346 re.compile("a").match("ba", 1)           # succeeds
 347 re.compile("^a").search("ba", 1)         # fails; 'a' not at start
 348 re.compile("^a").search("\na", 1)        # fails; 'a' not at start
 349 re.compile("^a", re.M).search("\na", 1)  # succeeds
 350 re.compile("^a", re.M).search("ba", 1)   # fails; no preceding \n
 351 \end{verbatim}
 352
 353
 354 \subsection{Module Contents}
 355 \nodename{Contents of Module re}
 356
 357 The module defines the following functions and constants, and an exception:
 358
 359
 360 \begin{funcdesc}{compile}{pattern\optional{, flags}}
 361   Compile a regular expression pattern into a regular expression
 362   object, which can be used for matching using its \function{match()} and
 363   \function{search()} methods, described below.
 364
 365   The expression's behaviour can be modified by specifying a
 366   \var{flags} value.  Values can be any of the following variables,
 367   combined using bitwise OR (the \code{|} operator).
 368
 369 The sequence
 370
 371 \begin{verbatim}
 372 prog = re.compile(pat)
 373 result = prog.match(str)
 374 \end{verbatim}
 375
 376 is equivalent to
 377
 378 \begin{verbatim}
 379 result = re.match(pat, str)
 380 \end{verbatim}
 381
 382 but the version using \function{compile()} is more efficient when the
 383 expression will be used several times in a single program.
 384 %(The compiled version of the last pattern passed to
 385 %\function{regex.match()} or \function{regex.search()} is cached, so
 386 %programs that use only a single regular expression at a time needn't
 387 %worry about compiling regular expressions.)
 388 \end{funcdesc}
 389
 390 \begin{datadesc}{I}
 391 \dataline{IGNORECASE}
 392 Perform case-insensitive matching; expressions like \regexp{[A-Z]} will match
 393 lowercase letters, too.  This is not affected by the current locale.
 394 \end{datadesc}
 395
 396 \begin{datadesc}{L}
 397 \dataline{LOCALE}
 398 Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
 399 \regexp{\e B} dependent on the current locale.
 400 \end{datadesc}
 401
 402 \begin{datadesc}{M}
 403 \dataline{MULTILINE}
 404 When specified, the pattern character \character{\^} matches at the
 405 beginning of the string and at the beginning of each line
 406 (immediately following each newline); and the pattern character
 407 \character{\$} matches at the end of the string and at the end of each line
 408 (immediately preceding each newline).
 409 By default, \character{\^} matches only at the beginning of the string, and
 410 \character{\$} only at the end of the string and immediately before the
 411 newline (if any) at the end of the string.
 412 \end{datadesc}
 413
 414 \begin{datadesc}{S}
 415 \dataline{DOTALL}
 416 Make the \character{.} special character match any character at all,
 417 including a newline; without this flag, \character{.} will match
 418 anything \emph{except} a newline.
 419 \end{datadesc}
 420
 421 \begin{datadesc}{U}
 422 \dataline{UNICODE}
 423 Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
 424 \regexp{\e B} dependent on the Unicode character properties database.
 425 \versionadded{2.0}
 426 \end{datadesc}
 427
 428 \begin{datadesc}{X}
 429 \dataline{VERBOSE}
 430 This flag allows you to write regular expressions that look nicer.
 431 Whitespace within the pattern is ignored,
 432 except when in a character class or preceded by an unescaped
 433 backslash, and, when a line contains a \character{\#} neither in a character
 434 class or preceded by an unescaped backslash, all characters from the
 435 leftmost such \character{\#} through the end of the line are ignored.
 436 % XXX should add an example here
 437 \end{datadesc}
 438
 439
 440 \begin{funcdesc}{search}{pattern, string\optional{, flags}}
 441   Scan through \var{string} looking for a location where the regular
 442   expression \var{pattern} produces a match, and return a
 443   corresponding \class{MatchObject} instance.
 444   Return \code{None} if no
 445   position in the string matches the pattern; note that this is
 446   different from finding a zero-length match at some point in the string.
 447 \end{funcdesc}
 448
 449 \begin{funcdesc}{match}{pattern, string\optional{, flags}}
 450   If zero or more characters at the beginning of \var{string} match
 451   the regular expression \var{pattern}, return a corresponding
 452   \class{MatchObject} instance.  Return \code{None} if the string does not
 453   match the pattern; note that this is different from a zero-length
 454   match.
 455
 456   \strong{Note:}  If you want to locate a match anywhere in
 457   \var{string}, use \method{search()} instead.
 458 \end{funcdesc}
 459
 460 \begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
 461   Split \var{string} by the occurrences of \var{pattern}.  If
 462   capturing parentheses are used in \var{pattern}, then the text of all
 463   groups in the pattern are also returned as part of the resulting list.
 464   If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
 465   occur, and the remainder of the string is returned as the final
 466   element of the list.  (Incompatibility note: in the original Python
 467   1.5 release, \var{maxsplit} was ignored.  This has been fixed in
 468   later releases.)
 469
 470 \begin{verbatim}
 471 >>> re.split('\W+', 'Words, words, words.')
 472 ['Words', 'words', 'words', '']
 473 >>> re.split('(\W+)', 'Words, words, words.')
 474 ['Words', ', ', 'words', ', ', 'words', '.', '']
 475 >>> re.split('\W+', 'Words, words, words.', 1)
 476 ['Words', 'words, words.']
 477 \end{verbatim}
 478
 479   This function combines and extends the functionality of
 480   the old \function{regsub.split()} and \function{regsub.splitx()}.
 481 \end{funcdesc}
 482
 483 \begin{funcdesc}{findall}{pattern, string}
 484 Return a list of all non-overlapping matches of \var{pattern} in
 485 \var{string}.  If one or more groups are present in the pattern,
 486 return a list of groups; this will be a list of tuples if the pattern
 487 has more than one group.  Empty matches are included in the result.
 488 \versionadded{1.5.2}
 489 \end{funcdesc}
 490
 491 \begin{funcdesc}{sub}{pattern, repl, string\optional{, count\code{ = 0}}}
 492 Return the string obtained by replacing the leftmost non-overlapping
 493 occurrences of \var{pattern} in \var{string} by the replacement
 494 \var{repl}.  If the pattern isn't found, \var{string} is returned
 495 unchanged.  \var{repl} can be a string or a function; if a function,
 496 it is called for every non-overlapping occurrence of \var{pattern}.
 497 The function takes a single match object argument, and returns the
 498 replacement string.  For example:
 499
 500 \begin{verbatim}
 501 >>> def dashrepl(matchobj):
 502 ....    if matchobj.group(0) == '-': return ' '
 503 ....    else: return '-'
 504 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
 505 'pro--gram files'
 506 \end{verbatim}
 507
 508 The pattern may be a string or a
 509 regex object; if you need to specify
 510 regular expression flags, you must use a regex object, or use
 511 embedded modifiers in a pattern; e.g.
 512 \samp{sub("(?i)b+", "x", "bbbb BBBB")} returns \code{'x x'}.
 513
 514 The optional argument \var{count} is the maximum number of pattern
 515 occurrences to be replaced; \var{count} must be a non-negative integer, and
 516 the default value of 0 means to replace all occurrences.
 517
 518 Empty matches for the pattern are replaced only when not adjacent to a
 519 previous match, so \samp{sub('x*', '-', 'abc')} returns \code{'-a-b-c-'}.
 520
 521 If \var{repl} is a string, any backslash escapes in it are processed.
 522 That is, \samp{\e n} is converted to a single newline character,
 523 \samp{\e r} is converted to a linefeed, and so forth.  Unknown escapes
 524 such as \samp{\e j} are left alone.  Backreferences, such as \samp{\e 6}, are
 525 replaced with the substring matched by group 6 in the pattern.
 526
 527 In addition to character escapes and backreferences as described
 528 above, \samp{\e g<name>} will use the substring matched by the group
 529 named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
 530 \samp{\e g<number>} uses the corresponding group number; \samp{\e
 531 g<2>} is therefore equivalent to \samp{\e 2}, but isn't ambiguous in a
 532 replacement such as \samp{\e g<2>0}.  \samp{\e 20} would be
 533 interpreted as a reference to group 20, not a reference to group 2
 534 followed by the literal character \character{0}.
 535 \end{funcdesc}
 536
 537 \begin{funcdesc}{subn}{pattern, repl, string\optional{, count\code{ = 0}}}
 538 Perform the same operation as \function{sub()}, but return a tuple
 539 \code{(\var{new_string}, \var{number_of_subs_made})}.
 540 \end{funcdesc}
 541
 542 \begin{funcdesc}{escape}{string}
 543   Return \var{string} with all non-alphanumerics backslashed; this is
 544   useful if you want to match an arbitrary literal string that may have
 545   regular expression metacharacters in it.
 546 \end{funcdesc}
 547
 548 \begin{excdesc}{error}
 549   Exception raised when a string passed to one of the functions here
 550   is not a valid regular expression (e.g., unmatched parentheses) or
 551   when some other error occurs during compilation or matching.  It is
 552   never an error if a string contains no match for a pattern.
 553 \end{excdesc}
 554
 555
 556 \subsection{Regular Expression Objects \label{re-objects}}
 557
 558 Compiled regular expression objects support the following methods and
 559 attributes:
 560
 561 \begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
 562                                         endpos}}}
 563   Scan through \var{string} looking for a location where this regular
 564   expression produces a match, and return a
 565   corresponding \class{MatchObject} instance.  Return \code{None} if no
 566   position in the string matches the pattern; note that this is
 567   different from finding a zero-length match at some point in the string.
 568
 569   The optional \var{pos} and \var{endpos} parameters have the same
 570   meaning as for the \method{match()} method.
 571 \end{methoddesc}
 572
 573 \begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
 574                                        endpos}}}
 575   If zero or more characters at the beginning of \var{string} match
 576   this regular expression, return a corresponding
 577   \class{MatchObject} instance.  Return \code{None} if the string does not
 578   match the pattern; note that this is different from a zero-length
 579   match.
 580
 581   \strong{Note:}  If you want to locate a match anywhere in
 582   \var{string}, use \method{search()} instead.
 583
 584   The optional second parameter \var{pos} gives an index in the string
 585   where the search is to start; it defaults to \code{0}.  This is not
 586   completely equivalent to slicing the string; the \code{'\^'} pattern
 587   character matches at the real beginning of the string and at positions
 588   just after a newline, but not necessarily at the index where the search
 589   is to start.
 590
 591   The optional parameter \var{endpos} limits how far the string will
 592   be searched; it will be as if the string is \var{endpos} characters
 593   long, so only the characters from \var{pos} to \var{endpos} will be
 594   searched for a match.
 595 \end{methoddesc}
 596
 597 \begin{methoddesc}[RegexObject]{split}{string\optional{,
 598                                        maxsplit\code{ = 0}}}
 599 Identical to the \function{split()} function, using the compiled pattern.
 600 \end{methoddesc}
 601
 602 \begin{methoddesc}[RegexObject]{findall}{string}
 603 Identical to the \function{findall()} function, using the compiled pattern.
 604 \end{methoddesc}
 605
 606 \begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
 607 Identical to the \function{sub()} function, using the compiled pattern.
 608 \end{methoddesc}
 609
 610 \begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
 611                                       count\code{ = 0}}}
 612 Identical to the \function{subn()} function, using the compiled pattern.
 613 \end{methoddesc}
 614
 615
 616 \begin{memberdesc}[RegexObject]{flags}
 617 The flags argument used when the regex object was compiled, or
 618 \code{0} if no flags were provided.
 619 \end{memberdesc}
 620
 621 \begin{memberdesc}[RegexObject]{groupindex}
 622 A dictionary mapping any symbolic group names defined by
 623 \regexp{(?P<\var{id}>)} to group numbers.  The dictionary is empty if no
 624 symbolic groups were used in the pattern.
 625 \end{memberdesc}
 626
 627 \begin{memberdesc}[RegexObject]{pattern}
 628 The pattern string from which the regex object was compiled.
 629 \end{memberdesc}
 630
 631
 632 \subsection{Match Objects \label{match-objects}}
 633
 634 \class{MatchObject} instances support the following methods and attributes:
 635
 636 \begin{methoddesc}[MatchObject]{expand}{template}
 637  Return the string obtained by doing backslash substitution on the
 638 template string \var{template}, as done by the \method{sub()} method.
 639 Escapes such as \samp{\e n} are converted to the appropriate
 640 characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and named
 641 backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced by the contents of the
 642 corresponding group.
 643 \end{methoddesc}
 644
 645 \begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
 646 Returns one or more subgroups of the match.  If there is a single
 647 argument, the result is a single string; if there are
 648 multiple arguments, the result is a tuple with one item per argument.
 649 Without arguments, \var{group1} defaults to zero (i.e. the whole match
 650 is returned).
 651 If a \var{groupN} argument is zero, the corresponding return value is the
 652 entire matching string; if it is in the inclusive range [1..99], it is
 653 the string matching the the corresponding parenthesized group.  If a
 654 group number is negative or larger than the number of groups defined
 655 in the pattern, an \exception{IndexError} exception is raised.
 656 If a group is contained in a part of the pattern that did not match,
 657 the corresponding result is \code{-1}.  If a group is contained in a
 658 part of the pattern that matched multiple times, the last match is
 659 returned.
 660
 661 If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
 662 the \var{groupN} arguments may also be strings identifying groups by
 663 their group name.  If a string argument is not used as a group name in
 664 the pattern, an \exception{IndexError} exception is raised.
 665
 666 A moderately complicated example:
 667
 668 \begin{verbatim}
 669 m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
 670 \end{verbatim}
 671
 672 After performing this match, \code{m.group(1)} is \code{'3'}, as is
 673 \code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
 674 \end{methoddesc}
 675
 676 \begin{methoddesc}[MatchObject]{groups}{\optional{default}}
 677 Return a tuple containing all the subgroups of the match, from 1 up to
 678 however many groups are in the pattern.  The \var{default} argument is
 679 used for groups that did not participate in the match; it defaults to
 680 \code{None}.  (Incompatibility note: in the original Python 1.5
 681 release, if the tuple was one element long, a string would be returned
 682 instead.  In later versions (from 1.5.1 on), a singleton tuple is
 683 returned in such cases.)
 684 \end{methoddesc}
 685
 686 \begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
 687 Return a dictionary containing all the \emph{named} subgroups of the
 688 match, keyed by the subgroup name.  The \var{default} argument is
 689 used for groups that did not participate in the match; it defaults to
 690 \code{None}.
 691 \end{methoddesc}
 692
 693 \begin{methoddesc}[MatchObject]{start}{\optional{group}}
 694 \funcline{end}{\optional{group}}
 695 Return the indices of the start and end of the substring
 696 matched by \var{group}; \var{group} defaults to zero (meaning the whole
 697 matched substring).
 698 Return \code{-1} if \var{group} exists but
 699 did not contribute to the match.  For a match object
 700 \var{m}, and a group \var{g} that did contribute to the match, the
 701 substring matched by group \var{g} (equivalent to
 702 \code{\var{m}.group(\var{g})}) is
 703
 704 \begin{verbatim}
 705 m.string[m.start(g):m.end(g)]
 706 \end{verbatim}
 707
 708 Note that
 709 \code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
 710 \var{group} matched a null string.  For example, after \code{\var{m} =
 711 re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
 712 \code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
 713 \code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
 714 an \exception{IndexError} exception.
 715 \end{methoddesc}
 716
 717 \begin{methoddesc}[MatchObject]{span}{\optional{group}}
 718 For \class{MatchObject} \var{m}, return the 2-tuple
 719 \code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
 720 Note that if \var{group} did not contribute to the match, this is
 721 \code{(-1, -1)}.  Again, \var{group} defaults to zero.
 722 \end{methoddesc}
 723
 724 \begin{memberdesc}[MatchObject]{pos}
 725 The value of \var{pos} which was passed to the
 726 \function{search()} or \function{match()} function.  This is the index into
 727 the string at which the regex engine started looking for a match.
 728 \end{memberdesc}
 729
 730 \begin{memberdesc}[MatchObject]{endpos}
 731 The value of \var{endpos} which was passed to the
 732 \function{search()} or \function{match()} function.  This is the index into
 733 the string beyond which the regex engine will not go.
 734 \end{memberdesc}
 735
 736 \begin{memberdesc}[MatchObject]{lastgroup}
 737 The name of the last matched capturing group, or \code{None} if the
 738 group didn't have a name, or if no group was matched at all.
 739 \end{memberdesc}
 740
 741 \begin{memberdesc}[MatchObject]{lastindex}
 742 The integer index of the last matched capturing group, or \code{None}
 743 if no group was matched at all.
 744 \end{memberdesc}
 745
 746 \begin{memberdesc}[MatchObject]{re}
 747 The regular expression object whose \method{match()} or
 748 \method{search()} method produced this \class{MatchObject} instance.
 749 \end{memberdesc}
 750
 751 \begin{memberdesc}[MatchObject]{string}
 752 The string passed to \function{match()} or \function{search()}.
 753 \end{memberdesc}
 754
 755 \begin{seealso}
 756 \seetext{Jeffrey Friedl, \citetitle{Mastering Regular Expressions},
 757 O'Reilly.  The Python material in this book dates from before the
 758 \module{re} module, but it covers writing good regular expression
 759 patterns in great detail.}
 760 \end{seealso}
 761