Doc/lib/libre.tex

   1 \section{\module{re} ---
   2          Perl-style regular expression operations.}
   3 \declaremodule{standard}{re}
   4 \moduleauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
   5 \sectionauthor{Andrew M. Kuchling}{amk1@bigfoot.com}
   6
   7
   8 \modulesynopsis{Perl-style regular expression search and match
   9 operations.}
  10
  11
  12 This module provides regular expression matching operations similar to
  13 those found in Perl.  It's 8-bit clean: the strings being processed
  14 may contain both null bytes and characters whose high bit is set.  Regular
  15 expression pattern strings may not contain null bytes, but can specify
  16 the null byte using the \code{\e\var{number}} notation.
  17 Characters with the high bit set may be included.  The \module{re}
  18 module is always available.
  19
  20 Regular expressions use the backslash character (\character{\e}) to
  21 indicate special forms or to allow special characters to be used
  22 without invoking their special meaning.  This collides with Python's
  23 usage of the same character for the same purpose in string literals;
  24 for example, to match a literal backslash, one might have to write
  25 \code{'\e\e\e\e'} as the pattern string, because the regular expression
  26 must be \samp{\e\e}, and each backslash must be expressed as
  27 \samp{\e\e} inside a regular Python string literal.
  28
  29 The solution is to use Python's raw string notation for regular
  30 expression patterns; backslashes are not handled in any special way in
  31 a string literal prefixed with \character{r}.  So \code{r"\e n"} is a
  32 two-character string containing \character{\e} and \character{n},
  33 while \code{"\e n"} is a one-character string containing a newline.
  34 Usually patterns will be expressed in Python code using this raw
  35 string notation.
  36
  37 \subsection{Regular Expression Syntax \label{re-syntax}}
  38
  39 A regular expression (or RE) specifies a set of strings that matches
  40 it; the functions in this module let you check if a particular string
  41 matches a given regular expression (or if a given regular expression
  42 matches a particular string, which comes down to the same thing).
  43
  44 Regular expressions can be concatenated to form new regular
  45 expressions; if \emph{A} and \emph{B} are both regular expressions,
  46 then \emph{AB} is also an regular expression.  If a string \emph{p}
  47 matches A and another string \emph{q} matches B, the string \emph{pq}
  48 will match AB.  Thus, complex expressions can easily be constructed
  49 from simpler primitive expressions like the ones described here.  For
  50 details of the theory and implementation of regular expressions,
  51 consult the Friedl book referenced below, or almost any textbook about
  52 compiler construction.
  53
  54 A brief explanation of the format of regular expressions follows.  For
  55 further information and a gentler presentation, consult the Regular
  56 Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
  57
  58 Regular expressions can contain both special and ordinary characters.
  59 Most ordinary characters, like \character{A}, \character{a}, or \character{0},
  60 are the simplest regular expressions; they simply match themselves.
  61 You can concatenate ordinary characters, so \regexp{last} matches the
  62 string \code{'last'}.  (In the rest of this section, we'll write RE's in
  63 \regexp{this special style}, usually without quotes, and strings to be
  64 matched \code{'in single quotes'}.)
  65
  66 Some characters, like \character{|} or \character{(}, are special.  Special
  67 characters either stand for classes of ordinary characters, or affect
  68 how the regular expressions around them are interpreted.
  69
  70 The special characters are:
  71
  72 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
  73
  74 \item[\character{.}] (Dot.)  In the default mode, this matches any
  75 character except a newline.  If the \constant{DOTALL} flag has been
  76 specified, this matches any character including a newline.
  77
  78 \item[\character{\^}] (Caret.)  Matches the start of the string, and in
  79 \constant{MULTILINE} mode also matches immediately after each newline.
  80
  81 \item[\character{\$}] Matches the end of the string, and in
  82 \constant{MULTILINE} mode also matches before a newline.
  83 \regexp{foo} matches both 'foo' and 'foobar', while the regular
  84 expression \regexp{foo\$} matches only 'foo'.
  85
  86 \item[\character{*}] Causes the resulting RE to
  87 match 0 or more repetitions of the preceding RE, as many repetitions
  88 as are possible.  \regexp{ab*} will
  89 match 'a', 'ab', or 'a' followed by any number of 'b's.
  90
  91 \item[\character{+}] Causes the
  92 resulting RE to match 1 or more repetitions of the preceding RE.
  93 \regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
  94 will not match just 'a'.
  95
  96 \item[\character{?}] Causes the resulting RE to
  97 match 0 or 1 repetitions of the preceding RE.  \regexp{ab?} will
  98 match either 'a' or 'ab'.
  99 \item[\code{*?}, \code{+?}, \code{??}] The \character{*}, \character{+}, and
 100 \character{?} qualifiers are all \dfn{greedy}; they match as much text as
 101 possible.  Sometimes this behaviour isn't desired; if the RE
 102 \regexp{<.*>} is matched against \code{'<H1>title</H1>'}, it will match the
 103 entire string, and not just \code{'<H1>'}.
 104 Adding \character{?} after the qualifier makes it perform the match in
 105 \dfn{non-greedy} or \dfn{minimal} fashion; as \emph{few} characters as
 106 possible will be matched.  Using \regexp{.*?} in the previous
 107 expression will match only \code{'<H1>'}.
 108
 109 \item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
 110 \var{m} to \var{n} repetitions of the preceding RE, attempting to
 111 match as many repetitions as possible.  For example, \regexp{a\{3,5\}}
 112 will match from 3 to 5 \character{a} characters.  Omitting \var{n}
 113 specifies an infinite upper bound; you can't omit \var{m}.
 114
 115 \item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
 116 match from \var{m} to \var{n} repetitions of the preceding RE,
 117 attempting to match as \emph{few} repetitions as possible.  This is
 118 the non-greedy version of the previous qualifier.  For example, on the
 119 6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
 120 \character{a} characters, while \regexp{a\{3,5\}?} will only match 3
 121 characters.
 122
 123 \item[\character{\e}] Either escapes special characters (permitting
 124 you to match characters like \character{*}, \character{?}, and so
 125 forth), or signals a special sequence; special sequences are discussed
 126 below.
 127
 128 If you're not using a raw string to
 129 express the pattern, remember that Python also uses the
 130 backslash as an escape sequence in string literals; if the escape
 131 sequence isn't recognized by Python's parser, the backslash and
 132 subsequent character are included in the resulting string.  However,
 133 if Python would recognize the resulting sequence, the backslash should
 134 be repeated twice.  This is complicated and hard to understand, so
 135 it's highly recommended that you use raw strings for all but the
 136 simplest expressions.
 137
 138 \item[\code{[]}] Used to indicate a set of characters.  Characters can
 139 be listed individually, or a range of characters can be indicated by
 140 giving two characters and separating them by a \character{-}.  Special
 141 characters are not active inside sets.  For example, \regexp{[akm\$]}
 142 will match any of the characters \character{a}, \character{k},
 143 \character{m}, or \character{\$}; \regexp{[a-z]}
 144 will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
 145 letter or digit.  Character classes such as \code{\e w} or \code{\e S}
 146 (defined below) are also acceptable inside a range.  If you want to
 147 include a \character{]} or a \character{-} inside a set, precede it with a
 148 backslash, or place it as the first character.  The
 149 pattern \regexp{[]]} will match \code{']'}, for example.
 150
 151 You can match the characters not within a range by \dfn{complementing}
 152 the set.  This is indicated by including a
 153 \character{\^} as the first character of the set; \character{\^} elsewhere will
 154 simply match the \character{\^} character.  For example, \regexp{[{\^}5]}
 155 will match any character except \character{5}.
 156
 157 \item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
 158 creates a regular expression that will match either A or B.  This can
 159 be used inside groups (see below) as well.  To match a literal \character{|},
 160 use \regexp{\e|}, or enclose it inside a character class, as in  \regexp{[|]}.
 161
 162 \item[\code{(...)}] Matches whatever regular expression is inside the
 163 parentheses, and indicates the start and end of a group; the contents
 164 of a group can be retrieved after a match has been performed, and can
 165 be matched later in the string with the \regexp{\e \var{number}} special
 166 sequence, described below.  To match the literals \character{(} or
 167 \character{')}, use \regexp{\e(} or \regexp{\e)}, or enclose them
 168 inside a character class: \regexp{[(] [)]}.
 169
 170 \item[\code{(?...)}] This is an extension notation (a \character{?}
 171 following a \character{(} is not meaningful otherwise).  The first
 172 character after the \character{?}
 173 determines what the meaning and further syntax of the construct is.
 174 Extensions usually do not create a new group;
 175 \regexp{(?P<\var{name}>...)} is the only exception to this rule.
 176 Following are the currently supported extensions.
 177
 178 \item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
 179 \character{L}, \character{m}, \character{s}, \character{u},
 180 \character{x}.)  The group matches the empty string; the letters set
 181 the corresponding flags (\constant{re.I}, \constant{re.L},
 182 \constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
 183 for the entire regular expression.  This is useful if you wish to
 184 include the flags as part of the regular expression, instead of
 185 passing a \var{flag} argument to the \function{compile()} function.
 186
 187 \item[\code{(?:...)}] A non-grouping version of regular parentheses.
 188 Matches whatever regular expression is inside the parentheses, but the
 189 substring matched by the
 190 group \emph{cannot} be retrieved after performing a match or
 191 referenced later in the pattern.
 192
 193 \item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
 194 the substring matched by the group is accessible via the symbolic group
 195 name \var{name}.  Group names must be valid Python identifiers.  A
 196 symbolic group is also a numbered group, just as if the group were not
 197 named.  So the group named 'id' in the example above can also be
 198 referenced as the numbered group 1.
 199
 200 For example, if the pattern is
 201 \regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
 202 name in arguments to methods of match objects, such as \code{m.group('id')}
 203 or \code{m.end('id')}, and also by name in pattern text
 204 (e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
 205
 206 \item[\code{(?P=\var{name})}] Matches whatever text was matched by the
 207 earlier group named \var{name}.
 208
 209 \item[\code{(?\#...)}] A comment; the contents of the parentheses are
 210 simply ignored.
 211
 212 \item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
 213 consume any of the string.  This is called a lookahead assertion.  For
 214 example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
 215 followed by \code{'Asimov'}.
 216
 217 \item[\code{(?!...)}] Matches if \regexp{...} doesn't match next.  This
 218 is a negative lookahead assertion.  For example,
 219 \regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
 220 followed by \code{'Asimov'}.
 221
 222 \end{list}
 223
 224 The special sequences consist of \character{\e} and a character from the
 225 list below.  If the ordinary character is not on the list, then the
 226 resulting RE will match the second character.  For example,
 227 \regexp{\e\$} matches the character \character{\$}.
 228
 229 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
 230
 231 \item[\code{\e \var{number}}] Matches the contents of the group of the
 232 same number.  Groups are numbered starting from 1.  For example,
 233 \regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
 234 \code{'the end'} (note
 235 the space after the group).  This special sequence can only be used to
 236 match one of the first 99 groups.  If the first digit of \var{number}
 237 is 0, or \var{number} is 3 octal digits long, it will not be interpreted
 238 as a group match, but as the character with octal value \var{number}.
 239 Inside the \character{[} and \character{]} of a character class, all numeric
 240 escapes are treated as characters.
 241
 242 \item[\code{\e A}] Matches only at the start of the string.
 243
 244 \item[\code{\e b}] Matches the empty string, but only at the
 245 beginning or end of a word.  A word is defined as a sequence of
 246 alphanumeric characters, so the end of a word is indicated by
 247 whitespace or a non-alphanumeric character.  Inside a character range,
 248 \regexp{\e b} represents the backspace character, for compatibility with
 249 Python's string literals.
 250
 251 \item[\code{\e B}] Matches the empty string, but only when it is
 252 \emph{not} at the beginning or end of a word.
 253
 254 \item[\code{\e d}]Matches any decimal digit; this is
 255 equivalent to the set \regexp{[0-9]}.
 256
 257 \item[\code{\e D}]Matches any non-digit character; this is
 258 equivalent to the set \regexp{[{\^}0-9]}.
 259
 260 \item[\code{\e s}]Matches any whitespace character; this is
 261 equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
 262
 263 \item[\code{\e S}]Matches any non-whitespace character; this is
 264 equivalent to the set \regexp{[\^\ \e t\e n\e r\e f\e v]}.
 265
 266 \item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
 267 flags are not specified,
 268 matches any alphanumeric character; this is equivalent to the set
 269 \regexp{[a-zA-Z0-9_]}.  With \constant{LOCALE}, it will match the set
 270 \regexp{[0-9_]} plus whatever characters are defined as letters for
 271 the current locale.  If \constant{UNICODE} is set, this will match the
 272 characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
 273 in the Unicode character properties database.
 274
 275 \item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
 276 flags are not specified, matches any non-alphanumeric character; this
 277 is equivalent to the set \regexp{[{\^}a-zA-Z0-9_]}.   With
 278 \constant{LOCALE}, it will match any character not in the set
 279 \regexp{[0-9_]}, and not defined as a letter for the current locale.
 280 If \constant{UNICODE} is set, this will match anything other than
 281 \regexp{[0-9_]} and characters marked at alphanumeric in the Unicode
 282 character properties database.
 283
 284 \item[\code{\e Z}]Matches only at the end of the string.
 285
 286 \item[\code{\e \e}] Matches a literal backslash.
 287
 288 \end{list}
 289
 290
 291 \subsection{Matching vs. Searching \label{matching-searching}}
 292 \sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
 293
 294 Python offers two different primitive operations based on regular
 295 expressions: match and search.  If you are accustomed to Perl's
 296 semantics, the search operation is what you're looking for.  See the
 297 \function{search()} function and corresponding method of compiled
 298 regular expression objects.
 299
 300 Note that match may differ from search using a regular expression
 301 beginning with \character{\^}: \character{\^} matches only at the
 302 start of the string, or in \constant{MULTILINE} mode also immediately
 303 following a newline.  The ``match'' operation succeeds only if the
 304 pattern matches at the start of the string regardless of mode, or at
 305 the starting position given by the optional \var{pos} argument
 306 regardless of whether a newline precedes it.
 307
 308 % Examples from Tim Peters:
 309 \begin{verbatim}
 310 re.compile("a").match("ba", 1)           # succeeds
 311 re.compile("^a").search("ba", 1)         # fails; 'a' not at start
 312 re.compile("^a").search("\na", 1)        # fails; 'a' not at start
 313 re.compile("^a", re.M).search("\na", 1)  # succeeds
 314 re.compile("^a", re.M).search("ba", 1)   # fails; no preceding \n
 315 \end{verbatim}
 316
 317
 318 \subsection{Module Contents}
 319 \nodename{Contents of Module re}
 320
 321 The module defines the following functions and constants, and an exception:
 322
 323
 324 \begin{funcdesc}{compile}{pattern\optional{, flags}}
 325   Compile a regular expression pattern into a regular expression
 326   object, which can be used for matching using its \function{match()} and
 327   \function{search()} methods, described below.
 328
 329   The expression's behaviour can be modified by specifying a
 330   \var{flags} value.  Values can be any of the following variables,
 331   combined using bitwise OR (the \code{|} operator).
 332
 333 The sequence
 334
 335 \begin{verbatim}
 336 prog = re.compile(pat)
 337 result = prog.match(str)
 338 \end{verbatim}
 339
 340 is equivalent to
 341
 342 \begin{verbatim}
 343 result = re.match(pat, str)
 344 \end{verbatim}
 345
 346 but the version using \function{compile()} is more efficient when the
 347 expression will be used several times in a single program.
 348 %(The compiled version of the last pattern passed to
 349 %\function{regex.match()} or \function{regex.search()} is cached, so
 350 %programs that use only a single regular expression at a time needn't
 351 %worry about compiling regular expressions.)
 352 \end{funcdesc}
 353
 354 \begin{datadesc}{I}
 355 \dataline{IGNORECASE}
 356 Perform case-insensitive matching; expressions like \regexp{[A-Z]} will match
 357 lowercase letters, too.  This is not affected by the current locale.
 358 \end{datadesc}
 359
 360 \begin{datadesc}{L}
 361 \dataline{LOCALE}
 362 Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
 363 \regexp{\e B} dependent on the current locale.
 364 \end{datadesc}
 365
 366 \begin{datadesc}{M}
 367 \dataline{MULTILINE}
 368 When specified, the pattern character \character{\^} matches at the
 369 beginning of the string and at the beginning of each line
 370 (immediately following each newline); and the pattern character
 371 \character{\$} matches at the end of the string and at the end of each line
 372 (immediately preceding each newline).
 373 By default, \character{\^} matches only at the beginning of the string, and
 374 \character{\$} only at the end of the string and immediately before the
 375 newline (if any) at the end of the string.
 376 \end{datadesc}
 377
 378 \begin{datadesc}{S}
 379 \dataline{DOTALL}
 380 Make the \character{.} special character match any character at all,
 381 including a newline; without this flag, \character{.} will match
 382 anything \emph{except} a newline.
 383 \end{datadesc}
 384
 385 \begin{datadesc}{U}
 386 \dataline{UNICODE}
 387 Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
 388 \regexp{\e B} dependent on the Unicode character properties database.
 389 \versionadded{2.0}
 390 \end{datadesc}
 391
 392 \begin{datadesc}{X}
 393 \dataline{VERBOSE}
 394 This flag allows you to write regular expressions that look nicer.
 395 Whitespace within the pattern is ignored,
 396 except when in a character class or preceded by an unescaped
 397 backslash, and, when a line contains a \character{\#} neither in a character
 398 class or preceded by an unescaped backslash, all characters from the
 399 leftmost such \character{\#} through the end of the line are ignored.
 400 % XXX should add an example here
 401 \end{datadesc}
 402
 403
 404 \begin{funcdesc}{search}{pattern, string\optional{, flags}}
 405   Scan through \var{string} looking for a location where the regular
 406   expression \var{pattern} produces a match, and return a
 407   corresponding \class{MatchObject} instance.
 408   Return \code{None} if no
 409   position in the string matches the pattern; note that this is
 410   different from finding a zero-length match at some point in the string.
 411 \end{funcdesc}
 412
 413 \begin{funcdesc}{match}{pattern, string\optional{, flags}}
 414   If zero or more characters at the beginning of \var{string} match
 415   the regular expression \var{pattern}, return a corresponding
 416   \class{MatchObject} instance.  Return \code{None} if the string does not
 417   match the pattern; note that this is different from a zero-length
 418   match.
 419
 420   \strong{Note:}  If you want to locate a match anywhere in
 421   \var{string}, use \method{search()} instead.
 422 \end{funcdesc}
 423
 424 \begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
 425   Split \var{string} by the occurrences of \var{pattern}.  If
 426   capturing parentheses are used in \var{pattern}, then the text of all
 427   groups in the pattern are also returned as part of the resulting list.
 428   If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
 429   occur, and the remainder of the string is returned as the final
 430   element of the list.  (Incompatibility note: in the original Python
 431   1.5 release, \var{maxsplit} was ignored.  This has been fixed in
 432   later releases.)
 433
 434 \begin{verbatim}
 435 >>> re.split('\W+', 'Words, words, words.')
 436 ['Words', 'words', 'words', '']
 437 >>> re.split('(\W+)', 'Words, words, words.')
 438 ['Words', ', ', 'words', ', ', 'words', '.', '']
 439 >>> re.split('\W+', 'Words, words, words.', 1)
 440 ['Words', 'words, words.']
 441 \end{verbatim}
 442
 443   This function combines and extends the functionality of
 444   the old \function{regsub.split()} and \function{regsub.splitx()}.
 445 \end{funcdesc}
 446
 447 \begin{funcdesc}{findall}{pattern, string}
 448 Return a list of all non-overlapping matches of \var{pattern} in
 449 \var{string}.  If one or more groups are present in the pattern,
 450 return a list of groups; this will be a list of tuples if the pattern
 451 has more than one group.  Empty matches are included in the result.
 452 \versionadded{1.5.2}
 453 \end{funcdesc}
 454
 455 \begin{funcdesc}{sub}{pattern, repl, string\optional{, count\code{ = 0}}}
 456 Return the string obtained by replacing the leftmost non-overlapping
 457 occurrences of \var{pattern} in \var{string} by the replacement
 458 \var{repl}.  If the pattern isn't found, \var{string} is returned
 459 unchanged.  \var{repl} can be a string or a function; if a function,
 460 it is called for every non-overlapping occurrence of \var{pattern}.
 461 The function takes a single match object argument, and returns the
 462 replacement string.  For example:
 463
 464 \begin{verbatim}
 465 >>> def dashrepl(matchobj):
 466 ....    if matchobj.group(0) == '-': return ' '
 467 ....    else: return '-'
 468 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
 469 'pro--gram files'
 470 \end{verbatim}
 471
 472 The pattern may be a string or a
 473 regex object; if you need to specify
 474 regular expression flags, you must use a regex object, or use
 475 embedded modifiers in a pattern; e.g.
 476 \samp{sub("(?i)b+", "x", "bbbb BBBB")} returns \code{'x x'}.
 477
 478 The optional argument \var{count} is the maximum number of pattern
 479 occurrences to be replaced; \var{count} must be a non-negative integer, and
 480 the default value of 0 means to replace all occurrences.
 481
 482 Empty matches for the pattern are replaced only when not adjacent to a
 483 previous match, so \samp{sub('x*', '-', 'abc')} returns \code{'-a-b-c-'}.
 484
 485 If \var{repl} is a string, any backslash escapes in it are processed.
 486 That is, \samp{\e n} is converted to a single newline character,
 487 \samp{\e r} is converted to a linefeed, and so forth.  Unknown escapes
 488 such as \samp{\e j} are left alone.  Backreferences, such as \samp{\e 6}, are
 489 replaced with the substring matched by group 6 in the pattern.
 490
 491 In addition to character escapes and backreferences as described
 492 above, \samp{\e g<name>} will use the substring matched by the group
 493 named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
 494 \samp{\e g<number>} uses the corresponding group number; \samp{\e
 495 g<2>} is therefore equivalent to \samp{\e 2}, but isn't ambiguous in a
 496 replacement such as \samp{\e g<2>0}.  \samp{\e 20} would be
 497 interpreted as a reference to group 20, not a reference to group 2
 498 followed by the literal character \character{0}.
 499 \end{funcdesc}
 500
 501 \begin{funcdesc}{subn}{pattern, repl, string\optional{, count\code{ = 0}}}
 502 Perform the same operation as \function{sub()}, but return a tuple
 503 \code{(\var{new_string}, \var{number_of_subs_made})}.
 504 \end{funcdesc}
 505
 506 \begin{funcdesc}{escape}{string}
 507   Return \var{string} with all non-alphanumerics backslashed; this is
 508   useful if you want to match an arbitrary literal string that may have
 509   regular expression metacharacters in it.
 510 \end{funcdesc}
 511
 512 \begin{excdesc}{error}
 513   Exception raised when a string passed to one of the functions here
 514   is not a valid regular expression (e.g., unmatched parentheses) or
 515   when some other error occurs during compilation or matching.  It is
 516   never an error if a string contains no match for a pattern.
 517 \end{excdesc}
 518
 519
 520 \subsection{Regular Expression Objects \label{re-objects}}
 521
 522 Compiled regular expression objects support the following methods and
 523 attributes:
 524
 525 \begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
 526                                         endpos}}}
 527   Scan through \var{string} looking for a location where this regular
 528   expression produces a match, and return a
 529   corresponding \class{MatchObject} instance.  Return \code{None} if no
 530   position in the string matches the pattern; note that this is
 531   different from finding a zero-length match at some point in the string.
 532
 533   The optional \var{pos} and \var{endpos} parameters have the same
 534   meaning as for the \method{match()} method.
 535 \end{methoddesc}
 536
 537 \begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
 538                                        endpos}}}
 539   If zero or more characters at the beginning of \var{string} match
 540   this regular expression, return a corresponding
 541   \class{MatchObject} instance.  Return \code{None} if the string does not
 542   match the pattern; note that this is different from a zero-length
 543   match.
 544
 545   \strong{Note:}  If you want to locate a match anywhere in
 546   \var{string}, use \method{search()} instead.
 547
 548   The optional second parameter \var{pos} gives an index in the string
 549   where the search is to start; it defaults to \code{0}.  This is not
 550   completely equivalent to slicing the string; the \code{'\^'} pattern
 551   character matches at the real beginning of the string and at positions
 552   just after a newline, but not necessarily at the index where the search
 553   is to start.
 554
 555   The optional parameter \var{endpos} limits how far the string will
 556   be searched; it will be as if the string is \var{endpos} characters
 557   long, so only the characters from \var{pos} to \var{endpos} will be
 558   searched for a match.
 559 \end{methoddesc}
 560
 561 \begin{methoddesc}[RegexObject]{split}{string\optional{,
 562                                        maxsplit\code{ = 0}}}
 563 Identical to the \function{split()} function, using the compiled pattern.
 564 \end{methoddesc}
 565
 566 \begin{methoddesc}[RegexObject]{findall}{string}
 567 Identical to the \function{findall()} function, using the compiled pattern.
 568 \end{methoddesc}
 569
 570 \begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
 571 Identical to the \function{sub()} function, using the compiled pattern.
 572 \end{methoddesc}
 573
 574 \begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
 575                                       count\code{ = 0}}}
 576 Identical to the \function{subn()} function, using the compiled pattern.
 577 \end{methoddesc}
 578
 579
 580 \begin{memberdesc}[RegexObject]{flags}
 581 The flags argument used when the regex object was compiled, or
 582 \code{0} if no flags were provided.
 583 \end{memberdesc}
 584
 585 \begin{memberdesc}[RegexObject]{groupindex}
 586 A dictionary mapping any symbolic group names defined by
 587 \regexp{(?P<\var{id}>)} to group numbers.  The dictionary is empty if no
 588 symbolic groups were used in the pattern.
 589 \end{memberdesc}
 590
 591 \begin{memberdesc}[RegexObject]{pattern}
 592 The pattern string from which the regex object was compiled.
 593 \end{memberdesc}
 594
 595
 596 \subsection{Match Objects \label{match-objects}}
 597
 598 \class{MatchObject} instances support the following methods and attributes:
 599
 600 \begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
 601 Returns one or more subgroups of the match.  If there is a single
 602 argument, the result is a single string; if there are
 603 multiple arguments, the result is a tuple with one item per argument.
 604 Without arguments, \var{group1} defaults to zero (i.e. the whole match
 605 is returned).
 606 If a \var{groupN} argument is zero, the corresponding return value is the
 607 entire matching string; if it is in the inclusive range [1..99], it is
 608 the string matching the the corresponding parenthesized group.  If a
 609 group number is negative or larger than the number of groups defined
 610 in the pattern, an \exception{IndexError} exception is raised.
 611 If a group is contained in a part of the pattern that did not match,
 612 the corresponding result is \code{-1}.  If a group is contained in a
 613 part of the pattern that matched multiple times, the last match is
 614 returned.
 615
 616 If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
 617 the \var{groupN} arguments may also be strings identifying groups by
 618 their group name.  If a string argument is not used as a group name in
 619 the pattern, an \exception{IndexError} exception is raised.
 620
 621 A moderately complicated example:
 622
 623 \begin{verbatim}
 624 m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
 625 \end{verbatim}
 626
 627 After performing this match, \code{m.group(1)} is \code{'3'}, as is
 628 \code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
 629 \end{methoddesc}
 630
 631 \begin{methoddesc}[MatchObject]{groups}{\optional{default}}
 632 Return a tuple containing all the subgroups of the match, from 1 up to
 633 however many groups are in the pattern.  The \var{default} argument is
 634 used for groups that did not participate in the match; it defaults to
 635 \code{None}.  (Incompatibility note: in the original Python 1.5
 636 release, if the tuple was one element long, a string would be returned
 637 instead.  In later versions (from 1.5.1 on), a singleton tuple is
 638 returned in such cases.)
 639 \end{methoddesc}
 640
 641 \begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
 642 Return a dictionary containing all the \emph{named} subgroups of the
 643 match, keyed by the subgroup name.  The \var{default} argument is
 644 used for groups that did not participate in the match; it defaults to
 645 \code{None}.
 646 \end{methoddesc}
 647
 648 \begin{methoddesc}[MatchObject]{start}{\optional{group}}
 649 \funcline{end}{\optional{group}}
 650 Return the indices of the start and end of the substring
 651 matched by \var{group}; \var{group} defaults to zero (meaning the whole
 652 matched substring).
 653 Return \code{-1} if \var{group} exists but
 654 did not contribute to the match.  For a match object
 655 \var{m}, and a group \var{g} that did contribute to the match, the
 656 substring matched by group \var{g} (equivalent to
 657 \code{\var{m}.group(\var{g})}) is
 658
 659 \begin{verbatim}
 660 m.string[m.start(g):m.end(g)]
 661 \end{verbatim}
 662
 663 Note that
 664 \code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
 665 \var{group} matched a null string.  For example, after \code{\var{m} =
 666 re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
 667 \code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
 668 \code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
 669 an \exception{IndexError} exception.
 670 \end{methoddesc}
 671
 672 \begin{methoddesc}[MatchObject]{span}{\optional{group}}
 673 For \class{MatchObject} \var{m}, return the 2-tuple
 674 \code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
 675 Note that if \var{group} did not contribute to the match, this is
 676 \code{(-1, -1)}.  Again, \var{group} defaults to zero.
 677 \end{methoddesc}
 678
 679 \begin{memberdesc}[MatchObject]{pos}
 680 The value of \var{pos} which was passed to the
 681 \function{search()} or \function{match()} function.  This is the index into
 682 the string at which the regex engine started looking for a match.
 683 \end{memberdesc}
 684
 685 \begin{memberdesc}[MatchObject]{endpos}
 686 The value of \var{endpos} which was passed to the
 687 \function{search()} or \function{match()} function.  This is the index into
 688 the string beyond which the regex engine will not go.
 689 \end{memberdesc}
 690
 691 \begin{memberdesc}[MatchObject]{re}
 692 The regular expression object whose \method{match()} or
 693 \method{search()} method produced this \class{MatchObject} instance.
 694 \end{memberdesc}
 695
 696 \begin{memberdesc}[MatchObject]{string}
 697 The string passed to \function{match()} or \function{search()}.
 698 \end{memberdesc}
 699
 700 \begin{seealso}
 701 \seetext{Jeffrey Friedl, \citetitle{Mastering Regular Expressions},
 702 O'Reilly.  The Python material in this book dates from before the
 703 \module{re} module, but it covers writing good regular expression
 704 patterns in great detail.}
 705 \end{seealso}
 706