Doc/lib/libre.tex

   1 \section{\module{re} ---
   2          Perl-style regular expression operations.}
   3 \declaremodule{standard}{re}
   4 \moduleauthor{Andrew M. Kuchling}{akuchling@acm.org}
   5 \sectionauthor{Andrew M. Kuchling}{akuchling@acm.org}
   6
   7
   8 \modulesynopsis{Perl-style regular expression search and match
   9 operations.}
  10
  11
  12 This module provides regular expression matching operations similar to
  13 those found in Perl.  It's 8-bit clean: the strings being processed
  14 may contain both null bytes and characters whose high bit is set.  Regular
  15 expression pattern strings may not contain null bytes, but can specify
  16 the null byte using the \code{\e\var{number}} notation.
  17 Characters with the high bit set may be included.  The \module{re}
  18 module is always available.
  19
  20 Regular expressions use the backslash character (\character{\e}) to
  21 indicate special forms or to allow special characters to be used
  22 without invoking their special meaning.  This collides with Python's
  23 usage of the same character for the same purpose in string literals;
  24 for example, to match a literal backslash, one might have to write
  25 \code{'\e\e\e\e'} as the pattern string, because the regular expression
  26 must be \samp{\e\e}, and each backslash must be expressed as
  27 \samp{\e\e} inside a regular Python string literal.
  28
  29 The solution is to use Python's raw string notation for regular
  30 expression patterns; backslashes are not handled in any special way in
  31 a string literal prefixed with \character{r}.  So \code{r"\e n"} is a
  32 two-character string containing \character{\e} and \character{n},
  33 while \code{"\e n"} is a one-character string containing a newline.
  34 Usually patterns will be expressed in Python code using this raw
  35 string notation.
  36
  37 \subsection{Regular Expression Syntax \label{re-syntax}}
  38
  39 A regular expression (or RE) specifies a set of strings that matches
  40 it; the functions in this module let you check if a particular string
  41 matches a given regular expression (or if a given regular expression
  42 matches a particular string, which comes down to the same thing).
  43
  44 Regular expressions can be concatenated to form new regular
  45 expressions; if \emph{A} and \emph{B} are both regular expressions,
  46 then \emph{AB} is also an regular expression.  If a string \emph{p}
  47 matches A and another string \emph{q} matches B, the string \emph{pq}
  48 will match AB.  Thus, complex expressions can easily be constructed
  49 from simpler primitive expressions like the ones described here.  For
  50 details of the theory and implementation of regular expressions,
  51 consult the Friedl book referenced below, or almost any textbook about
  52 compiler construction.
  53
  54 A brief explanation of the format of regular expressions follows.  For
  55 further information and a gentler presentation, consult the Regular
  56 Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
  57
  58 Regular expressions can contain both special and ordinary characters.
  59 Most ordinary characters, like \character{A}, \character{a}, or \character{0},
  60 are the simplest regular expressions; they simply match themselves.
  61 You can concatenate ordinary characters, so \regexp{last} matches the
  62 string \code{'last'}.  (In the rest of this section, we'll write RE's in
  63 \regexp{this special style}, usually without quotes, and strings to be
  64 matched \code{'in single quotes'}.)
  65
  66 Some characters, like \character{|} or \character{(}, are special.  Special
  67 characters either stand for classes of ordinary characters, or affect
  68 how the regular expressions around them are interpreted.
  69
  70 The special characters are:
  71
  72 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
  73
  74 \item[\character{.}] (Dot.)  In the default mode, this matches any
  75 character except a newline.  If the \constant{DOTALL} flag has been
  76 specified, this matches any character including a newline.
  77
  78 \item[\character{\^}] (Caret.)  Matches the start of the string, and in
  79 \constant{MULTILINE} mode also matches immediately after each newline.
  80
  81 \item[\character{\$}] Matches the end of the string, and in
  82 \constant{MULTILINE} mode also matches before a newline.
  83 \regexp{foo} matches both 'foo' and 'foobar', while the regular
  84 expression \regexp{foo\$} matches only 'foo'.
  85
  86 \item[\character{*}] Causes the resulting RE to
  87 match 0 or more repetitions of the preceding RE, as many repetitions
  88 as are possible.  \regexp{ab*} will
  89 match 'a', 'ab', or 'a' followed by any number of 'b's.
  90
  91 \item[\character{+}] Causes the
  92 resulting RE to match 1 or more repetitions of the preceding RE.
  93 \regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
  94 will not match just 'a'.
  95
  96 \item[\character{?}] Causes the resulting RE to
  97 match 0 or 1 repetitions of the preceding RE.  \regexp{ab?} will
  98 match either 'a' or 'ab'.
  99 \item[\code{*?}, \code{+?}, \code{??}] The \character{*}, \character{+}, and
 100 \character{?} qualifiers are all \dfn{greedy}; they match as much text as
 101 possible.  Sometimes this behaviour isn't desired; if the RE
 102 \regexp{<.*>} is matched against \code{'<H1>title</H1>'}, it will match the
 103 entire string, and not just \code{'<H1>'}.
 104 Adding \character{?} after the qualifier makes it perform the match in
 105 \dfn{non-greedy} or \dfn{minimal} fashion; as \emph{few} characters as
 106 possible will be matched.  Using \regexp{.*?} in the previous
 107 expression will match only \code{'<H1>'}.
 108
 109 \item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
 110 \var{m} to \var{n} repetitions of the preceding RE, attempting to
 111 match as many repetitions as possible.  For example, \regexp{a\{3,5\}}
 112 will match from 3 to 5 \character{a} characters.  Omitting \var{n}
 113 specifies an infinite upper bound; you can't omit \var{m}.
 114
 115 \item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
 116 match from \var{m} to \var{n} repetitions of the preceding RE,
 117 attempting to match as \emph{few} repetitions as possible.  This is
 118 the non-greedy version of the previous qualifier.  For example, on the
 119 6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
 120 \character{a} characters, while \regexp{a\{3,5\}?} will only match 3
 121 characters.
 122
 123 \item[\character{\e}] Either escapes special characters (permitting
 124 you to match characters like \character{*}, \character{?}, and so
 125 forth), or signals a special sequence; special sequences are discussed
 126 below.
 127
 128 If you're not using a raw string to
 129 express the pattern, remember that Python also uses the
 130 backslash as an escape sequence in string literals; if the escape
 131 sequence isn't recognized by Python's parser, the backslash and
 132 subsequent character are included in the resulting string.  However,
 133 if Python would recognize the resulting sequence, the backslash should
 134 be repeated twice.  This is complicated and hard to understand, so
 135 it's highly recommended that you use raw strings for all but the
 136 simplest expressions.
 137
 138 \item[\code{[]}] Used to indicate a set of characters.  Characters can
 139 be listed individually, or a range of characters can be indicated by
 140 giving two characters and separating them by a \character{-}.  Special
 141 characters are not active inside sets.  For example, \regexp{[akm\$]}
 142 will match any of the characters \character{a}, \character{k},
 143 \character{m}, or \character{\$}; \regexp{[a-z]}
 144 will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
 145 letter or digit.  Character classes such as \code{\e w} or \code{\e S}
 146 (defined below) are also acceptable inside a range.  If you want to
 147 include a \character{]} or a \character{-} inside a set, precede it with a
 148 backslash, or place it as the first character.  The
 149 pattern \regexp{[]]} will match \code{']'}, for example.
 150
 151 You can match the characters not within a range by \dfn{complementing}
 152 the set.  This is indicated by including a
 153 \character{\^} as the first character of the set; \character{\^} elsewhere will
 154 simply match the \character{\^} character.  For example, \regexp{[{\^}5]}
 155 will match any character except \character{5}.
 156
 157 \item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
 158 creates a regular expression that will match either A or B.  This can
 159 be used inside groups (see below) as well.  To match a literal \character{|},
 160 use \regexp{\e|}, or enclose it inside a character class, as in  \regexp{[|]}.
 161
 162 \item[\code{(...)}] Matches whatever regular expression is inside the
 163 parentheses, and indicates the start and end of a group; the contents
 164 of a group can be retrieved after a match has been performed, and can
 165 be matched later in the string with the \regexp{\e \var{number}} special
 166 sequence, described below.  To match the literals \character{(} or
 167 \character{')}, use \regexp{\e(} or \regexp{\e)}, or enclose them
 168 inside a character class: \regexp{[(] [)]}.
 169
 170 \item[\code{(?...)}] This is an extension notation (a \character{?}
 171 following a \character{(} is not meaningful otherwise).  The first
 172 character after the \character{?}
 173 determines what the meaning and further syntax of the construct is.
 174 Extensions usually do not create a new group;
 175 \regexp{(?P<\var{name}>...)} is the only exception to this rule.
 176 Following are the currently supported extensions.
 177
 178 \item[\code{(?iLmsx)}] (One or more letters from the set \character{i},
 179 \character{L}, \character{m}, \character{s}, \character{x}.)  The group matches
 180 the empty string; the letters set the corresponding flags
 181 (\constant{re.I}, \constant{re.L}, \constant{re.M}, \constant{re.S},
 182 \constant{re.X}) for the entire regular expression.  This is useful if
 183 you wish to include the flags as part of the regular expression, instead
 184 of passing a \var{flag} argument to the \function{compile()} function.
 185
 186 \item[\code{(?:...)}] A non-grouping version of regular parentheses.
 187 Matches whatever regular expression is inside the parentheses, but the
 188 substring matched by the
 189 group \emph{cannot} be retrieved after performing a match or
 190 referenced later in the pattern.
 191
 192 \item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
 193 the substring matched by the group is accessible via the symbolic group
 194 name \var{name}.  Group names must be valid Python identifiers.  A
 195 symbolic group is also a numbered group, just as if the group were not
 196 named.  So the group named 'id' in the example above can also be
 197 referenced as the numbered group 1.
 198
 199 For example, if the pattern is
 200 \regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
 201 name in arguments to methods of match objects, such as \code{m.group('id')}
 202 or \code{m.end('id')}, and also by name in pattern text
 203 (e.g. \regexp{(?P=id)}) and replacement text (e.g. \code{\e g<id>}).
 204
 205 \item[\code{(?P=\var{name})}] Matches whatever text was matched by the
 206 earlier group named \var{name}.
 207
 208 \item[\code{(?\#...)}] A comment; the contents of the parentheses are
 209 simply ignored.
 210
 211 \item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
 212 consume any of the string.  This is called a lookahead assertion.  For
 213 example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
 214 followed by \code{'Asimov'}.
 215
 216 \item[\code{(?!...)}] Matches if \regexp{...} doesn't match next.  This
 217 is a negative lookahead assertion.  For example,
 218 \regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
 219 followed by \code{'Asimov'}.
 220
 221 \end{list}
 222
 223 The special sequences consist of \character{\e} and a character from the
 224 list below.  If the ordinary character is not on the list, then the
 225 resulting RE will match the second character.  For example,
 226 \regexp{\e\$} matches the character \character{\$}.
 227
 228 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
 229
 230 %
 231 \item[\code{\e \var{number}}] Matches the contents of the group of the
 232 same number.  Groups are numbered starting from 1.  For example,
 233 \regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
 234 \code{'the end'} (note
 235 the space after the group).  This special sequence can only be used to
 236 match one of the first 99 groups.  If the first digit of \var{number}
 237 is 0, or \var{number} is 3 octal digits long, it will not be interpreted
 238 as a group match, but as the character with octal value \var{number}.
 239 Inside the \character{[} and \character{]} of a character class, all numeric
 240 escapes are treated as characters.
 241 %
 242 \item[\code{\e A}] Matches only at the start of the string.
 243 %
 244 \item[\code{\e b}] Matches the empty string, but only at the
 245 beginning or end of a word.  A word is defined as a sequence of
 246 alphanumeric characters, so the end of a word is indicated by
 247 whitespace or a non-alphanumeric character.  Inside a character range,
 248 \regexp{\e b} represents the backspace character, for compatibility with
 249 Python's string literals.
 250 %
 251 \item[\code{\e B}] Matches the empty string, but only when it is
 252 \emph{not} at the beginning or end of a word.
 253 %
 254 \item[\code{\e d}]Matches any decimal digit; this is
 255 equivalent to the set \regexp{[0-9]}.
 256 %
 257 \item[\code{\e D}]Matches any non-digit character; this is
 258 equivalent to the set \regexp{[{\^}0-9]}.
 259 %
 260 \item[\code{\e s}]Matches any whitespace character; this is
 261 equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
 262 %
 263 \item[\code{\e S}]Matches any non-whitespace character; this is
 264 equivalent to the set \regexp{[\^\ \e t\e n\e r\e f\e v]}.
 265 %
 266 \item[\code{\e w}]When the \constant{LOCALE} flag is not specified,
 267 matches any alphanumeric character; this is equivalent to the set
 268 \regexp{[a-zA-Z0-9_]}.  With \constant{LOCALE}, it will match the set
 269 \regexp{[0-9_]} plus whatever characters are defined as letters for the
 270 current locale.
 271 %
 272 \item[\code{\e W}]When the \constant{LOCALE} flag is not specified,
 273 matches any non-alphanumeric character; this is equivalent to the set
 274 \regexp{[{\^}a-zA-Z0-9_]}.   With \constant{LOCALE}, it will match any
 275 character not in the set \regexp{[0-9_]}, and not defined as a letter
 276 for the current locale.
 277
 278 \item[\code{\e Z}]Matches only at the end of the string.
 279 %
 280
 281 \item[\code{\e \e}] Matches a literal backslash.
 282
 283 \end{list}
 284
 285
 286 \subsection{Matching vs. Searching \label{matching-searching}}
 287 \sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
 288
 289 \strong{XXX This section is still incomplete!}
 290
 291 Python offers two different primitive operations based on regular
 292 expressions: match and search.  If you are accustomed to Perl's
 293 semantics, the search operation is what you're looking for.  See the
 294 \function{search()} function and corresponding method of compiled
 295 regular expression objects.
 296
 297 Note that match may differ from search using a regular expression
 298 beginning with \character{\^}:  \character{\^} matches only at the start
 299 of the string, or in \constant{MULTILINE} mode also immediately
 300 following a newline.  "match" succeeds only if the pattern matches at
 301 the start of the string regardless of mode, or at the starting
 302 position given by the optional \var{pos} argument regardless of
 303 whether a newline precedes it.
 304
 305 % Examples from Tim Peters:
 306 \begin{verbatim}
 307 re.compile("a").match("ba", 1)           # succeeds
 308 re.compile("^a").search("ba", 1)         # fails; 'a' not at start
 309 re.compile("^a").search("\na", 1)        # fails; 'a' not at start
 310 re.compile("^a", re.M).search("\na", 1)  # succeeds
 311 re.compile("^a", re.M).search("ba", 1)   # fails; no preceding \n
 312 \end{verbatim}
 313
 314
 315 \subsection{Module Contents}
 316 \nodename{Contents of Module re}
 317
 318 The module defines the following functions and constants, and an exception:
 319
 320
 321 \begin{funcdesc}{compile}{pattern\optional{, flags}}
 322   Compile a regular expression pattern into a regular expression
 323   object, which can be used for matching using its \function{match()} and
 324   \function{search()} methods, described below.
 325
 326   The expression's behaviour can be modified by specifying a
 327   \var{flags} value.  Values can be any of the following variables,
 328   combined using bitwise OR (the \code{|} operator).
 329
 330 The sequence
 331
 332 \begin{verbatim}
 333 prog = re.compile(pat)
 334 result = prog.match(str)
 335 \end{verbatim}
 336
 337 is equivalent to
 338
 339 \begin{verbatim}
 340 result = re.match(pat, str)
 341 \end{verbatim}
 342
 343 but the version using \function{compile()} is more efficient when the
 344 expression will be used several times in a single program.
 345 %(The compiled version of the last pattern passed to
 346 %\function{regex.match()} or \function{regex.search()} is cached, so
 347 %programs that use only a single regular expression at a time needn't
 348 %worry about compiling regular expressions.)
 349 \end{funcdesc}
 350
 351 \begin{datadesc}{I}
 352 \dataline{IGNORECASE}
 353 Perform case-insensitive matching; expressions like \regexp{[A-Z]} will match
 354 lowercase letters, too.  This is not affected by the current locale.
 355 \end{datadesc}
 356
 357 \begin{datadesc}{L}
 358 \dataline{LOCALE}
 359 Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b},
 360 \regexp{\e B}, dependent on the current locale.
 361 \end{datadesc}
 362
 363 \begin{datadesc}{M}
 364 \dataline{MULTILINE}
 365 When specified, the pattern character \character{\^} matches at the
 366 beginning of the string and at the beginning of each line
 367 (immediately following each newline); and the pattern character
 368 \character{\$} matches at the end of the string and at the end of each line
 369 (immediately preceding each newline).
 370 By default, \character{\^} matches only at the beginning of the string, and
 371 \character{\$} only at the end of the string and immediately before the
 372 newline (if any) at the end of the string.
 373 \end{datadesc}
 374
 375 \begin{datadesc}{S}
 376 \dataline{DOTALL}
 377 Make the \character{.} special character match any character at all, including a
 378 newline; without this flag, \character{.} will match anything \emph{except}
 379 a newline.
 380 \end{datadesc}
 381
 382 \begin{datadesc}{X}
 383 \dataline{VERBOSE}
 384 This flag allows you to write regular expressions that look nicer.
 385 Whitespace within the pattern is ignored,
 386 except when in a character class or preceded by an unescaped
 387 backslash, and, when a line contains a \character{\#} neither in a character
 388 class or preceded by an unescaped backslash, all characters from the
 389 leftmost such \character{\#} through the end of the line are ignored.
 390 % XXX should add an example here
 391 \end{datadesc}
 392
 393
 394 \begin{funcdesc}{search}{pattern, string\optional{, flags}}
 395   Scan through \var{string} looking for a location where the regular
 396   expression \var{pattern} produces a match, and return a
 397   corresponding \class{MatchObject} instance.
 398   Return \code{None} if no
 399   position in the string matches the pattern; note that this is
 400   different from finding a zero-length match at some point in the string.
 401 \end{funcdesc}
 402
 403 \begin{funcdesc}{match}{pattern, string\optional{, flags}}
 404   If zero or more characters at the beginning of \var{string} match
 405   the regular expression \var{pattern}, return a corresponding
 406   \class{MatchObject} instance.  Return \code{None} if the string does not
 407   match the pattern; note that this is different from a zero-length
 408   match.
 409
 410   \strong{Note:}  If you want to locate a match anywhere in
 411   \var{string}, use \method{search()} instead.
 412 \end{funcdesc}
 413
 414 \begin{funcdesc}{split}{pattern, string, \optional{, maxsplit\code{ = 0}}}
 415   Split \var{string} by the occurrences of \var{pattern}.  If
 416   capturing parentheses are used in \var{pattern}, then the text of all
 417   groups in the pattern are also returned as part of the resulting list.
 418   If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
 419   occur, and the remainder of the string is returned as the final
 420   element of the list.  (Incompatibility note: in the original Python
 421   1.5 release, \var{maxsplit} was ignored.  This has been fixed in
 422   later releases.)
 423
 424 \begin{verbatim}
 425 >>> re.split('\W+', 'Words, words, words.')
 426 ['Words', 'words', 'words', '']
 427 >>> re.split('(\W+)', 'Words, words, words.')
 428 ['Words', ', ', 'words', ', ', 'words', '.', '']
 429 >>> re.split('\W+', 'Words, words, words.', 1)
 430 ['Words', 'words, words.']
 431 \end{verbatim}
 432
 433   This function combines and extends the functionality of
 434   the old \function{regsub.split()} and \function{regsub.splitx()}.
 435 \end{funcdesc}
 436
 437 \begin{funcdesc}{findall}{pattern, string}
 438 Return a list of all non-overlapping matches of \var{pattern} in
 439 \var{string}.  If one or more groups are present in the pattern,
 440 return a list of groups; this will be a list of tuples if the pattern
 441 has more than one group.  Empty matches are included in the result.
 442 \versionadded{1.5.2}
 443 \end{funcdesc}
 444
 445 \begin{funcdesc}{sub}{pattern, repl, string\optional{, count\code{ = 0}}}
 446 Return the string obtained by replacing the leftmost non-overlapping
 447 occurrences of \var{pattern} in \var{string} by the replacement
 448 \var{repl}.  If the pattern isn't found, \var{string} is returned
 449 unchanged.  \var{repl} can be a string or a function; if a function,
 450 it is called for every non-overlapping occurrence of \var{pattern}.
 451 The function takes a single match object argument, and returns the
 452 replacement string.  For example:
 453
 454 \begin{verbatim}
 455 >>> def dashrepl(matchobj):
 456 ....    if matchobj.group(0) == '-': return ' '
 457 ....    else: return '-'
 458 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
 459 'pro--gram files'
 460 \end{verbatim}
 461
 462 The pattern may be a string or a
 463 regex object; if you need to specify
 464 regular expression flags, you must use a regex object, or use
 465 embedded modifiers in a pattern; e.g.
 466 \samp{sub("(?i)b+", "x", "bbbb BBBB")} returns \code{'x x'}.
 467
 468 The optional argument \var{count} is the maximum number of pattern
 469 occurrences to be replaced; \var{count} must be a non-negative integer, and
 470 the default value of 0 means to replace all occurrences.
 471
 472 Empty matches for the pattern are replaced only when not adjacent to a
 473 previous match, so \samp{sub('x*', '-', 'abc')} returns \code{'-a-b-c-'}.
 474
 475 If \var{repl} is a string, any backslash escapes in it are processed.
 476 That is, \samp{\e n} is converted to a single newline character,
 477 \samp{\e r} is converted to a linefeed, and so forth.  Unknown escapes
 478 such as \samp{\e j} are left alone.  Backreferences, such as \samp{\e 6}, are
 479 replaced with the substring matched by group 6 in the pattern.
 480
 481 In addition to character escapes and backreferences as described
 482 above, \samp{\e g<name>} will use the substring matched by the group
 483 named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
 484 \samp{\e g<number>} uses the corresponding group number; \samp{\e
 485 g<2>} is therefore equivalent to \samp{\e 2}, but isn't ambiguous in a
 486 replacement such as \samp{\e g<2>0}.  \samp{\e 20} would be
 487 interpreted as a reference to group 20, not a reference to group 2
 488 followed by the literal character \character{0}.
 489 \end{funcdesc}
 490
 491 \begin{funcdesc}{subn}{pattern, repl, string\optional{, count\code{ = 0}}}
 492 Perform the same operation as \function{sub()}, but return a tuple
 493 \code{(\var{new_string}, \var{number_of_subs_made})}.
 494 \end{funcdesc}
 495
 496 \begin{funcdesc}{escape}{string}
 497   Return \var{string} with all non-alphanumerics backslashed; this is
 498   useful if you want to match an arbitrary literal string that may have
 499   regular expression metacharacters in it.
 500 \end{funcdesc}
 501
 502 \begin{excdesc}{error}
 503   Exception raised when a string passed to one of the functions here
 504   is not a valid regular expression (e.g., unmatched parentheses) or
 505   when some other error occurs during compilation or matching.  It is
 506   never an error if a string contains no match for a pattern.
 507 \end{excdesc}
 508
 509
 510 \subsection{Regular Expression Objects \label{re-objects}}
 511
 512 Compiled regular expression objects support the following methods and
 513 attributes:
 514
 515 \begin{methoddesc}[RegexObject]{search}{string\optional{, pos}\optional{,
 516                                         endpos}}
 517   Scan through \var{string} looking for a location where this regular
 518   expression produces a match, and return a
 519   corresponding \class{MatchObject} instance.  Return \code{None} if no
 520   position in the string matches the pattern; note that this is
 521   different from finding a zero-length match at some point in the string.
 522
 523   The optional \var{pos} and \var{endpos} parameters have the same
 524   meaning as for the \method{match()} method.
 525 \end{methoddesc}
 526
 527 \begin{methoddesc}[RegexObject]{match}{string\optional{, pos}\optional{,
 528                                        endpos}}
 529   If zero or more characters at the beginning of \var{string} match
 530   this regular expression, return a corresponding
 531   \class{MatchObject} instance.  Return \code{None} if the string does not
 532   match the pattern; note that this is different from a zero-length
 533   match.
 534
 535   \strong{Note:}  If you want to locate a match anywhere in
 536   \var{string}, use \method{search()} instead.
 537
 538   The optional second parameter \var{pos} gives an index in the string
 539   where the search is to start; it defaults to \code{0}.  This is not
 540   completely equivalent to slicing the string; the \code{'\^'} pattern
 541   character matches at the real beginning of the string and at positions
 542   just after a newline, but not necessarily at the index where the search
 543   is to start.
 544
 545   The optional parameter \var{endpos} limits how far the string will
 546   be searched; it will be as if the string is \var{endpos} characters
 547   long, so only the characters from \var{pos} to \var{endpos} will be
 548   searched for a match.
 549 \end{methoddesc}
 550
 551 \begin{methoddesc}[RegexObject]{split}{string, \optional{,
 552                                        maxsplit\code{ = 0}}}
 553 Identical to the \function{split()} function, using the compiled pattern.
 554 \end{methoddesc}
 555
 556 \begin{methoddesc}[RegexObject]{findall}{string}
 557 Identical to the \function{findall()} function, using the compiled pattern.
 558 \end{methoddesc}
 559
 560 \begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
 561 Identical to the \function{sub()} function, using the compiled pattern.
 562 \end{methoddesc}
 563
 564 \begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
 565                                       count\code{ = 0}}}
 566 Identical to the \function{subn()} function, using the compiled pattern.
 567 \end{methoddesc}
 568
 569
 570 \begin{memberdesc}[RegexObject]{flags}
 571 The flags argument used when the regex object was compiled, or
 572 \code{0} if no flags were provided.
 573 \end{memberdesc}
 574
 575 \begin{memberdesc}[RegexObject]{groupindex}
 576 A dictionary mapping any symbolic group names defined by
 577 \regexp{(?P<\var{id}>)} to group numbers.  The dictionary is empty if no
 578 symbolic groups were used in the pattern.
 579 \end{memberdesc}
 580
 581 \begin{memberdesc}[RegexObject]{pattern}
 582 The pattern string from which the regex object was compiled.
 583 \end{memberdesc}
 584
 585
 586 \subsection{Match Objects \label{match-objects}}
 587
 588 \class{MatchObject} instances support the following methods and attributes:
 589
 590 \begin{methoddesc}[MatchObject]{group}{\optional{group1, group2, ...}}
 591 Returns one or more subgroups of the match.  If there is a single
 592 argument, the result is a single string; if there are
 593 multiple arguments, the result is a tuple with one item per argument.
 594 Without arguments, \var{group1} defaults to zero (i.e. the whole match
 595 is returned).
 596 If a \var{groupN} argument is zero, the corresponding return value is the
 597 entire matching string; if it is in the inclusive range [1..99], it is
 598 the string matching the the corresponding parenthesized group.  If a
 599 group number is negative or larger than the number of groups defined
 600 in the pattern, an \exception{IndexError} exception is raised.
 601 If a group is contained in a part of the pattern that did not match,
 602 the corresponding result is \code{None}.  If a group is contained in a
 603 part of the pattern that matched multiple times, the last match is
 604 returned.
 605
 606 If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
 607 the \var{groupN} arguments may also be strings identifying groups by
 608 their group name.  If a string argument is not used as a group name in
 609 the pattern, an \exception{IndexError} exception is raised.
 610
 611 A moderately complicated example:
 612
 613 \begin{verbatim}
 614 m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
 615 \end{verbatim}
 616
 617 After performing this match, \code{m.group(1)} is \code{'3'}, as is
 618 \code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
 619 \end{methoddesc}
 620
 621 \begin{methoddesc}[MatchObject]{groups}{\optional{default}}
 622 Return a tuple containing all the subgroups of the match, from 1 up to
 623 however many groups are in the pattern.  The \var{default} argument is
 624 used for groups that did not participate in the match; it defaults to
 625 \code{None}.  (Incompatibility note: in the original Python 1.5
 626 release, if the tuple was one element long, a string would be returned
 627 instead.  In later versions (from 1.5.1 on), a singleton tuple is
 628 returned in such cases.)
 629 \end{methoddesc}
 630
 631 \begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
 632 Return a dictionary containing all the \emph{named} subgroups of the
 633 match, keyed by the subgroup name.  The \var{default} argument is
 634 used for groups that did not participate in the match; it defaults to
 635 \code{None}.
 636 \end{methoddesc}
 637
 638 \begin{methoddesc}[MatchObject]{start}{\optional{group}}
 639 \funcline{end}{\optional{group}}
 640 Return the indices of the start and end of the substring
 641 matched by \var{group}; \var{group} defaults to zero (meaning the whole
 642 matched substring).
 643 Return \code{None} if \var{group} exists but
 644 did not contribute to the match.  For a match object
 645 \var{m}, and a group \var{g} that did contribute to the match, the
 646 substring matched by group \var{g} (equivalent to
 647 \code{\var{m}.group(\var{g})}) is
 648
 649 \begin{verbatim}
 650 m.string[m.start(g):m.end(g)]
 651 \end{verbatim}
 652
 653 Note that
 654 \code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
 655 \var{group} matched a null string.  For example, after \code{\var{m} =
 656 re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
 657 \code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
 658 \code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
 659 an \exception{IndexError} exception.
 660 \end{methoddesc}
 661
 662 \begin{methoddesc}[MatchObject]{span}{\optional{group}}
 663 For \class{MatchObject} \var{m}, return the 2-tuple
 664 \code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
 665 Note that if \var{group} did not contribute to the match, this is
 666 \code{(None, None)}.  Again, \var{group} defaults to zero.
 667 \end{methoddesc}
 668
 669 \begin{memberdesc}[MatchObject]{pos}
 670 The value of \var{pos} which was passed to the
 671 \function{search()} or \function{match()} function.  This is the index into
 672 the string at which the regex engine started looking for a match.
 673 \end{memberdesc}
 674
 675 \begin{memberdesc}[MatchObject]{endpos}
 676 The value of \var{endpos} which was passed to the
 677 \function{search()} or \function{match()} function.  This is the index into
 678 the string beyond which the regex engine will not go.
 679 \end{memberdesc}
 680
 681 \begin{memberdesc}[MatchObject]{re}
 682 The regular expression object whose \method{match()} or
 683 \method{search()} method produced this \class{MatchObject} instance.
 684 \end{memberdesc}
 685
 686 \begin{memberdesc}[MatchObject]{string}
 687 The string passed to \function{match()} or \function{search()}.
 688 \end{memberdesc}
 689
 690 \begin{seealso}
 691 \seetext{Jeffrey Friedl, \emph{Mastering Regular Expressions},
 692 O'Reilly.  The Python material in this book dates from before the
 693 \module{re} module, but it covers writing good regular expression
 694 patterns in great detail.}
 695 \end{seealso}
 696