This commit was manufactured by cvs2svn to create tag 'r234c1'.
[python/dscho.git] / Doc / lib / libre.tex
blob7368ab47144aee5c38f075c4ba18b4640670c4ed
1 \section{\module{re} ---
2 Regular expression operations}
3 \declaremodule{standard}{re}
4 \moduleauthor{Fredrik Lundh}{fredrik@pythonware.com}
5 \sectionauthor{Andrew M. Kuchling}{amk@amk.ca}
8 \modulesynopsis{Regular expression search and match operations with a
9 Perl-style expression syntax.}
12 This module provides regular expression matching operations similar to
13 those found in Perl. Regular expression pattern strings may not
14 contain null bytes, but can specify the null byte using the
15 \code{\e\var{number}} notation. Both patterns and strings to be
16 searched can be Unicode strings as well as 8-bit strings. The
17 \module{re} module is always available.
19 Regular expressions use the backslash character (\character{\e}) to
20 indicate special forms or to allow special characters to be used
21 without invoking their special meaning. This collides with Python's
22 usage of the same character for the same purpose in string literals;
23 for example, to match a literal backslash, one might have to write
24 \code{'\e\e\e\e'} as the pattern string, because the regular expression
25 must be \samp{\e\e}, and each backslash must be expressed as
26 \samp{\e\e} inside a regular Python string literal.
28 The solution is to use Python's raw string notation for regular
29 expression patterns; backslashes are not handled in any special way in
30 a string literal prefixed with \character{r}. So \code{r"\e n"} is a
31 two-character string containing \character{\e} and \character{n},
32 while \code{"\e n"} is a one-character string containing a newline.
33 Usually patterns will be expressed in Python code using this raw
34 string notation.
36 \begin{seealso}
37 \seetitle{Mastering Regular Expressions}{Book on regular expressions
38 by Jeffrey Friedl, published by O'Reilly. The second
39 edition of the book no longer covers Python at all,
40 but the first edition covered writing good regular expression
41 patterns in great detail.}
42 \end{seealso}
45 \subsection{Regular Expression Syntax \label{re-syntax}}
47 A regular expression (or RE) specifies a set of strings that matches
48 it; the functions in this module let you check if a particular string
49 matches a given regular expression (or if a given regular expression
50 matches a particular string, which comes down to the same thing).
52 Regular expressions can be concatenated to form new regular
53 expressions; if \emph{A} and \emph{B} are both regular expressions,
54 then \emph{AB} is also a regular expression. In general, if a string
55 \emph{p} matches \emph{A} and another string \emph{q} matches \emph{B},
56 the string \emph{pq} will match AB. This holds unless \emph{A} or
57 \emph{B} contain low precedence operations; boundary conditions between
58 \emph{A} and \emph{B}; or have numbered group references. Thus, complex
59 expressions can easily be constructed from simpler primitive
60 expressions like the ones described here. For details of the theory
61 and implementation of regular expressions, consult the Friedl book
62 referenced above, or almost any textbook about compiler construction.
64 A brief explanation of the format of regular expressions follows. For
65 further information and a gentler presentation, consult the Regular
66 Expression HOWTO, accessible from \url{http://www.python.org/doc/howto/}.
68 Regular expressions can contain both special and ordinary characters.
69 Most ordinary characters, like \character{A}, \character{a}, or
70 \character{0}, are the simplest regular expressions; they simply match
71 themselves. You can concatenate ordinary characters, so \regexp{last}
72 matches the string \code{'last'}. (In the rest of this section, we'll
73 write RE's in \regexp{this special style}, usually without quotes, and
74 strings to be matched \code{'in single quotes'}.)
76 Some characters, like \character{|} or \character{(}, are special.
77 Special characters either stand for classes of ordinary characters, or
78 affect how the regular expressions around them are interpreted.
80 The special characters are:
82 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
84 \item[\character{.}] (Dot.) In the default mode, this matches any
85 character except a newline. If the \constant{DOTALL} flag has been
86 specified, this matches any character including a newline.
88 \item[\character{\textasciicircum}] (Caret.) Matches the start of the
89 string, and in \constant{MULTILINE} mode also matches immediately
90 after each newline.
92 \item[\character{\$}] Matches the end of the string or just before the
93 newline at the end of the string, and in \constant{MULTILINE} mode
94 also matches before a newline. \regexp{foo} matches both 'foo' and
95 'foobar', while the regular expression \regexp{foo\$} matches only
96 'foo'. More interestingly, searching for \regexp{foo.\$} in
97 'foo1\textbackslash nfoo2\textbackslash n' matches 'foo2' normally,
98 but 'foo1' in \constant{MULTILINE} mode.
100 \item[\character{*}] Causes the resulting RE to
101 match 0 or more repetitions of the preceding RE, as many repetitions
102 as are possible. \regexp{ab*} will
103 match 'a', 'ab', or 'a' followed by any number of 'b's.
105 \item[\character{+}] Causes the
106 resulting RE to match 1 or more repetitions of the preceding RE.
107 \regexp{ab+} will match 'a' followed by any non-zero number of 'b's; it
108 will not match just 'a'.
110 \item[\character{?}] Causes the resulting RE to
111 match 0 or 1 repetitions of the preceding RE. \regexp{ab?} will
112 match either 'a' or 'ab'.
114 \item[\code{*?}, \code{+?}, \code{??}] The \character{*},
115 \character{+}, and \character{?} qualifiers are all \dfn{greedy}; they
116 match as much text as possible. Sometimes this behaviour isn't
117 desired; if the RE \regexp{<.*>} is matched against
118 \code{'<H1>title</H1>'}, it will match the entire string, and not just
119 \code{'<H1>'}. Adding \character{?} after the qualifier makes it
120 perform the match in \dfn{non-greedy} or \dfn{minimal} fashion; as
121 \emph{few} characters as possible will be matched. Using \regexp{.*?}
122 in the previous expression will match only \code{'<H1>'}.
124 \item[\code{\{\var{m}\}}]
125 Specifies that exactly \var{m} copies of the previous RE should be
126 matched; fewer matches cause the entire RE not to match. For example,
127 \regexp{a\{6\}} will match exactly six \character{a} characters, but
128 not five.
130 \item[\code{\{\var{m},\var{n}\}}] Causes the resulting RE to match from
131 \var{m} to \var{n} repetitions of the preceding RE, attempting to
132 match as many repetitions as possible. For example, \regexp{a\{3,5\}}
133 will match from 3 to 5 \character{a} characters. Omitting \var{m}
134 specifies a lower bound of zero,
135 and omitting \var{n} specifies an infinite upper bound. As an
136 example, \regexp{a\{4,\}b} will match \code{aaaab} or a thousand
137 \character{a} characters followed by a \code{b}, but not \code{aaab}.
138 The comma may not be omitted or the modifier would be confused with
139 the previously described form.
141 \item[\code{\{\var{m},\var{n}\}?}] Causes the resulting RE to
142 match from \var{m} to \var{n} repetitions of the preceding RE,
143 attempting to match as \emph{few} repetitions as possible. This is
144 the non-greedy version of the previous qualifier. For example, on the
145 6-character string \code{'aaaaaa'}, \regexp{a\{3,5\}} will match 5
146 \character{a} characters, while \regexp{a\{3,5\}?} will only match 3
147 characters.
149 \item[\character{\e}] Either escapes special characters (permitting
150 you to match characters like \character{*}, \character{?}, and so
151 forth), or signals a special sequence; special sequences are discussed
152 below.
154 If you're not using a raw string to
155 express the pattern, remember that Python also uses the
156 backslash as an escape sequence in string literals; if the escape
157 sequence isn't recognized by Python's parser, the backslash and
158 subsequent character are included in the resulting string. However,
159 if Python would recognize the resulting sequence, the backslash should
160 be repeated twice. This is complicated and hard to understand, so
161 it's highly recommended that you use raw strings for all but the
162 simplest expressions.
164 \item[\code{[]}] Used to indicate a set of characters. Characters can
165 be listed individually, or a range of characters can be indicated by
166 giving two characters and separating them by a \character{-}. Special
167 characters are not active inside sets. For example, \regexp{[akm\$]}
168 will match any of the characters \character{a}, \character{k},
169 \character{m}, or \character{\$}; \regexp{[a-z]}
170 will match any lowercase letter, and \code{[a-zA-Z0-9]} matches any
171 letter or digit. Character classes such as \code{\e w} or \code{\e S}
172 (defined below) are also acceptable inside a range. If you want to
173 include a \character{]} or a \character{-} inside a set, precede it with a
174 backslash, or place it as the first character. The
175 pattern \regexp{[]]} will match \code{']'}, for example.
177 You can match the characters not within a range by \dfn{complementing}
178 the set. This is indicated by including a
179 \character{\textasciicircum} as the first character of the set;
180 \character{\textasciicircum} elsewhere will simply match the
181 \character{\textasciicircum} character. For example,
182 \regexp{[{\textasciicircum}5]} will match
183 any character except \character{5}, and
184 \regexp{[\textasciicircum\code{\textasciicircum}]} will match any character
185 except \character{\textasciicircum}.
187 \item[\character{|}]\code{A|B}, where A and B can be arbitrary REs,
188 creates a regular expression that will match either A or B. An
189 arbitrary number of REs can be separated by the \character{|} in this
190 way. This can be used inside groups (see below) as well. As the target
191 string is scanned, REs separated by \character{|} are tried from left to
192 right. When one pattern completely matches, that branch is accepted.
193 This means that once \code{A} matches, \code{B} will not be tested further,
194 even if it would produce a longer overall match. In other words, the
195 \character{|} operator is never greedy. To match a literal \character{|},
196 use \regexp{\e|}, or enclose it inside a character class, as in \regexp{[|]}.
198 \item[\code{(...)}] Matches whatever regular expression is inside the
199 parentheses, and indicates the start and end of a group; the contents
200 of a group can be retrieved after a match has been performed, and can
201 be matched later in the string with the \regexp{\e \var{number}} special
202 sequence, described below. To match the literals \character{(} or
203 \character{)}, use \regexp{\e(} or \regexp{\e)}, or enclose them
204 inside a character class: \regexp{[(] [)]}.
206 \item[\code{(?...)}] This is an extension notation (a \character{?}
207 following a \character{(} is not meaningful otherwise). The first
208 character after the \character{?}
209 determines what the meaning and further syntax of the construct is.
210 Extensions usually do not create a new group;
211 \regexp{(?P<\var{name}>...)} is the only exception to this rule.
212 Following are the currently supported extensions.
214 \item[\code{(?iLmsux)}] (One or more letters from the set \character{i},
215 \character{L}, \character{m}, \character{s}, \character{u},
216 \character{x}.) The group matches the empty string; the letters set
217 the corresponding flags (\constant{re.I}, \constant{re.L},
218 \constant{re.M}, \constant{re.S}, \constant{re.U}, \constant{re.X})
219 for the entire regular expression. This is useful if you wish to
220 include the flags as part of the regular expression, instead of
221 passing a \var{flag} argument to the \function{compile()} function.
223 Note that the \regexp{(?x)} flag changes how the expression is parsed.
224 It should be used first in the expression string, or after one or more
225 whitespace characters. If there are non-whitespace characters before
226 the flag, the results are undefined.
228 \item[\code{(?:...)}] A non-grouping version of regular parentheses.
229 Matches whatever regular expression is inside the parentheses, but the
230 substring matched by the
231 group \emph{cannot} be retrieved after performing a match or
232 referenced later in the pattern.
234 \item[\code{(?P<\var{name}>...)}] Similar to regular parentheses, but
235 the substring matched by the group is accessible via the symbolic group
236 name \var{name}. Group names must be valid Python identifiers, and
237 each group name must be defined only once within a regular expression. A
238 symbolic group is also a numbered group, just as if the group were not
239 named. So the group named 'id' in the example above can also be
240 referenced as the numbered group 1.
242 For example, if the pattern is
243 \regexp{(?P<id>[a-zA-Z_]\e w*)}, the group can be referenced by its
244 name in arguments to methods of match objects, such as
245 \code{m.group('id')} or \code{m.end('id')}, and also by name in
246 pattern text (for example, \regexp{(?P=id)}) and replacement text
247 (such as \code{\e g<id>}).
249 \item[\code{(?P=\var{name})}] Matches whatever text was matched by the
250 earlier group named \var{name}.
252 \item[\code{(?\#...)}] A comment; the contents of the parentheses are
253 simply ignored.
255 \item[\code{(?=...)}] Matches if \regexp{...} matches next, but doesn't
256 consume any of the string. This is called a lookahead assertion. For
257 example, \regexp{Isaac (?=Asimov)} will match \code{'Isaac~'} only if it's
258 followed by \code{'Asimov'}.
260 \item[\code{(?!...)}] Matches if \regexp{...} doesn't match next. This
261 is a negative lookahead assertion. For example,
262 \regexp{Isaac (?!Asimov)} will match \code{'Isaac~'} only if it's \emph{not}
263 followed by \code{'Asimov'}.
265 \item[\code{(?<=...)}] Matches if the current position in the string
266 is preceded by a match for \regexp{...} that ends at the current
267 position. This is called a \dfn{positive lookbehind assertion}.
268 \regexp{(?<=abc)def} will find a match in \samp{abcdef}, since the
269 lookbehind will back up 3 characters and check if the contained
270 pattern matches. The contained pattern must only match strings of
271 some fixed length, meaning that \regexp{abc} or \regexp{a|b} are
272 allowed, but \regexp{a*} and \regexp{a\{3,4\}} are not. Note that
273 patterns which start with positive lookbehind assertions will never
274 match at the beginning of the string being searched; you will most
275 likely want to use the \function{search()} function rather than the
276 \function{match()} function:
278 \begin{verbatim}
279 >>> import re
280 >>> m = re.search('(?<=abc)def', 'abcdef')
281 >>> m.group(0)
282 'def'
283 \end{verbatim}
285 This example looks for a word following a hyphen:
287 \begin{verbatim}
288 >>> m = re.search('(?<=-)\w+', 'spam-egg')
289 >>> m.group(0)
290 'egg'
291 \end{verbatim}
293 \item[\code{(?<!...)}] Matches if the current position in the string
294 is not preceded by a match for \regexp{...}. This is called a
295 \dfn{negative lookbehind assertion}. Similar to positive lookbehind
296 assertions, the contained pattern must only match strings of some
297 fixed length. Patterns which start with negative lookbehind
298 assertions may match at the beginning of the string being searched.
300 \end{list}
302 The special sequences consist of \character{\e} and a character from the
303 list below. If the ordinary character is not on the list, then the
304 resulting RE will match the second character. For example,
305 \regexp{\e\$} matches the character \character{\$}.
307 \begin{list}{}{\leftmargin 0.7in \labelwidth 0.65in}
309 \item[\code{\e \var{number}}] Matches the contents of the group of the
310 same number. Groups are numbered starting from 1. For example,
311 \regexp{(.+) \e 1} matches \code{'the the'} or \code{'55 55'}, but not
312 \code{'the end'} (note
313 the space after the group). This special sequence can only be used to
314 match one of the first 99 groups. If the first digit of \var{number}
315 is 0, or \var{number} is 3 octal digits long, it will not be interpreted
316 as a group match, but as the character with octal value \var{number}.
317 Inside the \character{[} and \character{]} of a character class, all numeric
318 escapes are treated as characters.
320 \item[\code{\e A}] Matches only at the start of the string.
322 \item[\code{\e b}] Matches the empty string, but only at the
323 beginning or end of a word. A word is defined as a sequence of
324 alphanumeric or underscore characters, so the end of a word is indicated by
325 whitespace or a non-alphanumeric, non-underscore character. Note that
326 {}\code{\e b} is defined as the boundary between \code{\e w} and \code{\e
327 W}, so the precise set of characters deemed to be alphanumeric depends on the
328 values of the \code{UNICODE} and \code{LOCALE} flags. Inside a character
329 range, \regexp{\e b} represents the backspace character, for compatibility
330 with Python's string literals.
332 \item[\code{\e B}] Matches the empty string, but only when it is \emph{not}
333 at the beginning or end of a word. This is just the opposite of {}\code{\e
334 b}, so is also subject to the settings of \code{LOCALE} and \code{UNICODE}.
336 \item[\code{\e d}]Matches any decimal digit; this is
337 equivalent to the set \regexp{[0-9]}.
339 \item[\code{\e D}]Matches any non-digit character; this is
340 equivalent to the set \regexp{[{\textasciicircum}0-9]}.
342 \item[\code{\e s}]Matches any whitespace character; this is
343 equivalent to the set \regexp{[ \e t\e n\e r\e f\e v]}.
345 \item[\code{\e S}]Matches any non-whitespace character; this is
346 equivalent to the set \regexp{[\textasciicircum\ \e t\e n\e r\e f\e v]}.
348 \item[\code{\e w}]When the \constant{LOCALE} and \constant{UNICODE}
349 flags are not specified, matches any alphanumeric character and the
350 underscore; this is equivalent to the set
351 \regexp{[a-zA-Z0-9_]}. With \constant{LOCALE}, it will match the set
352 \regexp{[0-9_]} plus whatever characters are defined as alphanumeric for
353 the current locale. If \constant{UNICODE} is set, this will match the
354 characters \regexp{[0-9_]} plus whatever is classified as alphanumeric
355 in the Unicode character properties database.
357 \item[\code{\e W}]When the \constant{LOCALE} and \constant{UNICODE}
358 flags are not specified, matches any non-alphanumeric character; this
359 is equivalent to the set \regexp{[{\textasciicircum}a-zA-Z0-9_]}. With
360 \constant{LOCALE}, it will match any character not in the set
361 \regexp{[0-9_]}, and not defined as alphanumeric for the current locale.
362 If \constant{UNICODE} is set, this will match anything other than
363 \regexp{[0-9_]} and characters marked as alphanumeric in the Unicode
364 character properties database.
366 \item[\code{\e Z}]Matches only at the end of the string.
368 \end{list}
370 Most of the standard escapes supported by Python string literals are
371 also accepted by the regular expression parser:
373 \begin{verbatim}
374 \a \b \f \n
375 \r \t \v \x
377 \end{verbatim}
379 Octal escapes are included in a limited form: If the first digit is a
380 0, or if there are three octal digits, it is considered an octal
381 escape. Otherwise, it is a group reference.
384 % Note the lack of a period in the section title; it causes problems
385 % with readers of the GNU info version. See http://www.python.org/sf/581414.
386 \subsection{Matching vs Searching \label{matching-searching}}
387 \sectionauthor{Fred L. Drake, Jr.}{fdrake@acm.org}
389 Python offers two different primitive operations based on regular
390 expressions: match and search. If you are accustomed to Perl's
391 semantics, the search operation is what you're looking for. See the
392 \function{search()} function and corresponding method of compiled
393 regular expression objects.
395 Note that match may differ from search using a regular expression
396 beginning with \character{\textasciicircum}:
397 \character{\textasciicircum} matches only at the
398 start of the string, or in \constant{MULTILINE} mode also immediately
399 following a newline. The ``match'' operation succeeds only if the
400 pattern matches at the start of the string regardless of mode, or at
401 the starting position given by the optional \var{pos} argument
402 regardless of whether a newline precedes it.
404 % Examples from Tim Peters:
405 \begin{verbatim}
406 re.compile("a").match("ba", 1) # succeeds
407 re.compile("^a").search("ba", 1) # fails; 'a' not at start
408 re.compile("^a").search("\na", 1) # fails; 'a' not at start
409 re.compile("^a", re.M).search("\na", 1) # succeeds
410 re.compile("^a", re.M).search("ba", 1) # fails; no preceding \n
411 \end{verbatim}
414 \subsection{Module Contents}
415 \nodename{Contents of Module re}
417 The module defines the following functions and constants, and an exception:
420 \begin{funcdesc}{compile}{pattern\optional{, flags}}
421 Compile a regular expression pattern into a regular expression
422 object, which can be used for matching using its \function{match()} and
423 \function{search()} methods, described below.
425 The expression's behaviour can be modified by specifying a
426 \var{flags} value. Values can be any of the following variables,
427 combined using bitwise OR (the \code{|} operator).
429 The sequence
431 \begin{verbatim}
432 prog = re.compile(pat)
433 result = prog.match(str)
434 \end{verbatim}
436 is equivalent to
438 \begin{verbatim}
439 result = re.match(pat, str)
440 \end{verbatim}
442 but the version using \function{compile()} is more efficient when the
443 expression will be used several times in a single program.
444 %(The compiled version of the last pattern passed to
445 %\function{re.match()} or \function{re.search()} is cached, so
446 %programs that use only a single regular expression at a time needn't
447 %worry about compiling regular expressions.)
448 \end{funcdesc}
450 \begin{datadesc}{I}
451 \dataline{IGNORECASE}
452 Perform case-insensitive matching; expressions like \regexp{[A-Z]}
453 will match lowercase letters, too. This is not affected by the
454 current locale.
455 \end{datadesc}
457 \begin{datadesc}{L}
458 \dataline{LOCALE}
459 Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
460 \regexp{\e B} dependent on the current locale.
461 \end{datadesc}
463 \begin{datadesc}{M}
464 \dataline{MULTILINE}
465 When specified, the pattern character \character{\textasciicircum}
466 matches at the beginning of the string and at the beginning of each
467 line (immediately following each newline); and the pattern character
468 \character{\$} matches at the end of the string and at the end of each
469 line (immediately preceding each newline). By default,
470 \character{\textasciicircum} matches only at the beginning of the
471 string, and \character{\$} only at the end of the string and
472 immediately before the newline (if any) at the end of the string.
473 \end{datadesc}
475 \begin{datadesc}{S}
476 \dataline{DOTALL}
477 Make the \character{.} special character match any character at all,
478 including a newline; without this flag, \character{.} will match
479 anything \emph{except} a newline.
480 \end{datadesc}
482 \begin{datadesc}{U}
483 \dataline{UNICODE}
484 Make \regexp{\e w}, \regexp{\e W}, \regexp{\e b}, and
485 \regexp{\e B} dependent on the Unicode character properties database.
486 \versionadded{2.0}
487 \end{datadesc}
489 \begin{datadesc}{X}
490 \dataline{VERBOSE}
491 This flag allows you to write regular expressions that look nicer.
492 Whitespace within the pattern is ignored,
493 except when in a character class or preceded by an unescaped
494 backslash, and, when a line contains a \character{\#} neither in a
495 character class or preceded by an unescaped backslash, all characters
496 from the leftmost such \character{\#} through the end of the line are
497 ignored.
498 % XXX should add an example here
499 \end{datadesc}
502 \begin{funcdesc}{search}{pattern, string\optional{, flags}}
503 Scan through \var{string} looking for a location where the regular
504 expression \var{pattern} produces a match, and return a
505 corresponding \class{MatchObject} instance.
506 Return \code{None} if no
507 position in the string matches the pattern; note that this is
508 different from finding a zero-length match at some point in the string.
509 \end{funcdesc}
511 \begin{funcdesc}{match}{pattern, string\optional{, flags}}
512 If zero or more characters at the beginning of \var{string} match
513 the regular expression \var{pattern}, return a corresponding
514 \class{MatchObject} instance. Return \code{None} if the string does not
515 match the pattern; note that this is different from a zero-length
516 match.
518 \note{If you want to locate a match anywhere in
519 \var{string}, use \method{search()} instead.}
520 \end{funcdesc}
522 \begin{funcdesc}{split}{pattern, string\optional{, maxsplit\code{ = 0}}}
523 Split \var{string} by the occurrences of \var{pattern}. If
524 capturing parentheses are used in \var{pattern}, then the text of all
525 groups in the pattern are also returned as part of the resulting list.
526 If \var{maxsplit} is nonzero, at most \var{maxsplit} splits
527 occur, and the remainder of the string is returned as the final
528 element of the list. (Incompatibility note: in the original Python
529 1.5 release, \var{maxsplit} was ignored. This has been fixed in
530 later releases.)
532 \begin{verbatim}
533 >>> re.split('\W+', 'Words, words, words.')
534 ['Words', 'words', 'words', '']
535 >>> re.split('(\W+)', 'Words, words, words.')
536 ['Words', ', ', 'words', ', ', 'words', '.', '']
537 >>> re.split('\W+', 'Words, words, words.', 1)
538 ['Words', 'words, words.']
539 \end{verbatim}
541 This function combines and extends the functionality of
542 the old \function{regsub.split()} and \function{regsub.splitx()}.
543 \end{funcdesc}
545 \begin{funcdesc}{findall}{pattern, string}
546 Return a list of all non-overlapping matches of \var{pattern} in
547 \var{string}. If one or more groups are present in the pattern,
548 return a list of groups; this will be a list of tuples if the
549 pattern has more than one group. Empty matches are included in the
550 result unless they touch the beginning of another match.
551 \versionadded{1.5.2}
552 \end{funcdesc}
554 \begin{funcdesc}{finditer}{pattern, string}
555 Return an iterator over all non-overlapping matches for the RE
556 \var{pattern} in \var{string}. For each match, the iterator returns
557 a match object. Empty matches are included in the result unless they
558 touch the beginning of another match.
559 \versionadded{2.2}
560 \end{funcdesc}
562 \begin{funcdesc}{sub}{pattern, repl, string\optional{, count}}
563 Return the string obtained by replacing the leftmost non-overlapping
564 occurrences of \var{pattern} in \var{string} by the replacement
565 \var{repl}. If the pattern isn't found, \var{string} is returned
566 unchanged. \var{repl} can be a string or a function; if it is a
567 string, any backslash escapes in it are processed. That is,
568 \samp{\e n} is converted to a single newline character, \samp{\e r}
569 is converted to a linefeed, and so forth. Unknown escapes such as
570 \samp{\e j} are left alone. Backreferences, such as \samp{\e6}, are
571 replaced with the substring matched by group 6 in the pattern. For
572 example:
574 \begin{verbatim}
575 >>> re.sub(r'def\s+([a-zA-Z_][a-zA-Z_0-9]*)\s*\(\s*\):',
576 ... r'static PyObject*\npy_\1(void)\n{',
577 ... 'def myfunc():')
578 'static PyObject*\npy_myfunc(void)\n{'
579 \end{verbatim}
581 If \var{repl} is a function, it is called for every non-overlapping
582 occurrence of \var{pattern}. The function takes a single match
583 object argument, and returns the replacement string. For example:
585 \begin{verbatim}
586 >>> def dashrepl(matchobj):
587 .... if matchobj.group(0) == '-': return ' '
588 .... else: return '-'
589 >>> re.sub('-{1,2}', dashrepl, 'pro----gram-files')
590 'pro--gram files'
591 \end{verbatim}
593 The pattern may be a string or an RE object; if you need to specify
594 regular expression flags, you must use a RE object, or use embedded
595 modifiers in a pattern; for example, \samp{sub("(?i)b+", "x", "bbbb
596 BBBB")} returns \code{'x x'}.
598 The optional argument \var{count} is the maximum number of pattern
599 occurrences to be replaced; \var{count} must be a non-negative
600 integer. If omitted or zero, all occurrences will be replaced.
601 Empty matches for the pattern are replaced only when not adjacent to
602 a previous match, so \samp{sub('x*', '-', 'abc')} returns
603 \code{'-a-b-c-'}.
605 In addition to character escapes and backreferences as described
606 above, \samp{\e g<name>} will use the substring matched by the group
607 named \samp{name}, as defined by the \regexp{(?P<name>...)} syntax.
608 \samp{\e g<number>} uses the corresponding group number;
609 \samp{\e g<2>} is therefore equivalent to \samp{\e 2}, but isn't
610 ambiguous in a replacement such as \samp{\e g<2>0}. \samp{\e 20}
611 would be interpreted as a reference to group 20, not a reference to
612 group 2 followed by the literal character \character{0}. The
613 backreference \samp{\e g<0>} substitutes in the entire substring
614 matched by the RE.
615 \end{funcdesc}
617 \begin{funcdesc}{subn}{pattern, repl, string\optional{, count}}
618 Perform the same operation as \function{sub()}, but return a tuple
619 \code{(\var{new_string}, \var{number_of_subs_made})}.
620 \end{funcdesc}
622 \begin{funcdesc}{escape}{string}
623 Return \var{string} with all non-alphanumerics backslashed; this is
624 useful if you want to match an arbitrary literal string that may have
625 regular expression metacharacters in it.
626 \end{funcdesc}
628 \begin{excdesc}{error}
629 Exception raised when a string passed to one of the functions here
630 is not a valid regular expression (for example, it might contain
631 unmatched parentheses) or when some other error occurs during
632 compilation or matching. It is never an error if a string contains
633 no match for a pattern.
634 \end{excdesc}
637 \subsection{Regular Expression Objects \label{re-objects}}
639 Compiled regular expression objects support the following methods and
640 attributes:
642 \begin{methoddesc}[RegexObject]{match}{string\optional{, pos\optional{,
643 endpos}}}
644 If zero or more characters at the beginning of \var{string} match
645 this regular expression, return a corresponding
646 \class{MatchObject} instance. Return \code{None} if the string does not
647 match the pattern; note that this is different from a zero-length
648 match.
650 \note{If you want to locate a match anywhere in
651 \var{string}, use \method{search()} instead.}
653 The optional second parameter \var{pos} gives an index in the string
654 where the search is to start; it defaults to \code{0}. This is not
655 completely equivalent to slicing the string; the
656 \code{'\textasciicircum'} pattern
657 character matches at the real beginning of the string and at positions
658 just after a newline, but not necessarily at the index where the search
659 is to start.
661 The optional parameter \var{endpos} limits how far the string will
662 be searched; it will be as if the string is \var{endpos} characters
663 long, so only the characters from \var{pos} to \code{\var{endpos} -
664 1} will be searched for a match. If \var{endpos} is less than
665 \var{pos}, no match will be found, otherwise, if \var{rx} is a
666 compiled regular expression object,
667 \code{\var{rx}.match(\var{string}, 0, 50)} is equivalent to
668 \code{\var{rx}.match(\var{string}[:50], 0)}.
669 \end{methoddesc}
671 \begin{methoddesc}[RegexObject]{search}{string\optional{, pos\optional{,
672 endpos}}}
673 Scan through \var{string} looking for a location where this regular
674 expression produces a match, and return a
675 corresponding \class{MatchObject} instance. Return \code{None} if no
676 position in the string matches the pattern; note that this is
677 different from finding a zero-length match at some point in the string.
679 The optional \var{pos} and \var{endpos} parameters have the same
680 meaning as for the \method{match()} method.
681 \end{methoddesc}
683 \begin{methoddesc}[RegexObject]{split}{string\optional{,
684 maxsplit\code{ = 0}}}
685 Identical to the \function{split()} function, using the compiled pattern.
686 \end{methoddesc}
688 \begin{methoddesc}[RegexObject]{findall}{string}
689 Identical to the \function{findall()} function, using the compiled pattern.
690 \end{methoddesc}
692 \begin{methoddesc}[RegexObject]{finditer}{string}
693 Identical to the \function{finditer()} function, using the compiled pattern.
694 \end{methoddesc}
696 \begin{methoddesc}[RegexObject]{sub}{repl, string\optional{, count\code{ = 0}}}
697 Identical to the \function{sub()} function, using the compiled pattern.
698 \end{methoddesc}
700 \begin{methoddesc}[RegexObject]{subn}{repl, string\optional{,
701 count\code{ = 0}}}
702 Identical to the \function{subn()} function, using the compiled pattern.
703 \end{methoddesc}
706 \begin{memberdesc}[RegexObject]{flags}
707 The flags argument used when the RE object was compiled, or
708 \code{0} if no flags were provided.
709 \end{memberdesc}
711 \begin{memberdesc}[RegexObject]{groupindex}
712 A dictionary mapping any symbolic group names defined by
713 \regexp{(?P<\var{id}>)} to group numbers. The dictionary is empty if no
714 symbolic groups were used in the pattern.
715 \end{memberdesc}
717 \begin{memberdesc}[RegexObject]{pattern}
718 The pattern string from which the RE object was compiled.
719 \end{memberdesc}
722 \subsection{Match Objects \label{match-objects}}
724 \class{MatchObject} instances support the following methods and
725 attributes:
727 \begin{methoddesc}[MatchObject]{expand}{template}
728 Return the string obtained by doing backslash substitution on the
729 template string \var{template}, as done by the \method{sub()} method.
730 Escapes such as \samp{\e n} are converted to the appropriate
731 characters, and numeric backreferences (\samp{\e 1}, \samp{\e 2}) and
732 named backreferences (\samp{\e g<1>}, \samp{\e g<name>}) are replaced
733 by the contents of the corresponding group.
734 \end{methoddesc}
736 \begin{methoddesc}[MatchObject]{group}{\optional{group1, \moreargs}}
737 Returns one or more subgroups of the match. If there is a single
738 argument, the result is a single string; if there are
739 multiple arguments, the result is a tuple with one item per argument.
740 Without arguments, \var{group1} defaults to zero (the whole match
741 is returned).
742 If a \var{groupN} argument is zero, the corresponding return value is the
743 entire matching string; if it is in the inclusive range [1..99], it is
744 the string matching the corresponding parenthesized group. If a
745 group number is negative or larger than the number of groups defined
746 in the pattern, an \exception{IndexError} exception is raised.
747 If a group is contained in a part of the pattern that did not match,
748 the corresponding result is \code{None}. If a group is contained in a
749 part of the pattern that matched multiple times, the last match is
750 returned.
752 If the regular expression uses the \regexp{(?P<\var{name}>...)} syntax,
753 the \var{groupN} arguments may also be strings identifying groups by
754 their group name. If a string argument is not used as a group name in
755 the pattern, an \exception{IndexError} exception is raised.
757 A moderately complicated example:
759 \begin{verbatim}
760 m = re.match(r"(?P<int>\d+)\.(\d*)", '3.14')
761 \end{verbatim}
763 After performing this match, \code{m.group(1)} is \code{'3'}, as is
764 \code{m.group('int')}, and \code{m.group(2)} is \code{'14'}.
765 \end{methoddesc}
767 \begin{methoddesc}[MatchObject]{groups}{\optional{default}}
768 Return a tuple containing all the subgroups of the match, from 1 up to
769 however many groups are in the pattern. The \var{default} argument is
770 used for groups that did not participate in the match; it defaults to
771 \code{None}. (Incompatibility note: in the original Python 1.5
772 release, if the tuple was one element long, a string would be returned
773 instead. In later versions (from 1.5.1 on), a singleton tuple is
774 returned in such cases.)
775 \end{methoddesc}
777 \begin{methoddesc}[MatchObject]{groupdict}{\optional{default}}
778 Return a dictionary containing all the \emph{named} subgroups of the
779 match, keyed by the subgroup name. The \var{default} argument is
780 used for groups that did not participate in the match; it defaults to
781 \code{None}.
782 \end{methoddesc}
784 \begin{methoddesc}[MatchObject]{start}{\optional{group}}
785 \methodline{end}{\optional{group}}
786 Return the indices of the start and end of the substring
787 matched by \var{group}; \var{group} defaults to zero (meaning the whole
788 matched substring).
789 Return \code{-1} if \var{group} exists but
790 did not contribute to the match. For a match object
791 \var{m}, and a group \var{g} that did contribute to the match, the
792 substring matched by group \var{g} (equivalent to
793 \code{\var{m}.group(\var{g})}) is
795 \begin{verbatim}
796 m.string[m.start(g):m.end(g)]
797 \end{verbatim}
799 Note that
800 \code{m.start(\var{group})} will equal \code{m.end(\var{group})} if
801 \var{group} matched a null string. For example, after \code{\var{m} =
802 re.search('b(c?)', 'cba')}, \code{\var{m}.start(0)} is 1,
803 \code{\var{m}.end(0)} is 2, \code{\var{m}.start(1)} and
804 \code{\var{m}.end(1)} are both 2, and \code{\var{m}.start(2)} raises
805 an \exception{IndexError} exception.
806 \end{methoddesc}
808 \begin{methoddesc}[MatchObject]{span}{\optional{group}}
809 For \class{MatchObject} \var{m}, return the 2-tuple
810 \code{(\var{m}.start(\var{group}), \var{m}.end(\var{group}))}.
811 Note that if \var{group} did not contribute to the match, this is
812 \code{(-1, -1)}. Again, \var{group} defaults to zero.
813 \end{methoddesc}
815 \begin{memberdesc}[MatchObject]{pos}
816 The value of \var{pos} which was passed to the \function{search()} or
817 \function{match()} method of the \class{RegexObject}. This is the
818 index into the string at which the RE engine started looking for a
819 match.
820 \end{memberdesc}
822 \begin{memberdesc}[MatchObject]{endpos}
823 The value of \var{endpos} which was passed to the \function{search()}
824 or \function{match()} method of the \class{RegexObject}. This is the
825 index into the string beyond which the RE engine will not go.
826 \end{memberdesc}
828 \begin{memberdesc}[MatchObject]{lastindex}
829 The integer index of the last matched capturing group, or \code{None}
830 if no group was matched at all. For example, the expressions
831 \regexp{(a)b}, \regexp{((a)(b))}, and \regexp{((ab))} will have
832 \code{lastindex == 1} if applyied to the string \code{'ab'},
833 while the expression \regexp{(a)(b)} will have \code{lastindex == 2},
834 if applyied to the same string.
835 \end{memberdesc}
837 \begin{memberdesc}[MatchObject]{lastgroup}
838 The name of the last matched capturing group, or \code{None} if the
839 group didn't have a name, or if no group was matched at all.
840 \end{memberdesc}
842 \begin{memberdesc}[MatchObject]{re}
843 The regular expression object whose \method{match()} or
844 \method{search()} method produced this \class{MatchObject} instance.
845 \end{memberdesc}
847 \begin{memberdesc}[MatchObject]{string}
848 The string passed to \function{match()} or \function{search()}.
849 \end{memberdesc}
851 \subsection{Examples}
853 \leftline{\strong{Simulating \cfunction{scanf()}}}
855 Python does not currently have an equivalent to \cfunction{scanf()}.
856 \ttindex{scanf()}
857 Regular expressions are generally more powerful, though also more
858 verbose, than \cfunction{scanf()} format strings. The table below
859 offers some more-or-less equivalent mappings between
860 \cfunction{scanf()} format tokens and regular expressions.
862 \begin{tableii}{l|l}{textrm}{\cfunction{scanf()} Token}{Regular Expression}
863 \lineii{\code{\%c}}
864 {\regexp{.}}
865 \lineii{\code{\%5c}}
866 {\regexp{.\{5\}}}
867 \lineii{\code{\%d}}
868 {\regexp{[-+]?\e d+}}
869 \lineii{\code{\%e}, \code{\%E}, \code{\%f}, \code{\%g}}
870 {\regexp{[-+]?(\e d+(\e.\e d*)?|\e d*\e.\e d+)([eE][-+]?\e d+)?}}
871 \lineii{\code{\%i}}
872 {\regexp{[-+]?(0[xX][\e dA-Fa-f]+|0[0-7]*|\e d+)}}
873 \lineii{\code{\%o}}
874 {\regexp{0[0-7]*}}
875 \lineii{\code{\%s}}
876 {\regexp{\e S+}}
877 \lineii{\code{\%u}}
878 {\regexp{\e d+}}
879 \lineii{\code{\%x}, \code{\%X}}
880 {\regexp{0[xX][\e dA-Fa-f]+}}
881 \end{tableii}
883 To extract the filename and numbers from a string like
885 \begin{verbatim}
886 /usr/sbin/sendmail - 0 errors, 4 warnings
887 \end{verbatim}
889 you would use a \cfunction{scanf()} format like
891 \begin{verbatim}
892 %s - %d errors, %d warnings
893 \end{verbatim}
895 The equivalent regular expression would be
897 \begin{verbatim}
898 (\S+) - (\d+) errors, (\d+) warnings
899 \end{verbatim}
901 \leftline{\strong{Avoiding recursion}}
903 If you create regular expressions that require the engine to perform a
904 lot of recursion, you may encounter a RuntimeError exception with
905 the message \code{maximum recursion limit} exceeded. For example,
907 \begin{verbatim}
908 >>> import re
909 >>> s = 'Begin ' + 1000*'a very long string ' + 'end'
910 >>> re.match('Begin (\w| )*? end', s).end()
911 Traceback (most recent call last):
912 File "<stdin>", line 1, in ?
913 File "/usr/local/lib/python2.3/sre.py", line 132, in match
914 return _compile(pattern, flags).match(string)
915 RuntimeError: maximum recursion limit exceeded
916 \end{verbatim}
918 You can often restructure your regular expression to avoid recursion.
920 Starting with Python 2.3, simple uses of the \regexp{*?} pattern are
921 special-cased to avoid recursion. Thus, the above regular expression
922 can avoid recursion by being recast as
923 \regexp{Begin [a-zA-Z0-9_ ]*?end}. As a further benefit, such regular
924 expressions will run faster than their recursive equivalents.