Added 'description' class attribute to every command class (to help the
[python/dscho.git] / Doc / lib / libregex.tex
blob444b9229f512ddd056b9a7b0883192df02967db7
1 \section{\module{regex} ---
2 Regular expression search and match operations.}
3 \declaremodule{builtin}{regex}
5 \modulesynopsis{Regular expression search and match operations.}
8 This module provides regular expression matching operations similar to
9 those found in Emacs.
11 \strong{Obsolescence note:}
12 This module is obsolete as of Python version 1.5; it is still being
13 maintained because much existing code still uses it. All new code in
14 need of regular expressions should use the new
15 \code{re}\refstmodindex{re} module, which supports the more powerful
16 and regular Perl-style regular expressions. Existing code should be
17 converted. The standard library module
18 \code{reconvert}\refstmodindex{reconvert} helps in converting
19 \code{regex} style regular expressions to \code{re}\refstmodindex{re}
20 style regular expressions. (For more conversion help, see Andrew
21 Kuchling's\index{Kuchling, Andrew} ``\module{regex-to-re} HOWTO'' at
22 \url{http://www.python.org/doc/howto/regex-to-re/}.)
24 By default the patterns are Emacs-style regular expressions
25 (with one exception). There is
26 a way to change the syntax to match that of several well-known
27 \UNIX{} utilities. The exception is that Emacs' \samp{\e s}
28 pattern is not supported, since the original implementation references
29 the Emacs syntax tables.
31 This module is 8-bit clean: both patterns and strings may contain null
32 bytes and characters whose high bit is set.
34 \strong{Please note:} There is a little-known fact about Python string
35 literals which means that you don't usually have to worry about
36 doubling backslashes, even though they are used to escape special
37 characters in string literals as well as in regular expressions. This
38 is because Python doesn't remove backslashes from string literals if
39 they are followed by an unrecognized escape character.
40 \emph{However}, if you want to include a literal \dfn{backslash} in a
41 regular expression represented as a string literal, you have to
42 \emph{quadruple} it or enclose it in a singleton character class.
43 E.g.\ to extract \LaTeX\ \samp{\e section\{\textrm{\ldots}\}} headers
44 from a document, you can use this pattern:
45 \code{'[\e ]section\{\e (.*\e )\}'}. \emph{Another exception:}
46 the escape sequece \samp{\e b} is significant in string literals
47 (where it means the ASCII bell character) as well as in Emacs regular
48 expressions (where it stands for a word boundary), so in order to
49 search for a word boundary, you should use the pattern \code{'\e \e b'}.
50 Similarly, a backslash followed by a digit 0-7 should be doubled to
51 avoid interpretation as an octal escape.
53 \subsection{Regular Expressions}
55 A regular expression (or RE) specifies a set of strings that matches
56 it; the functions in this module let you check if a particular string
57 matches a given regular expression (or if a given regular expression
58 matches a particular string, which comes down to the same thing).
60 Regular expressions can be concatenated to form new regular
61 expressions; if \emph{A} and \emph{B} are both regular expressions,
62 then \emph{AB} is also an regular expression. If a string \emph{p}
63 matches A and another string \emph{q} matches B, the string \emph{pq}
64 will match AB. Thus, complex expressions can easily be constructed
65 from simpler ones like the primitives described here. For details of
66 the theory and implementation of regular expressions, consult almost
67 any textbook about compiler construction.
69 % XXX The reference could be made more specific, say to
70 % "Compilers: Principles, Techniques and Tools", by Alfred V. Aho,
71 % Ravi Sethi, and Jeffrey D. Ullman, or some FA text.
73 A brief explanation of the format of regular expressions follows.
75 Regular expressions can contain both special and ordinary characters.
76 Ordinary characters, like '\code{A}', '\code{a}', or '\code{0}', are
77 the simplest regular expressions; they simply match themselves. You
78 can concatenate ordinary characters, so '\code{last}' matches the
79 characters 'last'. (In the rest of this section, we'll write RE's in
80 \code{this special font}, usually without quotes, and strings to be
81 matched 'in single quotes'.)
83 Special characters either stand for classes of ordinary characters, or
84 affect how the regular expressions around them are interpreted.
86 The special characters are:
87 \begin{itemize}
88 \item[\code{.}] (Dot.) Matches any character except a newline.
89 \item[\code{\^}] (Caret.) Matches the start of the string.
90 \item[\code{\$}] Matches the end of the string.
91 \code{foo} matches both 'foo' and 'foobar', while the regular
92 expression '\code{foo\$}' matches only 'foo'.
93 \item[\code{*}] Causes the resulting RE to
94 match 0 or more repetitions of the preceding RE. \code{ab*} will
95 match 'a', 'ab', or 'a' followed by any number of 'b's.
96 \item[\code{+}] Causes the
97 resulting RE to match 1 or more repetitions of the preceding RE.
98 \code{ab+} will match 'a' followed by any non-zero number of 'b's; it
99 will not match just 'a'.
100 \item[\code{?}] Causes the resulting RE to
101 match 0 or 1 repetitions of the preceding RE. \code{ab?} will
102 match either 'a' or 'ab'.
104 \item[\code{\e}] Either escapes special characters (permitting you to match
105 characters like '*?+\&\$'), or signals a special sequence; special
106 sequences are discussed below. Remember that Python also uses the
107 backslash as an escape sequence in string literals; if the escape
108 sequence isn't recognized by Python's parser, the backslash and
109 subsequent character are included in the resulting string. However,
110 if Python would recognize the resulting sequence, the backslash should
111 be repeated twice.
113 \item[\code{[]}] Used to indicate a set of characters. Characters can
114 be listed individually, or a range is indicated by giving two
115 characters and separating them by a '-'. Special characters are
116 not active inside sets. For example, \code{[akm\$]}
117 will match any of the characters 'a', 'k', 'm', or '\$'; \code{[a-z]} will
118 match any lowercase letter.
120 If you want to include a \code{]} inside a
121 set, it must be the first character of the set; to include a \code{-},
122 place it as the first or last character.
124 Characters \emph{not} within a range can be matched by including a
125 \code{\^} as the first character of the set; \code{\^} elsewhere will
126 simply match the '\code{\^}' character.
127 \end{itemize}
129 The special sequences consist of '\code{\e}' and a character
130 from the list below. If the ordinary character is not on the list,
131 then the resulting RE will match the second character. For example,
132 \code{\e\$} matches the character '\$'. Ones where the backslash
133 should be doubled in string literals are indicated.
135 \begin{itemize}
136 \item[\code{\e|}]\code{A\e|B}, where A and B can be arbitrary REs,
137 creates a regular expression that will match either A or B. This can
138 be used inside groups (see below) as well.
140 \item[\code{\e( \e)}] Indicates the start and end of a group; the
141 contents of a group can be matched later in the string with the
142 \code{\e [1-9]} special sequence, described next.
143 \end{itemize}
145 \begin{fulllineitems}
146 \item[\code{\e \e 1, ... \e \e 7, \e 8, \e 9}]
147 Matches the contents of the group of the same
148 number. For example, \code{\e (.+\e ) \e \e 1} matches 'the the' or
149 '55 55', but not 'the end' (note the space after the group). This
150 special sequence can only be used to match one of the first 9 groups;
151 groups with higher numbers can be matched using the \code{\e v}
152 sequence. (\code{\e 8} and \code{\e 9} don't need a double backslash
153 because they are not octal digits.)
154 \end{fulllineitems}
156 \begin{itemize}
157 \item[\code{\e \e b}] Matches the empty string, but only at the
158 beginning or end of a word. A word is defined as a sequence of
159 alphanumeric characters, so the end of a word is indicated by
160 whitespace or a non-alphanumeric character.
162 \item[\code{\e B}] Matches the empty string, but when it is \emph{not} at the
163 beginning or end of a word.
165 \item[\code{\e v}] Must be followed by a two digit decimal number, and
166 matches the contents of the group of the same number. The group
167 number must be between 1 and 99, inclusive.
169 \item[\code{\e w}]Matches any alphanumeric character; this is
170 equivalent to the set \code{[a-zA-Z0-9]}.
172 \item[\code{\e W}] Matches any non-alphanumeric character; this is
173 equivalent to the set \code{[\^a-zA-Z0-9]}.
174 \item[\code{\e <}] Matches the empty string, but only at the beginning of a
175 word. A word is defined as a sequence of alphanumeric characters, so
176 the end of a word is indicated by whitespace or a non-alphanumeric
177 character.
178 \item[\code{\e >}] Matches the empty string, but only at the end of a
179 word.
181 \item[\code{\e \e \e \e}] Matches a literal backslash.
183 % In Emacs, the following two are start of buffer/end of buffer. In
184 % Python they seem to be synonyms for ^$.
185 \item[\code{\e `}] Like \code{\^}, this only matches at the start of the
186 string.
187 \item[\code{\e \e '}] Like \code{\$}, this only matches at the end of
188 the string.
189 % end of buffer
190 \end{itemize}
192 \subsection{Module Contents}
193 \nodename{Contents of Module regex}
195 The module defines these functions, and an exception:
198 \begin{funcdesc}{match}{pattern, string}
199 Return how many characters at the beginning of \var{string} match
200 the regular expression \var{pattern}. Return \code{-1} if the
201 string does not match the pattern (this is different from a
202 zero-length match!).
203 \end{funcdesc}
205 \begin{funcdesc}{search}{pattern, string}
206 Return the first position in \var{string} that matches the regular
207 expression \var{pattern}. Return \code{-1} if no position in the string
208 matches the pattern (this is different from a zero-length match
209 anywhere!).
210 \end{funcdesc}
212 \begin{funcdesc}{compile}{pattern\optional{, translate}}
213 Compile a regular expression pattern into a regular expression
214 object, which can be used for matching using its \code{match()} and
215 \code{search()} methods, described below. The optional argument
216 \var{translate}, if present, must be a 256-character string
217 indicating how characters (both of the pattern and of the strings to
218 be matched) are translated before comparing them; the \var{i}-th
219 element of the string gives the translation for the character with
220 \ASCII{} code \var{i}. This can be used to implement
221 case-insensitive matching; see the \code{casefold} data item below.
223 The sequence
225 \begin{verbatim}
226 prog = regex.compile(pat)
227 result = prog.match(str)
228 \end{verbatim}
230 is equivalent to
232 \begin{verbatim}
233 result = regex.match(pat, str)
234 \end{verbatim}
236 but the version using \code{compile()} is more efficient when multiple
237 regular expressions are used concurrently in a single program. (The
238 compiled version of the last pattern passed to \code{regex.match()} or
239 \code{regex.search()} is cached, so programs that use only a single
240 regular expression at a time needn't worry about compiling regular
241 expressions.)
242 \end{funcdesc}
244 \begin{funcdesc}{set_syntax}{flags}
245 Set the syntax to be used by future calls to \code{compile()},
246 \code{match()} and \code{search()}. (Already compiled expression
247 objects are not affected.) The argument is an integer which is the
248 OR of several flag bits. The return value is the previous value of
249 the syntax flags. Names for the flags are defined in the standard
250 module \code{regex_syntax}\refstmodindex{regex_syntax}; read the
251 file \file{regex_syntax.py} for more information.
252 \end{funcdesc}
254 \begin{funcdesc}{get_syntax}{}
255 Returns the current value of the syntax flags as an integer.
256 \end{funcdesc}
258 \begin{funcdesc}{symcomp}{pattern\optional{, translate}}
259 This is like \code{compile()}, but supports symbolic group names: if a
260 parenthesis-enclosed group begins with a group name in angular
261 brackets, e.g. \code{'\e(<id>[a-z][a-z0-9]*\e)'}, the group can
262 be referenced by its name in arguments to the \code{group()} method of
263 the resulting compiled regular expression object, like this:
264 \code{p.group('id')}. Group names may contain alphanumeric characters
265 and \code{'_'} only.
266 \end{funcdesc}
268 \begin{excdesc}{error}
269 Exception raised when a string passed to one of the functions here
270 is not a valid regular expression (e.g., unmatched parentheses) or
271 when some other error occurs during compilation or matching. (It is
272 never an error if a string contains no match for a pattern.)
273 \end{excdesc}
275 \begin{datadesc}{casefold}
276 A string suitable to pass as the \var{translate} argument to
277 \code{compile()} to map all upper case characters to their lowercase
278 equivalents.
279 \end{datadesc}
281 \noindent
282 Compiled regular expression objects support these methods:
284 \setindexsubitem{(regex method)}
285 \begin{funcdesc}{match}{string\optional{, pos}}
286 Return how many characters at the beginning of \var{string} match
287 the compiled regular expression. Return \code{-1} if the string
288 does not match the pattern (this is different from a zero-length
289 match!).
291 The optional second parameter, \var{pos}, gives an index in the string
292 where the search is to start; it defaults to \code{0}. This is not
293 completely equivalent to slicing the string; the \code{'\^'} pattern
294 character matches at the real beginning of the string and at positions
295 just after a newline, not necessarily at the index where the search
296 is to start.
297 \end{funcdesc}
299 \begin{funcdesc}{search}{string\optional{, pos}}
300 Return the first position in \var{string} that matches the regular
301 expression \code{pattern}. Return \code{-1} if no position in the
302 string matches the pattern (this is different from a zero-length
303 match anywhere!).
305 The optional second parameter has the same meaning as for the
306 \code{match()} method.
307 \end{funcdesc}
309 \begin{funcdesc}{group}{index, index, ...}
310 This method is only valid when the last call to the \code{match()}
311 or \code{search()} method found a match. It returns one or more
312 groups of the match. If there is a single \var{index} argument,
313 the result is a single string; if there are multiple arguments, the
314 result is a tuple with one item per argument. If the \var{index} is
315 zero, the corresponding return value is the entire matching string; if
316 it is in the inclusive range [1..99], it is the string matching the
317 the corresponding parenthesized group (using the default syntax,
318 groups are parenthesized using \code{{\e}(} and \code{{\e})}). If no
319 such group exists, the corresponding result is \code{None}.
321 If the regular expression was compiled by \code{symcomp()} instead of
322 \code{compile()}, the \var{index} arguments may also be strings
323 identifying groups by their group name.
324 \end{funcdesc}
326 \noindent
327 Compiled regular expressions support these data attributes:
329 \setindexsubitem{(regex attribute)}
331 \begin{datadesc}{regs}
332 When the last call to the \code{match()} or \code{search()} method found a
333 match, this is a tuple of pairs of indexes corresponding to the
334 beginning and end of all parenthesized groups in the pattern. Indices
335 are relative to the string argument passed to \code{match()} or
336 \code{search()}. The 0-th tuple gives the beginning and end or the
337 whole pattern. When the last match or search failed, this is
338 \code{None}.
339 \end{datadesc}
341 \begin{datadesc}{last}
342 When the last call to the \code{match()} or \code{search()} method found a
343 match, this is the string argument passed to that method. When the
344 last match or search failed, this is \code{None}.
345 \end{datadesc}
347 \begin{datadesc}{translate}
348 This is the value of the \var{translate} argument to
349 \code{regex.compile()} that created this regular expression object. If
350 the \var{translate} argument was omitted in the \code{regex.compile()}
351 call, this is \code{None}.
352 \end{datadesc}
354 \begin{datadesc}{givenpat}
355 The regular expression pattern as passed to \code{compile()} or
356 \code{symcomp()}.
357 \end{datadesc}
359 \begin{datadesc}{realpat}
360 The regular expression after stripping the group names for regular
361 expressions compiled with \code{symcomp()}. Same as \code{givenpat}
362 otherwise.
363 \end{datadesc}
365 \begin{datadesc}{groupindex}
366 A dictionary giving the mapping from symbolic group names to numerical
367 group indexes for regular expressions compiled with \code{symcomp()}.
368 \code{None} otherwise.
369 \end{datadesc}