Doc/lib/libregex.tex

   1 \section{\module{regex} ---
   2          Regular expression search and match operations.}
   3 \declaremodule{builtin}{regex}
   4
   5 \modulesynopsis{Regular expression search and match operations.}
   6
   7
   8 This module provides regular expression matching operations similar to
   9 those found in Emacs.
  10
  11 \strong{Obsolescence note:}
  12 This module is obsolete as of Python version 1.5; it is still being
  13 maintained because much existing code still uses it.  All new code in
  14 need of regular expressions should use the new
  15 \code{re}\refstmodindex{re} module, which supports the more powerful
  16 and regular Perl-style regular expressions.  Existing code should be
  17 converted.  The standard library module
  18 \code{reconvert}\refstmodindex{reconvert} helps in converting
  19 \code{regex} style regular expressions to \code{re}\refstmodindex{re}
  20 style regular expressions.  (For more conversion help, see Andrew
  21 Kuchling's\index{Kuchling, Andrew} ``\module{regex-to-re} HOWTO'' at
  22 \url{http://www.python.org/doc/howto/regex-to-re/}.)
  23
  24 By default the patterns are Emacs-style regular expressions
  25 (with one exception).  There is
  26 a way to change the syntax to match that of several well-known
  27 \UNIX{} utilities.  The exception is that Emacs' \samp{\e s}
  28 pattern is not supported, since the original implementation references
  29 the Emacs syntax tables.
  30
  31 This module is 8-bit clean: both patterns and strings may contain null
  32 bytes and characters whose high bit is set.
  33
  34 \strong{Please note:} There is a little-known fact about Python string
  35 literals which means that you don't usually have to worry about
  36 doubling backslashes, even though they are used to escape special
  37 characters in string literals as well as in regular expressions.  This
  38 is because Python doesn't remove backslashes from string literals if
  39 they are followed by an unrecognized escape character.
  40 \emph{However}, if you want to include a literal \dfn{backslash} in a
  41 regular expression represented as a string literal, you have to
  42 \emph{quadruple} it or enclose it in a singleton character class.
  43 E.g.\  to extract \LaTeX\ \samp{\e section\{\textrm{\ldots}\}} headers
  44 from a document, you can use this pattern:
  45 \code{'[\e ]section\{\e (.*\e )\}'}.  \emph{Another exception:}
  46 the escape sequece \samp{\e b} is significant in string literals
  47 (where it means the ASCII bell character) as well as in Emacs regular
  48 expressions (where it stands for a word boundary), so in order to
  49 search for a word boundary, you should use the pattern \code{'\e \e b'}.
  50 Similarly, a backslash followed by a digit 0-7 should be doubled to
  51 avoid interpretation as an octal escape.
  52
  53 \subsection{Regular Expressions}
  54
  55 A regular expression (or RE) specifies a set of strings that matches
  56 it; the functions in this module let you check if a particular string
  57 matches a given regular expression (or if a given regular expression
  58 matches a particular string, which comes down to the same thing).
  59
  60 Regular expressions can be concatenated to form new regular
  61 expressions; if \emph{A} and \emph{B} are both regular expressions,
  62 then \emph{AB} is also an regular expression.  If a string \emph{p}
  63 matches A and another string \emph{q} matches B, the string \emph{pq}
  64 will match AB.  Thus, complex expressions can easily be constructed
  65 from simpler ones like the primitives described here.  For details of
  66 the theory and implementation of regular expressions, consult almost
  67 any textbook about compiler construction.
  68
  69 % XXX The reference could be made more specific, say to
  70 % "Compilers: Principles, Techniques and Tools", by Alfred V. Aho,
  71 % Ravi Sethi, and Jeffrey D. Ullman, or some FA text.
  72
  73 A brief explanation of the format of regular expressions follows.
  74
  75 Regular expressions can contain both special and ordinary characters.
  76 Ordinary characters, like '\code{A}', '\code{a}', or '\code{0}', are
  77 the simplest regular expressions; they simply match themselves.  You
  78 can concatenate ordinary characters, so '\code{last}' matches the
  79 characters 'last'.  (In the rest of this section, we'll write RE's in
  80 \code{this special font}, usually without quotes, and strings to be
  81 matched 'in single quotes'.)
  82
  83 Special characters either stand for classes of ordinary characters, or
  84 affect how the regular expressions around them are interpreted.
  85
  86 The special characters are:
  87 \begin{itemize}
  88 \item[\code{.}] (Dot.)  Matches any character except a newline.
  89 \item[\code{\^}] (Caret.)  Matches the start of the string.
  90 \item[\code{\$}] Matches the end of the string.
  91 \code{foo} matches both 'foo' and 'foobar', while the regular
  92 expression '\code{foo\$}' matches only 'foo'.
  93 \item[\code{*}] Causes the resulting RE to
  94 match 0 or more repetitions of the preceding RE.  \code{ab*} will
  95 match 'a', 'ab', or 'a' followed by any number of 'b's.
  96 \item[\code{+}] Causes the
  97 resulting RE to match 1 or more repetitions of the preceding RE.
  98 \code{ab+} will match 'a' followed by any non-zero number of 'b's; it
  99 will not match just 'a'.
 100 \item[\code{?}] Causes the resulting RE to
 101 match 0 or 1 repetitions of the preceding RE.  \code{ab?} will
 102 match either 'a' or 'ab'.
 103
 104 \item[\code{\e}] Either escapes special characters (permitting you to match
 105 characters like '*?+\&\$'), or signals a special sequence; special
 106 sequences are discussed below.  Remember that Python also uses the
 107 backslash as an escape sequence in string literals; if the escape
 108 sequence isn't recognized by Python's parser, the backslash and
 109 subsequent character are included in the resulting string.  However,
 110 if Python would recognize the resulting sequence, the backslash should
 111 be repeated twice.
 112
 113 \item[\code{[]}] Used to indicate a set of characters.  Characters can
 114 be listed individually, or a range is indicated by giving two
 115 characters and separating them by a '-'.  Special characters are
 116 not active inside sets.  For example, \code{[akm\$]}
 117 will match any of the characters 'a', 'k', 'm', or '\$'; \code{[a-z]} will
 118 match any lowercase letter.
 119
 120 If you want to include a \code{]} inside a
 121 set, it must be the first character of the set; to include a \code{-},
 122 place it as the first or last character.
 123
 124 Characters \emph{not} within a range can be matched by including a
 125 \code{\^} as the first character of the set; \code{\^} elsewhere will
 126 simply match the '\code{\^}' character.
 127 \end{itemize}
 128
 129 The special sequences consist of '\code{\e}' and a character
 130 from the list below.  If the ordinary character is not on the list,
 131 then the resulting RE will match the second character.  For example,
 132 \code{\e\$} matches the character '\$'.  Ones where the backslash
 133 should be doubled in string literals are indicated.
 134
 135 \begin{itemize}
 136 \item[\code{\e|}]\code{A\e|B}, where A and B can be arbitrary REs,
 137 creates a regular expression that will match either A or B.  This can
 138 be used inside groups (see below) as well.
 139 %
 140 \item[\code{\e( \e)}] Indicates the start and end of a group; the
 141 contents of a group can be matched later in the string with the
 142 \code{\e [1-9]} special sequence, described next.
 143 \end{itemize}
 144
 145 \begin{fulllineitems}
 146 \item[\code{\e \e 1, ... \e \e 7, \e 8, \e 9}]
 147 Matches the contents of the group of the same
 148 number.  For example, \code{\e (.+\e ) \e \e 1} matches 'the the' or
 149 '55 55', but not 'the end' (note the space after the group).  This
 150 special sequence can only be used to match one of the first 9 groups;
 151 groups with higher numbers can be matched using the \code{\e v}
 152 sequence.  (\code{\e 8} and \code{\e 9} don't need a double backslash
 153 because they are not octal digits.)
 154 \end{fulllineitems}
 155
 156 \begin{itemize}
 157 \item[\code{\e \e b}] Matches the empty string, but only at the
 158 beginning or end of a word.  A word is defined as a sequence of
 159 alphanumeric characters, so the end of a word is indicated by
 160 whitespace or a non-alphanumeric character.
 161 %
 162 \item[\code{\e B}] Matches the empty string, but when it is \emph{not} at the
 163 beginning or end of a word.
 164 %
 165 \item[\code{\e v}] Must be followed by a two digit decimal number, and
 166 matches the contents of the group of the same number.  The group
 167 number must be between 1 and 99, inclusive.
 168 %
 169 \item[\code{\e w}]Matches any alphanumeric character; this is
 170 equivalent to the set \code{[a-zA-Z0-9]}.
 171 %
 172 \item[\code{\e W}] Matches any non-alphanumeric character; this is
 173 equivalent to the set \code{[\^a-zA-Z0-9]}.
 174 \item[\code{\e <}] Matches the empty string, but only at the beginning of a
 175 word.  A word is defined as a sequence of alphanumeric characters, so
 176 the end of a word is indicated by whitespace or a non-alphanumeric
 177 character.
 178 \item[\code{\e >}] Matches the empty string, but only at the end of a
 179 word.
 180
 181 \item[\code{\e \e \e \e}] Matches a literal backslash.
 182
 183 % In Emacs, the following two are start of buffer/end of buffer.  In
 184 % Python they seem to be synonyms for ^$.
 185 \item[\code{\e `}] Like \code{\^}, this only matches at the start of the
 186 string.
 187 \item[\code{\e \e '}] Like \code{\$}, this only matches at the end of
 188 the string.
 189 % end of buffer
 190 \end{itemize}
 191
 192 \subsection{Module Contents}
 193 \nodename{Contents of Module regex}
 194
 195 The module defines these functions, and an exception:
 196
 197
 198 \begin{funcdesc}{match}{pattern, string}
 199   Return how many characters at the beginning of \var{string} match
 200   the regular expression \var{pattern}.  Return \code{-1} if the
 201   string does not match the pattern (this is different from a
 202   zero-length match!).
 203 \end{funcdesc}
 204
 205 \begin{funcdesc}{search}{pattern, string}
 206   Return the first position in \var{string} that matches the regular
 207   expression \var{pattern}.  Return \code{-1} if no position in the string
 208   matches the pattern (this is different from a zero-length match
 209   anywhere!).
 210 \end{funcdesc}
 211
 212 \begin{funcdesc}{compile}{pattern\optional{, translate}}
 213   Compile a regular expression pattern into a regular expression
 214   object, which can be used for matching using its \code{match()} and
 215   \code{search()} methods, described below.  The optional argument
 216   \var{translate}, if present, must be a 256-character string
 217   indicating how characters (both of the pattern and of the strings to
 218   be matched) are translated before comparing them; the \var{i}-th
 219   element of the string gives the translation for the character with
 220   \ASCII{} code \var{i}.  This can be used to implement
 221   case-insensitive matching; see the \code{casefold} data item below.
 222
 223   The sequence
 224
 225 \begin{verbatim}
 226 prog = regex.compile(pat)
 227 result = prog.match(str)
 228 \end{verbatim}
 229 %
 230 is equivalent to
 231
 232 \begin{verbatim}
 233 result = regex.match(pat, str)
 234 \end{verbatim}
 235
 236 but the version using \code{compile()} is more efficient when multiple
 237 regular expressions are used concurrently in a single program.  (The
 238 compiled version of the last pattern passed to \code{regex.match()} or
 239 \code{regex.search()} is cached, so programs that use only a single
 240 regular expression at a time needn't worry about compiling regular
 241 expressions.)
 242 \end{funcdesc}
 243
 244 \begin{funcdesc}{set_syntax}{flags}
 245   Set the syntax to be used by future calls to \code{compile()},
 246   \code{match()} and \code{search()}.  (Already compiled expression
 247   objects are not affected.)  The argument is an integer which is the
 248   OR of several flag bits.  The return value is the previous value of
 249   the syntax flags.  Names for the flags are defined in the standard
 250   module \code{regex_syntax}\refstmodindex{regex_syntax}; read the
 251   file \file{regex_syntax.py} for more information.
 252 \end{funcdesc}
 253
 254 \begin{funcdesc}{get_syntax}{}
 255   Returns the current value of the syntax flags as an integer.
 256 \end{funcdesc}
 257
 258 \begin{funcdesc}{symcomp}{pattern\optional{, translate}}
 259 This is like \code{compile()}, but supports symbolic group names: if a
 260 parenthesis-enclosed group begins with a group name in angular
 261 brackets, e.g. \code{'\e(<id>[a-z][a-z0-9]*\e)'}, the group can
 262 be referenced by its name in arguments to the \code{group()} method of
 263 the resulting compiled regular expression object, like this:
 264 \code{p.group('id')}.  Group names may contain alphanumeric characters
 265 and \code{'_'} only.
 266 \end{funcdesc}
 267
 268 \begin{excdesc}{error}
 269   Exception raised when a string passed to one of the functions here
 270   is not a valid regular expression (e.g., unmatched parentheses) or
 271   when some other error occurs during compilation or matching.  (It is
 272   never an error if a string contains no match for a pattern.)
 273 \end{excdesc}
 274
 275 \begin{datadesc}{casefold}
 276 A string suitable to pass as the \var{translate} argument to
 277 \code{compile()} to map all upper case characters to their lowercase
 278 equivalents.
 279 \end{datadesc}
 280
 281 \noindent
 282 Compiled regular expression objects support these methods:
 283
 284 \setindexsubitem{(regex method)}
 285 \begin{funcdesc}{match}{string\optional{, pos}}
 286   Return how many characters at the beginning of \var{string} match
 287   the compiled regular expression.  Return \code{-1} if the string
 288   does not match the pattern (this is different from a zero-length
 289   match!).
 290
 291   The optional second parameter, \var{pos}, gives an index in the string
 292   where the search is to start; it defaults to \code{0}.  This is not
 293   completely equivalent to slicing the string; the \code{'\^'} pattern
 294   character matches at the real beginning of the string and at positions
 295   just after a newline, not necessarily at the index where the search
 296   is to start.
 297 \end{funcdesc}
 298
 299 \begin{funcdesc}{search}{string\optional{, pos}}
 300   Return the first position in \var{string} that matches the regular
 301   expression \code{pattern}.  Return \code{-1} if no position in the
 302   string matches the pattern (this is different from a zero-length
 303   match anywhere!).
 304
 305   The optional second parameter has the same meaning as for the
 306   \code{match()} method.
 307 \end{funcdesc}
 308
 309 \begin{funcdesc}{group}{index, index, ...}
 310 This method is only valid when the last call to the \code{match()}
 311 or \code{search()} method found a match.  It returns one or more
 312 groups of the match.  If there is a single \var{index} argument,
 313 the result is a single string; if there are multiple arguments, the
 314 result is a tuple with one item per argument.  If the \var{index} is
 315 zero, the corresponding return value is the entire matching string; if
 316 it is in the inclusive range [1..99], it is the string matching the
 317 the corresponding parenthesized group (using the default syntax,
 318 groups are parenthesized using \code{{\e}(} and \code{{\e})}).  If no
 319 such group exists, the corresponding result is \code{None}.
 320
 321 If the regular expression was compiled by \code{symcomp()} instead of
 322 \code{compile()}, the \var{index} arguments may also be strings
 323 identifying groups by their group name.
 324 \end{funcdesc}
 325
 326 \noindent
 327 Compiled regular expressions support these data attributes:
 328
 329 \setindexsubitem{(regex attribute)}
 330
 331 \begin{datadesc}{regs}
 332 When the last call to the \code{match()} or \code{search()} method found a
 333 match, this is a tuple of pairs of indexes corresponding to the
 334 beginning and end of all parenthesized groups in the pattern.  Indices
 335 are relative to the string argument passed to \code{match()} or
 336 \code{search()}.  The 0-th tuple gives the beginning and end or the
 337 whole pattern.  When the last match or search failed, this is
 338 \code{None}.
 339 \end{datadesc}
 340
 341 \begin{datadesc}{last}
 342 When the last call to the \code{match()} or \code{search()} method found a
 343 match, this is the string argument passed to that method.  When the
 344 last match or search failed, this is \code{None}.
 345 \end{datadesc}
 346
 347 \begin{datadesc}{translate}
 348 This is the value of the \var{translate} argument to
 349 \code{regex.compile()} that created this regular expression object.  If
 350 the \var{translate} argument was omitted in the \code{regex.compile()}
 351 call, this is \code{None}.
 352 \end{datadesc}
 353
 354 \begin{datadesc}{givenpat}
 355 The regular expression pattern as passed to \code{compile()} or
 356 \code{symcomp()}.
 357 \end{datadesc}
 358
 359 \begin{datadesc}{realpat}
 360 The regular expression after stripping the group names for regular
 361 expressions compiled with \code{symcomp()}.  Same as \code{givenpat}
 362 otherwise.
 363 \end{datadesc}
 364
 365 \begin{datadesc}{groupindex}
 366 A dictionary giving the mapping from symbolic group names to numerical
 367 group indexes for regular expressions compiled with \code{symcomp()}.
 368 \code{None} otherwise.
 369 \end{datadesc}