1 .\" $NetBSD: regex.3,v 1.19 2009/04/11 15:44:42 joerg Exp $
3 .\" Copyright (c) 1992, 1993, 1994
4 .\" The Regents of the University of California. All rights reserved.
6 .\" This code is derived from software contributed to Berkeley by
9 .\" Redistribution and use in source and binary forms, with or without
10 .\" modification, are permitted provided that the following conditions
12 .\" 1. Redistributions of source code must retain the above copyright
13 .\" notice, this list of conditions and the following disclaimer.
14 .\" 2. Redistributions in binary form must reproduce the above copyright
15 .\" notice, this list of conditions and the following disclaimer in the
16 .\" documentation and/or other materials provided with the distribution.
17 .\" 3. Neither the name of the University nor the names of its contributors
18 .\" may be used to endorse or promote products derived from this software
19 .\" without specific prior written permission.
21 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
22 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
23 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
24 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
25 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
26 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
27 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
28 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
29 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
30 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
33 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
35 .\" This code is derived from software contributed to Berkeley by
38 .\" Redistribution and use in source and binary forms, with or without
39 .\" modification, are permitted provided that the following conditions
41 .\" 1. Redistributions of source code must retain the above copyright
42 .\" notice, this list of conditions and the following disclaimer.
43 .\" 2. Redistributions in binary form must reproduce the above copyright
44 .\" notice, this list of conditions and the following disclaimer in the
45 .\" documentation and/or other materials provided with the distribution.
46 .\" 3. All advertising materials mentioning features or use of this software
47 .\" must display the following acknowledgement:
48 .\" This product includes software developed by the University of
49 .\" California, Berkeley and its contributors.
50 .\" 4. Neither the name of the University nor the names of its contributors
51 .\" may be used to endorse or promote products derived from this software
52 .\" without specific prior written permission.
54 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
55 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
56 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
57 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
58 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
59 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
60 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
61 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
62 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
63 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
66 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
77 .Nd regular-expression library
83 .Fn regcomp "regex_t * restrict preg" "const char * restrict pattern" "int cflags"
85 .Fn regexec "const regex_t * restrict preg" "const char * restrict string" "size_t nmatch" "regmatch_t pmatch[]" "int eflags"
87 .Fn regerror "int errcode" "const regex_t * restrict preg" "char * restrict errbuf" "size_t errbuf_size"
89 .Fn regfree "regex_t *preg"
91 These routines implement
93 regular expressions (``RE''s);
97 compiles an RE written as a string into an internal form,
99 matches that internal form against a string and reports results,
101 transforms error codes from either into human-readable messages,
104 frees any dynamically-allocated storage used by the internal form
109 declares two structure types,
113 the former for compiled internal forms and the latter for match reporting.
114 It also declares the four functions,
117 and a number of constants with names starting with ``REG_''.
120 compiles the regular expression contained in the
123 subject to the flags in
125 and places the results in the
127 structure pointed to by
130 is the bitwise OR of zero or more of the following flags:
131 .Bl -tag -width XXXREG_EXTENDED
133 Compile modern (``extended'') REs, rather than the obsolete
134 (``basic'') REs that are the default.
136 This is a synonym for 0,
137 provided as a counterpart to REG_EXTENDED to improve readability.
139 Compile with recognition of all special characters turned off.
140 All characters are thus considered ordinary, so the ``RE'' is a literal
142 This is an extension, compatible with but not specified by
144 and should be used with caution in software intended to be portable to
149 may not be used in the same call to
152 Compile for matching that ignores upper/lower case distinctions.
156 Compile for matching that need only report success or failure, not
159 Compile for newline-sensitive matching.
160 By default, newline is a completely ordinary character with no special
161 meaning in either REs or strings.
163 `[^' bracket expressions and `.' never match newline,
164 a `^' anchor matches the null string after any newline in the string
165 in addition to its normal function,
166 and the `$' anchor matches the null string before any newline in the
167 string in addition to its normal function.
169 The regular expression ends, not at the first NUL, but just before the
170 character pointed to by the
172 member of the structure pointed to by
177 .Fa "const\ char\ *" .
178 This flag permits inclusion of NULs in the RE; they are considered
180 This is an extension, compatible with but not specified by
182 and should be used with caution in software intended to be portable to
188 returns 0 and fills in the structure pointed to by
190 One member of that structure (other than
196 contains the number of parenthesized subexpressions within the RE
197 (except that the value of this member is undefined if the
202 fails, it returns a non-zero error code;
207 matches the compiled RE pointed to by
211 subject to the flags in
213 and reports results using
216 and the returned value.
217 The RE must have been compiled by a previous invocation of
219 The compiled form is not altered during execution of
221 so a single compiled RE can be used simultaneously by multiple threads.
224 the NUL-terminated string pointed to by
226 is considered to be the text of an entire line, minus any terminating
230 argument is the bitwise OR of zero or more of the following flags:
231 .Bl -tag -width XXXREG_NOTBOL
233 The first character of the string
234 is not the beginning of a line, so the `^' anchor should not match before it.
235 This does not affect the behavior of newlines under
238 The NUL terminating the string does not end a line, so the `$' anchor
239 should not match before it.
240 This does not affect the behavior of newlines under
243 The string is considered to start at
247 and to have a terminating NUL located at
251 (there need not actually be a NUL at that location),
252 regardless of the value of
254 See below for the definition of
258 This is an extension, compatible with but not specified by
260 and should be used with caution in software intended to be portable to
267 affects only the location of the string, not how it is matched.
272 for a discussion of what is matched in situations where an RE or a
273 portion thereof could match any of several substrings of
278 returns 0 for success and the non-zero code
281 Other non-zero error codes may be returned in exceptional situations;
287 was specified in the compilation of the RE, or if
293 argument (but see below for the case where
298 points to an array of
302 Such a structure has at least the members
308 (a signed arithmetic type at least as large as an
312 containing respectively the offset of the first character of a substring
313 and the offset of the first character after the end of the substring.
314 Offsets are measured from the beginning of the
318 An empty substring is denoted by equal offsets,
319 both indicating the character following the empty substring.
321 The 0th member of the
323 array is filled in to indicate what substring of
325 was matched by the entire RE.
326 Remaining members report what substring was matched by parenthesized
327 subexpressions within the RE;
330 reports subexpression
332 with subexpressions counted (starting at 1) by the order of their
333 opening parentheses in the RE, left to right.
334 Unused entries in the array\(emcorresponding either to subexpressions that
335 did not participate in the match at all, or to subexpressions that do not
336 exist in the RE (that is,
339 .Fa preg-\*[Gt]re_nsub )
345 If a subexpression participated in the match several times,
346 the reported substring is the last one it matched.
347 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
348 the parenthesized subexpression matches each of the three `b's and then
349 an infinite number of empty strings following the last `b',
350 so the reported substring is one of the empties.)
356 must point to at least one
363 to hold the input offsets for
365 Use for output is still entirely controlled by
374 will not be changed by a successful
384 to a human-readable, printable message.
388 the error code should have arisen from use of the
392 and if the error code came from
394 it should have been the result from the most recent
399 may be able to supply a more detailed message using information
403 places the NUL-terminated message into the buffer pointed to by
405 limiting the length (including the NUL) to at most
408 If the whole message won't fit,
409 as much of it as will fit before the terminating NUL is supplied.
411 the returned value is the size of buffer needed to hold the whole
412 message (including terminating NUL).
417 is ignored but the return value is still correct.
425 the ``message'' that results is the printable name of the error code,
426 e.g. ``REG_NOMATCH'',
427 rather than an explanation thereof.
434 shall be non-NULL and the
436 member of the structure it points to
437 must point to the printable name of an error code;
438 in this case, the result in
440 is the decimal digits of
441 the numeric value of the error code
442 (0 if the name is not recognized).
446 are intended primarily as debugging facilities;
447 they are extensions, compatible with but not specified by
449 and should be used with caution in software intended to be portable to
451 Be warned also that they are considered experimental and changes are possible.
454 frees any dynamically-allocated storage associated with the compiled RE
459 is no longer a valid compiled RE
460 and the effect of supplying it to
466 None of these functions references global variables except for tables
468 all are safe for use from multiple threads if the arguments are safe.
469 .Sh IMPLEMENTATION CHOICES
470 There are a number of decisions that
472 leaves up to the implementor,
473 either by explicitly saying ``undefined'' or by virtue of them being
474 forbidden by the RE grammar.
475 This implementation treats them as follows.
479 for a discussion of the definition of case-independent matching.
481 There is no particular limit on the length of REs,
482 except insofar as memory is limited.
483 Memory usage is approximately linear in RE size, and largely insensitive
484 to RE complexity, except for bounded repetitions.
485 See BUGS for one short RE using them
486 that will run almost any system out of memory.
488 A backslashed character other than one specifically given a magic meaning
491 (such magic meanings occur only in obsolete [``basic''] REs)
492 is taken as an ordinary character.
498 Equivalence classes cannot begin or end bracket-expression ranges.
499 The endpoint of one range cannot begin another.
502 the limit on repetition counts in bounded repetitions, is 255.
504 A repetition operator (?, *, +, or bounds) cannot follow another
506 A repetition operator cannot begin an expression or subexpression
507 or follow `^' or `|'.
509 `|' cannot appear first or last in a (sub)expression or after another `|',
510 i.e. an operand of `|' cannot be an empty subexpression.
511 An empty parenthesized subexpression, `()', is legal and matches an
513 An empty string is not a legal RE.
515 A `{' followed by a digit is considered the beginning of bounds for a
516 bounded repetition, which must then follow the syntax for bounds.
519 followed by a digit is considered an ordinary character.
521 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
522 REs are anchors, not ordinary characters.
524 Non-zero error codes from
528 include the following:
530 .Bl -tag -width XXXREG_ECOLLATE -compact
535 invalid regular expression
537 invalid collating element
539 invalid character class
541 \e applied to unescapable character
543 invalid backreference number
545 brackets [ ] not balanced
547 parentheses ( ) not balanced
549 braces { } not balanced
551 invalid repetition count(s) in { }
553 invalid character range in [ ]
557 ?, *, or + operand invalid
559 empty (sub)expression
561 ``can't happen''\(emyou found a bug
563 invalid argument, e.g. negative-length string
571 sections 2.8 (Regular Expression Notation)
573 B.5 (C Binding for Regular Expression Matching).
575 Originally written by Henry Spencer.
576 Altered for inclusion in the
580 There is one known functionality bug.
581 The implementation of internationalization is incomplete:
582 the locale is always assumed to be the default one of
584 and only the collating elements etc. of that locale are available.
586 The back-reference code is subtle and doubts linger about its correctness
591 This will improve with later releases.
593 exceeding 0 is expensive;
595 exceeding 1 is worse.
597 is largely insensitive to RE complexity
599 that back references are massively expensive.
600 RE length does matter; in particular, there is a strong speed bonus
601 for keeping RE length under about 30 characters,
602 with most special characters counting roughly double.
605 implements bounded repetitions by macro expansion,
606 which is costly in time and space if counts are large
607 or bounded repetitions are nested.
609 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
610 will (eventually) run almost any existing machine out of swap space.
612 There are suspected problems with response to obscure error conditions.
614 certain kinds of internal overflow,
615 produced only by truly enormous REs or by multiply nested bounded repetitions,
616 are probably not handled well.
620 things like `a)b' are legal REs because `)' is a special character
621 only in the presence of a previous unmatched `('.
622 This can't be fixed until the spec is fixed.
624 The standard's definition of back references is vague.
626 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
627 Until the standard is clarified, behavior in such cases should not be
630 The implementation of word-boundary matching is a bit of a kludge,
631 and bugs may lurk in combinations of word-boundary matching and anchoring.