1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
2 .\" Copyright (c) 1992, 1993, 1994
3 .\" The Regents of the University of California. All rights reserved.
5 .\" This code is derived from software contributed to Berkeley by
8 .\" Redistribution and use in source and binary forms, with or without
9 .\" modification, are permitted provided that the following conditions
11 .\" 1. Redistributions of source code must retain the above copyright
12 .\" notice, this list of conditions and the following disclaimer.
13 .\" 2. Redistributions in binary form must reproduce the above copyright
14 .\" notice, this list of conditions and the following disclaimer in the
15 .\" documentation and/or other materials provided with the distribution.
16 .\" 3. Neither the name of the University nor the names of its contributors
17 .\" may be used to endorse or promote products derived from this software
18 .\" without specific prior written permission.
20 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
21 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
22 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
23 .\" ARE DISCLAIMED. IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
24 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
25 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
26 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
27 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
28 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
29 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
32 .\" @(#)regex.3 8.4 (Berkeley) 3/20/94
33 .\" $FreeBSD: src/lib/libc/regex/regex.3,v 1.9 2001/10/01 16:08:58 ru Exp $
43 .Nd regular-expression library
50 .Fn regcomp "regex_t *restrict preg" "const char *_restrictpattern" "int cflags"
53 .Fa "const regex_t *_restrict preg" "const char *_restrict string"
54 .Fa "size_t nmatch" "regmatch_t pmatch[_restrict]" "int eflags"
58 .Fa "int errcode" "const regex_t *_restrict preg"
59 .Fa "char *_restrict errbuf" "size_t errbuf_size"
62 .Fn regfree "regex_t *preg"
64 These routines implement
71 compiles an RE written as a string into an internal form,
73 matches that internal form against a string and reports results,
75 transforms error codes from either into human-readable messages,
78 frees any dynamically-allocated storage used by the internal form
83 declares two structure types,
87 the former for compiled internal forms and the latter for match reporting.
88 It also declares the four functions,
91 and a number of constants with names starting with
95 compiles the regular expression contained in the
98 subject to the flags in
100 and places the results in the
102 structure pointed to by
105 is the bitwise OR of zero or more of the following flags:
106 .Bl -tag -width REG_EXTENDED
111 rather than the obsolete
116 This is a synonym for 0,
117 provided as a counterpart to
119 to improve readability.
121 Compile with recognition of all special characters turned off.
122 All characters are thus considered ordinary,
126 This is an extension,
127 compatible with but not specified by
129 and should be used with
130 caution in software intended to be portable to other systems.
138 Compile for matching that ignores upper/lower case distinctions.
142 Compile for matching that need only report success or failure,
143 not what was matched.
145 Compile for newline-sensitive matching.
146 By default, newline is a completely ordinary character with no special
147 meaning in either REs or strings.
150 bracket expressions and
155 anchor matches the null string after any newline in the string
156 in addition to its normal function,
159 anchor matches the null string before any newline in the
160 string in addition to its normal function.
162 The regular expression ends,
163 not at the first NUL,
164 but just before the character pointed to by the
166 member of the structure pointed to by
172 This flag permits inclusion of NULs in the RE;
173 they are considered ordinary characters.
174 This is an extension,
175 compatible with but not specified by
177 and should be used with
178 caution in software intended to be portable to other systems.
183 returns 0 and fills in the structure pointed to by
185 One member of that structure
192 contains the number of parenthesized subexpressions within the RE
193 (except that the value of this member is undefined if the
198 fails, it returns a non-zero error code;
203 matches the compiled RE pointed to by
207 subject to the flags in
209 and reports results using
212 and the returned value.
213 The RE must have been compiled by a previous invocation of
215 The compiled form is not altered during execution of
217 so a single compiled RE can be used simultaneously by multiple threads.
220 the NUL-terminated string pointed to by
222 is considered to be the text of an entire line, minus any terminating
226 argument is the bitwise OR of zero or more of the following flags:
227 .Bl -tag -width REG_STARTEND
229 The first character of
231 is not the beginning of a line, so the
233 anchor should not match before it.
234 This does not affect the behavior of newlines under
239 does not end a line, so the
241 anchor should not match before it.
242 This does not affect the behavior of newlines under
245 The string is considered to start at
248 .Fa pmatch Ns [0]. Ns Va rm_so
249 and to have a terminating NUL located at
252 .Fa pmatch Ns [0]. Ns Va rm_eo
253 (there need not actually be a NUL at that location),
254 regardless of the value of
256 See below for the definition of
260 This is an extension,
261 compatible with but not specified by
263 and should be used with
264 caution in software intended to be portable to other systems.
270 affects only the location of the string,
271 not how it is matched.
276 for a discussion of what is matched in situations where an RE or a
277 portion thereof could match any of several substrings of
282 returns 0 for success and the non-zero code
285 Other non-zero error codes may be returned in exceptional situations;
291 was specified in the compilation of the RE,
298 argument (but see below for the case where
303 points to an array of
307 Such a structure has at least the members
313 (a signed arithmetic type at least as large as an
317 containing respectively the offset of the first character of a substring
318 and the offset of the first character after the end of the substring.
319 Offsets are measured from the beginning of the
323 An empty substring is denoted by equal offsets,
324 both indicating the character following the empty substring.
326 The 0th member of the
328 array is filled in to indicate what substring of
330 was matched by the entire RE.
331 Remaining members report what substring was matched by parenthesized
332 subexpressions within the RE;
335 reports subexpression
337 with subexpressions counted (starting at 1) by the order of their opening
338 parentheses in the RE, left to right.
339 Unused entries in the array (corresponding either to subexpressions that
340 did not participate in the match at all, or to subexpressions that do not
341 exist in the RE (that is,
344 .Fa preg Ns -> Ns Va re_nsub ) )
350 If a subexpression participated in the match several times,
351 the reported substring is the last one it matched.
352 (Note, as an example in particular, that when the RE
356 the parenthesized subexpression matches each of the three
359 an infinite number of empty strings following the last
361 so the reported substring is one of the empties.)
367 must point to at least one
374 to hold the input offsets for
376 Use for output is still entirely controlled by
385 will not be changed by a successful
395 to a human-readable, printable message.
399 .No non\- Ns Dv NULL ,
400 the error code should have arisen from use of
405 and if the error code came from
407 it should have been the result from the most recent
412 may be able to supply a more detailed message using information
416 places the NUL-terminated message into the buffer pointed to by
418 limiting the length (including the NUL) to at most
421 If the whole message won't fit,
422 as much of it as will fit before the terminating NUL is supplied.
424 the returned value is the size of buffer needed to hold the whole
425 message (including terminating NUL).
430 is ignored but the return value is still correct.
440 that results is the printable name of the error code,
443 rather than an explanation thereof.
454 member of the structure it points to
455 must point to the printable name of an error code;
456 in this case, the result in
458 is the decimal digits of
459 the numeric value of the error code
460 (0 if the name is not recognized).
464 are intended primarily as debugging facilities;
466 compatible with but not specified by
468 and should be used with
469 caution in software intended to be portable to other systems.
470 Be warned also that they are considered experimental and changes are possible.
473 frees any dynamically-allocated storage associated with the compiled RE
478 is no longer a valid compiled RE
479 and the effect of supplying it to
485 None of these functions references global variables except for tables
487 all are safe for use from multiple threads if the arguments are safe.
488 .Sh IMPLEMENTATION CHOICES
489 There are a number of decisions that
491 leaves up to the implementor,
492 either by explicitly saying
494 or by virtue of them being
495 forbidden by the RE grammar.
496 This implementation treats them as follows.
500 for a discussion of the definition of case-independent matching.
502 There is no particular limit on the length of REs,
503 except insofar as memory is limited.
504 Memory usage is approximately linear in RE size, and largely insensitive
505 to RE complexity, except for bounded repetitions.
508 for one short RE using them
509 that will run almost any system out of memory.
511 A backslashed character other than one specifically given a magic meaning
514 (such magic meanings occur only in obsolete
517 is taken as an ordinary character.
525 Equivalence classes cannot begin or end bracket-expression ranges.
526 The endpoint of one range cannot begin another.
529 the limit on repetition counts in bounded repetitions, is 255.
531 A repetition operator
536 cannot follow another
538 A repetition operator cannot begin an expression or subexpression
545 cannot appear first or last in a (sub)expression or after another
549 cannot be an empty subexpression.
550 An empty parenthesized subexpression,
552 is legal and matches an
554 An empty string is not a legal RE.
558 followed by a digit is considered the beginning of bounds for a
559 bounded repetition, which must then follow the syntax for bounds.
563 followed by a digit is considered an ordinary character.
568 beginning and ending subexpressions in obsolete
570 REs are anchors, not ordinary characters.
576 sections 2.8 (Regular Expression Notation)
578 B.5 (C Binding for Regular Expression Matching).
580 Non-zero error codes from
584 include the following:
586 .Bl -tag -width REG_ECOLLATE -compact
591 invalid regular expression
593 invalid collating element
595 invalid character class
598 applied to unescapable character
600 invalid backreference number
614 invalid repetition count(s) in
617 invalid character range in
628 empty (sub)expression
630 can't happen - you found a bug
632 invalid argument, e.g. negative-length string
635 Originally written by
637 Altered for inclusion in the
641 This is an alpha release with known defects.
642 Please report problems.
644 The back-reference code is subtle and doubts linger about its correctness
649 This will improve with later releases.
651 exceeding 0 is expensive;
653 exceeding 1 is worse.
655 is largely insensitive to RE complexity
658 references are massively expensive.
659 RE length does matter; in particular, there is a strong speed bonus
660 for keeping RE length under about 30 characters,
661 with most special characters counting roughly double.
664 implements bounded repetitions by macro expansion,
665 which is costly in time and space if counts are large
666 or bounded repetitions are nested.
668 .Ql "((((a{1,100}){1,100}){1,100}){1,100}){1,100}"
669 will (eventually) run almost any existing machine out of swap space.
671 There are suspected problems with response to obscure error conditions.
673 certain kinds of internal overflow,
674 produced only by truly enormous REs or by multiply nested bounded repetitions,
675 are probably not handled well.
681 are legal REs because
684 a special character only in the presence of a previous unmatched
686 This can't be fixed until the spec is fixed.
688 The standard's definition of back references is vague.
690 .Ql "a\e(\e(b\e)*\e2\e)*d"
693 Until the standard is clarified,
694 behavior in such cases should not be relied on.
696 The implementation of word-boundary matching is a bit of a kludge,
697 and bugs may lurk in combinations of word-boundary matching and anchoring.