external/bsd/nvi/dist/regex/regex.3

   1 .\" Copyright (c) 1992, 1993, 1994 Henry Spencer.
   2 .\" Copyright (c) 1992, 1993, 1994
   3 .\"     The Regents of the University of California.  All rights reserved.
   4 .\"
   5 .\" This code is derived from software contributed to Berkeley by
   6 .\" Henry Spencer of the University of Toronto.
   7 .\"
   8 .\" Redistribution and use in source and binary forms, with or without
   9 .\" modification, are permitted provided that the following conditions
  10 .\" are met:
  11 .\" 1. Redistributions of source code must retain the above copyright
  12 .\"    notice, this list of conditions and the following disclaimer.
  13 .\" 2. Redistributions in binary form must reproduce the above copyright
  14 .\"    notice, this list of conditions and the following disclaimer in the
  15 .\"    documentation and/or other materials provided with the distribution.
  16 .\" 3. All advertising materials mentioning features or use of this software
  17 .\"    must display the following acknowledgement:
  18 .\"     This product includes software developed by the University of
  19 .\"     California, Berkeley and its contributors.
  20 .\" 4. Neither the name of the University nor the names of its contributors
  21 .\"    may be used to endorse or promote products derived from this software
  22 .\"    without specific prior written permission.
  23 .\"
  24 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  25 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  26 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  27 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  28 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  29 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  30 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  31 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  32 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  33 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  34 .\" SUCH DAMAGE.
  35 .\"
  36 .\"     @(#)regex.3     8.2 (Berkeley) 3/16/94
  37 .\"
  38 .TH REGEX 3 "March 16, 1994"
  39 .de ZR
  40 .\" one other place knows this name:  the SEE ALSO section
  41 .IR re_format (7) \\$1
  42 ..
  43 .SH NAME
  44 regcomp, regexec, regerror, regfree \- regular-expression library
  45 .SH SYNOPSIS
  46 .ft B
  47 .\".na
  48 #include <sys/types.h>
  49 .br
  50 #include <regex.h>
  51 .HP 10
  52 int regcomp(regex_t\ *preg, const\ char\ *pattern, int\ cflags);
  53 .HP
  54 int\ regexec(const\ regex_t\ *preg, const\ char\ *string,
  55 size_t\ nmatch, regmatch_t\ pmatch[], int\ eflags);
  56 .HP
  57 size_t\ regerror(int\ errcode, const\ regex_t\ *preg,
  58 char\ *errbuf, size_t\ errbuf_size);
  59 .HP
  60 void\ regfree(regex_t\ *preg);
  61 .\".ad
  62 .ft
  63 .SH DESCRIPTION
  64 These routines implement POSIX 1003.2 regular expressions (``RE''s);
  65 see
  66 .ZR .
  67 .I Regcomp
  68 compiles an RE written as a string into an internal form,
  69 .I regexec
  70 matches that internal form against a string and reports results,
  71 .I regerror
  72 transforms error codes from either into human-readable messages,
  73 and
  74 .I regfree
  75 frees any dynamically-allocated storage used by the internal form
  76 of an RE.
  77 .PP
  78 The header
  79 .I <regex.h>
  80 declares two structure types,
  81 .I regex_t
  82 and
  83 .IR regmatch_t ,
  84 the former for compiled internal forms and the latter for match reporting.
  85 It also declares the four functions,
  86 a type
  87 .IR regoff_t ,
  88 and a number of constants with names starting with ``REG_''.
  89 .PP
  90 .I Regcomp
  91 compiles the regular expression contained in the
  92 .I pattern
  93 string,
  94 subject to the flags in
  95 .IR cflags ,
  96 and places the results in the
  97 .I regex_t
  98 structure pointed to by
  99 .IR preg .
 100 .I Cflags
 101 is the bitwise OR of zero or more of the following flags:
 102 .IP REG_EXTENDED \w'REG_EXTENDED'u+2n
 103 Compile modern (``extended'') REs,
 104 rather than the obsolete (``basic'') REs that
 105 are the default.
 106 .IP REG_BASIC
 107 This is a synonym for 0,
 108 provided as a counterpart to REG_EXTENDED to improve readability.
 109 .IP REG_NOSPEC
 110 Compile with recognition of all special characters turned off.
 111 All characters are thus considered ordinary,
 112 so the ``RE'' is a literal string.
 113 This is an extension,
 114 compatible with but not specified by POSIX 1003.2,
 115 and should be used with
 116 caution in software intended to be portable to other systems.
 117 REG_EXTENDED and REG_NOSPEC may not be used
 118 in the same call to
 119 .IR regcomp .
 120 .IP REG_ICASE
 121 Compile for matching that ignores upper/lower case distinctions.
 122 See
 123 .ZR .
 124 .IP REG_NOSUB
 125 Compile for matching that need only report success or failure,
 126 not what was matched.
 127 .IP REG_NEWLINE
 128 Compile for newline-sensitive matching.
 129 By default, newline is a completely ordinary character with no special
 130 meaning in either REs or strings.
 131 With this flag,
 132 `[^' bracket expressions and `.' never match newline,
 133 a `^' anchor matches the null string after any newline in the string
 134 in addition to its normal function,
 135 and the `$' anchor matches the null string before any newline in the
 136 string in addition to its normal function.
 137 .IP REG_PEND
 138 The regular expression ends,
 139 not at the first NUL,
 140 but just before the character pointed to by the
 141 .I re_endp
 142 member of the structure pointed to by
 143 .IR preg .
 144 The
 145 .I re_endp
 146 member is of type
 147 .IR const\ char\ * .
 148 This flag permits inclusion of NULs in the RE;
 149 they are considered ordinary characters.
 150 This is an extension,
 151 compatible with but not specified by POSIX 1003.2,
 152 and should be used with
 153 caution in software intended to be portable to other systems.
 154 .PP
 155 When successful,
 156 .I regcomp
 157 returns 0 and fills in the structure pointed to by
 158 .IR preg .
 159 One member of that structure
 160 (other than
 161 .IR re_endp )
 162 is publicized:
 163 .IR re_nsub ,
 164 of type
 165 .IR size_t ,
 166 contains the number of parenthesized subexpressions within the RE
 167 (except that the value of this member is undefined if the
 168 REG_NOSUB flag was used).
 169 If
 170 .I regcomp
 171 fails, it returns a non-zero error code;
 172 see DIAGNOSTICS.
 173 .PP
 174 .I Regexec
 175 matches the compiled RE pointed to by
 176 .I preg
 177 against the
 178 .IR string ,
 179 subject to the flags in
 180 .IR eflags ,
 181 and reports results using
 182 .IR nmatch ,
 183 .IR pmatch ,
 184 and the returned value.
 185 The RE must have been compiled by a previous invocation of
 186 .IR regcomp .
 187 The compiled form is not altered during execution of
 188 .IR regexec ,
 189 so a single compiled RE can be used simultaneously by multiple threads.
 190 .PP
 191 By default,
 192 the NUL-terminated string pointed to by
 193 .I string
 194 is considered to be the text of an entire line, minus any terminating
 195 newline.
 196 The
 197 .I eflags
 198 argument is the bitwise OR of zero or more of the following flags:
 199 .IP REG_NOTBOL \w'REG_STARTEND'u+2n
 200 The first character of
 201 the string
 202 is not the beginning of a line, so the `^' anchor should not match before it.
 203 This does not affect the behavior of newlines under REG_NEWLINE.
 204 .IP REG_NOTEOL
 205 The NUL terminating
 206 the string
 207 does not end a line, so the `$' anchor should not match before it.
 208 This does not affect the behavior of newlines under REG_NEWLINE.
 209 .IP REG_STARTEND
 210 The string is considered to start at
 211 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_so\fR
 212 and to have a terminating NUL located at
 213 \fIstring\fR\ + \fIpmatch\fR[0].\fIrm_eo\fR
 214 (there need not actually be a NUL at that location),
 215 regardless of the value of
 216 .IR nmatch .
 217 See below for the definition of
 218 .IR pmatch
 219 and
 220 .IR nmatch .
 221 This is an extension,
 222 compatible with but not specified by POSIX 1003.2,
 223 and should be used with
 224 caution in software intended to be portable to other systems.
 225 Note that a non-zero \fIrm_so\fR does not imply REG_NOTBOL;
 226 REG_STARTEND affects only the location of the string,
 227 not how it is matched.
 228 .PP
 229 See
 230 .ZR
 231 for a discussion of what is matched in situations where an RE or a
 232 portion thereof could match any of several substrings of
 233 .IR string .
 234 .PP
 235 Normally,
 236 .I regexec
 237 returns 0 for success and the non-zero code REG_NOMATCH for failure.
 238 Other non-zero error codes may be returned in exceptional situations;
 239 see DIAGNOSTICS.
 240 .PP
 241 If REG_NOSUB was specified in the compilation of the RE,
 242 or if
 243 .I nmatch
 244 is 0,
 245 .I regexec
 246 ignores the
 247 .I pmatch
 248 argument (but see below for the case where REG_STARTEND is specified).
 249 Otherwise,
 250 .I pmatch
 251 points to an array of
 252 .I nmatch
 253 structures of type
 254 .IR regmatch_t .
 255 Such a structure has at least the members
 256 .I rm_so
 257 and
 258 .IR rm_eo ,
 259 both of type
 260 .I regoff_t
 261 (a signed arithmetic type at least as large as an
 262 .I off_t
 263 and a
 264 .IR ssize_t ),
 265 containing respectively the offset of the first character of a substring
 266 and the offset of the first character after the end of the substring.
 267 Offsets are measured from the beginning of the
 268 .I string
 269 argument given to
 270 .IR regexec .
 271 An empty substring is denoted by equal offsets,
 272 both indicating the character following the empty substring.
 273 .PP
 274 The 0th member of the
 275 .I pmatch
 276 array is filled in to indicate what substring of
 277 .I string
 278 was matched by the entire RE.
 279 Remaining members report what substring was matched by parenthesized
 280 subexpressions within the RE;
 281 member
 282 .I i
 283 reports subexpression
 284 .IR i ,
 285 with subexpressions counted (starting at 1) by the order of their opening
 286 parentheses in the RE, left to right.
 287 Unused entries in the array\(emcorresponding either to subexpressions that
 288 did not participate in the match at all, or to subexpressions that do not
 289 exist in the RE (that is, \fIi\fR\ > \fIpreg\fR\->\fIre_nsub\fR)\(emhave both
 290 .I rm_so
 291 and
 292 .I rm_eo
 293 set to \-1.
 294 If a subexpression participated in the match several times,
 295 the reported substring is the last one it matched.
 296 (Note, as an example in particular, that when the RE `(b*)+' matches `bbb',
 297 the parenthesized subexpression matches each of the three `b's and then
 298 an infinite number of empty strings following the last `b',
 299 so the reported substring is one of the empties.)
 300 .PP
 301 If REG_STARTEND is specified,
 302 .I pmatch
 303 must point to at least one
 304 .I regmatch_t
 305 (even if
 306 .I nmatch
 307 is 0 or REG_NOSUB was specified),
 308 to hold the input offsets for REG_STARTEND.
 309 Use for output is still entirely controlled by
 310 .IR nmatch ;
 311 if
 312 .I nmatch
 313 is 0 or REG_NOSUB was specified,
 314 the value of
 315 .IR pmatch [0]
 316 will not be changed by a successful
 317 .IR regexec .
 318 .PP
 319 .I Regerror
 320 maps a non-zero
 321 .I errcode
 322 from either
 323 .I regcomp
 324 or
 325 .I regexec
 326 to a human-readable, printable message.
 327 If
 328 .I preg
 329 is non-NULL,
 330 the error code should have arisen from use of
 331 the
 332 .I regex_t
 333 pointed to by
 334 .IR preg ,
 335 and if the error code came from
 336 .IR regcomp ,
 337 it should have been the result from the most recent
 338 .I regcomp
 339 using that
 340 .IR regex_t .
 341 .RI ( Regerror
 342 may be able to supply a more detailed message using information
 343 from the
 344 .IR regex_t .)
 345 .I Regerror
 346 places the NUL-terminated message into the buffer pointed to by
 347 .IR errbuf ,
 348 limiting the length (including the NUL) to at most
 349 .I errbuf_size
 350 bytes.
 351 If the whole message won't fit,
 352 as much of it as will fit before the terminating NUL is supplied.
 353 In any case,
 354 the returned value is the size of buffer needed to hold the whole
 355 message (including terminating NUL).
 356 If
 357 .I errbuf_size
 358 is 0,
 359 .I errbuf
 360 is ignored but the return value is still correct.
 361 .PP
 362 If the
 363 .I errcode
 364 given to
 365 .I regerror
 366 is first ORed with REG_ITOA,
 367 the ``message'' that results is the printable name of the error code,
 368 e.g. ``REG_NOMATCH'',
 369 rather than an explanation thereof.
 370 If
 371 .I errcode
 372 is REG_ATOI,
 373 then
 374 .I preg
 375 shall be non-NULL and the
 376 .I re_endp
 377 member of the structure it points to
 378 must point to the printable name of an error code;
 379 in this case, the result in
 380 .I errbuf
 381 is the decimal digits of
 382 the numeric value of the error code
 383 (0 if the name is not recognized).
 384 REG_ITOA and REG_ATOI are intended primarily as debugging facilities;
 385 they are extensions,
 386 compatible with but not specified by POSIX 1003.2,
 387 and should be used with
 388 caution in software intended to be portable to other systems.
 389 Be warned also that they are considered experimental and changes are possible.
 390 .PP
 391 .I Regfree
 392 frees any dynamically-allocated storage associated with the compiled RE
 393 pointed to by
 394 .IR preg .
 395 The remaining
 396 .I regex_t
 397 is no longer a valid compiled RE
 398 and the effect of supplying it to
 399 .I regexec
 400 or
 401 .I regerror
 402 is undefined.
 403 .PP
 404 None of these functions references global variables except for tables
 405 of constants;
 406 all are safe for use from multiple threads if the arguments are safe.
 407 .SH IMPLEMENTATION CHOICES
 408 There are a number of decisions that 1003.2 leaves up to the implementor,
 409 either by explicitly saying ``undefined'' or by virtue of them being
 410 forbidden by the RE grammar.
 411 This implementation treats them as follows.
 412 .PP
 413 See
 414 .ZR
 415 for a discussion of the definition of case-independent matching.
 416 .PP
 417 There is no particular limit on the length of REs,
 418 except insofar as memory is limited.
 419 Memory usage is approximately linear in RE size, and largely insensitive
 420 to RE complexity, except for bounded repetitions.
 421 See BUGS for one short RE using them
 422 that will run almost any system out of memory.
 423 .PP
 424 A backslashed character other than one specifically given a magic meaning
 425 by 1003.2 (such magic meanings occur only in obsolete [``basic''] REs)
 426 is taken as an ordinary character.
 427 .PP
 428 Any unmatched [ is a REG_EBRACK error.
 429 .PP
 430 Equivalence classes cannot begin or end bracket-expression ranges.
 431 The endpoint of one range cannot begin another.
 432 .PP
 433 RE_DUP_MAX, the limit on repetition counts in bounded repetitions, is 255.
 434 .PP
 435 A repetition operator (?, *, +, or bounds) cannot follow another
 436 repetition operator.
 437 A repetition operator cannot begin an expression or subexpression
 438 or follow `^' or `|'.
 439 .PP
 440 `|' cannot appear first or last in a (sub)expression or after another `|',
 441 i.e. an operand of `|' cannot be an empty subexpression.
 442 An empty parenthesized subexpression, `()', is legal and matches an
 443 empty (sub)string.
 444 An empty string is not a legal RE.
 445 .PP
 446 A `{' followed by a digit is considered the beginning of bounds for a
 447 bounded repetition, which must then follow the syntax for bounds.
 448 A `{' \fInot\fR followed by a digit is considered an ordinary character.
 449 .PP
 450 `^' and `$' beginning and ending subexpressions in obsolete (``basic'')
 451 REs are anchors, not ordinary characters.
 452 .SH SEE ALSO
 453 grep(1), re_format(7)
 454 .PP
 455 POSIX 1003.2, sections 2.8 (Regular Expression Notation)
 456 and
 457 B.5 (C Binding for Regular Expression Matching).
 458 .SH DIAGNOSTICS
 459 Non-zero error codes from
 460 .I regcomp
 461 and
 462 .I regexec
 463 include the following:
 464 .PP
 465 .nf
 466 .ta \w'REG_ECOLLATE'u+3n
 467 REG_NOMATCH     regexec() failed to match
 468 REG_BADPAT      invalid regular expression
 469 REG_ECOLLATE    invalid collating element
 470 REG_ECTYPE      invalid character class
 471 REG_EESCAPE     \e applied to unescapable character
 472 REG_ESUBREG     invalid backreference number
 473 REG_EBRACK      brackets [ ] not balanced
 474 REG_EPAREN      parentheses ( ) not balanced
 475 REG_EBRACE      braces { } not balanced
 476 REG_BADBR       invalid repetition count(s) in { }
 477 REG_ERANGE      invalid character range in [ ]
 478 REG_ESPACE      ran out of memory
 479 REG_BADRPT      ?, *, or + operand invalid
 480 REG_EMPTY       empty (sub)expression
 481 REG_ASSERT      ``can't happen''\(emyou found a bug
 482 REG_INVARG      invalid argument, e.g. negative-length string
 483 .fi
 484 .SH HISTORY
 485 Originally written by Henry Spencer at University of Toronto.
 486 Altered for inclusion in the 4.4BSD distribution.
 487 .SH BUGS
 488 This is an alpha release with known defects.
 489 Please report problems.
 490 .PP
 491 There is one known functionality bug.
 492 The implementation of internationalization is incomplete:
 493 the locale is always assumed to be the default one of 1003.2,
 494 and only the collating elements etc. of that locale are available.
 495 .PP
 496 The back-reference code is subtle and doubts linger about its correctness
 497 in complex cases.
 498 .PP
 499 .I Regexec
 500 performance is poor.
 501 This will improve with later releases.
 502 .I Nmatch
 503 exceeding 0 is expensive;
 504 .I nmatch
 505 exceeding 1 is worse.
 506 .I Regexec
 507 is largely insensitive to RE complexity \fIexcept\fR that back
 508 references are massively expensive.
 509 RE length does matter; in particular, there is a strong speed bonus
 510 for keeping RE length under about 30 characters,
 511 with most special characters counting roughly double.
 512 .PP
 513 .I Regcomp
 514 implements bounded repetitions by macro expansion,
 515 which is costly in time and space if counts are large
 516 or bounded repetitions are nested.
 517 An RE like, say,
 518 `((((a{1,100}){1,100}){1,100}){1,100}){1,100}'
 519 will (eventually) run almost any existing machine out of swap space.
 520 .PP
 521 There are suspected problems with response to obscure error conditions.
 522 Notably,
 523 certain kinds of internal overflow,
 524 produced only by truly enormous REs or by multiply nested bounded repetitions,
 525 are probably not handled well.
 526 .PP
 527 Due to a mistake in 1003.2, things like `a)b' are legal REs because `)' is
 528 a special character only in the presence of a previous unmatched `('.
 529 This can't be fixed until the spec is fixed.
 530 .PP
 531 The standard's definition of back references is vague.
 532 For example, does
 533 `a\e(\e(b\e)*\e2\e)*d' match `abbbd'?
 534 Until the standard is clarified,
 535 behavior in such cases should not be relied on.
 536 .PP
 537 The implementation of word-boundary matching is a bit of a kludge,
 538 and bugs may lurk in combinations of word-boundary matching and anchoring.