lib/libcompat/regexp/regexp.3

   1 .\" Copyright (c) 1991, 1993
   2 .\"     The Regents of the University of California.  All rights reserved.
   3 .\"
   4 .\" Redistribution and use in source and binary forms, with or without
   5 .\" modification, are permitted provided that the following conditions
   6 .\" are met:
   7 .\" 1. Redistributions of source code must retain the above copyright
   8 .\"    notice, this list of conditions and the following disclaimer.
   9 .\" 2. Redistributions in binary form must reproduce the above copyright
  10 .\"    notice, this list of conditions and the following disclaimer in the
  11 .\"    documentation and/or other materials provided with the distribution.
  12 .\" 3. Neither the name of the University nor the names of its contributors
  13 .\"    may be used to endorse or promote products derived from this software
  14 .\"    without specific prior written permission.
  15 .\"
  16 .\" THIS SOFTWARE IS PROVIDED BY THE REGENTS AND CONTRIBUTORS ``AS IS'' AND
  17 .\" ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE
  18 .\" IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE
  19 .\" ARE DISCLAIMED.  IN NO EVENT SHALL THE REGENTS OR CONTRIBUTORS BE LIABLE
  20 .\" FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL
  21 .\" DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS
  22 .\" OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION)
  23 .\" HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT
  24 .\" LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY
  25 .\" OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF
  26 .\" SUCH DAMAGE.
  27 .\"
  28 .\"     from: @(#)regexp.3      8.1 (Berkeley) 6/4/93
  29 .\"     $NetBSD: regexp.3,v 1.15 2003/08/07 16:44:17 agc Exp $
  30 .\"
  31 .Dd June 4, 1993
  32 .Dt REGEXP 3
  33 .Os
  34 .Sh NAME
  35 .Nm regcomp ,
  36 .Nm regexec ,
  37 .Nm regsub ,
  38 .Nm regerror
  39 .Nd obsolete "'regexp'" regular expression handlers
  40 .Sh LIBRARY
  41 .Lb libcompat
  42 .Sh SYNOPSIS
  43 .In regexp.h
  44 .Ft regexp *
  45 .Fn regcomp "const char *exp"
  46 .Ft int
  47 .Fn regexec "const regexp *prog" "const char *string"
  48 .Ft void
  49 .Fn regsub "const regexp *prog" "const char *source" "char *dest"
  50 .Ft void
  51 .Fn regerror "const char *msg"
  52 .Sh DESCRIPTION
  53 .Bf -symbolic
  54 This interface is made obsolete by
  55 .Xr regex 3 .
  56 It is available from the compatibility library, libcompat.
  57 .Ef
  58 .Pp
  59 The
  60 .Fn regcomp ,
  61 .Fn regexec ,
  62 .Fn regsub ,
  63 and
  64 .Fn regerror
  65 functions implement
  66 .Xr egrep 1 Ns -style
  67 regular expressions and supporting facilities.
  68 .Pp
  69 The
  70 .Fn regcomp
  71 function
  72 compiles a regular expression into a structure of type
  73 .Em regexp ,
  74 and returns a pointer to it.
  75 The space has been allocated using
  76 .Xr malloc 3
  77 and may be released by
  78 .Xr free 3 .
  79 .Pp
  80 The
  81 .Fn regexec
  82 function
  83 matches a
  84 .Dv NUL Ns -terminated
  85 .Fa string
  86 against the compiled regular expression
  87 in
  88 .Fa prog .
  89 It returns 1 for success and 0 for failure, and adjusts the contents of
  90 .Fa prog Ns 's
  91 .Em startp
  92 and
  93 .Em endp
  94 (see below) accordingly.
  95 .Pp
  96 The members of a
  97 .Em regexp
  98 structure include at least the following (not necessarily in order):
  99 .Bd -literal -offset indent
 100 char *startp[NSUBEXP];
 101 char *endp[NSUBEXP];
 102 .Ed
 103 .Pp
 104 where
 105 .Dv NSUBEXP
 106 is defined (as 10) in the header file.
 107 Once a successful
 108 .Fn regexec
 109 has been done using the
 110 .Fn regexp ,
 111 each
 112 .Em startp Ns - Em endp
 113 pair describes one substring
 114 within the
 115 .Fa string ,
 116 with the
 117 .Em startp
 118 pointing to the first character of the substring and
 119 the
 120 .Em endp
 121 pointing to the first character following the substring.
 122 The 0th substring is the substring of
 123 .Fa string
 124 that matched the whole
 125 regular expression.
 126 The others are those substrings that matched parenthesized expressions
 127 within the regular expression, with parenthesized expressions numbered
 128 in left-to-right order of their opening parentheses.
 129 .Pp
 130 The
 131 .Fn regsub
 132 function
 133 copies
 134 .Fa source
 135 to
 136 .Fa dest ,
 137 making substitutions according to the
 138 most recent
 139 .Fn regexec
 140 performed using
 141 .Fa prog .
 142 Each instance of `\*[Am]' in
 143 .Fa source
 144 is replaced by the substring
 145 indicated by
 146 .Em startp Ns Bq
 147 and
 148 .Em endp Ns Bq .
 149 Each instance of
 150 .Sq \e Ns Em n ,
 151 where
 152 .Em n
 153 is a digit, is replaced by
 154 the substring indicated by
 155 .Em startp Ns Bq Em n
 156 and
 157 .Em endp Ns Bq Em n .
 158 To get a literal `\*[Am]' or
 159 .Sq \e Ns Em n
 160 into
 161 .Fa dest ,
 162 prefix it with `\e';
 163 to get a literal `\e' preceding `\*[Am]' or
 164 .Sq \e Ns Em n ,
 165 prefix it with
 166 another `\e'.
 167 .Pp
 168 The
 169 .Fn regerror
 170 function
 171 is called whenever an error is detected in
 172 .Fn regcomp ,
 173 .Fn regexec ,
 174 or
 175 .Fn regsub .
 176 The default
 177 .Fn regerror
 178 writes the string
 179 .Fa msg ,
 180 with a suitable indicator of origin,
 181 on the standard
 182 error output
 183 and invokes
 184 .Xr exit 3 .
 185 The
 186 .Fn regerror
 187 function
 188 can be replaced by the user if other actions are desirable.
 189 .Sh REGULAR EXPRESSION SYNTAX
 190 A regular expression is zero or more
 191 .Em branches ,
 192 separated by `|'.
 193 It matches anything that matches one of the branches.
 194 .Pp
 195 A branch is zero or more
 196 .Em pieces ,
 197 concatenated.
 198 It matches a match for the first, followed by a match for the second, etc.
 199 .Pp
 200 A piece is an
 201 .Em atom
 202 possibly followed by `*', `+', or `?'.
 203 An atom followed by `*' matches a sequence of 0 or more matches of the atom.
 204 An atom followed by `+' matches a sequence of 1 or more matches of the atom.
 205 An atom followed by `?' matches a match of the atom, or the null string.
 206 .Pp
 207 An atom is a regular expression in parentheses (matching a match for the
 208 regular expression), a
 209 .Em range
 210 (see below), `.'
 211 (matching any single character), `^' (matching the null string at the
 212 beginning of the input string), `$' (matching the null string at the
 213 end of the input string), a `\e' followed by a single character (matching
 214 that character), or a single character with no other significance
 215 (matching that character).
 216 .Pp
 217 A
 218 .Em range
 219 is a sequence of characters enclosed in `[]'.
 220 It normally matches any single character from the sequence.
 221 If the sequence begins with `^',
 222 it matches any single character
 223 .Em not
 224 from the rest of the sequence.
 225 If two characters in the sequence are separated by `\-', this is shorthand
 226 for the full list of
 227 .Tn ASCII
 228 characters between them
 229 (e.g. `[0-9]' matches any decimal digit).
 230 To include a literal `]' in the sequence, make it the first character
 231 (following a possible `^').
 232 To include a literal `\-', make it the first or last character.
 233 .Sh AMBIGUITY
 234 If a regular expression could match two different parts of the input string,
 235 it will match the one which begins earliest.
 236 If both begin in the same place but match different lengths, or match
 237 the same length in different ways, life gets messier, as follows.
 238 .Pp
 239 In general, the possibilities in a list of branches are considered in
 240 left-to-right order, the possibilities for `*', `+', and `?' are
 241 considered longest-first, nested constructs are considered from the
 242 outermost in, and concatenated constructs are considered leftmost-first.
 243 The match that will be chosen is the one that uses the earliest
 244 possibility in the first choice that has to be made.
 245 If there is more than one choice, the next will be made in the same manner
 246 (earliest possibility) subject to the decision on the first choice.
 247 And so forth.
 248 .Pp
 249 For example,
 250 .Sq Li (ab|a)b*c
 251 could match
 252 `abc' in one of two ways.
 253 The first choice is between `ab' and `a'; since `ab' is earlier, and does
 254 lead to a successful overall match, it is chosen.
 255 Since the `b' is already spoken for,
 256 the `b*' must match its last possibility\(emthe empty string\(emsince
 257 it must respect the earlier choice.
 258 .Pp
 259 In the particular case where no `|'s are present and there is only one
 260 `*', `+', or `?', the net effect is that the longest possible
 261 match will be chosen.
 262 So
 263 .Sq Li ab* ,
 264 presented with `xabbbby', will match `abbbb'.
 265 Note that if
 266 .Sq Li ab* ,
 267 is tried against `xabyabbbz', it
 268 will match `ab' just after `x', due to the begins-earliest rule.
 269 (In effect, the decision on where to start the match is the first choice
 270 to be made, hence subsequent choices must respect it even if this leads them
 271 to less-preferred alternatives.)
 272 .Sh RETURN VALUES
 273 The
 274 .Fn regcomp
 275 function
 276 returns
 277 .Dv NULL
 278 for a failure
 279 .Pf ( Fn regerror
 280 permitting),
 281 where failures are syntax errors, exceeding implementation limits,
 282 or applying `+' or `*' to a possibly-null operand.
 283 .Sh SEE ALSO
 284 .Xr ed 1 ,
 285 .Xr egrep 1 ,
 286 .Xr ex 1 ,
 287 .Xr expr 1 ,
 288 .Xr fgrep 1 ,
 289 .Xr grep 1 ,
 290 .Xr regex 3
 291 .Sh HISTORY
 292 Both code and manual page for
 293 .Fn regcomp ,
 294 .Fn regexec ,
 295 .Fn regsub ,
 296 and
 297 .Fn regerror
 298 were written at the University of Toronto
 299 and appeared in
 300 .Bx 4.3 tahoe .
 301 They are intended to be compatible with the Bell V8
 302 .Xr regexp 3 ,
 303 but are not derived from Bell code.
 304 .Sh BUGS
 305 Empty branches and empty regular expressions are not portable to V8.
 306 .Pp
 307 The restriction against
 308 applying `*' or `+' to a possibly-null operand is an artifact of the
 309 simplistic implementation.
 310 .Pp
 311 Does not support
 312 .Xr egrep 1 Ns 's
 313 newline-separated branches;
 314 neither does the V8
 315 .Xr regexp 3 ,
 316 though.
 317 .Pp
 318 Due to emphasis on
 319 compactness and simplicity,
 320 it's not strikingly fast.
 321 It does give special attention to handling simple cases quickly.