doc/6.1.regular-expressions

   1
   2 regular expressions
   3 *******************
   4
   5   Normally, regular expressions have no place in a programming
   6 language like C.  And we don't have to use them!  Having said
   7 that, what lwc provides is a regexp-to-C code expander: i.e.
   8 C code is generated to implement the DFA's that make a regexp.
   9
  10   This is interesting for two cases:  1) to replace calls to
  11 strcmp, strcasecmp, strchr, strstr, strsep and the infamous
  12 sscanf family of functions.  The generated C code is probably
  13 more efficient and definitelly more powerful  2) for an ultra
  14 fast regexp implementation.  Still, perl is better for most
  15 tasks involving text processing though...
  16
  17   For an introduction to regular expressions see the the man
  18 page perlretut.
  19
  20
  21 1. The Code
  22 ***********
  23
  24 A regexp definition has the syntax:
  25
  26         [static|inline] RegExp <name> (<recipe-string> [, OPTIONS]);
  27
  28 For example, in a file put the line:
  29
  30         // in the recipe string you don't have to
  31         // escape backslashes!!
  32         RegExp email ("(\w[\w.-]*@[\w.-]*\w)");
  33
  34 And compile it with lwc.  The output may seem scary:
  35 "14 functions for this little regexp!!".  But wait.  The trick
  36 is that we let the C compiler do the hard work for us.  Any decent
  37 C compiler must be able to recusrively inline those 14 static
  38 inliners into ONE function.  [Use 'gcc -S -O3' to see what's inlined]
  39
  40 Generally, you can imagine that testing a string 's' against the
  41 regexp "^foo$", after inlining will result down to the code equivalent:
  42
  43         if (s [0] == 'f' && s [1] == 'o' && s [2] == 'o' && s [3] == 0)
  44
  45 Similarily, the regexp "^([cr]at|dog)" will be the same as:
  46
  47         if (((s [0] == 'c' || s [0] == 'r') && s [1] == 'a' && s [2] == 't')
  48         || (s [0] == 'd' && s [1] == 'o' && s [2] == 'g'))
  49
  50 lwc can use strncmp/strncasecmp for fixed strings. So in the current
  51 version, unless the option NOSTRFUNC is specified, lwc will convert the
  52 regular expression "^Hello[12]world" to
  53
  54         if (!strncmp (s, "Hello", 5) && (s [5] == '1' || s [5] == '2') &&
  55         !strncmp (s + 6, "world", 5))
  56
  57 In the generated code there's a function called email_match() and a
  58 string array called email_recipe.
  59
  60
  61 2. Using regexps from lwc
  62 *************************
  63
  64 Here's a sample email grepping programme.
  65
  66         RegExp email ("\w[\w.-]*@[\w.-]*\w");
  67
  68         int main ()
  69         {
  70                 char tmp [1024];
  71
  72                 while (fgets (tmp, 1024, stdin)) {
  73                         tmp [strlen (tmp) - 1] = 0;
  74                         if (email_match (tmp))
  75                                 printf ("MATCH: %s\n", tmp);
  76                 }
  77         }
  78
  79   Pretty small.  Except from regexp definitions, it is possible to
  80 have just 'regexp declarations'.  This is the case where the recipe
  81 string is not provided, and lwc will just generate the prototype
  82 of the name_match() function.  The actual code may be in another
  83 object file.  The declaration qualifiers static, extern and inline
  84 can be applied before regexp declarations and definitions.
  85 static and inline apply on the _match() function.  static and extern
  86 on the _recipe string.  For example:
  87
  88         extern RegExp URL;
  89
  90         ... if (URL_match (s)) ...
  91
  92    If you saw the code in [1], the email_match function has two arguments.
  93 The second argument has a default value '0' if not specified and otherwise
  94 is used to extract matches.
  95
  96
  97 3. Practical Extraction
  98 ***********************
  99
 100   In regexps, extraction happens for things that are enclosed in
 101 parentheses.  The matches are placed in charp_len structures, where
 102 'p' points to the start of the match in the string, and 'i' has
 103 the length of the match.  Here is the e-mail grepping program
 104 extracting and printing the email addresses only.
 105
 106         RegExp email ("(\w[\w.-]*@[\w.-]*\w)");
 107
 108         int main ()
 109         {
 110                 char tmp [1024];
 111                 charp_len e [1];
 112
 113                 while (fgets (tmp, 1024, stdin)) {
 114                         tmp [strlen (tmp) - 1] = 0;
 115                         if (email_match (tmp, e)) {
 116                                 char address [100];
 117                                 strncpy (address, e [0].p, e [0].i);
 118                                 address [e [0].i] = 0;
 119                                 printf ("address: %s\n", address);
 120                         }
 121                 }
 122         }
 123
 124 Some useful things to know:
 125
 126 - the lwc regexps do NOT extract matches inside repetitioners.
 127   This is rare and in most cases unwanted.  For example in the regexp
 128                 ((foo)*)(\d\d)
 129   there are only 2 matches extracted: the ENTIRE foo sequence and the
 130   two digits.  The parentheses around (foo) are used only for groupping.
 131
 132 - in cases where matches are OR'd, the other extract's 'p' is
 133   set to '0'.
 134                 (cat)|(dog)|(bird)
 135   There are 3 extracts but only one will be set. The other two
 136   will be point to NULL.
 137
 138 - it is possible to turn off extracting from certain parentheses
 139   with the embedded modifier (?:), as in perl.
 140
 141
 142 4. Regular Expression Details & Misc notes
 143 ******************************************
 144
 145 The differences from perl's regexps are:
 146
 147 [1] Backreferences are *not* supported.  Generally, although
 148  with backreferences you can do some things to impress your
 149  friends, backreferences cost.  It means that we have to store
 150  early matches even if we are not sure we will have an entire
 151  match.  So they are not supported (for now at least).
 152  Generally, anything which has to do with previous states of
 153  the regexp state machine, is not supported.
 154
 155 [2] Anchors and lookahead/lookbacks are not implemented.
 156  It is not difficult to implement lookahead, but is it really
 157  practical?
 158
 159 [3] The dot '.' matches newlines and there's no option to
 160  turn it off.  Whitespace \s matches newlines too.
 161
 162 [4] POSIX class speficiers like :digit: :letter:, etc are
 163  not implemented at all.
 164
 165 [5] The case where the first character may or may not be the
 166  start of the line, is not implemented.  For example: (^a|bc).
 167
 168 ---
 169 User class abbreviations:
 170
 171   It's possible to define custom abbreviations for character
 172 classes like \w, \s and \d.  The syntax uses 'abbrev' in the
 173 place the regexp name is expected.
 174
 175         RegExp abbrev ('e', "[1-4a-d+-]");
 176         RegExp abbrev ('b', "[^\w\e]");
 177
 178 Custom user classes only work for lowercase letters and the
 179 uppercase stands for the negated set.  \E == [^\e]
 180 The characters \r, \n, \f, \a, \t and \x are reserved because
 181 they have a meaning when escaped.
 182 ---
 183 Global replacements:
 184
 185  It is always possible to recheck the string with the _match()
 186 function *after* the point of the previous match.  A pseudo-code
 187 would be:
 188
 189         SET s = start-of-string
 190         WHILE regexp_match (s, e)
 191                 PRINT (e)
 192                 s = e.p + e.i
 193
 194
 195 ---
 196 Options:
 197
 198  NOCASE: to make case insensitive matches
 199
 200  NOEX: parentheses are used only for groupping
 201
 202  PACKED: pack the ctbl array. Usually it goes down to 1/4th
 203  but this is not as fast. Other things that minimize the amount
 204  of code are enabled, with the cost of speed.
 205
 206  NOCTBL: use switch-case statements instead of ctbl for
 207  character classes.
 208
 209  NOSTRFUNC: do not use strncmp/strncasecmp. Do not use any other
 210  functions at all. Implement all the comparisons with ==
 211
 212 Using the options :  RegExp foo ("...", NOCASE PACKED NOEX);
 213
 214 ---
 215 Extensions:
 216
 217  Because we want to be able to use regexps instead of strcmp
 218 and friends, we want the code to be very small and efficient.
 219 For that we have a new embedded modifier, the ?/ which sais
 220 that the regexp is to be taken for granted: it will definitelly
 221 be there.  For example:
 222
 223         RegExp NameAge ("^(?/NAME: )([a-zA-Z]+) (\d{1,3})");
 224
 225 In this example, the regexp supposes that the strings will
 226 absolutely definitelly start with "NAME: ", and so it will
 227 just skip the first 6 letters without checking.  Although
 228 no checks are done, the enclosed regexp matters for the
 229 optimizations and it must be of fixed size.
 230
 231         RegExp zoo ("^(?/(?:Entry|Field):)(\w+)");
 232
 233 ---
 234 Interesting code with
 235
 236         RegExp cdate ("((?:Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s+"
 237                       "(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)\s+"
 238                       "\d{1,2}\s+\d\d:\d\d:\d\d\s+\d\d\d\d)", NOCTBL);
 239
 240 ---
 241 At the time, regexps are not reentrant
 242
 243 ---
 244 benchmarks vs. egrep
 245
 246 egrep is cheating at i/o. Benchmarks with fgets() are hopeless!
 247
 248
 249 5. The perl operator -- a tribute to perl
 250 *****************************************
 251
 252   This is simply a shortcut of the above.  The perl regexp
 253 operator is the =~ and it's activated only if it is followed
 254 by a string literal.
 255
 256         if (datestr =~ "Jan [12] ")
 257                 printf ("Happy New Year!\n");
 258
 259 is the same as:
 260
 261         static inline RegExp some_unique_name ("Jan [12] ", NOEX PACKED);
 262         if (some_uniqe_name_match (datestr))
 263                 printf ("Happy New Year!\n");
 264
 265 These regexps are for fast strcmp'ing.  Their code is packed to
 266 minimal and don't extract anything.
 267 =~ has the same priority with the other relational operators >= <= < >
 268
 269
 270 ----
 271 the regexp facilities of lwc, are superfluous. A library would suffice.