5 Normally, regular expressions have no place in a programming
6 language like C. And we don't have to use them! Having said
7 that, what lwc provides is a regexp-to-C code expander: i.e.
8 C code is generated to implement the DFA's that make a regexp.
10 This is interesting for two cases: 1) to replace calls to
11 strcmp, strcasecmp, strchr, strstr, strsep and the infamous
12 sscanf family of functions. The generated C code is probably
13 more efficient and definitelly more powerful 2) for an ultra
14 fast regexp implementation. Still, perl is better for most
15 tasks involving text processing though...
17 For an introduction to regular expressions see the the man
24 A regexp definition has the syntax:
26 [static|inline] RegExp <name> (<recipe-string> [, OPTIONS]);
28 For example, in a file put the line:
30 // in the recipe string you don't have to
31 // escape backslashes!!
32 RegExp email ("(\w[\w.-]*@[\w.-]*\w)");
34 And compile it with lwc. The output may seem scary:
35 "14 functions for this little regexp!!". But wait. The trick
36 is that we let the C compiler do the hard work for us. Any decent
37 C compiler must be able to recusrively inline those 14 static
38 inliners into ONE function. [Use 'gcc -S -O3' to see what's inlined]
40 Generally, you can imagine that testing a string 's' against the
41 regexp "^foo$", after inlining will result down to the code equivalent:
43 if (s [0] == 'f' && s [1] == 'o' && s [2] == 'o' && s [3] == 0)
45 Similarily, the regexp "^([cr]at|dog)" will be the same as:
47 if (((s [0] == 'c' || s [0] == 'r') && s [1] == 'a' && s [2] == 't')
48 || (s [0] == 'd' && s [1] == 'o' && s [2] == 'g'))
50 lwc can use strncmp/strncasecmp for fixed strings. So in the current
51 version, unless the option NOSTRFUNC is specified, lwc will convert the
52 regular expression "^Hello[12]world" to
54 if (!strncmp (s, "Hello", 5) && (s [5] == '1' || s [5] == '2') &&
55 !strncmp (s + 6, "world", 5))
57 In the generated code there's a function called email_match() and a
58 string array called email_recipe.
61 2. Using regexps from lwc
62 *************************
64 Here's a sample email grepping programme.
66 RegExp email ("\w[\w.-]*@[\w.-]*\w");
72 while (fgets (tmp, 1024, stdin)) {
73 tmp [strlen (tmp) - 1] = 0;
74 if (email_match (tmp))
75 printf ("MATCH: %s\n", tmp);
79 Pretty small. Except from regexp definitions, it is possible to
80 have just 'regexp declarations'. This is the case where the recipe
81 string is not provided, and lwc will just generate the prototype
82 of the name_match() function. The actual code may be in another
83 object file. The declaration qualifiers static, extern and inline
84 can be applied before regexp declarations and definitions.
85 static and inline apply on the _match() function. static and extern
86 on the _recipe string. For example:
90 ... if (URL_match (s)) ...
92 If you saw the code in [1], the email_match function has two arguments.
93 The second argument has a default value '0' if not specified and otherwise
94 is used to extract matches.
97 3. Practical Extraction
98 ***********************
100 In regexps, extraction happens for things that are enclosed in
101 parentheses. The matches are placed in charp_len structures, where
102 'p' points to the start of the match in the string, and 'i' has
103 the length of the match. Here is the e-mail grepping program
104 extracting and printing the email addresses only.
106 RegExp email ("(\w[\w.-]*@[\w.-]*\w)");
113 while (fgets (tmp, 1024, stdin)) {
114 tmp [strlen (tmp) - 1] = 0;
115 if (email_match (tmp, e)) {
117 strncpy (address, e [0].p, e [0].i);
118 address [e [0].i] = 0;
119 printf ("address: %s\n", address);
124 Some useful things to know:
126 - the lwc regexps do NOT extract matches inside repetitioners.
127 This is rare and in most cases unwanted. For example in the regexp
129 there are only 2 matches extracted: the ENTIRE foo sequence and the
130 two digits. The parentheses around (foo) are used only for groupping.
132 - in cases where matches are OR'd, the other extract's 'p' is
135 There are 3 extracts but only one will be set. The other two
136 will be point to NULL.
138 - it is possible to turn off extracting from certain parentheses
139 with the embedded modifier (?:), as in perl.
142 4. Regular Expression Details & Misc notes
143 ******************************************
145 The differences from perl's regexps are:
147 [1] Backreferences are *not* supported. Generally, although
148 with backreferences you can do some things to impress your
149 friends, backreferences cost. It means that we have to store
150 early matches even if we are not sure we will have an entire
151 match. So they are not supported (for now at least).
152 Generally, anything which has to do with previous states of
153 the regexp state machine, is not supported.
155 [2] Anchors and lookahead/lookbacks are not implemented.
156 It is not difficult to implement lookahead, but is it really
159 [3] The dot '.' matches newlines and there's no option to
160 turn it off. Whitespace \s matches newlines too.
162 [4] POSIX class speficiers like :digit: :letter:, etc are
163 not implemented at all.
165 [5] The case where the first character may or may not be the
166 start of the line, is not implemented. For example: (^a|bc).
169 User class abbreviations:
171 It's possible to define custom abbreviations for character
172 classes like \w, \s and \d. The syntax uses 'abbrev' in the
173 place the regexp name is expected.
175 RegExp abbrev ('e', "[1-4a-d+-]");
176 RegExp abbrev ('b', "[^\w\e]");
178 Custom user classes only work for lowercase letters and the
179 uppercase stands for the negated set. \E == [^\e]
180 The characters \r, \n, \f, \a, \t and \x are reserved because
181 they have a meaning when escaped.
185 It is always possible to recheck the string with the _match()
186 function *after* the point of the previous match. A pseudo-code
189 SET s = start-of-string
190 WHILE regexp_match (s, e)
198 NOCASE: to make case insensitive matches
200 NOEX: parentheses are used only for groupping
202 PACKED: pack the ctbl array. Usually it goes down to 1/4th
203 but this is not as fast. Other things that minimize the amount
204 of code are enabled, with the cost of speed.
206 NOCTBL: use switch-case statements instead of ctbl for
209 NOSTRFUNC: do not use strncmp/strncasecmp. Do not use any other
210 functions at all. Implement all the comparisons with ==
212 Using the options : RegExp foo ("...", NOCASE PACKED NOEX);
217 Because we want to be able to use regexps instead of strcmp
218 and friends, we want the code to be very small and efficient.
219 For that we have a new embedded modifier, the ?/ which sais
220 that the regexp is to be taken for granted: it will definitelly
221 be there. For example:
223 RegExp NameAge ("^(?/NAME: )([a-zA-Z]+) (\d{1,3})");
225 In this example, the regexp supposes that the strings will
226 absolutely definitelly start with "NAME: ", and so it will
227 just skip the first 6 letters without checking. Although
228 no checks are done, the enclosed regexp matters for the
229 optimizations and it must be of fixed size.
231 RegExp zoo ("^(?/(?:Entry|Field):)(\w+)");
234 Interesting code with
236 RegExp cdate ("((?:Mon|Tue|Wed|Thu|Fri|Sat|Sun)\s+"
237 "(?:Jan|Feb|Mar|Apr|May|Jun|Jul|Aug|Sep|Oct|Nov)\s+"
238 "\d{1,2}\s+\d\d:\d\d:\d\d\s+\d\d\d\d)", NOCTBL);
241 At the time, regexps are not reentrant
246 egrep is cheating at i/o. Benchmarks with fgets() are hopeless!
249 5. The perl operator -- a tribute to perl
250 *****************************************
252 This is simply a shortcut of the above. The perl regexp
253 operator is the =~ and it's activated only if it is followed
256 if (datestr =~ "Jan [12] ")
257 printf ("Happy New Year!\n");
261 static inline RegExp some_unique_name ("Jan [12] ", NOEX PACKED);
262 if (some_uniqe_name_match (datestr))
263 printf ("Happy New Year!\n");
265 These regexps are for fast strcmp'ing. Their code is packed to
266 minimal and don't extract anything.
267 =~ has the same priority with the other relational operators >= <= < >
271 the regexp facilities of lwc, are superfluous. A library would suffice.