1 .TH FLEX 1 "26 May 1990" "Version 2.3"
3 flexdoc - fast lexical analyzer generator
6 .B [-bcdfinpstvFILT8 -C[efmF] -Sskeleton]
10 is a tool for generating
12 programs which recognized lexical patterns in text.
15 the given input files, or its standard input if no file names are given,
16 for a description of a scanner to generate. The description is in
18 of regular expressions and C code, called
20 generates as output a C source file,
22 which defines a routine
24 This file is compiled and linked with the
26 library to produce an executable. When the executable is run,
27 it analyzes its input for occurrences
28 of the regular expressions. Whenever it finds one, it executes
29 the corresponding C code.
30 .SH SOME SIMPLE EXAMPLES
32 First some simple examples to get the flavor of how one uses
36 input specifies a scanner which whenever it encounters the string
37 "username" will replace it with the user's login name:
41 username printf( "%s", getlogin() );
44 By default, any text not matched by a
47 is copied to the output, so the net effect of this scanner is
48 to copy its input file to its output with each occurrence
49 of "username" expanded.
50 In this input, there is just one rule. "username" is the
52 and the "printf" is the
54 The "%%" marks the beginning of the rules.
56 Here's another simple example:
59 int num_lines = 0, num_chars = 0;
62 \\n ++num_lines; ++num_chars;
69 printf( "# of lines = %d, # of chars = %d\\n",
70 num_lines, num_chars );
74 This scanner counts the number of characters and the number
75 of lines in its input (it produces no output other than the
76 final report on the counts). The first line
77 declares two globals, "num_lines" and "num_chars", which are accessible
82 routine declared after the second "%%". There are two rules, one
83 which matches a newline ("\\n") and increments both the line count and
84 the character count, and one which matches any character other than
85 a newline (indicated by the "." regular expression).
87 A somewhat more complicated example:
90 /* scanner for a toy Pascal-like language */
93 /* need this for the call to atof() below */
103 printf( "An integer: %s (%d)\\n", yytext,
107 {DIGIT}+"."{DIGIT}* {
108 printf( "A float: %s (%g)\\n", yytext,
112 if|then|begin|end|procedure|function {
113 printf( "A keyword: %s\\n", yytext );
116 {ID} printf( "An identifier: %s\\n", yytext );
118 "+"|"-"|"*"|"/" printf( "An operator: %s\\n", yytext );
120 "{"[^}\\n]*"}" /* eat up one-line comments */
122 [ \\t\\n]+ /* eat up whitespace */
124 . printf( "Unrecognized character: %s\\n", yytext );
132 ++argv, --argc; /* skip over program name */
134 yyin = fopen( argv[0], "r" );
142 This is the beginnings of a simple scanner for a language like
143 Pascal. It identifies different types of
145 and reports on what it has seen.
147 The details of this example will be explained in the following
149 .SH FORMAT OF THE INPUT FILE
152 input file consists of three sections, separated by a line with just
166 section contains declarations of simple
168 definitions to simplify the scanner specification, and declarations of
170 which are explained in a later section.
172 Name definitions have the form:
178 The "name" is a word beginning with a letter or an underscore ('_')
179 followed by zero or more letters, digits, '_', or '-' (dash).
180 The definition is taken to begin at the first non-white-space character
181 following the name and continuing to the end of the line.
182 The definition can subsequently be referred to using "{name}", which
183 will expand to "(definition)". For example,
190 defines "DIGIT" to be a regular expression which matches a
192 "ID" to be a regular expression which matches a letter
193 followed by zero-or-more letters-or-digits.
194 A subsequent reference to
206 and matches one-or-more digits followed by a '.' followed
207 by zero-or-more digits.
213 input contains a series of rules of the form:
219 where the pattern must be unindented and the action must begin
222 See below for a further description of patterns and actions.
224 Finally, the user code section is simply copied to
227 It is used for companion routines which call or are called
228 by the scanner. The presence of this section is optional;
229 if it is missing, the second
231 in the input file may be skipped, too.
233 In the definitions and rules sections, any
235 text or text enclosed in
239 is copied verbatim to the output (with the %{}'s removed).
240 The %{}'s must appear unindented on lines by themselves.
242 In the rules section,
243 any indented or %{} text appearing before the
244 first rule may be used to declare variables
245 which are local to the scanning routine and (after the declarations)
246 code which is to be executed whenever the scanning routine is entered.
247 Other indented or %{} text in the rule section is still copied to the output,
248 but its meaning is not well-defined and it may well cause compile-time
249 errors (this feature is present for
251 compliance; see below for other such features).
253 In the definitions section, an unindented comment (i.e., a line
254 beginning with "/*") is also copied verbatim to the output up
255 to the next "*/". Also, any line in the definitions section
256 beginning with '#' is ignored, though this style of comment is
257 deprecated and may go away in the future.
259 The patterns in the input are written using an extended set of regular
260 expressions. These are:
263 x match the character 'x'
264 . any character except newline
265 [xyz] a "character class"; in this case, the pattern
266 matches either an 'x', a 'y', or a 'z'
267 [abj-oZ] a "character class" with a range in it; matches
268 an 'a', a 'b', any letter from 'j' through 'o',
270 [^A-Z] a "negated character class", i.e., any character
271 but those in the class. In this case, any
272 character EXCEPT an uppercase letter.
273 [^A-Z\\n] any character EXCEPT an uppercase letter or
275 r* zero or more r's, where r is any regular expression
277 r? zero or one r's (that is, "an optional r")
278 r{2,5} anywhere from two to five r's
279 r{2,} two or more r's
281 {name} the expansion of the "name" definition
284 the literal string: [xyz]"foo
285 \\X if X is an 'a', 'b', 'f', 'n', 'r', 't', or 'v',
286 then the ANSI-C interpretation of \\x.
287 Otherwise, a literal 'X' (used to escape
288 operators such as '*')
289 \\123 the character with octal value 123
290 \\x2a the character with hexadecimal value 2a
291 (r) match an r; parentheses are used to override
292 precedence (see below)
295 rs the regular expression r followed by the
296 regular expression s; called "concatenation"
299 r|s either an r or an s
302 r/s an r but only if it is followed by an s. The
303 s is not part of the matched text. This type
304 of pattern is called as "trailing context".
305 ^r an r, but only at the beginning of a line
306 r$ an r, but only at the end of a line. Equivalent
310 <s>r an r, but only in start condition s (see
311 below for discussion of start conditions)
313 same, but in any of start conditions s1,
317 <<EOF>> an end-of-file
319 an end-of-file when in start condition s1 or s2
322 The regular expressions listed above are grouped according to
323 precedence, from highest precedence at the top to lowest at the bottom.
324 Those grouped together have equal precedence. For example,
336 since the '*' operator has higher precedence than concatenation,
337 and concatenation higher than alternation ('|'). This pattern
342 the string "ba" followed by zero-or-more r's.
343 To match "foo" or zero-or-more "bar"'s, use:
349 and to match zero-or-more "foo"'s-or-"bar"'s:
356 Some notes on patterns:
358 A negated character class such as the example "[^A-Z]"
360 .I will match a newline
361 unless "\\n" (or an equivalent escape sequence) is one of the
362 characters explicitly present in the negated character class
363 (e.g., "[^A-Z\\n]"). This is unlike how many other regular
364 expression tools treat negated character classes, but unfortunately
365 the inconsistency is historically entrenched.
366 Matching newlines means that a pattern like [^"]* can match an entire
367 input (overflowing the scanner's input buffer) unless there's another
370 A rule can have at most one instance of trailing context (the '/' operator
371 or the '$' operator). The start condition, '^', and "<<EOF>>" patterns
372 can only occur at the beginning of a pattern, and, as well as with '/' and '$',
373 cannot be grouped inside parentheses. A '^' which does not occur at
374 the beginning of a rule or a '$' which does not occur at the end of
375 a rule loses its special properties and is treated as a normal character.
377 The following are illegal:
384 Note that the first of these, can be written "foo/bar\\n".
386 The following will result in '$' or '^' being treated as a normal character:
393 If what's wanted is a "foo" or a bar-followed-by-a-newline, the following
394 could be used (the special '|' action is explained below):
398 bar$ /* action goes here */
401 A similar trick will work for matching a foo or a
402 bar-at-the-beginning-of-a-line.
403 .SH HOW THE INPUT IS MATCHED
404 When the generated scanner is run, it analyzes its input looking
405 for strings which match any of its patterns. If it finds more than
406 one match, it takes the one matching the most text (for trailing
407 context rules, this includes the length of the trailing part, even
408 though it will then be returned to the input). If it finds two
409 or more matches of the same length, the
410 rule listed first in the
412 input file is chosen.
414 Once the match is determined, the text corresponding to the match
417 is made available in the global character pointer
419 and its length in the global integer
423 corresponding to the matched pattern is then executed (a more
424 detailed description of actions follows), and then the remaining
425 input is scanned for another match.
427 If no match is found, then the
429 is executed: the next character in the input is considered matched and
430 copied to the standard output. Thus, the simplest legal
438 which generates a scanner that simply copies its input (one character
439 at a time) to its output.
441 Each pattern in a rule has a corresponding action, which can be any
442 arbitrary C statement. The pattern ends at the first non-escaped
443 whitespace character; the remainder of the line is its action. If the
444 action is empty, then when the pattern is matched the input token
445 is simply discarded. For example, here is the specification for a program
446 which deletes all occurrences of "zap me" from its input:
453 (It will copy all other characters in the input to the output since
454 they will be matched by the default rule.)
456 Here is a program which compresses multiple blanks and tabs down to
457 a single blank, and throws away whitespace found at the end of a line:
461 [ \\t]+ putchar( ' ' );
462 [ \\t]+$ /* ignore this token */
466 If the action contains a '{', then the action spans till the balancing '}'
467 is found, and the action may cross multiple lines.
469 knows about C strings and comments and won't be fooled by braces found
470 within them, but also allows actions to begin with
472 and will consider the action to be all the text up to the next
474 (regardless of ordinary braces inside the action).
476 An action consisting solely of a vertical bar ('|') means "same as
477 the action for the next rule." See below for an illustration.
479 Actions can include arbitrary C code, including
481 statements to return a value to whatever routine called
485 is called it continues processing tokens from where it last left
486 off until it either reaches
487 the end of the file or executes a return. Once it reaches an end-of-file,
488 however, then any subsequent call to
490 will simply immediately return, unless
492 is first called (see below).
494 Actions are not allowed to modify yytext or yyleng.
496 There are a number of special directives which can be included within
500 copies yytext to the scanner's output.
503 followed by the name of a start condition places the scanner in the
504 corresponding start condition (see below).
507 directs the scanner to proceed on to the "second best" rule which matched the
508 input (or a prefix of the input). The rule is chosen as described
509 above in "How the Input is Matched", and
513 set up appropriately.
514 It may either be one which matched as much text
515 as the originally chosen rule but came later in the
517 input file, or one which matched less text.
518 For example, the following will both count the
519 words in the input and call the routine special() whenever "frob" is seen:
525 frob special(); REJECT;
526 [^ \\t\\n]+ ++word_count;
531 any "frob"'s in the input would not be counted as words, since the
532 scanner normally executes only one action per token.
535 are allowed, each one finding the next best choice to the currently
536 active rule. For example, when the following scanner scans the token
537 "abcd", it will write "abcdabcaba" to the output:
545 .|\\n /* eat up any unmatched character */
548 (The first three rules share the fourth's action since they use
549 the special '|' action.)
551 is a particularly expensive feature in terms scanner performance;
554 of the scanner's actions it will slow down
556 of the scanner's matching. Furthermore,
558 cannot be used with the
564 Note also that unlike the other special actions,
568 code immediately following it in the action will
573 tells the scanner that the next time it matches a rule, the corresponding
576 onto the current value of
578 rather than replacing it. For example, given the input "mega-kludge"
579 the following will write "mega-mega-kludge" to the output:
583 mega- ECHO; yymore();
587 First "mega-" is matched and echoed to the output. Then "kludge"
588 is matched, but the previous "mega-" is still hanging around at the
593 for the "kludge" rule will actually write "mega-kludge".
596 in the scanner's action entails a minor performance penalty in the
597 scanner's matching speed.
600 returns all but the first
602 characters of the current token back to the input stream, where they
603 will be rescanned when the scanner looks for the next match.
607 are adjusted appropriately (e.g.,
611 ). For example, on the input "foobar" the following will write out
616 foobar ECHO; yyless(3);
622 will cause the entire current input string to be scanned again. Unless you've
623 changed how the scanner will subsequently process its input (using
625 for example), this will result in an endless loop.
630 back onto the input stream. It will be the next character scanned.
631 The following action will take the current token and cause it
632 to be rescanned enclosed in parentheses.
638 for ( i = yyleng - 1; i >= 0; --i )
646 puts the given character back at the
648 of the input stream, pushing back strings must be done back-to-front.
651 reads the next character from the input stream. For example,
652 the following is one way to eat up C comments:
661 while ( (c = input()) != '*' &&
663 ; /* eat up text of comment */
667 while ( (c = input()) == '*' )
670 break; /* found the end */
675 error( "EOF in comment" );
682 (Note that if the scanner is compiled using
686 is instead referred to as
688 in order to avoid a name clash with the
690 stream by the name of
694 can be used in lieu of a return statement in an action. It terminates
695 the scanner and returns a 0 to the scanner's caller, indicating "all done".
696 Subsequent calls to the scanner will immediately return unless preceded
702 is also called when an end-of-file is encountered. It is a macro and
704 .SH THE GENERATED SCANNER
709 which contains the scanning routine
711 a number of tables used by it for matching tokens, and a number
712 of auxiliary routines and macros. By default,
714 is declared as follows:
719 ... various definitions and the actions in here ...
723 (If your environment supports function prototypes, then it will
724 be "int yylex( void )".) This definition may be changed by redefining
725 the "YY_DECL" macro. For example, you could use:
729 #define YY_DECL float lexscan( a, b ) float a, b;
732 to give the scanning routine the name
734 returning a float, and taking two floats as arguments. Note that
735 if you give arguments to the scanning routine using a
736 K&R-style/non-prototyped function declaration, you must terminate
737 the definition with a semi-colon (;).
741 is called, it scans tokens from the global input file
743 (which defaults to stdin). It continues until it either reaches
744 an end-of-file (at which point it returns the value 0) or
745 one of its actions executes a
748 In the former case, when called again the scanner will immediately
753 at the new input file. (
755 takes one argument, a
758 In the latter case (i.e., when an action
759 executes a return), the scanner may then be called again and it
760 will resume scanning where it left off.
762 By default (and for purposes of efficiency), the scanner uses
763 block-reads rather than simple
765 calls to read characters from
767 The nature of how it gets its input can be controlled by redefining the
770 YY_INPUT's calling sequence is "YY_INPUT(buf,result,max_size)". Its
771 action is to place up to
773 characters in the character array
775 and return in the integer variable
778 number of characters read or the constant YY_NULL (0 on Unix systems)
779 to indicate EOF. The default YY_INPUT reads from the
780 global file-pointer "yyin".
782 A sample redefinition of YY_INPUT (in the definitions
783 section of the input file):
788 #define YY_INPUT(buf,result,max_size) \\
790 int c = getchar(); \\
791 result = (c == EOF) ? YY_NULL : (buf[0] = c, 1); \\
796 This definition will change the input processing to occur
797 one character at a time.
799 You also can add in things like keeping track of the
800 input line number this way; but don't expect your scanner to
803 When the scanner receives an end-of-file indication from YY_INPUT,
808 returns false (zero), then it is assumed that the
809 function has gone ahead and set up
811 to point to another input file, and scanning continues. If it returns
812 true (non-zero), then the scanner terminates, returning 0 to its
817 always returns 1. Presently, to redefine it you must first
818 "#undef yywrap", as it is currently implemented as a macro. As indicated
819 by the hedging in the previous sentence, it may be changed to
820 a true function in the near future.
822 The scanner writes its
826 global (default, stdout), which may be redefined by the user simply
827 by assigning it to some other
832 provides a mechanism for conditionally activating rules. Any rule
833 whose pattern is prefixed with "<sc>" will only be active when
834 the scanner is in the start condition named "sc". For example,
837 <STRING>[^"]* { /* eat up the string body ... */
842 will be active only when the scanner is in the "STRING" start
846 <INITIAL,STRING,QUOTE>\\. { /* handle an escape ... */
851 will be active only when the current start condition is
852 either "INITIAL", "STRING", or "QUOTE".
855 are declared in the definitions (first) section of the input
856 using unindented lines beginning with either
860 followed by a list of names.
863 start conditions, the latter
865 start conditions. A start condition is activated using the
867 action. Until the next
869 action is executed, rules with the given start
870 condition will be active and
871 rules with other start conditions will be inactive.
872 If the start condition is
874 then rules with no start conditions at all will also be active.
879 rules qualified with the start condition will be active.
880 A set of rules contingent on the same exclusive start condition
881 describe a scanner which is independent of any of the other rules in the
883 input. Because of this,
884 exclusive start conditions make it easy to specify "mini-scanners"
885 which scan portions of the input that are syntactically different
886 from the rest (e.g., comments).
888 If the distinction between inclusive and exclusive start conditions
889 is still a little vague, here's a simple example illustrating the
890 connection between the two. The set of rules:
895 <example>foo /* do something */
903 <INITIAL,example>foo /* do something */
909 any unmatched character) remains active in start conditions.
912 returns to the original state where only the rules with
913 no start conditions are active. This state can also be
914 referred to as the start-condition "INITIAL", so
918 (The parentheses around the start condition name are not required but
919 are considered good style.)
922 actions can also be given as indented code at the beginning
923 of the rules section. For example, the following will cause
924 the scanner to enter the "SPECIAL" start condition whenever
926 is called and the global variable
938 <SPECIAL>blahblahblah
939 ...more rules follow...
943 To illustrate the uses of start conditions,
944 here is a scanner which provides two different interpretations
945 of a string like "123.456". By default it will treat it as
946 as three tokens, the integer "123", a dot ('.'), and the integer "456".
947 But if the string is preceded earlier in the line by the string
949 it will treat it as a single token, the floating-point number
959 expect-floats BEGIN(expect);
961 <expect>[0-9]+"."[0-9]+ {
962 printf( "found a float, = %f\\n",
966 /* that's the end of the line, so
967 * we need another "expect-number"
968 * before we'll recognize any more
975 printf( "found an integer, = %d\\n",
979 "." printf( "found a dot\\n" );
982 Here is a scanner which recognizes (and discards) C comments while
983 maintaining a count of the current input line.
992 <comment>[^*\\n]* /* eat anything that's not a '*' */
993 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
994 <comment>\\n ++line_num;
995 <comment>"*"+"/" BEGIN(INITIAL);
998 Note that start-conditions names are really integer values and
999 can be stored as such. Thus, the above could be extended in the
1009 comment_caller = INITIAL;
1016 comment_caller = foo;
1020 <comment>[^*\\n]* /* eat anything that's not a '*' */
1021 <comment>"*"+[^*/\\n]* /* eat up '*'s not followed by '/'s */
1022 <comment>\\n ++line_num;
1023 <comment>"*"+"/" BEGIN(comment_caller);
1026 One can then implement a "stack" of start conditions using an
1027 array of integers. (It is likely that such stacks will become
1030 feature in the future.) Note, though, that
1031 start conditions do not have their own name-space; %s's and %x's
1032 declare names in the same fashion as #define's.
1033 .SH MULTIPLE INPUT BUFFERS
1034 Some scanners (such as those which support "include" files)
1035 require reading from several input streams. As
1037 scanners do a large amount of buffering, one cannot control
1038 where the next input will be read from by simply writing a
1040 which is sensitive to the scanning context.
1042 is only called when the scanner reaches the end of its buffer, which
1043 may be a long time after scanning a statement such as an "include"
1044 which requires switching the input source.
1046 To negotiate these sorts of problems,
1048 provides a mechanism for creating and switching between multiple
1049 input buffers. An input buffer is created by using:
1052 YY_BUFFER_STATE yy_create_buffer( FILE *file, int size )
1057 pointer and a size and creates a buffer associated with the given
1058 file and large enough to hold
1060 characters (when in doubt, use
1062 for the size). It returns a
1064 handle, which may then be passed to other routines:
1067 void yy_switch_to_buffer( YY_BUFFER_STATE new_buffer )
1070 switches the scanner's input buffer so subsequent tokens will
1074 .B yy_switch_to_buffer()
1075 may be used by yywrap() to sets things up for continued scanning, instead
1076 of opening a new file and pointing
1081 void yy_delete_buffer( YY_BUFFER_STATE buffer )
1084 is used to reclaim the storage associated with a buffer.
1088 .B yy_create_buffer(),
1089 provided for compatibility with the C++ use of
1093 for creating and destroying dynamic objects.
1096 .B YY_CURRENT_BUFFER
1099 handle to the current buffer.
1101 Here is an example of using these features for writing a scanner
1102 which expands include files (the
1104 feature is discussed below):
1107 /* the "incl" state is used for picking up the name
1108 * of an include file
1113 #define MAX_INCLUDE_DEPTH 10
1114 YY_BUFFER_STATE include_stack[MAX_INCLUDE_DEPTH];
1115 int include_stack_ptr = 0;
1119 include BEGIN(incl);
1122 [^a-z\\n]*\\n? ECHO;
1124 <incl>[ \\t]* /* eat the whitespace */
1125 <incl>[^ \\t\\n]+ { /* got the include file name */
1126 if ( include_stack_ptr >= MAX_INCLUDE_DEPTH )
1128 fprintf( stderr, "Includes nested too deeply" );
1132 include_stack[include_stack_ptr++] =
1135 yyin = fopen( yytext, "r" );
1140 yy_switch_to_buffer(
1141 yy_create_buffer( yyin, YY_BUF_SIZE ) );
1147 if ( --include_stack_ptr < 0 )
1153 yy_switch_to_buffer(
1154 include_stack[include_stack_ptr] );
1158 .SH END-OF-FILE RULES
1159 The special rule "<<EOF>>" indicates
1160 actions which are to be taken when an end-of-file is
1161 encountered and yywrap() returns non-zero (i.e., indicates
1162 no further files to process). The action must finish
1163 by doing one of four things:
1169 has been pointed at a new file to process;
1179 or, switching to a new buffer using
1180 .B yy_switch_to_buffer()
1181 as shown in the example above.
1183 <<EOF>> rules may not be used with other
1184 patterns; they may only be qualified with a list of start
1185 conditions. If an unqualified <<EOF>> rule is given, it
1188 start conditions which do not already have <<EOF>> actions. To
1189 specify an <<EOF>> rule for only the initial start condition, use
1196 These rules are useful for catching things like unclosed comments.
1203 ...other rules for dealing with quotes...
1206 error( "unterminated quote" );
1212 yyin = fopen( *filelist, "r" );
1220 .SH MISCELLANEOUS MACROS
1223 can be redefined to provide an action
1224 which is always executed prior to the matched rule's action. For example,
1225 it could be #define'd to call a routine to convert yytext to lower-case.
1229 may be redefined to provide an action which is always executed before
1230 the first scan (and before the scanner's internal initializations are done).
1231 For example, it could be used to call a routine to read
1232 in a data table or open a logging file.
1234 In the generated scanner, the actions are all gathered in one large
1235 switch statement and separated using
1237 which may be redefined. By default, it is simply a "break", to separate
1238 each rule's action from the following rule's.
1241 allows, for example, C++ users to
1242 #define YY_BREAK to do nothing (while being very careful that every
1243 rule ends with a "break" or a "return"!) to avoid suffering from
1244 unreachable statement warnings where because a rule's action ends with
1248 .SH INTERFACING WITH YACC
1249 One of the main uses of
1251 is as a companion to the
1255 parsers expect to call a routine named
1257 to find the next input token. The routine is supposed to
1258 return the type of the next token as well as putting any associated
1269 to instruct it to generate the file
1271 containing definitions of all the
1275 input. This file is then included in the
1277 scanner. For example, if one of the tokens is "TOK_NUMBER",
1278 part of the scanner might look like:
1287 [0-9]+ yylval = atoi( yytext ); return TOK_NUMBER;
1290 .SH TRANSLATION TABLE
1291 In the name of POSIX compliance,
1294 .I translation table
1295 for mapping input characters into groups.
1296 The table is specified in the first section, and its format looks like:
1301 2 ABCDEFGHIJKLMNOPQRSTUVWXYZ
1307 This example specifies that the characters 'a', 'b', 'c', and 'd'
1308 are to all be lumped into group #1, upper-case letters
1309 in group #2, digits in group #52, tabs, blanks, and newlines into
1312 no other characters will appear in the patterns.
1313 The group numbers are actually disregarded by
1316 serves, though, to lump characters together. Given the above
1317 table, for example, the pattern "a(AA)*5" is equivalent to "d(ZQ)*0".
1318 They both say, "match any character in group #1, followed by
1319 zero-or-more pairs of characters
1320 from group #2, followed by a character from group #52." Thus
1322 provides a crude way for introducing equivalence classes into
1323 the scanner specification.
1327 option (see below) coupled with the equivalence classes which
1329 automatically generates take care of virtually all the instances
1330 when one might consider using
1332 But what the hell, it's there if you want it.
1335 has the following options:
1338 Generate backtracking information to
1340 This is a list of scanner states which require backtracking
1341 and the input characters on which they do so. By adding rules one
1342 can remove backtracking states. If all backtracking states
1347 is used, the generated scanner will run faster (see the
1349 flag). Only users who wish to squeeze every last cycle out of their
1350 scanners need worry about this option. (See the section on PERFORMANCE
1351 CONSIDERATIONS below.)
1354 is a do-nothing, deprecated option included for POSIX compliance.
1357 in previous releases of
1360 specified table-compression options. This functionality is
1363 flag. To ease the the impact of this change, when
1367 it currently issues a warning message and assumes that
1369 was desired instead. In the future this "promotion" of
1373 will go away in the name of full POSIX compliance (unless
1374 the POSIX meaning is removed first).
1377 makes the generated scanner run in
1379 mode. Whenever a pattern is recognized and the global
1381 is non-zero (which is the default),
1382 the scanner will write to
1387 --accepting rule at line 53 ("the matched text")
1390 The line number refers to the location of the rule in the file
1391 defining the scanner (i.e., the file that was fed to flex). Messages
1392 are also generated when the scanner backtracks, accepts the
1393 default rule, reaches the end of its input buffer (or encounters
1394 a NUL; at this point, the two look the same as far as the scanner's concerned),
1395 or reaches an end-of-file.
1398 specifies (take your pick)
1402 No table compression is done. The result is large but fast.
1403 This option is equivalent to
1412 scanner. The case of letters given in the
1415 be ignored, and tokens in the input will be matched regardless of case. The
1416 matched text given in
1418 will have the preserved case (i.e., it will not be folded).
1421 is another do-nothing, deprecated option included only for
1425 generates a performance report to stderr. The report
1426 consists of comments regarding features of the
1428 input file which will cause a loss of performance in the resulting scanner.
1429 Note that the use of
1431 and variable trailing context (see the BUGS section in flex(1))
1432 entails a substantial performance penalty; use of
1439 flag entail minor performance penalties.
1444 (that unmatched scanner input is echoed to
1446 to be suppressed. If the scanner encounters input that does not
1447 match any of its rules, it aborts with an error. This option is
1448 useful for finding holes in a scanner's rule set.
1453 to write the scanner it generates to standard output instead
1462 a summary of statistics regarding the scanner it generates.
1463 Most of the statistics are meaningless to the casual
1466 first line identifies the version of
1468 which is useful for figuring
1469 out where you stand with respect to patches and new releases,
1470 and the next two lines give the date when the scanner was created
1471 and a summary of the flags which were in effect.
1476 scanner table representation should be used. This representation is
1477 about as fast as the full table representation
1479 and for some sets of patterns will be considerably smaller (and for
1480 others, larger). In general, if the pattern set contains both "keywords"
1481 and a catch-all, "identifier" rule, such as in the set:
1484 "case" return TOK_CASE;
1485 "switch" return TOK_SWITCH;
1487 "default" return TOK_DEFAULT;
1488 [a-z]+ return TOK_ID;
1491 then you're better off using the full table representation. If only
1492 the "identifier" rule is present and you then use a hash table or some such
1493 to detect the keywords, you're better off using
1496 This option is equivalent to
1505 scanner. Normally, scanners generated by
1507 always look ahead one
1508 character before deciding that a rule has been matched. At the cost of
1509 some scanning overhead,
1511 will generate a scanner which only looks ahead
1512 when needed. Such scanners are called
1514 because if you want to write a scanner for an interactive system such as a
1515 command shell, you will probably want the user's input to be terminated
1516 with a newline, and without
1518 the user will have to type a character in addition to the newline in order
1519 to have the newline recognized. This leads to dreadful interactive
1522 If all this seems to confusing, here's the general rule: if a human will
1523 be typing in input to your scanner, use
1525 otherwise don't; if you don't care about squeezing the utmost performance
1526 from your scanner and you
1527 don't want to make any assumptions about the input to your scanner,
1533 cannot be used in conjunction with
1548 directives. Without this option,
1550 peppers the generated scanner
1551 with #line directives so error messages in the actions will be correctly
1552 located with respect to the original
1554 input file, and not to
1555 the fairly meaningless line numbers of
1559 does not presently generate the necessary directives
1560 to "retarget" the line numbers for those parts of
1562 which it generated. So if there is an error in the generated code,
1563 a meaningless line number is reported.)
1570 mode. It will generate a lot of messages to
1573 the form of the input and the resultant non-deterministic and deterministic
1574 finite automata. This option is mostly for use in maintaining
1580 to generate an 8-bit scanner, i.e., one which can recognize 8-bit
1581 characters. On some sites,
1583 is installed with this option as the default. On others, the default
1584 is 7-bit characters. To see which is the case, check the verbose
1586 output for "equivalence classes created". If the denominator of
1587 the number shown is 128, then by default
1589 is generating 7-bit characters. If it is 256, then the default is
1590 8-bit characters and the
1592 flag is not required (but may be a good idea to keep the scanner
1593 specification portable). Feeding a 7-bit scanner 8-bit characters
1594 will result in infinite loops, bus errors, or other such fireworks,
1595 so when in doubt, use the flag. Note that if equivalence classes
1596 are used, 8-bit scanners take only slightly more table space than
1597 7-bit scanners (128 bytes, to be exact); if equivalence classes are
1598 not used, however, then the tables may grow up to twice their
1602 controls the degree of table compression.
1608 .I equivalence classes,
1609 i.e., sets of characters
1610 which have identical lexical properties (for example, if the only
1611 appearance of digits in the
1613 input is in the character class
1614 "[0-9]" then the digits '0', '1', ..., '9' will all be put
1615 in the same equivalence class). Equivalence classes usually give
1616 dramatic reductions in the final table/object file sizes (typically
1617 a factor of 2-5) and are pretty cheap performance-wise (one array
1618 look-up per character scanned).
1623 scanner tables should be generated -
1625 should not compress the
1626 tables by taking advantages of similar transition functions for
1630 specifies that the alternate fast scanner representation (described
1640 .I meta-equivalence classes,
1641 which are sets of equivalence classes (or characters, if equivalence
1642 classes are not being used) that are commonly used together. Meta-equivalence
1643 classes are often a big win when using compressed tables, but they
1644 have a moderate performance impact (one or two "if" tests and one
1645 array look-up per character scanned).
1649 specifies that the scanner tables should be compressed but neither
1650 equivalence classes nor meta-equivalence classes should be used.
1658 do not make sense together - there is no opportunity for meta-equivalence
1659 classes if the table is not being compressed. Otherwise the options
1660 may be freely mixed.
1662 The default setting is
1664 which specifies that
1666 should generate equivalence classes
1667 and meta-equivalence classes. This setting provides the highest
1668 degree of table compression. You can trade off
1669 faster-executing scanners at the cost of larger tables with
1670 the following generally being true:
1683 Note that scanners with the smallest tables are usually generated and
1684 compiled the quickest, so
1685 during development you will usually want to use the default, maximal
1689 is often a good compromise between speed and size for production
1693 options are not cumulative; whenever the flag is encountered, the
1694 previous -C settings are forgotten.
1697 overrides the default skeleton file from which
1699 constructs its scanners. You'll never need this option unless you are doing
1701 maintenance or development.
1702 .SH PERFORMANCE CONSIDERATIONS
1703 The main design goal of
1705 is that it generate high-performance scanners. It has been optimized
1706 for dealing well with large sets of rules. Aside from the effects
1707 of table compression on scanner speed outlined above,
1708 there are a number of options/actions which degrade performance. These
1709 are, from most expensive to least:
1714 pattern sets that require backtracking
1715 arbitrary trailing context
1717 '^' beginning-of-line operator
1721 with the first three all being quite expensive and the last two
1725 should be avoided at all costs when performance is important.
1726 It is a particularly expensive option.
1728 Getting rid of backtracking is messy and often may be an enormous
1729 amount of work for a complicated scanner. In principal, one begins
1734 file. For example, on the input
1738 foo return TOK_KEYWORD;
1739 foobar return TOK_KEYWORD;
1742 the file looks like:
1745 State #6 is non-accepting -
1746 associated rule line numbers:
1748 out-transitions: [ o ]
1749 jam-transitions: EOF [ \\001-n p-\\177 ]
1751 State #8 is non-accepting -
1752 associated rule line numbers:
1754 out-transitions: [ a ]
1755 jam-transitions: EOF [ \\001-` b-\\177 ]
1757 State #9 is non-accepting -
1758 associated rule line numbers:
1760 out-transitions: [ r ]
1761 jam-transitions: EOF [ \\001-q s-\\177 ]
1763 Compressed tables always backtrack.
1766 The first few lines tell us that there's a scanner state in
1767 which it can make a transition on an 'o' but not on any other
1768 character, and that in that state the currently scanned text does not match
1769 any rule. The state occurs when trying to match the rules found
1770 at lines 2 and 3 in the input file.
1771 If the scanner is in that state and then reads
1772 something other than an 'o', it will have to backtrack to find
1773 a rule which is matched. With
1774 a bit of headscratching one can see that this must be the
1775 state it's in when it has seen "fo". When this has happened,
1776 if anything other than another 'o' is seen, the scanner will
1777 have to back up to simply match the 'f' (by the default rule).
1779 The comment regarding State #8 indicates there's a problem
1780 when "foob" has been scanned. Indeed, on any character other
1781 than a 'b', the scanner will have to back up to accept "foo".
1782 Similarly, the comment for State #9 concerns when "fooba" has
1785 The final comment reminds us that there's no point going to
1786 all the trouble of removing backtracking from the rules unless
1791 since there's no performance gain doing so with compressed scanners.
1793 The way to remove the backtracking is to add "error" rules:
1797 foo return TOK_KEYWORD;
1798 foobar return TOK_KEYWORD;
1803 /* false alarm, not really a keyword */
1809 Eliminating backtracking among a list of keywords can also be
1810 done using a "catch-all" rule:
1814 foo return TOK_KEYWORD;
1815 foobar return TOK_KEYWORD;
1817 [a-z]+ return TOK_ID;
1820 This is usually the best solution when appropriate.
1822 Backtracking messages tend to cascade.
1823 With a complicated set of rules it's not uncommon to get hundreds
1824 of messages. If one can decipher them, though, it often
1825 only takes a dozen or so rules to eliminate the backtracking (though
1826 it's easy to make a mistake and have an error rule accidentally match
1827 a valid token. A possible future
1829 feature will be to automatically add rules to eliminate backtracking).
1832 trailing context (where both the leading and trailing parts do not have
1833 a fixed length) entails almost the same performance loss as
1835 (i.e., substantial). So when possible a rule like:
1839 mouse|rat/(cat|dog) run();
1846 mouse/cat|dog run();
1854 mouse|rat/cat run();
1855 mouse|rat/dog run();
1858 Note that here the special '|' action does
1860 provide any savings, and can even make things worse (see
1864 Another area where the user can increase a scanner's performance
1865 (and one that's easier to implement) arises from the fact that
1866 the longer the tokens matched, the faster the scanner will run.
1867 This is because with long tokens the processing of most input
1868 characters takes place in the (short) inner scanning loop, and
1869 does not often have to go through the additional work of setting up
1870 the scanning environment (e.g.,
1872 for the action. Recall the scanner for C comments:
1879 "/*" BEGIN(comment);
1882 <comment>"*"+[^*/\\n]*
1883 <comment>\\n ++line_num;
1884 <comment>"*"+"/" BEGIN(INITIAL);
1887 This could be sped up by writing it as:
1894 "/*" BEGIN(comment);
1897 <comment>[^*\\n]*\\n ++line_num;
1898 <comment>"*"+[^*/\\n]*
1899 <comment>"*"+[^*/\\n]*\\n ++line_num;
1900 <comment>"*"+"/" BEGIN(INITIAL);
1903 Now instead of each newline requiring the processing of another
1904 action, recognizing the newlines is "distributed" over the other rules
1905 to keep the matched text as long as possible. Note that
1909 slow down the scanner! The speed of the scanner is independent
1910 of the number of rules or (modulo the considerations given at the
1911 beginning of this section) how complicated the rules are with
1912 regard to operators such as '*' and '|'.
1914 A final example in speeding up a scanner: suppose you want to scan
1915 through a file containing identifiers and keywords, one per line
1916 and with no other extraneous characters, and recognize all the
1917 keywords. A natural first approach is:
1926 while /* it's a keyword */
1928 .|\\n /* it's not a keyword */
1931 To eliminate the back-tracking, introduce a catch-all rule:
1940 while /* it's a keyword */
1943 .|\\n /* it's not a keyword */
1946 Now, if it's guaranteed that there's exactly one word per line,
1947 then we can reduce the total number of matches by a half by
1948 merging in the recognition of newlines with that of the other
1958 while\\n /* it's a keyword */
1961 .|\\n /* it's not a keyword */
1964 One has to be careful here, as we have now reintroduced backtracking
1965 into the scanner. In particular, while
1967 know that there will never be any characters in the input stream
1968 other than letters or newlines,
1970 can't figure this out, and it will plan for possibly needing backtracking
1971 when it has scanned a token like "auto" and then the next character
1972 is something other than a newline or a letter. Previously it would
1973 then just match the "auto" rule and be done, but now it has no "auto"
1974 rule, only a "auto\\n" rule. To eliminate the possibility of backtracking,
1975 we could either duplicate all rules but without final newlines, or,
1976 since we never expect to encounter such an input and therefore don't
1977 how it's classified, we can introduce one more catch-all rule, this
1978 one which doesn't include a newline:
1987 while\\n /* it's a keyword */
1991 .|\\n /* it's not a keyword */
1996 this is about as fast as one can get a
1998 scanner to go for this particular problem.
2002 is slow when matching NUL's, particularly when a token contains
2004 It's best to write rules which match
2006 amounts of text if it's anticipated that the text will often include NUL's.
2007 .SH INCOMPATIBILITIES WITH LEX AND POSIX
2009 is a rewrite of the Unix
2011 tool (the two implementations do not share any code, though),
2012 with some extensions and incompatibilities, both of which
2013 are of concern to those who wish to write scanners acceptable
2014 to either implementation. At present, the POSIX
2017 very close to the original
2019 implementation, so some of these
2020 incompatibilities are also in conflict with the POSIX draft. But
2021 the intent is that except as noted below,
2023 as it presently stands will
2024 ultimately be POSIX conformant (i.e., that those areas of conflict with
2025 the POSIX draft will be resolved in
2027 favor). Please bear in
2028 mind that all the comments which follow are with regard to the POSIX
2030 standard of Summer 1989, and not the final document (or subsequent
2031 drafts); they are included so
2033 users can be aware of the standardization issues and those areas where
2035 may in the near future undergo changes incompatible with
2036 its current definition.
2039 is fully compatible with
2041 with the following exceptions:
2045 scanner internal variable
2047 is not supported. It is difficult to support this option efficiently,
2048 since it requires examining every character scanned and reexamining
2049 the characters when the scanner backs up.
2050 Things get more complicated when the end of buffer or file is reached or a
2051 NUL is scanned (since the scan must then be restarted with the proper line
2052 number count), or the user uses the yyless(), unput(), or REJECT actions,
2053 or the multiple input buffer functions.
2055 The fix is to add rules which, upon seeing a newline, increment
2056 yylineno. This is usually an easy process, though it can be a drag if some
2057 of the patterns can match multiple newlines along with other characters.
2059 yylineno is not part of the POSIX draft.
2063 routine is not redefinable, though it may be called to read characters
2064 following whatever has been matched by a rule. If
2066 encounters an end-of-file the normal
2068 processing is done. A ``real'' end-of-file is returned by
2073 Input is instead controlled by redefining the
2081 cannot be redefined is in accordance with the POSIX draft, but
2083 has not yet been accepted into the draft (and probably won't; it looks
2084 like the draft will simply not specify any way of controlling the
2085 scanner's input other than by making an initial assignment to
2089 scanners do not use stdio for input. Because of this, when writing an
2090 interactive scanner one must explicitly call fflush() on the
2091 stream associated with the terminal after writing out a prompt.
2094 such writes are automatically flushed since
2098 for their input. Also, when writing interactive scanners with
2105 scanners are not as reentrant as
2107 scanners. In particular, if you have an interactive scanner and
2108 an interrupt handler which long-jumps out of the scanner, and
2109 the scanner is subsequently called again, you may get the following
2113 fatal flex scanner internal error--end of buffer missed
2116 To reenter the scanner, first use
2127 macro is done to the file-pointer
2132 The POSIX draft mentions that an
2134 routine exists but currently gives no details as to what it does.
2137 does not support exclusive start conditions (%x), though they
2138 are in the current POSIX draft.
2140 When definitions are expanded,
2142 encloses them in parentheses.
2143 With lex, the following:
2148 foo{NAME}? printf( "Found it\\n" );
2152 will not match the string "foo" because when the macro
2153 is expanded the rule is equivalent to "foo[A-Z][A-Z0-9]*?"
2154 and the precedence is such that the '?' is associated with
2157 the rule will be expanded to
2158 "foo([A-Z][A-Z0-9]*)?" and so the string "foo" will match.
2159 Note that because of this, the
2163 operators cannot be used in a
2167 The POSIX draft interpretation is the same as
2170 To specify a character class which matches anything but a left bracket (']'),
2173 one can use "[^]]" but with
2175 one must use "[^\\]]". The latter works with
2182 (generate a Ratfor scanner) option is not supported. It is not part
2185 If you are providing your own yywrap() routine, you must include a
2186 "#undef yywrap" in the definitions section (section 1). Note that
2187 the "#undef" will have to be enclosed in %{}'s.
2190 specifies that yywrap() is a function and this is very unlikely to change; so
2191 .I flex users are warned
2194 is likely to be changed to a function in the near future.
2201 are undefined until the next token is matched. This is not the case with
2203 or the present POSIX draft.
2205 The precedence of the
2207 (numeric range) operator is different.
2209 interprets "abc{1,3}" as "match one, two, or
2210 three occurrences of 'abc'", whereas
2212 interprets it as "match 'ab'
2213 followed by one, two, or three occurrences of 'c'". The latter is
2214 in agreement with the current POSIX draft.
2216 The precedence of the
2218 operator is different.
2220 interprets "^foo|bar" as "match either 'foo' at the beginning of a line,
2221 or 'bar' anywhere", whereas
2223 interprets it as "match either 'foo' or 'bar' if they come at the beginning
2224 of a line". The latter is in agreement with the current POSIX draft.
2226 To refer to yytext outside of the scanner source file,
2227 the correct definition with
2229 is "extern char *yytext" rather than "extern char yytext[]".
2230 This is contrary to the current POSIX draft but a point on which
2232 will not be changing, as the array representation entails a
2233 serious performance penalty. It is hoped that the POSIX draft will
2234 be emended to support the
2236 variety of declaration (as this is a fairly painless change to
2257 the first time the scanner is called, providing
2259 has not already been assigned to a non-NULL value. The difference is
2260 subtle, but the net effect is that with
2264 does not have a valid value until the scanner has been called.
2266 The special table-size declarations such as
2278 is #define'd so scanners may be written for use with either
2285 features are not included in
2287 or the POSIX draft standard:
2294 %{}'s around actions
2296 comments beginning with '#' (deprecated)
2297 multiple actions on a line
2300 This last feature refers to the fact that with
2302 you can put multiple actions on the same line, separated with
2303 semi-colons, while with
2308 foo handle_foo(); ++num_foos_seen;
2311 is (rather surprisingly) truncated to
2318 does not truncate the action. Actions that are not enclosed in
2319 braces are simply terminated at the end of the line.
2321 .I reject_used_but_not_detected undefined
2323 .I yymore_used_but_not_detected undefined -
2324 These errors can occur at compile time. They indicate that the
2331 failed to notice the fact, meaning that
2333 scanned the first two sections looking for occurrences of these actions
2334 and failed to find any, but somehow you snuck some in (via a #include
2335 file, for example). Make an explicit reference to the action in your
2337 input file. (Note that previously
2341 mechanism for dealing with this problem; this feature is still supported
2342 but now deprecated, and will go away soon unless the author hears from
2343 people who can argue compellingly that they need it.)
2345 .I flex scanner jammed -
2346 a scanner compiled with
2348 has encountered an input string which wasn't matched by
2351 .I flex input buffer overflowed -
2352 a scanner rule matched a string long enough to overflow the
2353 scanner's internal input buffer (16K bytes by default - controlled by
2355 in "flex.skel". Note that to redefine this macro, you must first
2359 .I scanner requires -8 flag -
2360 Your scanner specification includes recognizing 8-bit characters and
2361 you did not specify the -8 flag (and your site has not installed flex
2362 with -8 as the default).
2365 fatal flex scanner internal error--end of buffer missed -
2366 This can occur in an scanner which is reentered after a long-jump
2367 has jumped out (or over) the scanner's activation frame. Before
2368 reentering the scanner, use:
2375 .I too many %t classes! -
2376 You managed to put every single character into its own %t class.
2378 requires that at least one of the classes share characters.
2379 .SH DEFICIENCIES / BUGS
2383 flex(1), lex(1), yacc(1), sed(1), awk(9).
2385 M. E. Lesk and E. Schmidt,
2386 .I LEX - Lexical Analyzer Generator
2388 Vern Paxson, with the help of many ideas and much inspiration from
2389 Van Jacobson. Original version by Jef Poskanzer. The fast table
2390 representation is a partial implementation of a design done by Van
2391 Jacobson. The implementation was done by Kevin Gong and Vern Paxson.
2395 beta-testers, feedbackers, and contributors, especially Casey
2396 Leedom, benson@odi.com, Keith Bostic,
2397 Frederic Brehm, Nick Christopher, Jason Coughlin,
2398 Scott David Daniels, Leo Eskin,
2399 Chris Faylor, Eric Goldman, Eric
2400 Hughes, Jeffrey R. Jones, Kevin B. Kenny, Ronald Lamprecht,
2401 Greg Lee, Craig Leres, Mohamed el Lozy, Jim Meyering, Marc Nozell, Esmond Pitt,
2402 Jef Poskanzer, Jim Roskind,
2403 Dave Tallman, Frank Whaley, Ken Yap, and those whose names
2404 have slipped my marginal mail-archiving skills but whose contributions
2405 are appreciated all the same.
2407 Thanks to Keith Bostic, John Gilmore, Craig Leres, Bob
2408 Mulcahy, Rich Salz, and Richard Stallman for help with various distribution
2411 Thanks to Esmond Pitt and Earle Horton for 8-bit character support;
2412 to Benson Margulies and Fred
2413 Burke for C++ support; to Ove Ewerlid for the basics of support for
2414 NUL's; and to Eric Hughes for the basics of support for multiple buffers.
2416 Work is being done on extending
2418 to generate scanners in which the
2419 state machine is directly represented in C code rather than tables.
2420 These scanners may well be substantially faster than those generated
2421 using -f or -F. If you are working in this area and are interested
2422 in comparing notes and seeing whether redundant work can be avoided,
2423 contact Ove Ewerlid (ewerlid@mizar.DoCS.UU.SE).
2425 This work was primarily done when I was at the Real Time Systems Group
2426 at the Lawrence Berkeley Laboratory in Berkeley, CA. Many thanks to all there
2427 for the support I received.
2433 Computer Science Department
2436 Ithaca, NY 14853-7501
2442 .\" ref. to awk(9) man page corrected -- ASW 2005-01-15