build: set version to 0.5
[vis.git] / lua / lexers / lexer.lua
blobd4af90c5076f0534bf13d5b4867618acaab0ceb9
1 -- Copyright 2006-2017 Mitchell mitchell.att.foicica.com. See LICENSE.
3 local M = {}
5 --[=[ This comment is for LuaDoc.
6 ---
7 -- Lexes Scintilla documents with Lua and LPeg.
8 --
9 -- ## Overview
11 -- Lexers highlight the syntax of source code. Scintilla (the editing component
12 -- behind [Textadept][] and [SciTE][]) traditionally uses static, compiled C++
13 -- lexers which are notoriously difficult to create and/or extend. On the other
14 -- hand, Lua makes it easy to to rapidly create new lexers, extend existing
15 -- ones, and embed lexers within one another. Lua lexers tend to be more
16 -- readable than C++ lexers too.
18 -- Lexers are Parsing Expression Grammars, or PEGs, composed with the Lua
19 -- [LPeg library][]. The following table comes from the LPeg documentation and
20 -- summarizes all you need to know about constructing basic LPeg patterns. This
21 -- module provides convenience functions for creating and working with other
22 -- more advanced patterns and concepts.
24 -- Operator | Description
25 -- ---------------------|------------
26 -- `lpeg.P(string)` | Matches `string` literally.
27 -- `lpeg.P(`_`n`_`)` | Matches exactly _`n`_ characters.
28 -- `lpeg.S(string)` | Matches any character in set `string`.
29 -- `lpeg.R("`_`xy`_`")` | Matches any character between range `x` and `y`.
30 -- `patt^`_`n`_ | Matches at least _`n`_ repetitions of `patt`.
31 -- `patt^-`_`n`_ | Matches at most _`n`_ repetitions of `patt`.
32 -- `patt1 * patt2` | Matches `patt1` followed by `patt2`.
33 -- `patt1 + patt2` | Matches `patt1` or `patt2` (ordered choice).
34 -- `patt1 - patt2` | Matches `patt1` if `patt2` does not match.
35 -- `-patt` | Equivalent to `("" - patt)`.
36 -- `#patt` | Matches `patt` but consumes no input.
38 -- The first part of this document deals with rapidly constructing a simple
39 -- lexer. The next part deals with more advanced techniques, such as custom
40 -- coloring and embedding lexers within one another. Following that is a
41 -- discussion about code folding, or being able to tell Scintilla which code
42 -- blocks are "foldable" (temporarily hideable from view). After that are
43 -- instructions on how to use LPeg lexers with the aforementioned Textadept and
44 -- SciTE editors. Finally there are comments on lexer performance and
45 -- limitations.
47 -- [LPeg library]: http://www.inf.puc-rio.br/~roberto/lpeg/lpeg.html
48 -- [Textadept]: http://foicica.com/textadept
49 -- [SciTE]: http://scintilla.org/SciTE.html
51 -- ## Lexer Basics
53 -- The *lexers/* directory contains all lexers, including your new one. Before
54 -- attempting to write one from scratch though, first determine if your
55 -- programming language is similar to any of the 80+ languages supported. If so,
56 -- you may be able to copy and modify that lexer, saving some time and effort.
57 -- The filename of your lexer should be the name of your programming language in
58 -- lower case followed by a *.lua* extension. For example, a new Lua lexer has
59 -- the name *lua.lua*.
61 -- Note: Try to refrain from using one-character language names like "c", "d",
62 -- or "r". For example, Scintillua uses "ansi_c", "dmd", and "rstats",
63 -- respectively.
65 -- ### New Lexer Template
67 -- There is a *lexers/template.txt* file that contains a simple template for a
68 -- new lexer. Feel free to use it, replacing the '?'s with the name of your
69 -- lexer:
71 -- -- ? LPeg lexer.
73 -- local l = require('lexer')
74 -- local token, word_match = l.token, l.word_match
75 -- local P, R, S = lpeg.P, lpeg.R, lpeg.S
77 -- local M = {_NAME = '?'}
79 -- -- Whitespace.
80 -- local ws = token(l.WHITESPACE, l.space^1)
82 -- M._rules = {
83 -- {'whitespace', ws},
84 -- }
86 -- M._tokenstyles = {
88 -- }
90 -- return M
92 -- The first 3 lines of code simply define often used convenience variables. The
93 -- 5th and last lines define and return the lexer object Scintilla uses; they
94 -- are very important and must be part of every lexer. The sixth line defines
95 -- something called a "token", an essential building block of lexers. You will
96 -- learn about tokens shortly. The rest of the code defines a set of grammar
97 -- rules and token styles. You will learn about those later. Note, however, the
98 -- `M.` prefix in front of `_rules` and `_tokenstyles`: not only do these tables
99 -- belong to their respective lexers, but any non-local variables need the `M.`
100 -- prefix too so-as not to affect Lua's global environment. All in all, this is
101 -- a minimal, working lexer that you can build on.
103 -- ### Tokens
105 -- Take a moment to think about your programming language's structure. What kind
106 -- of key elements does it have? In the template shown earlier, one predefined
107 -- element all languages have is whitespace. Your language probably also has
108 -- elements like comments, strings, and keywords. Lexers refer to these elements
109 -- as "tokens". Tokens are the fundamental "building blocks" of lexers. Lexers
110 -- break down source code into tokens for coloring, which results in the syntax
111 -- highlighting familiar to you. It is up to you how specific your lexer is when
112 -- it comes to tokens. Perhaps only distinguishing between keywords and
113 -- identifiers is necessary, or maybe recognizing constants and built-in
114 -- functions, methods, or libraries is desirable. The Lua lexer, for example,
115 -- defines 11 tokens: whitespace, comments, strings, numbers, keywords, built-in
116 -- functions, constants, built-in libraries, identifiers, labels, and operators.
117 -- Even though constants, built-in functions, and built-in libraries are subsets
118 -- of identifiers, Lua programmers find it helpful for the lexer to distinguish
119 -- between them all. It is perfectly acceptable to just recognize keywords and
120 -- identifiers.
122 -- In a lexer, tokens consist of a token name and an LPeg pattern that matches a
123 -- sequence of characters recognized as an instance of that token. Create tokens
124 -- using the [`lexer.token()`]() function. Let us examine the "whitespace" token
125 -- defined in the template shown earlier:
127 -- local ws = token(l.WHITESPACE, l.space^1)
129 -- At first glance, the first argument does not appear to be a string name and
130 -- the second argument does not appear to be an LPeg pattern. Perhaps you
131 -- expected something like:
133 -- local ws = token('whitespace', S('\t\v\f\n\r ')^1)
135 -- The `lexer` (`l`) module actually provides a convenient list of common token
136 -- names and common LPeg patterns for you to use. Token names include
137 -- [`lexer.DEFAULT`](), [`lexer.WHITESPACE`](), [`lexer.COMMENT`](),
138 -- [`lexer.STRING`](), [`lexer.NUMBER`](), [`lexer.KEYWORD`](),
139 -- [`lexer.IDENTIFIER`](), [`lexer.OPERATOR`](), [`lexer.ERROR`](),
140 -- [`lexer.PREPROCESSOR`](), [`lexer.CONSTANT`](), [`lexer.VARIABLE`](),
141 -- [`lexer.FUNCTION`](), [`lexer.CLASS`](), [`lexer.TYPE`](), [`lexer.LABEL`](),
142 -- [`lexer.REGEX`](), and [`lexer.EMBEDDED`](). Patterns include
143 -- [`lexer.any`](), [`lexer.ascii`](), [`lexer.extend`](), [`lexer.alpha`](),
144 -- [`lexer.digit`](), [`lexer.alnum`](), [`lexer.lower`](), [`lexer.upper`](),
145 -- [`lexer.xdigit`](), [`lexer.cntrl`](), [`lexer.graph`](), [`lexer.print`](),
146 -- [`lexer.punct`](), [`lexer.space`](), [`lexer.newline`](),
147 -- [`lexer.nonnewline`](), [`lexer.nonnewline_esc`](), [`lexer.dec_num`](),
148 -- [`lexer.hex_num`](), [`lexer.oct_num`](), [`lexer.integer`](),
149 -- [`lexer.float`](), and [`lexer.word`](). You may use your own token names if
150 -- none of the above fit your language, but an advantage to using predefined
151 -- token names is that your lexer's tokens will inherit the universal syntax
152 -- highlighting color theme used by your text editor.
154 -- #### Example Tokens
156 -- So, how might you define other tokens like comments, strings, and keywords?
157 -- Here are some examples.
159 -- **Comments**
161 -- Line-style comments with a prefix character(s) are easy to express with LPeg:
163 -- local shell_comment = token(l.COMMENT, '#' * l.nonnewline^0)
164 -- local c_line_comment = token(l.COMMENT, '//' * l.nonnewline_esc^0)
166 -- The comments above start with a '#' or "//" and go to the end of the line.
167 -- The second comment recognizes the next line also as a comment if the current
168 -- line ends with a '\' escape character.
170 -- C-style "block" comments with a start and end delimiter are also easy to
171 -- express:
173 -- local c_comment = token(l.COMMENT, '/*' * (l.any - '*/')^0 * P('*/')^-1)
175 -- This comment starts with a "/\*" sequence and contains anything up to and
176 -- including an ending "\*/" sequence. The ending "\*/" is optional so the lexer
177 -- can recognize unfinished comments as comments and highlight them properly.
179 -- **Strings**
181 -- It is tempting to think that a string is not much different from the block
182 -- comment shown above in that both have start and end delimiters:
184 -- local dq_str = '"' * (l.any - '"')^0 * P('"')^-1
185 -- local sq_str = "'" * (l.any - "'")^0 * P("'")^-1
186 -- local simple_string = token(l.STRING, dq_str + sq_str)
188 -- However, most programming languages allow escape sequences in strings such
189 -- that a sequence like "\\"" in a double-quoted string indicates that the
190 -- '"' is not the end of the string. The above token incorrectly matches
191 -- such a string. Instead, use the [`lexer.delimited_range()`]() convenience
192 -- function.
194 -- local dq_str = l.delimited_range('"')
195 -- local sq_str = l.delimited_range("'")
196 -- local string = token(l.STRING, dq_str + sq_str)
198 -- In this case, the lexer treats '\' as an escape character in a string
199 -- sequence.
201 -- **Keywords**
203 -- Instead of matching _n_ keywords with _n_ `P('keyword_`_`n`_`')` ordered
204 -- choices, use another convenience function: [`lexer.word_match()`](). It is
205 -- much easier and more efficient to write word matches like:
207 -- local keyword = token(l.KEYWORD, l.word_match{
208 -- 'keyword_1', 'keyword_2', ..., 'keyword_n'
209 -- })
211 -- local case_insensitive_keyword = token(l.KEYWORD, l.word_match({
212 -- 'KEYWORD_1', 'keyword_2', ..., 'KEYword_n'
213 -- }, nil, true))
215 -- local hyphened_keyword = token(l.KEYWORD, l.word_match({
216 -- 'keyword-1', 'keyword-2', ..., 'keyword-n'
217 -- }, '-'))
219 -- By default, characters considered to be in keywords are in the set of
220 -- alphanumeric characters and underscores. The last token demonstrates how to
221 -- allow '-' (hyphen) characters to be in keywords as well.
223 -- **Numbers**
225 -- Most programming languages have the same format for integer and float tokens,
226 -- so it might be as simple as using a couple of predefined LPeg patterns:
228 -- local number = token(l.NUMBER, l.float + l.integer)
230 -- However, some languages allow postfix characters on integers.
232 -- local integer = P('-')^-1 * (l.dec_num * S('lL')^-1)
233 -- local number = token(l.NUMBER, l.float + l.hex_num + integer)
235 -- Your language may need other tweaks, but it is up to you how fine-grained you
236 -- want your highlighting to be. After all, you are not writing a compiler or
237 -- interpreter!
239 -- ### Rules
241 -- Programming languages have grammars, which specify valid token structure. For
242 -- example, comments usually cannot appear within a string. Grammars consist of
243 -- rules, which are simply combinations of tokens. Recall from the lexer
244 -- template the `_rules` table, which defines all the rules used by the lexer
245 -- grammar:
247 -- M._rules = {
248 -- {'whitespace', ws},
249 -- }
251 -- Each entry in a lexer's `_rules` table consists of a rule name and its
252 -- associated pattern. Rule names are completely arbitrary and serve only to
253 -- identify and distinguish between different rules. Rule order is important: if
254 -- text does not match the first rule, the lexer tries the second rule, and so
255 -- on. This simple grammar says to match whitespace tokens under a rule named
256 -- "whitespace".
258 -- To illustrate the importance of rule order, here is an example of a
259 -- simplified Lua grammar:
261 -- M._rules = {
262 -- {'whitespace', ws},
263 -- {'keyword', keyword},
264 -- {'identifier', identifier},
265 -- {'string', string},
266 -- {'comment', comment},
267 -- {'number', number},
268 -- {'label', label},
269 -- {'operator', operator},
270 -- }
272 -- Note how identifiers come after keywords. In Lua, as with most programming
273 -- languages, the characters allowed in keywords and identifiers are in the same
274 -- set (alphanumerics plus underscores). If the lexer specified the "identifier"
275 -- rule before the "keyword" rule, all keywords would match identifiers and thus
276 -- incorrectly highlight as identifiers instead of keywords. The same idea
277 -- applies to function, constant, etc. tokens that you may want to distinguish
278 -- between: their rules should come before identifiers.
280 -- So what about text that does not match any rules? For example in Lua, the '!'
281 -- character is meaningless outside a string or comment. Normally the lexer
282 -- skips over such text. If instead you want to highlight these "syntax errors",
283 -- add an additional end rule:
285 -- M._rules = {
286 -- {'whitespace', ws},
287 -- {'error', token(l.ERROR, l.any)},
288 -- }
290 -- This identifies and highlights any character not matched by an existing
291 -- rule as an `lexer.ERROR` token.
293 -- Even though the rules defined in the examples above contain a single token,
294 -- rules may consist of multiple tokens. For example, a rule for an HTML tag
295 -- could consist of a tag token followed by an arbitrary number of attribute
296 -- tokens, allowing the lexer to highlight all tokens separately. The rule might
297 -- look something like this:
299 -- {'tag', tag_start * (ws * attributes)^0 * tag_end^-1}
301 -- Note however that lexers with complex rules like these are more prone to lose
302 -- track of their state.
304 -- ### Summary
306 -- Lexers primarily consist of tokens and grammar rules. At your disposal are a
307 -- number of convenience patterns and functions for rapidly creating a lexer. If
308 -- you choose to use predefined token names for your tokens, you do not have to
309 -- define how the lexer highlights them. The tokens will inherit the default
310 -- syntax highlighting color theme your editor uses.
312 -- ## Advanced Techniques
314 -- ### Styles and Styling
316 -- The most basic form of syntax highlighting is assigning different colors to
317 -- different tokens. Instead of highlighting with just colors, Scintilla allows
318 -- for more rich highlighting, or "styling", with different fonts, font sizes,
319 -- font attributes, and foreground and background colors, just to name a few.
320 -- The unit of this rich highlighting is called a "style". Styles are simply
321 -- strings of comma-separated property settings. By default, lexers associate
322 -- predefined token names like `lexer.WHITESPACE`, `lexer.COMMENT`,
323 -- `lexer.STRING`, etc. with particular styles as part of a universal color
324 -- theme. These predefined styles include [`lexer.STYLE_CLASS`](),
325 -- [`lexer.STYLE_COMMENT`](), [`lexer.STYLE_CONSTANT`](),
326 -- [`lexer.STYLE_ERROR`](), [`lexer.STYLE_EMBEDDED`](),
327 -- [`lexer.STYLE_FUNCTION`](), [`lexer.STYLE_IDENTIFIER`](),
328 -- [`lexer.STYLE_KEYWORD`](), [`lexer.STYLE_LABEL`](), [`lexer.STYLE_NUMBER`](),
329 -- [`lexer.STYLE_OPERATOR`](), [`lexer.STYLE_PREPROCESSOR`](),
330 -- [`lexer.STYLE_REGEX`](), [`lexer.STYLE_STRING`](), [`lexer.STYLE_TYPE`](),
331 -- [`lexer.STYLE_VARIABLE`](), and [`lexer.STYLE_WHITESPACE`](). Like with
332 -- predefined token names and LPeg patterns, you may define your own styles. At
333 -- their core, styles are just strings, so you may create new ones and/or modify
334 -- existing ones. Each style consists of the following comma-separated settings:
336 -- Setting | Description
337 -- ---------------|------------
338 -- font:_name_ | The name of the font the style uses.
339 -- size:_int_ | The size of the font the style uses.
340 -- [not]bold | Whether or not the font face is bold.
341 -- weight:_int_ | The weight or boldness of a font, between 1 and 999.
342 -- [not]italics | Whether or not the font face is italic.
343 -- [not]underlined| Whether or not the font face is underlined.
344 -- fore:_color_ | The foreground color of the font face.
345 -- back:_color_ | The background color of the font face.
346 -- [not]eolfilled | Does the background color extend to the end of the line?
347 -- case:_char_ | The case of the font ('u': upper, 'l': lower, 'm': normal).
348 -- [not]visible | Whether or not the text is visible.
349 -- [not]changeable| Whether the text is changeable or read-only.
351 -- Specify font colors in either "#RRGGBB" format, "0xBBGGRR" format, or the
352 -- decimal equivalent of the latter. As with token names, LPeg patterns, and
353 -- styles, there is a set of predefined color names, but they vary depending on
354 -- the current color theme in use. Therefore, it is generally not a good idea to
355 -- manually define colors within styles in your lexer since they might not fit
356 -- into a user's chosen color theme. Try to refrain from even using predefined
357 -- colors in a style because that color may be theme-specific. Instead, the best
358 -- practice is to either use predefined styles or derive new color-agnostic
359 -- styles from predefined ones. For example, Lua "longstring" tokens use the
360 -- existing `lexer.STYLE_STRING` style instead of defining a new one.
362 -- #### Example Styles
364 -- Defining styles is pretty straightforward. An empty style that inherits the
365 -- default theme settings is simply an empty string:
367 -- local style_nothing = ''
369 -- A similar style but with a bold font face looks like this:
371 -- local style_bold = 'bold'
373 -- If you want the same style, but also with an italic font face, define the new
374 -- style in terms of the old one:
376 -- local style_bold_italic = style_bold..',italics'
378 -- This allows you to derive new styles from predefined ones without having to
379 -- rewrite them. This operation leaves the old style unchanged. Thus if you
380 -- had a "static variable" token whose style you wanted to base off of
381 -- `lexer.STYLE_VARIABLE`, it would probably look like:
383 -- local style_static_var = l.STYLE_VARIABLE..',italics'
385 -- The color theme files in the *lexers/themes/* folder give more examples of
386 -- style definitions.
388 -- ### Token Styles
390 -- Lexers use the `_tokenstyles` table to assign tokens to particular styles.
391 -- Recall the token definition and `_tokenstyles` table from the lexer template:
393 -- local ws = token(l.WHITESPACE, l.space^1)
395 -- ...
397 -- M._tokenstyles = {
399 -- }
401 -- Why is a style not assigned to the `lexer.WHITESPACE` token? As mentioned
402 -- earlier, lexers automatically associate tokens that use predefined token
403 -- names with a particular style. Only tokens with custom token names need
404 -- manual style associations. As an example, consider a custom whitespace token:
406 -- local ws = token('custom_whitespace', l.space^1)
408 -- Assigning a style to this token looks like:
410 -- M._tokenstyles = {
411 -- custom_whitespace = l.STYLE_WHITESPACE
412 -- }
414 -- Do not confuse token names with rule names. They are completely different
415 -- entities. In the example above, the lexer assigns the "custom_whitespace"
416 -- token the existing style for `WHITESPACE` tokens. If instead you want to
417 -- color the background of whitespace a shade of grey, it might look like:
419 -- local custom_style = l.STYLE_WHITESPACE..',back:$(color.grey)'
420 -- M._tokenstyles = {
421 -- custom_whitespace = custom_style
422 -- }
424 -- Notice that the lexer peforms Scintilla/SciTE-style "$()" property expansion.
425 -- You may also use "%()". Remember to refrain from assigning specific colors in
426 -- styles, but in this case, all user color themes probably define the
427 -- "color.grey" property.
429 -- ### Line Lexers
431 -- By default, lexers match the arbitrary chunks of text passed to them by
432 -- Scintilla. These chunks may be a full document, only the visible part of a
433 -- document, or even just portions of lines. Some lexers need to match whole
434 -- lines. For example, a lexer for the output of a file "diff" needs to know if
435 -- the line started with a '+' or '-' and then style the entire line
436 -- accordingly. To indicate that your lexer matches by line, use the
437 -- `_LEXBYLINE` field:
439 -- M._LEXBYLINE = true
441 -- Now the input text for the lexer is a single line at a time. Keep in mind
442 -- that line lexers do not have the ability to look ahead at subsequent lines.
444 -- ### Embedded Lexers
446 -- Lexers embed within one another very easily, requiring minimal effort. In the
447 -- following sections, the lexer being embedded is called the "child" lexer and
448 -- the lexer a child is being embedded in is called the "parent". For example,
449 -- consider an HTML lexer and a CSS lexer. Either lexer stands alone for styling
450 -- their respective HTML and CSS files. However, CSS can be embedded inside
451 -- HTML. In this specific case, the CSS lexer is the "child" lexer with the HTML
452 -- lexer being the "parent". Now consider an HTML lexer and a PHP lexer. This
453 -- sounds a lot like the case with CSS, but there is a subtle difference: PHP
454 -- _embeds itself_ into HTML while CSS is _embedded in_ HTML. This fundamental
455 -- difference results in two types of embedded lexers: a parent lexer that
456 -- embeds other child lexers in it (like HTML embedding CSS), and a child lexer
457 -- that embeds itself within a parent lexer (like PHP embedding itself in HTML).
459 -- #### Parent Lexer
461 -- Before embedding a child lexer into a parent lexer, the parent lexer needs to
462 -- load the child lexer. This is done with the [`lexer.load()`]() function. For
463 -- example, loading the CSS lexer within the HTML lexer looks like:
465 -- local css = l.load('css')
467 -- The next part of the embedding process is telling the parent lexer when to
468 -- switch over to the child lexer and when to switch back. The lexer refers to
469 -- these indications as the "start rule" and "end rule", respectively, and are
470 -- just LPeg patterns. Continuing with the HTML/CSS example, the transition from
471 -- HTML to CSS is when the lexer encounters a "style" tag with a "type"
472 -- attribute whose value is "text/css":
474 -- local css_tag = P('<style') * P(function(input, index)
475 -- if input:find('^[^>]+type="text/css"', index) then
476 -- return index
477 -- end
478 -- end)
480 -- This pattern looks for the beginning of a "style" tag and searches its
481 -- attribute list for the text "`type="text/css"`". (In this simplified example,
482 -- the Lua pattern does not consider whitespace between the '=' nor does it
483 -- consider that using single quotes is valid.) If there is a match, the
484 -- functional pattern returns a value instead of `nil`. In this case, the value
485 -- returned does not matter because we ultimately want to style the "style" tag
486 -- as an HTML tag, so the actual start rule looks like this:
488 -- local css_start_rule = #css_tag * tag
490 -- Now that the parent knows when to switch to the child, it needs to know when
491 -- to switch back. In the case of HTML/CSS, the switch back occurs when the
492 -- lexer encounters an ending "style" tag, though the lexer should still style
493 -- the tag as an HTML tag:
495 -- local css_end_rule = #P('</style>') * tag
497 -- Once the parent loads the child lexer and defines the child's start and end
498 -- rules, it embeds the child with the [`lexer.embed_lexer()`]() function:
500 -- l.embed_lexer(M, css, css_start_rule, css_end_rule)
502 -- The first parameter is the parent lexer object to embed the child in, which
503 -- in this case is `M`. The other three parameters are the child lexer object
504 -- loaded earlier followed by its start and end rules.
506 -- #### Child Lexer
508 -- The process for instructing a child lexer to embed itself into a parent is
509 -- very similar to embedding a child into a parent: first, load the parent lexer
510 -- into the child lexer with the [`lexer.load()`]() function and then create
511 -- start and end rules for the child lexer. However, in this case, swap the
512 -- lexer object arguments to [`lexer.embed_lexer()`](). For example, in the PHP
513 -- lexer:
515 -- local html = l.load('html')
516 -- local php_start_rule = token('php_tag', '<?php ')
517 -- local php_end_rule = token('php_tag', '?>')
518 -- l.embed_lexer(html, M, php_start_rule, php_end_rule)
520 -- ### Lexers with Complex State
522 -- A vast majority of lexers are not stateful and can operate on any chunk of
523 -- text in a document. However, there may be rare cases where a lexer does need
524 -- to keep track of some sort of persistent state. Rather than using `lpeg.P`
525 -- function patterns that set state variables, it is recommended to make use of
526 -- Scintilla's built-in, per-line state integers via [`lexer.line_state`](). It
527 -- was designed to accommodate up to 32 bit flags for tracking state.
528 -- [`lexer.line_from_position()`]() will return the line for any position given
529 -- to an `lpeg.P` function pattern. (Any positions derived from that position
530 -- argument will also work.)
532 -- Writing stateful lexers is beyond the scope of this document.
534 -- ## Code Folding
536 -- When reading source code, it is occasionally helpful to temporarily hide
537 -- blocks of code like functions, classes, comments, etc. This is the concept of
538 -- "folding". In the Textadept and SciTE editors for example, little indicators
539 -- in the editor margins appear next to code that can be folded at places called
540 -- "fold points". When the user clicks an indicator, the editor hides the code
541 -- associated with the indicator until the user clicks the indicator again. The
542 -- lexer specifies these fold points and what code exactly to fold.
544 -- The fold points for most languages occur on keywords or character sequences.
545 -- Examples of fold keywords are "if" and "end" in Lua and examples of fold
546 -- character sequences are '{', '}', "/\*", and "\*/" in C for code block and
547 -- comment delimiters, respectively. However, these fold points cannot occur
548 -- just anywhere. For example, lexers should not recognize fold keywords that
549 -- appear within strings or comments. The lexer's `_foldsymbols` table allows
550 -- you to conveniently define fold points with such granularity. For example,
551 -- consider C:
553 -- M._foldsymbols = {
554 -- [l.OPERATOR] = {['{'] = 1, ['}'] = -1},
555 -- [l.COMMENT] = {['/*'] = 1, ['*/'] = -1},
556 -- _patterns = {'[{}]', '/%*', '%*/'}
557 -- }
559 -- The first assignment states that any '{' or '}' that the lexer recognized as
560 -- an `lexer.OPERATOR` token is a fold point. The integer `1` indicates the
561 -- match is a beginning fold point and `-1` indicates the match is an ending
562 -- fold point. Likewise, the second assignment states that any "/\*" or "\*/"
563 -- that the lexer recognizes as part of a `lexer.COMMENT` token is a fold point.
564 -- The lexer does not consider any occurences of these characters outside their
565 -- defined tokens (such as in a string) as fold points. Finally, every
566 -- `_foldsymbols` table must have a `_patterns` field that contains a list of
567 -- [Lua patterns][] that match fold points. If the lexer encounters text that
568 -- matches one of those patterns, the lexer looks up the matched text in its
569 -- token's table in order to determine whether or not the text is a fold point.
570 -- In the example above, the first Lua pattern matches any '{' or '}'
571 -- characters. When the lexer comes across one of those characters, it checks if
572 -- the match is an `lexer.OPERATOR` token. If so, the lexer identifies the match
573 -- as a fold point. The same idea applies for the other patterns. (The '%' is in
574 -- the other patterns because '\*' is a special character in Lua patterns that
575 -- needs escaping.) How do you specify fold keywords? Here is an example for
576 -- Lua:
578 -- M._foldsymbols = {
579 -- [l.KEYWORD] = {
580 -- ['if'] = 1, ['do'] = 1, ['function'] = 1,
581 -- ['end'] = -1, ['repeat'] = 1, ['until'] = -1
582 -- },
583 -- _patterns = {'%l+'}
584 -- }
586 -- Any time the lexer encounters a lower case word, if that word is a
587 -- `lexer.KEYWORD` token and in the associated list of fold points, the lexer
588 -- identifies the word as a fold point.
590 -- If your lexer has case-insensitive keywords as fold points, simply add a
591 -- `_case_insensitive = true` option to the `_foldsymbols` table and specify
592 -- keywords in lower case.
594 -- If your lexer needs to do some additional processing to determine if a match
595 -- is a fold point, assign a function that returns an integer. Returning `1` or
596 -- `-1` indicates the match is a fold point. Returning `0` indicates it is not.
597 -- For example:
599 -- local function fold_strange_token(text, pos, line, s, match)
600 -- if ... then
601 -- return 1 -- beginning fold point
602 -- elseif ... then
603 -- return -1 -- ending fold point
604 -- end
605 -- return 0
606 -- end
608 -- M._foldsymbols = {
609 -- ['strange_token'] = {['|'] = fold_strange_token},
610 -- _patterns = {'|'}
611 -- }
613 -- Any time the lexer encounters a '|' that is a "strange_token", it calls the
614 -- `fold_strange_token` function to determine if '|' is a fold point. The lexer
615 -- calls these functions with the following arguments: the text to identify fold
616 -- points in, the beginning position of the current line in the text to fold,
617 -- the current line's text, the position in the current line the matched text
618 -- starts at, and the matched text itself.
620 -- [Lua patterns]: http://www.lua.org/manual/5.2/manual.html#6.4.1
622 -- ### Fold by Indentation
624 -- Some languages have significant whitespace and/or no delimiters that indicate
625 -- fold points. If your lexer falls into this category and you would like to
626 -- mark fold points based on changes in indentation, use the
627 -- `_FOLDBYINDENTATION` field:
629 -- M._FOLDBYINDENTATION = true
631 -- ## Using Lexers
633 -- ### Textadept
635 -- Put your lexer in your *~/.textadept/lexers/* directory so you do not
636 -- overwrite it when upgrading Textadept. Also, lexers in this directory
637 -- override default lexers. Thus, Textadept loads a user *lua* lexer instead of
638 -- the default *lua* lexer. This is convenient for tweaking a default lexer to
639 -- your liking. Then add a [file type][] for your lexer if necessary.
641 -- [file type]: _M.textadept.file_types.html
643 -- ### SciTE
645 -- Create a *.properties* file for your lexer and `import` it in either your
646 -- *SciTEUser.properties* or *SciTEGlobal.properties*. The contents of the
647 -- *.properties* file should contain:
649 -- file.patterns.[lexer_name]=[file_patterns]
650 -- lexer.$(file.patterns.[lexer_name])=[lexer_name]
652 -- where `[lexer_name]` is the name of your lexer (minus the *.lua* extension)
653 -- and `[file_patterns]` is a set of file extensions to use your lexer for.
655 -- Please note that Lua lexers ignore any styling information in *.properties*
656 -- files. Your theme file in the *lexers/themes/* directory contains styling
657 -- information.
659 -- ## Considerations
661 -- ### Performance
663 -- There might be some slight overhead when initializing a lexer, but loading a
664 -- file from disk into Scintilla is usually more expensive. On modern computer
665 -- systems, I see no difference in speed between LPeg lexers and Scintilla's C++
666 -- ones. Optimize lexers for speed by re-arranging rules in the `_rules` table
667 -- so that the most common rules match first. Do keep in mind that order matters
668 -- for similar rules.
670 -- ### Limitations
672 -- Embedded preprocessor languages like PHP cannot completely embed in their
673 -- parent languages in that the parent's tokens do not support start and end
674 -- rules. This mostly goes unnoticed, but code like
676 -- <div id="<?php echo $id; ?>">
678 -- or
680 -- <div <?php if ($odd) { echo 'class="odd"'; } ?>>
682 -- will not style correctly.
684 -- ### Troubleshooting
686 -- Errors in lexers can be tricky to debug. Lexers print Lua errors to
687 -- `io.stderr` and `_G.print()` statements to `io.stdout`. Running your editor
688 -- from a terminal is the easiest way to see errors as they occur.
690 -- ### Risks
692 -- Poorly written lexers have the ability to crash Scintilla (and thus its
693 -- containing application), so unsaved data might be lost. However, I have only
694 -- observed these crashes in early lexer development, when syntax errors or
695 -- pattern errors are present. Once the lexer actually starts styling text
696 -- (either correctly or incorrectly, it does not matter), I have not observed
697 -- any crashes.
699 -- ### Acknowledgements
701 -- Thanks to Peter Odding for his [lexer post][] on the Lua mailing list
702 -- that inspired me, and thanks to Roberto Ierusalimschy for LPeg.
704 -- [lexer post]: http://lua-users.org/lists/lua-l/2007-04/msg00116.html
705 -- @field LEXERPATH (string)
706 -- The path used to search for a lexer to load.
707 -- Identical in format to Lua's `package.path` string.
708 -- The default value is `package.path`.
709 -- @field DEFAULT (string)
710 -- The token name for default tokens.
711 -- @field WHITESPACE (string)
712 -- The token name for whitespace tokens.
713 -- @field COMMENT (string)
714 -- The token name for comment tokens.
715 -- @field STRING (string)
716 -- The token name for string tokens.
717 -- @field NUMBER (string)
718 -- The token name for number tokens.
719 -- @field KEYWORD (string)
720 -- The token name for keyword tokens.
721 -- @field IDENTIFIER (string)
722 -- The token name for identifier tokens.
723 -- @field OPERATOR (string)
724 -- The token name for operator tokens.
725 -- @field ERROR (string)
726 -- The token name for error tokens.
727 -- @field PREPROCESSOR (string)
728 -- The token name for preprocessor tokens.
729 -- @field CONSTANT (string)
730 -- The token name for constant tokens.
731 -- @field VARIABLE (string)
732 -- The token name for variable tokens.
733 -- @field FUNCTION (string)
734 -- The token name for function tokens.
735 -- @field CLASS (string)
736 -- The token name for class tokens.
737 -- @field TYPE (string)
738 -- The token name for type tokens.
739 -- @field LABEL (string)
740 -- The token name for label tokens.
741 -- @field REGEX (string)
742 -- The token name for regex tokens.
743 -- @field STYLE_CLASS (string)
744 -- The style typically used for class definitions.
745 -- @field STYLE_COMMENT (string)
746 -- The style typically used for code comments.
747 -- @field STYLE_CONSTANT (string)
748 -- The style typically used for constants.
749 -- @field STYLE_ERROR (string)
750 -- The style typically used for erroneous syntax.
751 -- @field STYLE_FUNCTION (string)
752 -- The style typically used for function definitions.
753 -- @field STYLE_KEYWORD (string)
754 -- The style typically used for language keywords.
755 -- @field STYLE_LABEL (string)
756 -- The style typically used for labels.
757 -- @field STYLE_NUMBER (string)
758 -- The style typically used for numbers.
759 -- @field STYLE_OPERATOR (string)
760 -- The style typically used for operators.
761 -- @field STYLE_REGEX (string)
762 -- The style typically used for regular expression strings.
763 -- @field STYLE_STRING (string)
764 -- The style typically used for strings.
765 -- @field STYLE_PREPROCESSOR (string)
766 -- The style typically used for preprocessor statements.
767 -- @field STYLE_TYPE (string)
768 -- The style typically used for static types.
769 -- @field STYLE_VARIABLE (string)
770 -- The style typically used for variables.
771 -- @field STYLE_WHITESPACE (string)
772 -- The style typically used for whitespace.
773 -- @field STYLE_EMBEDDED (string)
774 -- The style typically used for embedded code.
775 -- @field STYLE_IDENTIFIER (string)
776 -- The style typically used for identifier words.
777 -- @field STYLE_DEFAULT (string)
778 -- The style all styles are based off of.
779 -- @field STYLE_LINENUMBER (string)
780 -- The style used for all margins except fold margins.
781 -- @field STYLE_BRACELIGHT (string)
782 -- The style used for highlighted brace characters.
783 -- @field STYLE_BRACEBAD (string)
784 -- The style used for unmatched brace characters.
785 -- @field STYLE_CONTROLCHAR (string)
786 -- The style used for control characters.
787 -- Color attributes are ignored.
788 -- @field STYLE_INDENTGUIDE (string)
789 -- The style used for indentation guides.
790 -- @field STYLE_CALLTIP (string)
791 -- The style used by call tips if [`buffer.call_tip_use_style`]() is set.
792 -- Only the font name, size, and color attributes are used.
793 -- @field STYLE_FOLDDISPLAYTEXT (string)
794 -- The style used for fold display text.
795 -- @field any (pattern)
796 -- A pattern that matches any single character.
797 -- @field ascii (pattern)
798 -- A pattern that matches any ASCII character (codes 0 to 127).
799 -- @field extend (pattern)
800 -- A pattern that matches any ASCII extended character (codes 0 to 255).
801 -- @field alpha (pattern)
802 -- A pattern that matches any alphabetic character ('A'-'Z', 'a'-'z').
803 -- @field digit (pattern)
804 -- A pattern that matches any digit ('0'-'9').
805 -- @field alnum (pattern)
806 -- A pattern that matches any alphanumeric character ('A'-'Z', 'a'-'z',
807 -- '0'-'9').
808 -- @field lower (pattern)
809 -- A pattern that matches any lower case character ('a'-'z').
810 -- @field upper (pattern)
811 -- A pattern that matches any upper case character ('A'-'Z').
812 -- @field xdigit (pattern)
813 -- A pattern that matches any hexadecimal digit ('0'-'9', 'A'-'F', 'a'-'f').
814 -- @field cntrl (pattern)
815 -- A pattern that matches any control character (ASCII codes 0 to 31).
816 -- @field graph (pattern)
817 -- A pattern that matches any graphical character ('!' to '~').
818 -- @field print (pattern)
819 -- A pattern that matches any printable character (' ' to '~').
820 -- @field punct (pattern)
821 -- A pattern that matches any punctuation character ('!' to '/', ':' to '@',
822 -- '[' to ''', '{' to '~').
823 -- @field space (pattern)
824 -- A pattern that matches any whitespace character ('\t', '\v', '\f', '\n',
825 -- '\r', space).
826 -- @field newline (pattern)
827 -- A pattern that matches any set of end of line characters.
828 -- @field nonnewline (pattern)
829 -- A pattern that matches any single, non-newline character.
830 -- @field nonnewline_esc (pattern)
831 -- A pattern that matches any single, non-newline character or any set of end
832 -- of line characters escaped with '\'.
833 -- @field dec_num (pattern)
834 -- A pattern that matches a decimal number.
835 -- @field hex_num (pattern)
836 -- A pattern that matches a hexadecimal number.
837 -- @field oct_num (pattern)
838 -- A pattern that matches an octal number.
839 -- @field integer (pattern)
840 -- A pattern that matches either a decimal, hexadecimal, or octal number.
841 -- @field float (pattern)
842 -- A pattern that matches a floating point number.
843 -- @field word (pattern)
844 -- A pattern that matches a typical word. Words begin with a letter or
845 -- underscore and consist of alphanumeric and underscore characters.
846 -- @field FOLD_BASE (number)
847 -- The initial (root) fold level.
848 -- @field FOLD_BLANK (number)
849 -- Flag indicating that the line is blank.
850 -- @field FOLD_HEADER (number)
851 -- Flag indicating the line is fold point.
852 -- @field fold_level (table, Read-only)
853 -- Table of fold level bit-masks for line numbers starting from zero.
854 -- Fold level masks are composed of an integer level combined with any of the
855 -- following bits:
857 -- * `lexer.FOLD_BASE`
858 -- The initial fold level.
859 -- * `lexer.FOLD_BLANK`
860 -- The line is blank.
861 -- * `lexer.FOLD_HEADER`
862 -- The line is a header, or fold point.
863 -- @field indent_amount (table, Read-only)
864 -- Table of indentation amounts in character columns, for line numbers
865 -- starting from zero.
866 -- @field line_state (table)
867 -- Table of integer line states for line numbers starting from zero.
868 -- Line states can be used by lexers for keeping track of persistent states.
869 -- @field property (table)
870 -- Map of key-value string pairs.
871 -- @field property_expanded (table, Read-only)
872 -- Map of key-value string pairs with `$()` and `%()` variable replacement
873 -- performed in values.
874 -- @field property_int (table, Read-only)
875 -- Map of key-value pairs with values interpreted as numbers, or `0` if not
876 -- found.
877 -- @field style_at (table, Read-only)
878 -- Table of style names at positions in the buffer starting from 1.
879 module('lexer')]=]
881 lpeg = require('lpeg')
882 local lpeg_P, lpeg_R, lpeg_S, lpeg_V = lpeg.P, lpeg.R, lpeg.S, lpeg.V
883 local lpeg_Ct, lpeg_Cc, lpeg_Cp = lpeg.Ct, lpeg.Cc, lpeg.Cp
884 local lpeg_Cmt, lpeg_C = lpeg.Cmt, lpeg.C
885 local lpeg_match = lpeg.match
887 M.LEXERPATH = package.path
889 -- Table of loaded lexers.
890 M.lexers = {}
892 -- Keep track of the last parent lexer loaded. This lexer's rules are used for
893 -- proxy lexers (those that load parent and child lexers to embed) that do not
894 -- declare a parent lexer.
895 local parent_lexer
897 if not package.searchpath then
898 -- Searches for the given *name* in the given *path*.
899 -- This is an implementation of Lua 5.2's `package.searchpath()` function for
900 -- Lua 5.1.
901 function package.searchpath(name, path)
902 local tried = {}
903 for part in path:gmatch('[^;]+') do
904 local filename = part:gsub('%?', name)
905 local f = io.open(filename, 'r')
906 if f then f:close() return filename end
907 tried[#tried + 1] = ("no file '%s'"):format(filename)
909 return nil, table.concat(tried, '\n')
913 -- Adds a rule to a lexer's current ordered list of rules.
914 -- @param lexer The lexer to add the given rule to.
915 -- @param name The name associated with this rule. It is used for other lexers
916 -- to access this particular rule from the lexer's `_RULES` table. It does not
917 -- have to be the same as the name passed to `token`.
918 -- @param rule The LPeg pattern of the rule.
919 local function add_rule(lexer, id, rule)
920 if not lexer._RULES then
921 lexer._RULES = {}
922 -- Contains an ordered list (by numerical index) of rule names. This is used
923 -- in conjunction with lexer._RULES for building _TOKENRULE.
924 lexer._RULEORDER = {}
926 lexer._RULES[id] = rule
927 lexer._RULEORDER[#lexer._RULEORDER + 1] = id
930 -- Adds a new Scintilla style to Scintilla.
931 -- @param lexer The lexer to add the given style to.
932 -- @param token_name The name of the token associated with this style.
933 -- @param style A Scintilla style created from `style()`.
934 -- @see style
935 local function add_style(lexer, token_name, style)
936 local num_styles = lexer._numstyles
937 if num_styles == 32 then num_styles = num_styles + 8 end -- skip predefined
938 if num_styles >= 255 then print('Too many styles defined (255 MAX)') end
939 lexer._TOKENSTYLES[token_name], lexer._numstyles = num_styles, num_styles + 1
940 lexer._EXTRASTYLES[token_name] = style
943 -- (Re)constructs `lexer._TOKENRULE`.
944 -- @param parent The parent lexer.
945 local function join_tokens(lexer)
946 local patterns, order = lexer._RULES, lexer._RULEORDER
947 local token_rule = patterns[order[1]]
948 for i = 2, #order do token_rule = token_rule + patterns[order[i]] end
949 lexer._TOKENRULE = token_rule + M.token(M.DEFAULT, M.any)
950 return lexer._TOKENRULE
953 -- Adds a given lexer and any of its embedded lexers to a given grammar.
954 -- @param grammar The grammar to add the lexer to.
955 -- @param lexer The lexer to add.
956 local function add_lexer(grammar, lexer, token_rule)
957 local token_rule = join_tokens(lexer)
958 local lexer_name = lexer._NAME
959 for i = 1, #lexer._CHILDREN do
960 local child = lexer._CHILDREN[i]
961 if child._CHILDREN then add_lexer(grammar, child) end
962 local child_name = child._NAME
963 local rules = child._EMBEDDEDRULES[lexer_name]
964 local rules_token_rule = grammar['__'..child_name] or rules.token_rule
965 grammar[child_name] = (-rules.end_rule * rules_token_rule)^0 *
966 rules.end_rule^-1 * lpeg_V(lexer_name)
967 local embedded_child = '_'..child_name
968 grammar[embedded_child] = rules.start_rule * (-rules.end_rule *
969 rules_token_rule)^0 * rules.end_rule^-1
970 token_rule = lpeg_V(embedded_child) + token_rule
972 grammar['__'..lexer_name] = token_rule -- can contain embedded lexer rules
973 grammar[lexer_name] = token_rule^0
976 -- (Re)constructs `lexer._GRAMMAR`.
977 -- @param lexer The parent lexer.
978 -- @param initial_rule The name of the rule to start lexing with. The default
979 -- value is `lexer._NAME`. Multilang lexers use this to start with a child
980 -- rule if necessary.
981 local function build_grammar(lexer, initial_rule)
982 local children = lexer._CHILDREN
983 if children then
984 local lexer_name = lexer._NAME
985 if not initial_rule then initial_rule = lexer_name end
986 local grammar = {initial_rule}
987 add_lexer(grammar, lexer)
988 lexer._INITIALRULE = initial_rule
989 lexer._GRAMMAR = lpeg_Ct(lpeg_P(grammar))
990 else
991 lexer._GRAMMAR = lpeg_Ct(join_tokens(lexer)^0)
995 local string_upper = string.upper
996 -- Default styles.
997 local default = {
998 'nothing', 'whitespace', 'comment', 'string', 'number', 'keyword',
999 'identifier', 'operator', 'error', 'preprocessor', 'constant', 'variable',
1000 'function', 'class', 'type', 'label', 'regex', 'embedded'
1002 for i = 1, #default do
1003 local name, upper_name = default[i], string_upper(default[i])
1004 M[upper_name] = name
1005 if not M['STYLE_'..upper_name] then
1006 M['STYLE_'..upper_name] = ''
1009 -- Predefined styles.
1010 local predefined = {
1011 'default', 'linenumber', 'bracelight', 'bracebad', 'controlchar',
1012 'indentguide', 'calltip', 'folddisplaytext'
1014 for i = 1, #predefined do
1015 local name, upper_name = predefined[i], string_upper(predefined[i])
1016 M[upper_name] = name
1017 if not M['STYLE_'..upper_name] then
1018 M['STYLE_'..upper_name] = ''
1023 -- Initializes or loads and returns the lexer of string name *name*.
1024 -- Scintilla calls this function in order to load a lexer. Parent lexers also
1025 -- call this function in order to load child lexers and vice-versa. The user
1026 -- calls this function in order to load a lexer when using Scintillua as a Lua
1027 -- library.
1028 -- @param name The name of the lexing language.
1029 -- @param alt_name The alternate name of the lexing language. This is useful for
1030 -- embedding the same child lexer with multiple sets of start and end tokens.
1031 -- @param cache Flag indicating whether or not to load lexers from the cache.
1032 -- This should only be `true` when initially loading a lexer (e.g. not from
1033 -- within another lexer for embedding purposes).
1034 -- The default value is `false`.
1035 -- @return lexer object
1036 -- @name load
1037 function M.load(name, alt_name, cache)
1038 if cache and M.lexers[alt_name or name] then return M.lexers[alt_name or name] end
1039 parent_lexer = nil -- reset
1041 -- When using Scintillua as a stand-alone module, the `property` and
1042 -- `property_int` tables do not exist (they are not useful). Create them to
1043 -- prevent errors from occurring.
1044 if not M.property then
1045 M.property, M.property_int = {}, setmetatable({}, {
1046 __index = function(t, k) return tonumber(M.property[k]) or 0 end,
1047 __newindex = function() error('read-only property') end
1051 -- Load the language lexer with its rules, styles, etc.
1052 M.WHITESPACE = (alt_name or name)..'_whitespace'
1053 local lexer_file, error = package.searchpath('lexers/'..name, M.LEXERPATH)
1054 local ok, lexer = pcall(dofile, lexer_file or '')
1055 if not ok then
1056 return nil
1058 if alt_name then lexer._NAME = alt_name end
1060 -- Create the initial maps for token names to style numbers and styles.
1061 local token_styles = {}
1062 for i = 1, #default do token_styles[default[i]] = i - 1 end
1063 for i = 1, #predefined do token_styles[predefined[i]] = i + 31 end
1064 lexer._TOKENSTYLES, lexer._numstyles = token_styles, #default
1065 lexer._EXTRASTYLES = {}
1067 -- If the lexer is a proxy (loads parent and child lexers to embed) and does
1068 -- not declare a parent, try and find one and use its rules.
1069 if not lexer._rules and not lexer._lexer then lexer._lexer = parent_lexer end
1071 -- If the lexer is a proxy or a child that embedded itself, add its rules and
1072 -- styles to the parent lexer. Then set the parent to be the main lexer.
1073 if lexer._lexer then
1074 local l, _r, _s = lexer._lexer, lexer._rules, lexer._tokenstyles
1075 if not l._tokenstyles then l._tokenstyles = {} end
1076 if _r then
1077 for i = 1, #_r do
1078 -- Prevent rule id clashes.
1079 l._rules[#l._rules + 1] = {lexer._NAME..'_'.._r[i][1], _r[i][2]}
1082 if _s then
1083 for token, style in pairs(_s) do l._tokenstyles[token] = style end
1085 lexer = l
1088 -- Add the lexer's styles and build its grammar.
1089 if lexer._rules then
1090 if lexer._tokenstyles then
1091 for token, style in pairs(lexer._tokenstyles) do
1092 add_style(lexer, token, style)
1095 for i = 1, #lexer._rules do
1096 add_rule(lexer, lexer._rules[i][1], lexer._rules[i][2])
1098 build_grammar(lexer)
1100 -- Add the lexer's unique whitespace style.
1101 add_style(lexer, lexer._NAME..'_whitespace', M.STYLE_WHITESPACE)
1103 -- Process the lexer's fold symbols.
1104 if lexer._foldsymbols and lexer._foldsymbols._patterns then
1105 local patterns = lexer._foldsymbols._patterns
1106 for i = 1, #patterns do patterns[i] = '()('..patterns[i]..')' end
1109 lexer.lex, lexer.fold = M.lex, M.fold
1110 M.lexers[alt_name or name] = lexer
1111 return lexer
1115 -- Lexes a chunk of text *text* (that has an initial style number of
1116 -- *init_style*) with lexer *lexer*.
1117 -- If *lexer* has a `_LEXBYLINE` flag set, the text is lexed one line at a time.
1118 -- Otherwise the text is lexed as a whole.
1119 -- @param lexer The lexer object to lex with.
1120 -- @param text The text in the buffer to lex.
1121 -- @param init_style The current style. Multiple-language lexers use this to
1122 -- determine which language to start lexing in.
1123 -- @return table of token names and positions.
1124 -- @name lex
1125 function M.lex(lexer, text, init_style)
1126 if not lexer._GRAMMAR then return {M.DEFAULT, #text + 1} end
1127 if not lexer._LEXBYLINE then
1128 -- For multilang lexers, build a new grammar whose initial_rule is the
1129 -- current language.
1130 if lexer._CHILDREN then
1131 for style, style_num in pairs(lexer._TOKENSTYLES) do
1132 if style_num == init_style then
1133 local lexer_name = style:match('^(.+)_whitespace') or lexer._NAME
1134 if lexer._INITIALRULE ~= lexer_name then
1135 build_grammar(lexer, lexer_name)
1137 break
1141 return lpeg_match(lexer._GRAMMAR, text)
1142 else
1143 local tokens = {}
1144 local function append(tokens, line_tokens, offset)
1145 for i = 1, #line_tokens, 2 do
1146 tokens[#tokens + 1] = line_tokens[i]
1147 tokens[#tokens + 1] = line_tokens[i + 1] + offset
1150 local offset = 0
1151 local grammar = lexer._GRAMMAR
1152 for line in text:gmatch('[^\r\n]*\r?\n?') do
1153 local line_tokens = lpeg_match(grammar, line)
1154 if line_tokens then append(tokens, line_tokens, offset) end
1155 offset = offset + #line
1156 -- Use the default style to the end of the line if none was specified.
1157 if tokens[#tokens] ~= offset then
1158 tokens[#tokens + 1], tokens[#tokens + 2] = 'default', offset + 1
1161 return tokens
1166 -- Determines fold points in a chunk of text *text* with lexer *lexer*.
1167 -- *text* starts at position *start_pos* on line number *start_line* with a
1168 -- beginning fold level of *start_level* in the buffer. If *lexer* has a `_fold`
1169 -- function or a `_foldsymbols` table, that field is used to perform folding.
1170 -- Otherwise, if *lexer* has a `_FOLDBYINDENTATION` field set, or if a
1171 -- `fold.by.indentation` property is set, folding by indentation is done.
1172 -- @param lexer The lexer object to fold with.
1173 -- @param text The text in the buffer to fold.
1174 -- @param start_pos The position in the buffer *text* starts at, starting at
1175 -- zero.
1176 -- @param start_line The line number *text* starts on.
1177 -- @param start_level The fold level *text* starts on.
1178 -- @return table of fold levels.
1179 -- @name fold
1180 function M.fold(lexer, text, start_pos, start_line, start_level)
1181 local folds = {}
1182 if text == '' then return folds end
1183 local fold = M.property_int['fold'] > 0
1184 local FOLD_BASE = M.FOLD_BASE
1185 local FOLD_HEADER, FOLD_BLANK = M.FOLD_HEADER, M.FOLD_BLANK
1186 if fold and lexer._fold then
1187 return lexer._fold(text, start_pos, start_line, start_level)
1188 elseif fold and lexer._foldsymbols then
1189 local lines = {}
1190 for p, l in (text..'\n'):gmatch('()(.-)\r?\n') do
1191 lines[#lines + 1] = {p, l}
1193 local fold_zero_sum_lines = M.property_int['fold.on.zero.sum.lines'] > 0
1194 local fold_symbols = lexer._foldsymbols
1195 local fold_symbols_patterns = fold_symbols._patterns
1196 local fold_symbols_case_insensitive = fold_symbols._case_insensitive
1197 local style_at, fold_level = M.style_at, M.fold_level
1198 local line_num, prev_level = start_line, start_level
1199 local current_level = prev_level
1200 for i = 1, #lines do
1201 local pos, line = lines[i][1], lines[i][2]
1202 if line ~= '' then
1203 if fold_symbols_case_insensitive then line = line:lower() end
1204 local level_decreased = false
1205 for j = 1, #fold_symbols_patterns do
1206 for s, match in line:gmatch(fold_symbols_patterns[j]) do
1207 local symbols = fold_symbols[style_at[start_pos + pos + s - 1]]
1208 local l = symbols and symbols[match]
1209 if type(l) == 'function' then l = l(text, pos, line, s, match) end
1210 if type(l) == 'number' then
1211 current_level = current_level + l
1212 if l < 0 and current_level < prev_level then
1213 -- Potential zero-sum line. If the level were to go back up on
1214 -- the same line, the line may be marked as a fold header.
1215 level_decreased = true
1220 folds[line_num] = prev_level
1221 if current_level > prev_level then
1222 folds[line_num] = prev_level + FOLD_HEADER
1223 elseif level_decreased and current_level == prev_level and
1224 fold_zero_sum_lines then
1225 if line_num > start_line then
1226 folds[line_num] = prev_level - 1 + FOLD_HEADER
1227 else
1228 -- Typing within a zero-sum line.
1229 local level = fold_level[line_num - 1] - 1
1230 if level > FOLD_HEADER then level = level - FOLD_HEADER end
1231 if level > FOLD_BLANK then level = level - FOLD_BLANK end
1232 folds[line_num] = level + FOLD_HEADER
1233 current_level = current_level + 1
1236 if current_level < FOLD_BASE then current_level = FOLD_BASE end
1237 prev_level = current_level
1238 else
1239 folds[line_num] = prev_level + FOLD_BLANK
1241 line_num = line_num + 1
1243 elseif fold and (lexer._FOLDBYINDENTATION or
1244 M.property_int['fold.by.indentation'] > 0) then
1245 -- Indentation based folding.
1246 -- Calculate indentation per line.
1247 local indentation = {}
1248 for indent, line in (text..'\n'):gmatch('([\t ]*)([^\r\n]*)\r?\n') do
1249 indentation[#indentation + 1] = line ~= '' and #indent
1251 -- Find the first non-blank line before start_line. If the current line is
1252 -- indented, make that previous line a header and update the levels of any
1253 -- blank lines inbetween. If the current line is blank, match the level of
1254 -- the previous non-blank line.
1255 local current_level = start_level
1256 for i = start_line - 1, 0, -1 do
1257 local level = M.fold_level[i]
1258 if level >= FOLD_HEADER then level = level - FOLD_HEADER end
1259 if level < FOLD_BLANK then
1260 local indent = M.indent_amount[i]
1261 if indentation[1] and indentation[1] > indent then
1262 folds[i] = FOLD_BASE + indent + FOLD_HEADER
1263 for j = i + 1, start_line - 1 do
1264 folds[j] = start_level + FOLD_BLANK
1266 elseif not indentation[1] then
1267 current_level = FOLD_BASE + indent
1269 break
1272 -- Iterate over lines, setting fold numbers and fold flags.
1273 for i = 1, #indentation do
1274 if indentation[i] then
1275 current_level = FOLD_BASE + indentation[i]
1276 folds[start_line + i - 1] = current_level
1277 for j = i + 1, #indentation do
1278 if indentation[j] then
1279 if FOLD_BASE + indentation[j] > current_level then
1280 folds[start_line + i - 1] = current_level + FOLD_HEADER
1281 current_level = FOLD_BASE + indentation[j] -- for any blanks below
1283 break
1286 else
1287 folds[start_line + i - 1] = current_level + FOLD_BLANK
1290 else
1291 -- No folding, reset fold levels if necessary.
1292 local current_line = start_line
1293 for _ in text:gmatch('\r?\n') do
1294 folds[current_line] = start_level
1295 current_line = current_line + 1
1298 return folds
1301 -- The following are utility functions lexers will have access to.
1303 -- Common patterns.
1304 M.any = lpeg_P(1)
1305 M.ascii = lpeg_R('\000\127')
1306 M.extend = lpeg_R('\000\255')
1307 M.alpha = lpeg_R('AZ', 'az')
1308 M.digit = lpeg_R('09')
1309 M.alnum = lpeg_R('AZ', 'az', '09')
1310 M.lower = lpeg_R('az')
1311 M.upper = lpeg_R('AZ')
1312 M.xdigit = lpeg_R('09', 'AF', 'af')
1313 M.cntrl = lpeg_R('\000\031')
1314 M.graph = lpeg_R('!~')
1315 M.print = lpeg_R(' ~')
1316 M.punct = lpeg_R('!/', ':@', '[\'', '{~')
1317 M.space = lpeg_S('\t\v\f\n\r ')
1319 M.newline = lpeg_S('\r\n\f')^1
1320 M.nonnewline = 1 - M.newline
1321 M.nonnewline_esc = 1 - (M.newline + '\\') + '\\' * M.any
1323 M.dec_num = M.digit^1
1324 M.hex_num = '0' * lpeg_S('xX') * M.xdigit^1
1325 M.oct_num = '0' * lpeg_R('07')^1
1326 M.integer = lpeg_S('+-')^-1 * (M.hex_num + M.oct_num + M.dec_num)
1327 M.float = lpeg_S('+-')^-1 *
1328 ((M.digit^0 * '.' * M.digit^1 + M.digit^1 * '.' * M.digit^0) *
1329 (lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1)^-1 +
1330 (M.digit^1 * lpeg_S('eE') * lpeg_S('+-')^-1 * M.digit^1))
1332 M.word = (M.alpha + '_') * (M.alnum + '_')^0
1335 -- Creates and returns a token pattern with token name *name* and pattern
1336 -- *patt*.
1337 -- If *name* is not a predefined token name, its style must be defined in the
1338 -- lexer's `_tokenstyles` table.
1339 -- @param name The name of token. If this name is not a predefined token name,
1340 -- then a style needs to be assiciated with it in the lexer's `_tokenstyles`
1341 -- table.
1342 -- @param patt The LPeg pattern associated with the token.
1343 -- @return pattern
1344 -- @usage local ws = token(l.WHITESPACE, l.space^1)
1345 -- @usage local annotation = token('annotation', '@' * l.word)
1346 -- @name token
1347 function M.token(name, patt)
1348 return lpeg_Cc(name) * patt * lpeg_Cp()
1352 -- Creates and returns a pattern that matches a range of text bounded by
1353 -- *chars* characters.
1354 -- This is a convenience function for matching more complicated delimited ranges
1355 -- like strings with escape characters and balanced parentheses. *single_line*
1356 -- indicates whether or not the range must be on a single line, *no_escape*
1357 -- indicates whether or not to ignore '\' as an escape character, and *balanced*
1358 -- indicates whether or not to handle balanced ranges like parentheses and
1359 -- requires *chars* to be composed of two characters.
1360 -- @param chars The character(s) that bound the matched range.
1361 -- @param single_line Optional flag indicating whether or not the range must be
1362 -- on a single line.
1363 -- @param no_escape Optional flag indicating whether or not the range end
1364 -- character may be escaped by a '\\' character.
1365 -- @param balanced Optional flag indicating whether or not to match a balanced
1366 -- range, like the "%b" Lua pattern. This flag only applies if *chars*
1367 -- consists of two different characters (e.g. "()").
1368 -- @return pattern
1369 -- @usage local dq_str_escapes = l.delimited_range('"')
1370 -- @usage local dq_str_noescapes = l.delimited_range('"', false, true)
1371 -- @usage local unbalanced_parens = l.delimited_range('()')
1372 -- @usage local balanced_parens = l.delimited_range('()', false, false, true)
1373 -- @see nested_pair
1374 -- @name delimited_range
1375 function M.delimited_range(chars, single_line, no_escape, balanced)
1376 local s = chars:sub(1, 1)
1377 local e = #chars == 2 and chars:sub(2, 2) or s
1378 local range
1379 local b = balanced and s or ''
1380 local n = single_line and '\n' or ''
1381 if no_escape then
1382 local invalid = lpeg_S(e..n..b)
1383 range = M.any - invalid
1384 else
1385 local invalid = lpeg_S(e..n..b) + '\\'
1386 range = M.any - invalid + '\\' * M.any
1388 if balanced and s ~= e then
1389 return lpeg_P{s * (range + lpeg_V(1))^0 * e}
1390 else
1391 return s * range^0 * lpeg_P(e)^-1
1396 -- Creates and returns a pattern that matches pattern *patt* only at the
1397 -- beginning of a line.
1398 -- @param patt The LPeg pattern to match on the beginning of a line.
1399 -- @return pattern
1400 -- @usage local preproc = token(l.PREPROCESSOR, l.starts_line('#') *
1401 -- l.nonnewline^0)
1402 -- @name starts_line
1403 function M.starts_line(patt)
1404 return lpeg_Cmt(lpeg_C(patt), function(input, index, match, ...)
1405 local pos = index - #match
1406 if pos == 1 then return index, ... end
1407 local char = input:sub(pos - 1, pos - 1)
1408 if char == '\n' or char == '\r' or char == '\f' then return index, ... end
1409 end)
1413 -- Creates and returns a pattern that verifies that string set *s* contains the
1414 -- first non-whitespace character behind the current match position.
1415 -- @param s String character set like one passed to `lpeg.S()`.
1416 -- @return pattern
1417 -- @usage local regex = l.last_char_includes('+-*!%^&|=,([{') *
1418 -- l.delimited_range('/')
1419 -- @name last_char_includes
1420 function M.last_char_includes(s)
1421 s = '['..s:gsub('[-%%%[]', '%%%1')..']'
1422 return lpeg_P(function(input, index)
1423 if index == 1 then return index end
1424 local i = index
1425 while input:sub(i - 1, i - 1):match('[ \t\r\n\f]') do i = i - 1 end
1426 if input:sub(i - 1, i - 1):match(s) then return index end
1427 end)
1431 -- Returns a pattern that matches a balanced range of text that starts with
1432 -- string *start_chars* and ends with string *end_chars*.
1433 -- With single-character delimiters, this function is identical to
1434 -- `delimited_range(start_chars..end_chars, false, true, true)`.
1435 -- @param start_chars The string starting a nested sequence.
1436 -- @param end_chars The string ending a nested sequence.
1437 -- @return pattern
1438 -- @usage local nested_comment = l.nested_pair('/*', '*/')
1439 -- @see delimited_range
1440 -- @name nested_pair
1441 function M.nested_pair(start_chars, end_chars)
1442 local s, e = start_chars, lpeg_P(end_chars)^-1
1443 return lpeg_P{s * (M.any - s - end_chars + lpeg_V(1))^0 * e}
1447 -- Creates and returns a pattern that matches any single word in list *words*.
1448 -- Words consist of alphanumeric and underscore characters, as well as the
1449 -- characters in string set *word_chars*. *case_insensitive* indicates whether
1450 -- or not to ignore case when matching words.
1451 -- This is a convenience function for simplifying a set of ordered choice word
1452 -- patterns.
1453 -- @param words A table of words.
1454 -- @param word_chars Optional string of additional characters considered to be
1455 -- part of a word. By default, word characters are alphanumerics and
1456 -- underscores ("%w_" in Lua). This parameter may be `nil` or the empty string
1457 -- in order to indicate no additional word characters.
1458 -- @param case_insensitive Optional boolean flag indicating whether or not the
1459 -- word match is case-insensitive. The default is `false`.
1460 -- @return pattern
1461 -- @usage local keyword = token(l.KEYWORD, word_match{'foo', 'bar', 'baz'})
1462 -- @usage local keyword = token(l.KEYWORD, word_match({'foo-bar', 'foo-baz',
1463 -- 'bar-foo', 'bar-baz', 'baz-foo', 'baz-bar'}, '-', true))
1464 -- @name word_match
1465 function M.word_match(words, word_chars, case_insensitive)
1466 local word_list = {}
1467 for i = 1, #words do
1468 word_list[case_insensitive and words[i]:lower() or words[i]] = true
1470 local chars = M.alnum + '_'
1471 if word_chars then chars = chars + lpeg_S(word_chars) end
1472 return lpeg_Cmt(chars^1, function(input, index, word)
1473 if case_insensitive then word = word:lower() end
1474 return word_list[word] and index or nil
1475 end)
1479 -- Embeds child lexer *child* in parent lexer *parent* using patterns
1480 -- *start_rule* and *end_rule*, which signal the beginning and end of the
1481 -- embedded lexer, respectively.
1482 -- @param parent The parent lexer.
1483 -- @param child The child lexer.
1484 -- @param start_rule The pattern that signals the beginning of the embedded
1485 -- lexer.
1486 -- @param end_rule The pattern that signals the end of the embedded lexer.
1487 -- @usage l.embed_lexer(M, css, css_start_rule, css_end_rule)
1488 -- @usage l.embed_lexer(html, M, php_start_rule, php_end_rule)
1489 -- @usage l.embed_lexer(html, ruby, ruby_start_rule, ruby_end_rule)
1490 -- @name embed_lexer
1491 function M.embed_lexer(parent, child, start_rule, end_rule)
1492 -- Add child rules.
1493 if not child._EMBEDDEDRULES then child._EMBEDDEDRULES = {} end
1494 if not child._RULES then -- creating a child lexer to be embedded
1495 if not child._rules then error('Cannot embed language with no rules') end
1496 for i = 1, #child._rules do
1497 add_rule(child, child._rules[i][1], child._rules[i][2])
1500 child._EMBEDDEDRULES[parent._NAME] = {
1501 ['start_rule'] = start_rule,
1502 token_rule = join_tokens(child),
1503 ['end_rule'] = end_rule
1505 if not parent._CHILDREN then parent._CHILDREN = {} end
1506 local children = parent._CHILDREN
1507 children[#children + 1] = child
1508 -- Add child styles.
1509 if not parent._tokenstyles then parent._tokenstyles = {} end
1510 local tokenstyles = parent._tokenstyles
1511 tokenstyles[child._NAME..'_whitespace'] = M.STYLE_WHITESPACE
1512 if child._tokenstyles then
1513 for token, style in pairs(child._tokenstyles) do
1514 tokenstyles[token] = style
1517 -- Add child fold symbols.
1518 if not parent._foldsymbols then parent._foldsymbols = {} end
1519 if child._foldsymbols then
1520 for token, symbols in pairs(child._foldsymbols) do
1521 if not parent._foldsymbols[token] then parent._foldsymbols[token] = {} end
1522 for k, v in pairs(symbols) do
1523 if type(k) == 'number' then
1524 parent._foldsymbols[token][#parent._foldsymbols[token] + 1] = v
1525 elseif not parent._foldsymbols[token][k] then
1526 parent._foldsymbols[token][k] = v
1531 child._lexer = parent -- use parent's tokens if child is embedding itself
1532 parent_lexer = parent -- use parent's tokens if the calling lexer is a proxy
1535 -- Determines if the previous line is a comment.
1536 -- This is used for determining if the current comment line is a fold point.
1537 -- @param prefix The prefix string defining a comment.
1538 -- @param text The text passed to a fold function.
1539 -- @param pos The pos passed to a fold function.
1540 -- @param line The line passed to a fold function.
1541 -- @param s The s passed to a fold function.
1542 local function prev_line_is_comment(prefix, text, pos, line, s)
1543 local start = line:find('%S')
1544 if start < s and not line:find(prefix, start, true) then return false end
1545 local p = pos - 1
1546 if text:sub(p, p) == '\n' then
1547 p = p - 1
1548 if text:sub(p, p) == '\r' then p = p - 1 end
1549 if text:sub(p, p) ~= '\n' then
1550 while p > 1 and text:sub(p - 1, p - 1) ~= '\n' do p = p - 1 end
1551 while text:sub(p, p):find('^[\t ]$') do p = p + 1 end
1552 return text:sub(p, p + #prefix - 1) == prefix
1555 return false
1558 -- Determines if the next line is a comment.
1559 -- This is used for determining if the current comment line is a fold point.
1560 -- @param prefix The prefix string defining a comment.
1561 -- @param text The text passed to a fold function.
1562 -- @param pos The pos passed to a fold function.
1563 -- @param line The line passed to a fold function.
1564 -- @param s The s passed to a fold function.
1565 local function next_line_is_comment(prefix, text, pos, line, s)
1566 local p = text:find('\n', pos + s)
1567 if p then
1568 p = p + 1
1569 while text:sub(p, p):find('^[\t ]$') do p = p + 1 end
1570 return text:sub(p, p + #prefix - 1) == prefix
1572 return false
1576 -- Returns a fold function (to be used within the lexer's `_foldsymbols` table)
1577 -- that folds consecutive line comments that start with string *prefix*.
1578 -- @param prefix The prefix string defining a line comment.
1579 -- @usage [l.COMMENT] = {['--'] = l.fold_line_comments('--')}
1580 -- @usage [l.COMMENT] = {['//'] = l.fold_line_comments('//')}
1581 -- @name fold_line_comments
1582 function M.fold_line_comments(prefix)
1583 local property_int = M.property_int
1584 return function(text, pos, line, s)
1585 if property_int['fold.line.comments'] == 0 then return 0 end
1586 if s > 1 and line:match('^%s*()') < s then return 0 end
1587 local prev_line_comment = prev_line_is_comment(prefix, text, pos, line, s)
1588 local next_line_comment = next_line_is_comment(prefix, text, pos, line, s)
1589 if not prev_line_comment and next_line_comment then return 1 end
1590 if prev_line_comment and not next_line_comment then return -1 end
1591 return 0
1595 M.property_expanded = setmetatable({}, {
1596 -- Returns the string property value associated with string property *key*,
1597 -- replacing any "$()" and "%()" expressions with the values of their keys.
1598 __index = function(t, key)
1599 return M.property[key]:gsub('[$%%]%b()', function(key)
1600 return t[key:sub(3, -2)]
1601 end)
1602 end,
1603 __newindex = function() error('read-only property') end
1606 --[[ The functions and fields below were defined in C.
1609 -- Returns the line number of the line that contains position *pos*, which
1610 -- starts from 1.
1611 -- @param pos The position to get the line number of.
1612 -- @return number
1613 local function line_from_position(pos) end
1616 -- Individual fields for a lexer instance.
1617 -- @field _NAME The string name of the lexer.
1618 -- @field _rules An ordered list of rules for a lexer grammar.
1619 -- Each rule is a table containing an arbitrary rule name and the LPeg pattern
1620 -- associated with the rule. The order of rules is important, as rules are
1621 -- matched sequentially.
1622 -- Child lexers should not use this table to access and/or modify their
1623 -- parent's rules and vice-versa. Use the `_RULES` table instead.
1624 -- @field _tokenstyles A map of non-predefined token names to styles.
1625 -- Remember to use token names, not rule names. It is recommended to use
1626 -- predefined styles or color-agnostic styles derived from predefined styles
1627 -- to ensure compatibility with user color themes.
1628 -- @field _foldsymbols A table of recognized fold points for the lexer.
1629 -- Keys are token names with table values defining fold points. Those table
1630 -- values have string keys of keywords or characters that indicate a fold
1631 -- point whose values are integers. A value of `1` indicates a beginning fold
1632 -- point and a value of `-1` indicates an ending fold point. Values can also
1633 -- be functions that return `1`, `-1`, or `0` (indicating no fold point) for
1634 -- keys which need additional processing.
1635 -- There is also a required `_patterns` key whose value is a table containing
1636 -- Lua pattern strings that match all fold points (the string keys contained
1637 -- in token name table values). When the lexer encounters text that matches
1638 -- one of those patterns, the matched text is looked up in its token's table
1639 -- to determine whether or not it is a fold point.
1640 -- There is also an optional `_case_insensitive` option that indicates whether
1641 -- or not fold point keys are case-insensitive. If `true`, fold point keys
1642 -- should be in lower case.
1643 -- @field _fold If this function exists in the lexer, it is called for folding
1644 -- the document instead of using `_foldsymbols` or indentation.
1645 -- @field _lexer The parent lexer object whose rules should be used. This field
1646 -- is only necessary to disambiguate a proxy lexer that loaded parent and
1647 -- child lexers for embedding and ended up having multiple parents loaded.
1648 -- @field _RULES A map of rule name keys with their associated LPeg pattern
1649 -- values for the lexer.
1650 -- This is constructed from the lexer's `_rules` table and accessible to other
1651 -- lexers for embedded lexer applications like modifying parent or child
1652 -- rules.
1653 -- @field _LEXBYLINE Indicates the lexer can only process one whole line of text
1654 -- (instead of an arbitrary chunk of text) at a time.
1655 -- The default value is `false`. Line lexers cannot look ahead to subsequent
1656 -- lines.
1657 -- @field _FOLDBYINDENTATION Declares the lexer does not define fold points and
1658 -- that fold points should be calculated based on changes in indentation.
1659 -- @class table
1660 -- @name lexer
1661 local lexer
1664 return M