Class Lexer<T>
- Type Parameters:
T
- the type of token types
Disambiguation
The definition of the lexical structure of a language can contain two kinds of ambiguity:- Token rules can overlap, and thus assign more than one token type to the same input. This ambiguity can be detected and resolved statically, using only information in the rule set.
- Potential matches of different length can arise for a particular input. This ambiguity is generally not dealt with statically. Instead, a longest-match strategy is implemented: The production of a token is deferred until it is certain that the matched input is not part of an even longer token.
Special Tokens
Beside the main task to segment a valid input sequence into tokens, the lexical analysis also has to deal with errors and the end of input. For each of the two cases, a dedicated token type can be set:
- Every time an error occurs, a token with the error token type and the error message text is emitted. If the error is recoverable, non-error tokens may follow.
- When the end of the input has been reached, any further demand is met with an unlimited number of tokens with the end-of-input token type and no text.
null
. It is recommended to set the token types to
two distinct reserved non-null values before use.-
Constructor Summary
-
Method Summary
Modifier and TypeMethodDescription<D> LookaheadTokenSource
<D, T, int[]> lex
(CodePointSource input) Returns a token source that produces tokens from the given input source.<D> LookaheadTokenSource
<D, T, int[]> lex
(CodePointSource input, D documentId, int firstLine, int firstColumn) Returns a token source that produces tokens from the given input source.setEndType
(T endType) Sets the token type indicating the end of input.setErrorType
(T errorType) Sets the token type indicating an error in the lexical analysis.
-
Constructor Details
-
Lexer
Creates a new instance.The given token rule set is not interpreted literally during token construction, but compiled to an efficient internal representation during this constructor.
- Parameters:
rules
- the token rule set to use
-
-
Method Details
-
setEndType
Sets the token type indicating the end of input.The given type should not also be associated with any proper token rule. The type may be
null
.The client that consumes the token stream should be prepared to deal with an unbounded amount of tokens of this type following the actual input. It is not necessary to consume more than one of them. However, the first token of this type indicates that the input has been processed completely.
It is possible, although not recommended, to call this method while a lexical analysis is ongoing. The effect occurs immediately.
- Parameters:
endType
- the token type indicating the end of input- Returns:
this
-
setErrorType
Sets the token type indicating an error in the lexical analysis.The given type should not also be associated with any proper token rule. The type may be
null
.The client that consumes the token stream should be prepared to deal with spontaneous occurrences of this token. In some application-dependent cases, subsequent tokens may still be meaningful, and error recovery may be attempted.
It is possible, although not recommended, to call this method while a lexical analysis is ongoing. The effect occurs immediately.
- Parameters:
errorType
- the token type indicating an error in the lexical analysis- Returns:
this
-
lex
Returns a token source that produces tokens from the given input source.Location information in the produced tokens will not contain a document identifier or line number, whereas columns start from
0
. This is suitable, e.g., for reading from dynamically created strings.- Type Parameters:
D
- the fictional type of document identifiers in token locations; arbitrary since all values arenull
- Parameters:
input
- the input source to analyze- Returns:
- a token source that produces tokens from the given input source
-
lex
public <D> LookaheadTokenSource<D,T, lexint[]> (CodePointSource input, @Opt D documentId, int firstLine, int firstColumn) Returns a token source that produces tokens from the given input source.- Type Parameters:
D
- the type of document identifiers in token locations- Parameters:
input
- the input source to analyzedocumentId
- the document identifier of the document underlying the input, ornull
if not availablefirstLine
- the non-negative first line number of the document underlying the input, or one of the special valuesLocation.UNKNOWN
orLocation.NOT_APPLICABLE
firstColumn
- the non-negative first column number of the document underlying the input, or one of the special valuesLocation.UNKNOWN
orLocation.NOT_APPLICABLE
- Returns:
- a token source that produces tokens from the given input source
-