Disambiguation

The definition of the lexical structure of a language can contain two kinds of ambiguity:

Token rules can overlap, and thus assign more than one token type to the same input. This ambiguity can be detected and resolved statically, using only information in the rule set.
Potential matches of different length can arise for a particular input. This ambiguity is generally not dealt with statically. Instead, a longest-match strategy is implemented: The production of a token is deferred until it is certain that the matched input is not part of an even longer token.

Special Tokens

Beside the main task to segment a valid input sequence into tokens, the lexical analysis also has to deal with errors and the end of input. For each of the two cases, a dedicated token type can be set:

Every time an error occurs, a token with the error token type and the error message text is emitted. If the error is recoverable, non-error tokens may follow.
When the end of the input has been reached, any further demand is met with an unlimited number of tokens with the end-of-input token type and no text.

The initial default value for both error and end-of-input token types is null. It is recommended to set the token types to two distinct reserved non-null values before use.

See Also:

Constructor Summary

Constructors

Constructor

Description

Lexer(TokenRuleSet<T> rules)

Creates a new instance.
Method Summary

Modifier and Type

Method

Description

@Opt Classifier

getClassifier()

<D> LookaheadTokenSource<D,T,int[]>

lex(CodePointSource input)

Returns a token source that produces tokens from the given input source.

<D> LookaheadTokenSource<D,T,int[]>

lex(CodePointSource input, D documentId, int firstLine, int firstColumn)

Returns a token source that produces tokens from the given input source.

<D> LookaheadTokenSource<D,T,int[]>

lex(CodePointSource input, D documentId, int firstLine, int firstColumn, Token.Factory<D,T,?> tokenFactory)

Returns a token source that produces tokens from the given input source.

Lexer<T>

setClassifier(@Opt Classifier classifier)

Lexer<T>

setEndType(T endType)

Sets the token type indicating the end of input.

Lexer<T>

setErrorType(T errorType)

Sets the token type indicating an error in the lexical analysis.

Methods inherited from class java.lang.Object
clone, equals, finalize, getClass, hashCode, notify, notifyAll, toString, wait, wait, wait

Constructor Details
- Lexer
  
  public Lexer(TokenRuleSet<T> rules)
  
  Creates a new instance.
  The given token rule set is not interpreted literally during token construction, but compiled to an efficient internal representation during this constructor.
  
  Parameters:
  
  rules - the token rule set to use
Method Details
- setEndType
  
  public Lexer<T> setEndType(@Opt T endType)
  
  Sets the token type indicating the end of input.
  The given type should not also be associated with any proper token rule. The type may be null.
  The client that consumes the token stream should be prepared to deal with an unbounded amount of tokens of this type following the actual input. It is not necessary to consume more than one of them. However, the first token of this type indicates that the input has been processed completely.
  It is possible, although not recommended, to call this method while a lexical analysis is ongoing. The effect occurs immediately.
  
  Parameters:
  
  endType - the token type indicating the end of input
  
  Returns:
  
  this
- setErrorType
  
  public Lexer<T> setErrorType(@Opt T errorType)
  
  Sets the token type indicating an error in the lexical analysis.
  The given type should not also be associated with any proper token rule. The type may be null.
  The client that consumes the token stream should be prepared to deal with spontaneous occurrences of this token. In some application-dependent cases, subsequent tokens may still be meaningful, and error recovery may be attempted.
  It is possible, although not recommended, to call this method while a lexical analysis is ongoing. The effect occurs immediately.
  
  Parameters:
  
  errorType - the token type indicating an error in the lexical analysis
  
  Returns:
  
  this
- getClassifier
  
  @Opt public @Opt Classifier getClassifier()
- setClassifier
  
  public Lexer<T> setClassifier(@Opt @Opt Classifier classifier)
- lex
  
  public <D> LookaheadTokenSource<D,T,int[]> lex(CodePointSource input)
  
  Returns a token source that produces tokens from the given input source.
  Location information in the produced tokens will not contain a document identifier or line number, whereas columns start from 0. This is suitable, e.g., for reading from dynamically created strings.
  
  Type Parameters:
  
  D - the fictional type of document identifiers in token locations; arbitrary since all values are null
  
  Parameters:
  
  input - the input source to analyze
  
  Returns:
  
  a token source that produces tokens from the given input source
- lex
  
  public <D> LookaheadTokenSource<D,T,int[]> lex(CodePointSource input, @Opt D documentId, int firstLine, int firstColumn)
  
  Returns a token source that produces tokens from the given input source.
  This implementation creates token objects of class SimpleToken.
  
  Type Parameters:
  
  D - the type of document identifiers in token locations
  
  Parameters:
  
  input - the input source to analyze
  
  documentId - the document identifier of the document underlying the input, or null if not available
  
  firstLine - the non-negative first line number of the document underlying the input, or one of the special values Location.UNKNOWN or Location.NOT_APPLICABLE
  
  firstColumn - the non-negative first column number of the document underlying the input, or one of the special values Location.UNKNOWN or Location.NOT_APPLICABLE
  
  Returns:
  
  a token source that produces tokens from the given input source
- lex
  
  public <D> LookaheadTokenSource<D,T,int[]> lex(CodePointSource input, @Opt D documentId, int firstLine, int firstColumn, Token.Factory<D,T,?> tokenFactory)
  
  Returns a token source that produces tokens from the given input source.
  
  Type Parameters:
  
  D - the type of document identifiers in token locations
  
  Parameters:
  
  input - the input source to analyze
  
  documentId - the document identifier of the document underlying the input, or null if not available
  
  firstLine - the non-negative first line number of the document underlying the input, or one of the special values Location.UNKNOWN or Location.NOT_APPLICABLE
  
  firstColumn - the non-negative first column number of the document underlying the input, or one of the special values Location.UNKNOWN or Location.NOT_APPLICABLE
  
  tokenFactory - the factory to create tokens with
  
  Returns:
  
  a token source that produces tokens from the given input source

Class Lexer<T>

Disambiguation

Special Tokens

Constructor Summary

Method Summary

Methods inherited from class java.lang.Object

Constructor Details

Lexer

Method Details

setEndType

setErrorType

getClassifier

setClassifier

lex

lex

lex