Class Lexer<T>

java.lang.Object
eu.bandm.tools.lexic.Lexer<T>
Type Parameters:
T - the type of token types

public class Lexer<T> extends Object
Lexical analyzer that maps code point sources to token sources.

Disambiguation

The definition of the lexical structure of a language can contain two kinds of ambiguity:
  • Token rules can overlap, and thus assign more than one token type to the same input. This ambiguity can be detected and resolved statically, using only information in the rule set.
  • Potential matches of different length can arise for a particular input. This ambiguity is generally not dealt with statically. Instead, a longest-match strategy is implemented: The production of a token is deferred until it is certain that the matched input is not part of an even longer token.

Special Tokens

Beside the main task to segment a valid input sequence into tokens, the lexical analysis also has to deal with errors and the end of input. For each of the two cases, a dedicated token type can be set:

  • Every time an error occurs, a token with the error token type and the error message text is emitted. If the error is recoverable, non-error tokens may follow.
  • When the end of the input has been reached, any further demand is met with an unlimited number of tokens with the end-of-input token type and no text.
The initial default value for both error and end-of-input token types is null. It is recommended to set the token types to two distinct reserved non-null values before use.
  • Constructor Details

    • Lexer

      public Lexer(TokenRuleSet<T> rules)
      Creates a new instance.

      The given token rule set is not interpreted literally during token construction, but compiled to an efficient internal representation during this constructor.

      Parameters:
      rules - the token rule set to use
  • Method Details

    • setEndType

      public Lexer<T> setEndType(@Opt T endType)
      Sets the token type indicating the end of input.

      The given type should not also be associated with any proper token rule. The type may be null.

      The client that consumes the token stream should be prepared to deal with an unbounded amount of tokens of this type following the actual input. It is not necessary to consume more than one of them. However, the first token of this type indicates that the input has been processed completely.

      It is possible, although not recommended, to call this method while a lexical analysis is ongoing. The effect occurs immediately.

      Parameters:
      endType - the token type indicating the end of input
      Returns:
      this
    • setErrorType

      public Lexer<T> setErrorType(@Opt T errorType)
      Sets the token type indicating an error in the lexical analysis.

      The given type should not also be associated with any proper token rule. The type may be null.

      The client that consumes the token stream should be prepared to deal with spontaneous occurrences of this token. In some application-dependent cases, subsequent tokens may still be meaningful, and error recovery may be attempted.

      It is possible, although not recommended, to call this method while a lexical analysis is ongoing. The effect occurs immediately.

      Parameters:
      errorType - the token type indicating an error in the lexical analysis
      Returns:
      this
    • lex

      public <D> LookaheadTokenSource<D,T,int[]> lex(CodePointSource input)
      Returns a token source that produces tokens from the given input source.

      Location information in the produced tokens will not contain a document identifier or line number, whereas columns start from 0. This is suitable, e.g., for reading from dynamically created strings.

      Type Parameters:
      D - the fictional type of document identifiers in token locations; arbitrary since all values are null
      Parameters:
      input - the input source to analyze
      Returns:
      a token source that produces tokens from the given input source
    • lex

      public <D> LookaheadTokenSource<D,T,int[]> lex(CodePointSource input, @Opt D documentId, int firstLine, int firstColumn)
      Returns a token source that produces tokens from the given input source.
      Type Parameters:
      D - the type of document identifiers in token locations
      Parameters:
      input - the input source to analyze
      documentId - the document identifier of the document underlying the input, or null if not available
      firstLine - the non-negative first line number of the document underlying the input, or one of the special values Location.UNKNOWN or Location.NOT_APPLICABLE
      firstColumn - the non-negative first column number of the document underlying the input, or one of the special values Location.UNKNOWN or Location.NOT_APPLICABLE
      Returns:
      a token source that produces tokens from the given input source