Package eu.bandm.tools.lexic


package eu.bandm.tools.lexic
Toolkit for the dynamic construction of lexical analyzers.

Input

The input of lexical analysis is assumed to be a sequence of Unicode code points, encoded according to UTF-32 as int values, with the special value -1 indicating the end of the input sequence. This data structure is embodied in the supplier interface CodePointSource.

Often the input sequence is drawn from the contents of a document. In that case it is useful for back-reference to obtain metadata about the location of input elements in the original document. The datatype Location is used for that purpose, available from the interface LocationCodePointSource.

Output

The output of lexical analysis is a sequence of Token objects, each consisting of a token type, a segment of the input text, and a Location object. Note that the text segment of a token is now encoded as a String in UTF-16. The end of the token sequence is indicated by a token with a distinguished token type and empty text segment. The end token type can be chosen arbitrarily, as long as the supplier and the consumer agree. This data structure is embodied in the supplier interface TokenSource.

A token sequence can be processed in a pipeline of TokenProcessor objects. The most common processing task is filtering, implemented by subclasses of TokenFilter. Convenience methods such as TokenSource.removeTypes(T[]) support type-based filtering.

Error Handling

Neither CodePointSource nor TokenSource may convey errors by returning null results or by raising exceptions.

An error during the production of a code point sequence must be communicated between the supplier and the consumer by application-specific means. An error may or may not terminate the sequence. See CodePointSource.read(java.io.Reader,java.util.function.Consumer) for an example contract.

An error during the production of a token sequence should be communicated between the supplier and the consumer by a token with a distinguished token type, and error message instead of text segment. The error token type can be chosen arbitrarily, as long as the supplier and the consumer agree. An error may or may not terminate the sequence. See Lexer.setErrorType(T) for an example contract.

Rule Definitions

The lexical structure of the input language is defined by a set of token rules. The class TokenRuleSet provides collections of token rules and capabilities for disambiguation of overlapping token rules.

Each token rule consists of a token fragment that defines a regular language, and an associated token type. The class TokenRule provides such pairs.

Token fragments are compositional regular languages. The class TokenFragment provides factory methods for all the usual constructs on regular languages, e.g., single characters, union, concatenation, intersection, etc. Every fragment is automatically compiled to a nondeterministic finite automaton at construction time.

Note that regular expressions are also a complete notation for regular languages, but real-world implementations such as java.util.regex have different practical focus and theoretical properties.

Automata

Token fragments have executable implementations in the form of nondeterministic finite automata, embodied by the class NAutomaton. Some operations on finite automata use the nondeterministic form; others require deterministic automata, embodied by the class DAutomaton. Conversions between the two are performed automatically when needed.

For efficient execution, deterministic automata can be simplified to a zero-overhead form, embodied by the class ZAutomaton.

All types of automata can be used to process a sequence of input code points in a uniform way, via the iterator-like interface Automaton.Trace.

Usage

FIXME
  • Class
    Description
    Base class of finite automata.
    State of an automaton.
    Iterator-like mutable API for tracking the consumption of an input sequence of code points by an automaton.
    Behavior of an automaton in a particular state.
    Immutable map of Unicode code point keys encoded as int values to arbitrary values.
    A specialized supplier of unicode code points.
    Deterministic finite-state labeled automaton.
    This class contains only static factories for pretty-printable representations of collections.
    Lexical analyzer that maps code point sources to token sources.
    A secondary code point source that tracks location information.
    Nondeterministic finite-state labeled automaton.
    Simple immutable token implementation.
    Token<D,T>
    Abstract interface of parser tokens.
    Abstract base class for token processors that filter out certain tokens.
    Syntactic fragment as building block for a token rule.
    Singleton type indicating successful matching.
    Abstract base class for secondary token sources that feed on other token sources.
    Associates a token type with a syntax fragment.
    A set of token rules together with a precendence relation between token types.
    A specialized supplier of tokens.
    Indicates that the implementor can process code point sequences like an automaton.
    Zero-overhead automaton that is identical to the behavior of its own initial state.