Package eu.bandm.tools.lexic
Input
The input of lexical analysis is assumed to be a sequence of Unicode code points, encoded according to UTF-32 asint values, with the special
value -1 indicating the end of the input sequence. This data
structure is embodied in the supplier interface CodePointSource.
Often the input sequence is drawn from the contents of a document. In
that case it is useful for back-reference to obtain metadata about the
location of input elements in the original document. The datatype Location is used for that purpose, available from
the interface LocationCodePointSource.
Output
The output of lexical analysis is a sequence ofToken objects, each
consisting of a token type, a segment of the input text, and a Location object. Note that the text segment of a
token is now encoded as a String in UTF-16. The end of the token
sequence is indicated by a token with a distinguished token type and empty
text segment. The end token type can be chosen arbitrarily, as long as the
supplier and the consumer agree. This data structure is embodied in the
supplier interface TokenSource.
A token sequence can be processed in a pipeline of TokenProcessor
objects. The most common processing task is filtering, implemented by
subclasses of TokenFilter. Convenience methods such as TokenSource.removeTypes(T[]) support type-based filtering.
Error Handling
NeitherCodePointSource nor TokenSource may convey errors by
returning null results or by raising exceptions.
An error during the production of a code point sequence must be
communicated between the supplier and the consumer by application-specific
means. An error may or may not terminate the sequence. See CodePointSource.read(java.io.Reader,java.util.function.Consumer) for an
example contract.
An error during the production of a token sequence should be communicated
between the supplier and the consumer by a token with a distinguished token
type, and error message instead of text segment. The error token type can be
chosen arbitrarily, as long as the supplier and the consumer agree. An error
may or may not terminate the sequence. See Lexer.setErrorType(T) for an
example contract.
Rule Definitions
The lexical structure of the input language is defined by a set of token rules. The classTokenRuleSet provides collections of token rules
and capabilities for disambiguation of overlapping token rules.
Each token rule consists of a token fragment that defines a regular
language, and an associated token type. The class TokenRule provides
such pairs.
Token fragments are compositional regular languages. The class TokenFragment provides factory methods for all the usual constructs on
regular languages, e.g., single characters, union, concatenation,
intersection, etc. Every fragment is automatically compiled to a
nondeterministic finite automaton at construction time.
Note that regular expressions are also a complete notation for
regular languages, but real-world implementations such as java.util.regex have different practical focus and theoretical properties.
Automata
Token fragments have executable implementations in the form of nondeterministic finite automata, embodied by the classNAutomaton.
Some operations on finite automata use the nondeterministic form; others
require deterministic automata, embodied by the class DAutomaton.
Conversions between the two are performed automatically when needed.
For efficient execution, deterministic automata can be simplified to a
zero-overhead form, embodied by the class ZAutomaton.
All types of automata can be used to process a sequence of input code
points in a uniform way, via the iterator-like interface Automaton.Trace.
Usage
The class Lexer acts as the factory for automaton-driven TokenSource instances. It takes a TokenRuleSet and an optional
Classifier, and prepares an accepting automaton. Instances require a
fixed CodePointSource at creation.
Transferrable Lookahead
Parallelization
Code Point Classification
-
ClassDescriptionAutomaton<L,
T> Base class of finite automata.State of an automaton.Iterator-like mutable API for tracking the consumption of an input sequence of code points by an automaton.Behavior<L,T> Behavior of an automaton in a particular state.A function on code points that substitutes representatives.CodePointMap<V>Immutable map of Unicode code point keys encoded asintvalues to arbitrary values.A specialized supplier of unicode code points.DAutomaton<V>Deterministic finite-state labeled automaton.This class contains only static factories for pretty-printable representations of collections.Lexer<T>Lexical analyzer that maps code point sources to token sources.Indicates that an unexpected or illegal code point has been found in an input document.A secondary code point source that tracks location information.LookaheadTokenFilter<D,T, L> Abstract base class for token processors that filter out certain tokens.LookaheadTokenMultiplexer<K,D, T, L> A multiplexer between token source channels with internal lookahead buffer, selected by a key.LookaheadTokenProcessor<D,T, L> Abstract base class for secondary token sources that feed on other token sources.LookaheadTokenSource<D,T, L> A supplier of tokens with internal lookahead buffer.LookaheadTokenSourceProxy<D,T, L> A dynamic proxy that can be reconfigured which token source to forward at any time.NAutomaton<V>Nondeterministic finite-state labeled automaton.SimpleToken<D,T> Simple immutable token implementation.Token<D,T> Abstract interface of lexical tokens.Token.Factory<D,T, R extends Token<D, T>> Functional interface for the creation of tokens.TokenFilter<D,T> Abstract base class for token processors that filter out certain tokens.Syntactic fragment as building block for a token rule.Singleton type indicating successful matching.TokenRule<T>Associates a token type with a syntax fragment.TokenRuleSet<T>A set of token rules together with a precendence relation between token types.TokenSource<D,T> A specialized supplier of tokens.Traceable<L>Indicates that the implementor can process code point sequences like an automaton.Traversable<L,S> Indicates that the implementor can be traversed as the transition graph of an automaton.ZAutomaton<V>Zero-overhead automaton that is identical to the behavior of its own initial state.