Package eu.bandm.tools.lexic
Input
The input of lexical analysis is assumed to be a sequence of Unicode code points, encoded according to UTF-32 asint
values, with the special value -1
indicating the end of the
input sequence. This data structure is embodied in the supplier
interface CodePointSource
.
Often the input sequence is drawn from the contents of a
document. In that case it is useful for back-reference to obtain
metadata about the location of input elements in the original
document. The datatype Location
is
used for that purpose, available from the interface LocationCodePointSource
.
Output
The output of lexical analysis is a sequence ofToken
objects, each consisting of a token type, a segment of the input
text, and a Location
object. Note
that the text segment of a token is now encoded as a String
in UTF-16. The end of the token sequence is indicated by a
token with a distinguished token type and empty text segment. The
end token type can be chosen arbitrarily, as long as the supplier
and the consumer agree. This data structure is embodied in the
supplier interface TokenSource
.
A token sequence can be processed in a pipeline of TokenProcessor
objects. The most common processing task is
filtering, implemented by subclasses of TokenFilter
.
Convenience methods such as TokenSource.removeTypes(T[])
support type-based filtering.
Error Handling
NeitherCodePointSource
nor TokenSource
may convey
errors by returning null
results or by raising exceptions.
An error during the production of a code point sequence must be
communicated between the supplier and the consumer by
application-specific means. An error may or may not terminate the
sequence. See CodePointSource.read(java.io.Reader,java.util.function.Consumer)
for an example contract.
An error during the production of a token sequence should be
communicated between the supplier and the consumer by a token with
a distinguished token type, and error message instead of text
segment. The error token type can be chosen arbitrarily, as long
as the supplier and the consumer agree. An error may or may not
terminate the sequence. See Lexer.setErrorType(T)
for an
example contract.
Rule Definitions
The lexical structure of the input language is defined by a set of token rules. The classTokenRuleSet
provides collections
of token rules and capabilities for disambiguation of overlapping
token rules.
Each token rule consists of a token fragment that defines a
regular language, and an associated token type. The class TokenRule
provides such pairs.
Token fragments are compositional regular languages. The class
TokenFragment
provides factory methods for all the usual
constructs on regular languages, e.g., single characters, union,
concatenation, intersection, etc. Every fragment is automatically
compiled to a nondeterministic finite automaton at construction
time.
Note that regular expressions are also a complete
notation for regular languages, but real-world implementations such
as java.util.regex
have different practical focus and
theoretical properties.
Automata
Token fragments have executable implementations in the form of nondeterministic finite automata, embodied by the classNAutomaton
. Some operations on finite automata use the
nondeterministic form; others require deterministic automata,
embodied by the class DAutomaton
. Conversions between the
two are performed automatically when needed.
For efficient execution, deterministic automata can be
simplified to a zero-overhead form, embodied by the class ZAutomaton
.
All types of automata can be used to process a sequence of input
code points in a uniform way, via the iterator-like interface
Automaton.Trace
.
Usage
FIXME-
ClassDescriptionAutomaton<L,
T> Base class of finite automata.State of an automaton.Iterator-like mutable API for tracking the consumption of an input sequence of code points by an automaton.Behavior<L,T> Behavior of an automaton in a particular state.CodePointMap<V>Immutable map of Unicode code point keys encoded asint
values to arbitrary values.A specialized supplier of unicode code points.DAutomaton<V>Deterministic finite-state labeled automaton.This class contains only static factories for pretty-printable representations of collections.Lexer<T>Lexical analyzer that maps code point sources to token sources.A secondary code point source that tracks location information.LookaheadTokenFilter<D,T, L> Abstract base class for token processors that filter out certain tokens.LookaheadTokenMultiplexer<K,D, T, L> LookaheadTokenProcessor<D,T, L> Abstract base class for secondary token sources that feed on other token sources.LookaheadTokenSource<D,T, L> A supplier of tokens with internal lookahead buffer.NAutomaton<V>Nondeterministic finite-state labeled automaton.SimpleToken<D,T> Simple immutable token implementation.Token<D,T> Abstract interface of lexical tokens.TokenFilter<D,T> Abstract base class for token processors that filter out certain tokens.Syntactic fragment as building block for a token rule.Singleton type indicating successful matching.TokenProcessor<D,T> Abstract base class for secondary token sources that feed on other token sources.TokenRule<T>Associates a token type with a syntax fragment.TokenRuleSet<T>A set of token rules together with a precendence relation between token types.TokenSource<D,T> A specialized supplier of tokens.Traceable<L>Indicates that the implementor can process code point sequences like an automaton.ZAutomaton<V>Zero-overhead automaton that is identical to the behavior of its own initial state.