eu.bandm.tools.lexic (BandM metatools API doc)

package eu.bandm.tools.lexic

Toolkit for the dynamic construction of lexical analyzers.

Input

The input of lexical analysis is assumed to be a sequence of Unicode code points, encoded according to UTF-32 as int values, with the special value -1 indicating the end of the input sequence. This data structure is embodied in the supplier interface CodePointSource.

Often the input sequence is drawn from the contents of a document. In that case it is useful for back-reference to obtain metadata about the location of input elements in the original document. The datatype Location is used for that purpose, available from the interface LocationCodePointSource.

Output

The output of lexical analysis is a sequence of Token objects, each consisting of a token type, a segment of the input text, and a Location object. Note that the text segment of a token is now encoded as a String in UTF-16. The end of the token sequence is indicated by a token with a distinguished token type and empty text segment. The end token type can be chosen arbitrarily, as long as the supplier and the consumer agree. This data structure is embodied in the supplier interface TokenSource.

A token sequence can be processed in a pipeline of TokenProcessor objects. The most common processing task is filtering, implemented by subclasses of TokenFilter. Convenience methods such as TokenSource.removeTypes(T[]) support type-based filtering.

Error Handling

Neither CodePointSource nor TokenSource may convey errors by returning null results or by raising exceptions.

An error during the production of a code point sequence must be communicated between the supplier and the consumer by application-specific means. An error may or may not terminate the sequence. See CodePointSource.read(java.io.Reader,java.util.function.Consumer) for an example contract.

An error during the production of a token sequence should be communicated between the supplier and the consumer by a token with a distinguished token type, and error message instead of text segment. The error token type can be chosen arbitrarily, as long as the supplier and the consumer agree. An error may or may not terminate the sequence. See Lexer.setErrorType(T) for an example contract.

Rule Definitions

The lexical structure of the input language is defined by a set of token rules. The class TokenRuleSet provides collections of token rules and capabilities for disambiguation of overlapping token rules.

Each token rule consists of a token fragment that defines a regular language, and an associated token type. The class TokenRule provides such pairs.

Token fragments are compositional regular languages. The class TokenFragment provides factory methods for all the usual constructs on regular languages, e.g., single characters, union, concatenation, intersection, etc. Every fragment is automatically compiled to a nondeterministic finite automaton at construction time.

Note that regular expressions are also a complete notation for regular languages, but real-world implementations such as java.util.regex have different practical focus and theoretical properties.

Automata

Token fragments have executable implementations in the form of nondeterministic finite automata, embodied by the class NAutomaton. Some operations on finite automata use the nondeterministic form; others require deterministic automata, embodied by the class DAutomaton. Conversions between the two are performed automatically when needed.

For efficient execution, deterministic automata can be simplified to a zero-overhead form, embodied by the class ZAutomaton.

All types of automata can be used to process a sequence of input code points in a uniform way, via the iterator-like interface Automaton.Trace.

Usage

FIXME