Class TokenFragment

java.lang.Object
eu.bandm.tools.lexic.TokenFragment
All Implemented Interfaces:
FormatClient, Serializable

public class TokenFragment extends Object implements FormatClient, Serializable
Syntactic fragment as building block for a token rule.

Each fragment contains an accepting nondeterministic automaton as its implementation. The fragment is said to match a sequence of input code points if and only if the sequence is accepted by its implementation automaton.

The constructor of this class is hidden from applications. The creation of fragments is effected by factory methods.

Fragments are immutable. Factory methods are provided for non-destructive updates that derive new fragments from old ones.

Fragments are compositional: any fragment can serve directly in a token rule, no matter whether it is primitive or complex.

The language of fragments shares many constructs with the language of regular expressions. Others correspond more directly with set theory. The constructs are chosen such that there is an effective implementation in terms of nondeterministic automata.

See Also:
  • Field Details

    • success

      public static final TokenFragment.Success success
      Singleton value indicating successful matching.
  • Method Details

    • toString

      public String toString()
      Overrides:
      toString in class Object
    • format

      public <F> F format(FormatServer<F> server)
      Description copied from interface: FormatClient
      Represent this or the underlying object in a human-readable, pretty-printable way.
      Specified by:
      format in interface FormatClient
      Type Parameters:
      F - the type of format objects to produce
      Parameters:
      server - a factory object that can produce format objects
      Returns:
      a format object produced by the server
    • getImplementation

      public NAutomaton<TokenFragment.Success> getImplementation()
      Returns the accepting automaton that implements this fragment.

      A sequence of input code points is accepted by the automaton, if it can consume the whole sequence, and ends up in a state labeled with Collections.singleton(success). Otherwise, i.e., if the automaton fails to consume all of the sequence or ends up in a state labeled with Collections.emptySet(), then the input sequence is rejected.

      Returns:
      the accepting automaton that implements this fragment
    • epsilon

      public static TokenFragment epsilon()
      Returns a token fragment that matches zero code points of input.
      Returns:
      a token fragment that matches zero code points of input
    • fail

      public static TokenFragment fail()
      Returns a token fragment that does not match any input.
      Returns:
      a token fragment that does not match any input
    • of

      public static TokenFragment of(int codePoint)
      Returns a token fragment that matches the given input code point.

      This construct corresponds to a single ordinary or quoted character in a regular expression.

      Parameters:
      codePoint - the code point to match
      Returns:
      a token fragment that matches the given input code point
      Throws:
      IllegalArgumentException - if the given number is not a valid code point
    • of

      public static TokenFragment of(String text)
      Returns a token fragment that matches the input code point sequence specified by the given string.

      This construct corresponds to a substring of ordinary or quoted characters in a regular expression.

      Parameters:
      text - the string to match
      Returns:
      a token fragment that matches the input code point sequence specified by the given string
    • anyOf

      public static TokenFragment anyOf(String... text)
      Returns a token fragment that matches any of the input code point sequences specified by the given strings.

      This construct corresponds to a choice of substrings of ordinary or quoted characters in a regular expression.

      Parameters:
      text - the array of strings to match
      Returns:
      a token fragment that matches the input code point sequence specified by one of the given strings
    • anyOf

      public static TokenFragment anyOf(int... codePoints)
      Returns a token fragment that matches any one of the given input code points.

      This construct corresponds to a simple character class in a regular expression.

      Parameters:
      codePoints - the code points to match
      Returns:
      a token fragment that matches any one of the given input code points
    • range

      public static TokenFragment range(int from, int to)
      Returns a token fragment that matches any input code point in the given interval.

      Both given end points of the interval are inclusive; i.e., a code point c is matched if from <= c && c <= to.

      This construct corresponds to a character range in a regular expression.

      Parameters:
      from - the lower end of the interval of code points to match
      to - the upper end of the interval of code points to match
      Returns:
      a token fragment that matches any input code point in the given interval
    • except

      public static TokenFragment except(int... codePoints)
      Returns a token fragment that matches any input code point except for the given ones.

      This construct corresponds to a negated character class in a regular expression.

      Parameters:
      codePoints - the code points not to match
      Returns:
      a token fragment that matches any input code point except for the given ones
    • any

      public static TokenFragment any()
      Returns a token fragment that matches any input code point.

      This construct corresponds to a wildcard character in a regular expression.

      Returns:
      a token fragment that matches any input code point
    • andThen

      public TokenFragment andThen(TokenFragment other)
      Returns a token fragment that matches input matched by this fragment followed by the given other fragment.

      This construct corresponds to a followed-by sequence in a regular expression.

      Parameters:
      other - the other fragment
      Returns:
      a token fragment that matches input matched by this fragment followed by the given other fragment
    • orElse

      public TokenFragment orElse(TokenFragment other)
      Returns a token fragment that matches input either matched by this fragment or by the given other fragment, or both.

      An input sequence matched by both parts of the combined fragment simultaneously is not considered ambiguous.

      This construct corresponds to an either-or choice in a regular expression.

      Parameters:
      other - the other fragment
      Returns:
      a token fragment that matches input matched by this fragment followed by the given other fragment
    • optional

      public TokenFragment optional()
      Returns a token fragment that matches input matched by this fragment, or alternatively zero code points.

      This construct corresponds to a ? operator in a regular expression.

      Returns:
      a token fragment that matches input matched by this fragment, or alternatively zero code points
    • plus

      public TokenFragment plus()
      Returns a token fragment that matches input matched by one or more repetitions of this fragment.

      This construct corresponds to a + operator in a regular expression.

      Returns:
      a token fragment that matches input matched by one or more repetitions of this fragment
    • star

      public TokenFragment star()
      Returns a token fragment that matches input matched by zero or more repetitions of this fragment.

      This construct corresponds to a * operator in a regular expression.

      Returns:
      a token fragment that matches input matched by zero or more repetitions of this fragment
    • butNot

      public TokenFragment butNot(TokenFragment other)
      Returns a token fragment that matches input matched by this fragment but not by the given other fragment.

      This construct does not correspond to negative lookahead, or any other typical feature, in a regular expression. It does, however, correspond to the set difference of the respective sublanguages; a concept that is sometimes used semiformally in definitions of the lexical structure of a language.

      Parameters:
      other - the other fragment
      Returns:
      a token fragment that matches input matched by this fragment but not by the given other fragment
    • butOnly

      public TokenFragment butOnly(TokenFragment other)
      Returns a token fragment that matches input matched by both this fragment and also the given other fragment.

      This construct does not correspond to positive lookahead, or any other typical feature, in a regular expression. It does, however, correspond to the set intersection of the respective sublanguages.

      Parameters:
      other - the other fragment
      Returns:
      a token fragment that matches input matched by both this fragment and also the given other fragment
    • contained

      public TokenFragment contained()
      Returns a token fragment that matches any input which contains a contiguous section matched by this fragment.

      While it is common to have a regular expression match only some substring of a given input string, there is no direct correspondence for this construct.

      Returns:
      a token fragment that matches any input which contains a contiguous section matched by this fragment
    • andThenUntil

      public TokenFragment andThenUntil(TokenFragment delimiter)
      Returns a token fragment that matches input matched by this fragment followed by any input finally matched once by the given delimiter fragment.

      Matches of the delimiter in the middle section are forbidden. Hence this construct is useful to counteract a longest-match strategy, and prevent variable-length token rules from matching too much of input.

      Parameters:
      delimiter - the delimiter fragment
      Returns:
      a fragment that matches variable-length input with the given beginning and end
    • andThenUntil

      public TokenFragment andThenUntil(TokenFragment body, TokenFragment delimiter)
      Returns a token fragment that matches input matched by this fragment followed by input matched by the given body fragment and finally matched once by the given delimiter fragment.

      Matches of the delimiter in the middle section are forbidden. Hence this construct is useful to counteract a longest-match strategy, and prevent variable-length token rules from matching too much of input.

      Parameters:
      body - the body fragment
      delimiter - the delimiter fragment
      Returns:
      a fragment that matches variable-length input with the given beginning and end
    • andThenWithout

      public TokenFragment andThenWithout(TokenFragment delimiter)
    • andThenWithout

      public TokenFragment andThenWithout(TokenFragment body, TokenFragment delimiter)
    • normalize

      public TokenFragment normalize()
      Returns a token fragment that matches the same inputs as this fragment, but with a simple implementation.

      The implementing automaton of the result shall have no branching transitions, and no redundant or dead states. This does not imply that the number of states is non-increasing.

      Returns:
      a token fragment that matches the same inputs as this fragment, but with a simple implementation