[all pages:] introduction message/location/muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf DOC-EN:lablog.ddf SOURCE:basic.dd2 DOC:xslt.ddf SOURCE:xslt.dd2 DOC:meta.ddf [site map]



All pages: introduction message/location/muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf DOC-EN:lablog.ddf SOURCE:basic.dd2 DOC:xslt.ddf SOURCE:xslt.dd2 DOC:meta.ddf [site map]



go one page back go to start go to start go one page ahead
dtd bandm meta_tools tdom

Xantlr, Representing the Result of LA-LL(k) Parsing as a Sequence of SAX Events



(related API documentation: package xantlr,    package xantlrtdom   )


1          Xantlr's Basic Functionality
2          Specifying the Generated SAX Events
2.1          Normalization of Content Models
3          Running Xantlr
3.1          Messages and Unreachable Nonterminals
3.2          antlrC Grammar Inheritance and Xantlr
4          Running the Resulting Compiler
5          Notes on the Implementation

^ToC 1 Xantlr's Basic Functionality

Xantlr can be thought of as a preprocessor to the famous antlrC tool: It automatically inserts certain "semantic actions" into the grammar definition. When the corresponding parser generated by antlrC will be running, these actions will cause the emissions of XML SAX events ([Sax04]) representing the abstract syntax tree, which is the result of the parsing process.

Furthermore, an XML DTD is generated by Xantlr, which exactly defines the structure of the generated sequence of SAX events.

The advantages of this technique are ...

  1. the semantic actions are defined in a declarative style, ergo less error-prone,
  2. the structure of the resulting tree is documented and can be automatically verified,
  3. the further processing of the result can be performed automatically.

In nearly all cases we pipe the output of an Xantlr parser into a Tdom generated model for further visitor-based processing.

More details on all these issues can be found in [tlw01a]

Please also refer to the antlr documentation of the version employed by Xantlr. ([antlr2doc].)

^ToC 2 Specifying the Generated SAX Events

Basically, Xantlr emits a sequence of SAX Events which represent the "abstract syntax tree" (or "AST") of the recognized parsing result. An AST is a thinned-out version of the parse tree, in which redundant front-end non-terminals do not appear anymore.

Every inner node in the AST is represented as an XML element. Its subnodes are represented as this element's content, respecting sequential order.

Every leaf node which corresponds to a recognized terminal, is by default not at all represented in the output data, but con be configured to appear in the SAX outout stream with its character data.

For controlling the kind of SAX events generated, new optionshave been added to the existing antlr options .

The definition of rule options from antlr is enhanced by two new options, each of which may be given zero or one time for each non-terminal in the parser definition:

rule_options ::= ( ... | sax_event_type | xml_tag ) *
sax_event_type ::= xmlNodeType =
        ( pcdata | content | entity | abstract )
xml_tag ::= xmlNodeName = stringValue

The effect of these options is ...

  1. xmlNodeName = stringValue :
    If this option is set to a string value, this string gives the element's "tag value" used in the corresponding SAX events.
    If it is not set, the tag value is identical to the name of the non-terminal.
  2. xmlNodeType is not given:
    This is the default case, and there are two important and significant sub-cases:
    1. When the production contains references to other non-terminals, i.e. rules given in the "parser" part, then the appearance of the non-terminal in the AST is encoded as an ELEMENT, the contents of which is the serialization of the sub-tree ruled by this non-terminal, as described above.
    2. If the production contains only terminals, i.e. references to rules from the "lexer" part of the source, or character sequences defined as "in-line" constants, then the non-terminal does not appear at allin the output.
  3. xmlNodeType = pcdata
    This option may be assigned to productions which expand to terminals (=antlrC "lexer rules") or disjunctions of these. The front-end representation derived from the non-terminal is included as PCDATA contents into the corresponding element.
    Please note that "antlrC lexer rules" themselves do never contribute to the SAX event stream. Non-terminals containing only reference to these (even in an arbitrary complex regular expression!) will produce an ELEMENT with EMPTY content if no xmlNodeType option is selected.
  4. xmlNodeType = content
    The non-terminal is neither contained in the SAX serialization, nor in the generated DTD. Instead, in both places it is treated like a "macro" definition, and its content directly replaces all its references.
  5. xmlNodeType = entity
    The non-terminal is not contained in the SAX serialization, as in the preceding case. In the DTD its definition is represented as an ENTITY declaration, and all its appereances in other productions as ENTITY references in the corresponding content model.
  6. xmlNodeType = abstract
    This option may be assigned to a non-terminal, the production rule of which is a mere disjunction of non-terminals. The non-terminals contained therein may not be referred to anywhere else in the grammar.
    This option only makes sense if Xantlr is used in conjunction with Tdom . It commands Tdom to create the class representing the containing non-terminal as an "abstract" Java class, and to generate classes for the contained non-terminals as derived from this abstract class.
    This derivation is respected by the generated Visitor's scheduling discipline in the sense of "inheritance". In certain cases, this modeling may lead to much more elegant code for the further visitor-based processing of the models.

^ToC 2.1 Normalization of Content Models

The regular expressions as they appear in the non-terminal rules in the original Xantlr-grammar, and as they appear in the resulting DTD may have subtle differences. (These are esp. important in case that the DTD is fed into a Tdom model, because name mangling will be affected.)

You should better always read the resulting DTD carefully.

The first issue is, that alternatives with empty contents ("epsilon") cannot be expressed in DTD, but have to be modelled by modifying neighbours or parents. So a standard transformation is of type ...

    A | #eps | B     -->  (A | B)?

Secondly, there is an important and wide-spread simplifying transformation which is really very helpful and corresponds to the transition between front-end representation ("parse" tree) and semantics ("AST"):

Consider e.g. the front-end syntax definition ...

 
     parameterlist ::= ( parameter ( "," parameter ) * )? 

This is a typical case in the parsing of programming languages: either you can enter no "parameter" at all, or just one, or more than one, seperated by a front-end token which does notappear in the semi-AST.

The DTD content model corresponding verbatim to the parser grammar is ...

 
      parameterlist ::= ( parameter ( parameter ) * )?

In Xantlr, each regular (sub-)expression of form "X X*" is rewritten to " X+". So we get ...

      parameterlist ::= ( parameter + )?

Each regular (sub-)expression of form "(X+)?" is rewritten to "X* ". So we get at last ...

      parameterlist ::= parameter * 

So we get the most convenient form for further processing.

In principle, you always should read the DTD generated by Xantlr carefully and, whenever a program evolves, please consider the influences of changes in the grammar definition file to the DTD.

^ToC 3 Running Xantlr

Xantlr is called applied to a grammar source file exactly like the underlying antlr-tool, but with the meta_tools classes preceding the antlrC classes in the classpath:

/usr/bin/java -classpath metatools.jar:antlr.jar  antlr.Tool  mygrammar.g

This will generated the sources for the parser, lexer, vocabulary etc. as ususal with antlrC . ("MyParser.java", "MyLexer.java", "MyTokenTypes.java", "MyTokenTypes.txt", etc.)
The generated source text for the parser is enhanced with the above-mentioned SAX event generating code.
Additionally a DTD file is generated for each parser in the grammar source file, named "MyParser.dtd".

^ToC 3.1 Messages and Unreachable Nonterminals

Additionally to the known progress messages and error messages generated by antlrC , Xantlr additionally outputs the following:

  1. "warning: unreachable element: xml-node-name "
    This warning is issued whenever a parser production is neither declared as public nor indirectly called by a public production.
    The element definition is nevertheless contained in the generated DTD file!
    This is for convenience of the programmers: Such non-reachabilities can raise temporarily in a development and test phase. When the DTD is used to create a Tdom model (which mostly will be the case), tedious commenting-out could otherwise be necessary in user code derived from the Visitor class, when an Element class would suddenly be missing. Furthermore, this would be rather error-prone since one easily can forget to re-activate the code when the parser is consistent again.

^ToC 3.2 antlrC Grammar Inheritance and Xantlr

In principle the Xantlr mechanisms are compatible with the inheritance mechanism defined by antlrC .

Please note that if you have a grammar "X extends Y", the parsers and lexers belonging to "X" normally must be produced, even if they are not used themselves. Mostly the generated files "X_parser_TokenTypes.java" and/or "X_parser_TokenTypes.txt" are required.

^ToC 4 Running the Resulting Compiler

Each compiler generated by Xantlr is derived from a base class called <METATOOLS>/xantlr/runtime/X_LLkParser.

This class is an extension of the original antlr-class "LLkParser", and has two additional fields, which hold the receivers (1) of the generated events, and (2) of all error messages.

The first field must be set to an object of type <METATOOLS>/xantlr/runtime/EventGenerator. This interfaces offers the methods which are called by the automatically inserted semantic actions, namely EventGenerator.startElement(tag), EventGenerator.endElement(tag), etc.

An implementation of this interface has to be set by calling X_LLkParser.setEventGenerator() . Currently this is always an instance of SAXEventGenerator, which maps the start/end-calls mentioned above to the corresponding SAX events.

A SAXEventGenerator must in turn be linked to the target of these SAX events, by calling SAXEventGenerator.setContentHandler(org.xml.sax.ContentHandler)

For the error messages one has to call X_LLkParser.setMessageReceiver() .

antlr itself has an option "defaultErrorHandler", which can be set "for an entire (.g) file, ...for a grammar, ...for a rule", as explained in http://www.antlr2.org/doc/options.html#File, Grammar, and Rule This option is described in http://www.antlr2.org/doc/options.html#defaultErrorHandler

When set to false, antlrC throws some exceptions, when set to true, these are caught and fed into the functions defined in the base class antlr/Parser : (reportError(String), reportError(RecognitionException), reportWarning(String) ) which simply print to System.err. These methods are redefined in <METATOOLS>/xantlr/runtime/X_LLkParserto generated correct message objectsand to send these to the above-mentioned message receiver.

Note 1: reportWarning() can be called explicitly from any of the user defined rules by antlr semantic actions !)

Note 2:You can substitute an own class at the base class of the generated parser as defined in the last paragraph of http://www.antlr2.org/doc/metalang.html#Parser Class Definitionsby starting the grammar file with the declaration

class MyParser extends Parser("path.of.my.own.ParserBaseClass");

But in case of Xantlr, your "ParserBaseClass" must also derive from xantlr.X_LLkParser.

When an Xantlr generated parser shall be run, it must not be plugged to an antlrC generated lexer directly, but by an intermediate HistoryToken.

Assuming that myInStream is a java InputStream object which delivers the text to parse, and MyParser and MyLexer are the names of the generated classes, then following code will bring the abstract syntax tree to be printed on the terminal:

   MyLexer lexer = new MyLexer(myInStream);
   MyParser parser = new MyParser(HistoryToken.chain(lexer));
   SAXEventGenerator gen = new SAXEventGenerator(parser);
   parser.setMessageReceiver(new MessagePrinter(new PrintWriter_flushing(System.err)));
   gen.setContentHandler(new ContentPrinter
                          (new PrintWriter_flushing(System.err), 
                          false, false));
   parser.topNonTerminal();

(Please note: Whenever the result of the parcsing process shall be fed into Tdom , then there is a glueing class <METATOOLS>/xantlrtdom/XantlrTdom, which does all this plugging automatically. See on the co-operation of xantlr and tdom.)

^ToC 5 Notes on the Implementation

Currently Xantlr is working as a modification of antlrC only in the version 2.7.4 .

Since this version is no longer found in the net, we keep a copy on http://bandm.eu/software/mirror/index.html

In the original antlrC implementation, the semantic actions directly called modification methods of a (non-specified) automaton, which plugs together the generated parser code.
From the sources in antrl.g we created xantlr.g, splitting up this process into a two phases:

  1. first there is a "text parser" called ANTLRSyntaxChecker. It reads the grammar source and builds a (newly defined) intermediate model. This model is based on the tree model antlrC comes with.
  2. Secondly, there is class ANTLRParser extends TreeParser, which performs the original code generating method calls, but delayed.

Because the latter is named identical to the parser of the original antlrC implmentation, this splitting operation is transparent to the rest of the antlrC code, which can therefore be executed as ususal.

On this intermediate model the newly defined Xantlr modifications can be implemented in a clean and maintainable way. They include (1) the interpreting of the newly defined options, (2) translating them into semantic actions (i.e. java method calls) and (3) deleting the options for not confusing the genuine antlr process.

These are the sources are processed by antlrC :

 xantlr.g             --->    ANTLRSyntaxChecker
                      --->    ANTLRParser.jaca

 visitor.g            --->    XANTLRVisitor.java
                              

 dtd.g                --->    DTDGenerator.java            extends XANTLRVisitor 
                              

 expander.g           --->    XmlRepresentationExpander    extends XANTLRVisitor 
                              
  
 filter.g             --->    XmlRepresentationFilter      extends XANTLRVisitor
                              
                              

All these phases are plugged together in AntlrParser.grammar(), created from by antlrC from xantlr.g. Since the name of the generated class is antlr.ANTLRParser, the unmodified antlr.Tool will run as usual and perform all the additional Xantlr tasks "without even noticing it", if only the meta_tools jar-file (containing the new ANTLRParser ) precedes the antlrC jar-file in the classpath.




go one page back go to start go to start go one page ahead
dtd bandm meta_tools tdom

made    2025-01-09_11h54   by    lepper   on    happy-ubuntu        Valid XHTML 1.0 Strict Valid CSS 2.1

produced with eu.bandm.metatools.d2d    and    XSLT    FYI view page d2d source text