[all pages:] introduction message/location/muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf DOC-EN:lablog.ddf SOURCE:basic.dd2 DOC:xslt.ddf SOURCE:xslt.dd2 DOC:meta.ddf [site map]



All pages: introduction message/location/muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf DOC-EN:lablog.ddf SOURCE:basic.dd2 DOC:xslt.ddf SOURCE:xslt.dd2 DOC:meta.ddf [site map]



go one page back go to start go to start go one page ahead
xantlr bandm meta_tools ops

Tdom , a Generator for Typed XML Models



(related API documentation: package tdom.runtime   )


1          Principles of Tdom
2          Mapping from DTD to Java
2.1          Relevant Information content of a DTD
2.1.1          Process instructions
2.2          XML Namespaces
2.3          The own XML Document Id
2.4          Pre-Defined Infra-Structure, Runtime Classes
2.5          Generated Java Classes for the top-level DTD. Reflection.
2.6          Generated Java Classes for Element declarations. General Name Translation.
2.6.1          Abstract Java Classes as Realisations of DTD Content Model Alternatives
2.6.2          Name Mangling from DTD Elements' Contents and Sub-Contents to Java Classes
2.6.3          Inner Classes Generated for Sub-Content
2.6.4          Retrieval, Update and Visit Methods
2.6.5          Unsafe retrieval methods and alternative checked list generation
2.6.6          Inner Classes Generated for Embedded Sequences
2.6.7          Inner Classes Generated for Choices
2.6.8          Text Content and Mixed Content
2.7          Attributes
2.7.1          Generated Classes for Attributes
2.7.2          Checking Value Assignments
2.7.3          Enumeration Attributes with Integer Tokens Only
2.7.4          Unsetting Attributes
2.7.5          Common Classes for Common Attributes
2.7.6          Attributes with Attribute Types "ID", "IDREF" and "IDREFS"
2.8          Auxiliary methods for numeric contents of elements and attributes
2.9          Additional Documentation Text
2.10          Ethereals: Comments and Processing Instructions as Second Class Inhabitants
3          Construction of Tdom models
3.1          Error Cases and Exception Hierarchy
3.2          Explicit Constructor Application for Elements and Sub-Content
3.2.1          Creating Elements with Structured Contents, Statically Typed
3.2.2          Creating Elements with Structured Contents, Dynamically Typed, by Semi-Parser
3.2.3          Creating Elements with Mixed or Pure PCData Contents, Statically Typed
3.2.4          Creating Elements with Mixed or Pure PCData Contents, Dynamically Typed
3.2.5          Creating Elements with Attributes
3.3          Automated Construction of Documents and Elements
4          Visitors and Patterns
4.1          The Generated Visitor Class and Deriving User Defined Visitors
4.2          Calling a User Defined Visitor
4.3          Default Visiting Strategy of Generated Visitors and User Defined Explicit Control
4.4          Untyped Visitors
4.5          Generated Paisley patterns
5          The Extension Mechanism
6          Serialization and Conversions
6.1          Generating SAX Events
6.2          Visualization of a Tdom Model
6.3          Format Generation for a Tdom Model
6.3.1          Stand-alone format description file
6.3.2          Process instructions in a DTD
6.3.3          Options from an Xantlr Source
6.4          Creating a W3C (untyped) DOM Representation
6.5          Compressed De-/Serialization
7          Using the Tdom Tool
7.1          Calling the Tdom Tool
7.2          Outputs and Error Messages
8          Xantlrand Tdom --- Special Issues of Their Co-Operation
8.1          Information Interchange by Option Controlled DTD Generation
8.2          Different Layers of Ambiguity
8.3          XantrlTdom, Glueing Code and Error Messaging Issues

^ToC 1 Principles of Tdom

Tdom is a tool for generating typed data models of an xml text body according to a definition given as XML DTD [xml] . "Typed" model means that (a) the validity of the model w.r.t. the DTD is guaranteed by all creation and modification methods, 1 and (b) that this can be proved at compile time. 2

The Tdom generated model behaves "partly algebraic", since each node behaves like an algebraic expression and knows nothing about the context(s) it appears in. So, in contrast to w3c DOM ([w3cDom]), you can employ sharing, even between different "documents". Nodes exist independently from a global document object, and can be created, processed and stored in a freely compositional and local way. This is a fundamental requirement for a "functional style" of programming.
Tdom nodes do not behave algebraic in the sense that they can be treated as mutable (but think twice if it is really necessary for your purposes, you loose sharing !-), and do not support algebraic equals().
(This could be added in some later version !?!)

The fact that they "do not know" their parent and their siblings makes Tdom nodes behave more like nodes in a tree in the mathematical sense. Software architects used to W3C DOM et.sim. may consider this restriction to be a draw-back. Processing and creating trees is of course fundamentally different in this paradigm: creating goes most naturally bottom-up, processing goes most naturally by visiting top-down (see chapter 4) and memorizing all required context information "on the flight".
(For all our applications, we found this a most convenient, safe and easy to debug way of coding !-)

Applying the Tdom compiler to a DTD yields a collection of Java source files, forming a single package. This package will be processed by a Java compiler. It relies on the presence of a collection of base clases in the package <METATOOLS>/tdom/runtime, see Section 2.4 . The generated collection provides (at least) one Java class definition for each type of node defined by the DTD. This includes ...

  1. one or two classes for each ELEMENT declaration,
  2. one class for each attribute declaration,
  3. and one class for each structured sub-content of an ELEMENT's content.

All these Java classes are called "node classes" in the following. A Tdom model of a certain text corpus is realized by a structured collection of instances of these node classes. Each such instance represents a certain part of the document, and each node class represents a certain type of these document parts.

All generated node classes provide ...

  1. constructor definitions,
  2. methods for retrieving sub-nodes of a given node,
  3. methods for updating sub-nodes of a given node while preserving the correctness w.r.t. the DTD,
    (For all these topics see chapter 2 and chapter 3)
  4. parsing methods for creating documents from SAX streams or W3C-DOMs (see Section 3.3),
  5. methods for translating a Tdom model into a SAX stream,
  6. methods for a compressed serialization and de-serialization (for these topics see chapter 6).

Additionally the generated package contains ...

  1. a visitor base class Visitor and a visitor template VisitorTemplate for declarative processing of the models, as described in chapter 4,
  2. a class derived from the run-time class TypedDTD, which contains a DTD model. Please note that this class additionally does need the textual representation of the DTD as a (error-free, parseable) runtime resource, accessible via the relative path "./original.dtd" from the DTD class.

After these classes have been compiled by a Java compiler, you can create a Tdom text model ...

  1. ...either "manually", i.e. by explicitly calling constructors for node classes, thereby explicitly creating the document tree bottom-up, see chapter 2
  2. or by creating a whole Document with the generated parsing methods: Either with the validating SAX receiver from a SAX event stream, or with the validation DOM interpreter from a W3C-DOM (e.g. [xercesj]), see Section 3.3.

Each Tdom model, or each fragment thereof, can then ...

  1. ...be analysed by visitor based code, see chapter 4,
  2. ...be modified by applying the update methods of those Java standard libraries which make up the model, see chapter 2,
  3. ...and finally be written out by different serialization methods, see chapter 6

Some of the public examples on page download & licences make extensive use of Tdom .
See esp. the "BandM booking" book keeping software, where a dedicated DTD models the business objects, and the d2d based "Wiki" where type correct XHTML 1.0 is constructed in small pieces bottom-up.

^ToC 2 Mapping from DTD to Java

In the following simplified examples, let "pkg" be the name of the generated packages, as given to the Tdom tool by the command line parameter --pkgname, see Section 7.1.

^ToC 2.1 Relevant Information content of a DTD

The Tdom tool processes one single DTD file and generates one package of Java source files. (The contents of further DTD files, which are included directly or indirectly in this file by the famous "external parameter entity" mechanism, are of course also considered and are processed as if contained directly in the top file.)
From the DTD it uses ...

  1. ...all element defintions, their (expanded) content models and their attribute definitions.
  2. ...process instructions of the form "<?tdom ... ?>"
  3. ...entity definitions.

In most cases, entity definitions are only used implicitly. Tdom uses the meta_tools component dtd , expanding entity references in a transparent way.

^ToC 2.1.1 Process instructions

The translation into Java code is controlled by a whole zoo of "process instructions" adressing Tdom , as defined in [xml]. Here are the most important:

  1. <?tdom xmlns=..?>
    <?tdom xmlns:..=..?> -- XML namespace definition, see Section 2.2.
  2. <?tdom SYSTEM/PUBLIC ..?> -- defines the xml document id for this DTD file, see Section 2.3.
  3. <?tdom public/private/default ..?> -- defines which elements can be used as top of a model, as so called "document elements", see Section 2.6.
  4. <?tdom abstract ..?>
    <?tdom abstract-entity ..?> -- generated abstract classes for alternatives in DTD content models, for leaner code, see Section 2.6.1. Ending with an ellipsis it is also used in the expansion mechanism, see chapter 5.
  5. <?tdom attribute ..?>
    <?tdom attribute-entity ..?> -- generate a common class for an attribute used in different elements, for leaner code, see Section 2.7.5
  6. <?tdom import ..?> -- import another Tdom model for extending it, see chapter 5
  7. <?tdom doc ..?> -- define additional documentation text to integrate it into the generated API doc, see Section 2.9
  8. <?tdom package ..?> -- ?? FIXME MORE (_)

^ToC 2.2 XML Namespaces

The identifier of the names of all elements, entities and attributes are implemented as NamespaceName.

As documented there, this class can represent names in "non-namespace-mode" and in "namespace-mode". In the first case the character ":" is treated in no way special. In the latter there must be at most one such character, and the prefix is mapped to a "namespace URI", as it is the standard way with XML namespaces, see [xml-ns].

For Tdom , the namespace mode is activated by process instructions which exactly follow the standard XML syntax:

<?tdom xmlns="myMainModule" ?>
<?tdom xmlns:mathml="http://www.w3.org/1998/Math/MathML" ?>

For the runtime namespace logic, the prefixes are ignored and "equals()" etc. is ruled only by the namespace URI, -- the usual way with namespace aware XML. In this concern, all prefixes must be different, but can be arbitrary.

But for the code generation the prefixes are kept and only the colon ":" is replaced by an underscore "_". (There are more characters to be replaced, see the paragraph on name translation in Section 2.6.) So the selected prefixes appear in the name of the generated Java classes and should be selected accordingly.

^ToC 2.3 The own XML Document Id

With the PIs ...

<?tdom SYSTEM "xslt.dtd"?>
--- or ---
<?tdom PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN"
              "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" ?>

... the XMLDocumentIdentifier of the dtd file itself is made known to Tdom . It will be stored in generated DTD class and is accessible by the method getDocumentId().

^ToC 2.4 Pre-Defined Infra-Structure, Runtime Classes

The classes generated by Tdom to represent Elements, Attributes and sub-expressions of content models, all inherit from pre-defined runtime classes. These classes are contained in eu.bandm.tools.tdom.runtime and provide basic functionalities.

The following figure indicates symbolically their inheritance tree, and the places where the generated classes are inserted:
(Please note that this graph is only symbolic and leaves out many details. For more details, please refer to the current API documentation contained in
<METATOOLS>/eu/bandm/tools/tdom/runtime !)


  INTERFACES:
  eu.bandm.tools.tdom.runtime.
    TypedContent 

    TypedElement.MixedContentContainer
    | TypedElement.PCDataContainer

    Visitable<V>
    ImpliedAttribute
    Identifiable
    ... // etc

  CLASSES:
  eu.bandm.tools.tdom.runtime.
    TypedDTD
    | pkg.DTD                    <<<<< GENERATED once for each Tdom model

    TypedNode
    | TypedDocument
    |
    | TypedSubstantial 
    | | TypedPCData IMPLEMENTS Matchable // FIXME visitable / matchable??? WAS GILT??
    | |
    | | TypedElement IMPLEMENTS TypedContent 
    | | | pkg.Element             <<<<< GENERATED once for each Tdom model
    | | | | pkg.Element_<tagX>    <<<<< GENERATED once for each ELEMENT declaration
    | | | | pkg.Element_<tagY>    <<<<< "
    |
    | TypedEthereal
    | | TypedComment
    | | TypedProcessingInstruction
    |
    | TypedSubtree IMPLEMENTS TypedContent 
    | | TypedChoice
    |
    | TypedAttribute
    | | CDataAttribute
    | | EnumerationAttribute
          pkg.Element_<tag>.Attr_<attname>   <<<<< GENERATED for each attribute declaration
    | | NmTokenAttribute
    | | | IdAttribute
    | | | IdRefAttribute
            pkg.Element_<tag>.Attr_<attname> <<<<< GENERATED for each attribute declaration
    | | NmTokensAttribute
    | | | IdRefsAttribute
            pkg.Element_<tag>.Attr_<attname> <<<<< GENERATED for each attribute declaration
    |        ...
    |

    TypedElement.MixedContent IMPLEMENTS TypedContent
    | pkg.Element_<tagX>.Content    <<<<< GENERATED once for each mixed-content ELEMENT

    TypedElement.MixedContentFactory
    //...    

^ToC 2.5 Generated Java Classes for the top-level DTD. Reflection.

In each Tdom run, among others one top-level class pkg.Element is generated, from which all element classes are derived, and one prototype class pkg.Visitor. The pkg.Element class implements tdom.runtime.Visitable<pkg.Visitor>; these co-operate in visitor dispatching. see chapter 4 below.

For a beginner, the different classes and instances called "dtd" or sim. may be confusing:

  1. In most cases, the DTD to translate into Java source files is given to the impementation of the Tdom compiler as a DTD text file, as it is defined in [xml]. See Section 7.1 below.
  2. In the generated package "pkg", a class called "pkg.DTD" will be generated. This is the central means for retrieving all kinds of reflective information, when later running the generated code.
  3. For this purpose, it inherits from <METATOOLS>/tdom/tdom/runtime/TypedDTD .
  4. The method <METATOOLS>/tdom/tdom/runtime/TypedDTD/getInterfaceInfo delivers an instance of <METATOOLS>/tdom/tdom/runtime/TypedDTD.DTDInfo . This gives access to very different objects containing reflective information about the elements and attributs in this tdom model.
    As an example see the preparational code for xslt transformations, which analyses the target tdom by these reflection mechanisms in the constructor of xslt.base.ResultContext.
  5. Esp. the generated pkg.DTD provides a public static field called "dtd", which gives access to one instance of itself.
  6. The original DTD source text must be accessible to the package's initialization code in a text file named "original.dtd".
  7. This is parsed on initialization, and the resulting instance of <METATOOLS>/dtd/DTD.Dtd, which is an umod model, is made accessible by <METATOOLS>/tdom/tdom/runtime/TypedDTD,getDTD().

^ToC 2.6 Generated Java Classes for Element declarations. General Name Translation.

For each <!ELEMENT tag ...> declaration in the source DTD, a node class is generated in the generated package "pkg". Instances of this class will be used to represent the document's sub-trees corresponding to this element. Such a class is called "element class" in the following. Its name is Element_<tag'>, where <tag'> is the DTD name translated to a Java name.

The name translation from DTD to Java is necessary for all kinds of names, as element tags, attribute names, entity names etc. What happens in all these cases is, that every single occurence of a minus signs "-", a colon ":" or a dot "." is replaced by a single underscore "_".

The Tdom tool does not check whether ambiguities are created by this translation. Instead, you will get error messages from the subsequent Java compilation process, accordingly. This will happen for an input like

   <!ELMENT a.b-c ...>
   <!ELMENT a-b.c ...>

For those <!ELEMENT tag ...> declarations which can serve as the top-level node of a document, a further class names Document_<tag'> is created. This inherits from <METATOOLS>.tdom.runtime.TypedDocument and contains additionally the methods for parsing a document as a whole from some external source (SAX or W3C-DOM). The indication whether a given element is such a top-level one is encoded in the DTD by process instructions: The positive case (=yes, E can be top-level element) is indicated in the DTD by the process instruction

<?tdom public E ?>

The negative case by

<?tdom private E ?>

For all those element declarations which are not explicitly mentioned in this way the default is defined by

<?tdom default private ?>
-- or --
<?tdom default public ?>

As mentioned above, every element class may contain inner Java classes which realize complex sub-contents, and update and retrieve methods ("set_<..>(..)" and "get_<..>(..)") for all sub-contents.

Please note that you can even derive further hand-coded sub-classes from generated element classes. This is possible because, whenever a typed element needs to know the identity of the class it is an instance of (e.g. for visiting or for serialization), it does not use the Java language .getClass()-method, but the generated method public int getTagIndex(), which will be inherited by your derived classes and is specified in TypedElement.getTagIndex().

The tag name of a given element can be read statically, when the class is known, by ...

Element_<>.TAG_NAME

...due to the generated definition ...

public static final String TAG_NAME ;

The dynamic way is defined in <METATOOLS>/tdom/runtime/TypedElement by

NamespaceName el.getName()
String        el.getTagName()
String        el.getNamespaceURI()
String        el.getLocalName()

^ToC 2.6.1 Abstract Java Classes as Realisations of DTD Content Model Alternatives

The Tdom tool does support some abstraction of isomorphic content definitions. This abstraction is rather limited, due to the nature of DTD, but nevertheless an important means for increasing re-usability, while preserving static type safety.

Any content model which is an undecorated choice expression of element references can be translated into an abstract class, in the Java terminology. The benefit comes from the further consequences: (a) all element declarations referred to in this alternative will be translated into an element class which is derived from this abstract class, and (b) whenever this choice clause will appear in a certain content model, it will be replaced by a simple reference to this abstract class. This happens independently of the sequential order of the alternatives, and also when the choice alternatives are a true subset of the alternatives of a bigger choice clause.
In the case of nested choices, the largest sub-expression of every choice which matches such an abstraction is replaced by the corresponding abstract class.
(These statement must be refined for overlapping definitions, see below.)
The single inheritance property of Java implies that each element may appear at most in one of these declarations.

This mechanism is controlled by a tdom process instruction like

<?tdom abstract  a (b|c|d) ?>

The content model must be a disjunction of plain element types. The PI declares that an abstract class "Element_a" will be generated. This will be the superclass of the classes which realize the elements in the disjunction, here: Element_b, Element_c and Element_d. We assume that the name "a" is fresh, ie. does not appear as an element or entity declaration in the rest of the DTD. Whenever this choices appears in a content model, the complicate interface for choices (see Section 2.6.7 below), is replaced by the much more simple interface for a single element reference.

<!ELEMENT x (y, (b|c|d)*, z, (b|c|d)) >

==> yields a code interface containing
    class Element_x { 
       public Element_a[] getElems_1_a() {...}
       public void  setElems_1_a(Element_a[]) {...}
       public Elem_a  getElem_2_a() {...}
       public Elem_a  setElem_2_a(Element_b e) {...}
       public Elem_a  setElem_2_a(Element_c e) {...}
       public Elem_a  setElem_2_a(Element_d e) {...}
    }

A more realistic example, simplified from our XHTML model:

<?tdom abstract-entity  misc.inline ?>  

<!ENTITY % misc.inline "ins | del | script">

<?tdom abstract misc (noscript | misc.inline) ?>

<?tdom abstract heading (h1|h2|h3|h4|h5|h6) ?>

<?tdom abstract block
  (p | heading | div | ul | ol | dl  | pre | hr | blockquote | address 
  |blocktext | fieldset | table) ?>

<?tdom abstract form.content (block | misc) ?>

<?tdom abstract block.content (form.content | form) ?>

The last example shows that these definitions may be nested, as long as no cycles do result.

Whenever such a choice expression is defined by the contents of a DTD "parameter entity", this entity can be used directly. its contents will define the subclasses, and its name will be used as the name of the abstract class. This is shown in the first line of the example above. The entity's expansion text may carry further decoration which will be stripped to get the alternative expressions, as in

<?tdom abstract-entity  form.content ?>
<!ENTITY % form.content "(%block; | %misc;)*">

This abstraction mechanism significantly increases reusability and versatility: With the XHTML example, it is possible now to collect the very different sub-classes of Element_block_content into one single storage, e.g. an ArrayList<Element_block_content>, and later insert this sequence into an arbitrarily chosen instance of Element_object, Element_map, Element_fieldset, Element_noscript, Element_body, or Element_blockquote.

Without this abstraction, the first, collecting step would already require the wrapping of the elements into a certain alternative of a certain choice type of a certain hosting element class, and the collected sequence could not be used anywhere else, without "hacking" and losing static type safety.

The behaviour in case of overlapping choice expressions is not fully defined:

<?dom abstract x  (a | b) ?>  
<?dom abstract y  (b | c) ?>  

// not clear what will happen here:
<!ELEMENT  z  (a | b | c) >

So please look into the generated code to find out, or avoid such definitions.

Up to here we assumed that the name of the abstract class is a fresh name.
A different case is to declare an existing element declaration as abstract. The preconditions and consequences are the same, the contents model of the element must be an undecorated choice clause.

Additionally, the corresponding node is not longer represented in the generated model by an object instance on its own, but is represented indirectly, by an instance of one of its sub-classes, which corresponds to an element from its contents model's choice expression. This instance transparently represents two(2) or even more nodes of the conceptual document tree, namely the chosen leaf element and the containing, abstract element. "Transparently" means that visitor code and attribute accessing methods are not affected.

FIXME STIMMT DAS??? BEISPIEL ??? der code in TypedDOMGenerator scheint ein "<?tdom abstract node?>" NICHT zu unterstützen !?

This method is frequently employed in the Tdom /Xantlr co-operation, to eliminate unnecessary nodes which only present alternatives in the derivation tree. This is controlled by a "options {xmlNodeType=abstract;}" option in the Xantlr grammar file, which is translated to the Tdom PI automatically, see chapter 8 below.

^ToC 2.6.2 Name Mangling from DTD Elements' Contents and Sub-Contents to Java Classes

(This section describes how Java class definitions and their naming are derived from the contents model of a certain element. Please note that the mere name translation for eliminating those characters, which are valid identifier components in XML but not in Java, as described in this chapter above, must happen anyhow and independently!)

The mapping rules between DTD and Java class definitions act locally on each "<!ELEMENT...>" declaration. The structure of the generated Java classes and their naming convention immediately reflect the usage of parentheses in the regular expression describing the element's contents, as given by the compiled DTD.
Please note that there is no implicit normalization of DTD content models. For the name mangling purpose there is a difference between the content models

  a, (b, c)

...and ...

 a, b, c

Name mangling is basically defined on sequences.

Therefore, as a first step, the top-level of the element's content definition is always interpreted as such a sequence , possibly of length 1(one).

Each sequence consists of content particles. Each such content particle is ...

  1. either an element reference,
  2. or an embedded choice,
  3. or an embedded sub-sequence.

All content particles may be decorated with a quantification symbol "?", "*" or "+". The naming convention assigns position numbers separately to (a) to all references to elements of a certain tag, (b) to all embedded choices and (c) to all embedded sub-sequences.
These numberings always start with the number one(1).

From these numbers (and the tag strings in case (a)) particle names are generated.
E.g. the sub-particles of the top-level in following DTD content model are adressed by the particle names put beneath:

 <!ELEMENT NC ( TA, (TB, TC, TB)+, TA, (TB)*, (TE | TF | TG), (TB, TC, TB)  ) >
                |   |              |   |      |               |            
                |   Seq_1          |   |      Choice_1        Seq_2        
                |                  |   Elem_1_TB                           
                Elem_1_TA          Elem_2_TA                               

The quantification symbols "?", "*" or "+" and all parenthes around singletons (i.e. not enclosing sub-sequences or alternatives) are ignored in the definition of particle names.

If the top-level content model as written in the DTD is an alternative, the top-level content for Tdom is considered as a singleton sequence.
In this case we get a very simple top-level naming, like ...

 <!ELEMENT NC ( TA | (TB, TC, TD)+ | (TC)+ | (TD, TC, TD) )  >
              |
              Choice_1  

 <!ELEMENT NC ( TA | (TB, TC, TD)+ | (TC)+ | (TD, TC, TD) )*  >
              |
              Choice_1  

^ToC 2.6.3 Inner Classes Generated for Sub-Content

For each sub-sequence and each choice contained in the top-level sequence, an inner class is defined in the class representing the element. The names of these classes are identical with the particle names, as defined above.

As with element classes, each such class provides update and retrieve methods ("set_<..>" and "get_<..>") for its sub-contents. An instance of such an inner class must be used as an argument for the constructors and for update methods, and is returned as result of a corresponding retrieve method.

Please note that in the current implementation there is no algebraic equality defined on content models. Therefore in the example above the types of Seq_1 and Seq_2 are not compatible, in spite of having the same contents definition. The same fact holds for embedded choices.

^ToC 2.6.4 Retrieval, Update and Visit Methods

Built upon the particle names, the Java class generated for the DTD element "NC" provides methods for retrieving and updating the contents of a given instance. Which methods are generated is controlled by the quantification decoration of the content particle.

Let <pname> be the particle name, and <plural> be its plural form (i.e. "Elems_1_TA" for "Elem_1_TA", "Choices_2" for "Choice_2" and "Seqs_1" for "Seq_1").
Let <pclass> be the class representing a sub-content (i.e. "Element_TA" for an Element reference, and "Element_NC.Seq_1" or "Element_NC.Choice_2" for embedded sub-content).

In case of undecorated particles the generated methods are ...

  public <pclass> get<pname>(){..}           // deliver current content
                                             //   this is always != null
  public <pclass> set<pname>(<pclass> e){..} // update current content
                                             //   if e==null, throw exception

If the modifier "?" is present, we get ...

  public <pclass> get<pname>(){..}            // deliver current content,
                                              //   but may return null
  public <pclass> set<pname>(<pclass> e){..}  // update current content
                                              //   and accept null as argument
  public boolean has<pname>(){..}             // return whether component is currently
                                              //   contained in the higher-level content

If one of the modifiers "*" or "+" is present, we get ...

  public <pclass>[] get<plural>(){..}         // deliver whole sequence as an array
  public <pclass> get<pname>(int pos){..}     // deliver content at the given position 
                                              //   of the sequence
  public <pclass>[] set<plural>(<pclass>[] e){..}
                                              // update current content totally 
                                              //   to a whole sequence 
  public <pclass> set<pname>(int pos,<pclass> e){..} 
                                              // update current content
                                              //   at the given position 
  public int count<plural>(){..               // return number of components currently
                                              //   contained in the higher-level content
  public void visit<plural>(Visitor v){..}    // apply visitor to all particles in the
                                              //   sub-conten

Please note that for your convenience every "set<>()"-method (including those described in the following sections!) always returns the old, overwritten value as its result.

The meaning of the method "visit<plural>(Visitor c)" will be explained in section Section 4.2.

^ToC 2.6.5 Unsafe retrieval methods and alternative checked list generation

In the preceding list of methods the retrieval functions for plural sub-contents like "get_Elems_1_TA()" (or like "getSeqs_2()" and "getChoices_1()" as introduced in the next sections) deliver a direct access to the Java "array" data object which realizes the contents of the Tdom model instance. Due to a weakness of the Java language, This array is not protected against vandalism, i.e. storing into it a forbidden null value.

The same holds for the plural update methode like "set_Elems1_TA(TA[])" (or like "setSeqs_1(TN.Sequ_1[])" and "setChoices_1(TN.Choice_1[])", which additionally can violate the "+" specification by applying it to zero-length arguments. So the type correctness w.r.t. the original DTD is only guranteed if these API methods are not used.

An alternative is the usage of checked lists for the implementation. These lists are proxy classes above the Java list classes and prohibit the insertion of null, see their api doc. This mode is selected when running tdom with the command line switch "--generateLists", see Section 7.1.

If lists are selected than in the methode signatures below and above all types "x[]" (appearing as result types or parameters) must be replaced by "CheckedList<x>". In this mode, all code using tdom is always type correct w.r.t. the original DTD.

^ToC 2.6.6 Inner Classes Generated for Embedded Sequences

In case of embedded sequences, the whole top-level procedure (particle naming scheme, inner class definition for sub-structures and generation of methods) is simply applied recursively.

A difference in the implementation is that the classes for sub-content (i.e. embedded Sequences and Choices) are not inner classes of the inner class representing the sub-content, but reside as direct inner classes of the element's class.
The nesting is only represented by their name, which is a concatenation of the particle names of all levels, connected by an underscore "_".

The following example shows some of the get methods and the resulting types (classes). The names of both are again constructed by the particle names:

 <!ELEMENT NC ( TA, (TB, TC, (TA, TB)*  ) )>
                    |        |    |
                    |        |    nc.getSeq_1().getSeq_1(3).getElem_1_TB()=>Element_TB
                    |        |   
                    |        nc.getSeq_1().getSeq_1(3)=>NC.Seq_1_Seq_1
                    |            
                    nc.getSeq_1()=>NC.Seq_1

^ToC 2.6.7 Inner Classes Generated for Choices

The inner classes generated for choices are sub-classes of the pre-defined runtime class TypedChoice. Additionally, for each alternative of a choice an inner class is generated, which is again a sub-class of this "typed choice class". The name of such an alternative class is the name of the choice class with the appendix "_Alt_<n>".

This "<n>" used to identify an alternative is the position number w.r.t. the containing choice in the original DTD formula. This numbering starts with 1(one) !

In our example from above, the naming is ...

 <!ELEMENT NC ( TA, (TB, TC, TB), TA, (TB)*, (TE | TF | TG), (TB, TC, TB) ) >
                                             ||    |    |
                                             ||    |    Choice_1_Alt_3
                                             ||    Choice_1_Alt_2
                                             |Choice_1_Alt_1
                                             Choice_1

The methods generated for the choice class are ...

public class Element_NC extends eu.bandm.tools.tdom.Element {
  ...
  public Choice_1 setChoice_1(Choice_1 e){...} 
                                        // change content accordingly.
  public Choice_1 getChoice_1(){...}    // deliver current content


  public abstract class Choice_1 extends TypedChoice {
    public int getAltIndex(){...}       // deliver the index of the
                                        //   currently contained alternative
    public Choice_1_Alt_1 toAlt_1(){..} // convert to the corresponding class,
    public Choice_1_Alt_1 toAlt_2(){..} //   if current content represents this
                                        //   alternative. Otherwise, return null

    public boolean isAlt_1(){..}        // return true iff current content 
    public boolean isAlt_2(){..}        //   is of the mentioned alternative.
    ...
  }

  public class Choice_1_Alt_1 extends Choice_1 {
  ... // update/retrieve/visit methods like an top-level element/sequence class !!
  }
}

The contents of each Choice_<m>_Alt_<n> class is again treated as a sequence (possibly a singleton sequence), and the top-level naming and code generation scheme is applied recursively.
Again, no further nesting of inner classes will be applied, but the representing classes are direct inner classes of the element's class, and their names created by concatenation of the naming particle hierarchy.

An example for retrieving:

 <!ELEMENT NC ( TA, (TB | TC |  TD, (TE | TF)* )  ) >
                    |           |   |
                    |           |   nc.getChoice_1().toAlt_3().getChoice_1(8)
                    |           |      =>NC.Choice_1_Alt_3_Choice_1
                    |           |   
                    |           nc.getChoice_1().toAlt_3()=>NC.Choice_1_Alt_3
                    |            
                    nc.getChoice_1()=>NC.Choice_1

^ToC 2.6.8 Text Content and Mixed Content

Mixed content and plain character content is treated specially. Mixed content could be considered a "choice-type with *-quantification", but in contrast to the standard implementation described above, the layer which explicitly adresses the choices is skipped for the sake of the user's convenience.

Instead, a specialized Content class is defined in the element's implementing class, which can contain either character data, or one of the elements listed in the mixed content declaration.

So the DTD definition ...

<!ELEMENT NB (#PCDATA) >
<!ELEMENT NC ( #PCDATA | TA | TB )* >

...is translated to ...

public class Element_NB extends Element 
  implements TypedElement.TypedPCDataContainer ... {
  public static class Content extends TypedElement.MixedContent {
    ...
  }
  public List<Content> getContent(){..}   // returns the modifiable list of particles
  public String getPCData() {return getPCData(this);} // convenience function

}

public class Element_NC extends Element 
  implements TypedElement.MixedContentContainer ... {
  ...
  public static class Content extends TypedElement.MixedContent {
    public Content (Element_TA el){...}   // create the variant with element TA
    public boolean isElement_TA(){...}    // returns whether content particle is a TA
    public Element_TA toElement_TA(){...} // returns casted content or null

    public Content (Element_TB el){...}   // create the variant with element TB
    public boolean isElement_TB(){...}    // returns whether content particle is a TB
    public Element_TB toElement_TB(){...} // returns casted content or null

    // inherited from TypedElement.MixedContent :
    public Content (String s){...}        // create the variant with pcdata
    public Content (TypedPCData s){...}   // dto.
    public boolean isPCData(){...}        // returns whether content particle is PCData
    public TypedPCData toPCData(){...}    // returns casted content or null
  }
  ...
  public List<Content> getContent(){..}   // returns the modifiable list of particles
}

// to get the character content of the pcdata particles, you additionally need:

public class TypedPCData extends TypedNode {
 ...
  public String getPCData(){} // returns text content of this content particle

}

Let Elx elx be a generated element class, and a reference to an instance of it. To read character data of a given content particle is done as in

  for (Elx.Content c : elx.getContent)
    if (c.isPCData())
      String charSeq = c.toPCData().getPCData();

This is rather tedious, of course.
The PCData objects themselves are algebraic: to change the text contents, you have to create a new instance and insert it into the list of el.getContent(). For conveniece there is a constructor which implies the new PCData():

   elx.getContent().add(new Elx.Content("text value"));

All elements which are defined by the DTD wording (#PCDATA) or (#PCDATA)*, i.e. which are pcdata ONLY, are realized as instances of PCDataContainer, a sub-class of MixedContentContainer.

Please note that also in this case you never can make any assumption on how many content particles exist, the concatenation of which represents the plain text.

Anyhow, processing should not happen on this technical level of representation. Additionally, for convenience, these objects offer directly the method getPCData(), which concatenates all fragments into one string.
Setting the contents nevertheless requires to create the intermediate container level by executing elx.setContent(new Elx.Content("newstringvalue"));

Beside this low-level treatment there is a general method

TypedElement {  String getDeepPCData() ; }

It descends the whole subtree rooted at the element and collects all character data recursively. This corresponds to the notion of "string-value" in XPath [XPath 1.0/5.2], to XPath's "string()" function and to "xsl:value-of" in [xslt1_0].
(The implementation requires the instantiation of a Visitor. This code is specific for the model, and thus realized in the generated code for Element.)

(The runtime class TypedElement offers both functionalities additionally wrapped into static functions objects:
public static final Function<MixedContentContainer, String> getFlatPCData,
public static final Function<MixedContentContainer, String> getDeepPCData,
)

^ToC 2.7 Attributes

^ToC 2.7.1 Generated Classes for Attributes

The definition of "attributes" in XML is rather akward and inpractical. E.g.

  1. technically, the scope of a concrete attribute definition is local to one certain ELEMENT. But the pragmatics of all attributes with the same name are in most cases defined globally, w.r.t. the DTD as a whole, --- which is not represented syntactically.
  2. The granularity of their "type system" is rather unbalanced;
  3. the semantics of "ID" type attributes are non-compositional w.r.t. the validity of the containing document;
  4. the declaration of "#FIXED" values mixes the realms of type definition and of data;
  5. all enumeration types (accidentially meeting in one element type) "should" have disjoint value sets,
    ([xml] , last sentence of section 3.3.1 says
    "For interoperability, the same Nmtoken SHOULD NOT occur more than once in the enumerated attribute types of a single element type." )
  6. etc

(Indeed we met well-experienced XSLT programmers who admitted that in their daily work the first step of every processing is the replacement of all attributes by additional ELEMENTs.)

The Tdom support of attributes is as follows:
For every pair of ELEMENT declaration and attribute definition a new inner class is defined in the element's class, which is derived from that subclass of <METATOOLS>/tdom/runtime/TypedAttribute which corresponds to the attribute's "type", see table below. (Only exception: Common classes for attributes of different elements as described in Section 2.7.5.)

The naming convention for this inner class and for the retrieval/update methods is similar to that of content particles as described above: The attribute named "X" from the DTD is addressed as "Attr_<X'>!" in the Java code, where X' is the mangled character sequence from X, as desribed for elements in Section 2.6. This Java name is used directly for the inner class which implements the attributes, and in the names of the retrieval methods.

The attribute objects serve as storages for values, not as values: They are created with the element object automatically, but a value has to be assigned to them explicitly (by the user of the API or via the parsed XML source). The identity of the attribute objects related to a particular element instance is totally under the control of the Tdom code: No explicit assingment by the user is possible; initial sharing is terminated automatically by write access; therefore references to attribute objects should better not be cached.

There are two retrieval methods:
element.readAttr_X() delivers the current attribute object. In case that this attribute has the default value, a common default object is returned. In this case the attempt to set a new value will result in an UnsupportedOperationException("!mutable"). But this is the better method to read an attribute value, because default objects can be shared. This method is also applied by all generated visitor code.

element.getAttr_X() delivers an individual object anyhow. The value of this object may be read and written. This method should only be used when writing is indeed intended, because the common default object is replaced by a dedicated, writable copy.

The retrieved Attribute object in turn provides methods for setting and getting the values, namely V getValue() and setValue(V). getValue() returns null only for an attribute which has been declared #IMPLIED and currently has an "absent" value.
String getStringValue() and static String getStringValue(V) deliver the value as it would appear in a XML standard text serialization.
String getTypeString() delivers the text of the type declaration in the DTD.
boolean isOptional() / isFixed() / isRequired() delivers whether the value has been declared in the DTD as #IMPLIED / #FIXED / #REQUIRED.
boolean isSpecified() delivers whether attribute has been set explicitly, when creating the containing element instance or afterwards. (For details see Section 2.7.4 below.)
V getDefaultValue() delivers the default value as declared in the DTD. The value null represents "attribute value is absent" and corresponds to the declaration "#IMPLIED".

(The V getValue(), setValue(V) and few other methods can be realized directly in the generated code, or inherited from the corresponding base classes from tdom.runtime, so please have a look to that api doc and into the generated sources.)

The type expression V depends on the attributes "type" as it appears in the DTD:

DTD attribute "type" realizing class derived from Java type <V>
NMTOKEN NmTokenAttribute String
Id IdAttribute String
IdRef IdRefAttribute String
CData CDataAttribute String
Enumeration EnumerationAttribute Enum<V>, dedicated type, generated for (and locally to) this attribute
Enumeration, if enum values are all integers, see Section 2.7.3. SelectedIntegersAttribute Integer
NMTOKENS NmTokensAttribute List<String>
IdRefs IdRefsAttribute List<String>

(For the inheritance relation between the different attribute classes see the tdom runtime class tree.)

In case of enumeration type attributes, the value must be one item from the enumeration class. This is a public inner class of the generated Attribute's class and always has the name "Value". When "s" is the name of one particular enumeration value as written in the DTD, then the corresponding enumeration items have the name

   "Value_" + (s.replace("[-.:]", "_"))

This translation is the same as described above, see Section 2.6. (Please note that name clashes may result!-)

The enumeration items offer a method "String getStringValue()", which delivers the original DTD wording; the EnumerationAttribute's class offers a method "Map<String,Value> getValueMap()" for the inverse translation.

^ToC 2.7.2 Checking Value Assignments

The setValue() method executes validity tests on its parameters as follows:

  1. For an EnumerationAttribute there is a typed setValue(V) method, which only checks for null values. There is additionally a setValue(String) method, which fails with the conversion into V.
  2. For a SelectedIntegersAttribute the typed setValue(int) method checks for allowed integer values. There is additionally a setValue(String) method, which fails with the conversion.
  3. For NMTOKEN, NMTOKENS, IDREF, IDREFS and ID special syntax rules must be matched, see checkNmToken(..)
  4. Thus CDATA attributes are the only which accept an empty string value.
  5. For each attribute declared as "#FIXED" only that one declared character sequence is a valid attribute value (Nevertheless, setting the value explicitly is not superfluous, because it changes the "isSet" flag, see Section 2.7.4.)
  6. Only for attributes which are declared as "#IMPLIED" the value null may be supplied, representing "not-present".

All these violation throw a TdomAttributeSyntaxException (or a subclass thereof). See Section 3.1 for the hierarchy of TdomException. So unallowed null values and violated #FIXED attributes are treated as special cases of failed syntax checks.
Therefore all generated setValue(V) methods are declared with "throws TdomAttributeSyntaxException", except the setValue(V) of an EnumerationAttribute and of a CDataAttribute, if and only if they have the default value #IMPLIED, what additionally allows null. Compare e.g. the setValue(V) methods for Attr_http_equiv, Attr_lang (common attribute, inherited method) and Attr_content (inherited method) in Element_meta of the XHTML tdom.

^ToC 2.7.3 Enumeration Attributes with Integer Tokens Only

In many standard DTDs one frequently finds enumeration attributes which only contain selected integer values. The standard implementation would require a chain of three redundant conversions when executing calculations (text representation to enumeration value to text representation to integer value). Therefore the attribute with name "att1" in element "ele1" and the shared attribute "att2" (see Section 2.7.5) are treated specially when declared in the dtd by the tdom pi

  <?tdom selectedIntegers ele1@att1  @att2  ?>

This leads to the generation of a subclass of SelectedIntegersAttribute. This class implements storage, retrieval and validity check directly on Java "int" values.

^ToC 2.7.4 Unsetting Attributes

Attributes which are declared with a default value (including the special value #IMPLIED") can be "not present" in the textual representation of an XML document, or similar, when creating the document by constructor calls.
In Tdom , this fact is memorized by an internal flag. The dedicated method

  class Attr_[XY] {
   ...
   public void clearValue(){..}
}

clears that flag and sets the value back to the default value. Afterwards, the attribute will not be written out when serializing the document model. This can be changed by executing e.g.

  el.getAttr_XY().clearValue();
  el.getAttr_XY().setValue(el.readAttr_XY().readValue());
-- or --
  el.getAttr_XY().setValue(el.readAttr_XY().getDefaultValue());

After this, the value is also the default value, but the attribute is considered "set" and will be written out (unless the value is null in case of #IMPLIED).

^ToC 2.7.5 Common Classes for Common Attributes

So far, no two different attributes are ever assignment compatible, even if they carry the same name, type and default value. This corresponds to the definitions of DTDs, which do not impose any semantics on attributes, beside the mere string value.

To impose an abstraction on attributes, Tdom understands process instructions like ...

<?tdom attribute
   attA     CDATA       #IMPLIED
   attB     (B1|B2)     "B1" 
?>

The meaning is, that on top-level of the generated package (i.e. not as a part of any ELEMENTs code) a stand-alone attribute class is generated. This class is named and behaves like the "local" attribute classes described above.

A local attribute class is still created in any Element's class, as described above. But for each attribute which matches a "global" attribute, this class (1) uses the global class as its base class, (2) inherits its methods and most of its fields, and (3) will be recognized by the more abstract matching methods of the generated Matcher class. This last feature, as described in Section 4.2, is the main purpose of this abstraction.

In many DTDs from practical use, common attributes are declared in ENTITIYs, which are included in different ATTLISTs. In this case the effect of creating common base classes can be achieved for all attributes defined in such an entity by the process instruction ...

<?tdom attribute-entity entA entB entC ?>

In this case all entities "entA", "entB", "entC", must expand to complete attribute declarations (one or more), and the process intruction is processed exactly as explained above, after expanding these entities.

Some remarks are practically important:

First: Currently only the "name" of the attributes is used for name mangling, so there can be only one common attribute class with a certain name. Tdom behaves like DTD (as ugly as it is !-) insofar as the first definition wins over any subsequent attempt to re-define. A warning is issued in this case.

Second: A common attribute is only recognized if all three dimensions (name, "type", and initial value) are exactly identical. So the following two declarations do not match:

<?tdom  attribute
   attA     CDATA       #IMPLIED
?>
<!ATTLIST E
   attA     CDATA       #REQUIRED
>

The Tdom tool will issue a hint, whenever a common attribute is not recognized due to such minimal differences.

Third:The entity names themselves and the grouping of the attributes is in no way reflected; they are simply "unpacked" to a list of attributes, which are independently compiled as common attributes, as described.

^ToC 2.7.6 Attributes with Attribute Types "ID", "IDREF" and "IDREFS"

Attributes of "type ID, IDREF and IDREFS" are special because they are intended to model references between sub-trees of a document. An XML dcoument is only "valid" if (a) there is at most one(1) attribute declaration of "type ID" in each element's attribute list, and (b) the string value of every instance of an IDREF attribute, and each single "NAME" token in the value of an IDREFS attribute, corresponds to exactly one(1) instance of an ID attribute carrying the same value (see [xml], "Validity constraint: One ID per Element Type" and "Validity constraint: IDREF")
Of course these conditions are not really sensible.
E.g. for changing the value of an ID attribute without violating these rules, first all referring IDREF/IDREFS tokens must be deleted. Then the attribute's value may be changed, and not before this, all referring attributes may be visited a second time, to set them to the new value.

Therefore Tdom does not check ID/IDREF/IDREFS attributes by default. Instead, this can be done explicitly, when a model is completely constructed, by the following methods:

// in the generated package:

  class Document_<DOC> {

    /** @return  the id-based map string->Element
      * @ŧhrows  tdom.runtime.HomonymousIdException if one(1) id is used for 
      * two(2) different elements
      * @throws tdom.runtime.SynonymousIdException if two(2) ids are used for
      * one(1) element
      */
    public ElementDictionary<Element> createDictionary() {
    }
  }

// in package tdom.runtime : 

   /** Indicates the presence of an ID attribute. **/
   public interface Identifiable {
     /** @return  the current id, but does not supply an automatically generated one.*/
     String @Opt getId() ;
   }

  class ElementDictionary<E>{
    /** @return  the element with the given id, or null.*/
    public @Opt E get(String s){
    }
  }

  class IdRefAttribute{
    /** @return the element with the given id, or null-.*/
    public @Opt E getValue(ElementDictionary<E>){
    }
  }
  class IdRefsAttribute{
     /** @return a list of all elements with the ids, including "null" for failures.*/
    public java.util.List<@Opt E> getValues(ElementDictionary<E>){
    }
  }

There are some more useful methods for handling the mappings explicitly. Please refer to the api doc of the involved runtime classes.

Please note that "SynonymousIdException" cannot occur when only using generated code for filling the map: According to [xml], "Validity constraint: One ID per Element Type", each element definition may have only one attribute of "ID" flavour, and this is checked statically when translating the DTD and can cause an error message.

The constraint [xml], "Validity constraint: IDREF / second phrase" is not checked at all automatically: every return value ==null can be treated as an error by the caller explicitly, if appropriate.

^ToC 2.8 Auxiliary methods for numeric contents of elements and attributes

In most DTDs, the contents of many attributes and of PCDATA-only elements is employed to encode numeric contents, i.e. integer or floating point numbers. For easy decoding of these data, the class TypedNode offers some overloaded methods.

Their names are "asInt(..)", "asBigInteger(..)", "asDouble(..)", "asHexInt(..)", etc. They behave robust and deliver null in case of null input or conversion error. The overloading allows them to be applied to the appropriate attribute types and to element contents in a uniform way. (Therefore their location as static methods in "TypedNode" !)

For details please refer to the api doc.

^ToC 2.9 Additional Documentation Text

The generated Java source contains automatically generated API documentation to explain the fundamental technical aspects of handling the generated classes and methods. But of course, Tdom does know nothing about their intendend meaning.
Therefore it is possible to attach explicit author's documentation text to

  1. the model as a whole
  2. each element class
  3. each single attribute of an element,
  4. each "common" attribute, i.e. which is declared unrelated to a particular element, as described in Section 2.7.5.

The means are process instructions in the DTD, as described in Section 2.1.1.

For the above-mentioned categories this looks like ...

<?tdom doc -   documentation text for the model as a whole.  ?>
<?tdom doc oneElement   documentation text for this particular element 
 called "oneElement" ?>
<?tdom doc oneElement@oneAttribute   documentation text for this particular
 attribute of this particular element ?>
<?tdom doc @oneAttribute   documentation text for this particular 
 abstract/shared attribute ?>

At the end of the Tdom run all these PIs which had not been processed, eg. because they adress a non-existing target due to a typo, are all reported by one warning message each.

More than one entry with the same documentation target may occur; their text will be concatenated in source order.

The treatment of these "semantic" or "author's" documentation texts is similar to that in umod : During the normal rendering process (by javadoc) the resulting "doc comments" are rendered specially: included in "<div class='bandmUser'>" tags. They appear in green color, if the stylesheet "bandmApiDoc.css" is appended to the generated stylesheet. This shall clarify the difference between the mere technical documentation and the semantic level of meaning.

Even farther goes the employment of
javadoc ... -doclet eu.bandm.tools.tdom.doclet.TdomUserDoc

In this case most technical fields and methods are totally omitted, and only those with the annotation "@User" are included in the documentation. The result is much leaner and focussed and may be more useful when programming "around" the tdom model.

^ToC 2.10 Ethereals: Comments and Processing Instructions as Second Class Inhabitants

It is not understood that Processing Instructions and Comments are part of a "model" in the narrow sense, and originally Tdom did not support them.

Anyhow, the requirements and application contexts are various, so it may be sensible to include them. We introduced them as "second class" inhabitants, which have to be attached to a "substantial" inhabitant for being stored and retrieved.

Every Element and every PCData fragment as a "Substantial" has two "decorative" sequences of "Ethereals" (see the symbolic class tree in Section 2.4 !), one "preceding" and one "following".

Furthermore, every Element and Document has a "leading" and a "trailing" sequence, which can be used if no Substantials are contained. The access methods are

  List<TypedEthereal> TypedDocument.[get/read]LeadingEthereals()
  List<TypedEthereal> TypedDocument.[get/read]TrailingEthereals()
  List<TypedEthereal> TypedElement.[get/read]LeadingEthereals()
  List<TypedEthereal> TypedElement.[get/read]TrailingEthereals()

  List<TypedEthereal> TypedSubstantial.[get/read]PrecedingEthereals()
  List<TypedEthereal> TypedSubstantial.[get/read]FollowingEthereals()

The ..read.. variant delivers a read-only-list (which can be shared iff empty), the ..get.. variant delivers a list the user can modify.

There are n+1 possiblities to store a sequence of n Ethereals w.r.t. the two neighbouring Substantials:

  <el>                  |  IN          OR            OR              OR
     <!-- comment -->   |    el.leading el.leading      el.leading      subel.preceding
     <?target text ?>   |    el.leading el.leading      subel.preceding subel.preceding
     <!-- comment -->   |    el.leading subel.preceding subel.preceding subel.preceding
     <subel/>           |
     <!-- comment -->   |     IN subel.following OR subel2.preceding 
     <subel2/>          |
  </el>                 |

The fact where an Ethereal is stored does not have any meaning a priori. A parser is allowed to choose any solution, arbitrarily. Of course, on the next conceptual level the user may define a "meta-syntax" of relations and meanings, (E.g., is a comment related to the follower, or to the element just opened? Is a comment related to a processing instruction?) In such a case the model must be traversed and these relations constructed explicitly, implemented by additional data.

^ToC 3 Construction of Tdom models

^ToC 3.1 Error Cases and Exception Hierarchy

According to the possible error conditions when constructing a Tdom instance, there is a class tree of the following hierarchy of checked exceptions. (Java speak "checked" means that they must be declared explicitly.)

All these classes memorize the information about the offending value, attribute, element context and location, as far as known. There are further subtypes of these classes for particular cases, see the api doc.

For narrowing the scope of the necessary exception declarations, there is the class TypedAttribute.SafeValues with this only instance, which is used as a flag to distinguish between safe/unsafe methods and constructors, which throw/do not throw a TdomAttributeException. For details see Section 3.2.5.

^ToC 3.2 Explicit Constructor Application for Elements and Sub-Content

Two central design issues of Tdom are (a) that all existing models at every instants of their life-time are type-correct sub-trees w.r.t. the corresponding DTD, and (b) that this is checked statically, at compile time, as far as possible.
Therefore most of the generated public constructors always require complete and type-correct contents as their argument.
As a consequence, a larger Tdom model must be constructed bottom-up, in a term-like fashion. This (at a first glance possibly annoying) strict discipline implies especially that Tdom models are always finite by construction.
3

Please note that constructing a large Tdom model by explicit constructor calls is a tedious task. Explicit constructor calls only make sense as the back-end of some automated translation procedures.
For constructing a Tdom model from a pre-existent XML text file one can use the SAX interface or the w3c Dom interface. These are described in Section 3.3 below.

^ToC 3.2.1 Creating Elements with Structured Contents, Statically Typed

The basic structural element, to which the generated Java constructors correspond, is again the sequence. So constructors are generated for top-level content regular expressions, considered as a sequence, and for all sub-sequences and alternatives, which are sequences again. Since the Java method signature corresponds to the DTD content model, no TdomContentException is thrown by the invocation of such a "statically typed" constructor. (That no TdomAttributeSyntaxException is thrown can be selected by supplying safeValues.) All variants are illustrated by the following example:

  <!ELEMENT NC (TA, (TB | TC, TD*), (TE, TF?)*, (TG)? )

...is translated into ...

public class Element_NC extends Element {
  public Element_NC (SafeValues s,
                     Element_TA x1,
                     Element_NC.Choice_1 x2,
                     Element_NC.Seq_1[] x3,
                     Element_TG x4) {...} 

  public Element_NC (Element_TA x1,
                     Element_NC.Choice_1 x2,
                     Element_NC.Seq_1[] x3,
                     Element_TG x4) throws TdomAttributeException {...} 

  public abstract static class Choice_1 
            extends eu.bandm.tools.tdom.TypedChoice {...}

    public static class Choice_1_Alt_1 extends Choice_1 {
      ...
      public Choice_1_Alt_1(Element_TB x1){...}
      ...
    }

    public static class Choice_1_Alt_2 extends Choice_1 {
      ...
      public Choice_1_Alt_2(Element_TC x1, Element_TD[] x2){...}
      // please notice this array reflecting the "*" ^^
      ...
    }
  }//Choice_1

  public abstract static class Seq_1 
            extends eu.bandm.tools.tdom.TypedSubTree{
    ...
    public Seq_1 (Element_TE x1, Element_TF x2) {...}
    ...
  }
}

This allows a constructor call like ...

  new Element_NC( aTA, 
                  new Element_NC.Choice_1_Alt_1(aTB),
                  new Element_NC.Seq_1[0],
                  (Element_TG)null ) ;   // throws TdomAttributeSyntaxException
  -- or -- 
  new Element_NC( safeValues,
                  aTA, 
                  new Element_NC.Choice_1_Alt_1(aTB),
                  new Element_NC.Seq_1[0],
                  (Element_TG)null ) ; 

For convenience an array parameter at the last position is declared as a "vararg", so that the explicit construction of an intermediate array for this position is not required (though still possible!):

  <!ELEMENT NC (A* B*)>

--> leads to -->

  new Element_NC( Element_A[] elems_A_1, Element_B... elems_B_1)

^ToC 3.2.2 Creating Elements with Structured Contents, Dynamically Typed, by Semi-Parser

Oftenly it is more convenient to simply enumerate the sequence of Java objects which shall make up the contents of a newly created element. In this case, a simplified parsing process can be applied to the classes of these elements. We call it "semi-parser", because it parses only one layer of content, but does not descend into the depth, into contents of sub-elements, as the full-fledged parsers do, as described in Section 3.3.

For this purpose there is an untyped constructor ...

   new Element_NC (Element... elements) 
        throws TdomContentException, TdomAttributeSyntaxException  {..}

("Element extends tdom.runtime.TypedElement" is the top-level element class generated specially with this certain model, so the method is "not completely" untyped !-)

A TdomContentException is thrown whenever the supplied sequence of Java objects cannot be mapped to the content model.
(Since exceptions must be caught or declared anyhow, there is no variant with "safeValues" preventing TdomAttributeSyntaxExceptions.)

Since the "vararg" arguments can be represented by an array, also alternatives for content creation can be defined by a pure expression, using the concatenation operations defined in <METATOOLS>/eu/bandm/tools/ops/Arrays , as used in our "Dtd to Html renderer" <METATOOLS>/eu/bandm/tools/dtm/HtmlRenderer according to this scheme:

import eu.bandm.tools.ops.Arrays ; 

// ...

    final Element_html el_html = new Element_html
      (new Element_head
       (new Element_head.Choice_1[0],
	new Element_head.Choice_2_Alt_1
	(new Element_title("windowTitle")),
        new Element_head.Choice_2[0]),
       new Element_body
       (Arrays.append
	((htmlIsDynamic)
	 ?new Element_block_content[]
	  {new Element_noscript
	   (new Element_div
	    (new Element_div.Content
	     ("<!-- please switch on JAVA SCRIPT for dynamic behaviour! -->"))
	     {@Override protected void initAttrs(){
	       getAttr_class().setValue(class_alert);
	     }})}
	 :new Element_block_content[0],

	 new Element_block_content[]
	 {new Element_pre(preItems.toArray
			  (new Element_pre.Content[preItems.size()])),
	  new Element_hr(),
	  makeFooter(basicFileName, 
		     "http://bandm.eu/metatools/docs/usage/dtd.html#txt_dtd_tool")
	 }
	 )));
// ...

Element_p makeFooter(String a, String b){...}

List<Element_pre.Content> preItems = ...

The header part of the created html element is constructed statically typed. The body part is dynamically typed, using Arrays.append and function calls to write case distinctions in a fully compositional way.

(Please note that in our xhtml model "block_content" is an abstraction of different Element classes, controlled by a content model entity, as described in Section 2.6.1.)

^ToC 3.2.3 Creating Elements with Mixed or Pure PCData Contents, Statically Typed

In case of mixed content, e.g. a declaration like ...

  <!ELEMENT NM (#PCDATA | TA |TB)* >

...the generated constructors are ...

   public Element_NM (Element_NM.Content... content) throws TdomAttributeSyntaxException {...} 
   public Element_NM (SafeValues, Element_NM.Content... content){...}
   public Element_NM (String content) throws TdomAttributeSyntaxException {...}
   public Element_NM (SafeValues, String content){...}

The "SafeValues" flag has the same role as described above.
The last two constructors are short-cuts for the case of pure character content.
The canonical constructors are the first two, where all components have to be wrapped into the correct content class, like in ...

  new Element_NM (new Element_NM.Content("characters with embedded TA "),
                  new Element_NM.Content(aTA),
                  new Element_NM.Content(new TypedPCData(" followed by a TB ")),
                  new Element_NM.Content(aTB) )

The first argument in this example is possible because of the short-cut constructor ...

  public class TypedElement.MixedContent {
    ...
    public MixedContent (String data){ this(new TypedPCData(data)); }
    ...
  }

Of course, instead of "vararg"-parameters you can always supply an array, e.g. delivered by Collection.toArray([]).

^ToC 3.2.4 Creating Elements with Mixed or Pure PCData Contents, Dynamically Typed

The dynamically typed constructor for elements with mixed contents has the signature

   new Element_NM (Object ...) throws TdomContentException, TdomAttributeSyntaxException ;

It behaves like all other semi-parsers, as described in Section 3.2.2: It throws a TdomContentException whenever the supplied sequence of Java objects cannot be mapped to the content model.

The techniques for constructing the argument list as described for structured content in Section 3.2.2 can be used accordingly.

^ToC 3.2.5 Creating Elements with Attributes

When an element instance is constructed which has #REQUIRED attributes, then these must be set by the caller and checked by the constructor code for validity, before the constructor is allowed to return normally, meaning success. This is required by the Tdom philosophy of only producing type correct instances.

Setting attributes in a constructor call is done by defining an anonymous inline class derived from the real element class. By overriding the methods public void initAttrs() throws TdomAttributeSyntaxException and public void initAttrsSafe() the caller can set arbitrary attribute values.

No TdomAttributeSyntaxException can leave the second variant, and only this is called by the "safe constructor" (= the constructor with the safeValues flag). So basically three variants are possible for construction:

   new Element_e (a, b, c)
     { @Override public void initAttrs() throws TdomAttributeSyntaxException {
        getAttr_a1().setValue(v1);
       }
       @Override public void initAttrsSafe() {
         getAttr_a2().setValue(v2);
       }
     };  
-- or -- 
   new Element_e (safeValues, a, b, c)
     { @Override public void initAttrsSafe() {
         getAttr_a2().setValue(v2);
       }
     };  
-- or -- 
-- if there are NO unsafe attributes: 
   new Element_e (a, b, c)
     { @Override public void initAttrs() {
         getAttr_a2().setValue(v2);
       }
     };  

Of course the safe constructor can (and should) always be used if no attribute at all is set.

ATTENTION: The safe constructor does only call initAttrsSafe(), but not initAttrs(). Putting initialization code in the latter and meaning the former is a hard to find error which cannot be detected statically.
If there are no unsafe attribute values at all, then only initAttrs() is generated (throwing nothing). This can be checked in compile time by the @Override annotation.

To set a required attribute value to a null value by an explicit java call yields an attribute syntax error, but to forget to set it at all yields a missing attribute error.

If a TdomAttributeSyntaxException is possible, it must be caught and treated locally for using the safe variant, like in

   new Elemen_e (safeValues, a, b, c)
     { public void initAttrsSafe() {
         try {
           getAttr_a1().setValue("myconst");
         } catch (final TdomAttributeSyntaxException e){
            throw new ImpossibleError(" cannot happen, 'myconst' is a valid NMTOKEN.");
         }
       }
     };  

For convenience there is the method

import static eu.bandm.tools.tdom.runtime.TypedAttribute.assertSetAttrValid ; 
   new Elemen_e (safeValues, a, b, c)
     { public void initAttrsSafe() {
        assertSetAttrValid(getAttr_a1(), "myconst");
       }
     }; 

which wraps the TdomAttributeSyntaxException, which is assumed to never happen, into an unchecked AssertionError.

After the execution of this user-defined initialization methods, the constructor checks for completeness of required attributes.
If an attribute declared as #REQUIRED is not set explitly, then a TdomAttributeMissingException is thrown. All constructors of elements which have such an attribute are thus declared like

package xhtml_1_0 ; 
  public 
  Element_script() 
    throws TdomAttributeMissingException, TdomAttributeSyntaxException {
       ...}

The definedness of an attribute by a user-defined initAttrs() method cannot be checked statically, therefore this exception must be caught somewhere. Because a "clean functional style" of programming leads to deeply nested constructor calls, the catch clause would be far away from its cause. Therefore the class TdomAttributeMissingSupplier provides a wrapper method which can be used to translate all TdomAttributeMissingExceptions in an unchecked AssertionError, when they are known not to happen. This is a fragment from "dtm/HtmlRenderer", which builds a complete Html header element by one single expression:

import static eu.bandm.tools.tdom.runtime.TdomAttributeMissingSupplier.assertAttrsComplete ; 

      (..., ... , 
           Element_head.Choice_2_Alt_1_Choice_1.alt
            (assertAttrsComplete(() -> new Element_script(safeValues, "&#xffef;")
              {@Override protected void initAttrsSafe(){
                getAttr_src().setValue(path_to_javascript);
                assertSetAttrValid(getAttr_type(), "text/javascript");
              }})
            )... )

((
By the way: a serious pitfall is trying an abstractions like ...

   private void Element_e makeIt  (final String name){
     return new Element_e(){
                  @Override public void initAttrs(){
                    this.getAttr_name().setValue(name);
                                              // ^^^^ refers to local field of Element_e class
                 }} ;
     }

The method's parameter is NOT adressed by "name", since the local field of the element's class is the narrower lexical scope!
))

^ToC 3.3 Automated Construction of Documents and Elements

As mentioned above, the methods for constructing large Tdom models from given text files is via the generated SAX parser or the generated W3C-DOM validator.

Both kinds of creation methods are only defined for the "Document_<tag>" classes, not for pure "Element_<tag>" classes. This is due to the fact that both construction methods possibly require global information, like namespace mapping and collections of "ID"-type attribute values, things not existing with simple elements.

The creation methods are provided by the Java class implementing the DTD. Let the package containing Tdom generated classes (i.e. all element classes, document classes and the DTD class) be called "<myModel>". Let "<tagA>", "<tagB>", etc. be the tags of those elements which can serve as the top-level element of a document, according to the process instructions as described in Section 2.1.1.

Then you can create a Tdom Document object by calling one of the following methods:

  package <myModel> ;
  import eu.bandm.tools.util.SAXEventStream ;
  import eu.bandm.tools.tdom.runtime.TdomAttributeException ; 
  import eu.bandm.tools.tdom.runtime.TdomContentException ; 
  import eu.bandm.tools.tdom.runtime.TdomXmlException ; 
  import eu.bandm.tools.tdom.runtime.TypedDTD ; 

  public final class DTD extends TypedDTD {
    createDocument_<tagA> (Element_<tagA> el) {...}
    createDocument_<tagB> (Element_<tagB> el) {...}
    ...
    createDocument_<tagA> (org.w3c.dom.Document document) 
      throws TdomContentException, TdomAttributeException 
      {...}
    createDocument_<tagB> (org.w3c.dom.Document document) 
      throws TdomContentException, TdomAttributeException 
      {...}
    createDocument_<tagA> (SAXEventStream s)
      throws TdomContentException, TdomAttributeException, TdomXmlException
      {...}
    createDocument_<tagB> (SAXEventStream s)
      throws TdomContentException, TdomAttributeException, TdomXmlException
      {...}
    createDocument_<tagA> (java.io.InputStream in) throws java.io.IOExcept {...}
    createDocument_<tagB> (java.io.InputStream in) throws java.io.IOExcept {...}

  }

The first two methods only complete the "manually bottom-up creation" as described in Section 3.2: You create a document by first creating its top-level element and then giving it as an argument to the constructor.

For large documents the following methods are more convenient:

If the argument to "create_Document<tag>()" is a W3C DOM, than this DOM object is validated against the DTD, and, in case of conformance, a Tdom model is returned. Otherwise a TdomException is thrown.

If the argument to "create_Document<tag>()" is a SAXEventStream, then the content models of the DTD must be LL(1), and the SAX events are consumed to construct the Tdom model. In case of non-conformance, a TdomException is thrown.

A <METATOOLS>/util/SAXEventStream is an interface which provides access to a "frozen" sequence of SAX calls. This freezing is necessary because LL(1) parsing needs (which surprise!) a look-ahead of depth 1, which is not provided when using the SAX interface directly.

The implementation currently provided is contained in <METATOOLS>/util/SAXEventQueue.

The W2C DOM and SAX based construction methods can throw TdomContentExceptions and TdomAttributeExceptions, as with the explicit constructor invocations above. Additionally the SAX based methods can throw TdomXmlException in case of erronuos XML input files.

The SAX interface's handling of attributes is rather complicated and expensive. Therefore, currently, we do not totally type-check the SAX event stream as such!
Of course, when there is no value for an attribute which is "required" as described by the DTD (e.g. neither declared as "#IMPLIED", nor having a default value), then a TdomAttributeMissingException is thrown.
But we do currently not check for undefined attributes, ie. attribute names which are not declared in the DTD and thus nor represented in the model. (The foreseen TdomAttributeUndefException is not thrown.)
This is a violation of the "Validity constraint: Attribute Value Type" from [xml] , which says "The attribute MUST have been declared"
The same holds for the even more primitive "Well-formedness constraint:Unique Att Spec", which says "An attribute name MUST NOT appear more than once in the same start-tag or empty-element tag."

The format of the SAX event would make both checks rather expensive.

The practical problem is that these kinds of errors oftenly result from a miss-spelled attribute name. But the missing of the really meant attribute will not be signalled iff it has a default value!

If you have to create large sub-structures of a Tdom model (e.g. starting with a top-level element Element_<tagX>) out of your own program code, it may be nevertheless the method of choice to use the SAX interface to create a complete Document_<tagX>:
Simply send SAX events to a <METATOOLS>/util/SAXEventQueue, the other side of which is consumed by the method aDTD.createDocument_<tagX>(SAXEventStream s).
Then extract the desired element by calling the (generated and therefore strongly typed) method ...

public class Document_<tagX> extends ... tdom.runtime.TypedDocument {
  public Element_<tagX> getDocumentElement() {...} // returns top-level element
} 

For this purpose it is necessary to previously declare all those elements declarations in the DTD as "public", which are intended as the top element of such sub-trees. This is described in Section 2.1.1 above, and tells Tdom to create the required Document_<> classes.

The last method (aDTD.createDocument_<tagX>(java.io.InputStream)) is related to our own compression method, and explained in Section 6.5.

The class <METATOOLS>/xantrltdom/TdomReader provides the glueing code between a file input stream or similar source of text, and the construction of a tdom model. Its usage is demonstrated in ../../examples/doctypes/xhtml/Main.java.

^ToC 4 Visitors and Patterns

^ToC 4.1 The Generated Visitor Class and Deriving User Defined Visitors

As mentioned above, the most elegant way of processing a Tdom model to some other format is the application of Visitors.
With every Tdom model the base class Visitor.java is generated, from which you can derive your processing tools. This class defines a "visit(final <generatedClass> node)" method for each node classe generated by Tdom . This includes element classes, classes representing sub-sequences, choices, alternatives and attributes. A user defines a transformation by deriving from this visitor class and overriding only those methods where he/she wants to extract some information or perform some update.

On the other side the generated Element class (which is the top of all generated element classes) implements the interface Visitable<Visitor>, and the method host(Visitor). This method is the counterpart, which causes the visitor to call its visit method on this. (This method is needed to apply a visitor to any elment without knowing its concrete class at compile time.)

The definition of a derived visitor is most conveniently done by editing a copy of the generated VisitorTemplate.java. This file contains method declarations for all visit() methods acting on element classes. These empty method templates are preceded by a "Javadoc" comment which contains the corresponding content definition from the original DTD.
Please note that the method declarations for classes representing sub-content (e.g. "+visit(Element_TA.Choice_1_Alt_2 x)+") are not included in VisitorTemplate.java, but have to be added manually, whenever required.

((It may be convenient to have a look at "Visitor.java" for doing "copy and paste" on some more complicated method declarations of this kind.))

^ToC 4.2 Calling a User Defined Visitor

The Tdom visitors are of most simple kind, compared to the more complex ones generated by umod . They only provide the above-mentioned single method per visited class, namely "visit(<generatedclass> x)". This method can be called from external ("hand-written") code for the intial invoking of the visitor. It is also used internally by the visit() code of the generated base visitor itself, for the descending to its child nodes.

If the class of an object is known statically, this call is optimal w.r.t. performance. It the class is not known, there is the method "visit(generatedPackage.Element element)", which does a switch/case-based look up of the element's tag index.

All node classes support the method "x.host(Visitor v)". This method calls the most-narrowly statically typed "v.visit(x)" method of the visitor v. This allows to visit sub-content which contains choices without the need to know in advance which alternatives are present in the concrete model data.

This "x.host(Visitor v)" method is also realized by the generated Document_<> classes.

Further, all node classes which have repeated sub-content, like "elems_1_A", "choices_1" or "seqs_1", offer a method like "visit_choices_1(Visitor v)" which does the stepping through the sub-contents automatically, as mentioned already in Section 2.6.4 above.

^ToC 4.3 Default Visiting Strategy of Generated Visitors and User Defined Explicit Control

All these direct way of calling (i.e. the skipping of a "match()" multiplexer as needed in the umod visitors) are possible because the structure of the model is almost completely statically defined at compile time, and because there are no specialization relations (= "inheritance") between distinct classes.
The only places where dynamic decisions are required come (a) from alternatives (including abstract classes) and (b) from quantification decorations "?", "*" or "+" in the original DTD.
The high performance of the Tdom visitors results from the fact that in both cases only simple and constant int values need to be considered, --- e.g. the result of "final int getAltIndex()", a method generated with every sub-class of TypedAlt and which returns the value of a static final int assigned to the generated class at compile time, or of "<>.count<plural>()" in case of repetitions.

The generated base visitor does nothing more than descending the document tree in depth-first textual order.

E.g. the DTD declaration ...

<!ELEMENT A (B, (C)*, (D)?) >

...generates a method like 4 ...

package MySemiAst ;
public class Visitor {
  ...
  public void visit (Element_A el){
    visit (el.getElem_1_B());
    for (int i = 0, n = el.countElems_1_C() ; i < n ; i++) 
      visit(el.getElem_1_C(i));
    if (el.hasElem_1_D())
      visit(el.getElem_1_D());
  }
  ...
}

Special processing of nodes of a certain class is implemented by deriving from this base class. If you want to descend into the sub-tree structure starting at the currently visited node el, you simply call "super.visit(el)", or you start with a new, specialized visitor:

package transformations ; 
import MySemiAst.* ;

public class Transform_1 extends Visitor {

  protected class SpecialTransformation extends Transform_1 {
    ...
  }
  protected void visit (Element_A el){
    final int value = Integer.parse (el.getElem_1_B().getPCData());
    new Specialtranformation().visit(el.getElems_1_C()); <<<<< GEHT NICHT !!?? :-(
    final int secondvalue = new Visitor(){
      protected int result = 0 ;
      public int process(Element el){visit(el); return result;}
      public void visit (Element_C el){
        result += Integer.parse (el.getPCData());
      }
    }.process(el);
    super.visit(el);
  }
}

The call graph for a content declaration "<!ELEMENT A ((B)*(..|..|..))>" can be symbollically sketched like ...

visit(Element_A e)          ---------> {for (i=0;i<e.countElem_1_B();i++)
                                          visit(e.getElem_1_B);
                                       } ;             visit(e.getChoice_1())


visit(Element_A.Choice_1 c)  --------> switch(c.getAltIndex()){
                                         case 0:visit(c.toAlt_1());
                                         case 1:visit(c.toAlt_2());
                                       }
 
visit(Element_A.Choice_1_Alt_1 a) ---> visit(...)

...and the scheme for deriving the transformation tools "UserDefV" like ...

Visitor.visit(Element_A e)  ---------> {..visit(e.getElem_1_B)..} ; visit(...)
                                             |
   +-----------------------------------------+  
   V
UserDefV.visit(Element_B e) ------>  --->  super.visit(e)
                                                            |
   +--------------------------------------------------------+  
   V
Visitor.visit(Element_B e) ---------->  

^ToC 4.4 Untyped Visitors

Sometimes (e.g. for simulating a W3C DOM and for directly applying tpath expressions) an untyped view to a Tdom instance is required. For this purpose, the class UntypedVisitor is provided. It co-operates with additional "hosting" methods in the generated node classes called "__dumpElementSnapshot(List)" and "__getAllAttrs(List)".
It is esp. suited for accessing and collecting Ethereals, or all attributes of a certain type, etc. It is created without any parameters and applied to any element or document by calling "match(e)". Its behaviour is defined, as usual, by deriving and overriding the diverse "action(e)" or "descend_<..>(e)" methods, see the api doc.
As appropriate for this untyped view, no action or match methods for structural sub-contents are provided.

^ToC 4.5 Generated Paisley patterns

The command line option --patterns activates the generation of Paisley patterns, see the paisley documentation.

FIXME MORE

^ToC 5 The Extension Mechanism

Tdom includes a mechanism for extending a model by one or more others Its prime purpose is to suppoert reusage of visitor based code.

On the source level, an extension has to declared in the DTD: A short example is contained in metatools/examples/tdom/extend . Therein, the file arith.dtd , further simplified, reads as follows:

<?tdom abstract expr (nullary | unary | binary) ?>
<?tdom abstract nullary (const | ...) ?>
<?tdom abstract unary_op (neg | ...) ?>
<?tdom abstract binary_op (add | sub | mul | div | ...) ?>

<!ENTITY % expr.extension "">
<!ENTITY % unary_op.extension "">
<!ENTITY % binary_op.extension "">

<!ENTITY % expr "(nullary | unary | binary %expr.extension;)">
<!ELEMENT unary ( (neg  %unary_op.extension;),   %expr;)>
<!ELEMENT binary (%expr;, "(add | sub | mul | div %binary_op.extension;),  %expr;)>



The "?tdom abstract" process instruction is terminated by the ellipse "...", which indicates to the Tdom code generator to forsee the plugging-in of more element types.
The same is done on the genuine DTD level by defining an "ENTITY" each, which is per default of course empty.

The usage of the extension mechanism can be seen in the file logic.dtd , which, slightly simplified, reads as follows:

<?tdom import arith SYSTEM "arith.dtd" ?>

<!ENTITY % arith.dtd SYSTEM "arith.dtd">
%arith.dtd;

<!ELEMENT prop (%expr;)*>
<!ATTLIST prop pred NMTOKEN #REQUIRED>

Again, the import is executed on the DTD level and the Tdom level parallel, requiring seemingly redundante doubling.

FIXME BEISPIEL KAPUTT ??? IRRNGTWAS ist da VERLORENGEGANGEN !?!?!

^ToC 6 Serialization and Conversions

^ToC 6.1 Generating SAX Events

Among the generated classes there is always a Dumper class, extending the Visitor class. Its constructors are defined as ...

  public Dumper(org.xml.sax.ContentHandler contentHandler) { ..}
  public Dumper(org.xml.sax.ContentHandler contentHandler,
                org.xml.sax.ext.LexicalHandler commentHandler) { ..}

Whenever visit() is called on any element of the model, the corresponding SAX events [Sax04] are generated for this element and its attributes, for any related Ethereals (see Section 2.10) and for all elements contained therein recursively.

The LexicalHandler is optional and is used to receive TypedComment objects. If the first constructor is called and the argument happens to support both interfaces, it will be used in both roles.

^ToC 6.2 Visualization of a Tdom Model

You can easily print a whole model or an abitrary sub-tree to a console or to a text file by combining the above-mentioned SAX event generation with our ContentPrinter. E.g.:

  public Dumper(new ContentPrinter(new PrintWriter_flushing(System.out), true, true)
               ).visit(myModel);

^ToC 6.3 Format Generation for a Tdom Model

The class formatfrontends.Tdom2format contains an instantiation of the generic format compiler.

The generated code is a specialization of the Visitor classgenerated with the tdom code. It offers a public method "toFormat()" which translates a tdom model (or any sub-expression of it) into a format object. The outlines of such a converter called "myFormatter". generated for some Tdom generated package "myModel" are:

import myModel.* ; // Element_A, Element_B, etc. 
 
public class myFormatter extends myModel.Visitor {

  // public interface 
  public Formst toFormat ( Visitable<? super Visitor> element){
    result=Format.empty;
    visit(element); 
    return result ,
  }
  public int default_indent = 2 ;   // can be modified  
  public Format default_delimiter = Format.empty ; // for debugging only

  // auxiliary funtions 

  protected Format __throwIt(){...}
  protected void visit (TypedPCData){...}
  protected Format toFormat_throwing(Visitable){...} // if ==null then throw!
  protected Collection<Format> toFormat_array(Visitable[]){...}// "map" 
   
  // user defined visitor functions  

  public void visit (Element_A element){  
    result = // format generating function
  }
  // etc.
}

A "visit()" method is added for each node class of the Tdom model, if and only if the user gives a format description.

The default case is that the visitor reaches "visit(typedPCData)". This simply concatenates all character content into one big, unformatted Append format. This is exactly what you want in most cases for the lowest layers of the structure definition.

(( There is a field "public Format default_delimiter" in the generated code, which is initialized to an empty format. For debugging it can be set to something like "---", indicating the borders of the different concatenated pcdata ranges. ))

The language to define the format for a given Tdom class explicitly, is an instance of the generic format definition language. The following instantiations are specific to its application to Tdom :

DOMAIN_SPECIFIC_DATA_ADRESSING ::= element | choice | sequence | $pcdata | $quoteDTDstyle ( blankCharacter ) + formatDescription
element ::= tag ( $ nat ) ?
choice ::= ( $C | $Choice ) nat
alt ::= ( $A | $Alt ) nat
sequence ::= ( $S | $Seq ) nat
nat ::= // a natural number including zero(0) ;

The reference "$quoteDTDstyle data" means a "DTD-like" quotation of the (possibly concatenated) string content of the "data" format, done by Format.quoteDTDstyle(Format). This means simply framing the data with single quotes if it contains double quotes, and vice versa. And doing anything if the data contains both !-).

The reference "$pcdata" means the text content of the current tdom element. as it is delivered by the generated getPCData() method.

Appearing in a format code "<tag>$<nat>" means a reference to the format generated for "getElem<tag>_<nat>()", --- "<tag>" without number defaults to "getElem<tag>_1()".
Analoguously: "$C<nat>"/"$Choice<nat>" means a reference to the format generated for "getChoice_<nat>()" and "$S<nat>"/"$Seq<nat>" means "getSeq_<nat>()".

If the selected element/choice/sequence appears in the DTD under a list combinator ("+" or "*"), one of the list format descriptors must be applied("[]", "[,,/]", "[;;/]{}", etc.), as described for the generic case.

Please note that currently there is not type-checkingbetween the format descriptors and the DTD of the model. So addressing singular instead of plural (i.e. leaving out a list format descriptor instead of using it, --- or vice versa), will crash when trying to compile the generated code (there is either a "getElems_1_A()" of a "getElem_1_A()", as described above.)

DOMAIN_SPECIFIC_SWITCH_SELECTOR ::= element | choice | sequence
DOMAIN_SPECIFIC_CASE_LABEL ::= nat

In case the DOMAIN_SPECIFIC_SWITCH_SELECTOR refers to an element, this element must appear in the current content model with a "?" modifier. The selector tags may be "0" for absent and "1" for present.

In case the DOMAIN_SPECIFIC_SWITCH_SELECTOR refers to a choice, than the selector tags correspond to the allowed values of "altIndex", i.e. the position number of the present alternative in the DTD choice construct.

If in a switch no default case is given, this defaults to "$throw".

Currently there are three ways to write down these definitions: (1) in a stand-alone file, (2) as PIs in a DTD, or (3) by option values from an Xantlr source.

^ToC 6.3.1 Stand-alone format description file

All format definitions in such a file must have the form ...

formatRule ::= tag ( seq| choice alt ) * ( choice ) ? blankCharacter * = formatDescription .

Please note the dot "." as a delimiter at the end which is required because in a formatDescription all withespace is significant. The begin is directly after the "=".

The expression to the left of the "=" is verbatim translated into a "visit(Element_<tag><tail>)" rule, where <tail> is the name of the inner class of the element class, as denotated by the sequence of selectors, which are translated the same way as in the context of the format description, as described above.

The call of the tool in controlled by these parameters:

( definitions from file ../../src/eu/bandm/tools/formatfrontends/tdom2format.options )

-G --sourceroot uri
  file system position of the root of the java source tree
-1 --packagename string
  name of the package which will contain the formatting code
-2 --basevisitor string
  name of the base visitor class from which the formatting code inherits.
-3 --classname string
  name of the generated visitor class
-4 --sourcetype ( dtd | nondtd )
  indication of the type of the source file
-5 --sourcefile uri
  path to the source file to translate
-w --linewidth int(=79)
  width of the generated source files
--targetclasspath uri(=$na)
  position in file system for loading referred existing classes

Enumeration sourcetypes:

dtd
nondtd

ATTENTION Currently these parameters are not yet decoded as documented, but

  1. the parameters with numbers as abbrevs can only be given by position
  2. the other option values may be specified by System.getProperty("eu.bandm.tools.formatfrontedns.tdom2format.<OPTIONNAME>").

^ToC 6.3.2 Process instructions in a DTD

The format of the declarations is exacly like above, but all are wrapped into a process instruction, either one or more each:

  <?tdom2format  a =expr. ?>
  <?tdom2format  b =expr. c  =expr.  ?>

Please note again the dot "." which ends the format descriptions.

The call of the tool is now ...

  ${JAVA} eu.bandm.tools.formatfrontends.Tdom2format  -- model.DTD   ???

^ToC 6.3.3 Options from an Xantlr Source

If the Tdom meta-model is an AST meta-model created from an Xantlr grammar, then the format directives can be formulated directly from the grammar sorce file by a special kind of rule-wise options.

An in the grammar file construct like ...

  public myNonterm  options { format = "expr" ; } : grammarExpr ; 

... will be translated by Xantlr into a process instruction in the generated DTD:

  <?tdom2format myNonterm =expr.  ?>

Here the dot is added by this translation process, so the whole string constant in the "option" statement (including trailing whitespace!) is seen as format description.

But you can add more format expressions into the one option statement, if you want do define format rules for sub-expressions like choices or sub-sequences. Simly terminate the "main" expression for the non-terminal with an explicit dot, and append the further declarations for these sub-expressions (or arbitrary unrelated nonterminals !-) as explicit rules.

Simply consider that Xantlr prepends the name of the non-terminal and the equal sign, and append one dot, than you can insert further rules arbitrarily ;

So ...

  public myNonterm  options { format = "expr. myN $C1=sub1. myN $C2$A1 = sub2" ; }
                           :a(b|c)(d|e) ; 

... will be translated by Xantlr into a process instruction with three format declarations in the generated DTD:

  <?tdom2format myNonterm =expr. myN $C1=sub1. myN $C2$A1 = sub2.   ?>

^ToC 6.4 Creating a W3C (untyped) DOM Representation

Creating a w3c dom representation can be done via this SAX output: meta_tools provide a generic translation module in <METATOOLS>util/SAX2DOMConverter . It implements org.xml.sax.ContentHandler and therefore can be connected as a drain to the Dumper describwd in the preceding paragraph.

You have to plug in a W3C-Dom implementation, e.g. a Xerces-J [xercesj] by calling the public method ...

public void setDOMImplementation(org.w3c.dom DOMImplementation domImpl){...}

After all SAX events have been sent completely, you can access the generated Document by calling ...

public org.w3c.dom.Document getDocument()

^ToC 6.5 Compressed De-/Serialization

Since the standard XML encoding (using opening and closing tags and many different layers of escaping and quoting) is very redundant, Tdom supports a compressed binary storage format, in which all tags are encoded in a minimal way, controlled by the DTD. So the tagging information is a kind of "binary", while the text contents are left unchanged (i.e. it is encoded using UTF-8)

The writing out of a document or element is initiated by a sequence like ...

  import eu.bandm.tools.tdom.runtime.EncodingOutputStream ;
  os = new EncodingOutputStream(anyOutputstream);
  myElement.encode(os);
// OR   myDocument.enccode(os);

The definition of the encode methods is on the level of the runtime classes, the generated classes are defined above, so see <METATOOLS>/tdom/runtime/TypedDocumentor <METATOOLS>/tdom/runtime/TypedElementfor details.

The reading back is only possible on Document level, by calling the approporate factory method defined with the DTD class: as already mentioned in Section 3.3 above.

    createDocument_<tagA> (java.io.InputStream in) throws java.io.IOExcept {...}

^ToC 7 Using the Tdom Tool

^ToC 7.1 Calling the Tdom Tool

The Tdom tool is called from the command line:

( definitions from file ../../src/eu/bandm/tools/tdom_withOptions/Options.xml )

-0 --destdir string
  file system mounting point of the generated source tree
-1 --pkgname string
  name of the generated package. Determines the position of generated resources relative to destdir
-2 --sourcedtd string
  file position of dtd to compile
-C --baseClass string(=eu.bandm.tools.tdom.runtime.TypedElement)
  base class for all generated element classes
--commonContentClass
  whether to generate only one common subclass of MixedContent for PCData only elements per model
--generateLists bool(=false)
  generate checked list classes for repeated sub-contents, not arrays
--linewidth int(=78)
  number of columns for the generated Java source text. This is NOT a strict limit, but a strong orientation.
--noCompress
  whether the decode/encode methods shall be omitted
--patterns
  whether "paisley" pattern access methods shall be generated.

Currently the parameters --commonContentClass, --baseClass and --noCompress are not supported.

Additionally there is a macro in the metatools make system defined in etc/calltools.mk, which is called from a Makefile as ...

$(call tdom, <destDir>, <pkgName>, <sourceDtd>)

This macro esp. cares for the conversion of slashes and backslashes between a unix and a cygwin environment.

^ToC 7.2 Outputs and Error Messages

As an output, Tdom generates one source file for each <ELEMENT..>! declaration, according to the naming conventions explained above in Section 2.6,

Additionally there will be generated ...

  1. A file named sources, listing all source files generated by Tdom , and included by the Makefile, esp. for realizing "make clean"
  2. The abstract super classes Element.java and Document.java.
  3. The binary version of the DTD, contained in DTD.java
  4. Visitor.java and VisitorTemplate.java, as described above in chapter 4..

Tdom constructs a whole zoo of parsers and validators (from SAX, W3C-DOM, Java constructors, etc). In this context it is central to detect ambiguities in the DTD grammar. In practice we oftenly find badly written DTDs. But many of the ambiguities are two alternatives which both include "epsilon" in their language, eg. ...

  ... ( a* | b* ) ...

These ambiguities do not violate the "LL(1)" requirement, as iterpreted by W3C. Nevertheless, it is important to have a close look to all points of ambiguity, The Tdom tool does print them in a precise format. Eg. when trying to translate "xhmlt1-flat.dtd", you get the typical warning

warning: table: conflict on [thead, tbody, tr, tfoot] between alts 0 and 1 in rule 
[thead, col, tbody, tr, colgroup, tfoot] 
-> ( [thead, col, tbody, tr, tfoot] 
            -> {[col] -> (col)* -> [thead, tbody, tr, tfoot]}
   | [thead, tbody, tr, colgroup, tfoot]  
            -> {[colgroup] -> (colgroup)* -> [thead, tbody, tr, tfoot]}
   )

In the innermost nesting (in braces) you see the first and follow sets of some grammar expressions :

             { [firstSet] ->  grammarExpr  -> [followSet] }

The conflict is here caused by a disjunction. On the next higher level you see the disjunction of two such constructs, each preceded by its own first set (which is identical to the internal first set, or to the union of the internal first and follow set, if the grammar expresssion can produce epsilon, --- as we all have learned from the Dragon Book !-)

 ( [firstSet_A] -> {...}
 | [firstSet_B] -> {...}
 )

Before this bracket, again, there is the first set of the disjunction as a whole (which is not very informative, it is just the union), but in the warning message you find the intersection of these two sets, which is the cause for the ambiguity.

^ToC 8 Xantlrand Tdom --- Special Issues of Their Co-Operation

^ToC 8.1 Information Interchange by Option Controlled DTD Generation

First of all, the grammar file fed into Xantlr must be contain the global, parser-level option

options {
  dtdMode = tdom ;
}

This make the generated DTD contain special "process instructions" generated to Tdom (as described above in Section 2.1.1), reflecting the settings in the grammar definition file:

  1. the Xantlr rule-level option "xmlNodeTpye=abstract" (cf. xantrl sax event types) adds to the DTD something like <?tdom abstract a (b|c) ?>".
  2. an Xantlr rule-level "private" or "public" modifier is translated to a process instruction accordingly, cf. Section 2.1.1.

^ToC 8.2 Different Layers of Ambiguity

When Xantlr and Tdom are plugged together, two parsers are involved: First comes the antlrC parser, which consumes front-end characters and emits the standard antlrC error messages. The output of the Xantlr generated parser is a SAX event stream, which is fed into the different SAX receivers created by the Tdom compiler.

Already when translating the DTD, Tdom possibly issues error messages concerning this second level of parsing, --- mostly caused by ambiguities, i.e. violations of the LL(1)-criterium. This kind of ambiguity should not be mixed up with the antlrC front-end ambiguities. Consider a definition (taken from the d2d grammar) like ...

       definition  ::= "list" "of" reference
                    |  "short" "for" reference 
                    | ...

The antlrC generated front-end parser has no problems with ambiguity, because there are terminal tokens guarding the alternatives. Because these tokens do notautomatically contribute to the semi-AST, the generated DTD would read corresponding to

       definition  ::= reference
                    |  reference 
                    | ... ;

This is ambiguous, and the following Tdom translation will issue an error message, as explained above in Section 7.2.

As a remedy, you should either wrap one of the terminal in a non-terminal (with an empty DTD content model), like ...

       definition   : LIST "of" reference
                    | "short" "for" reference 
                    | ... ;
       LIST : "list" ;

...or wrap one of the alternatives as a whole in a non-terminal:

       definition   : "list" "of" reference
                    | shortcutdefinition 
                    | ... ;
       shortcutdefinition : "short" "for" reference ; 

Now the generated SAX events suffice to distinguish between the alternatives.

^ToC 8.3 XantrlTdom, Glueing Code and Error Messaging Issues

meta_tools provides some glueing code for plugging together Xantlr and Tdom . The central class is <METATOOLS>/xantlrtdom/XantlrTdom, which internally creates buffers and auxiliary message pipes, etc, and plugs it all together.

Let "XXX" be the name of your grammar, and "YYY" the top production, then the usage pattern is ...

    final XXX_Lexer lexer = new XXX_Lexer(stream);
    lexer.setFilename(filename);
    final XXX_Parser parser = new XXX_Parser(HistoryToken.chain(lexer));
    tee = (tracing) 
        ? new ContentPrinter(new PrintWriter_flushing(System.err), true, true)
        : null ; 
    final XantlrTdom link 
      = XantlrTdom.link (parser,
                         msg1,
                         1024,
                         tee,
                         DTD.dtd,
                         msg2);
    final Document_YYY  document_module 
      = link.parse("YYY", Document_YYY.class);

Please refer to the API doc.



1 But see the discussion on unsafe API methods in Section 2.6.5 below

2 Only exception: When taking them seriously, the constraints on id/idref/idrefs attributes imposed by [xml] are nearly impossible to maintain! And they would be very expensive to check automatically with every model update. Therefore they are only evaluated on demand, controlled by the user. See Section 2.7.6 for details.

3 They can become cyclic when later calling set_..() methods: another reason for better treating model elements as immutable!

4 This code fragment is written for documentation purpose using the access methods from the user interface, as described above. The real implementation, since in the same package as the element classes, of course uses the direct access to protected variables for the sake of efficiency, e.g. "a.elems_1_C.length" instead of "a.countElems_1_C()"






go one page back go to start go to start go one page ahead
xantlr bandm meta_tools ops

made    2025-01-09_11h54   by    lepper   on    happy-ubuntu        Valid XHTML 1.0 Strict Valid CSS 2.1

produced with eu.bandm.metatools.d2d    and    XSLT    FYI view page d2d source text