[all pages:]

 xantlr bandm meta_tools ops

Tdom , a Generator for Typed XML Models

(related API documentation: package tdom.runtime   )

# ^ToC 1 Principles of Tdom

Tdom is a tool for generating typed data models of an xml text body according to a definition given as XML DTD [xml] . "Typed" model means that (a) the validity of the model w.r.t. the DTD is guaranteed by all creation and modification methods, and (b) that this can be proved at compile time. 1

The Tdom generated model behaves "partly algebraic", since each node behaves like an algebraic expression and knows nothing about the context(s) it appears in. So, in contrast to w3c DOM ([w3cDom]), you can employ sharing, even between different "documents". Nodes exist independently from a global document object, and can be created, processed and stored in a freely compositional and local way. This is a fundamental requirement for a "functional style" of programming.
Tdom nodes do not behave algebraic in the sense that they can be treated as mutable (but think twice if it is really necessary for your purposes, you loose sharing !-), and do not support algebraic equals().
(This could be added in some later version !?!)

The fact that they "do not know" their parent and their siblings makes Tdom nodes behave more like nodes in a tree in the mathematical sense. Software architects used to W3C DOM et.sim. may consider this restriction to be a draw-back. Processing and creating trees is of course fundamentally different in this paradigm: creating goes most naturally bottom-up, processing goes most naturally by visiting top-down (see chapter 4) and memorizing all context required information "on the flight".
(For all our applications, we found this a most convenient, safe and easy to debug way of coding !-)

Applying the Tdom compiler to a DTD yields a collection of Java source files, forming a single package. This package will be processed by a Java compiler. It relies on the presence of a collection of base clases in the package <METATOOLS>/tdom/runtime, see Section 2.4 . The generated collection provides (at least) one Java class definition for each type of node defined by the DTD. This includes ...

1. one or two classes for each ELEMENT declaration,
2. one class for each attribute declaration,
3. and one class for each structured sub-content of an ELEMENT's content.

All these Java classes are called "node classes" in the following. A Tdom model of a certain text corpus is realized by a structured collection of instances of these node classes. Each such instance represents a certain part of the document, and each node class represents a certain type of these document parts.

All generated node classes provide ...

1. constructor definitions,
2. methods for retrieving sub-nodes of a given node,
3. methods for updating sub-nodes of a given node while preserving the correctness w.r.t. the DTD,
(For all these topics see chapter 2 and chapter 3)
4. parsing methods for creating documents from SAX streams or W3C-DOMs (see Section 3.2),
5. methods for translating a Tdom model into a SAX stream,
6. methods for a compressed serialization and de-serialization (for these topics see chapter 6).

Additionally the generated package contains ...

1. a visitor base class Visitor and a visitor template VisitorTemplate for declarative processing of the models, as described in chapter 4,
2. a class derived from the run-time class TypedDTD, which contains a DTD model. Please note that this class additionally does need the textual representation of the DTD as a (error-free, parseable) runtime resource, accessible via the relative path "./original.dtd" from the DTD class.

After these classes have been compiled by a Java compiler, you can create a Tdom text model ...

1. ...either "manually", i.e. by explicitly calling constructors for node classes, thereby explicitly creating the document tree bottom-up, see chapter 2
2. or by creating a whole Document with the generated parsing methods: Either with the validating SAX receiver from a SAX event stream, or with the validation DOM interpreter from a W3C-DOM (e.g. [xercesj]), see Section 3.2.

Each Tdom model, or each fragment thereof, can then ...

1. ...be analysed by visitor based code, see chapter 4,
2. ...be modified by applying the update methods of those Java standard libraries which make up the model, see chapter 2,
3. ...and finally be written out by different serialization methods, see chapter 6

Some of the public examples on page download & licences make extensive use of Tdom .
See esp. the "BandM booking" book keeping software, where a dedicated DTD models the business objects, and the d2d based "Wiki" where type correct XHTML 1.0 is constructed in small pieces bottom-up.

# ^ToC 2 Mapping from DTD to Java

## ^ToC 2.1 Relevant Information content of a DTD

The Tdom tool processes one single DTD file and generates one package of Java source files. From the DTD it uses ...

1. ...all element defintions, their (expanded) content models and their attribute definitions.
2. ...process instructions of the form "<?tdom ... ?>"
3. ...entity definitions.

In most cases, entity definitions are only used implicitly. Tdom uses the meta_tools component dtd , expanding entity references in a transparent way.

### ^ToC 2.1.1 Process instructions

The translation into Java code is controlled by a whole zoo of "process instructions" adressing Tdom , as defined in [xml]. Here are the most important:

1. <?tdom xmlns=..?>
<?tdom xmlns:..=..?> -- XML namespace definition, see Section 2.2.
2. <?tdom SYSTEM/PUBLIC ..?> -- defines the xml document id for this DTD file, see Section 2.3.
3. <?tdom public/private/default ..?> -- defines which elements can be used as top of a model, as so called "document elements", see Section 2.6.
4. <?tdom abstract ..?>
<?tdom abstract-entity ..?> -- generated abstract classes for alternatives, for leaner code, see Section 2.6.1. Ending with an ellipsis it is also used in the expansion mechanism, see chapter 5.
5. <?tdom attribute ..?>
<?tdom attribute-entity ..?> -- generate a common class for an attribute used in different elements, for leaner code, see Section 2.14
6. <?tdom import ..?> -- import another Tdom model for extending it, see chapter 5
7. <?tdom doc ..?> -- define additional text to integrate it into the generated API doc. FIXME MORE (_)
8. <?tdom package ..?> -- ?? FIXME MORE (_)

## ^ToC 2.2 XML Namespaces

As documented there, this class can represent names in "non-namespace-mode" and in "namespace-mode". In the first case the character ":" is treated in no way special. In the latter there must be at most one such character, and the prefix is mapped to a "namespace URI", as it is the standard way with XML namespaces, see [xml-ns].

For Tdom , the namespace mode is activated by process instructions which exactly follow the standard XML syntax:

  

For the runtime namespace logic, the prefixes are ignored and "equals()" etc. is ruled only by the namespace URI, -- the usual way with namespace aware XML. In this concern, all prefixes must be different, but can be arbitrary.

But for the code generation the prefixes are kept and only the colon ":" is replaced by an underscore "_". (There are more characters to be replaced, see the paragraph on name translation in Section 2.6.) So the selected prefixes appear in the name of the generated Java classes and should be selected accordingly.

## ^ToC 2.3 The own XML Document Id

With the PIs ...

  --- or --- 

... the XMLDocumentIdentifier of the dtd file itself is made known to Tdom . It will be stored in generated DTD class and is accessible by the method getDocumentId().

## ^ToC 2.4 Pre-Defined Class Infra-Structure

The classes generated by Tdom to represent Elements, Attributes and sub-contents of the XML document are derived from pre-defined runtime classes. These classes provide basic functionalities.

  INTERFACES: eu.bandm.tools.tdom.runtime. TypedContent TypedElement.MixedContentContainer | TypedElement.PCDataContainer Visitable ImpliedAttribute Identifiable ... // etc CLASSES: eu.bandm.tools.tdom.runtime. TypedDTD | pkg.DTD <<<<< GENERATED once for each Tdom model TypedNode | TypedDocument | | TypedSubstantial | | TypedPCData IMPLEMENTS Matchable // FIXME visitable / matchable??? WAS GILT?? | | | | TypedElement IMPLEMENTS TypedContent | | | pkg.Element <<<<< GENERATED once for each Tdom model | | | | pkg.Element_ <<<<< GENERATED once for each ELEMENT declaration | | | | pkg.Element_ <<<<< " | | TypedEthereal | | TypedComment | | TypedProcessingInstruction | | TypedSubtree IMPLEMENTS TypedContent | | TypedChoice | | TypedAttribute | | CDataAttribute | | EnumerationAttribute pkg.Element_.Attr_ <<<<< GENERATED for each attribute declaration | | NmTokenAttribute | | | IdAttribute | | | IdRefAttribute pkg.Element_.Attr_ <<<<< GENERATED for each attribute declaration | | NmTokensAttribute | | | IdRefsAttribute pkg.Element_.Attr_ <<<<< GENERATED for each attribute declaration | ... | TypedElement.MixedContent IMPLEMENTS TypedContent | pkg.Element_.Content <<<<< GENERATED once for each mixed-content ELEMENT TypedElement.MixedContentFactory //... 

## ^ToC 2.5 Generated Java Classes for the top-level DTD

For a beginner, the different classes and instances called "dtd" or sim. may be confusing:

1. In most cases, the DTD to translate into Java source files is given to the impementation of the Tdom compiler as a DTD text file, as it is defined in [xml]. See Section 7.1 below.
2. In the generated package "pkg", a class called "pkg.DTD" will be generated. This is the central means for retrieving all kinds of reflective information, when later running the generated code.
3. For this purpose, it inherits from <METATOOLS>/tdom/tdom/runtime/TypedDTD .
4. The method <METATOOLS>/tdom/tdom/runtime/TypedDTD/getInterfaceInfo delivers different class objects, and an instance of <METATOOLS>/tdom/tdom/runtime/TypedDTD.AbstractElementInfo .
5. Esp. the generated pkg.DTD provides a public static field called "dtd", which gives access to one instance of itself.
6. The original DTD source text must be accessible to the package's initialization code in a text file named "original.dtd".
7. This is parsed on initialization, and the resulting instance of <METATOOLS>/dtd/DTD.Dtd, which is an umod model, is made accessible by <METATOOLS>/tdom/tdom/runtime/TypedDTD/getDTD().

## ^ToC 2.6 Generated Java Classes for Element declarations. General Name Translation.

For each <!ELEMENT tag ...> declaration in the source DTD, a node class is generated in the generated package "pkg". Instances of this class will be used to represent the document's sub-trees corresponding to this element. Such a class is called "element class" in the following. Its name is Element_<tag'>, where <tag'> is the DTD name translated to a Java name.

The name translation from DTD to Java is necessary for all kinds of names, as element tags, attribute names, entity names etc. What happens in all these cases is, that every single occurence of a minus signs "-", a colon ":" or a dot "." is replaced by a single underscore "_".

The Tdom tool does not check whether ambiguities are created by this translation. Instead, you will get error messages from the subsequent Java compilation process, accordingly. This will happen for an input like

  

For those <!ELEMENT tag ...> declarations which can serve as the top-level node of a document, a further class names Document_<tag'> is created. This inherits from <METATOOLS>.tdom.runtime.TypedDocument and contains additionally the methods for parsing a document as a whole from some external source (SAX or W3C-DOM). The indication whether a given element is such a top-level one is encoded in the DTD by process instructions: The positive case (=yes, E can be top-level element) is indicated in the DTD by the process instruction

  

The negative case by

  

For all those element declarations which are not explicitly mentioned in this way the default is defined by

  -- or -- 

Please note that you can even derive further hand-coded sub-classes from generated element classes. This is possible because, whenever a typed element needs to know the identity of the class it is an instance of (e.g. for visiting or for serialization), it does not use the Java language .getClass()-method, but the generated method public int getTagIndex(), which will be inherited by your derived classes and is specified in TypedElement.getTagIndex().

The tag name of a given element can be read statically, when the class is known, by ...

 Element_<>.TAG_NAME 

...due to the generated definition ...

 public static final String TAG_NAME ; 

The dynamic way is defined in <METATOOLS>/tdom/runtime/TypedElement by

 NamespaceName el.getName() String el.getTagName() String el.getNamespaceURI() String el.getLocalName() 

### ^ToC 2.6.1 Abstract Java Classes as Realisations of DTD Alternatives

The Tdom tool does support some abstraction of isomorphic content definitions. This abstraction is rather limited, due to the nature of DTD, but nevertheless an important means for increasing re-usability, while preserving static type safety.

Any content model which is an undecorated choice expression of element references can be translated into an abstract class, in the Java terminology. The benefit comes from the further consequences: (a) all element declarations referred to in this alternative will be translated into an element class which is derived from this abstract class, and (b) whenever this choice clause will appear in a certain content model, it will be replaced by a simple reference to this abstract class. This happens independently of the sequential order of the alternatives, and also when the choice alternatives are a true subset of the alternatives of a bigger choice clause.
In the case of nested choices, the largest sub-expression of every choice which matches such an abstraction is replaced by the corresponding abstract class.
(These statement must be refined for overlapping definitions, see below.)
The single inheritance property of Java implies that each element may appear at most in one of these declarations.

This mechanism is controlled by a tdom process instruction like

  
  ==> yields a code interface containing class Element_x { public Element_a[] getElems_1_a() {...} public void setElems_1_a() {...} } 

while the getter method simply accepts any instance of class b, c, or d.

A more realistic example, simplified from our XHTML model:

  

The last example shows that these definitions may be nested in an a-cyclic way.

Whenever such a choice expression is defined by the contents of a DTD "parameter entity", this entity can be used directly. its contents will define the subclasses, and its name will be used as the name of the abstract class. This is shown in the first line of the example above. The entity's expansion text may carry further decoration which will be stripped to get the alternative expressions, as in

  

This abstraction mechanism significantly increases reusability and versatility: With the XHTML example, it is possible now to collect the very different sub-classes of Element_block_content into one single storage, e.g. an ArrayList<Element_block_content>, and later insert this sequence into an arbitrarily chosen instance of Element_object, Element_map, Element_fieldset, Element_noscript, Element_body, or Element_blockquote.

Without this abstraction, the first, collecting step would already require the wrapping of the elements into a certain alternative of a certain choice type of a certain hosting element class, and the collected sequence could not be used anywhere else, without "hacking" and losing static type safety.

The behaviour in case of overlapping choice expressions is not fully defined:

  // not clear what will happen here: 

So please look into the generated code to find out, or avoid such definitions.

Up to here we assumed that the name of the abstract class is a fresh name.
A different case is to declare an existing element declaration as abstract. The preconditions and consequences are the same, the contents model of the element must be an undecorated choice clause.

Additionally, the corresponding node is not longer represented in the generated model by an object instance on its own, but is represented indirectly, by an instance of one of its sub-classes, which corresponds to an element from its contents model's choice expression. This instance transparently represents two(2) or even more nodes of the conceptual document tree, namely the chosen leaf element and the containing, abstract element. "Transparently" means that visitor code and attribute accessing methods are not affected.

FIXME STIMMT DAS??? BEISPIEL ??? der code in TypedDOMGenerator scheint ein "<?tdom abstract node?>" NICHT zu unterstützen !?

This method is frequently employed in the Tdom /Xantlr co-operation, to eliminate unnecessary nodes which only present alternatives in the derivation tree. This is controlled by a "options {xmlNodeType=abstract;}" option in the Xantlr grammar file, which is translated to the Tdom PI automatically, see chapter 8 below.

## ^ToC 2.7 Name Mangling from DTD Elements' Contents and Sub-Contents to Java Classes

The mapping rules between DTD and Java class definitions act locally on each "<!ELEMENT...>" declaration. The structure of the generated Java classes and their naming convention immediately reflect the usage of parentheses in the regular expression describing the element's contents, as given by the compiled DTD.
Please note that there is no implicit normalization of DTD content models. For the name mangling purpose there is a difference between the content models

  a, (b, c) 

...and ...

  a, b, c 

Name mangling is basically defined on sequences.

Therefore, as a first step, the top-level of the element's content definition is always interpreted as such a sequence , possibly of length 1(one).

Each sequence consists of content particles. Each such content particle is ...

1. either an element reference,
2. or an embedded choice,
3. or an embedded sub-sequence.

All content particles may be decorated with a quantification symbol "?", "*" or "+". The naming convention assigns position numbers separately to (a) to all references to elements of a certain tag, (b) to all embedded choices and (c) to all embedded sub-sequences.

From these numbers (and the tag strings in case (a)) particle names are generated.
E.g. the sub-particles of the top-level in following DTD content model are adressed by the particle names put beneath:

  | | | | | | | Seq_1 | | Choice_1 Seq_2 | | Elem_1_TB Elem_1_TA Elem_2_TA 

The quantification symbols "?", "*" or "+" and all parenthes around singletons (i.e. not enclosing sub-sequences or alternatives) are ignored in the definition of particle names.

If the top-level content model as written in the DTD is an alternative, the top-level content for Tdom is considered as a singleton sequence.
In this case we get a very simple top-level naming, like ...

  | Choice_1 | Choice_1 

## ^ToC 2.8 Inner Classes Generated for Sub-Content

For each sub-sequence and each choice contained in the top-level sequence, an inner class is defined in the class representing the element. The names of these classes are identical with the particle names, as defined above.

An instance of such an inner class must be used as an argument for the constructors and for update methods, and is returned as result of a corresponding retrieve method.

Please note that in the current implementation there is no algebraic equality defined on content models. Therefore in the example above the types of Seq_1 and Seq_2 are not compatible, in spite of having the same contents definition. The same fact holds for embedded choices.

## ^ToC 2.9 Retrieval, Update and Visit Methods

Built upon the particle names, the Java class generated for the DTD element "NC" provides methods for retrieving and updating the contents of a given instance. Which methods are generated is controlled by the quantification decoration of the content particle.

Let <pname> be the particle name, and <plural> be its plural form (i.e. "Elems_1_TA" for "Elem_1_TA", "Choices_2" for "Choice_2" and "Seqs_1" for "Seq_1").
Let <pclass> be the class representing a sub-content (i.e. "Element_TA" for an Element reference, and "Element_NC.Seq_1" or "Element_NC.Choice_2" for embedded sub-content).

In case of undecorated particles the generated methods are ...

  public get(){..} // deliver current content // this is always != null public set( e){..} // update current content // if e==null, throw exception 

If the modifier "?" is present, we get ...

  public get(){..} // deliver current content, // but may return null public set( e){..} // update current content // and accept null as argument public boolean has(){..} // return whether component is currently // contained in the higher-level content 

If one of the modifiers "*" or "+" is present, we get ...

  public [] get(){..} // deliver whole sequence as an array public get(int pos){..} // deliver content at the given position // of the sequence public [] set([] e){..} // update current content totally // to a whole sequence public set(int pos, e){..} // update current content // at the given position public int count(){.. // return number of components currently // contained in the higher-level content public void visit(Visitor v){..} // apply visitor to all particles in the // sub-conten 

Please note that for your convenience every "set<>()"-method (including those described in the following sections!) always returns the old, overwritten value as its result.

## ^ToC 2.10 Inner Classes Generated for Embedded Sequences

In case of embedded sequences, the whole top-level procedure (particle naming scheme, inner class definition for sub-structures and generation of methods) is simply applied recursively.

A difference in the implementation is that the classes for sub-content (i.e. embedded Sequences and Choices) are not inner classes of the inner class representing the sub-content, but reside as direct inner classes of the element's class.
The nesting is only represented by their name, which is a concatenation of the particle names of all levels, connected by an underscore "_".

The following example shows some of the get methods and the resulting types (classes). The names of both are again constructed by the particle names:

  | | | | | nc.getSeq_1().getSeq_1(3).getElem_1_TB()=>Element_TB | | | nc.getSeq_1().getSeq_1(3)=>NC.Seq_1_Seq_1 | nc.getSeq_1()=>NC.Seq_1 

## ^ToC 2.11 Inner Classes Generated for Choices

The inner classes generated for choices are sub-classes of the pre-defined runtime class TypedChoice. Additionally, for each alternative of a choice an inner class is generated, which is again a sub-class of this "typed choice class". The name of such an alternative class is the name of the choice class with the appendix "_Alt_<n>".

This "<n>" used to identify an alternative is the position number w.r.t. the containing choice in the original DTD formula. This numbering starts with 1(one) !

In our example from above, the naming is ...

  || | | || | Choice_1_Alt_3 || Choice_1_Alt_2 |Choice_1_Alt_1 Choice_1 

The methods generated for the choice class are ...

 public class Element_NC extends eu.bandm.tools.tdom.Element { ... public Choice_1 setChoice_1(Choice_1 e){...} // change content accordingly. public Choice_1 getChoice_1(){...} // deliver current content public abstract class Choice_1 extends TypedChoice { public int 1 getAltIndex(){...} // deliver the index of the // currently contained alternative public Choice_1_Alt_1 toAlt_1(){..} // convert to the corresponding class, public Choice_1_Alt_1 toAlt_2(){..} // if current content represents this // alternative. Otherwise, return null public boolean isAlt_1(){..} // return true iff current content public boolean isAlt_2(){..} // is of the mentioned alternative. ... } public class Choice_1_Alt_1 extends Choice_1 { ... // update/retrieve/visit methods like an top-level element/sequence class !! } } 

The contents of each Choice_<m>_Alt_<n> class is again treated as a sequence (possibly a singleton sequence), and the top-level naming and code generation scheme is applied recursively.
Again, no further nesting of inner classes will be applied, but the representing classes are direct inner classes of the element's class, and their names created by concatenation of the naming particle hierarchy.

An example for retrieving:

  | | | | | nc.getChoice_1().toAlt_3().getChoice_1(8) | | =>NC.Choice_1_Alt_3_Choice_1 | | | nc.getChoice_1().toAlt_3()=>NC.Choice_1_Alt_3 | nc.getChoice_1()=>NC.Choice_1 

## ^ToC 2.12 Text Content and Mixed Content

Mixed content and plain character content is treated specially. Mixed content could be considered a "choice-type with *-quantification", but in contrast to the standard implementation described above, the layer which explicitly adresses the choices is skipped for the sake of the user's convenience.

Instead, a specialized Content class is defined in the element's implementing class, which can contain either character data, or one of the elements listed in the mixed content declaration.

So the DTD definition ...

  

...is translated to ...

 public class Element_NB extends Element implements TypedElement.TypedPCDataContainer ... { public static class Content extends TypedElement.MixedContent { ... } public List getContent(){..} // returns the modifiable list of particles public String getPCData() {return getPCData(this);} // convenience function } public class Element_NC extends Element implements TypedElement.MixedContentContainer ... { ... public static class Content extends TypedElement.MixedContent { public Content (Element_TA el){...} // create the variant with element TA public boolean isElement_TA(){...} // returns whether content particle is a TA public Element_TA toElement_TA(){...} // returns casted content or null public Content (Element_TB el){...} // create the variant with element TB public boolean isElement_TB(){...} // returns whether content particle is a TB public Element_TB toElement_TB(){...} // returns casted content or null // inherited from TypedElement.MixedContent : public Content (String s){...} // create the variant with pcdata public Content (TypedPCData s){...} // dto. public boolean isPCData(){...} // returns whether content particle is PCData public TypedPCData toPCData(){...} // returns casted content or null } ... public List getContent(){..} // returns the modifiable list of particles } // to get the character content of the pcdata particles, you additionally need: public class TypedPCData extends TypedNode { ... public String getPCData(){} // returns text content of this content particle } 

Let Elx elx be a generated element class, and a reference to an instance of it. To read character data of a given content particle is done as in

  for (Elx.Content c : elx.getContent) if (c.isPCData()) String charSeq = c.toPCData().getPCData(); 

This is rather tedious, of course.
The PCData objects themselves are algebraic: to change the text contents, you have to create a new instance and insert it into the list of el.getContent(). For conveniece there is a constructor which implies the new PCData():

  elx.getContent().add(new Elx.Content("text value")); 

All elements which are defined by the DTD wording (#PCDATA) or (#PCDATA)*, i.e. which are pcdata ONLY, are realized as instances of PCDataContainer, a sub-class of MixedContentContainer.

Please note that also in this case you never can make any assumption on \emph.how many. content particles exist, the concatenation of which represents the plain text.

Anyhow, processing should not happen on this technical level of representation. Additionally, for convenience, these objects offer directly the method getPCData(), which concatenates all fragments into one string.
Setting the contents nevertheless requires to create the intermediate container level by executing elx.setContent(new Elx.Content("newstringvalue"));

Beside this low-level treatment there is a general method

 TypedElement { String getDeepPCData() ; } 

It descends the whole subtree rooted at the element and collects all character data recursively. This corresponds to the notion of "string-value" in XPath [XPath 1.0/5.2], to XPath's "string()" function and to "xsl:value-of" in [xslt1_0].
(The implementation requires the instantiation of a Visitor. This code is specific for the model, and thus realized in the generated code for Element.)

(The runtime class TypedElement offers both functionalities additionally wrapped into static functions objects:
public static final Function<MixedContentContainer, String> getFlatPCData,
public static final Function<MixedContentContainer, String> getDeepPCData,
)

## ^ToC 2.13 Attributes

The definition of "attributes" in XML is rather akward and inpractical. E.g.

1. technically, the scope of a concrete attribute definition is local to one certain ELEMENT. But the pragmatics of all attributes with the same name are in most cases defined globally, w.r.t. the DTD as a whole, --- which is not represented syntactically.
2. the granularity of their "type system" is rather unbalanced;
3. the semantics of "ID" type attributes are non-compositional w.r.t. the validity of the containing document;
4. the declaration as "fixed" mixes the realms of type definition and of data;
5. all enumeration types (accidentially meeting in one element type) "should" have disjoint value sets,
([
xml] , last sentence of section 3.3.1 says
"For interoperability, the same Nmtoken SHOULD NOT occur more than once in the enumerated attribute types of a single element type." )
6. etc

(Indeed we met well-experienced XSLT programmers who admitted that in their daily work the first step of every processing is the replacement of all attributes by additional ELEMENTs.)

The Tdom support of attributes is as follows:
For every pair of ELEMENT declaration and attribute definition a new inner class is defined in the element's class, which is derived from that subclass of <METATOOLS>/tdom/runtime/TypedAttribute which corresponds to the attribute's "type". (Only exception: Common classes for attributes of different elements as described in the next section.)

The naming convention for this inner class and for the retrieval/update methods is similar to that of content particles as described above.

These attribute objects serve as value storages for values, not as values: They are created with the element object automatically, but a value has to be assigned to them explicitly (by the user of the API or via the parsed XML source).

There are two retrieval methods:
element.readAttr_X() delivers the current attribute object. In case that this attribute has the default value, a common default object is returned. In this case the attempt to set a new value will result in an UnsupportedOperationException("!mutable"). But this is the better method to read an attribute value, because default objects can be shared. This method is also applied by all generated visitor code.

element.getAttr_X() delivers an individual object anyhow. The value of this object may be read and written. This method should only be used when writing is indeed intended, because the common default object is replaced by a dedicated, writable copy.

The V getValue(), setValue(V) and few other methods can be realized directly in the generated code, or inherited from the corrsponding base classes from tdom.runtime, so have a look to that api doc and into the generated sources.
The type expression V depends on the attributes "type" as it appears in the DTD:

 DTD attribute "type" Java type NMTOKEN String Id String IdRef String CData String Enumeration Enum, dedicated type, generated for (and locally to) this attribute NMTOKENS List IdRefs List The setValue() method executes validity tests on its parameters, e.g. checks for allowed characters in all cases except CData. In case of enumeration type attributes, the value must be taken from the public inner enumeration class. Its name, and the names of its values, can be seen in the following example. A value of ==null is only allowed for attributes which are specified as #IMPLIED , and means that the attribute is re-set to the state of being "unspecified" in the source document. All attributes which DO have a default value cannot be set to null but only to this value, but keep the status of being "specified." The method to read is getDefaultValue(). Additionally there is always a method getStringValue() which returns the current value as one single string, as it would appear in a document, and the method isSpecified(), indicating whether the value delivered is an explicit user-defined value, not the default value. The method getValue() returns null only for an unspecified attribute with #IMPLIED initialization. Attributes which are declared as #REQUIRED must be defined whenever creating a new element object, as described below in Section 3.1.5. So the following declaration ...
  

...generates code like ...

  package mySemiAst ; ... class Element_A extends mySemiAst.Element { ... public Attr_status readAttr_status() {..} // returns attribute object, maybe default public Attr_status getAttr_status() {..} // returns own, writable attribute object public Attr_prefix readAttr_prefix() {..} public Attr_prefix getAttr_prefix() {..} public Attr_name readAttr_name() {..} public Attr_name getAttr_name() {..} public static class Attr_status extends ... .tdom.runtime.EnumerationAttribute { public static enum Value implements EnumerationValue { Value_passed("passed"), Value_failed("failed"), // etc ... } static final Value defaultValue = Value.Value_passed ; ... // inherited from .../tdom/runtime/EnumerationAttribute : public void setValue(Value value) { ... } public Value getValue() { ... } public String getValueString() { ... } } public static class Attr_prefix extends ... .tdom.runtime.NmTokenAttribute { static final String DEFAULT_VALUE = "test" ; public Attr_prefix(String value){...} // create instance public Attr_prefix(){Attr_prefix(DEFAULT_VALUE);} ... // inherited from CDataAttribute: public String getValue(){...} // retrieve current value public void setValue(String newvalue){...} // check if newvalue is an // "NMTOKEN" and update current value } public static class Attr_name extends ... .tdom.runtime.CDataAttribute { static final String DEFAULT_VALUE = null ; public Attr_name(String value){...} // create instance public Attr_name(){Attr_name(DEFAULT_VALUE);} // ml an bt ist das korrekt ?? ... // inherited from CDataAttribute: public String getValue(){...} // retrieve current value public void setValue(String newvalue){...} // update current value } ... } 

Trying to set a value of an attribute which is declared as "#FIXED", requires that the new value is equal to the current value. Otherwise it throws an UnsupportedOperationException("#FIXED").

## ^ToC 2.14 Common Classes for Common Attributes

So far, no two different attributes are ever assignment compatible, even if they carry the same name, type and default value. This corresponds to the definitions of DTDs, which do not impose any semantics on attributes, beside the mere string value.

To impose an abstraction on attributes, Tdom understands process instruction like ...

  

The meaning is, that on top-level of the generated package (i.e. not as a part of any ELEMENTs code) a stand-alone attribute class is generated. This class is named and behaves like the "local" attribute classes described above.

In many DTDs from practical use, common attributes are declared in ENTITIYs, which are included in different ATTLISTs. In this case the effect of creating common base classes can be achieved for all attributes defined in such an entity by the process instruction ...

  

In this case all entities "entA", "entB", "entC", must expand to complete attribute declarations (one or more), and the process intruction is processed exactly as explained above, after expanding these entities.

Some remarks are practically important:

First: Currently only the "name" of the attributes is used for name mangling, so there can be only one common attribute class with a certain name. Tdom behaves like DTD (as ugly as it is !-) insofar as the first definition wins over any subsequent attempt to re-define. A warning is issued in this case.

Second: A common attribute is only recognized if all three dimensions (name, "type", and initial value) are exactly identical. So the following two declarations do not match:

  

The Tdom tool will issue a hint, whenever a common attribute is not recognized due to such minimal differences.

Third:The entity names themselves and the grouping of the attributes is in no way reflected; they are simply "unpacked" to a list of attributes, which are independently compiled as common attributes, as described.

## ^ToC 2.15 Attributes with Attribute Types "ID", "IDREF" and "IDREFS"

Attributes of "type ID, IDREF and IDREFS" are special because they are intended to model references between sub-trees of a document. An XML dcoument is only valid if (a) there is at most one(1) attribute declaration of "type ID" in each element's attribute list, and (b) every instance of an IDREF attribute, and each single token in an IDREFS attribute, corresponds to exactly one(1) instance of an ID attribute carrying the same value (see [xml], "Validity constraint: One ID per Element Type" and "Validity constraint: IDREF")
Of course these conditions are not really sensible.
E.g. for changing the value of an ID attribute without violating these rules, first all referring IDREF/IDREFS tokens must be deleted. Then the attribute's value may be changed, and not before this, all referring attriubtes may be visited a second time, to set them to the new value.

Therefore Tdom does not check ID/IDREF/IDREFS attributes by default. Instead, this can be done explicitly, when a model is completely constructed, by the following methods:

 // in the generated package: class Document_ { public ElementDictionary createDictionary() { // creates the id-based map string->Element // throws tdom.runtime.HomonymousIdException if one(1) id is used for // two(2) different elements // throws tdom.runtime.SynonymousIdException if two(2) ids are used for // one(1) element } } // in package tdom.runtime : /** Indicates the presence of an ID attribute. **/ public interface Identifiable { /** delivers the current id, but does not supply an automatically generated one! */ String getId() ; } class ElementDictionary{ public E get(String s){ //returns the element with the given id, or null } } class IdRefAttribute{ public E getValue(ElementDictionary){ //returns the element with the given id, or null } } class IdRefsAttribute{ public java.util.List getValues(ElementDictionary){ //returns the elements with the ids, // contained in the attribute's current value } } 

There are some more useful methods for handling the mappings explicitly. Please refer to the api doc of the involved runtime classes.

## ^ToC 2.16 Ethereals: Comments and Processing Instructions as Second Class Inhabitants

It is not understood that Processing Instructions and Comments are part of a "model" in the narrow sense, and originally Tdom did not support them.

Anyhow, the requirements and application contexts are various, so it may be sensible to include them. We introduced them as "second class" inhabitants, which have to be attached to a "substantial" inhabitant for being stored and retrieved.

Every Element and every PCData fragment as a "Substantial" has two "decorative" sequences of "Ethereals" (see the symbolic class tree in Section 2.4 !), one "preceding" and one "following".

Furthermore, every Element and Document has a "leading" and a "trailing" sequence, which can be used if no Substantials are contained. The access methods are

  List TypedDocument.[get/read]LeadingEthereals() List TypedDocument.[get/read]TrailingEthereals() List TypedElement.[get/read]LeadingEthereals() List TypedElement.[get/read]TrailingEthereals() List TypedSubstantial.[get/read]PrecedingEthereals() List TypedSubstantial.[get/read]FollowingEthereals() 

The ..read.. variant delivers a read-only-list (which can be shared iff empty), the ..get.. variant delivers a list the user can modify.

There are n+1 possiblities to store a sequence of n Ethereals w.r.t. the two neighbouring Substantials:

  | IN OR OR OR | el.leading el.leading el.leading subel.preceding | el.leading el.leading subel.preceding subel.preceding | el.leading subel.preceding subel.preceding subel.preceding | | IN subel.following OR subel2.preceding | | 

The fact where an Ethereal is stored does not have any meaning a priori. A parser is allowed to choose any solution, arbitrarily. Of course, on the next conceptual level the user may define a "meta-syntax" of relations and meanings, (E.g., is a comment related to the follower, or to the element just opened? Is a comment related to a processing instruction?) In such a case the model must be traversed and these relations constructed explicitly, implemented by additional data.

# ^ToC 3 Construction of Tdom models

## ^ToC 3.1 Explicit Constructor Application for Elements and Sub-Content

Two central design issues of Tdom are (a) that all existing models at every instants of their life-time are type-correct sub-trees w.r.t. the corresponding DTD, and (b) that this is checked statically, at compile time, as far as possible.
Therefore the generated public constructors always require complete and type-correct contents as their argument.
As a consequence, a larger Tdom model must be constructed bottom-up, in a term-like fashion. This (at a first glance possibly annoying) strict discipline implies especially, that Tdom models are always finite by construction, --- a fact which cannot be guaranteed by the construction interface e.g. of W3C DOM. 2

Please note that constructing a large Tdom model by explicit constructor calls is a tedious task. Explicit constructor calls only make sense as the back-end of some automated translation procedures.
For constructing a Tdom model from a pre-existent XML text file one can use the SAX interface or the w3c Dom interface. These are described in Section 3.2 below.

### ^ToC 3.1.1 Creating Elements with Structured Contents, Statically Typed

The basic structural element, to which the generated Java constructors correspond, is again the sequence. So constructors are generated for top-level content regular expressions, considered as a sequence, and for all sub-sequences and alternatives, which are sequences again. All variants are illustrated by the following example:

 

...is translated into ...

 public class Element_NC extends Element { public Element_NC (Element_TA x1, Element_NC.Choice_1 x2, Element_NC.Seq_1[] x3, Element_TG x4) {...} public abstract static class Choice_1 extends eu.bandm.tools.tdom.TypedChoice {...} public static class Choice_1_Alt_1 extends Choice_1 { ... public Choice_1_Alt_1(Element_TB x1){...} ... } public static class Choice_1_Alt_2 extends Choice_1 { ... public Choice_1_Alt_2(Element_TC x1, Element_TD[] x2){...} // please notice this array reflecting the "*" ^^ ... } }//Choice_1 public abstract static class Seq_1 extends eu.bandm.tools.tdom.TypedSubTree{ ... public Seq_1 (Element_TE x1, Element_TF x2) {...} ... } } 

This allows a constructor call like ...

  new Element_NC( aTA, new Element_NC.Choice_1_Alt_1(aTB), new Element_NC.Seq_1[0], (Element_TG)null ), 

For convenience an array parameter at the last position is declared as a "vararg", so that the explicit construction of an intermediate array for this position is not required (though still possible!):

  --> leads to --> new Element_NC( Element_A[] elems_A_1, Element_B... elems_B_1) 

### ^ToC 3.1.2 Creating Elements with Structured Contents, Dynamically Typed

For this purpose there is an untyped constructor ...

  new Element_NC (Element... elements) throws TDOMException { 

("Element extends tdom.runtime.TypedElement" is the top-level element class generated specially with this certain model, so the method is "not completely" untyped !-)

A TDOMException is thrown whenever the supplied sequence of Java objects cannot be mapped to the content model.

Since the "vararg" arguments can be represented by an array, also alternatives for content creation can be defined by a pure expression, using the concatenation operations defined in <METATOOLS>/eu/bandm/tools/ops/Arrays , as used in our "Dtd to Html renderer" <METATOOLS>/eu/bandm/tools/dtm/HtmlRendereraccording to this scheme:

 import eu.bandm.tools.ops.Arrays ; // ... final Element_html el_html = new Element_html (new Element_head (new Element_head.Choice_1[0], new Element_head.Choice_2_Alt_1 (new Element_title("windowTitle")), new Element_head.Choice_2[0]), new Element_body (Arrays.append ((htmlIsDynamic) ?new Element_block_content[] {new Element_noscript (new Element_div (new Element_div.Content ("")) {@Override protected void initAttrs(){ getAttr_class().setValue(class_alert); }})} :new Element_block_content[0], new Element_block_content[] {new Element_pre(preItems.toArray (new Element_pre.Content[preItems.size()])), new Element_hr(), makeFooter(basicFileName, "http://bandm.eu/metatools/docs/usage/dtd.html#txt_dtd_tool") } ))); // ... Element_p makeFooter(String a, String b){...} List preItems = ... 

The header part of the created html element is constructed statically typed. The body part is dynamically typed, using Arrays.append and function calls to write case distinctions in a fully compositional way.

(Please note that in our xhtml model "block_content" is an abstraction of different Element classes, controlled by a content model entity, as described in Section 2.6.1.)

### ^ToC 3.1.3 Creating Elements with Mixed or Pure PCData Contents, Statically Typed

In case of mixed content, e.g. a declaration like ...

  

...the generated constructors are ...

  public Element_NM (Element_NM.Content... content){...} public Element_NM (String content){...} 

The second is a shortcut for creating an element with just text content; the canonical constructor is the first one, and all particles have to be wrapped into the correct content class, like in ...

  new Element_NM (new Element_NM.Content("characters with embedded TA "), new Element_NM.Content(aTA), new Element_NM.Content(new TypedPCData(" followed by a TB ")), new Element_NM.Content(aTB) ) 

The first argument is possible because of the short-cut constructor ...

  public class TypedElement.MixedContent { ... public MixedContent (String data){ this(new TypedPCData(data)); } ... } 

Of course, instead of "vararg"-parameters you can always supply an array, e.g. delivered by Collection.toArray([]).

### ^ToC 3.1.4 Creating Elements with Mixed or Pure PCData Contents, Dynamically Typed

The dynamically typed constructor for elements with mixed contents has the signature

  new Element_NM (Object ...); 

It behaves like the other kind of dynamically typed constructor, i.e. it throws a TDOMException whenever the supplied sequence of Java objects cannot be mapped to the content model.

The techniques for constructing the argument list as described for structured content in Section 3.1.2 can be used accordingly.

### ^ToC 3.1.5 Creating Elements with Attributes

When a constructor is called for an element class which has #REQUIRED attributes, then these attributes have to be defined by the caller, before the construction process can be completed.

Syntactically, this is done by creating an instance of an anonymous, in-line class derived from the real element class. By overriding the method public void initAttrs() the caller has to set all attribute values. This method is called by the constructor. Then the constructor code performs a completeness (non-null) check on all required attributes, before it can successfully return to the caller.

  

...results in a constructor call like ...

  Element_A a = new Element_A((Element_B)null){ @Override public void initAttrs(){ this.getAttr_attA().setValue(""); this.getAttr_attB().setValue("thisIsAToken"); } ; 

This guarantees correctness w.r.t. the DTD at least dynamically.

In case a required attribute is not supplied, a TDOMException describing the error is thrown.

## ^ToC 3.2 Automated Construction of Documents and Elements

As mentioned above, the methods for constructing large Tdom models from given text files is via the generated SAX parser or the generated W3C-DOM validator.

Both kinds of creation methods are only defined for the "Document_<tag>" classes, not for pure "Element_<tag>" classes. This is due to the fact that both construction methods possibly require global information, like namespace mapping and collections of "ID"-type attribute values, things not existing with simple elements.

Then you can create a Tdom Document object by calling one of the following methods:

  package ; import eu.bandm.tools.util.SAXEventStream ; import eu.bandm.tools.tdom.runtime.TDOMException ; import eu.bandm.tools.tdom.runtime.TypedDTD ; public final class DTD extends TypedDTD { createDocument_ (Element_ el) {...} createDocument_ (Element_ el) {...} ... createDocument_ (org.w3c.dom.Document document) throws TDOMException {...} createDocument_ (org.w3c.dom.Document document) throws TDOMException {...} ... createDocument_ (SAXEventStream s) throws TDOMException {...} createDocument_ (SAXEventStream s) throws TDOMException {...} ... createDocument_ (java.io.InputStream in) throws java.io.IOExcept {...} createDocument_ (java.io.InputStream in) throws java.io.IOExcept {...} } 

The creation methods of the first kind complete the "manually bottom-up creation": You create a document by creating its top-level element and then giving it as an argument to the constructor.

For large documents the following methods are more convenient:

If the argument to "create_Document<tag>()" is a W3C DOM, than this DOM object is validated against the DTD, and, in case of conformance, a Tdom model is returned. Otherwise a TDOMException is thrown.

If the argument to "create_Document<tag>()" is a SAXEventStream, then the content models of the DTD must be LL(1), and the SAX events are consumed to construct the Tdom model. In case of non-conformance, a TDOMException is thrown.

A <METATOOLS>/util/SAXEventStream is an interface which provides access to a "frozen" sequence of SAX calls. This freezing is necessary because LL(1) parsing needs (which surprise!) a look-ahead of depth 1, which is not provided when using the SAX interface directly.

The implementation currently provided is contained in <METATOOLS>/util/SAXEventQueue

The SAX interface's handling of attributes is rather complicated and expensive. Therefore, currently, we do not totally type-check the SAX event stream as such!
Of course, when there is no value for an attribute which is "required" as described by the DTD (e.g. neither declared as "#IMPLIED", nor having a default value), then a TDOMException is thrown.
But we do NOT check for "wild" attributes, ie. attribute names which are not declared in the DTD and thus nor represented in the model.
This is a violation of the "Validity constraint: Attribute Value Type" from [xml] , which says "The attribute MUST have been declared"
The same holds for the even more primitive "Well-formedness constraint:Unique Att Spec", which says "An attribute name MUST NOT appear more than once in the same start-tag or empty-element tag."

The format of the SAX event would make both checks rather expensive.

The practical problem is that these kinds of errors oftenly result from a miss-spelled attribute name. But the missing of the really meant attribute will not be signalled iff it has a default value!

If you have to create large sub-structures of a Tdom model (e.g. starting with a top-level element Element_<tagX>) out of your own program code, it may be nevertheless the method of choice to use the SAX interface to create a complete Document_<tagX>:
Simply send SAX events to a <METATOOLS>/util/SAXEventQueue, the other side of which is consumed by the method aDTD.createDocument_<tagX>(SAXEventStream s).
Then extract the desired element by calling the (generated and therefore strongly typed) method ...

 public class Document_ extends ... tdom.runtime.TypedDocument { public Element_ getDocumentElement() {...} // returns top-level element } 

For this purpose it is necessary to previously declare all those elements declarations in the DTD as "public", which are intended as the top element of such sub-trees. This is described in Section 2.1.1 above, and tells Tdom to create the required Document_<> classes.

The last method (aDTD.createDocument_<tagX>(java.io.InputStream)) is related to our own compression method, and explained in Section 6.5.

The class <METATOOLS>/xantrltdom/TdomReader provides the glueing code between a file input stream or similar source of text, and the construction of a tdom model. Its usage is demonstrated in ../../examples/doctypes/xhtml/Makefile.

# ^ToC 4 Visitors

## ^ToC 4.1 The Generated Visitor Class and Deriving User Defined Visitors

As mentioned above, the most elegant way of processing a Tdom model to some other format is the application of Visitors.
With every Tdom model the base class Visitor.java is generated, from which you can derive your processing tools. This class defines a "visit(final <generatedClass> node)" method for each node classe generated by Tdom . This includes element classes, classes representing sub-sequences, choices, alternatives and attributes. A user defines a transformation by deriving from this visitor class and overriding only those methods where he/she wants to extract some information or perform some update.

On the other side the generated Element class (which is the top of all generated element classes) implements the interface Visitable<Visitor>, and the method host(Visitor). This method is the counterpart, which causes the visitor to call its visit method on this. (This method is needed to apply a visitor to any elment without knowing its concrete class at compile time.)

The definition of a derived visitor is most conveniently done by editing a copy of the generated VisitorTemplate.java. This file contains method declarations for all visit() methods acting on element classes. These empty method templates are preceded by a "Javadoc" comment which contains the corresponding content definition from the original DTD.
Please note that the method declarations for classes representing sub-content (e.g. "+visit(Element_TA.Choice_1_Alt_2 x)+") are not included in VisitorTemplate.java, but have to be added manually, whenever required.

((It may be convenient to have a look at "Visitor.java" for doing "copy and paste" on some more complicated method declarations of this kind.))

## ^ToC 4.2 Calling a User Defined Visitor

The Tdom visitors are of most simple kind, compared to the more complex ones generated by umod . They only provide the above-mentioned single method per visited class, namely "visit(<generatedclass> x)". This method can be called from external ("hand-written") code for the intial invoking of the visitor. It is also used internally by the visit() code of the generated base visitor itself, for the descending to its child nodes.

If the class of an object is known statically, this call is optimal w.r.t. performance. It the class is not known, there is the method "visit(generatedPackage.Element element)", which does a switch/case-based look up of the element's tag index.

All node classes support the method "x.host(Visitor v)". This method calls the most-narrowly statically typed "v.visit(x)" method of the visitor v. This allows to visit sub-content which contains choices without the need to know in advance which alternatives are present in the concrete model data.

This "x.host(Visitor v)" method is also realized by the generated Document_<> classes.

Further, all node classes which have repeated sub-content, like "elems_1_A", "choices_1" or "seqs_1", offer a method like "visit_choices_1(Visitor v)" which does the stepping through the sub-contents automatically, as mentioned already in Section 2.9 above.

## ^ToC 4.3 Default Visiting Strategy of Generated Visitors and User Defined Explicit Control

All these direct way of calling (i.e. the skipping of a "match()" multiplexer as needed in the umod visitors) are possible because the structure of the model is almost completely statically defined at compile time, and because there are no specialization relations (= "inheritance") between distinct classes.
The only places where dynamic decisions are required come (a) from alternatives (including abstract classes) and (b) from quantification decorations "?", "*" or "+" in the original DTD.
The high performance of the Tdom visitors results from the fact that in both cases only simple and constant int values need to be considered, --- e.g. the result of "final int getAltIndex()", a method generated with every sub-class of TypedAlt and which returns the value of a static final int assigned to the generated class at compile time, or of "<>.count<plural>()" in case of repetitions.

The generated base visitor does nothing more than descending the document tree in depth-first textual order.

E.g. the DTD declaration ...

  

...generates a method like 3 ...

 package MySemiAst ; public class Visitor { ... public void visit (Element_A el){ visit (el.getElem_1_B()); for (int i = 0, n = el.countElems_1_C() ; i < n ; i++) visit(el.getElem_1_C(i)); if (el.hasElem_1_D()) visit(el.getElem_1_D()); } ... } 

Special processing of nodes of a certain class is implemented by deriving from this base class. If you want to descend into the sub-tree structure starting at the currently visited node el, you simply call "super.visit(el)", or you start with a new, specialized visitor:

 package transformations ; import MySemiAst.* ; public class Transform_1 extends Visitor { protected class SpecialTransformation extends Transform_1 { ... } protected void visit (Element_A el){ final int value = Integer.parse (el.getElem_1_B().getPCData()); new Specialtranformation().visit(el.getElems_1_C()); <<<<< GEHT NICHT !!?? :-( final int secondvalue = new Visitor(){ protected int result = 0 ; public int process(Element el){visit(el); return result;} public void visit (Element_C el){ result += Integer.parse (el.getPCData()); } }.process(el); super.visit(el); } } 

The call graph for a content declaration "<!ELEMENT A ((B)*(..|..|..))>" can be symbollically sketched like ...

 visit(Element_A e) ---------> {for (i=0;i switch(c.getAltIndex()){ case 0:visit(c.toAlt_1()); case 1:visit(c.toAlt_2()); } visit(Element_A.Choice_1_Alt_1 a) ---> visit(...) 

...and the scheme for deriving the transformation tools "UserDefV" like ...

 Visitor.visit(Element_A e) ---------> {..visit(e.getElem_1_B)..} ; visit(...) | +-----------------------------------------+ V UserDefV.visit(Element_B e) ------> ---> super.visit(e) | +--------------------------------------------------------+ V Visitor.visit(Element_B e) ----------> 

# ^ToC 5 The Extension Mechanism

Tdom includes a mechanism for extending a model by one or more others Its prime purpose is to suppoert reusage of visitor based code.

On the source level, an extension has to declared in the DTD: A short example is contained in metatools/examples/tdom/extend . Therein, the file arith.dtd , further simplified, reads as follows:

  

The "?tdom abstract" process instruction is terminated by the ellipse "...", which indicates to the Tdom code generator to forsee the plugging-in of more element types.
The same is done on the genuine DTD level by defining an "ENTITY" each, which is per default of course empty.

The usage of the extension mechanism can be seen in the file logic.dtd , which, slightly simplified, reads as follows:

  %arith.dtd; 

Again, the import is executed on the DTD level and the Tdom level parallel, requiring seemingly redundante doubling.

FIXME BEISPIEL KAPUTT ??? IRRNGTWAS ist da VERLORENGEGANGEN !?!?!

# ^ToC 6 Serialization and Conversions

## ^ToC 6.1 Generating SAX Events

Among the generated classes there is always a Dumper class, extending the Visitor class. Its constructors are defined as ...

  public Dumper(org.xml.sax.ContentHandler contentHandler) { ..} public Dumper(org.xml.sax.ContentHandler contentHandler, org.xml.sax.ext.LexicalHandler commentHandler) { ..} 

Whenever visit() is called on any element of the model, the corresponding SAX events [Sax04] are generated for this element and its attributes, for any related Ethereals (see Section 2.16) and for all elements contained therein recursively.

The LexicalHandler is optional and is used to receive TypedComment objects. If the first constructor is called and the argument happens to support both interfaces, it will be used in both roles.

## ^ToC 6.2 Visualization of a Tdom Model

You can easily print a whole model or an abitrary sub-tree to a console or to a text file by combining the above-mentioned SAX event generation with our ContentPrinter. E.g.:

  public Dumper(new ContentPrinter(new PrintWriter_flushing(System.out), true, true) ).visit(myModel); 

## ^ToC 6.3 Format Generation for a Tdom Model

The class formatfrontends.Tdom2format contains an instantiation of the generic format compiler.

The generated code is a specialization of the Visitor classgenerated with the tdom code. It offers a public method "toFormat()" which translates a tdom model (or any sub-expression of it) into a format object. The outlines of such a converter called "myFormatter". generated for some Tdom generated package "myModel" are:

 import myModel.* ; // Element_A, Element_B, etc. public class myFormatter extends myModel.Visitor { // public interface public Formst toFormat ( Visitable element){ result=Format.empty; visit(element); return result , } public int default_indent = 2 ; // can be modified public Format default_delimiter = Format.empty ; // for debugging only // auxiliary funtions protected Format __throwIt(){...} protected void visit (TypedPCData){...} protected Format toFormat_throwing(Visitable){...} // if ==null then throw! protected Collection toFormat_array(Visitable[]){...}// "map" // user defined visitor functions public void visit (Element_A element){ result = // format generating function } // etc. } 

A "visit()" method is added for each node class of the Tdom model, if and only if the user gives a format description.

The default case is that the visitor reaches "visit(typedPCData)". This simply concatenates all character content into one big, unformatted Append format. This is exactly what you want in most cases for the lowest layers of the structure definition.

(( There is a field "public Format default_delimiter" in the generated code, which is initialized to an empty format. For debugging it can be set to something like "---", indicating the borders of the different concatenated pcdata ranges. ))

The language to define the format for a given Tdom class explicitly, is an instance of the generic format definition language. The following instantiations are specific to its application to Tdom :

 DOMAIN_SPECIFIC_DATA_ADRESSING ::= element | choice | sequence | $pcdata |$quoteDTDstyle ( blankCharacter ) + formatDescription
 element ::= tag ( $nat ) ?  choice ::= ($C | $Choice ) nat  alt ::= ($A | $Alt ) nat  sequence ::= ($S | $Seq ) nat  nat ::= // a natural number including zero(0) ; The reference "$quoteDTDstyle data" means a "DTD-like" quotation of the (possibly concatenated) string content of the "data" format, done by Format.quoteDTDstyle(Format). This means simply framing the data with single quotes if it contains double quotes, and vice versa. And doing anything if the data contains both !-).

The reference "$pcdata" means the text content of the current tdom element. as it is delivered by the generated getPCData() method. Appearing in a format code "<tag>$<nat>" means a reference to the format generated for "getElem<tag>_<nat>()", --- "<tag>" without number defaults to "getElem<tag>_1()".
Analoguously: "$C<nat>"/"$Choice<nat>" means a reference to the format generated for "getChoice_<nat>()" and "$S<nat>"/"$Seq<nat>" means "getSeq_<nat>()".

If the selected element/choice/sequence appears in the DTD under a list combinator ("+" or "*"), one of the list format descriptors must be applied("[]", "[,,/]", "[;;/]{}", etc.), as described for the generic case.

Please note that currently there is not type-checkingbetween the format descriptors and the DTD of the model. So addressing singular instead of plural (i.e. leaving out a list format descriptor instead of using it, --- or vice versa), will crash when trying to compile the generated code (there is either a "getElems_1_A()" of a "getElem_1_A()", as described above.)

 DOMAIN_SPECIFIC_SWITCH_SELECTOR ::= element | choice | sequence

In case the DOMAIN_SPECIFIC_SWITCH_SELECTOR refers to an element, this element must appear in the current content model with a "?" modifier. The selector tags may be "0" for absent and "1" for present.

In case the DOMAIN_SPECIFIC_SWITCH_SELECTOR refers to a choice, than the selector tags correspond to the allowed values of "altIndex", i.e. the position number of the present alternative in the DTD choice construct.

If in a switch no default case is given, this defaults to "$throw". Currently there are three ways to write down these definitions: (1) in a stand-alone file, (2) as PIs in a DTD, or (3) by option values from an Xantlr source. ### ^ToC 6.3.1 Stand-alone format description file All format definitions in such a file must have the form ...  formatRule ::= tag ( seq| choice alt ) * ( choice ) ? blankCharacter * = formatDescription . Please note the dot "." as a delimiter at the end which is required because in a formatDescription all withespace is significant. The begin is directly after the "=". The expression to the left of the "=" is verbatim translated into a "visit(Element_<tag><tail>)" rule, where <tail> is the name of the inner class of the element class, as denotated by the sequence of selectors, which are translated the same way as in the context of the format description, as described above. The call of the tool in controlled by these parameters: ( definitions from file ../../src/eu/bandm/tools/formatfrontends/tdom2format.options )  -G --sourceroot uri file system position of the root of the java source tree -w --linewidth int(=79) width of the generated source files -1 --packagename string name of the package which will contain the formatting code -2 --basevisitor string name of the base visitor class from which the formatting code inherits. -3 --classname string name of the generated visitor class -4 --sourcetype ( dtd| nondtd) indication of the type of the source file -5 --sourcefile uri path to the source file to translate --targetclasspath uri(=$na) position in file system for loading referred existing classes

ATTENTION Currently these parameters are not yet decoded as documented, but

1. the parameters with numbers as abbrevs can only be given by position
2. the other option values may be specified by System.getProperty("eu.bandm.tools.formatfrontedns.tdom2format.<OPTIONNAME>").

### ^ToC 6.3.2 Process instructions in a DTD

The format of the declarations is exacly like above, but all are wrapped into a process instruction, either one or more each:

  

Please note again the dot "." which ends the format descriptions.

The call of the tool is now ...

  ${JAVA} eu.bandm.tools.formatfrontends.Tdom2format -- model.DTD ???  ### ^ToC 6.3.3 Options from an Xantlr Source If the Tdom meta-model is an AST meta-model created from an Xantlr grammar, then the format directives can be formulated directly from the grammar sorce file by a special kind of rule-wise options. An in the grammar file construct like ...   public myNonterm options { format = "expr" ; } : grammarExpr ;  ... will be translated by Xantlr into a process instruction in the generated DTD:    Here the dot is added by this translation process, so the whole string constant in the "option" statement (including trailing whitespace!) is seen as format description. But you can add more format expressions into the one option statement, if you want do define format rules for sub-expressions like choices or sub-sequences. Simly terminate the "main" expression for the non-terminal with an explicit dot, and append the further declarations for these sub-expressions (or arbitrary unrelated nonterminals !-) as explicit rules. Simply consider that Xantlr prepends the name of the non-terminal and the equal sign, and append one dot, than you can insert further rules arbitrarily ; So ...   public myNonterm options { format = "expr. myN$C1=sub1. myN $C2$A1 = sub2" ; } :a(b|c)(d|e) ; 

... will be translated by Xantlr into a process instruction with three format declarations in the generated DTD:

  

## ^ToC 6.4 Creating a W3C (untyped) DOM Representation

Creating a w3c dom representation can be done via this SAX output: meta_tools provide a generic translation module in <METATOOLS>util/SAX2DOMConverter . It implements org.xml.sax.ContentHandler and therefore can be connected as a drain to the Dumper describwd in the preceding paragraph.

You have to plug in a W3C-Dom implementation, e.g. a Xerces-J [xercesj] by calling the public method ...

 public void setDOMImplementation(org.w3c.dom DOMImplementation domImpl){...} 

After all SAX events have been sent completely, you can access the generated Document by calling ...

 public org.w3c.dom.Document getDocument() 

## ^ToC 6.5 Compressed De-/Serialization

Since the standard XML encoding (using opening and closing tags and many different layers of escaping and quoting) is very redundant, Tdom supports a compressed binary storage format, in which all tags are encoded in a minimal way, controlled by the DTD. So the tagging information is a kind of "binary", while the text contents are left unchanged (i.e. it is encoded using UTF-8)

The writing out of a document or element is initiated by a sequence like ...

  import eu.bandm.tools.tdom.runtime.EncodingOutputStream ; os = new EncodingOutputStream(anyOutputstream); myElement.encode(os); // OR myDocument.enccode(os); 

The reading back is only possible on Document level, by calling the approporate factory method defined with the DTD class: as already mentioned in Section 3.2 above.

  createDocument_ (java.io.InputStream in) throws java.io.IOExcept {...} 

# ^ToC 7 Using the Tdom Tool

## ^ToC 7.1 Calling the Tdom Tool

The Tdom tool is called from the command line:

( definitions from file ../../src/eu/bandm/tools/tdom_withOptions/Options.xml )

 -0 --destdir string file system mounting point of the generated source tree -1 --pkgname string name of the generated package. Determines the position of generated resources relative to destdir -2 --sourcedtd string file position of dtd to compile --linewidth int(=78) number of columns for the generated Java source text. This is NOT a strict limit, but a strong orientation. --nocompress whether the decode/encode methods shall be omitted --patterns whether pattern access methods shall be generated.

ATTENTION Currently these parameters are not yet decoded as documented, but

1. the parameters with numbers as abbrevs can only be given by position
2. the other option values may be specified by System.getProperty("eu.bandm.tools.formatfrontedns.tdom2format.<OPTIONNAME>")

As an alternative there is a macro in the make system included in etc/calltools.mk, which is called from a Makefile as ...

 \$(call tdom, , , ) 

This macro esp. cares for the conversion of slashes and backslashes between a unix and a cygwin environment.

## ^ToC 7.2 Outputs and Error Messages

As an output, Tdom generates one source file for each <ELEMENT..>! declaration, according to the naming conventions explained above in Section 2.6,

Additionally there will be generated ...

1. A file named sources, listing all source files generated by Tdom , and included by the Makefile, esp. for realizing "make clean"
2. The abstract super classes Element.java and Document.java.
3. The binary version of the DTD, contained in DTD.java
4. Visitor.java and VisitorTemplate.java, as described above in chapter 4..

  ... ( a* | b* ) ... 

These ambiguities do not violate the "LL(1)" requirement, as iterpreted by W3C. Nevertheless, it is important to have a close look to all points of ambiguity, The Tdom tool does print them in a precise format. Eg. when trying to translate "xhmlt1-flat.dtd", you get the typical warning

 warning: table: conflict on [thead, tbody, tr, tfoot] between alts 0 and 1 in rule [thead, col, tbody, tr, colgroup, tfoot] -> ( [thead, col, tbody, tr, tfoot] -> {[col] -> (col)* -> [thead, tbody, tr, tfoot]} | [thead, tbody, tr, colgroup, tfoot] -> {[colgroup] -> (colgroup)* -> [thead, tbody, tr, tfoot]} ) 

In the innermost nesting (in braces) you see the first and follow sets of some grammar expressions :

  { [firstSet] -> grammarExpr -> [followSet] } 

The conflict is here caused by a disjunction. On the next higher level you see the disjunction of two such constructs, each preceded by its own first set (which is identical to the internal first set, or to the union of the internal first and follow set, if the grammar expresssion can produce epsilon, --- as we all have learned from the Dragon Book !-)

  ( [firstSet_A] -> {...} | [firstSet_B] -> {...} ) 

Before this bracket, again, there is the first set of the disjunction as a whole (which is not very informative, it is just the union), but in the warning message you find the intersection of these two sets, which is the cause for the ambiguity.

# ^ToC 8Xantlrand Tdom --- Special Issues of Their Co-Operation

## ^ToC 8.1 Information Interchange by Option Controlled DTD Generation

First of all, the grammar file fed into Xantlr must be contain the global, parser-level option

 options { dtdMode = tdom ; } 

This make the generated DTD contain special "process instructions" generated to Tdom (as described above in Section 2.1.1), reflecting the settings in the grammar definition file:

1. the Xantlr rule-level option "xmlNodeTpye=abstract" (cf. xantrl sax event types) adds to the DTD something like <?tdom abstract a (b|c) ?>".
2. an Xantlr rule-level "private" or "public" modifier is translated to a process instruction accordingly, cf. Section 2.1.1.

## ^ToC 8.2 Different Layers of Ambiguity

When Xantlr and Tdom are plugged together, two parsers are involved: First comes the antlrC parser, which consumes front-end characters and emits the standard antlrC error messages. The output of the Xantlr generated parser is a SAX event stream, which is fed into the different SAX receivers created by the Tdom compiler.

Already when translating the DTD, Tdom possibly issues error messages concerning this second level of parsing, --- mostly caused by ambiguities, i.e. violations of the LL(1)-criterium. This kind of ambiguity should not be mixed up with the antlrC front-end ambiguities. Consider a definition (taken from the d2d grammar) like ...

  definition ::= "list" "of" reference | "short" "for" reference | ... 

The antlrC generated front-end parser has no problems with ambiguity, because there are terminal tokens guarding the alternatives. Because these tokens do notautomatically contribute to the semi-AST, the generated DTD would read corresponding to

  definition ::= reference | reference | ... ; 

This is ambiguous, and the following Tdom translation will issue an error message, as explained above in Section 7.2.

As a remedy, you should either wrap one of the terminal in a non-terminal (with an empty DTD content model), like ...

  definition : LIST "of" reference | "short" "for" reference | ... ; LIST : "list" ; 

...or wrap one of the alternatives as a whole in a non-terminal:

  definition : "list" "of" reference | shortcutdefinition | ... ; shortcutdefinition : "short" "for" reference ; 

Now the generated SAX events suffice to distinguish between the alternatives.

## ^ToC 8.3XantrlTdom, Glueing Code and Error Messaging Issues

meta_tools provides some glueing code for plugging together Xantlr and Tdom . The central class is <METATOOLS>/xantlrtdom/XantlrTdom, which internally creates buffers and auxiliary message pipes, etc, and plugs it all together.

Let "XXX" be the name of your grammar, and "YYY" the top production, then the usage pattern is ...

  final XXX_Lexer lexer = new XXX_Lexer(stream); lexer.setFilename(filename); final XXX_Parser parser = new XXX_Parser(HistoryToken.chain(lexer)); tee = (tracing) ? new ContentPrinter(new PrintWriter_flushing(System.err), true, true) : null ; final XantlrTdom link = XantlrTdom.link (parser, msg1, 1024, tee, DTD.dtd, msg2); final Document_YYY document_module = link.parse("YYY", Document_YYY.class); 

Please refer to the API doc.

1 Only exception: When taking them seriously, the constraints on id/idref/idrefs attributes imposed by [xml] are nearly impossible to maintain! And they would be very expensive to check automatically with every model update. Therefore they are only evaluated on demand, controlled by the user. See Section 2.15 for details.

2 They can become cyclic by later alterations; another reason for better treating model elements as immutable!

3 This code fragment is written for documentation purpose using the access methods from the user interface, as described above. The real implementation, since in the same package as the element classes, of course uses the direct access to protected variables for the sake of efficiency, e.g. "a.elems_1_C.length" instead of "a.countElems_1_C()"

[all pages:]

 xantlr bandm meta_tools ops

made    2017-07-13_09h09   by    lepper   on    linux-q699.site

produced with eu.bandm.metatools.d2d `    and    XSLT    FYI view page d2d source text