[all pages:] introduction message/location/muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf DOC-EN:lablog.ddf SOURCE:basic.dd2 DOC:xslt.ddf SOURCE:xslt.dd2 DOC:meta.ddf [site map]
All pages: introduction message/location/muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf DOC-EN:lablog.ddf SOURCE:basic.dd2 DOC:xslt.ddf SOURCE:xslt.dd2 DOC:meta.ddf [site map]
xantlr | bandm meta_tools | ops |
Tdom , a Generator for Typed XML Models
(related API documentation: package tdom.runtime )
1
Principles of Tdom
2
Mapping from DTD to Java
2.1
Relevant Information content of a DTD
2.1.1
Process instructions
2.2
XML Namespaces
2.3
The own XML Document Id
2.4
Pre-Defined Infra-Structure, Runtime Classes
2.5
Generated Java Classes for the top-level DTD. Reflection.
2.6
Generated Java Classes for Element declarations. General Name Translation.
2.6.1
Abstract Java Classes as Realisations of DTD Content Model Alternatives
2.6.2
Name Mangling from DTD Elements' Contents and Sub-Contents
to Java Classes
2.6.3
Inner Classes Generated for Sub-Content
2.6.4
Retrieval, Update and Visit Methods
2.6.5
Unsafe retrieval methods and alternative checked list generation
2.6.6
Inner Classes Generated for Embedded Sequences
2.6.7
Inner Classes Generated for Choices
2.6.8
Text Content and Mixed Content
2.7
Attributes
2.7.1
Generated Classes for Attributes
2.7.2
Checking Value Assignments
2.7.3
Enumeration Attributes with Integer Tokens Only
2.7.4
Unsetting Attributes
2.7.5
Common Classes for Common Attributes
2.7.6
Attributes with Attribute Types "ID", "IDREF" and
"IDREFS"
2.8
Auxiliary methods for numeric contents of elements and attributes
2.9
Additional Documentation Text
2.10
Ethereals: Comments and Processing Instructions as
Second Class Inhabitants
3
Construction of Tdom
models
3.1
Error Cases and Exception Hierarchy
3.2
Explicit Constructor Application for Elements and Sub-Content
3.2.1
Creating Elements with Structured Contents, Statically Typed
3.2.2
Creating Elements with Structured Contents, Dynamically Typed,
by Semi-Parser
3.2.3
Creating Elements with Mixed or Pure PCData Contents, Statically Typed
3.2.4
Creating Elements with Mixed or Pure PCData Contents, Dynamically Typed
3.2.5
Creating Elements with Attributes
3.3
Automated Construction of Documents and
Elements
4
Visitors and Patterns
4.1
The Generated Visitor Class and Deriving User Defined Visitors
4.2
Calling a User Defined Visitor
4.3
Default Visiting Strategy of Generated Visitors and User Defined
Explicit Control
4.4
Untyped Visitors
4.5
Generated Paisley patterns
5
The Extension Mechanism
6
Serialization and Conversions
6.1
Generating SAX Events
6.2
Visualization of a Tdom
Model
6.3
Format Generation for a Tdom
Model
6.3.1
Stand-alone format description file
6.3.2
Process instructions in a DTD
6.3.3
Options from an Xantlr Source
6.4
Creating a W3C (untyped) DOM Representation
6.5
Compressed De-/Serialization
7
Using the Tdom
Tool
7.1
Calling the Tdom
Tool
7.2
Outputs and Error Messages
8
Xantlrand Tdom
--- Special Issues of Their Co-Operation
8.1
Information Interchange by Option Controlled DTD Generation
8.2
Different Layers of Ambiguity
8.3
XantrlTdom, Glueing Code and Error Messaging Issues
Tdom is a tool for generating typed data models of an xml text body according to a definition given as XML DTD [xml] . "Typed" model means that (a) the validity of the model w.r.t. the DTD is guaranteed by all creation and modification methods, 1 and (b) that this can be proved at compile time. 2
The Tdom
generated model behaves "partly algebraic", since each node
behaves like an algebraic expression and knows nothing about the context(s)
it appears in.
So, in contrast to w3c DOM ([w3cDom]), you can employ sharing,
even between different "documents".
Nodes exist independently from a global document object, and can
be created, processed and stored in a freely compositional and local way.
This is a fundamental requirement for a "functional style" of programming.
Tdom
nodes do not behave algebraic in the sense that
they can be treated as
mutable (but think twice if it is really necessary for your purposes,
you loose sharing !-),
and do not support algebraic equals().
(This could be added in some later version !?!)
The fact that they "do not know" their parent and their siblings makes
Tdom
nodes behave more like nodes in a tree in the mathematical sense.
Software architects used to W3C DOM et.sim. may consider this restriction
to be a draw-back.
Processing and creating trees is of course fundamentally different in this
paradigm: creating goes most naturally bottom-up, processing goes most
naturally by visiting top-down (see chapter 4) and memorizing all required
context information "on the flight".
(For all our applications, we found this a most convenient, safe and
easy to debug way of coding !-)
Applying the Tdom compiler to a DTD yields a collection of Java source files, forming a single package. This package will be processed by a Java compiler. It relies on the presence of a collection of base clases in the package <METATOOLS>/tdom/runtime, see Section 2.4 . The generated collection provides (at least) one Java class definition for each type of node defined by the DTD. This includes ...
All these Java classes are called "node classes" in the following. A Tdom model of a certain text corpus is realized by a structured collection of instances of these node classes. Each such instance represents a certain part of the document, and each node class represents a certain type of these document parts.
All generated node classes provide ...
Additionally the generated package contains ...
After these classes have been compiled by a Java compiler, you can create a Tdom text model ...
Each Tdom model, or each fragment thereof, can then ...
Some of the public examples on page
download & licences make extensive use of Tdom
.
See esp. the "BandM booking" book keeping
software, where a dedicated DTD models the business objects, and
the d2d
based "Wiki" where type correct XHTML 1.0 is constructed in
small pieces bottom-up.
In the following simplified examples, let "pkg" be the name of the generated packages, as given to the Tdom tool by the command line parameter --pkgname, see Section 7.1.
The Tdom
tool
processes one single DTD file and generates one package of Java source files.
(The contents of further DTD files, which are
included directly or indirectly in this file by the
famous "external parameter entity" mechanism, are of course also considered
and are processed as if contained directly in the top file.)
From the DTD it uses ...
In most cases, entity definitions are only used implicitly. Tdom uses the meta_tools component dtd , expanding entity references in a transparent way.
The translation into Java code is controlled by a whole zoo of "process instructions" adressing Tdom , as defined in [xml]. Here are the most important:
The identifier of the names of all elements, entities and attributes are implemented as NamespaceName.
As documented there, this class can represent names in "non-namespace-mode" and in "namespace-mode". In the first case the character ":" is treated in no way special. In the latter there must be at most one such character, and the prefix is mapped to a "namespace URI", as it is the standard way with XML namespaces, see [xml-ns].
For Tdom , the namespace mode is activated by process instructions which exactly follow the standard XML syntax:
<?tdom xmlns="myMainModule" ?> <?tdom xmlns:mathml="http://www.w3.org/1998/Math/MathML" ?> |
For the runtime namespace logic, the prefixes are ignored and "equals()" etc. is ruled only by the namespace URI, -- the usual way with namespace aware XML. In this concern, all prefixes must be different, but can be arbitrary.
But for the code generation the prefixes are kept and only the colon ":" is replaced by an underscore "_". (There are more characters to be replaced, see the paragraph on name translation in Section 2.6.) So the selected prefixes appear in the name of the generated Java classes and should be selected accordingly.
With the PIs ...
<?tdom SYSTEM "xslt.dtd"?> --- or --- <?tdom PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" ?> |
... the XMLDocumentIdentifier of the dtd file itself is made known to Tdom . It will be stored in generated DTD class and is accessible by the method getDocumentId().
The classes generated by Tdom to represent Elements, Attributes and sub-expressions of content models, all inherit from pre-defined runtime classes. These classes are contained in eu.bandm.tools.tdom.runtime and provide basic functionalities.
The following figure indicates symbolically their inheritance tree,
and the places where the generated classes are inserted:
(Please note that this graph is only symbolic and leaves out many details.
For more details, please refer to the current API documentation contained in
<METATOOLS>/eu/bandm/tools/tdom/runtime !)
INTERFACES: eu.bandm.tools.tdom.runtime. TypedContent TypedElement.MixedContentContainer | TypedElement.PCDataContainer Visitable<V> ImpliedAttribute Identifiable ... // etc CLASSES: eu.bandm.tools.tdom.runtime. TypedDTD | pkg.DTD <<<<< GENERATED once for each Tdom model TypedNode | TypedDocument | | TypedSubstantial | | TypedPCData IMPLEMENTS Matchable // FIXME visitable / matchable??? WAS GILT?? | | | | TypedElement IMPLEMENTS TypedContent | | | pkg.Element <<<<< GENERATED once for each Tdom model | | | | pkg.Element_<tagX> <<<<< GENERATED once for each ELEMENT declaration | | | | pkg.Element_<tagY> <<<<< " | | TypedEthereal | | TypedComment | | TypedProcessingInstruction | | TypedSubtree IMPLEMENTS TypedContent | | TypedChoice | | TypedAttribute | | CDataAttribute | | EnumerationAttribute pkg.Element_<tag>.Attr_<attname> <<<<< GENERATED for each attribute declaration | | NmTokenAttribute | | | IdAttribute | | | IdRefAttribute pkg.Element_<tag>.Attr_<attname> <<<<< GENERATED for each attribute declaration | | NmTokensAttribute | | | IdRefsAttribute pkg.Element_<tag>.Attr_<attname> <<<<< GENERATED for each attribute declaration | ... | TypedElement.MixedContent IMPLEMENTS TypedContent | pkg.Element_<tagX>.Content <<<<< GENERATED once for each mixed-content ELEMENT TypedElement.MixedContentFactory //... |
In each Tdom run, among others one top-level class pkg.Element is generated, from which all element classes are derived, and one prototype class pkg.Visitor. The pkg.Element class implements tdom.runtime.Visitable<pkg.Visitor>; these co-operate in visitor dispatching. see chapter 4 below.
For a beginner, the different classes and instances called "dtd" or sim. may be confusing:
For each <!ELEMENT tag ...> declaration in the source DTD, a node class is generated in the generated package "pkg". Instances of this class will be used to represent the document's sub-trees corresponding to this element. Such a class is called "element class" in the following. Its name is Element_<tag'>, where <tag'> is the DTD name translated to a Java name.
The name translation from DTD to Java is necessary for all kinds of names, as element tags, attribute names, entity names etc. What happens in all these cases is, that every single occurence of a minus signs "-", a colon ":" or a dot "." is replaced by a single underscore "_".
The Tdom tool does not check whether ambiguities are created by this translation. Instead, you will get error messages from the subsequent Java compilation process, accordingly. This will happen for an input like
<!ELMENT a.b-c ...> <!ELMENT a-b.c ...> |
For those <!ELEMENT tag ...> declarations which can serve as the top-level node of a document, a further class names Document_<tag'> is created. This inherits from <METATOOLS>.tdom.runtime.TypedDocument and contains additionally the methods for parsing a document as a whole from some external source (SAX or W3C-DOM). The indication whether a given element is such a top-level one is encoded in the DTD by process instructions: The positive case (=yes, E can be top-level element) is indicated in the DTD by the process instruction
<?tdom public E ?> |
The negative case by
<?tdom private E ?> |
For all those element declarations which are not explicitly mentioned in this way the default is defined by
<?tdom default private ?> -- or -- <?tdom default public ?> |
As mentioned above, every element class may contain inner Java classes which realize complex sub-contents, and update and retrieve methods ("set_<..>(..)" and "get_<..>(..)") for all sub-contents.
Please note that you can even derive further hand-coded sub-classes from generated element classes. This is possible because, whenever a typed element needs to know the identity of the class it is an instance of (e.g. for visiting or for serialization), it does not use the Java language .getClass()-method, but the generated method public int getTagIndex(), which will be inherited by your derived classes and is specified in TypedElement.getTagIndex().
The tag name of a given element can be read statically, when the class is known, by ...
Element_<>.TAG_NAME |
...due to the generated definition ...
public static final String TAG_NAME ; |
The dynamic way is defined in <METATOOLS>/tdom/runtime/TypedElement by
NamespaceName el.getName() String el.getTagName() String el.getNamespaceURI() String el.getLocalName() |
The Tdom tool does support some abstraction of isomorphic content definitions. This abstraction is rather limited, due to the nature of DTD, but nevertheless an important means for increasing re-usability, while preserving static type safety.
Any content model which is an undecorated choice expression of element
references can be translated into an abstract class, in the
Java terminology.
The benefit comes from the further consequences:
(a) all element declarations referred to
in this alternative will be translated into an element class which is
derived from this abstract class, and (b) whenever this choice clause
will appear in a certain content model, it will be replaced by a
simple reference to this abstract class.
This happens independently of the sequential order of the
alternatives, and also when the choice alternatives are a
true subset of the alternatives of a bigger choice clause.
In the case of nested choices, the largest sub-expression of every
choice which matches such an abstraction is replaced by the
corresponding abstract class.
(These statement must be refined for overlapping definitions, see below.)
The single inheritance property of Java implies that each element may
appear at most in one of these declarations.
This mechanism is controlled by a tdom process instruction like
<?tdom abstract a (b|c|d) ?> |
The content model must be a disjunction of plain element types. The PI declares that an abstract class "Element_a" will be generated. This will be the superclass of the classes which realize the elements in the disjunction, here: Element_b, Element_c and Element_d. We assume that the name "a" is fresh, ie. does not appear as an element or entity declaration in the rest of the DTD. Whenever this choices appears in a content model, the complicate interface for choices (see Section 2.6.7 below), is replaced by the much more simple interface for a single element reference.
<!ELEMENT x (y, (b|c|d)*, z, (b|c|d)) > ==> yields a code interface containing class Element_x { public Element_a[] getElems_1_a() {...} public void setElems_1_a(Element_a[]) {...} public Elem_a getElem_2_a() {...} public Elem_a setElem_2_a(Element_b e) {...} public Elem_a setElem_2_a(Element_c e) {...} public Elem_a setElem_2_a(Element_d e) {...} } |
A more realistic example, simplified from our XHTML model:
<?tdom abstract-entity misc.inline ?> <!ENTITY % misc.inline "ins | del | script"> <?tdom abstract misc (noscript | misc.inline) ?> <?tdom abstract heading (h1|h2|h3|h4|h5|h6) ?> <?tdom abstract block (p | heading | div | ul | ol | dl | pre | hr | blockquote | address |blocktext | fieldset | table) ?> <?tdom abstract form.content (block | misc) ?> <?tdom abstract block.content (form.content | form) ?> |
The last example shows that these definitions may be nested, as long as no cycles do result.
Whenever such a choice expression is defined by the contents of a DTD "parameter entity", this entity can be used directly. its contents will define the subclasses, and its name will be used as the name of the abstract class. This is shown in the first line of the example above. The entity's expansion text may carry further decoration which will be stripped to get the alternative expressions, as in
<?tdom abstract-entity form.content ?> <!ENTITY % form.content "(%block; | %misc;)*"> |
This abstraction mechanism significantly increases reusability and versatility: With the XHTML example, it is possible now to collect the very different sub-classes of Element_block_content into one single storage, e.g. an ArrayList<Element_block_content>, and later insert this sequence into an arbitrarily chosen instance of Element_object, Element_map, Element_fieldset, Element_noscript, Element_body, or Element_blockquote.
Without this abstraction, the first, collecting step would already require the wrapping of the elements into a certain alternative of a certain choice type of a certain hosting element class, and the collected sequence could not be used anywhere else, without "hacking" and losing static type safety.
The behaviour in case of overlapping choice expressions is not fully defined:
<?dom abstract x (a | b) ?> <?dom abstract y (b | c) ?> // not clear what will happen here: <!ELEMENT z (a | b | c) > |
So please look into the generated code to find out, or avoid such definitions.
Up to here we assumed that the name of the abstract class is a fresh name.
A different case is to declare an existing element declaration as abstract.
The preconditions and consequences
are the same, the contents model of the element must
be an undecorated choice clause.
Additionally, the corresponding node is not longer represented in the generated model by an object instance on its own, but is represented indirectly, by an instance of one of its sub-classes, which corresponds to an element from its contents model's choice expression. This instance transparently represents two(2) or even more nodes of the conceptual document tree, namely the chosen leaf element and the containing, abstract element. "Transparently" means that visitor code and attribute accessing methods are not affected.
FIXME STIMMT DAS??? BEISPIEL ??? der code in TypedDOMGenerator scheint ein "<?tdom abstract node?>" NICHT zu unterstützen !?
This method is frequently employed in the Tdom /Xantlr co-operation, to eliminate unnecessary nodes which only present alternatives in the derivation tree. This is controlled by a "options {xmlNodeType=abstract;}" option in the Xantlr grammar file, which is translated to the Tdom PI automatically, see chapter 8 below.
(This section describes how Java class definitions and their naming are derived from the contents model of a certain element. Please note that the mere name translation for eliminating those characters, which are valid identifier components in XML but not in Java, as described in this chapter above, must happen anyhow and independently!)
The mapping rules between DTD and Java
class definitions act
locally on each "<!ELEMENT...>" declaration.
The structure of the generated Java
classes and their naming
convention immediately reflect the usage of parentheses in
the regular expression describing the element's contents, as given by the
compiled DTD.
Please note that there is no implicit normalization of DTD content models.
For the name mangling purpose there is a difference between the
content models
a, (b, c) |
...and ...
a, b, c |
Name mangling is basically defined on sequences.
Therefore, as a first step, the top-level of the element's content definition is always interpreted as such a sequence , possibly of length 1(one).
Each sequence consists of content particles. Each such content particle is ...
All content particles may be decorated with a
quantification symbol "?", "*" or "+".
The naming convention assigns position numbers separately to
(a) to all references to elements of a certain tag,
(b) to all embedded choices
and (c) to all embedded sub-sequences.
These numberings always start with the number one(1).
From these numbers (and the tag strings in case (a))
particle names are generated.
E.g. the sub-particles of the top-level in following
DTD content model are adressed by the particle names put beneath:
<!ELEMENT NC ( TA, (TB, TC, TB)+, TA, (TB)*, (TE | TF | TG), (TB, TC, TB) ) > | | | | | | | Seq_1 | | Choice_1 Seq_2 | | Elem_1_TB Elem_1_TA Elem_2_TA |
The quantification symbols "?", "*" or "+" and all parenthes around singletons (i.e. not enclosing sub-sequences or alternatives) are ignored in the definition of particle names.
If the top-level content model as written in the DTD is an alternative,
the top-level content for Tdom
is considered as a singleton sequence.
In this case we get a very simple top-level naming, like ...
<!ELEMENT NC ( TA | (TB, TC, TD)+ | (TC)+ | (TD, TC, TD) ) > | Choice_1 <!ELEMENT NC ( TA | (TB, TC, TD)+ | (TC)+ | (TD, TC, TD) )* > | Choice_1 |
For each sub-sequence and each choice contained in the top-level sequence, an inner class is defined in the class representing the element. The names of these classes are identical with the particle names, as defined above.
As with element classes, each such class provides update and retrieve methods ("set_<..>" and "get_<..>") for its sub-contents. An instance of such an inner class must be used as an argument for the constructors and for update methods, and is returned as result of a corresponding retrieve method.
Please note that in the current implementation there is no algebraic equality defined on content models. Therefore in the example above the types of Seq_1 and Seq_2 are not compatible, in spite of having the same contents definition. The same fact holds for embedded choices.
Built upon the particle names, the Java class generated for the DTD element "NC" provides methods for retrieving and updating the contents of a given instance. Which methods are generated is controlled by the quantification decoration of the content particle.
Let <pname> be the particle name, and <plural> be its plural form
(i.e. "Elems_1_TA" for "Elem_1_TA",
"Choices_2" for "Choice_2" and
"Seqs_1" for "Seq_1").
Let <pclass> be the class representing a sub-content (i.e.
"Element_TA" for an Element reference, and "Element_NC.Seq_1" or
"Element_NC.Choice_2" for embedded sub-content).
In case of undecorated particles the generated methods are ...
public <pclass> get<pname>(){..} // deliver current content // this is always != null public <pclass> set<pname>(<pclass> e){..} // update current content // if e==null, throw exception |
If the modifier "?" is present, we get ...
public <pclass> get<pname>(){..} // deliver current content, // but may return null public <pclass> set<pname>(<pclass> e){..} // update current content // and accept null as argument public boolean has<pname>(){..} // return whether component is currently // contained in the higher-level content |
If one of the modifiers "*" or "+" is present, we get ...
public <pclass>[] get<plural>(){..} // deliver whole sequence as an array public <pclass> get<pname>(int pos){..} // deliver content at the given position // of the sequence public <pclass>[] set<plural>(<pclass>[] e){..} // update current content totally // to a whole sequence public <pclass> set<pname>(int pos,<pclass> e){..} // update current content // at the given position public int count<plural>(){.. // return number of components currently // contained in the higher-level content public void visit<plural>(Visitor v){..} // apply visitor to all particles in the // sub-conten |
Please note that for your convenience every "set<>()"-method (including those described in the following sections!) always returns the old, overwritten value as its result.
The meaning of the method "visit<plural>(Visitor c)" will be explained in section Section 4.2.
In the preceding list of methods the retrieval functions for plural sub-contents like "get_Elems_1_TA()" (or like "getSeqs_2()" and "getChoices_1()" as introduced in the next sections) deliver a direct access to the Java "array" data object which realizes the contents of the Tdom model instance. Due to a weakness of the Java language, This array is not protected against vandalism, i.e. storing into it a forbidden null value.
The same holds for the plural update methode like "set_Elems1_TA(TA[])" (or like "setSeqs_1(TN.Sequ_1[])" and "setChoices_1(TN.Choice_1[])", which additionally can violate the "+" specification by applying it to zero-length arguments. So the type correctness w.r.t. the original DTD is only guranteed if these API methods are not used.
An alternative is the usage of checked lists for the implementation. These lists are proxy classes above the Java list classes and prohibit the insertion of null, see their api doc. This mode is selected when running tdom with the command line switch "--generateLists", see Section 7.1.
If lists are selected than in the methode signatures below and above all types "x[]" (appearing as result types or parameters) must be replaced by "CheckedList<x>". In this mode, all code using tdom is always type correct w.r.t. the original DTD.
In case of embedded sequences, the whole top-level procedure (particle naming scheme, inner class definition for sub-structures and generation of methods) is simply applied recursively.
A difference in the implementation is that the classes for
sub-content (i.e. embedded Sequences and Choices) are not
inner classes of the inner class representing the sub-content, but
reside as direct inner classes of the element's class.
The nesting is only represented by their name, which is a concatenation
of the particle names of all levels, connected by an underscore "_".
The following example shows some of the get methods and the resulting types (classes). The names of both are again constructed by the particle names:
<!ELEMENT NC ( TA, (TB, TC, (TA, TB)* ) )> | | | | | nc.getSeq_1().getSeq_1(3).getElem_1_TB()=>Element_TB | | | nc.getSeq_1().getSeq_1(3)=>NC.Seq_1_Seq_1 | nc.getSeq_1()=>NC.Seq_1 |
The inner classes generated for choices are sub-classes of the pre-defined runtime class TypedChoice. Additionally, for each alternative of a choice an inner class is generated, which is again a sub-class of this "typed choice class". The name of such an alternative class is the name of the choice class with the appendix "_Alt_<n>".
This "<n>" used to identify an alternative is the position number w.r.t. the containing choice in the original DTD formula. This numbering starts with 1(one) !
In our example from above, the naming is ...
<!ELEMENT NC ( TA, (TB, TC, TB), TA, (TB)*, (TE | TF | TG), (TB, TC, TB) ) > || | | || | Choice_1_Alt_3 || Choice_1_Alt_2 |Choice_1_Alt_1 Choice_1 |
The methods generated for the choice class are ...
public class Element_NC extends eu.bandm.tools.tdom.Element { ... public Choice_1 setChoice_1(Choice_1 e){...} // change content accordingly. public Choice_1 getChoice_1(){...} // deliver current content public abstract class Choice_1 extends TypedChoice { public int getAltIndex(){...} // deliver the index of the // currently contained alternative public Choice_1_Alt_1 toAlt_1(){..} // convert to the corresponding class, public Choice_1_Alt_1 toAlt_2(){..} // if current content represents this // alternative. Otherwise, return null public boolean isAlt_1(){..} // return true iff current content public boolean isAlt_2(){..} // is of the mentioned alternative. ... } public class Choice_1_Alt_1 extends Choice_1 { ... // update/retrieve/visit methods like an top-level element/sequence class !! } } |
The contents of each Choice_<m>_Alt_<n> class is again treated as
a sequence (possibly a singleton sequence), and the top-level naming and
code generation scheme is applied recursively.
Again, no further nesting of inner classes will be applied, but the
representing classes are direct inner classes of the element's class,
and their names created by concatenation of the naming particle
hierarchy.
An example for retrieving:
<!ELEMENT NC ( TA, (TB | TC | TD, (TE | TF)* ) ) > | | | | | nc.getChoice_1().toAlt_3().getChoice_1(8) | | =>NC.Choice_1_Alt_3_Choice_1 | | | nc.getChoice_1().toAlt_3()=>NC.Choice_1_Alt_3 | nc.getChoice_1()=>NC.Choice_1 |
Mixed content and plain character content is treated specially. Mixed content could be considered a "choice-type with *-quantification", but in contrast to the standard implementation described above, the layer which explicitly adresses the choices is skipped for the sake of the user's convenience.
Instead, a specialized Content class is defined in the element's implementing class, which can contain either character data, or one of the elements listed in the mixed content declaration.
So the DTD definition ...
<!ELEMENT NB (#PCDATA) > <!ELEMENT NC ( #PCDATA | TA | TB )* > |
...is translated to ...
public class Element_NB extends Element implements TypedElement.TypedPCDataContainer ... { public static class Content extends TypedElement.MixedContent { ... } public List<Content> getContent(){..} // returns the modifiable list of particles public String getPCData() {return getPCData(this);} // convenience function } public class Element_NC extends Element implements TypedElement.MixedContentContainer ... { ... public static class Content extends TypedElement.MixedContent { public Content (Element_TA el){...} // create the variant with element TA public boolean isElement_TA(){...} // returns whether content particle is a TA public Element_TA toElement_TA(){...} // returns casted content or null public Content (Element_TB el){...} // create the variant with element TB public boolean isElement_TB(){...} // returns whether content particle is a TB public Element_TB toElement_TB(){...} // returns casted content or null // inherited from TypedElement.MixedContent : public Content (String s){...} // create the variant with pcdata public Content (TypedPCData s){...} // dto. public boolean isPCData(){...} // returns whether content particle is PCData public TypedPCData toPCData(){...} // returns casted content or null } ... public List<Content> getContent(){..} // returns the modifiable list of particles } // to get the character content of the pcdata particles, you additionally need: public class TypedPCData extends TypedNode { ... public String getPCData(){} // returns text content of this content particle } |
Let Elx elx be a generated element class, and a reference to an instance of it. To read character data of a given content particle is done as in
for (Elx.Content c : elx.getContent) if (c.isPCData()) String charSeq = c.toPCData().getPCData(); |
This is rather tedious, of course.
The PCData objects themselves are algebraic: to change the text contents, you have
to create a new instance and insert it into the list of el.getContent().
For conveniece there is a constructor which implies the new PCData():
elx.getContent().add(new Elx.Content("text value")); |
All elements which are defined by the DTD wording (#PCDATA) or (#PCDATA)*, i.e. which are pcdata ONLY, are realized as instances of PCDataContainer, a sub-class of MixedContentContainer.
Please note that also in this case you never can make any assumption on how many content particles exist, the concatenation of which represents the plain text.
Anyhow, processing should not happen on this technical level of representation.
Additionally, for convenience,
these objects offer directly the method
getPCData(), which concatenates all fragments into one string.
Setting the contents nevertheless requires to create the intermediate
container level by executing
elx.setContent(new Elx.Content("newstringvalue"));
Beside this low-level treatment there is a general method
TypedElement { String getDeepPCData() ; } |
It descends the whole subtree rooted at the element
and collects all character data recursively.
This corresponds to the notion of "string-value" in XPath [XPath 1.0/5.2],
to XPath's "string()" function and to "xsl:value-of" in [xslt1_0].
(The implementation requires the instantiation of a Visitor. This code is
specific for the model, and thus realized in the generated code for
Element.)
(The runtime class TypedElement offers both functionalities
additionally wrapped into static functions objects:
public static final Function<MixedContentContainer, String> getFlatPCData,
public static final Function<MixedContentContainer, String> getDeepPCData,
)
The definition of "attributes" in XML is rather akward and inpractical. E.g.
(Indeed we met well-experienced XSLT programmers who admitted that in their daily work the first step of every processing is the replacement of all attributes by additional ELEMENTs.)
The Tdom
support of attributes is as follows:
For every pair of ELEMENT declaration and attribute definition
a new inner class is defined in the element's class, which is derived
from that subclass of
<METATOOLS>/tdom/runtime/TypedAttribute which corresponds to the attribute's "type",
see table below.
(Only exception: Common classes for attributes of different elements as described
in Section 2.7.5.)
The naming convention for this inner class and for the retrieval/update methods is similar to that of content particles as described above: The attribute named "X" from the DTD is addressed as "Attr_<X'>!" in the Java code, where X' is the mangled character sequence from X, as desribed for elements in Section 2.6. This Java name is used directly for the inner class which implements the attributes, and in the names of the retrieval methods.
The attribute objects serve as storages for values, not as values: They are created with the element object automatically, but a value has to be assigned to them explicitly (by the user of the API or via the parsed XML source). The identity of the attribute objects related to a particular element instance is totally under the control of the Tdom code: No explicit assingment by the user is possible; initial sharing is terminated automatically by write access; therefore references to attribute objects should better not be cached.
There are two retrieval methods:
element.readAttr_X() delivers the current attribute object.
In case that this attribute has the default value, a common default object
is returned. In this case the attempt to set a new value
will result in an UnsupportedOperationException("!mutable").
But this is the better method to read an attribute value, because
default objects can be shared. This method is also applied by all generated
visitor code.
element.getAttr_X() delivers an individual object anyhow. The value of this object may be read and written. This method should only be used when writing is indeed intended, because the common default object is replaced by a dedicated, writable copy.
The retrieved Attribute object in turn provides methods for setting and getting
the values, namely V getValue() and setValue(V).
getValue() returns null only for an attribute which has been
declared #IMPLIED and currently has an "absent" value.
String getStringValue()
and static String getStringValue(V) deliver the value as it would appear in a
XML standard text serialization.
String getTypeString() delivers the text of the type declaration in the DTD.
boolean isOptional() / isFixed() /
isRequired() delivers whether the value has been declared in the
DTD as #IMPLIED / #FIXED / #REQUIRED.
boolean isSpecified() delivers whether attribute has been set explicitly,
when creating the containing element instance or afterwards. (For details see
Section 2.7.4 below.)
V getDefaultValue() delivers the default value as declared in the DTD.
The value null represents "attribute value is absent" and corresponds to
the declaration "#IMPLIED".
(The V getValue(), setValue(V) and few other methods can be realized directly in the generated code, or inherited from the corresponding base classes from tdom.runtime, so please have a look to that api doc and into the generated sources.)
The type expression V depends on the attributes "type" as it appears in the DTD:
DTD attribute "type" | realizing class derived from | Java type <V> |
NMTOKEN | NmTokenAttribute | String |
Id | IdAttribute | String |
IdRef | IdRefAttribute | String |
CData | CDataAttribute | String |
Enumeration | EnumerationAttribute | Enum<V>, dedicated type, generated for (and locally to) this attribute |
Enumeration, if enum values are all integers, see Section 2.7.3. | SelectedIntegersAttribute | Integer |
NMTOKENS | NmTokensAttribute | List<String> |
IdRefs | IdRefsAttribute | List<String> |
(For the inheritance relation between the different attribute classes see the tdom runtime class tree.)
In case of enumeration type attributes, the value must be one item from the enumeration class. This is a public inner class of the generated Attribute's class and always has the name "Value". When "s" is the name of one particular enumeration value as written in the DTD, then the corresponding enumeration items have the name
"Value_" + (s.replace("[-.:]", "_")) |
This translation is the same as described above, see Section 2.6. (Please note that name clashes may result!-)
The enumeration items offer a method "String getStringValue()", which delivers the original DTD wording; the EnumerationAttribute's class offers a method "Map<String,Value> getValueMap()" for the inverse translation.
The setValue() method executes validity tests on its parameters as follows:
All these violation throw a TdomAttributeSyntaxException (or a subclass thereof).
See Section 3.1 for the hierarchy of TdomException.
So unallowed null values and violated #FIXED attributes are treated
as special cases of failed syntax checks.
Therefore all generated
setValue(V) methods are declared with "throws TdomAttributeSyntaxException",
except the setValue(V) of an EnumerationAttribute and
of a CDataAttribute, if and only if they have the default value
#IMPLIED, what additionally allows null.
Compare e.g. the setValue(V) methods for Attr_http_equiv, Attr_lang (common attribute,
inherited method) and Attr_content (inherited method) in
Element_meta of the XHTML tdom.
In many standard DTDs one frequently finds enumeration attributes which only contain selected integer values. The standard implementation would require a chain of three redundant conversions when executing calculations (text representation to enumeration value to text representation to integer value). Therefore the attribute with name "att1" in element "ele1" and the shared attribute "att2" (see Section 2.7.5) are treated specially when declared in the dtd by the tdom pi
<?tdom selectedIntegers ele1@att1 @att2 ?> |
This leads to the generation of a subclass of SelectedIntegersAttribute. This class implements storage, retrieval and validity check directly on Java "int" values.
Attributes which are declared with a default value (including the special value
#IMPLIED") can be "not present" in the textual representation of an XML document,
or similar, when creating the document by constructor calls.
In Tdom
, this fact is memorized by an internal flag.
The dedicated method
class Attr_[XY] { ... public void clearValue(){..} } |
clears that flag and sets the value back to the default value. Afterwards, the attribute will not be written out when serializing the document model. This can be changed by executing e.g.
el.getAttr_XY().clearValue(); el.getAttr_XY().setValue(el.readAttr_XY().readValue()); -- or -- el.getAttr_XY().setValue(el.readAttr_XY().getDefaultValue()); |
After this, the value is also the default value, but the attribute is considered "set" and will be written out (unless the value is null in case of #IMPLIED).
So far, no two different attributes are ever assignment compatible, even if they carry the same name, type and default value. This corresponds to the definitions of DTDs, which do not impose any semantics on attributes, beside the mere string value.
To impose an abstraction on attributes, Tdom understands process instructions like ...
<?tdom attribute attA CDATA #IMPLIED attB (B1|B2) "B1" ?> |
The meaning is, that on top-level of the generated package (i.e. not as a part of any ELEMENTs code) a stand-alone attribute class is generated. This class is named and behaves like the "local" attribute classes described above.
A local attribute class is still created in any Element's class, as described above. But for each attribute which matches a "global" attribute, this class (1) uses the global class as its base class, (2) inherits its methods and most of its fields, and (3) will be recognized by the more abstract matching methods of the generated Matcher class. This last feature, as described in Section 4.2, is the main purpose of this abstraction.
In many DTDs from practical use, common attributes are declared in ENTITIYs, which are included in different ATTLISTs. In this case the effect of creating common base classes can be achieved for all attributes defined in such an entity by the process instruction ...
<?tdom attribute-entity entA entB entC ?> |
In this case all entities "entA", "entB", "entC", must expand to complete attribute declarations (one or more), and the process intruction is processed exactly as explained above, after expanding these entities.
Some remarks are practically important:
First: Currently only the "name" of the attributes is used for name mangling, so there can be only one common attribute class with a certain name. Tdom behaves like DTD (as ugly as it is !-) insofar as the first definition wins over any subsequent attempt to re-define. A warning is issued in this case.
Second: A common attribute is only recognized if all three dimensions (name, "type", and initial value) are exactly identical. So the following two declarations do not match:
<?tdom attribute attA CDATA #IMPLIED ?> <!ATTLIST E attA CDATA #REQUIRED > |
The Tdom tool will issue a hint, whenever a common attribute is not recognized due to such minimal differences.
Third:The entity names themselves and the grouping of the attributes is in no way reflected; they are simply "unpacked" to a list of attributes, which are independently compiled as common attributes, as described.
Attributes of "type ID, IDREF and IDREFS" are special because they
are intended to model references between sub-trees of a document.
An XML dcoument is only "valid" if (a) there is at most one(1)
attribute declaration of "type ID" in each element's attribute list, and
(b) the string value of every instance of an IDREF attribute, and each single "NAME"
token in the value of an IDREFS attribute,
corresponds to exactly one(1) instance of an ID attribute carrying
the same value (see [xml], "Validity constraint: One ID per Element Type"
and "Validity constraint: IDREF")
Of course these conditions are not really sensible.
E.g. for changing the value of an ID attribute without violating these rules,
first all referring IDREF/IDREFS tokens must be deleted.
Then the attribute's value may be changed, and not before this,
all referring attributes may be visited a second time, to set them to the new value.
Therefore Tdom does not check ID/IDREF/IDREFS attributes by default. Instead, this can be done explicitly, when a model is completely constructed, by the following methods:
// in the generated package: class Document_<DOC> { /** @return the id-based map string->Element * @ŧhrows tdom.runtime.HomonymousIdException if one(1) id is used for * two(2) different elements * @throws tdom.runtime.SynonymousIdException if two(2) ids are used for * one(1) element */ public ElementDictionary<Element> createDictionary() { } } // in package tdom.runtime : /** Indicates the presence of an ID attribute. **/ public interface Identifiable { /** @return the current id, but does not supply an automatically generated one.*/ String @Opt getId() ; } class ElementDictionary<E>{ /** @return the element with the given id, or null.*/ public @Opt E get(String s){ } } class IdRefAttribute{ /** @return the element with the given id, or null-.*/ public @Opt E getValue(ElementDictionary<E>){ } } class IdRefsAttribute{ /** @return a list of all elements with the ids, including "null" for failures.*/ public java.util.List<@Opt E> getValues(ElementDictionary<E>){ } } |
There are some more useful methods for handling the mappings explicitly. Please refer to the api doc of the involved runtime classes.
Please note that "SynonymousIdException" cannot occur when only using generated code for filling the map: According to [xml], "Validity constraint: One ID per Element Type", each element definition may have only one attribute of "ID" flavour, and this is checked statically when translating the DTD and can cause an error message.
The constraint [xml], "Validity constraint: IDREF / second phrase" is not checked at all automatically: every return value ==null can be treated as an error by the caller explicitly, if appropriate.
In most DTDs, the contents of many attributes and of PCDATA-only elements is employed to encode numeric contents, i.e. integer or floating point numbers. For easy decoding of these data, the class TypedNode offers some overloaded methods.
Their names are "asInt(..)", "asBigInteger(..)", "asDouble(..)", "asHexInt(..)", etc. They behave robust and deliver null in case of null input or conversion error. The overloading allows them to be applied to the appropriate attribute types and to element contents in a uniform way. (Therefore their location as static methods in "TypedNode" !)
For details please refer to the api doc.
The generated Java source contains automatically generated API documentation to explain
the fundamental technical aspects of handling the generated classes and methods.
But of course, Tdom
does know nothing about their intendend meaning.
Therefore it is possible to attach explicit author's documentation text to
The means are process instructions in the DTD, as described in Section 2.1.1.
For the above-mentioned categories this looks like ...
<?tdom doc - documentation text for the model as a whole. ?> <?tdom doc oneElement documentation text for this particular element called "oneElement" ?> <?tdom doc oneElement@oneAttribute documentation text for this particular attribute of this particular element ?> <?tdom doc @oneAttribute documentation text for this particular abstract/shared attribute ?> |
At the end of the Tdom run all these PIs which had not been processed, eg. because they adress a non-existing target due to a typo, are all reported by one warning message each.
More than one entry with the same documentation target may occur; their text will be concatenated in source order.
The treatment of these "semantic" or "author's" documentation texts is similar to that in umod : During the normal rendering process (by javadoc) the resulting "doc comments" are rendered specially: included in "<div class='bandmUser'>" tags. They appear in green color, if the stylesheet "bandmApiDoc.css" is appended to the generated stylesheet. This shall clarify the difference between the mere technical documentation and the semantic level of meaning.
Even farther goes the employment of
javadoc ... -doclet eu.bandm.tools.tdom.doclet.TdomUserDoc
In this case most technical fields and methods are totally omitted, and only those
with the annotation "@User" are included in the documentation. The result
is much leaner and focussed and may be more useful when programming "around" the tdom
model.
It is not understood that Processing Instructions and Comments are part of a "model" in the narrow sense, and originally Tdom did not support them.
Anyhow, the requirements and application contexts are various, so it may be sensible to include them. We introduced them as "second class" inhabitants, which have to be attached to a "substantial" inhabitant for being stored and retrieved.
Every Element and every PCData fragment as a "Substantial" has two "decorative" sequences of "Ethereals" (see the symbolic class tree in Section 2.4 !), one "preceding" and one "following".
Furthermore, every Element and Document has a "leading" and a "trailing" sequence, which can be used if no Substantials are contained. The access methods are
List<TypedEthereal> TypedDocument.[get/read]LeadingEthereals() List<TypedEthereal> TypedDocument.[get/read]TrailingEthereals() List<TypedEthereal> TypedElement.[get/read]LeadingEthereals() List<TypedEthereal> TypedElement.[get/read]TrailingEthereals() List<TypedEthereal> TypedSubstantial.[get/read]PrecedingEthereals() List<TypedEthereal> TypedSubstantial.[get/read]FollowingEthereals() |
The ..read.. variant delivers a read-only-list (which can be shared iff empty), the ..get.. variant delivers a list the user can modify.
There are n+1 possiblities to store a sequence of n Ethereals w.r.t. the two neighbouring Substantials:
<el> | IN OR OR OR <!-- comment --> | el.leading el.leading el.leading subel.preceding <?target text ?> | el.leading el.leading subel.preceding subel.preceding <!-- comment --> | el.leading subel.preceding subel.preceding subel.preceding <subel/> | <!-- comment --> | IN subel.following OR subel2.preceding <subel2/> | </el> | |
The fact where an Ethereal is stored does not have any meaning a priori. A parser is allowed to choose any solution, arbitrarily. Of course, on the next conceptual level the user may define a "meta-syntax" of relations and meanings, (E.g., is a comment related to the follower, or to the element just opened? Is a comment related to a processing instruction?) In such a case the model must be traversed and these relations constructed explicitly, implemented by additional data.
According to the possible error conditions when constructing a Tdom instance, there is a class tree of the following hierarchy of checked exceptions. (Java speak "checked" means that they must be declared explicitly.)
All these classes memorize the information about the offending value, attribute, element context and location, as far as known. There are further subtypes of these classes for particular cases, see the api doc.
For narrowing the scope of the necessary exception declarations, there is the class TypedAttribute.SafeValues with this only instance, which is used as a flag to distinguish between safe/unsafe methods and constructors, which throw/do not throw a TdomAttributeException. For details see Section 3.2.5.
Two central design issues of Tdom
are (a) that all existing models
at every instants of their life-time are type-correct sub-trees
w.r.t. the corresponding DTD, and (b) that this is checked statically, at
compile time, as far as possible.
Therefore most of the generated public constructors always require complete and
type-correct contents as their argument.
As a consequence, a larger Tdom
model must be constructed bottom-up, in
a term-like fashion. This (at a first glance possibly annoying) strict
discipline implies especially that Tdom
models are always finite by
construction.
3
Please note that constructing a large Tdom
model by explicit constructor
calls is a tedious task. Explicit constructor calls only make sense as the back-end
of some automated translation procedures.
For constructing a Tdom
model from a pre-existent XML text file
one can use the SAX interface or the w3c Dom interface.
These are described in Section 3.3 below.
The basic structural element, to which the generated Java constructors correspond, is again the sequence. So constructors are generated for top-level content regular expressions, considered as a sequence, and for all sub-sequences and alternatives, which are sequences again. Since the Java method signature corresponds to the DTD content model, no TdomContentException is thrown by the invocation of such a "statically typed" constructor. (That no TdomAttributeSyntaxException is thrown can be selected by supplying safeValues.) All variants are illustrated by the following example:
<!ELEMENT NC (TA, (TB | TC, TD*), (TE, TF?)*, (TG)? ) |
...is translated into ...
public class Element_NC extends Element { public Element_NC (SafeValues s, Element_TA x1, Element_NC.Choice_1 x2, Element_NC.Seq_1[] x3, Element_TG x4) {...} public Element_NC (Element_TA x1, Element_NC.Choice_1 x2, Element_NC.Seq_1[] x3, Element_TG x4) throws TdomAttributeException {...} public abstract static class Choice_1 extends eu.bandm.tools.tdom.TypedChoice {...} public static class Choice_1_Alt_1 extends Choice_1 { ... public Choice_1_Alt_1(Element_TB x1){...} ... } public static class Choice_1_Alt_2 extends Choice_1 { ... public Choice_1_Alt_2(Element_TC x1, Element_TD[] x2){...} // please notice this array reflecting the "*" ^^ ... } }//Choice_1 public abstract static class Seq_1 extends eu.bandm.tools.tdom.TypedSubTree{ ... public Seq_1 (Element_TE x1, Element_TF x2) {...} ... } } |
This allows a constructor call like ...
new Element_NC( aTA, new Element_NC.Choice_1_Alt_1(aTB), new Element_NC.Seq_1[0], (Element_TG)null ) ; // throws TdomAttributeSyntaxException -- or -- new Element_NC( safeValues, aTA, new Element_NC.Choice_1_Alt_1(aTB), new Element_NC.Seq_1[0], (Element_TG)null ) ; |
For convenience an array parameter at the last position is declared as a "vararg", so that the explicit construction of an intermediate array for this position is not required (though still possible!):
<!ELEMENT NC (A* B*)> --> leads to --> new Element_NC( Element_A[] elems_A_1, Element_B... elems_B_1) |
Oftenly it is more convenient to simply enumerate the sequence of Java objects which shall make up the contents of a newly created element. In this case, a simplified parsing process can be applied to the classes of these elements. We call it "semi-parser", because it parses only one layer of content, but does not descend into the depth, into contents of sub-elements, as the full-fledged parsers do, as described in Section 3.3.
For this purpose there is an untyped constructor ...
new Element_NC (Element... elements) throws TdomContentException, TdomAttributeSyntaxException {..} |
("Element extends tdom.runtime.TypedElement" is the top-level element class generated specially with this certain model, so the method is "not completely" untyped !-)
A TdomContentException is thrown whenever the supplied sequence of Java
objects cannot be mapped to the content model.
(Since exceptions must be caught or declared anyhow, there is no variant
with "safeValues" preventing TdomAttributeSyntaxExceptions.)
Since the "vararg" arguments can be represented by an array, also alternatives for content creation can be defined by a pure expression, using the concatenation operations defined in <METATOOLS>/eu/bandm/tools/ops/Arrays , as used in our "Dtd to Html renderer" <METATOOLS>/eu/bandm/tools/dtm/HtmlRenderer according to this scheme:
import eu.bandm.tools.ops.Arrays ; // ... final Element_html el_html = new Element_html (new Element_head (new Element_head.Choice_1[0], new Element_head.Choice_2_Alt_1 (new Element_title("windowTitle")), new Element_head.Choice_2[0]), new Element_body (Arrays.append ((htmlIsDynamic) ?new Element_block_content[] {new Element_noscript (new Element_div (new Element_div.Content ("<!-- please switch on JAVA SCRIPT for dynamic behaviour! -->")) {@Override protected void initAttrs(){ getAttr_class().setValue(class_alert); }})} :new Element_block_content[0], new Element_block_content[] {new Element_pre(preItems.toArray (new Element_pre.Content[preItems.size()])), new Element_hr(), makeFooter(basicFileName, "http://bandm.eu/metatools/docs/usage/dtd.html#txt_dtd_tool") } ))); // ... Element_p makeFooter(String a, String b){...} List<Element_pre.Content> preItems = ... |
The header part of the created html element is constructed statically typed. The body part is dynamically typed, using Arrays.append and function calls to write case distinctions in a fully compositional way.
(Please note that in our xhtml model "block_content" is an abstraction of different Element classes, controlled by a content model entity, as described in Section 2.6.1.)
In case of mixed content, e.g. a declaration like ...
<!ELEMENT NM (#PCDATA | TA |TB)* > |
...the generated constructors are ...
public Element_NM (Element_NM.Content... content) throws TdomAttributeSyntaxException {...} public Element_NM (SafeValues, Element_NM.Content... content){...} public Element_NM (String content) throws TdomAttributeSyntaxException {...} public Element_NM (SafeValues, String content){...} |
The "SafeValues" flag has the same role as described above.
The last two constructors are short-cuts for the case of pure character content.
The canonical constructors are the first two, where all components have to be
wrapped into the correct content class, like in ...
new Element_NM (new Element_NM.Content("characters with embedded TA "), new Element_NM.Content(aTA), new Element_NM.Content(new TypedPCData(" followed by a TB ")), new Element_NM.Content(aTB) ) |
The first argument in this example is possible because of the short-cut constructor ...
public class TypedElement.MixedContent { ... public MixedContent (String data){ this(new TypedPCData(data)); } ... } |
Of course, instead of "vararg"-parameters you can always supply an array, e.g. delivered by Collection.toArray([]).
The dynamically typed constructor for elements with mixed contents has the signature
new Element_NM (Object ...) throws TdomContentException, TdomAttributeSyntaxException ; |
It behaves like all other semi-parsers, as described in Section 3.2.2: It throws a TdomContentException whenever the supplied sequence of Java objects cannot be mapped to the content model.
The techniques for constructing the argument list as described for structured content in Section 3.2.2 can be used accordingly.
When an element instance is constructed which has #REQUIRED attributes, then these must be set by the caller and checked by the constructor code for validity, before the constructor is allowed to return normally, meaning success. This is required by the Tdom philosophy of only producing type correct instances.
Setting attributes in a constructor call is done by defining an anonymous inline class derived from the real element class. By overriding the methods public void initAttrs() throws TdomAttributeSyntaxException and public void initAttrsSafe() the caller can set arbitrary attribute values.
No TdomAttributeSyntaxException can leave the second variant, and only this is called by the "safe constructor" (= the constructor with the safeValues flag). So basically three variants are possible for construction:
new Element_e (a, b, c) { @Override public void initAttrs() throws TdomAttributeSyntaxException { getAttr_a1().setValue(v1); } @Override public void initAttrsSafe() { getAttr_a2().setValue(v2); } }; -- or -- new Element_e (safeValues, a, b, c) { @Override public void initAttrsSafe() { getAttr_a2().setValue(v2); } }; -- or -- -- if there are NO unsafe attributes: new Element_e (a, b, c) { @Override public void initAttrs() { getAttr_a2().setValue(v2); } }; |
Of course the safe constructor can (and should) always be used if no attribute at all is set.
ATTENTION: The safe constructor does only call
initAttrsSafe(), but not initAttrs().
Putting initialization code in the latter and meaning the former
is a hard to find error which cannot be detected statically.
If there are no unsafe attribute values at all, then only initAttrs() is generated
(throwing nothing). This can be checked in compile time by the @Override annotation.
To set a required attribute value to a null value by an explicit java call yields an attribute syntax error, but to forget to set it at all yields a missing attribute error.
If a TdomAttributeSyntaxException is possible, it must be caught and treated locally for using the safe variant, like in
new Elemen_e (safeValues, a, b, c) { public void initAttrsSafe() { try { getAttr_a1().setValue("myconst"); } catch (final TdomAttributeSyntaxException e){ throw new ImpossibleError(" cannot happen, 'myconst' is a valid NMTOKEN."); } } }; |
For convenience there is the method
import static eu.bandm.tools.tdom.runtime.TypedAttribute.assertSetAttrValid ; new Elemen_e (safeValues, a, b, c) { public void initAttrsSafe() { assertSetAttrValid(getAttr_a1(), "myconst"); } }; |
which wraps the TdomAttributeSyntaxException, which is assumed to never happen, into an unchecked AssertionError.
After the execution of this user-defined initialization methods, the constructor
checks for completeness of required attributes.
If an attribute declared as #REQUIRED is not set explitly, then a
TdomAttributeMissingException is thrown.
All constructors of elements which have such an attribute are thus declared like
package xhtml_1_0 ; public Element_script() throws TdomAttributeMissingException, TdomAttributeSyntaxException { ...} |
The definedness of an attribute by a user-defined initAttrs() method cannot be checked statically, therefore this exception must be caught somewhere. Because a "clean functional style" of programming leads to deeply nested constructor calls, the catch clause would be far away from its cause. Therefore the class TdomAttributeMissingSupplier provides a wrapper method which can be used to translate all TdomAttributeMissingExceptions in an unchecked AssertionError, when they are known not to happen. This is a fragment from "dtm/HtmlRenderer", which builds a complete Html header element by one single expression:
import static eu.bandm.tools.tdom.runtime.TdomAttributeMissingSupplier.assertAttrsComplete ; (..., ... , Element_head.Choice_2_Alt_1_Choice_1.alt (assertAttrsComplete(() -> new Element_script(safeValues, "￯") {@Override protected void initAttrsSafe(){ getAttr_src().setValue(path_to_javascript); assertSetAttrValid(getAttr_type(), "text/javascript"); }}) )... ) |
((
By the way: a serious pitfall is trying an abstractions like ...
private void Element_e makeIt (final String name){ return new Element_e(){ @Override public void initAttrs(){ this.getAttr_name().setValue(name); // ^^^^ refers to local field of Element_e class }} ; } |
The method's parameter is NOT adressed by "name",
since the local field of the element's class is the narrower lexical scope!
))
As mentioned above, the methods for constructing large Tdom models from given text files is via the generated SAX parser or the generated W3C-DOM validator.
Both kinds of creation methods are only defined for the "Document_<tag>" classes, not for pure "Element_<tag>" classes. This is due to the fact that both construction methods possibly require global information, like namespace mapping and collections of "ID"-type attribute values, things not existing with simple elements.
The creation methods are provided by the Java class implementing the DTD. Let the package containing Tdom generated classes (i.e. all element classes, document classes and the DTD class) be called "<myModel>". Let "<tagA>", "<tagB>", etc. be the tags of those elements which can serve as the top-level element of a document, according to the process instructions as described in Section 2.1.1.
Then you can create a Tdom Document object by calling one of the following methods:
package <myModel> ; import eu.bandm.tools.util.SAXEventStream ; import eu.bandm.tools.tdom.runtime.TdomAttributeException ; import eu.bandm.tools.tdom.runtime.TdomContentException ; import eu.bandm.tools.tdom.runtime.TdomXmlException ; import eu.bandm.tools.tdom.runtime.TypedDTD ; public final class DTD extends TypedDTD { createDocument_<tagA> (Element_<tagA> el) {...} createDocument_<tagB> (Element_<tagB> el) {...} ... createDocument_<tagA> (org.w3c.dom.Document document) throws TdomContentException, TdomAttributeException {...} createDocument_<tagB> (org.w3c.dom.Document document) throws TdomContentException, TdomAttributeException {...} createDocument_<tagA> (SAXEventStream s) throws TdomContentException, TdomAttributeException, TdomXmlException {...} createDocument_<tagB> (SAXEventStream s) throws TdomContentException, TdomAttributeException, TdomXmlException {...} createDocument_<tagA> (java.io.InputStream in) throws java.io.IOExcept {...} createDocument_<tagB> (java.io.InputStream in) throws java.io.IOExcept {...} } |
The first two methods only complete the "manually bottom-up creation" as described in Section 3.2: You create a document by first creating its top-level element and then giving it as an argument to the constructor.
For large documents the following methods are more convenient:
If the argument to "create_Document<tag>()" is a W3C DOM, than this DOM object is validated against the DTD, and, in case of conformance, a Tdom model is returned. Otherwise a TdomException is thrown.
If the argument to "create_Document<tag>()" is a SAXEventStream, then the content models of the DTD must be LL(1), and the SAX events are consumed to construct the Tdom model. In case of non-conformance, a TdomException is thrown.
A <METATOOLS>/util/SAXEventStream is an interface which provides access to a "frozen" sequence of SAX calls. This freezing is necessary because LL(1) parsing needs (which surprise!) a look-ahead of depth 1, which is not provided when using the SAX interface directly.
The implementation currently provided is contained in <METATOOLS>/util/SAXEventQueue.
The W2C DOM and SAX based construction methods can throw TdomContentExceptions and TdomAttributeExceptions, as with the explicit constructor invocations above. Additionally the SAX based methods can throw TdomXmlException in case of erronuos XML input files.
The SAX interface's handling of attributes is rather complicated
and expensive. Therefore, currently, we
do not totally type-check the SAX event stream as such!
Of course, when there is no value for an attribute which is
"required" as described by the DTD (e.g. neither declared as
"#IMPLIED", nor having a default value),
then a TdomAttributeMissingException is thrown.
But we do currently not check for undefined attributes, ie. attribute names which are
not declared in the DTD and thus nor represented in the model.
(The foreseen TdomAttributeUndefException is not thrown.)
This is a violation of the
"Validity constraint: Attribute Value Type" from [xml] , which says
"The attribute MUST have been declared"
The same holds for the even more primitive
"Well-formedness constraint:Unique Att Spec", which says
"An attribute name MUST NOT appear more than once in the same start-tag
or empty-element tag."
The format of the SAX event would make both checks rather expensive.
The practical problem is that these kinds of errors oftenly result from a miss-spelled attribute name. But the missing of the really meant attribute will not be signalled iff it has a default value!
If you have to create large sub-structures of a Tdom
model
(e.g. starting with a top-level element Element_<tagX>)
out of your own program code, it may be nevertheless the method of choice
to use the SAX interface to
create a complete Document_<tagX>:
Simply send SAX events to a
<METATOOLS>/util/SAXEventQueue, the other side of which is consumed
by the method aDTD.createDocument_<tagX>(SAXEventStream s).
Then extract the desired element by calling the (generated and therefore
strongly typed) method ...
public class Document_<tagX> extends ... tdom.runtime.TypedDocument { public Element_<tagX> getDocumentElement() {...} // returns top-level element } |
For this purpose it is necessary to previously declare all those elements declarations in the DTD as "public", which are intended as the top element of such sub-trees. This is described in Section 2.1.1 above, and tells Tdom to create the required Document_<> classes.
The last method (aDTD.createDocument_<tagX>(java.io.InputStream)) is related to our own compression method, and explained in Section 6.5.
The class <METATOOLS>/xantrltdom/TdomReader provides the glueing code between a file input stream or similar source of text, and the construction of a tdom model. Its usage is demonstrated in ../../examples/doctypes/xhtml/Main.java.
As mentioned above, the most elegant way of processing a Tdom
model
to some other format is the application of Visitors.
With every Tdom
model the base class Visitor.java is generated,
from which you can derive your processing tools.
This class defines a "visit(final <generatedClass> node)" method
for each node classe generated
by Tdom
.
This includes element classes, classes representing sub-sequences, choices,
alternatives and attributes.
A user defines a transformation by deriving from this visitor class and
overriding only those methods where he/she wants to extract some information
or perform some update.
On the other side the generated Element class (which is the top of all generated element classes) implements the interface Visitable<Visitor>, and the method host(Visitor). This method is the counterpart, which causes the visitor to call its visit method on this. (This method is needed to apply a visitor to any elment without knowing its concrete class at compile time.)
The definition of a derived visitor is most
conveniently done by editing a copy of the generated VisitorTemplate.java.
This file contains method declarations for all visit() methods
acting on element classes. These empty
method templates are preceded by a "Javadoc" comment which contains
the corresponding content definition from the original DTD.
Please note that the method declarations for classes representing sub-content
(e.g. "+visit(Element_TA.Choice_1_Alt_2 x)+")
are
not included in VisitorTemplate.java, but have to be added
manually, whenever required.
((It may be convenient to have a look at "Visitor.java" for
doing "copy and paste" on some more complicated method declarations of this
kind.))
The Tdom visitors are of most simple kind, compared to the more complex ones generated by umod . They only provide the above-mentioned single method per visited class, namely "visit(<generatedclass> x)". This method can be called from external ("hand-written") code for the intial invoking of the visitor. It is also used internally by the visit() code of the generated base visitor itself, for the descending to its child nodes.
If the class of an object is known statically, this call is optimal w.r.t. performance. It the class is not known, there is the method "visit(generatedPackage.Element element)", which does a switch/case-based look up of the element's tag index.
All node classes support the method "x.host(Visitor v)". This method calls the most-narrowly statically typed "v.visit(x)" method of the visitor v. This allows to visit sub-content which contains choices without the need to know in advance which alternatives are present in the concrete model data.
This "x.host(Visitor v)" method is also realized by the generated Document_<> classes.
Further, all node classes which have repeated sub-content, like "elems_1_A", "choices_1" or "seqs_1", offer a method like "visit_choices_1(Visitor v)" which does the stepping through the sub-contents automatically, as mentioned already in Section 2.6.4 above.
All these direct way of calling (i.e. the skipping of a "match()" multiplexer as
needed in the
umod
visitors) are
possible because the structure of the model is almost
completely statically defined at compile time, and because there
are no specialization relations (= "inheritance") between distinct classes.
The only places where dynamic decisions are required come
(a) from alternatives (including abstract classes) and (b) from
quantification decorations "?", "*" or "+" in the original DTD.
The high performance of the Tdom
visitors results from the fact that
in both cases only simple and constant int values need to be
considered, --- e.g. the result of "final int getAltIndex()", a method
generated with every sub-class of TypedAlt and which returns the value of a
static final int assigned to the generated class at compile time,
or of "<>.count<plural>()" in case of repetitions.
The generated base visitor does nothing more than descending the document tree in depth-first textual order.
E.g. the DTD declaration ...
<!ELEMENT A (B, (C)*, (D)?) > |
...generates a method like 4 ...
package MySemiAst ; public class Visitor { ... public void visit (Element_A el){ visit (el.getElem_1_B()); for (int i = 0, n = el.countElems_1_C() ; i < n ; i++) visit(el.getElem_1_C(i)); if (el.hasElem_1_D()) visit(el.getElem_1_D()); } ... } |
Special processing of nodes of a certain class is implemented by deriving from this base class. If you want to descend into the sub-tree structure starting at the currently visited node el, you simply call "super.visit(el)", or you start with a new, specialized visitor:
package transformations ; import MySemiAst.* ; public class Transform_1 extends Visitor { protected class SpecialTransformation extends Transform_1 { ... } protected void visit (Element_A el){ final int value = Integer.parse (el.getElem_1_B().getPCData()); new Specialtranformation().visit(el.getElems_1_C()); <<<<< GEHT NICHT !!?? :-( final int secondvalue = new Visitor(){ protected int result = 0 ; public int process(Element el){visit(el); return result;} public void visit (Element_C el){ result += Integer.parse (el.getPCData()); } }.process(el); super.visit(el); } } |
The call graph for a content declaration "<!ELEMENT A ((B)*(..|..|..))>" can be symbollically sketched like ...
visit(Element_A e) ---------> {for (i=0;i<e.countElem_1_B();i++) visit(e.getElem_1_B); } ; visit(e.getChoice_1()) visit(Element_A.Choice_1 c) --------> switch(c.getAltIndex()){ case 0:visit(c.toAlt_1()); case 1:visit(c.toAlt_2()); } visit(Element_A.Choice_1_Alt_1 a) ---> visit(...) |
...and the scheme for deriving the transformation tools "UserDefV" like ...
Visitor.visit(Element_A e) ---------> {..visit(e.getElem_1_B)..} ; visit(...) | +-----------------------------------------+ V UserDefV.visit(Element_B e) ------> ---> super.visit(e) | +--------------------------------------------------------+ V Visitor.visit(Element_B e) ----------> |
Sometimes (e.g. for simulating a W3C DOM and for
directly applying tpath expressions)
an untyped view to a Tdom instance is required.
For this purpose, the class
UntypedVisitor is provided. It co-operates with additional "hosting" methods
in the generated node classes called
"__dumpElementSnapshot(List)" and
"__getAllAttrs(List)".
It is esp. suited for accessing and collecting
Ethereals, or all attributes of a certain type, etc.
It is created without any parameters and applied to any element or document by
calling "match(e)". Its behaviour is defined, as usual, by deriving and overriding
the diverse "action(e)" or "descend_<..>(e)" methods, see
the api doc.
As appropriate for this untyped view, no action or match methods for structural sub-contents
are provided.
The command line option --patterns activates the generation of Paisley patterns, see the paisley documentation.
FIXME MORE
Tdom includes a mechanism for extending a model by one or more others Its prime purpose is to suppoert reusage of visitor based code.
On the source level, an extension has to declared in the DTD: A short example is contained in metatools/examples/tdom/extend . Therein, the file arith.dtd , further simplified, reads as follows:
<?tdom abstract expr (nullary | unary | binary) ?> <?tdom abstract nullary (const | ...) ?> <?tdom abstract unary_op (neg | ...) ?> <?tdom abstract binary_op (add | sub | mul | div | ...) ?> <!ENTITY % expr.extension ""> <!ENTITY % unary_op.extension ""> <!ENTITY % binary_op.extension ""> <!ENTITY % expr "(nullary | unary | binary %expr.extension;)"> <!ELEMENT unary ( (neg %unary_op.extension;), %expr;)> <!ELEMENT binary (%expr;, "(add | sub | mul | div %binary_op.extension;), %expr;)> |
The "?tdom abstract" process instruction is terminated by the ellipse
"...", which indicates to the Tdom
code generator to
forsee the plugging-in of more element types.
The same is done on the genuine DTD level by defining an "ENTITY" each,
which is per default of course empty.
The usage of the extension mechanism can be seen in the file logic.dtd , which, slightly simplified, reads as follows:
<?tdom import arith SYSTEM "arith.dtd" ?> <!ENTITY % arith.dtd SYSTEM "arith.dtd"> %arith.dtd; <!ELEMENT prop (%expr;)*> <!ATTLIST prop pred NMTOKEN #REQUIRED> |
Again, the import is executed on the DTD level and the Tdom level parallel, requiring seemingly redundante doubling.
FIXME BEISPIEL KAPUTT ??? IRRNGTWAS ist da VERLORENGEGANGEN !?!?!
Among the generated classes there is always a Dumper class, extending the Visitor class. Its constructors are defined as ...
public Dumper(org.xml.sax.ContentHandler contentHandler) { ..} public Dumper(org.xml.sax.ContentHandler contentHandler, org.xml.sax.ext.LexicalHandler commentHandler) { ..} |
Whenever visit() is called on any element of the model, the corresponding SAX events [Sax04] are generated for this element and its attributes, for any related Ethereals (see Section 2.10) and for all elements contained therein recursively.
The LexicalHandler is optional and is used to receive TypedComment objects. If the first constructor is called and the argument happens to support both interfaces, it will be used in both roles.
You can easily print a whole model or an abitrary sub-tree to a console or to a text file by combining the above-mentioned SAX event generation with our ContentPrinter. E.g.:
public Dumper(new ContentPrinter(new PrintWriter_flushing(System.out), true, true) ).visit(myModel); |
The class formatfrontends.Tdom2format contains an instantiation of the generic format compiler.
The generated code is a specialization of the Visitor classgenerated with the tdom code. It offers a public method "toFormat()" which translates a tdom model (or any sub-expression of it) into a format object. The outlines of such a converter called "myFormatter". generated for some Tdom generated package "myModel" are:
import myModel.* ; // Element_A, Element_B, etc. public class myFormatter extends myModel.Visitor { // public interface public Formst toFormat ( Visitable<? super Visitor> element){ result=Format.empty; visit(element); return result , } public int default_indent = 2 ; // can be modified public Format default_delimiter = Format.empty ; // for debugging only // auxiliary funtions protected Format __throwIt(){...} protected void visit (TypedPCData){...} protected Format toFormat_throwing(Visitable){...} // if ==null then throw! protected Collection<Format> toFormat_array(Visitable[]){...}// "map" // user defined visitor functions public void visit (Element_A element){ result = // format generating function } // etc. } |
A "visit()" method is added for each node class of the Tdom model, if and only if the user gives a format description.
The default case is that the visitor reaches "visit(typedPCData)". This simply concatenates all character content into one big, unformatted Append format. This is exactly what you want in most cases for the lowest layers of the structure definition.
(( There is a field "public Format default_delimiter" in the generated code, which is initialized to an empty format. For debugging it can be set to something like "---", indicating the borders of the different concatenated pcdata ranges. ))
The language to define the format for a given Tdom class explicitly, is an instance of the generic format definition language. The following instantiations are specific to its application to Tdom :
DOMAIN_SPECIFIC_DATA_ADRESSING ::= element choice sequence $pcdata $quoteDTDstyle blankCharacter formatDescription |
element ::= tag $ nat |
choice ::= $C $Choice nat |
alt ::= $A $Alt nat |
sequence ::= $S $Seq nat |
nat ::= // a natural number including zero(0) ; |
The reference "$quoteDTDstyle data" means a "DTD-like" quotation of the (possibly concatenated) string content of the "data" format, done by Format.quoteDTDstyle(Format). This means simply framing the data with single quotes if it contains double quotes, and vice versa. And doing anything if the data contains both !-).
The reference "$pcdata" means the text content of the current tdom element. as it is delivered by the generated getPCData() method.
Appearing in a format code "<tag>$<nat>" means a reference to the
format generated for "getElem<tag>_<nat>()", --- "<tag>" without number
defaults to "getElem<tag>_1()".
Analoguously:
"$C<nat>"/"$Choice<nat>" means a reference to the
format generated for "getChoice_<nat>()" and
"$S<nat>"/"$Seq<nat>" means "getSeq_<nat>()".
If the selected element/choice/sequence appears in the DTD under a list combinator ("+" or "*"), one of the list format descriptors must be applied("[]", "[,,/]", "[;;/]{}", etc.), as described for the generic case.
Please note that currently there is not type-checkingbetween the format descriptors and the DTD of the model. So addressing singular instead of plural (i.e. leaving out a list format descriptor instead of using it, --- or vice versa), will crash when trying to compile the generated code (there is either a "getElems_1_A()" of a "getElem_1_A()", as described above.)
DOMAIN_SPECIFIC_SWITCH_SELECTOR ::= element choice sequence |
DOMAIN_SPECIFIC_CASE_LABEL ::= nat |
In case the DOMAIN_SPECIFIC_SWITCH_SELECTOR refers to an element, this element must appear in the current content model with a "?" modifier. The selector tags may be "0" for absent and "1" for present.
In case the DOMAIN_SPECIFIC_SWITCH_SELECTOR refers to a choice, than the selector tags correspond to the allowed values of "altIndex", i.e. the position number of the present alternative in the DTD choice construct.
If in a switch no default case is given, this defaults to "$throw".
Currently there are three ways to write down these definitions: (1) in a stand-alone file, (2) as PIs in a DTD, or (3) by option values from an Xantlr source.
All format definitions in such a file must have the form ...
formatRule ::= tag seq choice alt choice blankCharacter = formatDescription . |
Please note the dot "." as a delimiter at the end which is required because in a formatDescription all withespace is significant. The begin is directly after the "=".
The expression to the left of the "=" is verbatim translated into a "visit(Element_<tag><tail>)" rule, where <tail> is the name of the inner class of the element class, as denotated by the sequence of selectors, which are translated the same way as in the context of the format description, as described above.
The call of the tool in controlled by these parameters:
( definitions from file ../../src/eu/bandm/tools/formatfrontends/tdom2format.options )
-G | --sourceroot | uri |
file system position of the root of the java source tree | ||
-1 | --packagename | string |
name of the package which will contain the formatting code | ||
-2 | --basevisitor | string |
name of the base visitor class from which the formatting code inherits. | ||
-3 | --classname | string |
name of the generated visitor class | ||
-4 | --sourcetype | ( dtd| nondtd) |
indication of the type of the source file | ||
-5 | --sourcefile | uri |
path to the source file to translate | ||
-w | --linewidth | int(=79) |
width of the generated source files | ||
--targetclasspath | uri(=$na) | |
position in file system for loading referred existing classes | ||
ATTENTION Currently these parameters are not yet decoded as documented, but
The format of the declarations is exacly like above, but all are wrapped into a process instruction, either one or more each:
<?tdom2format a =expr. ?> <?tdom2format b =expr. c =expr. ?> |
Please note again the dot "." which ends the format descriptions.
The call of the tool is now ...
${JAVA} eu.bandm.tools.formatfrontends.Tdom2format -- model.DTD ??? |
If the Tdom meta-model is an AST meta-model created from an Xantlr grammar, then the format directives can be formulated directly from the grammar sorce file by a special kind of rule-wise options.
An in the grammar file construct like ...
public myNonterm options { format = "expr" ; } : grammarExpr ; |
... will be translated by Xantlr into a process instruction in the generated DTD:
<?tdom2format myNonterm =expr. ?> |
Here the dot is added by this translation process, so the whole string constant in the "option" statement (including trailing whitespace!) is seen as format description.
But you can add more format expressions into the one option statement, if you want do define format rules for sub-expressions like choices or sub-sequences. Simly terminate the "main" expression for the non-terminal with an explicit dot, and append the further declarations for these sub-expressions (or arbitrary unrelated nonterminals !-) as explicit rules.
Simply consider that Xantlr prepends the name of the non-terminal and the equal sign, and append one dot, than you can insert further rules arbitrarily ;
So ...
public myNonterm options { format = "expr. myN $C1=sub1. myN $C2$A1 = sub2" ; } :a(b|c)(d|e) ; |
... will be translated by Xantlr into a process instruction with three format declarations in the generated DTD:
<?tdom2format myNonterm =expr. myN $C1=sub1. myN $C2$A1 = sub2. ?> |
Creating a w3c dom representation can be done via this SAX output: meta_tools provide a generic translation module in <METATOOLS>util/SAX2DOMConverter . It implements org.xml.sax.ContentHandler and therefore can be connected as a drain to the Dumper describwd in the preceding paragraph.
You have to plug in a W3C-Dom implementation, e.g. a Xerces-J [xercesj] by calling the public method ...
public void setDOMImplementation(org.w3c.dom DOMImplementation domImpl){...} |
After all SAX events have been sent completely, you can access the generated Document by calling ...
public org.w3c.dom.Document getDocument() |
Since the standard XML encoding (using opening and closing tags and many different layers of escaping and quoting) is very redundant, Tdom supports a compressed binary storage format, in which all tags are encoded in a minimal way, controlled by the DTD. So the tagging information is a kind of "binary", while the text contents are left unchanged (i.e. it is encoded using UTF-8)
The writing out of a document or element is initiated by a sequence like ...
import eu.bandm.tools.tdom.runtime.EncodingOutputStream ; os = new EncodingOutputStream(anyOutputstream); myElement.encode(os); // OR myDocument.enccode(os); |
The definition of the encode methods is on the level of the runtime classes, the generated classes are defined above, so see <METATOOLS>/tdom/runtime/TypedDocumentor <METATOOLS>/tdom/runtime/TypedElementfor details.
The reading back is only possible on Document level, by calling the approporate factory method defined with the DTD class: as already mentioned in Section 3.3 above.
createDocument_<tagA> (java.io.InputStream in) throws java.io.IOExcept {...} |
The Tdom
tool is called from the command line:
( definitions from file ../../src/eu/bandm/tools/tdom_withOptions/Options.xml )
-0 | --destdir | string |
file system mounting point of the generated source tree | ||
-1 | --pkgname | string |
name of the generated package. Determines the position of generated resources relative to destdir | ||
-2 | --sourcedtd | string |
file position of dtd to compile | ||
-C | --baseClass | string(=eu.bandm.tools.tdom.runtime.TypedElement) |
base class for all generated element classes | ||
--commonContentClass | ||
whether to generate only one common subclass of MixedContent for PCData only elements per model | ||
--generateLists | bool(=false) | |
generate checked list classes for repeated sub-contents, not arrays | ||
--linewidth | int(=78) | |
number of columns for the generated Java source text. This is NOT a strict limit, but a strong orientation. | ||
--noCompress | ||
whether the decode/encode methods shall be omitted | ||
--patterns | ||
whether "paisley" pattern access methods shall be generated. | ||
Currently the parameters --commonContentClass, --baseClass and --noCompress are not supported.
Additionally there is a macro in the metatools make system defined in etc/calltools.mk, which is called from a Makefile as ...
$(call tdom, <destDir>, <pkgName>, <sourceDtd>) |
This macro esp. cares for the conversion of slashes and backslashes between a unix and a cygwin environment.
As an output, Tdom generates one source file for each <ELEMENT..>! declaration, according to the naming conventions explained above in Section 2.6,
Additionally there will be generated ...
Tdom constructs a whole zoo of parsers and validators (from SAX, W3C-DOM, Java constructors, etc). In this context it is central to detect ambiguities in the DTD grammar. In practice we oftenly find badly written DTDs. But many of the ambiguities are two alternatives which both include "epsilon" in their language, eg. ...
... ( a* | b* ) ... |
These ambiguities do not violate the "LL(1)" requirement, as iterpreted by W3C. Nevertheless, it is important to have a close look to all points of ambiguity, The Tdom tool does print them in a precise format. Eg. when trying to translate "xhmlt1-flat.dtd", you get the typical warning
warning: table: conflict on [thead, tbody, tr, tfoot] between alts 0 and 1 in rule [thead, col, tbody, tr, colgroup, tfoot] -> ( [thead, col, tbody, tr, tfoot] -> {[col] -> (col)* -> [thead, tbody, tr, tfoot]} | [thead, tbody, tr, colgroup, tfoot] -> {[colgroup] -> (colgroup)* -> [thead, tbody, tr, tfoot]} ) |
In the innermost nesting (in braces) you see the first and follow sets of some grammar expressions :
{ [firstSet] -> grammarExpr -> [followSet] } |
The conflict is here caused by a disjunction. On the next higher level you see the disjunction of two such constructs, each preceded by its own first set (which is identical to the internal first set, or to the union of the internal first and follow set, if the grammar expresssion can produce epsilon, --- as we all have learned from the Dragon Book !-)
( [firstSet_A] -> {...} | [firstSet_B] -> {...} ) |
Before this bracket, again, there is the first set of the disjunction as a whole (which is not very informative, it is just the union), but in the warning message you find the intersection of these two sets, which is the cause for the ambiguity.
First of all, the grammar file fed into Xantlr must be contain the global, parser-level option
options { dtdMode = tdom ; } |
This make the generated DTD contain special "process instructions" generated to Tdom (as described above in Section 2.1.1), reflecting the settings in the grammar definition file:
When Xantlr and Tdom are plugged together, two parsers are involved: First comes the antlrC parser, which consumes front-end characters and emits the standard antlrC error messages. The output of the Xantlr generated parser is a SAX event stream, which is fed into the different SAX receivers created by the Tdom compiler.
Already when translating the DTD, Tdom possibly issues error messages concerning this second level of parsing, --- mostly caused by ambiguities, i.e. violations of the LL(1)-criterium. This kind of ambiguity should not be mixed up with the antlrC front-end ambiguities. Consider a definition (taken from the d2d grammar) like ...
definition ::= "list" "of" reference | "short" "for" reference | ... |
The antlrC generated front-end parser has no problems with ambiguity, because there are terminal tokens guarding the alternatives. Because these tokens do notautomatically contribute to the semi-AST, the generated DTD would read corresponding to
definition ::= reference | reference | ... ; |
This is ambiguous, and the following Tdom translation will issue an error message, as explained above in Section 7.2.
As a remedy, you should either wrap one of the terminal in a non-terminal (with an empty DTD content model), like ...
definition : LIST "of" reference | "short" "for" reference | ... ; LIST : "list" ; |
...or wrap one of the alternatives as a whole in a non-terminal:
definition : "list" "of" reference | shortcutdefinition | ... ; shortcutdefinition : "short" "for" reference ; |
Now the generated SAX events suffice to distinguish between the alternatives.
meta_tools provides some glueing code for plugging together Xantlr and Tdom . The central class is <METATOOLS>/xantlrtdom/XantlrTdom, which internally creates buffers and auxiliary message pipes, etc, and plugs it all together.
Let "XXX" be the name of your grammar, and "YYY" the top production, then the usage pattern is ...
final XXX_Lexer lexer = new XXX_Lexer(stream); lexer.setFilename(filename); final XXX_Parser parser = new XXX_Parser(HistoryToken.chain(lexer)); tee = (tracing) ? new ContentPrinter(new PrintWriter_flushing(System.err), true, true) : null ; final XantlrTdom link = XantlrTdom.link (parser, msg1, 1024, tee, DTD.dtd, msg2); final Document_YYY document_module = link.parse("YYY", Document_YYY.class); |
Please refer to the API doc.
1 But see the discussion on unsafe API methods in Section 2.6.5 below
2 Only exception: When taking them seriously, the constraints on id/idref/idrefs attributes imposed by [xml] are nearly impossible to maintain! And they would be very expensive to check automatically with every model update. Therefore they are only evaluated on demand, controlled by the user. See Section 2.7.6 for details.
3 They can become cyclic when later calling set_..() methods: another reason for better treating model elements as immutable!
4 This code fragment is written for documentation purpose using the access methods from the user interface, as described above. The real implementation, since in the same package as the element classes, of course uses the direct access to protected variables for the sake of efficiency, e.g. "a.elems_1_C.length" instead of "a.countElems_1_C()"
xantlr | bandm meta_tools | ops |
made
2024-08-30_17h46 by
lepper on
happy-ubuntu
produced with
eu.bandm.metatools.d2d
and
XSLT
FYI view
page d2d source text