[all pages:] introduction message/location/muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf DOC-EN:lablog.ddf SOURCE:basic.dd2 DOC:xslt.ddf SOURCE:xslt.dd2 DOC:meta.ddf [site map]
All pages: introduction message/location/muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf DOC-EN:lablog.ddf SOURCE:basic.dd2 DOC:xslt.ddf SOURCE:xslt.dd2 DOC:meta.ddf [site map]
auxiliaries | bandm meta_tools | downloads & licenses |
D2d --- XML Made Useful For Authors
1
Purpose and Way of Operation
1.1
Intentions
1.2
Use Cases
1.3
Three Fields of Parsing in the text2xml Transformation
2
Parsing a Text Input File Into An XML Model
2.1
Top-Level File Structure and Document Type Declaration
2.1.1
Top Element Declaration and Input Data
2.1.2
Local Definition Modules
2.2
Tokenization
2.2.1
Comments
2.2.2
Tags
2.2.3
Meta-Commands
2.2.4
All Special Characters
2.3
Tagged Parsing
2.3.1
Structure and Effect of Tags
2.3.2
Declaration and Execution of Tag Parsers
2.3.3
Difference between Tags, References and Insertions
2.3.4
Character Data Input, Whitespace and Discarding Whitespace.
2.3.5
LL(1) Constraint
2.3.6
Tag Recognition and Evolution of Document Types
2.3.7
DTD Conformance
2.4
Character Based Parsing
2.4.1
Definition and Semantics of Character Parsers
2.4.2
Character Parsers and DTD Content Models
2.4.3
Execution of Character Parsers and Their Limitedness
2.5
Modifiers
2.5.1
Tuning the XML Representation of the Generated Output
2.5.2
Verbatim Input Mode
2.6
Enumerations
2.7
Incomplete Input and Signaling of Errors
2.8
Post-Processing
2.9
Macros and Inclusions
3
Modules, Substitutions and Parametrization
3.1
Default Declarations
3.2
Local Definitions
3.3
XML Name Spaces
3.4
Importing Definitions From Other Modules
3.5
Parameterization Of Modules and Substitutions
3.6
Combinations of Substitutions and Insertions
3.7
External Document Type Definitions
3.7.1
Using W3c XML DTDs
3.7.2
Denotating Values of umod Models
4
The Xslt Mode
4.1
Additional Xml Name Space Declarations
4.2
Issues with Type-checking and Coverage when Expanding XSLT Constructs.
Shadowing caused by Missing Context Information
4.3
Further Caveats for the Xslt Mode
5
Documentations And Transformations
5.1
Adjoining Documentation Text And Transformation Rules to Definitions
5.2
Generating Documentation of a Module's Structure
5.3
Defining a Collection of Transformation Rules Decentrally
6
Tool Implementations
6.1
D2d Main Tool
6.2
D2d Batch Tool
7
"d2d_gp" --- a General Purpose Text Architecture
7.1
The "basic" Module and Its Sub-Modules
7.2
Special Modules For Technical Documentation
7.3
The XSLT Transformations Into XHTML 1.0
7.3.1
Required Rotations when Translating to XHTML 1.0
D2d ("Triple-Dee"/"DDD"/"Directly To Document"/"Direct Document Denotation") enables authors of scientific texts, as well as poets, essayists and novelists, to employ XML based document denotation in their creative process.
The first aim is that the creative flow of writing is interrupted as least as possible. Therefore d2d is implemented as a compiler, and basically uses only one single escape character. This design supports the creation of XML documents by speach input.
The second aim is to give the user full control of the structure he or she creates. Depending on their level of interest and experience, users can control type definition, construction and processing of text bodies with a scalable level of detail. So what the user types really "is" pure XML, --- only encoded in a writable and readable way. This fact the name "direct document denotation" refers to.
(For the design principles see also this short buzzword list and a German language concept sheet.)
There have been several applications of d2d, from small text based data bases (bibliographies, book-keeping), over medium-scale technical texts, like this documentation you are just reading, up to poems and novels. The use cases span a wide range from stand-alone application, generating XML for further processing, to programmatically fully integrated as a user's front-end in some dedicated application. Two extreme examples are a larger analytic text in musicology, against our integrated book-keeping software (both in German language).
D2d is a co-operative of different concepts and applications. Central tool is the text2xml compiler which creates an xml text file from some input text in the d2d front-end format.
The parsing process is controlled by the document type. This can be given in some third-party standard format. Currently only W3C-DTD is supported, see [xml].
But there is also an own definition language, referred to as "d2d definition format" or "ddf" or "dd2", and called "ddf" in the following. It provides (1) much finer control on the text parsing process, down to character granularity, (2) extensive modularization, and (3) adaptibility not by parametrization, but by applying free transformations on structure defining expressions.
Further there are tools for (1) transforming between ddf and dtd, as fas as possible, (2) for generating documentation on ddf's, and (3) a special mode for interleaving XSLT commands with document structure.
There is (4) a special parsing mode (plus a collection of pre-defined element types realizing combinators), which allows to parse the external representation of umod models, according to their serialization rules. This allows the direct denotation of data models.
Planned is (5) a syntax controlled editor, either on java base, or as an emacs mode.
Last not leat, there is a first gemeric application: a modular architecture
of document type definitions, called "d2d_gp", because it should be
sufficient for all "*g*eneral *p*urposes".
It is described here briefly in chapter 7, but documented mainly by
the automatically generated documentation pages, extracted
from the ddf files.
It comes with an elaborate xslt system for generating xhtml1.0
[xhtml10] and covers a wide range of traditional publishing.
It can be taken as a basis for own
developments using the above-mentioned re-writing mechanism.
We plan to add translation rules into LaTeX, and into the XML version
of a pseudo-standard wordprocessing format from Redmond, WA.
When transforming an input document in the d2d input syntax into an xml model (and the corresponding textual representation), this process can be seen as a combination of three different parsing situations. In each of these a different parsing technique is applied:
The text-to-xml frontend compiler of d2d takes some character input
and parses it to an internal XML model.
In most cases, this model will immediately be written out by the tool implementation
to some traditional XML-encoded text file, for further processing by the user.
Each file which shall be subject to this d2d parsing must have a structure like ...
foo foo foo #d2d 2.0 text using <MODULE> : <ELEMENT> <INPUT TEXT> #eof bla bla bla |
The prefix of the file, i.e. everything preceding the "#d2d" is ignored.
This is followed by the "magic words" "#d2d 2.0" and a document type declaration: By "<MODULE>" a certain ddf definition module will be referred to. The definition text of this module must be locatable by the parser implementation.
In the ddf format, modules may be nested,
and the module identifier "<MODULE>" is a sequence of names, separated
by dot ".", reflecting this nesting.
The name of the top-level module must correspond to the
name of the text file which contains the text of its definition, and this file
must reside in one of the directories specified as
"modules search path." See Section 6.1 below.
Currently this implementation additionally accepts umod type definitions and w3c's dtd files [xml] as module definitions. The corresponding files are considered iff no ddf module with the given name can be found. In this case, since these formats do not support nesting, the name of the module must be a simple identifier.
As "<ELEMENT>" the name of a parser definition from this module must be supplied. This is the definition of the top-level element that will be generated by the parsing process, and its contents definition is the rule initially defining the parsing process. (Normally, this top-most and initial parser will be a tagged parser, not a character parser.)
The end of the parseable input datamust be marked with the meta-tag eof. Subsequent input in the input file will be ignored.
(This stricht framing seems annoying, but it makes it easy to embed d2d sources in bodies of emails and other badly specified contexts.)
Additionally, preceding the input data, a source file may contain local document type definitions, similar to the "internal subset" in the world of w3c dtd.
They take the form of one or more module definitions, as described below in chapter 3. Again, this is led in by a reserved character sequences. A file with this format looks like ...
foo foo foo #d2d 2.0 module localmodule // here some new parser definitions end module this will be ignored again, like foo foo foo foo #d2d module localmodule2 // "2.0" may be ommitted (but shouldn't !-) // here some definitions tags myparser = // etc. end module #d2d 2.0 text using localmodule2 : myparser // input text #eof |
Again, all characters outside the marked regions will be ignored. This is done for an easy embedding into e.g. a piece of e-mail without the need of complicated wrappers.
The first layer of parsing is tokenization. Its rules are fixed, but configurable. The parsing process splits the input into tokens and recognizes ...
Comments follow the well-known "C" discipline: At each
point of the input text there is one current and
configurable comment-lead-in character, or comment character for short.
It defaults to "/".
Two of these start a one-line comment, which extends up to the end of the line.
The trailing new line character is not consumed, i.e. it is part of the
next-to-consume input.
One comment character together with an adjacent asterisk
"*" start a multi-line comment, which
extends up to the inverse sequence.
The text forming the comment ist totally ignored by the parsing process.
There are no further rules of nesting, so the contents of the comment text do
not open further levels of comment, etc.:
input // this is a one line comment input // this also /* only a one line comment input /* this is a // multi /* line comment */ input continued |
Comments may be inserted nearly everywhere where whitespace can appear. Comments do not contribute to the resulting model and are totally ignored by the parsing process.
Tags and meta-commands are recognized by the command character. This is configurable, too, and defaults to the pound sign "#".
After a command character, all whitespace and further command characters are ignored, ie. the command character is idempotent.
Then an alphanumeric identifier must follow (either directly. or prefixed by end tag characters, see below). This identifier is a character sequence starting with an ASCII letter, followed by ASCII digits or letters or the underscore or the minus sign.
This identifier is either the name of a built-in meta-command, or the start of a tag.
Currently there are only four meta-commands defined:
Please not that #eof, instead of a meta-command, could also be seen as a special, pre-defined tag, implicitly appended to the top-level grammar expression. The effect in parsing would be the same. But the names of all meta-commands are reserved and cannot be used as tags in a user defined document type definition.
The choice of the command character is esp. important for our central aim, not to disturb the flow of authoring. The pound sign "#" is convenient e.g. on a German keyboard layout, where it can be typed with one single key press, without modifying keys. It should be changed according to the preferences of the author.
The following text brings all possible combinations (It is not meant as a practical example !-)
#setcommand ! !setcomment % %* this is a %%%%% multi line comment !! *% !setcommand##setcomment!!!a one line comment |
All further examples visible on this documentation page
will stick to the default settings for these both
characters.
(Consequently, the
source text
of this documentation had to change both of them !-)
Here, between a meta-command and its character argument, is the only place where whitespace is permitted, but not a comment:
#setcommand /*new command character follows*/! |
...will raise an error, because the command character cannot be set to the character "/".
In summa, there are four(4) special, reserved characters involved in the text2xml parsing process, these two, plus the asterisk "*" for framing multi-line comments, plus the character for constructing closing tags, "/", called end tag character in the following. The last two are not configurable, and also for the first two there a some forbidden combinations:
name | default | restriction |
commandchar | # | Cannot be set to the end tag character / |
Cannot be set to the current comment character | ||
commentchar | / | Cannot be set to asterisk * |
Cannot be set to the current command character | ||
end tag character | / | Not configurable |
multi-line comment | * | Not configurable |
These are indeed four(4) reserved characters, but since explicit closing tags are very rarely required, and comments do not count anyway, we still daresay that d2d realizes (nearly) all tagging with "only one(1)" reserved character, namely the command character.
Finally, all input data which is not preceded by a command char and not part of a comment, is recognized as "character data".
The tagged parsers form the upper, coarse layer of user-defined parsing. For each open tag recognized in the input stream, a new XML node (element or attribute) in the model is created. In case of an element, all subsequently recognized input (character data and further nodes) will be included in therein as its contents, until the parsing process decides to "close" this element.
This corresponds to the well-known XML/SGML approach of tagging, i.e. of parsing a XML encoded source text. But with the d2d approach there are some important simplifications:
In combination with the character parsers explained below, this leads to a much more intuitive handling, esp. during the process of creative authoring, as in the following example:
This is a #emph!source text! which refers to some #cite citation and to some #ref text_label in the same document. #p This is a new paragraph, containing #list #i a first #i and a second list item, refering to a #link http://bandm.eu #text link#/link #/list |
It is obvious that you can enter such a text with your favourite text editor. This is not really funny with the standard XML frontend representation:
<p>This is a <emph>source text</emph> which refers to some <cite key="citation"> and to some <ref>text_label</ref> in the same document. </p> <p> This is a new paragraph, containing <list> <listitem>a first </listitem> <listitem> and a second list item, refering to a <link href="http://bandm.eu"><text>link</text></link> </listitem> </p> |
Beside these (and some more minor) simplifications, the grammar defining the document type and ruling the parsing process must still confirm to the well-known DTD standards, i.e. it must be "LL-1-parseable" in the sense of [xml], after all closing tags inferred by the d2d parser have been inserted, see Section 2.3.5 below.
Tags are recognized in two cases:
(1) whenever a command character is followed
by an identifier which is not one of the reserved meta-commands from
Section 2.2.3,
and
(2) whenever a command character is followed
by such an identifier, which is preceded by one(1) or three(3) end tag characters.
In both cases there may be arbitrary whitespace following the command character, but no whitespace between the end tag characters and the identifier.
In the first case we have an open or empty tag, in the second we have a close tag.
In all cases a whitespace character (blank, tab, newline, etc.) following a tag which is not an empty tag, is interpreted and consumed for delimiting that tag and will not be part of any character contents.
After an open tag some input must follow which matches the definition of the node (attribute or element) to which the tag refers to. The translation process will create such a node instance in the model under construction, and parse the subsequent input to create its contents.
An empty element may be denotated by an empty tag, see below.
An close tag explicitly ends the parsing process for an element currently under construction, and continues with the parsing context which governed the parsing before this element had been opened.
Similar to the well-known XML/SGML front-end representation, a close tag is constructed by an end tag character immediately preceding the identifier,
An empty element can be represented by an empty tag, which is an identifier immediately followed by an end tag character. In XML this is restricted ("for interoperability") to elements defined as empty. [xml, 3.1] In d2d it can be used anyway. E.g. with the declaration tags a = b *, the input #a/ is legal.
In contrast to XML, d2d additionally provides both kinds of tags in a form with three(3) end tag characters. These are "premature end" tags. They indicate that the contents of the element are not yet complete, but will be completed in a later working phase. This is e.g. useful for abstracts, bibliographic entries, examples, etc. which are left open in the first course of writing, and shall be explicitly marked as incomplete.
The empty tag has also such a "premature" form which is written "#a///".
In contrast to XML, d2d additionally provides a generic closing tag consisting of the end tag character followed by a whitespace character. This means to end the latest opened element, i.e. it is a close tag to the open tag which has been recognized last. The whitespace character is consumed as part of the tag, and is consequently not subject to further parsing, i.e. it will not be translated to contents of any re-opened element.
An open tag may consist not only of the identifier, but may also include the one(1) immediately following character:
Whenever a whitespace character follows, this is consumed as part of the open tag. So this character will not be considered by further parsing, e.g. for the content of the element just opened.
Whenever an opening parenthesis character follows, the corresponding closing parenthesis character is implicitly assigned to the role of the corresponding closing tag.
This supports a kind of tagging which is known from them \verb%...% construct in LaTeX or the -e "s/../../g" construct in sed.
Currently the following parentheses are implemented:
opening character | ( | < | [ | { | . | ! | \ | : | $ | ^ |
closing character | ) | > | ] | } | . | ! | \ | : | $ | ^ |
The assignment of a certain close tag to a certain input character is valid
up to the first appearance of this character (outside of that what is swallowed
by a character parser, cf the warnings below in Section 2.4.3!).
This character will be replaced by the closing tag by the parsing algorithm.
(As with explicit tags,
arbitrary combinations of opening, closing and empty tags may occur before,
--- as long as they comply with the document type, of course !-)
The parenthesis
assignments are stacked, so that in this moment the next-older assignment
will pop up and be valid again. So the same paranthesis character can be
re-used in a nested way:
this is #bold!bold and #ital!bold italic! text!! --- yields this is <bold>bold and <ital>bold italic</ital> text</bold>! |
Only when the character following the identifier in the open tag is neither whitespace nor a special character usable as a open parenthesis (nor an alphanumeric one, of course, because then it would be part of the identifier, and not a follower !-) it is taken as input for the further parsing.
The following example illustrates all these cases (The lines are to be read not as a part of one continuous text, but each as a separate example, in some larger but arbitrary context!)
foo foo #ident this char data goes into an <ident>..</ident> element foo foo #ident this char data starts with one blank space foo foo #ident(this char data ends here) foo foo foo foo foo #ident!this char data ends here! foo foo foo foo foo #ident=but this char data starts with an equal sign foo foo #/ident after the element <ident>..</ident> after the element foo foo #///ident here it ends PREMATURELY </ident> here // MISSING FIXME CONSUME ONE BLANK !!! foo foo #ident/ this is an empty element <ident></ident> this is .. foo foo #ident/this is an empty element <ident></ident>this is .. foo foo #ident/// this is an empty element, but waiting for later completion foo foo #/ here ends the top element currently under construction, and the character data starts with a "h" foo foo #/ dito, but the character data starts with a blank. |
There are two major caveats with the parenthesis mode:
First, the assignment of the role of the closing tag to the parenthesis character is made purely on the lexical level. It does not consider the the content model of the parser definition the tag refers to. The latter information lives on a higher level, and will not influence the lower level of tokenization. This seem to be somehow unconvenient, but is for good reasons, as discussed in Section 2.3.6.
Consequently, it is possible to write things like ...
foo foo #el^goes to first el#/el foo foo #el goes to second el^ foo foo |
The first tag opens a new element of type "el" and defines the caret sign "^" as a denotation for the corresponding end tag. But then the text does not use this short-cut notation at all, instead it uses the explicit end tag "/el". So the role of "/el" is still assigned to the character "^" and used afterwards, for the second element.
Our current implementation issues a warning when recognizing the first, explicit end tag, because the closing parenthesis character is intended to be used in correct nesting, i.e. to replace the closing tag which corresponds to the open tag the it was defined with.
Secondly, this non-awareness of the content definitions can lead to unwanted effects with empty elements. Assume an author writing a novel and refering to its protagonist by "#pat", which is defined as an empty element. Then the following text cannot be parsed:
#ben stood there and waited for #pat. Finally she came. |
The contents of the element pat is #EMPTY (written in the dtd style). Nevertheless the dot character "." is used as an open parenthesis for its contents. Consequently, the second dot character will be read as the closing tag "#/pat", and " Finally she came" as the intended contents for the element, which is a typing error, because these must be empty.
Here are some correct alternatives, meaning the same. (The last one is given only to demonstrate a further trap, not as a suggestion !-)
#ben stood there and waited for #pat/. Finally she came. #ben stood there and waited for #pat(). Finally she came. #ben stood there and waited for #pat... Finally she came. |
Because there may be significant distance between the place where such an unintended end tag comes into effect, and the place of its unintended definition, every error message generated by the d2d tool concerning an unexpected closing tag resulting from a parenthesis is followed by a hint message giving the position of its definition.
The content model of each element filled by tag parsing and the corresponding parser process are declared in a ddf module by a definition statement compliant to the following grammar:
tags_parser_def ::=
public
tags
ident
,
ident
= d_expr modifiers |
d_expr ::= #implicit expr #empty #GENERIC |
expr ::= decor decor , decor decor | decor decor & decor ... |
decor ::= atom atom ? atom * atom + |
atom ::= #chars ( expr ) reference insertion ... |
insertion ::= @ reference |
Here is a typical example, which defines a hypertext link, closely following the classical HTML <a> element:
tags link = #implict url, ( text? & (blank|top|inframe|framename)? & loc? & refdate?) tags url, text, loc, refdate, framename = #chars tags blank, top, inframe = #empty |
The corresponding input will appear like ...
please refer to #link http://bandm.eu/metatools/doc/d2d.html #text this link #loc txt_label #top for more information |
Every tags_parser_def defines the contents of one or more element types, and thus defines the corresponding tag-driven parsers which can convert input text into a model for these elements.
At the beginning of the parsing process, when looking the the text input as a whole, content must follow which is accepted by the parser corresponding to the indicated top level element, see Section 2.1.1 above.
"Being accepted by a tag parser p" means that the input can be partitioned into a sequence of sub-inputs. Each of these starts with a tag, and is followed by some content which in turn is accepted by the parser indicated by that tag. The sequence of all these tags must match the regular expression which is given with the definition of "p".
Only exception is the #implict declaration: This keyword suppresses the very first tag of a top-level sequence, so that only the content, but not the tag must follow.
In practice, this is esp. useful for obligate entries which always, without alternatives, stand at the beginning of a structure, like ids or numbers or keys.
As soon as the whole contents are recognized as "complete", ie. as soon as no more further input can match the contents definition, a close tag is inferred and inserted in the parsing process.
Nevertheless, the close tag may appear in the input text explicitly, eg. for readability by humans.
W.r.t the top-level element and the parsing of the input file as a whole, the text must be terminated by the meta-command eof.
The declararation as #empty defines that the contents of the element are empty.
If not, then the constructors of the regular expressions have the usual meaning:
Some caveatsw.r.t the "&" operator:
1
the "&" operator is associative. The "," and the "|" operator
are also associative, but with them it is clear by their semantics. For
the "&" operator it must be stated explicitly. The different
kind of notation in the list above wants to indicate this symbolically.
2
The "&" operator means permutation, and not interleaving!
E.g.
tags x = a & (b, c) ---- #x #b #c #a // is legal! #x #b #a #c // is NOT legal! |
This restriction is closely related to LL(1) restriction, see Section 2.3.5.
3
Permutation happens only in the input text, not
in the data model. The data model will always be normalized to the
sequence of the declaration.
The "&" operator is meant as a mechanism for
writing down e.g. the different "columns"
of an entry in a data base without the
possibly tedious need to respect a totally arbitrary and meaningless
sequential order.
It is not intended to denotate a sequential order as such!
(In real word data models, a sequential order which carries
any semantics is in nearly all cases
only sensible between different values of the same type, eg. the
participant of a sports tournament, and not between different types, each
appearing exactly once with only one value each, as it is described by the
& operator.)
A reference, as it appears in the expression of a
parser definition, must always be resolvable as the name
of an existing definition, which may be a tag parser, a
character parser, or an enumeration.
When such a reference appears in a regular
expression expr, this implies
(1) the appearance of the tag of the referred
parser in the parsed input text, (2) followed by some input which is accepted
by that referred parser,
and (3) the construction of a corresponding node (element or attribute,
to speak with XML) in the recognized output model.
Definitions may refer to themselves, or to others in a mutually recursive way.
In the world of the input texts all parsers and element types are referred to by tags. The tag for every parser is a simple identifier, namely the ident which appears to the left of the equal sign in the tags_parser_def which defined this parser.
This holds for those defined directly on the top-level of a module, as well as for those which are nested as local definitions (see Section 3.2).
In contrast, in the world of definitions, whenever refering to a parser in another parser definition, e.g. when construction a reference, a kind of "qualified name" is used, which in case of local definitions or of imported modules is a sequences of more than one identifiers. But such a construct never appears in text inputs.
Refering to another tag parser in the form of an insertion means refering only to its content model, i.e. inserting the regular expression which is the content model of the referred parser at this point into the construction of the regular expression.
In contrast to a "normal" reference, an insertion exists only in the world of definitions. It is not "visible" in the input text.
In the current implementation, tag parsers do not allow cycles, i.e. mutually recursive insertions, but char parsers do.
Insertions are introduced so that no special class of definitions is required for content models as such (as with the two constructs "define" and "element" in [relaxng], or with "ENTITY" and "ELEMENT" in XML DTDs [xml]). In practice, this leads frequently to parsers (tag parsers or character parsers) which are only defined for the sake of their content model to be used in the definition of other parsers, and which are never applied to text input on their own.
This insertion mechanism works also for the character parsers, as described below, Section 2.4. But, of course, both kinds of parsers cannot be mixed: only tag parsers can be inserted into the expression of a tag parser, and only character parsers into character parsers!
Last not least, in tag parsers the expression "#chars" corresponds to all character data. These are all those fragments of the input which are not recognized as tags or comments or meta-commands, as described in Section 2.2 above.
Beside the "#implicit" declaration, this is the second case in which an open tag is inferred: The "invisible tag" for "#chars" is inserted whenever the beginning of such character data is recognized by the tokenization process.
This implicit tagging could be (formally not correct) depicted as ...
p = (hr | br | #chars)* hr = br* br = #empty ----- applied to text input ---> #p this is a paragraph, #hr#br and it continues here ----- implicit tagging yields --> // #p #chars this is a paragraph, #hr#br#chars and it continues here ----- standard d2d parsing yields --> <p>this is a paragraph, <hr><br/></hr>and it continues here |
The auxiliary second line shows that after the insertion of the implicit "#chars" tag, the parsing and tag inferences mechanism of d2d can be applied as usual, without further special treatment.
Nevertheless it turned out to treat certain character data in a special way, namely whitespace character input.
If the currently parsed content model does not allow character data,
then every non-whitespace data will be tagged with "#chars" and
influence the parsing situation, as described above. In the example
above, it terminated the collection for the "hr" element.
This "active role" of character data is not sensible with white-space input:
Instead, whenever the currently parsed content definition does not
accept any character data, it is more convenient that white-space is ignored.
For better readability of complex nested contents this is even necessary,
since it allows indentation in the source text which has no effect on the
parsed result.
So the insertion of blank characters in the following version
will not change the parsing result,
because the currently growing element ("hr") does not accept character data:
#p this is a paragraph, #hr #br and it continues here // ^^^ has no effect ----- standard d2d parsing yields --> <p> this is a paragraph, <hr><br/></hr>and it continues here |
...but inserting non-blank characters (of course !-) will:
#p this is a paragraph, #hr XX#br and it continues here ----- standard d2d parsing yields --> <p> this is a paragraph, <hr/> XX<br/>and it continues here |
The situation is complementary, if the currently parsed element, the current
parsing situation, does allow character input.
Firstly, the "#chars" tag would not lead to closing one or more
parser levels, but is consumed and inserted into the currently parsed contents.
Consequently, white-space must never be ignored.
This can lead to some possibly surprising effects. A widespread example is found
in standard XHTML:
// slightly simplified contents model: link = (id? & href? & name? & style? & class?), (#chars | p | img | div )* ----- applied to text input ---> #link #href thisIsAHref#/link ----- yields, as expected ---> <link href="thisIsAHref"/> ---- but applied to text input ---> #link #href thisIsAHref#/link ---- is implicitly tagged as ---> #link #chars #href thisIsAHref#/link ----- and thus standard d2d parsing yields --> ERROR, href not allowed |
This effects are caused by the fact that "link" does accept character data, and therefore more than one blank character are considered as input to the "(#pcdata|...)*" part of its content model. This leaves the initial permutation expression behind, once and forever!
So when the "href" is meant as a part of the "link" element, it has to be input like
#link #href thisIsAHref -- or even #link#href thisIsAHref -- but not as #link #href thisIsAHref |
One single blank is swallowed as part of the tag (see Section 2.3.1), but the second blank must be treated as character input, and this requires to enter the second parenthesis, so "href" is not longer applicable.
A similar effect comes with the definition of paragraph in our standard text format d2d_gp, see chapter 7:
tags p = (kind? & lang?), (#chars | @PHYSICAL_MARKUP | @DOMAINSPECIFIC_MARKUP)* chars kind = @S:ident chars lang = @XML:lang ----- must be written like ---> #p#kind motto Here starts the paragraph. #p #kind motto Here starts the paragraph. ----- but NOT like ---> #p #kind motto Here starts the paragraph. |
The discarding and respecting of whitespace happens in exactly the same way when re-entering an element's content after an explicit closed tag:
#p #hr #br/ #br/ #/hr continue // ^^^^ ^^^ ^^^^^ ignored // ^^^^^^^ not ignored #p #hr #br/ #br/ continue // ^^^^ ^^^ ignored // ^^^^^^^^^^^^^^^^ not ignored |
(You will note that in the last case there is a kind of "backward parsing": Not before the non-white-space character data is recognized and leads to an inferred "/hr", the whitespace sequence will be classified as relevant input!)
Please note that all these "front-end" rules and considerations are
independent from whether the
whitespace will be stored to the resulting elements. The
character data contents of elements may be "trimmed" at both ends,
according to a declaration of "trimming", see
Section 2.5.1.
So possibly the whitespace recognized as the text start of the
"link" element above will nevertheless be discarded when constructing
the result element!
But this "back-end" feature is in no way related to the
decision mechanism described so far!
This is sensible for two reasons:
(a) The front-end rules for whitespace also apply to the points of
re-entering a contents model after an explicit closing tag,
as in the last example above. This case is not covered by the back-end "trimming";
which only affects the very ends.
(b) If trimming would be considered in the parsing process, then later
changes in the back-end representation would influence the parsibility of
source documents in the front-end. This does not seem wise.
So please do not mix up those two layers. After short practice, the front-end rules will surely turn out to be much less complicated than it may seem, since they are useful and intuitively capable!
All content declarations must (locally!) fulfill the LL(1) restriction on all repetitions and alternatives (but only w.r.t. all input in which all closing tags are contained explicitly!).
This not only for ease of implementation. It is more for readability by a user. We aim at users not coming from informatics or language theory, but from administrative practice, lyrics, journalism, etc. And we want "visibility of control", means: no inference rules or backtracking behind the scene. And we want easy compositionality, explainable to above-mentioned users.
(With the character parsers, as described below, the priniciples are just the opposite. There we want utmost convenience of usage, and the text areas in which "magical recognition" takes place are normally rather small and explicitly bounded.)
The fact that the & operator means permutation, not interleaving,
is closely related to this features:
If we had an interleave operator, than LL(1)
must be held by all possible sequences of interleaving.
This would be rather tedious to implement, not very friendly to the user,
would introduce non-compositionality, and would restrict the appicability of
this operator in the world of declaration expressions
more than it would bring freedom in the world of input texts.
The first central target of the d2d approach can be comprehended as "recognizing opening tags with least typing as possible, and infering closing tags."
This may seem very trivial, but is, contrarily, suprisingly complex. Not at least because is does not only deal with a computer-to-human-interfacing problem, but with a second target, namely to support its dynamic evolution.
During the years of development, application and improvement we revised some design decisions which initially seemed very sensible according to the first target, but turned out to be unfeasible w.r.t. the second.
This is the third version of d2d. Only details were changed in the last version step, but with heavy impact. E.g., in the preceding version, the command character could be omitted whenever the current parsing situation implied that only a tag could follow, not character data.
So you could write ...
#h1 title This is a hierarchy-level-one title |
...what now has to be written as ...
#h1 #title This is a hierarchy-level-one title |
The reason for this new restriction comes from the possible evolution of schemata and document type declarations: we want them to be evolvable, presevering downward-compatibility to the existing documents. In practice this is an important issue, because what frequently happens are (1) adding of further alternatives to existing content models, and (2) refinement of content definitions. The former when fields of applications grow, the latter when fragments of information require finer analyzing, which had not been dealt with when initially creating the documents.
So the typical evolutions of document type definitions, seen from standpoint of one particular element type and its contents definition, form these two groups of movements:
a certain sub-element is ... ... nonExisting ... optional ... obligate 1 -----------> 3 -------------> 2 -------------------------------> <------------ 4 <------------- 5 <----------------------------- 6 the contents of an element are ... ... empty ... unstructured ... a structured #chars character parser 7 ---------------> 8 --------------> structured tag parser 9 ---------------> |
The first group of transformations refer to tag parsers and their content models, consisting of elements. The second group of transformations replace unstructured "#char" data by finer analysis. (Character parsers are introduced below, in Section 2.4.)
In detail the consequences are ...
1) A new element is introduced as optional. No existing document is invalidated.
2) A new element is introduced as obligate. All
existing documents are invalidated, what may be exactly what you want.
3) All documents not yet containing this element are invalidated.
4) No impact on existing documents, future documents may be more simple.
5) and 6) All those documents which do contain this element are invalidated.
This seems an unlikely case (unless Supreme Court forbids storage of certain
data, etc.)
7) This is a rather frequent case in practice: e.g. "calendaric date" or "course number" are first entered as mere character data, and later, when further processing becomes necessary, refined to a character parser. Some documents may become invalid: This may be either caused by typos in the document, or it is an indication that the new structured character parser is defined in a too restricted way, does not consider all legal possibilities.
Please note that there are two methods for specifying unstructured character data, either as a tag or as a character parser:
//a tags calendaric = #chars //b chars calendaric = (#S:all - '#')* //c chars calendaric = (#S:non_whitespace)* //d chars calendaric = (#S:all)* |
Esp. the a and b are not totally equivalent:
The input recognized by variant a is terminated with a
command character, which can dynamically be redefined by the input source.
But the input to b is always terminated by the character "#", and ONLY by this!
The variant d will swallow the rest of the input file and probably result in
an error. See the warnings in Section 2.4.3.
8) In most cases this transformation is not advisable if the element ever occured in "mixed content": with its empty content, closing tags were always immediately inferred. So they probably do not appear explicitly in the existing document sources, but after this transformation are required to prevent the element from swallowing all following character data.
9)
This should impose no severe problems, as long as the sets of
involved tags (esp. the "first" sets!) are chosen carefuilly. There are two sub-cases:
9-a) The new content definition can produce "epsilon", so
the empty content is legal. Then normally no existing
document gets invalid. (The only problematic case is that the director
sets of the new content model and of the application context are not disjoint, see 9-b.)
9-b) The new content defintion cannot produce "epsilon". Then all
existing documents are either invalid, or the enhanced element now
swallows elements as its childs which before had been its subsequent siblings
and nephews, which keeps it syntactically valid, but may not be what is
intended semantically.
It is easy to see that if the content model were considered in the underlying tokenization process, as in the example above and in the preceding version of d2d, much more documents would be affected by type evolution!
Consider e.g. a paragraph which consists of character data mixed with typical inline elements:
tags p = (#chars | cite | pers | opus)* ---- #p This is about #pers Beethoven#/ and #pers Schiller#/ . |
Later we want to extend the definition with a possible label:
chars label = @S:ident // use some standard identifier grammar tags p = label?, (#chars | cite | pers | opus)* ---- #p #label beeschill This is about #pers Beethoven#/ and #pers Schiller#/ . #p label is not used by this paragraph #p nor by this |
When we decide to make the label obligate by evolving the content model from "label?," to "label,", then we want error mesages like "missing label" for the last two input lines. Instead, the preceding version of d2d would "suddenly" infer a command lead-in character and and react as follows:
chars label = @S:ident // use some standard identifier grammar tags p = label, (#chars | cite | pers | opus)* ---- #p #label beeschill This is about #pers Beethoven#/ and #pers Schiller#/ . #p label is not used by this paragraph // <p><label>is</label>not used by this paragraph</p> #p nor by this // error unknown tag "nor" |
The opposite effect would be even more confusing: Assume "label" is obligate ...
tags p = label, (#chars | cite | pers | opus)* ---- #p label beeschill This is about #pers Beethoven#/ and #pers Schiller#/ . // delivers <p><label>beeschill</label> This is ... </p> |
...and is now changed to be optional:
tags p = label?, (#chars | cite | pers | opus)* ---- #p label beeschill This is about #pers Beethoven#/ and #pers Schiller#/ . // delivers <p>label beeschill This is ... </p> |
So the feature of infering the command character is nice w.r.t. typing, but hardly compatible to evolution, and has consequently be abandoned.
Whenever a DTD shall be constructed from a ddf model definition, this is guranteed to be possible when only the following patterns are used:
tags x1 = #empty --- or --- tags x2 = #chars --- or --- tags x3 = #implicit a, (b1 & b2? & b3?), (#chars, c1, c2, c3)* --- or --- tags x4 = #implicit a, (b1 & b2? & b3?), c1+, (c2 & c3)*, ... // any expr not containing #chars --- for being dtd-compatible, the last two patterns necessarily require the following content models and xml representations for their child nodes: tags b1, b2, b2 = #chars with xmlrep attribute // or tags b1, b2, b2 = #empty with xmlrep attribute // or chars b1, b2, b2 = // ... contents without structure! with xmlrep attribute tags c1 = .... with xmlrep element chars c2 = .... with xmlrep element enum c3 = .... with xmlrep element // "a" may be of any type // as an element, it is prepended to the grammar in x3, and a member // of the "mixture" in x4 (thus not restricted to only one(1) instance!) |
The first two patterns convert trivially.
In the last two patterns some of the categories (indicates by the above examples) may be left empty. Esp. the child element tagges with "#implicit" does not always make sense.
The first example will be translated to "mixed content", while the second will create a "grammar kind" content model. In both cases the permutation among the b1, b2, .. are automatically realized by the "attribute" mechanism of XML.
The permutations in expressions over elements, ie. in the DTD content grammar, are replaced by sequentialization, and the output will be arranged in the sequence of the definition, not in the (arbitrary) sequence of the input. This corresponds to the semantics of the "&" operator, which is also not reflected in the model as such, cf. Section 2.3.2.
At the lowest level of nesting a certain tag (explicitly mentioned or inferred by a "#implicit" declaration) can initiate a second kind of parsing process. This operates without any tags, but is totally character based.
The grammars corresponding to these "character parsers" can be arbitrary context free. They are parsed in an expensive "longest match" discipline. This is possible since the character sequences parsed here are normally rather short. These character parsers are used for structured data elements like calendaric dates, personal names, reference numbers, etc.
E.g. see the follwing data base entry:
#course ET-11f-09 #lecturer Prof. Dr. Peter Pepper #start 23. Jan 2009 #abstract ... ETC. |
The standard XML representation of this input would be
<course> <key> <faculty>ET</faculty> <number>11</number> <grade>f</grade> <year>09</year> </key> <lecturer> <prename>Peter</prename> <name>Pepper</name> <honour>Prof. Dr.</honour> </lecturer> <start> <day>23</day> <month>1</month> <year>2009</year> </start> <abstract> ... ETC. |
What do you prefer to type ? (or even simply, to read !-)
The definition of character parsers follow basically the same syntax as for tag parsers:
chars_parser_def ::=
public
chars
ident
,
ident
= d_expr modifiers |
The basic expressions have to be extended as follows:
atom ::= ...
stringconst
charset
decnum
hexnum
char_range
atom U atom atom A atom atom - atom ... |
stringconst ::= " char " |
charset ::= ' char ' |
decnum ::= 0 1 2 3 4 5 6 7 8 9 |
hexnum ::=
0x
0X
0 1 2 3 4 5 6 7 8 9 a b c d e f A B C D E F |
char_range ::= expr .. expr |
stringconst stands for a sequence of characters enclosed in double quotes. Used as a character parser, it matches exactly this sequence in the input.
All constructs which represent a character set can be used as a character
parser and match exactly one of the characters contained therein:
charset stands for a sequence of characters enclosed in single quotes.
The sequence is a denotation for a a character set constant containing these
characters.
(Please note that the empty instances of stringconst and character set
represent contradictorial meanings in the semantics: the empty
String "" matches always. The empty character set
'' matches never !-)
A hexnum is a hexdecimal constant. It starts with "0x".
A decnum is a a decimal constant, i.e. a sequence of decimal digit.
Both represent the singleton set which contains
only the one unicode character with that numeric value.
A char_range "a..b" is a character set which contains all
those charactes the numeric values of which are larger than or equal to that
of "a" and smaller than or equal to that of "b".
The operators "U", "A" and "-" stand for union, insersection
and subtraction of those character sets.
The grammar one level higher, for expr,
must also be extended for the construction of character parsers.
As new combinators we get ...
expr ::= ... decor ~ decor > decor |
The first new combinator is the tight sequence, ie. the concatenation without intervening whitespace. Please note that the "," operator is still applicable, even with character parsers, and means juxtaposition with arbitrary intervening whitespace.
The second operator is a prefix which causes greedy parsing: in contrast to non-deterministic parsing, the expression (=x) will be applied to the input as long as possible and no subsequent expression will be started as an alternative. Please note that no backtracking will happen, ie. after x has eaten all it can, the following expression may fail even when it would have succeeded if x had stopped earlier.
The operators for permutation and substitution are not applicable in the definition of character parses.
As new decorators we get ...
expr ::= ... atom ~* atom ~+ |
These both also mean tight repetition, ie. the concatenation without intervening whitespace. Please note that the "*" and "*" operator are still applicable, even with character parsers, and mean repetition with arbitrary intervening whitespace.
Finally we have ...
atom ::= ... resultContainer ... |
resultContainer ::= [ ident expr ] |
This construct defines a sub-element in the generated output model.
As with tag parsers. there are two ways of refering to other
definitions:
An identifier preceded by an "@" character is an
insertion and means the inlining of the definition's grammar.
An identifier without this prefix is a reference, which will result in
a sub-element in the generated xml model.
Such a reference can be to another tag parser (in case of non-insertion
this may constitute even cycles!), or to an enumeration.
In the context of character parsing,
a refered enum_def is treated as the disjunction of
stringconst s, which correspond to the front-end representation of the
enumeration items.
There are some type constraints which have to be fulfilled inside the grammars definitions character parsers:
In detail each character parser contributes as follows to the constructed result:
The disjunction of all result containers with the same name in the same module yields an expression describing all possible contents with this tag in the generated XML output. But this expression is possibly not conformant to the LL(1)-rules of the XML/DTD standard [xml]. It is feasible to derive a true "DTD content model" automatically, but the result is in most cases not very ergonomic. Furthermore, changes in the front-end grammar can cause hardly predictable results in the data structure, what in no case seems desirable.
Therefore d2d takes the opposite approach when exporting character parser definitions to a DTD, e.g. for data definition: the simple disjunction expression is used as content model, but can be overridden by an explicit data description grammar expression.
A typical use case is that different variants of the input parsing grammar correspond to an application of the permutation operator in the data description. The generated XML document follows the structure of the data contents definition, esp. w.r.t. the sequential order of permutation expressions, etc.
These data declarations are realized by a further kind of top level definitions:
chars_data_def ::= data ident , ident = d_expr modifiers |
The identifiers are those of the character parsers and result containers the content models of which are to define; the expression must only mention the result containers appearing in the parser definitions' subexpressions appearing directly therein.
The equivalence of these both expressions is checked as soon as the DTD export task is executed. More details can be found in an article dedicated to this problem [FIXME_missing_txtVsData not found].
A character parser is initiated the same way as a tag parser, namely by its tag appearing in the input text.
As mentioned above, they are executed in a parallel, breadth first search. They swallow all kinds of characters which are accepted by their regular expression, regardless of the current settings of command character and comment character!
W.r.t. the precedence of tag level command character and comment lead-in character, vs. user defined parsers, three solutions seem sensible:
Currently, the solution (1) is taken, i.e. user defined character parsers always take highest precedence.
As a negative consequence, it therefore becomes really easy to write (unintentionally !-) a character parser which is non-terminating! This parser will swallow the whole file contents, and terminate with a "premature end of file" error.
So its better to design the character parsers in a way which guarantees safe termination. This can be called limitedness, and is most easily accomplished e.g. by not-accepting certain groups of characters. Eg. the parser for "identifier" from the standard application, "d2d_gp.basic.sets:ident" accepts only alphanumeric characters plus "-" and "_" and rejects any whitespace. This is nice to type and sure to terminate soon.
After a character parser has completed, i.e. does not accept any more input, there may always follow an explicit closing tag, corresponding to the opening tag which started the parser. This can be either by a command character with an explicit end tag, or a parenthesis character defined with the open tag, cf. Section 2.3.1. A closing tag is never required and normally is omitted.
With the abovementioned parser for identifiers, the following lines of input are equivalent:
foo foo foo #ident xzy foo foo foo foo foo foo #ident xzy #/ident foo foo foo foo foo foo #ident(xzy)foo foo foo |
The definitions so far are completed by modifiers which allow modifications of the generated output:
modifiers ::= with modifier , modifier |
modifier ::= xmlspec editspec inputspec postprocessor localnodes |
xmlspec ::= xmlrep
trimmed
untrimmed
att el cdata = ident stringconst stringconst |
editspec ::= // foreseen for fine-tuning the behaviour of a syntax contolled editor |
The meaning of postprocessor is explained below, in Section 2.8,
the meaning of localnodes in Section 3.2.
editspec is reserved for the tuning of a future syntax controled editor.
The other kinds of modifier are explained in the following sections.
The xmlspec specifies the kind of generated nodes. As a default, an element is generated. If an attribute is wanted instead, this has to be specified by
... with xmlrep att |
Character data contents can be trimmed or untrimmed w.r.t. leading and trailing whitespace. The default for this can be set on a per module basis or by default groups (Section 3.1). This default can be overridden here for each parser definition individually.
As a tag for the input text always that ident is taken which
appears in the definition of the parser, left to the equation sign
in the rule tags_parser_def, chars_parser_def or
enum_def.
The tag used in the xml output also defaults to this tag, but can be overridden
here by the ident or stringconst.
It must be overriden in case a name space shall be used for this
tag which is not the default name space, see Section 3.3.
If the element itself is empty but shall be represented by a certain attribute value, then the second string parameter (the very last stringconst) gives this value.
Example: In our "easy-to-type" version of xslt (chapter 4),
the "disable output escaping" property of an xslt:text element,
which is "off" as a default, is switched on by simply including one
empty element. In the generated mode it must of course
be realized as defined by the standard, ie. by
an attribute with one certain value.
This is achieved by the definitions ...
tags text = noescape?, (#chars)* with xmlrep untrimmed element tags valueof = #implicit xpath, noescape? tags noescape = #empty with xmlrep att = "disable-output-escaping" "yes" |
The verbatim input mode is a modification of the parser and lexer processes for parsing character content. It is meant for the content of those element types which quote syntactic structures from other computer languages, which should be preserved in a verbatim way, so that the d2d parsing should interfere as least as possible.
The verbatim mode is activated by a certain alternative in the modifier production from above:
inputspec ::= with input verbatim |
The generation of warnings an be suppressed by using the built-in meta-command like suppressVerbatimCommandCharWarning 17, which will suppress the next 17 warnings.
FIXME
WENN eine enumeration AUSSCHLIESZLICH EMBEDED in einem char-parser auftritt,
dann braucht sie kein XML tag. kann aber sinnvoller einfach als disjunktion auftreten.
(Darum ist "xmlrep none" da NICHT SINNVOLL!)
SONST braucht sie eine encodierung (verschiedenste xml-nsnames können verwendung finden!
Auch mehr als einer!)
Auftreten: Embedded in char parser, tagged in char parser, tagged in tag parser!
Die WARNING in Resolver4 wegen wiederverwendung xml tag ("listSymbol") kommt daher, dass bei XRegExp ein "@X"-operator die definition X durch den INHALT ersetzt, der wiederum aus Definitionen besteht. "X" erscheint also garnicht als Definition im expandieten Modul. Das geht aber für Enumerations-Definitionen nicht, weil die zum Parsieren gebraucht werden. Also die WARNUNG. Bei der DTD-Generierung allerdings sollte "@enum" auch aufgelöst werden. FIXME. (also statt der komplizierten ge-tagg-ten Enum-übersetzung enfach nur PCDATA!)
-----------------------------------------------
Enumerations are just for convenience, and could as well be realized by
character parsers or tag parsers, followed by some xslt processing.
Their definitions consist of alternatives of
alternatives, and offer more ways for automated encoding the selected value
into the output.
Their definition follows the syntax
enum_def ::= enum
ident
=
#GENERIC enum_item , enum_item enum_modifier |
enum_item ::= ident stringconst decnum |
enum_modifier ::= with
xmlrep
empty
attribute element = stringconst numeric first name as is |
Every enumeration consists of a set of strings, called enumeration items. Each item is assigned to a numeric value.
The definitions with ident and stringconst are totally equivalent and may be mixed arbitrarily. The latter may always be used. It must be used when the identifier is not an "ident" in the sense of the ddf lexer. So it is possible to define enumerations with items which are not identifiers:
enum alignments = "<-", "->", "-><-" , "<->" with xmlrep attribute numeric |
The forms with and without decnum may not be mixed: Either each enum item is given a numeric representation explicitly, or no single number appears in the definition and all items are numbered automatically. If numbers are given explicitly, the same number may be assigned to more than one item.
The enum_modifier defines how the recognized item is represented in the xml output:
Enumerations can (1) be referred to in tag parsers, and (2) referred to and
(3) inserted into character parsers.
In case (1) they appear in the input with their name as their tag,
followed by the selected enum item.
In cases (2) and (3) the tag does not appear in the input, as usual with
character parses. The difference will be in the output:
In (2) a node is constructed as in (1), but in (3)
the character parser will interpret the enumeration only as
one big alternative of string constants, discarding all encoding and
numerical values.
tags p = (alignment?), (#chars|q|r)* chars q = (alignment?), [rest(S:digit)*] chars r = (@alignment?), (S:digit)* ------ #p #alignment <->here start the chars#/p // yields // <p alignment="3">here start the chars</p> #p #q<->17 here start the chars#/p // yields // <p><q alignment="3"><rest>17</rest></q> here start the chars</p> #p #r<->17 here start the chars#/p // yields // <p><r><->17</r> here start the chars</p> |
So obviously enumerations do not contribute essentially new features, but support easier processing by high flexibility of encoding and a further translation step. This makes a post-processing superfluous which would imply distribution or duplication of information.
When a DTD is generated from a d2d type definiion, the enum definitions are translated according to the selected representation. Attributes are translated to XML "enumerated attributes", iff every value used for representation is conformant to the XML "Nmtoken" syntax rule, see [xml, rule 57+59]. Otherise they get simply CDATA.
In the first course of an authoring process parts of a document may stay incomplete. E.g. bibliographic entries, abstracts, or certain paragraphs of the text body may require completion in a later phase of work. This may refer to semantic completeness of the contents, as well as missing child elements which are syntactically required.
The latter case is supported by d2d by the "brute force" end tags and empty tags, as already mentioned above, Section 2.3.1.
#p This PhD thesis is organized as follows: #list/// //structure still missing, FIXME ! Evidently, this clean structure will help a lot .... #bibentry rahner59 #title Das Kirchenjahr // shit, WHERE was it publihed ?? TODO find out !!! #///bibentry |
The first example defines a "list" element which must contain at least one "list item", according to its definition. But the author does not yet know how to fill it, so he/she leaves it voluntarily in an incorrect state w.r.t. syntax.
In the second example there are many obligate fields defined for a bib entry (like author, title, place, year, kind, etc.) Many of them are missing, so the element is syntactically invalid and marked as such.
The reaction of the d2d tool on
incomplete input is basically the same,
whether this is intentionally and marked as such, or erronuously:
It generates special "meta-elements" in the generated model.
These follow roughly the definition
module d2d-meta xmlns d2d = "http://bandm.eu/doctypes/d2d_gp/d2d-meta" is default import S = basic.sets public tags parsingError = kind & tag & (expected | skipped) tags expected, skipped = #chars enum kind = open, close with xmlrep att chars tag = @S:ident with xmlrep att end module |
E.g., the standard xslt translation from our general purpose text model into xhtml makes these positions of the source document visible by presenting the informations concerning the missing contents with most ugly colouring.
There are always to cases:
First an input tag may be found somewhere later in the currently active stack of content models. In this case a sequence of nodes is considered to be missing. The tool inserts the wording of the offendig tag, and a synthesized grammar expression, which represents the structure of the missing input. Please note that this grammar may include required components of different levels of the document type definition, so it may never appear in the definition texts as such!
For example:
type definition : tags a = b, c, d tags b = e, (f|d)+ tags c,d,e,f = #empty input text: #a #b #e #d resulting document: <a> <b> <e/> <d2d-meta:parsingError kind="open" tag="d"> <d2d-meta:expected>(f|d)+,c</d2d-meta:expected> </d2d-meta:parsingError> </b> </a> |
The second case is that a tag cannot be considered as "correct but to early". In this case the tag must be discarded. All input up to the next command character is skipped and reported as such in the generated model. The parsing process continues with the next tag, which of course, can again run into this error situation.
An example:
type definition : tags a = b, c tags b,c = #empty input text: #a #b some text #x more text #c#d resulting document: <a> <b/> <d3d-meta:parsingError kind="open" tag="#chars"> <d2d-meta:skipped>some text </d2d-meta:skipped> </d2d-meta:parsingError> <d2d-meta:parsingError kind="open" tag="x"> <d2d-meta:skipped>#x more text </d2d-meta:skipped> </d2d-meta:parsingError> <c/> </a> |
Similar reactions are of course possible on misplaced closing tags.
(Currently, erronuously appearing closing tags and explicitly written
premature-closing tags are not distinguished in the generated model,
but this could change.)
Beside this generation of output, the tool can be configured
Sometimes a parsed element must comply to special semantic constraints, or the contents of some additional (oftenly "hidden") data field must be calculated, or some normalization shall be performed.
Whenever neither the (pure syntactical) means of d2d ddf document defintions are sufficient, nor these processes shall be delayed to a later processing phase, a java method can be employed to perform arbitrary immediate post-processing of a freshly generated model fragment.
The usage of this feature should be restricted to very few very special cases. E.g. enriching the elements which represent the calendaric date in a local specific format by one element or attribute which represens the same date in a normalized "UTC" encoding is a typical application of this kind of post-processing.
The post-processing is declared by refering to a certain java class by a variant of the modifier above:
postprocessor ::= postproc classAsString |
classAsString ::= " ident . ident " |
The class referred to must (a) of course be locatable by the tool via its classpath, etc, and (b) must derive from a certain class and offer the certain processing function. Details can be found in the API doc of the PostProcessor class .
(Macros and source file inclusions could be very helpful in particular situations, but are not yet implemented.)
As mentioned above, all parser definitions are grouped into modules.
Modules can contain other modules.
For parsing a d2d input text, this text must begin with the
identification of the top-level parser. This is done by its name, preceded
by the path to the module, see Section 2.1.1.
The syntax of each module is defined as ...
module ::= module
ident
defaultDeclaration importItem nameSpaceDecl module defaultGroup definition end module |
definition ::= tags_parser_def chars_parser_def enum_def chars_data_def ... |
ident ::= ASCII_letter ASCII_letter ASCII_digit _ - |
An ident is a simple identifier as known from "ancient C", since only ASCII letters and digits are permitted.
Modules may be nested, and the name of a module is its identifying path, i.e. the concatentation of the names of all containing modules, in descending order, combined by a dot ".".
The nesting of module does not imply any inheritance. It is just for organizing the names of modules and for putting more than one module into one text file. Each definition file must contain one top-level module, and the name of that module and the name of the file must be in a certain relation which allows the tool implementation to find the definition file from the name of a module.
The purpose of a leaf module is to contain definition s.
Each definition will assign one or more ident s to the parsers it defines, see the grammar rules for the different kinds of definitions tags_parser_def, chars_parser_def or enum_def.
Additionally, a module can contain other modules, and a module can contain importItem s which import definitions from other modules.
For these different kinds of objects there is only one single name space per module. So every ident can only be used once on the top level of a certain module as the name of a module, of a parser definition, or for an import item. Beside this, the enum_item s belonging to one enum_def form a separate name space, and so do local parser definitions, see Section 3.2 below.
Currently the following default declarations are supported:
defaultDeclaration ::= tagManglingDirective globalTrimming |
globalTrimming ::= plain text is trimmed untrimmed |
The globalTrimming simply sets the default decision, whether element and attribute content shall be trimmed from whitespace at both ends of their string contents. This default itself defaults to false. It may be overridden for every parser definition individually, see Section 2.5.1.
The tagManglingDirective defines how the tags for the XML back-end are derived for locally defined parsers, and is explained in the context of Section 3.2.
These defaults can be defined at the beginning of a module, but can also be put into a dedicated defaultGroup, the only purpose of which is to restrict the scope a certain default. This may be very useful e.g. if a certain group of parsers shall have trimmed content, or a specially mangled xml tag, but the rest of the module does not care or wants it explicitly different. The construct, as appearing in the definition of module above, is
defaultGroup ::= begin
defaults
defaultDeclaration
definition defaultGroup end defaults |
Many conventionally used names are re-used quite frequently in very different contexts. So ident s like "title", "name", "id" or "ident" , "number" or "num", "key" or "caption" are sensible in many different contexts, --- as tags for possibly very different sub-structures.
d2d supports the re-usage of ident s as tags and as names by supporting local definitions.
As mentioned above, the regexp for modifier includes as one of its alternatives localnodes, which is defined as ...
localnodes ::= local definition end local |
All the definition s which are local to a "containing" definition share one dedicated name space. So the same ident may be used for very different parser definitions, as long as they reside in different local scopes.
Normally these definitions are intended to be used in the regular expression of that "containing parser". Therefore all reference s in the expr which serve as the containing parser's content model definition are (in a first step) resolved against this local scope. If no definition is found, then the next higher containing definition is searched for a local definition, and finally the global scope of the module.
But this is only a convention and an abbreviated notation. This co-incidence is the most frequent case, but it is not necessarily so! Indeed, locally defined parses can be used, ie. referred to or inserted, everywhere. Outside of the expr of the containing parser they must be identified in a different way, namely by using their "qualified name", which is the sequence of the ident s of all parsers they are contained in, starting from module level, separated by a dot ".".
For example ...
tags link = #implicit url, (text?, when?, loc?) with local chars url = (S:letter~+)~"//~([frame S:letter~*]~"/")* // etc.,let it be a rather complete and complicate parser declaration! end local tags image = link.url, alt, (width? & height?) with local alt = #chars end local tags table = xxxx, (caption?), xxxx with local caption = @image.alt end local |
Please note that the tagging in the input source is independent from the position of the definition: it is only defined by the ident used in the definition of the parser. These tags must be unambiguous w.r.t. the LL(1) property anyhow, totally independent from the position of the defining parser in the scope of the defining module. The world of definitions and the world of text input is only loosely related.
It is sensible on the input side to let the user simply write "title", in the context of a picture or a person or a chapter.
On the other hand, in most cases the tags in the generated XML output model shall differ, so that subsequent processing and modelling can differentiate. So here a tag like "picture_title" or "chapterTitle" would be welcome.
First of all, the xml tag of a local node can be overridden explicitly to a certain string value, as it is possible with top-level definitions, by including this tag string in an xmlspec, cf Section 2.5.1.
Whenever this is not done, an automated tag mangling process takes place which is controlled by
tagManglingDirective ::= local
node
xmlrep
naming
=
join by stringconst join upcased no mangling |
Such a tagManglingDirective can appear at the start of the module, or in a dedicated defaultGroup. It defines how the xml tag of a local definition is derived from its ident and the xml tag of the containing definition.
So the xml tag of a local node is either ...
If no mangling directive is set in the source file, it defaults to "no mangling", and a warning is issued whenever this default is applied.
The generation process for the tags in the output model may involve usage of xml name spaces, see [xml-ns].
The namespaces for which tags are generated can be declared at the beginning of each module:
nameSpaceDecl ::= xmlns
ident
=
stringconst
is element default |
One or more of these statements may appear at the start of a module. All these statements will be inherited by all directly contained modules. But these can of course override.
In each of these statements the ident defines a prefix by which this namespace will be adressed in the xmlspec definitions of the following parser definitions. This prefix is totally arbitrary. It only needs to be unique among all nameSpaceDecl s of this module.
The stringconst is the namespace uri which is intended to adress. This uri connects to "outer reality" and in most cases has a well-known meaning, an owner, and a role defined by convention, etc.
The prefix defined here is likely to be the prefix chosen when writing out
the model, but of course this cannot be guaranteed, and it has only some
small impact, namely on the readability by humans.
((
This is not quite true because some browsers, inspite of claiming
to support xhtml, which is an instance of xml, nevertheless
require a certain prefix
to be used for recognizing xhtml elements. This is bad behaviour! We support
only so far as we can say it is likely that the same prefix will be
used for writing out an XML model to a text file, as has been used for
declaring the namespace in the ddf definition.
))
The assignment of xml name spaces to xml tags for the different parsers works as follows:
Modules may import other modules for referring to parsers defined therein:
importItem ::= import
ident
=
#GENERIC modulePath moduleSubst |
modulePath ::= ident . ident |
The imported module can be subject to substitutions (=re-writings), specified by moduleSubst. This is the central means for parametrization of modules and described later in detail.
References to definitions contained in the imported module are written in the following text using the declared identifier as a prefix, as in ...
import E = basic.elements // .. tags myDef = E.importedDef |
The descending into local definitions, as described in Section 3.2 above, is also possible with imported definitions. So we get the same grammar as above, but with a different meaning:
reference ::= ident . ident |
The heading sequence of ident can be a sequence of import prefices, pointing to an import statement in the current module, than an import statement in this imported module, etc. This sequence is followed by a second sequence of ident, which descends into definitions and local definitions.
Since there is only one name space for all things contained in a module, as mentioned above (this chapter) (import keys, local module names and definition names) these sequences are always unambiguous.
(Please note that importing a module makes accessible all contained definitions using the reference expression, but not the contained local modules. These, when needed, must be imported separately and explicitly.)
(Please note that modules do inherit nearly nothing from the context they are contained in. So their source text can be moved around arbitrarily.)
The only context dependencies concern the locating algorithm for imported modules. The modulePath of every imported module is resolved by first finding a module according to the first ident component. Then we descend int the sub-modules of this module, treating the following components as their names. This top-most ident is resolved by testing ...
(Please note that this is just "syntactic sugar" to allow a more convenient
moving around of module source fragments, when developing and refactoring.)
(The price for this is, that a top-level module with the same name as a
child or sibling module can currently not be addressed.)
For library modules to be useful for most different purposes, there must be some mechanism for tayloring, parametrizing and extending them. Here, d2d takes an approach based on "glass box parametrization", based on free rewriting.
(The theory of this approach is discussed in [rewriting2018] and [rewriting2024], but is not yet completely understood !-)
Dedicated references may be declared as being parameters, the user of a library module must instantiate when using the module. This is done by declaring a tag parser, a character parser or an enumeration as "#GENERIC" (as can be seen in the definitions d_expr and enum_def above) and then referring to this placeholder, or by writing a generic module import, cf. importItem .
But this is only a special case. Basically, a user may re-write any definition of the imported libraries, in any way he/she wants to. But this of course requires the access to the source, the knowledge of the definitions' structure. The advantage is that the author of the library just presents a prototype, and does not restrict its usage. The library can evolve in directions unforeseen and unforeseeable by that author!
The parameterizations of an imported module falls in three classes:
The corresponding rules are ...
moduleSubst ::= ^
(
ident
/
ident
)
in reference substitution |
substitution ::= ^ ( expr #none / reference ) |
So there are several, very different kinds of substitutions. But the syntax of all of the follows the same pattern:
in this context | insert | | this expr | | | instead of this reference | | | | modulePath ^ ( importPrefix / importPrefix ) modulePath ^ ( expr / reference ) expr ^ ( expr / reference ) |
The first form is special, and only allowed in module import commands. Here the substitution is one step more indirect than the headline suggest: the ident preceding the "/" must correspond to a second importItem in the same, importing module. The second ident but must refer to an import item in the imported module itself.
This allows the exchange of whole groups of definitions:
module webpage_italian import B = basic.deliverables ^(MYCAL / CAL) import MYCAL = calendaric_italian module calendaric_italian enum month = januario, februario,// etc. // etc. chars date = // etc end module public tags website = @B:website end module |
This substitution instantiates the module "basic.deliverables", but replaces the "calendaric module" (i.e. the module which is imported by this module as a library for calendaric data defintions and parsers) by a new version, contained in the module "webpage_italian.calendaric_italian"
This module (instead of the original) will be adressed by all references containing the prefix "CAL:" in the instantiated module. So finally a new, own parser definition "website" can be constructed, by inserting the parser definition from this instantiated module "B".
Prerequisite for this to work is that the user knows the names of all definitions which are used in "basic.deliverables" with the prefix "CAL:", i.e. the signature of imported definitions, because those must be replaced completely and in a type correct way.
Please note that only simple prefices without any "." inside can be used for this kind of substitution. The forms "import A = a ^(B.X/C)" and "import A = a ^(B/C.X)" are both illegal as soon as B or C is an import prefix.
In the second, general form the appearing non-terminals are what they seem: The re-written part (=the replaced sub-expression of "some context") is always a reference, and what is to replace them is an arbitrary expr, or the special value °#none".
(Currently there is no way of replacing complex expressions. But since in most cases parametrization of modules and parsers aims at extending some pre-defined alternatives, this was not yet necessary in practice !-)
In case of a moduleSubst, the expr to insert is (of course !-) evaluated in the context of the importing module.
In case of a moduleSubst, the optional "in reference"-part allows to restrict the substitution to the body of one certain parser in this imported module. This target is identified by this reference, of course resolved in the imported module.
Substitutions may also occur independently of a module import. This is reflected by extending the rule from above by ...
atom ::= ... atom substitution |
This allows derivation of new parser definitions from existing ones. Both mechanisms are extensively used in the construction of our standard text format "d2d_gp", as explained in chapter 7. Please see there for instructive examples.
For the special value °#none" see below Section 3.6.
The usage of substitution is esp. powerful in connection with insertion. But here also some caveats have to be considered:
First, the references of inserted expressions coming from imported definitions are evaluated in their original context. In other words: statically bound references are inserted after binding, not mere front-end identifiers:
module outer module A tags a = b, c tags b,c = #empty end module import A = A tagx x= @A:a tags b,c = #chars end module |
Here, the definition x will refer to the empty elements a and b, as defined in module A.
Nevertheless, a substitution always takes a front-end identifier and replaces it with an evaluated expression, i.e. evaluated in the context where the substitution is written down:
module outer module A tags a = b, c tags b,c = #empty end module import A = A tagx x= (@A:a) ^(b/c) tags b,c = #chars end module |
Now the contents of x are defined as a sequence of two elements, both called b, the first is empty, defined in module A, -- the second may contain #char data, defined in module outer.
When nesting substitutions, the outer one is applied to the expression part of the inner one (above the "/"), AND to the rewriting result, but not to the (mere front-end) reference text below the slash:
tags x1 = ( a,b,c ^(b / c) ) ^(d / b) // --> ( a,b,c ^(d / c) ) ^(d / b) // --> ( a,b,d ) ^(d / b) // --> ( a,d,d ) tags x2 = ( a,b,c ^(b / c) ) ^(d / c) // --> ( a,b,b ) ^(d / c) // --> ( a,b,d ) + WARNING, "c" did not occur tags x3 = ( a,b,c ^(@b / c) ) ^(c / b) // --> ( a,b,c ^(@c / c) ) ^(c / b) // --> ( a,b,(c, c)? ) ^(c / b) // --> ( a,c,(c, c)? ) tags b = (d, d)? tags c = (c, c)? /* ==== when replacing from inner to outer hte last example woild instead resolve to ... tags x3 = ( a,b,c ^(@b / c) ) ^(c / b) // --> ( a,b,(d, d)? ) ^(c / b) // --> ( a,c,(d, d)? ) ==== */ |
As mentioned, what is re-written is determined only by the front-end representation of the reference. The "semantics" of the resolved references are not involved in the matching, only their front-end denotation. Consider the following example:
module OUTER module INNER import I = one_particular_module import J = one_particular_module tags t = I:a tags u = J:a end module // INNNER import IN = INNER ^ ( b / I:a) tags b = // etc ... |
Here only the occurences of "one_particular_module:a" in the definition of "t" will be replaced by "b". The reference "J:a" in the definition of "u" is not touched, in spite it points to the same definition in "one_particular_module". Only the front-end representations of the references are subject to substitutions, not the declarations referred to!
Nevertheless, these front-end representations are the fully qualified ones,
after the resolution of the abbreviated access to local definitions!
Consider ...
module m tags t = a, b, c with local a = // ... end locals tags u = @t ^ (x/a) // this will NOT insert any ref to "x" tags v = @t ^ (x/t.a) // this WILL insert a ref to "x" end module |
Whenever a substitution does not result in any rewriting, a warning is issued by the tool. This is the case with the "u" definition, because there is no reference to "a" in the regular expression of "t", to which the substitution is applied. This is very easy to see when the notation above is read just as an abbreviation for
tags t = t.a, b, c with ... |
This to consider carefully is esp. important when there is some re-usage of ident s (the necessity of which indeed had been the reason to introduce local scopes !-) In the following example "a" does exist, but what is refered to in the body of "t" is "t.a", not "a":
module tags a = // .. tags t = a, b, c with local a = // ... end locals tags u = @t ^ (x/a) |
Any insertion only works with a single reference as its argument: The reference must point to a parser definition (of the same kind as the containing definition!) and then its value is inserted. Whenever a substitution yields anything not a reference, we get a typing error, as in ...
tags b = @a ^(x, y / a) |
Of course, it is not likely to write down such an erronuous form in this directly visible way. But with module imports this happens quite frequently:
import M = m ^( (a|b|c) / x) |
...ignoring that inside of M we have ...
module m tags x = #GENERIC tags y = @x |
To insertions, replacements are applied twice: first to the unresolved, then to the resolved form:
tags a = a?, b tags b = b?, a tags c = a, b tags x = @a ^(b / a) // --> @b ^(b / a) // --> b?, a ^(b / a) // --> b?, b tags y = @a ^(a / b) // --> a?, b ^(a / b) // --> a?, a tags z = @c ^(a / c) // --> @a ^(a / c) // --> a?, b ^(a / c) // --> a?, b |
Of course, x and y in this example describe infinite types, and it is not possible to notate instances of them. The effects of this kind of definitions soon become unforeseeable, and the "instantiated" version of the generated documentation may be helpful, as described in Section 5.2.
Last not least there is the special value #none.
It is special because it can be used only on top of a substitution slash, and it means different things when inserted into different target contexts:
When inserted into a sequence or a permutation, it stands for replacing the reference with the "empty sequence", which always matches but never produces any output. This means, the reference is simply deleted from the sequence.
When inserted into an alternative, it means the "impossible input", which never matches and also never produces any output. So again, the reference is simply forgotten.
These effects are independent from any individual decoration ("?", "*" or "+") the reference has in the target context.
Beside its own definition format "ddf", as described above, d2d can also use document type definitions in other formats for directing the parsing and model generation of some input text in the d2d syntax.
How this is recognized depends on the implementation.
The current tools (see chapter 6 below) first search all
positions in its search path for ddf definition modules,
e.g. for files ending with ".ddf",".dd2", etc.
Only if no such is found it searches for other document type definition
formats.
Currently
are recognized.
Every internal module definition can be exported back into the genuine d2d text format. This is done by the main tool with "--mode ddf2format", see Section 6.1.
This can be esp. useful for controlling the details of the translation result when an externally defined model is read in. E.g., for the dtd of xhtml (together with the additional, preparatory declarations of namespaces, etc.) the call would be
d2d2.base --mode ddf2format -0 xhtml_1_0 -1 recognizedHtmlModel.ddf --expanded 1 --path // must be set to find the xhtml_1_0.dtd |
Please note that an "expanded" module is an instance of ResolvedModule . It has several differences to a "raw" input module:
Interpreting a dtd as a ddf combines attributes and element contents for each single element type into one single ddf tag parser definition.
For each element definition, its attribute list is (lists are) translated into one single permutation expression. This is pre-pended before the translation of the element contents. The latter is straight-forward, mapping DTD constructors to ddf constructors, and falls into one of two(2) categories:
<!ELEMENT x (c, (d|e)*, f?)> <!ATTLIST x a1 XXX #IMPLIED a2 XXX 'default' a3 XXX #REQUIRED> ===> is read as ===> tags x = (a1? & a2? & a3), c, (d|e)*, f? <!ELEMENT x (#PCDATA | c | d | e)* > <!ATTLIST x a1 (m1|m2|m3) #IMPLIED a2 XXX 'default' a3 XXX #REQUIRED> ===> is read as ===> tags x = (a1? & a2? & a3), (#chars | c | d | e)* enum a1 = m1|m2|m3 tags a2,a3 = #chars |
Due to the "principle of least surprise", the "#implicit" feature of ddf is never synthesized.
Every identifier serving as an ATTRIBUTE name in the dtd is translated to the refefence to a synthetic node definition in the local scope of one single synthetic pseudo-element, mostl named "ATT". This primely to avoid name clashes on the definition level, i.e. in the top-level scope of the constructed ddf module.
For avoiding clashes of the tags, in the later text input, attributes can further be prefixed:
If there is a name clash between the name of an optional attribute and the "first set" of the regular expression of the content model, the attribute's tag will be prefixed by a string like "A-" or "att".
(The particular cases and the chosen prefixes are printed out as warnings, when the dtd is loaded for the first time.)
((
E.g. in case of xhtml1.0, the elements "ins",
"del" and "q" have a child element
with tag "cite", and an attribute with the same name.
Consequently, this attribute in these elements is
only adressable as "A-cite".
Elements without this clash, here only "blockquote", keep the
reference to the attribute by its genuine name.
Contrarily, in the same example there is an attribute
"script" which is nearly ubiquituous, and an element
"script", only appearing in the contents of "header".
But since this element is one of the
few not allowing this attribute, there is no clash, and
attribute and element are adressable with their genuine name, which is
the same.
))
Please note that his kind of clash resolution does not follow the principles of compatible evolution, as they are fundamental for text type definitions in the genuine d2d 2.0 format, and discussed in Section 2.3.6. E.g. adding an element "y" which goes into the first-set of an element's content model, will re-define an attribute's tag from "y" to "A-y". Later insertion of a further, non-optional child "x" in front of "y" will re-name the attribute back. But DTD's are asumed to be fixed. (No one wants to maintain them !-)
In the presence of namespaces there is a second level of disambiguation:
First, the namespaces must be declared by "<?tdom ..>" processing
instructions.
The prefix which is declared there is used for translating:
It is mangled into the names of elements and possibly attributes, for disambiguation.
E.g. in the ddf model of xhtml there is the ubiquituous attribute
"xml-lang".
In extreme cases this may lead to further mangling steps for further disambiguation, involving numbering, and must possibly be combined with the above-mentioned prefixing of attributes! See the following (non-real!) example:
<!ELEMENT xml_img (#PCDATA)* > <!ELEMENT x (#PCDATA | xml:lang | xml:img)*> <!ATTLIST x xml:lang #IMPLIED> ===> is read as ===> tags xml_img = #chars* tags x = A_xml_lang, (#chars | xml_lang | xml_0_img)* |
Every single occurance of such mangling and renaming is reported to the user by a warning.
Please note that currently only dtds with certain restrictions on the lexical structure of tags can be imported. A tag which does not fulfill the production ident would require further mangling, which is currently not supported.
As described in the umod documentation, there is a canonical definition for an XML serialization of models.
The implicitly induced document type definition can be used for direct denotation of umod data models.
A d2d source file can be declared to be a source for an xslt program, ie. declared to contain xslt-templates which generate fragments of a certain document type.
Currently we support XSLT 1.0 [xslt1_0].
The required declaration is as follows:
#d2d 2.0 xslt text producing <MODULE> : <ELEMENT> // ^^^^ ^^^^^^^^^ // these keyword are different #d2d 2.0 xslt text producing #tdom <CLASSNAME> : <ELEMENT> // ^^^^ ^^^^^^^^^^^^^^^ // these keyword are different |
The first variant finds the defintion module with the normal module-finding strategy, as described above. The second variant uses the DTD underlying the compiled (tdom model) identified by its full-qualified class name.
In xslt mode, the selected element is not the root of the generated document. This is instead xslt:stylesheet, as defined in our (slightly simplified) version of xslt, see the d2d source and the generated graphic documentation.of the d2d xslt model
Consequently, the top-level constructs appearing in this file can be #import, #output, #preserve-space, #template, etc., as usual for an xslt source text.
The element declaration indicated in the text type declaration but defines the target language of the xslt rules: All elements which are reachable from the indicated root element are recognized in all those contexts of the following text in which a target element may appear, as defined by xslt. Technically, they are collected into one big alternative which is assigned to the generic definition "RESULT_ELEMENTS" in our xslt model.
These target elements can of course contain a hierarchy of further elements of the target language, as long as they confirm to their contents definition. But they also can contain, vice versa, again certain xslt elements, namely those which produce content (e.g. "valueof", "if", "call").
These are identified by the d2d parsing algorithm by looking at the definition of "INSTRUCTIONS" . These, in turn, can again contain target language elements (either directly like "if" or indirectly like "choose/when"), etc., ad lib.
So we get a "sandwich" of alternating hierarchies of xslt elements and target language elements. Here an example, symbolically depicted:
xslt:stylesheet | | variable template | if........................ | html:p | | | br a ........................| choose | when...................... | image ........................| call-template |
The combinability is defined by classifying the xslt elements into ...
xslt:stylesheet | | (T) | variable | (T) template (B).......................... (target language elements a first group of nesting, top slices of hierarchy ) | (<---x1) (P/C)........................| | (content-producing xslt elements) | | (B).|........................ | | (<---x2) | (target language elements etc. nesting continued ) |
The implementation is realized by two(2) state machines, operating
independently as co-routines.
The d2d inference mechanism works for both parts independently.
Seen from the user, both parsers and both sets of tags are unified
in a transparent way.
The point "x1" is the crucial point where new name clashes
can occur, because there all tags of the target and many xslt
tags are permitted.
The clash comes from the production
TEMPLATE
.
This nonterminal does not model the definition of an xslt-template
(that is done by
template
!),
but for "all which occurs as the contents
of a template in the widest sense", e.g. including values for constants
and template arguments, for if-branches, etc.
Of course, only here the back-end elements
(by
@RESULT_ELEMENTS
) and certain xslt elements (by
@INSTRUCTIONS
) are combined and can clash.
First of all, no new clashes may occur with ATTRIBUTE-like definitions: Only elements, not attributes of the target language are involved in this recursive embedding: Neither may an attribute contain an xslt instruction, nor may it appear on top-level of the contents of an xslt template. This is the only situation where the chosen "xml representation kind" does affect the d2d parsing process. An attriute named "choose" or "if" will therefore never clash with the xslt element with the same name. (See more details at Section 4.3.)
But additional clashes can occur between tags from different scopes of the target language. DTD defined languages do not have different scopes, but in ddf defined target languages there are tags with module scope and tags with per-element scope. This is explained in detail in Section 4.2.
But most significant and most likely is a clash between top-level elements of
the target language and of xslt content producing "instructions".
This is solved as follows:
All xslt elements are additionally adressable by
tags which are prefixed by "X" and "x-".
In case (a) that an xslt tag also appears as a target language tag, and both
tags are applicable in a certain context, the
target language tag has priority
and thus the prefixed version must be used for the xslt tag.
In case (b) that a prefixed version itself is used by the target language elements,
the prefix is replicated as often as necessary.
E.g. if the target language contains a tag like "x-if", then the corresponding
prefixed version of the xslt tag will be "x-x-if".
The tool will issue a corresponding warning in both cases.
((
In case of xhtml as target language, there are element definitions
"var" and "param".
The first clashes with xslt "var", so this must be adressed
by "x-var" or by "Xvar" or sim.
Contrarily, there is no clash between the two roles of "param":
The xslt version can only appear in the prefix part of
template
,
and at this position no target language elements are allowed.
Strictly spoken there is nevertheless a clash, but the non-determinism of that
is easily resolved by the algorithm being greedy:
module xslt // ... public tags template = (match |name), (mode? & prior? & X:space? & param*), @TEMPLATE --- when instantiated with xhtml_1_0 as a target language, will read as --> public tags template = (match |name), (mode? & prior? & X:space? & param*), ( (if|call|apply|valueof|..) // xslt INSTRUCTIONS | (html|head|p|..|param|.. // xhtml RESULT_ELEMENTS // ^^^^ this is not LL(1) !!! ) )* |
As a consequence we get the following interpretations:
#template #name x #param p1 #param/ #param p2 #param/ #element param // xslt xslt xhtml // one trick to start // template part with a xhtml param #template #name x #param p1 #param/ #param p2 #param/ #message!! #param // xslt xslt xhtml / a different way to start // template part with a xhtml param |
The name spaces related to the output model, i.e. the xml corpus which will be created by the xslt source text, are imported automatically and applied to the generated output.
They are copied to the output with their original prefix definitions, i.e. the prefices can be used in embedded xpath expression and template match patterns to refer to the namespaces.
But further name spaces, esp. those of all input documents, must be declared explicitly. This is done like #ldots
#d2d 2.0 xslt text producing <MODULE> : <ELEMENT> from a = http://bandm.eu/doctypes/options b = http://www.w3.org/1998/Math/MathML = #empty |
This assigns the prefices a and b to the namespace uris, and the empty prefix to the empty uri. These prefices can now be used e.g. in embedded xpath expressions, with the declared meaning, since these declarations will be copied to the generated output.
Please note that there is no (real) typechecking between
the upper and the lower hierarchy of target language elements in the
picture above!
At all junction points "B", all target language
tags may appear.
The set of allowed tags at point (x2) in the picture above
does not depend on the situation at point (x1).
(E.g. oftenly at point x2 some arguments to a function ("call template" or
"apply templates" in xslt terminology) are constructed, which will be
part of the result (inserted at place x1) only after further wrapping,
or which will be even totally discarded. This shows that there is
no trivial relation between x1 and x2.)
As a first consequence of this un-relatedness,
the user must be aware that xslt code can be denotated which will
produce incorrect results w.r.t. the target language document type.
This has mathematical reasons: The type-checking problem is only solvable
for a restricted subset of xslt [marnev05].
While this problem is a general one of xslt, the second issue is related to the central inference mechanism inherent to d2d (which intends to simplify denotation and increase readability): the d2d parsing process must use some heuristics to find out how to continue after the embedded content-producing xslt element, e.g. when returning to the stack-level of (x1).
See the following example of contents definition and xslt code:
tags a = b,c,d,e,f+,g tags b,c,d,e,f,g = #empty ----- xslt-source 1: #template #match link #a #b #call myNamedTemplate#/call #f #f #g #/template xslt-source 2: #template #match link #a #b #call myNamedTemplate#/call #c #d #e #f #g #/template xslt-source 3: #template #match link #a #b #call myNamedTemplate#/call #/a #/template |
All three xslt fragments are (possibly!) correct: In fragment 1 the call
to myNamedTemplate must deliver the sequence of three elements as
its result, namely a c-, a d-, and an e-element.
It may produce some final f-elements.
In fragment 2 the call must deliver "nothing", the empty sequence (e.g.
just executing a debug message output).
In fragment 3 the call must deliver the whole required rest of
a's content definition.
But in every case the parser will not know how the called template will
behave. So it must be assumed that every content producing
xslt command (which is embedded into
the target language's structure definition and is expanded later, when
applying the transformation as a whole)
may cover any valid continuation sequence w.r.t. the
currently parsed nonterminal of the target language
and its contents model.
Those components of the current contents
model which are left out thus define the minimum coverage
of the expansion of the xslt construct.
A very important property in this context is, that every such xslt function can only expand to a true sub-expression of the current content model, but never beyond! That is because it always delivers well-formed (sub-)trees, not arbitrary sequences of tokens. Therefore the d2d xslt parsing mechanism must never look farther than the end of the current content model, and that is always known.
The implementation currently issues this minimum coverage by the following hints:
case 1: xslt expression is assumed to cover at least (c, d, e) case 3: xslt expression is assumed to cover at least (c, d, e, (f)+, g) |
So there is indeed a kind of "minimal type checking" done automatically. At least, the user is clearly informed about what the code has to deliver for correct overall output.
Obviously, this "wild card character" of the xslt expansion destroys the strict LL(1) discipline:
The same kind of example as above, but more complicated:
tags a = (b,c,d)*, (x,c,d) tags b,c,d,x = #empty ----- xslt-source 4 #template #match link #a #b #call myNamedTemplate#/call #d #/a #/template -->minimal cover c, d, (b, c, d)*, x, c xslt-source 5: #template #match link #a #b #call myNamedTemplate#/call #d #x #c #d #/a #/template -->minimal cover c, (b, c, d)*, #/template |
The difference between both cases cannot be recognized by the normal LL(1) parsing of d2d. After the called template, the parser does not know whether to continue with the "first" or the "second" reference to d.
Currently, we always decide for the first. The declarative operator covers is intended to list a sequence of tags. IT IS CURRENTLY NOT YET SUPPORTED! The meaning of it shall be to indicate that the corresponding elements are always contained in the produced content of the xslt expansion (as guaranteed, or at least, as intended by the user). The effect of which is to shift the parsing process over the first appearance of this tag. Since the content model (beside the wildcard of the xslt expanson) is always LL(1), there is alway such a tag which can be used for this disambiguation (????)
Both examples from above (and a third, new one) are correctly written as ...
tags a = (b,c,d)*, (x,c,d) tags b,c,d,x = #empty ----- #template #match link #a #b #call myNamedTemplate#/call #cover x #d #/a #/template #template #match link #a #b #call myNamedTemplate#/call #d #x #c #d #/a #/template #template #match link #a #b #call myNamedTemplate#/call #d #b #c #d #x #c #d #/a #/template |
A third consequence of the above-mentioned un-relatedness of the grammars ruling the parsing points x1 and x2 arises as soon as a certain d2d name used for an element definition on module level conflicts with a local element definition with the same name, see Section 3.2. Each tag which appears "freely floating" when re-entering the world of target tags at point x2 (and also in the top-level contents of a template, a variable content defintion, etc.), will thus be interpreted as a reference to the module level definition.
The local definition will only be recognized when the tag appears not on top-level, but in the content model of a target language element, which serves as a dis-ambiguation.
Anyhow, xslt itself always allows to construct elements explicitly, using xslt:element
The following table summarizes all supported tags and assigns them
to the interface categories.
Please note that our version simplifies the wording of the tags,
and replaces some attributes (with a "yes/no" kind of value) with an
empty element (inserted for "yes" and left out for "no").
You may refer additionally to the source of the definition module for our version of xslt and the generated documentation.
TOP-level(directly under "stylesheet") Character (and struct.) Producing (under template) (c) under template, but nothing (directly) producing Producing structured content back-end=target lang. elems. directly contained c = instructions contained, but only producing plain char data stylesheet transform import T include T strip-space T preserve-space T output T param (T) (c) B key T decimal-format T namespace-alias T template T B value-of C copy-of C number C apply ("-templates") C apply-imports C foreach ("for-each") C B sort if C B choose C when B other ("otherwise") B attribute-set T call (="call-template") C arg (="with-param") B variable (T) (c) B text C processing-instruction P c element P B attribute P c comment P c copy P B message (c) c fallback B |
For us, it turned out to be quite comfortable to write xslt programs using the d2d input front-end. Nevertheless, there are certain severe draw-backs and caveats. Some of them (e.g., w.r.t. the problem of shadowing of tags and attribute names) are already mentioned above. Some other caveats are described in this section shortly.
You still write XML "verbatim"
That means that all idiosyncratics of XML still apply, and partly affect
the parsing process.
Esp., XML "attributes" and "elements" still behave differently. This is not visible in d2d's standard front-end representation. Indeed, it was one of the design goals of d2d to eliminate all the complicated "junctims" coming with the dichotomy between "xml elements vs. attributes". But here this unification does have draw-backs, and you should be careful! For example:
#template #match link #a #href #choose #when XXXX |
The #choose will close the contents of href immediately,
because this is (in "XML and XSL-T reality") still "only" an attribute,
with "only" character content, not an "element" with "element content".
(The same mechanism, by the way, closes the match attribute
as soon as the #a tag is parsed, which appears to be a quite sensible behaviour!)
What you mean when calculating the character contents of an attribute is perhaps an "attribute value template", which is written as an xpath expression in curly brackets:
#template #match link #a #href {concat($myVar,'-',text())} |
The differences between the syntactic roles of attributes and elements in xslt are also the reason for the fact that attributes names never clash with xslt element tags: attributes simply cannot appear at all "free floating" on the top level of an xslt template:
#template #name encodeLinkTarget #href {concat($myVar,'-',text())} |
...is simply syntactically impossible in the XML world, because for the "attribute" href there would be no hosting "open tag"! The verbatim translation to the genuine xslt xml representation yields something which is not valid xslt:
<xsl:template name="encodeLinkTarget> href="{concat($myVar,'-',text())" </xsl:template> |
The syntactically correct way of generating an attribute node for the current context is of course creating the node explicitly:
#template #name encodeLinkTarget #attribute #name!href! #valueof concat($myVar,'-',text()) |
d2d parsing uses different kinds of brackets and parantheses, independently of the contents structure. So when writing the following, the braces will be consumed by the d2d parsing algorithm and you will not write an "attribute value template":
#template #match link #a #href{concat($myVar,'-',text())} --> results in <a href="concat($myVar,'-',text())" ... |
Instead, if you want explicit parentheses for the contents of href, you could choose other parentheses, like
#template #match link #a #href!{concat($myVar,'-',text())}! -- or -- #a #href<{concat($myVar,'-',text())}> -- or -- #a #href {concat($myVar,'-',text())}#/href |
Contrarily, in the xslt context no target language element is treated as empty! Even if neither attributes nor element contents is allowed, there could still be xslt elements which do not produce any output (like debug message output) included into the element as its contents.
But most elements have at least some ubiquituous attribute like "xml:lang" or "id". These attributes could possibly also be created by xslt code. This leads to a sometimes surprising behaviour of the parsing:
#p #br #call mytemplate -- yields --> <p> <br> <call name="mytemplate"/> </br> </p> |
In many cases the user will have meant something different, namely
#p #br/ #call mytemplate -- yields --> <p> <br/> <call name="mytemplate"/> </p> |
d2d parsing does not respect ("know of") the XPath syntax, which is wrapped into xslt constructs. As long as the command character is #, you cannot write ...
#template #match link #a #href {concat($myVar,"#",text())} -- results in --> <a href='{concat($myVar,"' #",text())} ^ PARSING ERROR, tag expected |
Last not least, some xslt constructs behave somehow
unexpected w.r.t. contents and inferred closing.
For "var" (which should better be called "const", and when
applied to the xhtml target must be written "x-var"), "param"
and "arg" we provide two versions for denotating
the value: either an
xpath expression, lead in by "xp", or a "template", which may contain
"nearly everything".
In case of var and param. this includes itself recursively!
In case of constants (called "var") this does even make sense, in some
rare cases:
// NOT intended nesting: #param p1 #text this text goes sas a value into the param p1 #param p2 #text // <-- and this param TOO : ---- yields surprisingly ---> <param name="p1"> <text>this text goes sas a value into the param p1</text> <param name="p2"> <text/> </param> </param> // sensible and intended nesting: #var v1 #var v2 #text complicated constant for replication #/var #valueof $v2#valueof $v2#valueof $v2#valueof $v2#valueof $v2 #/var |
So better write explict close tags for all these elements which otherwise would swallow anything:
#param p1 #text this text goes sas a value into the param p1 #/param #param p2 #text // etc. #/param |
The ddf modules, ie. the document type definition files in the d2d architecture, support an integrated documentation and transformation system.
A collection of documentation texts and processing instructions can be adjoined to different items in the source text. This collection is indexed by a user defined (processing) key, which is used in subsequent processing. This is achieved by statements in the source text which follow the grammar
definition ::= ... documentation |
documentation ::= docu ident localrefs = docutext |
localrefs ::= localref , localref |
localref ::= ident . ident |
docutext ::= stringconst #d2d char #/d2d |
A localref is like a reference, but excluding references to
imported definitions. Instead it may refer to an import statement as such,
and even to single values/items of an enumeration definition.
When a list of localrefs appears in a documentation, then
the following docutext is assigned to the source text
components with theses names.
(With enumeration items, the texts are stored with the numeric value,
i.e. assigning different texts to Jan and to Januar in the module
basic.calendaric_de will concatenate both texts for both values.)
In the rule for documentation, if localrefs is left out, the text refers to the whole module as such.
In the rule for documentation, the ident always gives the processing key.
By convention (which also the current tool implementatioh adheres to !) there are currently two kinds of processing keys:
The docutext can be given in double quotes like "...", not including line breaks, or limited by the tags #d2d ... #/d2d, with line breaks. Anyhow, d2d tagging can be applied freely in the text, as defined by the expression (@p | p*), taking p from basc.structure.
(ATTENTION, the curent implementation requires "def before use" for documentation assignment---in contrast to all other references.)
The current implementation of the main tool (see Section 6.1, using the command line parameter "--mode ddf2doc") allows to generate an xhtml documentation page for each particular d2d text format definition module. There are two kinds of documentation: uninstantiated vs. instantiated, also called "static" vs. "dynamic".
The static documenation represents the source text of a module---the dynamic represents its usable contents: when one of its definitions is made the top-level element of a document, and all imports, insertions, and substitutions have been resolved.
Documentations of both version are generated as XHMTL texts.
They contain, among others, a list of all
tags per module, a list of homonyms,
the information which definition refers to which other definitions,
for each relevant definition
a depiction of the syntax and the above-mentioned documentation text
from the definition module, led in by "docu user_<L>".
with <L> being the particular selected human language.
The language to select the documentation fragments and the language of
the fixed text components is the same, as selected by a command line option.
The documentation texts are parsed and translated to XHTML according to the definition d2d-meta : docutext. Therefore there are "rich text", as any non-taylored "basic.deliverables:webpage" content, using physical mark-up. links, tables, lists, external images, etc. ad lib.
The text fragments contained in the definition source related to the same definition (or module), with the same key, will all be concatenated and preceded by an implicit "#p " source text. Therefore most such doc texts can simply begin with readable text, not caring about formats. If they become longer, they contain more paragraphs by simply using the #p-tag. These details can easily be checked when comparing the source texts and the generated definitions of the d2d gp basic module, as described below chapter 7.
In case of "static" documentation, simply all definitions and sub-modules of the module source text are explained.
In case of an instantiated documentation, descending from its top-level public definitions of the module, all references are checked and instantiated, the same way as when the module is employed for text parsing. All those (and only those) definitions reachable from the top-level definitions are documented, totally independent from the static module which contains there source text.
Thus the dynamic documentation shows the really effective content definitions, after resolving all the (possibly rather complicated) substitutions and insertions (see Section 3.5 and Section 3.6). It shows the contents models which really rule the sequence of input tags when you have to denotate correct text input.
Please note that in this instantiated form, the origins of definitions (for instance all resolved insertions) are not visible anymore. So when designing own instantiations and variants of existing library modules, a look to the un-instanticated documentation and source text will be necessary.
As an example may serve the documentation of the d2d model of xslt, and the documentation pages of the "general purpose" document archictecture "d2d_gp", a link to which is found at the beginning of chapter 7.
In practice, centrally defined xslt scripts for further transformation of a certain d2d model turned out to be hard to maintain. Therefore d2d offers a way to denotate transformations directly with the parser definition.
The syntax is the same as for documentation. The text attached to each content model must be a fragment of an xslt text for a certain backend <b>, and the key must have the wording "to_<b>". An example from "basic.physical":
tags hr = #empty docu to_xhmtl_1_0 hr = "#hr" docu to_latex hr = "\\\hrule{}" tags emph = @EMPH_CONTENT docu to_xhml_1_0 emph = "#i#apply" docu to_latex emph = "\emph{#apply}" |
The current tool allows to extract and concatenate all xslt rules with a certain key. If the indicated target language is an XML model itself, then this will used for construction of the templates' contents.
A command line like "<D2D_TOOL> --mode ddf2xslt --key xthml_1_0 --sourcefile basic.physical --outputfile x.xslt.d2d" will leave the file x.xslt.d2d with something like
#d2d 2.0 xslt text using xhtml : html #stylesheet #version 1.0 #template match hr #hr #toplevel // INSERTED for terminating pending "#var" contents, etc. #template match emph #i#apply #eof |
This file, in turn, can be converted to a xslt source in the conventional
format as described above in chapter 4, eg. by a command line like
"<D2D_TOOL> --mode text2xml --sourcefile x.xslt.d2d --outputfile x.xslt"
Then you have a classical xslt source file, which can be applied by
any xslt processor to any input, e.g. by a command linke like
"eu.bandm.tools.utils3.CallXslt --in xmlsource.xml --xsl x.xslt
--out result.html"
or, using the
meta_tools
make system,
" $(call xml2xml, xmlsource.xml, x.xslt, result.html, $(PARAMETERS))"
All these steps can be done internally, in one single step, by calling ...
"eu.bandm.tools.d2d2.base.Main --mode text2target
--sourcefile mysource.d2d --key xhtml_1_0 --outputfile result.html
$(FURTHER_PARAMS)"
or, again using the make macros, ...
"$(call d2d2target, mysource.d2d, xhtml_1_0, result.html, $(FURTHER_PARAMS))"
This is most convenient for batch converting of text input, esp. because the caching of definition modules and xslt rules spares a lot of processing time. Nevertheless, when developing new transformation systems, it may be helpful for the debugging to perform these steps separately.
Currently there are different executable Java classes for different purposes: The "Main" tool allows to execute all functions described in this text from the command line---the "Batch" tool allows to transform multiple source texts into multiple output formats each.
The main tool #src!eu.bandm.tools.d2d2.base.Main! allows to execute all functions described in this text from the command line.
For this purpose, it recurs to the class "Tasks" which defines static methods for these functions and may be used for programmatic execution---see its api doc.
When called from the command line, the parameter behind "--mode" identifies the function the user wants to perform. Some of the further options are re-used with sligthly different meanings for the different functions. The syntax of the command line call is defined by these option definitions:
( definitions from file ../../src/eu/bandm/tools/d2d2/base/d2dOptions.xml )
-v | --version | |
show application and version information | ||
-m | --mode | ( text2xml | text2texts | ddf2dtd | ddf2format | ddf2doc | ddf2xslt | ddf2tsoap | dtd2ddf ) |
for what kind of task this application is called | ||
-p | --path | (string)* |
where to look for type definition modules | ||
not modes.dtd2ddf==mode0 | ||
-0 | --source | uri |
path of source file or name of module to process, depends on 'mode' | ||
-k | --key | string |
the target language for which documentation shall be generated; or the pair 'module:element' into which xslt code shall be extracted | ||
(modes.ddf2doc==mode0 or modes.ddf2xslt==mode0 or modes.ddf2dtd==mode0) | ||
--static | ||
Whether the static = un-instantiated structure of the given source text module shall be documented, not its instantiation. | ||
modes.ddf2doc==mode0 | ||
--additionalSources | (uri)* | |
additional source files, currently only used for documentation and transformation definitions. | ||
(modes.ddf2doc==mode0 or modes.ddf2xslt==mode0) | ||
--outputfile | uri | |
output file | ||
-d | --debug | int(=0) |
debug level, 0=silent 100=maximal verbose | ||
--interactive | int(=0) | |
which info to print in case of error: (=1) stack situation (=2) generated output so far | ||
(modes.text2xml==mode0 or modes.ddf2xslt==mode0) | ||
--partialdocs | ||
whether partially correct but incomplete documents may be produced. | ||
(modes.text2xml==mode0 or modes.ddf2xslt==mode0) | ||
--stylesheetParams | (string string)* | |
explicit pairs key:value of parameters for an xslt style sheet processing. | ||
modes.ddf2doc==mode0 | ||
--stylesheetParamFiles | (uri)* | |
list of files containing parameters for xslt processing. | ||
modes.ddf2doc==mode0 | ||
--lineWidth | int(=70) | |
width of a line---for most print out procedures. | ||
not (modes.text2xml==mode0 or modes.ddf2tsoap==mode0) | ||
Enumeration modes:
Select very different operation modes of this tool
text2xml | Translate one(1) d2d text input files into one(1) standard xml file. |
text2texts | Translate one d2d by xslt.DEPRECATED use Batch tool! |
ddf2dtd | Generate a standard dtd from a d2d module definition. |
ddf2format | Export a pretty-printed source text of the d2d definition module |
ddf2doc | Generate xhtml documentation for a d2d module (select reader's language.) |
ddf2xslt | Write out the xslt rules (in d2d format) for particular d2d module and target format |
ddf2tsoap | Write down the tsoap serialization of the d2d definition module |
dtd2ddf | Import a dtd and convert it into a d2d definition module. |
The diverse functions selected by the argument of "--mode" are ...
The -p/--path option allows to specify the places where module source texts will be searched. (These are realized by SourceHabitat) Every string in the argument declares such a place, and they are searched from the left to the rigth for the first match for a given filen name (which normally is created by appending a suffix to a top-level module name, see Section 2.1.1 above.)
Currently, three formats are supported:
Xslt style sheet parameters can be given by the options stylesheetParams and stylesheetParamFiles. These parameters can be declared and evaluated by the programmer of some xslt code, and control its behavior, see "top-level parameter" in [xslt1_0, sect. 11.4].
Every style sheet parameter defintion is
a pair of a NamespaceName as key, and a constant String value.
First the files listed after stylesheetParamFiles are loaded, in the given
sequential order. Each text line (separated by line feed) contains one
such key/value pair.
Then the explicit definitions from the command line arguments after
stylesheetParams are read. Every definition may override a previous one
with the same key, without further notice.
The "Batch" tool allows to transform multiple source texts into multiple output formats each. It uses bandm's own typed and tdom-based xslt implementation.
The command line options are
( definitions from file ../../src/eu/bandm/tools/d2d2/base/batchOptions.xml )
-d | --debug | int(=0) |
debug level, 0=silent=only errors and failures; 1=few major loggings; 2=all loggings and few warnings; 3=all warnings and some hints; 4=full info=additional hints plus extended context 10=some synthesized source texts; 20=full debugging, stack traces, tracing, and synthesized texts. | ||
-v | --version | |
show application and version information | ||
--fileBase | uri(=) | |
directory position as common prefix for all input and output files | ||
--inputPattern | string(=%) | |
file name pattern to analyse the input files. | ||
--interactive | int(=0) | |
which info to print in case of error: (=1) stack situation (=2) generated output so far | ||
--lineWidth | int(=70) | |
width of a text line for diverse print out procedures. | ||
--noTxsl | ||
do not use bandm txsl, but jre built-in xslt processor. (currently NOT SUPPORTED) | ||
--path | (string )+(= "RES_eu.bandm.tools.doctypes.DocTypes/d2d_gp" ) | |
where to look for type definition modules | ||
--partialdocs | ||
whether partially correct but incomplete documents may be produced. | ||
--pedantic | bool(=true) | |
whether to follow specification even when it appears hardly sensible | ||
--sources | (uri)+ | |
paths of source files to process, maybe relative to common file base | ||
--stylesheetParams | (string string)* | |
parameters for an xslt style sheet processing. | ||
--stylesheetParamFiles | (uri)* | |
list of files containing parameters for xslt processing. | ||
--strictCheck | bool(=true) | |
whether to type check unreachable code | ||
--transformations | (string string string string)* | |
tuples of target name, tdom class and top element name (both possibly empty = '-') and output file pattern. | ||
--tpathFunctions | string | |
name of a subclass of eu.bandm.tools.tpath.runtime.FunctionLibrary, defining tpath functions. Defaults to tpath+xslt pre-defined functions. | ||
--txslTraceFlags | ( silent | justWarnings | extendedInfo | showSource | templateDirs | globalValues | resourceFinding | templateResolution | templateCall | tpathEval | tpathTypes | tpathFunctions | varAssignment | testDecisions | exceptionStackTraces | elementReducing | verbatimText )* | |
sequence of flags which enable different debug tracings of the xstl processor separately | ||
--writeXsltFile | ||
Write out intermediate files to disk (for debugging purpose) when collecting xslt code. | ||
--xmlResult | string | |
file name pattern where to store the xml version of the d2d inputs | ||
Enumeration traceflag:
Switch on/off different trace outputs individually.
silent | Suppress all log messages, even warnings. Print only errors. |
justWarnings | Suppress all log messages. Print errors and warnings. (Overrides "silent") |
extendedInfo | Print additional information with most messages. |
showSource | Display source text as soon as parsed, and as additional information. |
templateDirs | Dump the template directories as soon they are filled. |
globalValues | Trace calculation of all global values (top-level parameters and variables). |
resourceFinding | Trace the resolution of URLs/URIs/file names and the access to resource objects. |
templateResolution | Trace finding of templates. Includes xsl:for-each application. |
templateCall | Trace parameter set up and code executon start for template calls and applies. |
tpathEval | Trace evaluation of all tpath expressions. |
tpathTypes | Print the results of type checking tpath expressions. |
tpathFunctions | Trace calls, parameters, and results of functions called from tpath expressions. |
varAssignment | Trace assignments to variables. |
testDecisions | Trace the deciding expressions in xsl:if and xsl:choose expressions. |
exceptionStackTraces | Print the java stack trace in case of exceptions. |
elementReducing | Trace the reduce step of a collected result element, ie. the call of the tdom semi-parser. |
verbatimText | Dumps all verbatim inserted XML fragments. |
The parameter filebase is merely for convenience and
allows to specify input and output files relatively.
(The parameter noTxsl is currently not supported: All
transformations use the bandm txsl processor and generate
well-typed results w.r.t. a tdom model.)
The parameter path is used to locate the files containing
definition modules, as described with the Main tool.
(See also there for other parameters
like stylesheetParams, stylesheetParamFiles,
debug, etc.)
The parameter transformations controls the operation. It is a list of quadruples: target-key, target tdom package name, target top level element, output filename pattern. The second and third component may be empty, meaning a transformation into unstructured text.
The input files are given as the list argument to sources, either absolute file names or relative to filebase.
The parameter inputPattern controls how these file names are reduced to stems. These stems are inserted into output filename patterns of the transformations.
The optional value of xmlResult indicates that the pure XML file shall also be written. All these patterns use the character "percent" = "%" to define or quote the stem. For instance:
--inputPattern %.d2d --fileBase ./documents --writeXml .DUMP_% --transformations to_xhtml_1_0 eu.bandm.tools.doctypes.xhtml html %.html to_embedded - - %-embedded.sh --sources myFile.d2d /home/donald/diary/christmas.d2d will process ./documents/myFile.d2d into ./documents/myFile.html ./documents/myFile-embedded.sh ./documents/.DUMP_myFile and /home/donald/diary/christmas.d2d into /home/donald/diary/christmas.html /home/donald/diary/christmas-embedded.sh /home/donald/diary/.DUMP_christmas |
d2d_gp is a general purpose text architecture which has been
developed in parallel with the different historic states of d2d itself.
It follows roughly well-known structural ideas from LaTeX, BibTeX, DocBook,
HTML and other text type architectures. Its goal is not to invent new
notions of text and its components, but to present established concepts
in a versatile, modifiable and extensible way.
Therefore some modules, e.g. those which define
table and list structures, are left utmost primitive, by intention.
Users may plug-in their favourites, and shall not be confrontated
with complexity inadequate to their current project.
So d2d_gp is intended to be parameterized and modified by the user, for
easy definition of textual document types and their subsequent transformations,
according to the needs of a certain project or to personal preferences.
Beside, it serves as a demonstration object for the mechanisms of d2d itself.
It consists of a collection of ddf modules, including user documentation text
(currently only in the English language) and xslt transformations (currently
only to xhtml 1.0).
The translation system into xhtml 1.0 is well-proven and
has been used for creating the documentation texts you are currently reading.
We plan to derive a translation system into LaTeX, and perhaps into "word-ml"
so that you can translate into these de facto/pseudo standard formats
(into the latter without the need for ugly tools from Redmond !-)
The following sections briefly describe the basic architecture and the usage. For details, please refer to the ddf definition sources and the automatically generated documentation.
The module "basic" is the central definition collection in the d2d_gp architecture. Its contents are grouped into sub-modules. The first of them define character sets and simple parsers of the lowest definition level. Then follow more complex and dedicated structures. All ends in definitions of standard publishing formats, defined in the top-level sub-module basic.deliverables .
Here a short characteristic of the sub-modules. (The links bring you to the automatically generated documentation. Please note that this doc represents the instantiated case. Consequently, one and the same module source text may appear there more than once, namely differently parameterized! So please, when looking for a certain module, please consider the list of instantiated modules at the beginning of the documentation.)
A first group of modules which are more specialized are contained in http://bandm.eu/doctypes/d2d_gp/technicalDoc.dd2 These modules support the technical documentation of software.
The definition of the meta_tools documentation itself, i.e. the text you are currently reading, is derived from these modules and contained in http://bandm.eu/doctypes/d2d_gp/mtdocpage.dd2
The transformation system from d2d_gp into xhtml 1.0 is multi lingual. This is controlled by a translation table, which must be extended when extending the rule set, and which may be copied and edited when adding support to new languages.
Most of the transformations from d2d_gp into xhtml 1.0 are straight forward:
The transformation to xhtml 1.0 is directed by a collection of
style sheet parameters.
Their names always start with "$user.", to prevent naming clashes with
internal parameters and constants.
(The file /doctypes/d2d_gp/mtdocpage_xhtml.css.prototype holds
an automatically collected list of all "css-class" definitions and of
ALL these variables. There is a script
extractClasses.xslt
which generates this file automatically when you apply it
to the extracted "<XXX>_to_xhtml_1_0.xslt". For creating this
textual representation of the Xslt rules,
cf. above Section 5.3.)
The most important of them are currently:
$user.user | Name of the person initiating the transformation process. This is currently used by the generated "standard footer". |
$user.date | Date and time of the rendering process. Goes also to the "standard footer". |
$user.host | Name of the machine on which the rendering process is run. Goes to the "standard footer". |
$user.mulitable | URI where the translation table can be found. Is needed by the translation process. |
$user.defaultLang | Serves as a very low priority default language; will be overriden by any lang or langs value in the source text. |
$user.collectiontitle | name of the collection the webpage is part of. Used for the "standard header". |
$user.currentkey | stem of the file which is currently processed (file name w/o directory prefix and ".d2d" or ".html" suffix) Needed for navigation links, etc. |
$user.url.sitemap | url of the sitemap, in the sitemap format. Required if header, footer, or other html elements shall include navigation |
$user.bibLocation | url of the file which contains the bibliography, iff it is not the current file. Iff a value is given, then all xhtml links which correspond to "cite" source elements will link there. (If no value is given, they will stay local to the generated file.) |
$user.biblistHideUrl | "==yes" indicates that a "clickable" entry visually shows only an indication of the document type to download (like ".pdf document" or ".html-Datei") instead of the full URL. |
$user.linkurlprefices | A "self-structuring" list of strings.
(This is the concatenation of string values
separated by an arbitrary delimiter, which is defined by
the first character in the string).
Defaults to "http:/"
The maximally ten(10) elements of this list are used as a prefices for every url in an explicit source "link" which starts with a decimal digit. This makes a kind of "mount points" for href-prefices which (a) occur frequently and (b) must possibly be re-located when rendering in different contexts. E.g. with "user.linkurlprefices='%http:/%http://aa.bb/c%file://a/%'" every source text " #link 2def#/link" will be rendered as " #link file://a/def#/link" The zero-th prefix should always be set to http:/, which is also the default value of the parameter, since may entries, e.g. in bibliographic lists, rely on this abbreviation. |
$user.linktextprefices | A self-structured list of strings, which are used as a prefix in a similar way as $user.linkurlprefices, but for the text part of link elements. |
$user.jsUrls | self-structured list of urls of "java script" files. One link for each of the mentioned files will be inserted into the html output. |
$user.cssUrls | self-structured list of urls of "css" files. One link for each of the mentioned files will be inserted into the html output. |
$user.iconUrl | one single url to the "icon" used in the title bar of a html browser. Encoded in html by "<link rel="icon" ...>" |
$user.showLabels | iff =='yes', then all labels in the text are visible, for debugging purpose. |
$user.pageSource | iff =='yes', then all link to the source text will be inserted in the footer "for your information". It has the form of a relative url, lying parallel to the html file. |
$user.footerSignet | defines the first element in the standard footer. Is output as text without escaping, can contain arbitrary Xhmtlm constants, like links and styles and ascii-art, etc. // cf /home/lepper/ml/web12/common.mk TESTED ???? FIXME !! |
($p_kind_filter) | Must contain the string ' * ' to include ALL kinds of paragraphs,
or a list of identifiers and only paragraphs with this "kind" will
be included in the output.
NOT YET PUBLIC, controlled by "$publicVersion", should be public, FIXME |
Further parameters are currently only used for the particular stylesheet generating the documentation you are currently reading, namely $link!http://bandm.eu/doctypes/d2d_gp/mtdocpage.dd2!, but are intended to be abstracted in near future:
$user.publicVersion | iff =='yes', then all proof-reading info and all paragraphs of "internal" kind are suppressed. |
$user.collectiontitle | The pure name of the collection, used for generating navigation bars, links, meta tags, etc. |
$user.collectiontitle_html | Overriding $user.collectiontitle in case where full Xhtml may be used. Is included without escaping, so it may construct abritrary Xhmlt verbatim! |
$user.url.sitemap | A file following the sitemap format for generating navigation devices. |
In HTML/xhtml the construct "table" is not under
a paragraph, as a child node, but on the same level, as a sibling.
The same ist true for the list constructs "ul" and "ol", and for
the horizontal ruler "hr".
(Contrarily, "br" may only appear inside a
"p", or inside another "block element")
The definition of the source-level elements "list" and "table" in d2d_gp is different, since they are children of a "p". In d2d_gp, "p" is the central means for organizing attributes like "lang" or "kind" or "label", etc. So here, a "p" wraps all, including list and table, etc.
When translating to xhtml, the containment-relationship must thus be rotated:
body body | /|\ | / | \ p p | p /|\ | | | / | \ | | | / | \ | | | / | \ | | | chars table chars chars table chars |
The algorithm is as follows:
[p]alpha[/p] ==> trans(alpha) trans(alpha) ==> f(e, e, alpha) f(top, sub, e) ==> top;p'(sub) p'(e)=e p'(alpha)=[p]alpha[/p] f(top, sub, [list]beta[/list];alpha) ==> f(top;p'(sub);[ul]beta[/ul], e, alpha) f(top, sub, nonlist;alpha) ==> f(top, sub;nonlist, alpha) f(top, e, e)==> top |
This algorithm is implemented in the functions (="named templated") "new_p, from_p and in_p" in the xslt code contained in basic.structure.
A similar rotation is necessary when translating into "ms-word xml".
auxiliaries | bandm meta_tools | downloads & licenses |
made
2024-09-02_18h09 by
lepper on
happy-ubuntu
produced with
eu.bandm.metatools.d2d
and
XSLT
FYI view
page d2d source text