[all pages:]

D2d --- XML Made Useful For Authors

^ToC 1 Purpose and Way of Operation

^ToC 1.1 Intentions

D2d ("Triple-Dee"/"DDD"/"Directly To Document"/"Direct Document Denotation") enables authors of scientific texts, as well as poets, essayists and novelists, to employ XML based document denotation in their creative process.

The first aim is that the creative flow of writing is interrupted as least as possible. Therefore d2d is implemented as a compiler, and basically uses only one single escape character. This design supports the creation of XML documents by speach input.

The second aim is to give the user full control of the structure he or she creates. Depending on their level of interest and experience, users can control type definition, construction and processing of text bodies with a scalable level of detail. So what the user types really "is" pure XML, --- only encoded in a writable and readable way. This fact the name "direct document denotation" refers to.

(For the design principles see also this short buzzword list and a German language concept sheet.)

There have been several applications of d2d, from small text based data bases (bibliographies, book-keeping), over medium-scale technical texts, like this documentation you are just reading, up to poems and novels. The use cases span a wide range from stand-alone application, generating XML for further processing, to programmatically fully integrated as a user's front-end in some dedicated application. Two extreme examples are a larger analytic text in musicology, against our integrated book-keeping software (both in German language).

^ToC 1.2 Use Cases

D2d is a co-operative of different concepts and applications. Central tool is the text2xml compiler which creates an xml text file from some input text in the d2d front-end format.

The parsing process is controlled by the document type. This can be given in some third-party standard format. Currently only W3C-DTD is supported, see [xml] .

But there is also an own definition language, referred to as "d2d definition format" or "ddf" or "dd2", and called "ddf" in the following. It provides (1) much finer control on the text parsing process, down to character granularity, (2) extensive modularization, and (3) adaptibility not by parametrization, but by applying free transformations on structure defining expressions.

Further there are tools for (1) transforming between ddf and dtd, as fas as possible, (2) for generating documentation on ddf's, and (3) a special mode for interleaving XSLT commands with document structure.

There is (4) a special parsing mode (plus a collection of pre-defined element types realizing combinators), which allows to parse the external representation of umod models, according to their serialization rules. This allows the direct denotation of data models.

Planned is (5) a syntax controlled editor, either on java base, or as an emacs mode.

Last not leat, there is a first gemeric application: a modular architecture of document type definitions, called "d2d_gp", because it should be sufficient for all "*g*eneral *p*urposes".
It is described here briefly in chapter 6, but documented mainly by the automatically generated documentation pages, extracted from the ddf files.
It comes with an elaborate xslt system for generating xhtml1.0 [xhtml10] and covers a wide range of traditional publishing. It can be taken as a basis for own developments using the above-mentioned re-writing mechanism.
We plan to add translation rules into LaTeX, and into the XML version of a pseudo-standard wordprocessing format from Redmond, WA.

^ToC 1.3 Three Fields of Parsing in the text2xml Transformation

When transforming an input document in the d2d input syntax into an xml model (and the corresponding textual representation), this process can be seen as a combination of three different parsing situations. In each of these a different parsing technique is applied:

1. First there is the parsing of the characters of the input stream, which recognizes the different kinds of opening and closing tags, of comments, of meta-commands and character data. This parser can be thought of as a "lexer" for the subsequent tag-parsing. Its rules are not defined by the user, but fixed by the d2d definition, and but are configurable.
2. Parsing of the explicit tags and of the contents framed by those, and creating the corresponding hierarchy of elements of the resulting XML tree-shaped text model. This works in LL(1) discipline, is controlled by the user's text type declarations, and esp. supports the inference of closing tags.
3. User defined parsers on character level. These are usually employed for the analysis of the fine-granular structure of certain input text components, like calendaric dates, personal names, code numbers, citation keys, text labels, formulas, identifiers, etc. These parsers use a "longest match" discipline, which is rather expensive to execute, but affordable, because the input data is always rather limited in extend.

^ToC 2 Parsing a Text Input File Into An XML Model

^ToC 2.1 Top-Level File Structure and Document Type Declaration

^ToC 2.1.1 Top Element Declaration and Input Data

The text-to-xml frontend compiler of d2d takes some character input and parses it to an internal XML model.
In most cases, this model will immediately be written out by the tool implementation to some traditional XML-encoded text file, for further processing by the user.

Each file which shall be subject to this d2d parsing must have a structure like ...

  foo foo foo #d2d 2.0 text using : #eof bla bla bla 

The prefix of the file, i.e. everything preceding the "#d2d" is ignored.

This is followed by the "magic words" "#d2d 2.0" and a document type declaration: By "<MODULE>" a certain ddf definition module will be referred to. The definition text of this module must be locatable by the parser implementation.

Currently this implementation additionally accepts umod type definitions and w3c's dtd files [xml] as module definitions. The corresponding files are considered iff no ddf module with the given name can be found. In this case, since these formats do not support nesting, the name of the module must be a simple identifier.

As "<ELEMENT>" the name of a parser definition from this module must be supplied. This is the definition of the top-level element that will be generated by the parsing process, and its contents definition is the rule initially defining the parsing process. (Normally, this top-most and initial parser will be a tagged parser, not a character parser.)

The end of the parseable input datamust be marked with the meta-tag eof. Subsequent input in the input file will be ignored.

(This stricht framing seems annoying, but it makes it easy to embed d2d sources in bodies of emails and other badly specified contexts.)

^ToC 2.1.2 Local Definition Modules

Additionally, preceding the input data, a source file may contain local document type definitions, similar to the "internal subset" in the world of w3c dtd.

They take the form of one or more module definitions, as described below in Section 2.8. Again, this is led in by a reserved character sequences. A file with this format looks like ...

  foo foo foo #d2d 2.0 module localmodule // here some new parser definitions end module this will be ignored again, like foo foo foo foo #d2d module localmodule2 // "2.0" may be ommitted (but shouldn't !-) // here some definitions tags myparser = // etc. end module #d2d 2.0 text using localmodule2 : myparser // input text #eof 

Again, all characters outside the marked regions will be ignored. This is done for an easy embedding into e.g. a piece of e-mail without the need of complicated wrappers.

^ToC 2.2 Tokenization

The first layer of parsing is tokenization. Its rules are fixed, but configurable. The parsing process splits the input into tokens and recognizes ...

2. open, close and empty tags
3. meta commands
4. character data

Comments follow the well-known "C" discipline: At each point of the input text there is one current and configurable comment-lead-in character, or comment character for short. It defaults to "/".
Two of these start a one-line comment, which extends up to the end of the line.
One comment character together with an adjacent asterisk "*" start a multi-line comment, which extends up to the inverse sequence.
The text forming the comment ist totally ignored by the parsing process. There are no further rules of nesting, so the contents of the comment text do not open further levels of comment, etc.:

  input // this is a one line comment input // this also /* only a one line comment input /* this is a // multi /* line comment */ input continued 

Comments may be inserted nearly everywhere where whitespace can appear. Comments do not contribute to the resulting model and are totally ignored by the parsing process.

Tags and meta-commands are recognized by the command character. This is configurable, too, and defaults to the pound sign "#".

Then an alphanumeric identifier must follow (either directly. or prefixed by end tag characters, see below). This identifier is a character sequence starting with an ASCII letter, followed by ASCII digits or letters or the underscore or the minus sign.

This identifier is either the name of a built-in meta-command, or the start of a tag.

1. #setcommand changes the current command character.
2. #setcomment changes the current comment character.
3. #suppressVerbatimCommandCharWarning, see Section 2.5.2,
4. #eof marks the end of the input which is subject to parsing.

Please not that #eof, instead of a meta-command, could also be seen as a special, pre-defined tag, implicitly appended to the top-level grammar expression. The effect in parsing would be the same. But the names of all meta-commands are reserved and cannot be used as tags in a user defined document type definition.

The choice of the command character is esp. important for our central aim, not to disturb the flow of authoring. The pound sign "#" is convenient e.g. on a German keyboard layout, where it can be typed with one single key press, without modifying keys. It should be changed according to the preferences of the author.

The following text brings all possible combinations (It is not meant as a practical example !-)

 #setcommand ! !setcomment % %* this is a %%%%% multi line comment !! *% !setcommand##setcomment!!!a one line comment 

All further examples visible on this documentation page will stick to the default settings for these both characters.
(Consequently, the source text of this documentation had to change both of them !-)

Here, between a meta-command and its character argument, is the only place where whitespace is permitted, but not a comment:

  #setcommand /*new command character follows*/! 

...will raise an error, because the command character cannot be set to the character "/".

In summa, there are four(4) special, reserved characters involved in the text2xml parsing process, these two, plus the asterisk "*" for framing multi-line comments, plus the character for constructing closing tags, "/", called end tag character in the following. The last two are not configurable, and also for the first two there a some forbidden combinations:

 name default restriction commandchar # Cannot be set to the end tag character / Cannot be set to the current comment character commentchar / Cannot be set to asterisk * Cannot be set to the current command character end tag character / Not configurable multi-line comment * Not configurable

These are indeed four(4) reserved characters, but since explicit closing tags are very rarely required, and comments do not count anyway, we still daresay that d2d realizes (nearly) all tagging with "only one(1)" reserved character, namely the command character.

Finally, all input data which is not preceded by a command char and not part of a comment, is recognized as "character data".

^ToC 2.3 Tagged Parsing

The tagged parsers form the upper, coarse layer of user-defined parsing. For each open tag recognized in the input stream, a new XML node (element or attribute) in the model is created. In case of an element, all subsequently recognized input (character data and further nodes) will be included in therein as its contents, until the parsing process decides to "close" this element.

This corresponds to the well-known XML/SGML approach of tagging, i.e. of parsing a XML encoded source text. But with the d2d approach there are some important simplifications:

1. Tags are written with one single escape character, called command character. As mentioned above, it defaults to the "pound sign" #, but can be re-assigned, whenever appropriate.
2. The syntax of allowed d2d tags is much more restricted, and given by the production for ident. Esp, [xml] allows the underscore as a start character, and so even the colon, and the "dot" and many non-Ascii characters as subsequent characters. In d2d, only Ascii letters, digits and the two(2) special characters "minus" "-" and "underscore" "_" are allowed. The first character must be a letter.
3. Closing tags can be ommitted in most cases and are inferred by the parser, but can also be inserted explicitly.
4. If both is not possible or convenient (e.g. in cases of nested mixed content), two special closing notations (similar to LaTeX's \verb and \begin{verbatim} constructs) can be applied, see Section 2.3.1 and Section 2.5.2,

In combination with the character parsers explained below, this leads to a much more intuitive handling, esp. during the process of creative authoring, as in the following example:

 This is a #emph!source text! which refers to some #cite citation and to some #ref text_label in the same document. #p This is a new paragraph, containing #list #i a first #i and a second list item, refering to a #link http://bandm.eu #text link#/link #/list 

It is obvious that you can enter such a text with your favourite text editor. This is not really funny with the standard XML frontend representation:

 

This is a source text which refers to some and to some text_label in the same document.

This is a new paragraph, containing a first and a second list item, refering to a link



Beside these (and some more minor) simplifications, the grammar defining the document type and ruling the parsing process must still confirm to the well-known DTD standards, i.e. it must be "LL-1-parseable" in the sense of [xml] , after all closing tags inferred by the d2d parser have been inserted, see Section 2.3.5 below.

^ToC 2.3.1 Structure and Effect of Tags

Tags are recognized in two cases:
(1) whenever a command character is followed by an identifier which is not one of the reserved meta-commands from
Section 2.2, and
(2) whenever a command character is followed by such an identifier, but preceded by one(1) or three(3) end tag characters.

In both cases there may be arbitrary whitespace following the command character, but no whitespace between the end tag characters and the identifier.

In the first case we have an open or empty tag, in the second we have a close tag.

After an open tag some input must follow which matches the definition of the node (attribute or element) to which the tag refers to. The translation process will create such a node instance in the model under construction, and parse the subsequent input to create its contents.

An empty element may be denotated by an empty tag, see below.

An close tag explicitly ends the parsing process for an element currently under construction, and continues with the parsing context which reigned before this element had been opened.

Similar to the well-known XML/SGML front-end representation, a close tag is constructed by an end tag character immediately preceding the identifier, and an empty tag is constructed by an end tag character immediately following the identifier.

In contrast to XML, d2d additionally provides both kinds of tags in a form with three(3) end tag characters. These are "premature end" tags. They indicate that the contents of the element are not yet complete, but will be completed in a later working phase. This is e.g. useful for abstracts, bibliographic entries, examples, etc. which are left open in the first course of writing, and shall be explicitly marked as incomplete.

In contrast to XML, d2d additionally provides a generic closing tag consisting of the end tag character followed by a whitespace character. This means to end the latest opened element, i.e. it is a close tag to the open tag which has been recognized last. The whitespace character is consumed as part of the tag, and is consequently not subject to further parsing, i.e. it will not be translated to contents of any re-opened element.

An open tag may consist not only of the identifier, but may also include the one(1) immediately following character:

Whenever a whitespace character follows, this is consumed as part of the open tag. So this character will not be considered by further parsing, e.g. for the content of the element just opened.

This supports a kind of tagging which is known from them \verb%...% construct in LaTeX or the -e "s?..?..?g" construct in sed.

Currently the following parentheses are implemented:

 opening character ( < [ { . ! \ : $^ closing character ) > ] } . ! \ :$ ^

The assignment of a certain close tag to a certain input character is valid up to the first appearance of this character (outside of that what is swallowed by a character parser, cf the warnings below in Section 2.4.3!). This character will be replaced by the closing tag by the parsing algorithm.
(As with explicit tags, arbitrary combinations of opening, closing and empty tags may occur before, --- as long as they comply with the document type, of course !-)
The parenthesis assignments are stacked, so that in this moment the next-older assignment will pop up and be valid again. So the same paranthesis character can be re-used in a nested way:

  this is #bold!bold and #ital!bold italic! text!! --- yields this is bold and bold italic text! 

Only when the character following the identifier in the open tag is neither whitespace nor a special character usable as a open parenthesis (nor an alphanumeric one, of course, because then it would be part of the identifier, and not a follower !-) it is taken as input for the further parsing.

The following example illustrates all these cases (The lines are to be read not as a part of one continuous text, but each as a separate example, in some larger but arbitrary context!)

  foo foo #ident this char data goes into an .. element foo foo #ident this char data starts with one blank space foo foo #ident(this char data ends here) foo foo foo foo foo #ident!this char data ends here! foo foo foo foo foo #ident=but this char data starts with an equal sign foo foo #/ident after the element .. after the element foo foo #///ident here it ends PREMATURELY here // MISSING FIXME CONSUME ONE BLANK !!! foo foo #ident/ this is an empty element this is .. foo foo #ident/this is an empty element this is .. foo foo #ident/// this is an empty element, but waiting for later completion foo foo #/ here ends the top element currently under construction, and the character data starts with a "h" foo foo #/ dito, but the character data starts with a blank. 

There are two major caveats with the parenthesis mode:

First, the assignment of the role of the closing tag to the parenthesis character is made purely on the lexical level. It does not consider the the content model of the parser definition the tag refers to. The latter information lives on a higher level, and will not influence the lower level of tokenization. This seem to be somehow unconvenient, but is for good reasons, as discussed in Section 2.3.6.

Consequently, it is possible to write things like ...

  foo foo #el^goes to first el#/el foo foo #el goes to second el^ foo foo 

The first tag opens a new element of type "el" and defines the caret sign "^" as a denotation for the corresponding end tag. But then the text does not use this short-cut notation at all, instead it uses the explicit end tag "/el". So the role of "/el" is still assigned to the character "^" and used afterwards, for the second element.

Our current implementation issues a warning when recognizing the first, explicit end tag, because the closing parenthesis character is intended to be used in correct nesting, i.e. to replace the closing tag which corresponds to the open tag the it was defined with.

Secondly, this non-awareness of the content definitions can lead to unwanted effects with empty elements. Assume an author writing a novel and refering to its protagonist by "#pat", which is defined as an empty element. Then the following text cannot be parsed:

 #ben stood there and waited for #pat. Finally she came. 

The contents of the element pat is #EMPTY (written in the dtd style). Nevertheless the dot character "." is used as an open parenthesis for its contents. Consequently, the second dot character will be read as the closing tag "#/pat", and " Finally she came" as the intended contents for the element, which is a typing error, because these must be empty.

Here are some correct alternatives, meaning the same. (The last one is given only to demonstrate a further trap, not as a suggestion !-)

 #ben stood there and waited for #pat/. Finally she came. #ben stood there and waited for #pat(). Finally she came. #ben stood there and waited for #pat... Finally she came. 

Because there may be significant distance between the place where such an unintended end tag comes into effect, and the place of its unintended definition, every error message generated by the d2d tool concerning an unexpected closing tag resulting from a parenthesis is followed by a hint message giving the position of its definition.

^ToC 2.3.2 Declaration and Execution of Tag Parsers

The content model of each element filled by tag parsing and the corresponding parser process are declared in a ddf module by a definition statement compliant to the following grammar:

 tags_parser_def ::= ( public ) ? tags ident ( , ident ) *         = d_expr ( modifiers ? )
 d_expr ::= ( #implicit ) ? expr | #empty | #GENERIC
 expr ::= decor | decor , decor | decor | decor | decor & decor | ...
 decor ::= atom | atom ? | atom * | atom +
 atom ::= #chars | ( expr ) | reference | insertion | ...
 insertion ::= @ reference

Here is a typical example, which defines a hypertext link, closely following the classical HTML <a> element:

  tags link = #implict url, ( text? & (blank|top|inframe|framename)? & loc? & refdate?) tags url, text, loc, refdate, framename = #chars tags blank, top, inframe = #empty 

The corresponding input will appear like ...

  please refer to #link http://bandm.eu/metatools/doc/d2d.html #text this link #loc txt_label #top for more information 

Every tags_parser_def defines the contents of one or more element types, and thus defines the corresponding tag-driven parsers which can convert input text into a model for these elements.

At the beginning of the parsing process, when looking the the text input as a whole, content must follow which is accepted by the parser corresponding to the indicated top level element, see Section 2.1.1 above.

"Being accepted by a tag parser p" means that the input can be partitioned into a sequence of sub-inputs. Each of these starts with a tag, and is followed by some content which in turn is accepted by the parser indicated by that tag. The sequence of all these tags must match the regular expression which is given with the definition of "p".

Only exception is the #implict declaration: This keyword suppresses the very first tag of a top-level sequence, so that only the content, but not the tag must follow.

In practice, this is esp. useful for obligate entries which always, without alternatives, stand at the beginning of a structure, like ids or numbers or keys.

As soon as the whole contents are recognized as "complete", ie. as soon as no more further input can match the contents definition, a close tag is inferred and inserted in the parsing process.

Nevertheless, the close tag may appear in the input text explicitly, eg. for readability by humans.

W.r.t the top-level element and the parsing of the input file as a whole, the text must be terminated by the meta-command eof.

The declararation as #empty defines that the contents of the element are empty.

If not, then the constructors of the regular expressions have the usual meaning:

1. #chars means character data
2. e? means that e may be omitted.
3. e* means that e may be omitted or repeated.
4. e+ means that e may be repeated.
5. e,f means e followed by f
6. e|f means e or f
7. e&f&g&h means any permutation of e, f , g and h.

Some caveatsw.r.t the "&" operator:

1
the "&" operator is associative. The "," and the "|" operator are also associative, but with them it is clear by their semantics. For the "&" operator it must be stated explicitly. The different kind of notation in the list above wants to indicate this symbolically.

2
The "&" operator means permutation, and not interleaving! E.g.

  tags x = a & (b, c) ---- #x #b #c #a // is legal! #x #b #a #c // is NOT legal! 

3
Permutation happens only in the input text, not in the data model. The data model will always be normalized to the sequence of the declaration. The "&" operator is meant as a mechanism for writing down e.g. the different "columns" of an entry in a data base without the possibly tedious need to respect a totally arbitrary and meaningless sequential order. It is not intended to denotate a sequential order as such!
(In real word data models, a sequential order which carries any semantics is in nearly all cases only sensible between different values of the same type, eg. the participant of a sports tournament, and not between different types, each appearing exactly once with only one value each, as it is described by the & operator.)

^ToC 2.3.3 Difference between Tags, References and Insertions

A reference, as it appears in the expression of a parser definition, must always be resolvable as the name of an existing definition, which may be a tag parser, a character parser, or an enumeration. When such a reference appears in a regular expression expr, this implies (1) the appearance of the tag of the referred parser in the parsed input text, (2) followed by some input which is accepted by that referred parser, and (3) the construction of a corresponding node (element or attribute, to speak with XML) in the recognized output model.
Definitions may refer to themselves, or to others in a mutually recursive way.

In the world of the input texts all parsers and element types are referred to by tags. The tag for every parser is a simple identifier, namely the ident which appears to the left of the equal sign in the tags_parser_def which defined this parser.

This holds for those defined directly on the top-level of a module, as well as for those which are nested as local definitions (see Section 2.8.2).

In contrast, in the world of definitions, whenever refering to a parser in another parser definition, e.g. when construction a reference, a kind of "qualified name" is used, which in case of local definitions or of imported modules is a sequences of more than one identifiers. But such a construct never appears in text inputs.

Refering to another tag parser in the form of an insertion means refering only to its content model, i.e. inserting the regular expression which is the content model of the referred parser at this point into the construction of the regular expression.

In contrast to a "normal" reference, an insertion exists only in the world of definitions. It is not "visible" in the input text.

In the current implementation, tag parsers do not allow cycles, i.e. mutually recursive insertions, but char parsers do.

Insertions are introduced so that no special class of definitions is required for content models as such (as with the two constructs "define" and "element" in [relaxng] , or with "ENTITY" and "ELEMENT" in XML DTDs [xml] ). In practice, this leads frequently to parsers (tag parsers or character parsers) which are only defined for the sake of their content model to be used in the definition of other parsers, and which are never applied to text input on their own.

This insertion mechanism works also for the character parsers, as described below, Section 2.4. But, of course, both kinds of parsers cannot be mixed: only tag parsers can be inserted into the expression of a tag parser, and only character parsers into character parsers!

^ToC 2.3.4 Character Data Input, Whitespace and Discarding Whitespace.

Last not least, in tag parsers the expression "#chars" corresponds to all character data. These are all those fragments of the input which are not recognized as tags or comments or meta-commands, as described in Section 2.2 above.

Beside the "#implicit" declaration, this is the second case in which an open tag is inferred: The "invisible tag" for "#chars" is inserted whenever the beginning of such character data is recognized by the tokenization process.

This implicit tagging could be (formally not correct) depicted as ...

  p = (hr | br | #chars)* hr = br* br = #empty ----- applied to text input ---> #p this is a paragraph, #hr#br and it continues here ----- implicit tagging yields --> // #p #chars this is a paragraph, #hr#br#chars and it continues here ----- standard d2d parsing yields -->

this is a paragraph,

and it continues here 

The auxiliary second line shows that after the insertion of the implicit "#chars" tag, the parsing and tag inferences mechanism of d2d can be applied as usual, without further special treatment.

Nevertheless it turned out to treat certain character data in a special way, namely whitespace character input.

If the currently parsed content model does not allow character data, then every non-whitespace data will be tagged with "#chars" and influence the parsing situation, as described above. In the example above, it terminated the collection for the "hr" element. This "active role" of character data is not sensible with white-space input: Instead, whenever the currently parsed content definition does not accept any character data, it is more convenient that white-space is ignored.
For better readability of complex nested contents this is even necessary, since it allows indentation in the source text which has no effect on the parsed result.
So the insertion of blank characters in the following version will not change the parsing result, because the currently growing element ("hr") does not accept character data:

  #p this is a paragraph, #hr #br and it continues here // ^^^ has no effect ----- standard d2d parsing yields -->

this is a paragraph,

and it continues here 

...but inserting non-blank characters (of course !-) will:

  #p this is a paragraph, #hr XX#br and it continues here ----- standard d2d parsing yields -->

this is a paragraph,

XX
and it continues here 

The situation is complementary, if the currently parsed element, the current parsing situation, does allow character input.
Firstly, the "#chars" tag would not lead to closing one or more parser levels, but is consumed and inserted into the currently parsed contents. Consequently, white-space must never be ignored.
This can lead to some possibly surprising effects. A widespread example is found in standard XHTML:

 // slightly simplified contents model: link = (id? & href? & name? & style? & class?), (#chars | p | img | div )* ----- applied to text input ---> #link #href thisIsAHref#/link ----- yields, as expected ---> ---- but applied to text input ---> #link #href thisIsAHref#/link ---- is implicitly tagged as ---> #link #chars #href thisIsAHref#/link ----- and thus standard d2d parsing yields --> ERROR, href not allowed 

This effects are caused by the fact that "link" does accept character data, and therefore more than one blank character are considered as input to the "(#pcdata|...)*" part of its content model. This leaves the initial permutation expression behind, once and forever!

So when the "href" is meant as a part of the "link" element, it has to be input like

  #link #href thisIsAHref -- or even #link#href thisIsAHref -- but not as #link #href thisIsAHref 

One single blank is swallowed as part of the tag (see Section 2.3.1), but the second blank must be treated as character input, and this requires to enter the second parenthesis, so "href" is not longer applicable.

A similar effect comes with the definition of paragraph in our standard text format d2d_gp, see chapter 6:

  tags p = (kind? & lang?), (#chars | @PHYSICAL_MARKUP | @DOMAINSPECIFIC_MARKUP)* chars kind = @S:ident chars lang = @XML:lang ----- must be written like ---> #p#kind motto Here starts the paragraph. #p #kind motto Here starts the paragraph. ----- but NOT like ---> #p #kind motto Here starts the paragraph. 

The discarding and respecting of whitespace happens in exactly the same way when re-entering an element's content after an explicit closed tag:

  #p #hr #br/ #br/ #/hr continue // ^^^^ ^^^ ^^^^^ ignored // ^^^^^^^ not ignored #p #hr #br/ #br/ continue // ^^^^ ^^^ ignored // ^^^^^^^^^^^^^^^^ not ignored 

(You will note that in the last case there is a kind of "backward parsing": Not before the non-white-space character data is recognized and leads to an inferred "/hr", the whitespace sequence will be classified as relevant input!)

Please note that all these "front-end" rules and considerations are independent from whether the whitespace will be stored to the resulting elements. The character data contents of elements may be "trimmed" at both ends, according to a declaration of "trimming", see Section 2.5.1. So possibly the whitespace recognized as the text start of the "link" element above will nevertheless be discarded when constructing the result element!
But this "back-end" feature is in no way related to the decision mechanism described so far!
This is sensible for two reasons:
(a) The front-end rules for whitespace also apply to the points of re-entering a contents model after an explicit closing tag, as in the last example above. This case is not covered by the back-end "trimming"; which only affects the very ends.
(b) If trimming would be considered in the parsing process, then later changes in the back-end representation would influence the parsibility of source documents in the front-end. This does not seem wise.

So please do not mix up those two layers. After short practice, the front-end rules will surely turn out to be much less complicated than it may seem, since they are useful and intuitively capable!

^ToC 2.3.5 LL(1) Constraint

All content declarations must (locally!) fulfill the LL(1) restriction on all repetitions and alternatives (but only w.r.t. all input in which all closing tags are contained explicitly!).

This not only for ease of implementation. It is more for readability by a user. We aim at users not coming from informatics or language theory, but from administrative practice, lyrics, journalism, etc. And we want "visibility of control", means: no inference rules or backtracking behind the scene. And we want easy compositionality, explainable to above-mentioned users.

(With the character parsers, as described below, the priniciples are just the opposite. There we want utmost convenience of usage, and the text areas in which "magical recognition" takes place are normally rather small and explicitly bounded.)

The fact that the & operator means permutation, not interleaving, is closely related to this features:
If we had an interleave operator, than LL(1) must be held by all possible sequences of interleaving. This would be rather tedious to implement, not very friendly to the user, would introduce non-compositionality, and would restrict the appicability of this operator in the world of declaration expressions more than it would bring freedom in the world of input texts.

^ToC 2.3.6 Tag Recognition and Evolution of Document Types

The first central target of the d2d approach can be comprehended as "recognizing opening tags with least typing as possible, and infering closing tags."

This may seem very trivial, but is, contrarily, suprisingly complex. Not at least because is does not only deal with a computer-to-human-interfacing problem, but with a second target, namely to support its dynamic evolution.

During the years of development, application and improvement we revised some design decisions which initially seemed very sensible according to the first target, but turned out to be unfeasible w.r.t. the second.

This is the third version of d2d. Only details were changed in the last version step, but with heavy impact. E.g., in the preceding version, the command character could be omitted whenever the current parsing situation implied that only a tag could follow, not character data.

So you could write ...

  #h1 title This is a hierarchy-level-one title 

...what now has to be written as ...

  #h1 #title This is a hierarchy-level-one title 

The reason for this new restriction comes from the possible evolution of schemata and document type declarations: we want them to be evolvable, presevering downward-compatibility to the existing documents. In practice this is an important issue, because what frequently happens are (1) adding of further alternatives to existing content models, and (2) refinement of content definitions. The former when fields of applications grow, the latter when fragments of information require finer analyzing, which had not been dealt with when initially creating the documents.

So the typical evolutions of document type definitions, seen from standpoint of one particular element type and its contents definition, form these two groups of movements:

  a certain sub-element is ... ... nonExisting ... optional ... obligate 1 -----------> 3 -------------> 2 -------------------------------> <------------ 4 <------------- 5 <----------------------------- 6 the contents of an element are ... ... empty ... unstructured ... a structured #chars character parser 7 ---------------> 8 --------------> structured tag parser 9 ---------------> 

In detail the consequences are ...
1) A new element is introduced as optional. No existing document is invalidated.
2) A new element is introduced as obligate. All existing documents are invalidated, what may be exactly what you want.
3) All documents not yet containing this element are invalidated.
4) No impact on existing documents, future documents may be more simple.
5) and 6) All those documents which do contain this element are invalidated. This seems an unlikely case (unless Supreme Court forbids storage of certain data, etc.)

7) This is a rather frequent case in practice: e.g. "calendaric date" or "course number" are first entered as mere character data, and later, when further processing becomes necessary, refined to a character parser. Some documents may become invalid: This may be either caused by typos in the document, or it is an indication that the new structured character parser is defined in a too restricted way, does not consider all legal possibilities.

Please note that there are two methods for specifying unstructured character data, either as a tag or as a character parser:

 //a tags calendaric = #chars //b chars calendaric = (#S:all - '#')* //c chars calendaric = (#S:non_whitespace)* //d chars calendaric = (#S:all)* 

Esp. the a and b are not totally equivalent: The input recognized by variant a is terminated with a command character, which can dynamically be redefined by the input source. But the input to b is always terminated by the character "#", and ONLY by this!
The variant d will swallow the rest of the input file and probably result in an error. See the warnings in Section 2.4.3.

8) In most cases this transformation is not advisable if the element ever occured in "mixed content": with its empty content, closing tags were always immediately inferred. So they probably do not appear explicitly in the existing document sources, but after this transformation are required to prevent the element from swallowing all following character data.

9) This should impose no severe problems, as long as the sets of involved tags (esp. the "first" sets!) are chosen carefuilly. There are two sub-cases:
9-a) The new content definition can produce "epsilon", so the empty content is legal. Then normally no existing document gets invalid. (The only problematic case is that the director sets of the new content model and of the application context are not disjoint, see 9-b.)
9-b) The new content defintion cannot produce "epsilon". Then all existing documents are either invalid, or the enhanced element now swallows elements as its childs which before had been its subsequent siblings and nephews, which keeps it syntactically valid, but may not be what is intended semantically.

It is easy to see that if the content model were considered in the underlying tokenization process, as in the example above and in the preceding version of d2d, much more documents would be affected by type evolution!

Consider e.g. a paragraph which consists of character data mixed with typical inline elements:

  tags p = (#chars | cite | pers | opus)* ---- #p This is about #pers Beethoven#/ and #pers Schiller#/ . 

Later we want to extend the definition with a possible label:

  chars label = @S:ident // use some standard identifier grammar tags p = label?, (#chars | cite | pers | opus)* ---- #p #label beeschill This is about #pers Beethoven#/ and #pers Schiller#/ . #p label is not used by this paragraph #p nor by this 

When we decide to make the label obligate by evolving the content model from "label?," to "label,", then we want error mesages like "missing label" for the last two input lines. Instead, the preceding version of d2d would "suddenly" infer a command lead-in character and and react as follows:

  chars label = @S:ident // use some standard identifier grammar tags p = label, (#chars | cite | pers | opus)* ---- #p #label beeschill This is about #pers Beethoven#/ and #pers Schiller#/ . #p label is not used by this paragraph //

not used by this paragraph

#p nor by this // error unknown tag "nor" 

The opposite effect would be even more confusing: Assume "label" is obligate ...

  tags p = label, (#chars | cite | pers | opus)* ---- #p label beeschill This is about #pers Beethoven#/ and #pers Schiller#/ . // delivers

This is ...



...and is now changed to be optional:

  tags p = label?, (#chars | cite | pers | opus)* ---- #p label beeschill This is about #pers Beethoven#/ and #pers Schiller#/ . // delivers

label beeschill This is ...



So the feature of infering the command character is nice w.r.t. typing, but hardly compatible to evolution, and has consequently be abandoned.

^ToC 2.3.7 DTD Conformance

Whenever a DTD shall be constructed from a ddf model definition, this is guranteed to be possible when only the following patterns are used:

  tags x1 = #empty --- or --- tags x2 = #chars --- or --- tags x3 = #implicit a, (b1 & b2? & b3?), (#chars, c1, c2, c3)* --- or --- tags x4 = #implicit a, (b1 & b2? & b3?), c1+, (c2 & c3)*, ... // any expr not containing #chars --- for being dtd-compatible, the last two patterns necessarily require the following content models and xml representations for their child nodes: tags b1, b2, b2 = #chars with xmlrep attribute // or tags b1, b2, b2 = #empty with xmlrep attribute // or chars b1, b2, b2 = // ... contents without structure! with xmlrep attribute tags c1 = .... with xmlrep element chars c2 = .... with xmlrep element enum c3 = .... with xmlrep element // "a" may be of any type // as an element, it is prepended to the grammar in x3, and a member // of the "mixture" in x4 (thus not restricted to only one(1) instance!) 

The first two patterns convert trivially.

In the last two patterns some of the categories (indicates by the above examples) may be left empty. Esp. the child element tagges with "#implicit" does not always make sense.

The first example will be translated to "mixed content", while the second will create a "grammar kind" content model. In both cases the permutation among the b1, b2, .. are automatically realized by the "attribute" mechanism of XML.

The permutations in expressions over elements, ie. in the DTD content grammar, are replaced by sequentialization, and the output will be arranged in the sequence of the definition, not in the (arbitrary) sequence of the input. This corresponds to the semantics of the "&" operator, which is also not reflected in the model as such, cf. Section 2.3.2.

^ToC 2.4 Character Based Parsing

At the lowest level of nesting a certain tag (explicitly mentioned or inferred by a "#implicit" declaration) can initiate a second kind of parsing process. This operates without any tags, but is totally character based.

The grammars corresponding to these "character parsers" can be arbitrary context free. They are parsed in an expensive "longest match" discipline. This is possible since the character sequences parsed here are normally rather short. These character parsers are used for structured data elements like calendaric dates, personal names, reference numbers, etc.

E.g. see the follwing data base entry:

  #course ET-11f-09 #lecturer Prof. Dr. Peter Pepper #start 23. Jan 2009 #abstract ... ETC. 

The standard XML representation of this input would be

  ET 11 f 09 Peter Pepper Prof. Dr. 23 1 2009 ... ETC. 

What do you prefer to type ? (or even simply, to read !-)

^ToC 2.4.1 Definition and Semantics of Character Parsers

The definition of character parsers follow basically the same syntax as for tag parsers:

 chars_parser_def ::= ( public ) ? chars ident ( , ident ) *         = d_expr ( modifiers ? )

The basic expressions have to be extended as follows:

 atom ::= ... | stringconst | charset | decnum | hexnum | char_range |         | atom U atom | atom A atom | atom - atom | ...
 stringconst ::= " char * "
 charset ::= ' char * '
 decnum ::= ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9 ) +
 hexnum ::= ( 0x | 0X )         ( 0 | 1 | 2 | 3 | 4 | 5 | 6 | 7 | 8 | 9         | a | b | c | d | e | f | A | B | C | D | E | F ) +
 char_range ::= expr .. expr

stringconst stands for a sequence of characters enclosed in double quotes. Used as a character parser, it matches exactly this sequence in the input.

All constructs which represent a character set can be used as a character parser and match exactly one of the characters contained therein:
charset stands for a sequence of characters enclosed in single quotes. The sequence is a denotation for a a character set constant containing these characters.
(Please note that the empty instances of stringconst and character set represent contradictorial meanings in the semantics: the empty String "" matches always. The empty character set '' matches never !-)
A hexnum is a hexdecimal constant. It starts with "0x". A decnum is a a decimal constant, i.e. a sequence of decimal digit. Both represent the singleton set which contains only the one unicode character with that numeric value.
A char_range "a..b" is a character set which contains all those charactes the numeric values of which are larger than or equal to that of "a" and smaller than or equal to that of "b".
The operators "U", "A" and "-" stand for union, insersection and subtraction of those character sets.

The grammar one level higher, for expr, must also be extended for the construction of character parsers.
As new combinators we get ...

 expr ::= ... | decor ~ decor | > decor

The first new combinator is the tight sequence, ie. the concatenation without intervening whitespace. Please note that the "," operator is still applicable, even with character parsers, and means juxtaposition with arbitrary intervening whitespace.

The second operator is a prefix which causes greedy parsing: in contrast to non-deterministic parsing, the expression (=x) will be applied to the input as long as possible and no subsequent expression will be started as an alternative. Please note that no backtracking will happen, ie. after x has eaten all it can, the following expression may fail even when it would have succeeded if x had stopped earlier.

The operators for permutation and substitution are not applicable in the definition of character parses.

As new decorators we get ...

 expr ::= ... | atom ~* | atom ~+

These both also mean tight repetition, ie. the concatenation without intervening whitespace. Please note that the "*" and "*" operator are still applicable, even with character parsers, and mean repetition with arbitrary intervening whitespace.

Finally we have ...

 atom ::= ... | resultContainer | ...
 resultContainer ::= [ ident expr ]

This construct defines a sub-element in the generated output model.

As with tag parsers. there are two ways of refering to other definitions:
An identifier preceded by an "@" character is an insertion and means the inlining of the definition's grammar.
An identifier without this prefix is a reference, which will result in a sub-element in the generated xml model.
Such a reference can be to another tag parser (in case of non-insertion this may constitute even cycles!), or to an enumeration. In the context of character parsing, a refered enum_def is treated as the disjunction of stringconst s, which correspond to the front-end representation of the enumeration items.

There are some type constraints which have to be fulfilled inside the grammars definitions character parsers:

1. Only character parsers and enumerations may be refered to by identifiers, not tag parsers.
2. All character parsers which describe exclusively sets of character sequences of the length "1", and only those, can be used as (/are called) character sets. (enumerations are excluded!)
3. Only these character sets may appear as arguments of the character set operators "A", "U" and "-", either directly, as the value of expressions, or indirectly, by refering to definitions the value of which is a character set.
4. As operands of a character set operator, direct references and references with the insertion operator "@" are treated the same.

In detail each character parser contributes as follows to the constructed result:

1. In any case one (top-level) element node is constructed which is related to the character parser as its type. The tag used for this node in the output model is determined the same way as with a tag parser: It is either implicitly the ident from the chars_parser_def, or a tag given explicitly, by an xmlspec, see below Section 2.5.1.
2. Iff there is no single sub-node defined in the parsers definition, then all accepted character input is taken verbatim as the contents of that node.
3. A character parser or an enumeration refered to in the grammar expression (directly, not as an insertion!) causes the generation of a new sub-node in the constructed output, according to the refered definition. (This definition also determines whether this node is an XML attribute or element, see Section 2.5.1 below).
4. An explict resultContainer causes the generation of a sub-element with the given ident as its tag.
[[FIXME no way to taylor xml encoding !?!?!]]
5. For a resultContainer which does contains sub-structure (other result containers, or non-insertion references to character parsers or enumerations), the sequence of the cresult nodes delivered by the parsing of these sub-structures makes up its contents-
All intervening character data is discarded, i.e. no "mixed content" is generated.
6. But for a resultContainer which is in a leaf position, i.e. does not contain further structure, the corresponding node in the output model is filled with the all accepted character data as its contents.

^ToC 2.4.2 Character Parsers and DTD Content Models

The disjunction of all result containers with the same name in the same module yields an expression describing all possible contents with this tag in the generated XML output. But this expression is possibly not conformant to the LL(1)-rules of the XML/DTD standard [xml]. It is feasible to derive a true "DTD content model" automatically, but the result is in most cases not very ergonomic. Furthermore, changes in the front-end grammar can cause hardly predictable results in the data structure, what in no case seems desirable.

Therefore d2d takes the opposite approach when exporting character parser definitions to a DTD, e.g. for data definition: the simple disjunction expression is used as content model, but can be overridden by an explicit data description grammar expression.

A typical use case is that different variants of the input parsing grammar correspond to an application of the permutation operator in the data description. The generated XML document follows the structure of the data contents definition, esp. w.r.t. the sequential order of permutation expressions, etc.

These data declarations are realized by a further kind of top level definitions:

 chars_data_def ::= data ident ( , ident ) * = d_expr ( modifiers ? )

The identifiers are those of the character parsers and result containers the content models of which are to define; the expression must only mention the result containers appearing in the parser definitions' subexpressions appearing directly therein.

The equivalence of these both expressions is checked as soon as the DTD export task is executed. More details can be found in an article dedicated to this problem [FIXME_missing_txtVsData not found].

^ToC 2.4.3 Execution of Character Parsers and Their Limitedness

As mentioned above, they are executed in a parallel, breadth first search. They swallow all kinds of characters which are accepted by their regular expression, regardless of the current settings of command character and comment character!

W.r.t. the precedence of tag level command character and comment lead-in character, vs. user defined parsers, three solutions seem sensible:

1. The user defined character parser take precedence over the current command character and all comment syntax.
2. Exactly the other way round: the current command character and all comment syntax takes precedence over the user defined character parsers.
3. The user defined character parser can refer to the state of the current command character by using the character "#". As soon as the command character is mapped to a different character, all user defined parsers follow this mapping and refer to the new command character, wherever "#" appears. This could be useful for defining parses which shall look like genuine d2d command syntax, and change their look and behaviour in sync (cf. our "mathMl" frontend).

Currently, the solution (1) is taken, i.e. user defined character parsers always take highest precedence.

As a negative consequence, it therefore becomes really easy to write (unintentionally !-) a character parser which is non-terminating! This parser will swallow the whole file contents, and terminate with a "premature end of file" error.

So its better to design the character parsers in a way which guarantees safe termination. This can be called limitedness, and is most easily accomplished e.g. by not-accepting certain groups of characters. Eg. the parser for "identifier" from the standard application, "d2d_gp.basic.sets:ident" accepts only alphanumeric characters plus "-" and "_" and rejects any whitespace. This is nice to type and sure to terminate soon.

With the abovementioned parser for identifiers, the following lines of input are equivalent:

  foo foo foo #ident xzy foo foo foo foo foo foo #ident xzy #/ident foo foo foo foo foo foo #ident(xzy)foo foo foo 

^ToC 2.5 Modifiers

The definitions so far are completed by modifiers which allow modifications of the generated output:

 modifiers ::= with modifier ( , modifier ) *
 modifier ::= xmlspec | editspec | inputspec | postprocessor | localnodes
 xmlspec ::= xmlrep ( trimmed | untrimmed ) ?         ( att | el | cdata ) ?         ( = ( ident | stringconst ) stringconst ? ) ?
 editspec ::= // foreseen for fine-tuning the behaviour of a syntax contolled editor

The meaning of postprocessor is explained below, in Section 2.10, the meaning of localnodes in Section 2.8.2.
editspec is reserved for the tuning of a future syntax controled editor.
The other kinds of modifier are explained in the following sections.

^ToC 2.5.1 Tuning the XML Representation of the Generated Output

The xmlspec specifies the kind of generated nodes. As a default, an element is generated. If an attribute is wanted instead, this has to be specified by

  ... with xmlrep att 

Character data contents can be trimmed or untrimmed w.r.t. leading and trailing whitespace. The default for this can be set on a per module basis or by default groups (Section 2.8.1). This default can be overridden here for each parser definition individually.

As a tag for the input text always that ident is taken which appears in the definition of the parser, left to the equation sign in the rule tags_parser_def, chars_parser_def or enum_def.
The tag used in the xml output also defaults to this tag, but can be overridden here by the ident or stringconst.
It must be overriden in case a name space shall be used for this tag which is not the default name space, see Section 2.8.3.

If the element itself is empty but shall be represented by a certain attribute value, then the second string parameter (the very last stringconst) gives this value.

Example: In our "easy-to-type" version of xslt (chapter 3), the "disable output escaping" property of an xslt:text element, which is "off" as a default, is switched on by simply including one empty element. In the generated mode it must of course be realized as defined by the standard, ie. by an attribute with one certain value.
This is achieved by the definitions ...

  tags text = noescape?, (#chars)* with xmlrep untrimmed element tags valueof = #implicit xpath, noescape? tags noescape = #empty with xmlrep att = "disable-output-escaping" "yes" 

^ToC 2.5.2 Verbatim Input Mode

The verbatim input mode is a modification of the parser and lexer processes for parsing character content. It is meant for the content of those element types which quote syntactic structures from other computer languages, which should be preserved in a verbatim way, so that the d2d parsing should interfere as least as possible.

1. When the character content of an element "e" is parsed in verbatim mode then no closing tags are inferred.
2. That means that only the opening tags of those nodes are recognized which are directly enlisted in the content model of "e".
3. These tags are only recognized if nothing (neither whitespace nor comments) stands between the command character and the tag name.
4. In all cases a command character appears but is NOT followed by the name of a directly contained child node, this command character is treated as normal character data input.
In this case a warning is issued by the tool.
5. Esp. there is no more idempotency of the command character (cf. Section 2.2). From a sequence of more than one commandchars, all but the last are normal character data in any case. (Because they are not followed immediately by a tag !-)
6. Also the end tag of "e" itself has to be written this way!
7. Comments however are recognized in verbatim mode in a normal way.
8. The built in meta command (see Section 2.2) are NOT recognized, (only, as mentioned above, the tags contained in the content model of "e" directly!)

The verbatim mode is activated by a certain alternative in the modifier production from above:

 inputspec ::= with input verbatim

The generation of warnings an be suppressed by using the built-in meta-command like suppressVerbatimCommandCharWarning 17, which will suppress the next 17 warnings.

^ToC 2.6 Enumerations

FIXME
WENN eine enumeration AUSSCHLIESZLICH EMBEDED in einem char-parser auftritt, dann braucht sie kein XML tag. kann aber sinnvoller einfach als disjunktion auftreten. (Darum ist "xmlrep none" da NICHT SINNVOLL!)
SONST braucht sie eine encodierung (verschiedenste xml-nsnames können verwendung finden! Auch mehr als einer!)
Auftreten: Embedded in char parser, tagged in char parser, tagged in tag parser!

Die WARNING in Resolver4 wegen wiederverwendung xml tag ("listSymbol") kommt daher, dass bei XRegExp ein "@X"-operator die definition X durch den INHALT ersetzt, der wiederum aus Definitionen besteht. "X" erscheint also garnicht als Definition im expandieten Modul. Das geht aber für Enumerations-Definitionen nicht, weil die zum Parsieren gebraucht werden. Also die WARNUNG. Bei der DTD-Generierung allerdings sollte "@enum" auch aufgelöst werden. FIXME. (also statt der komplizierten ge-tagg-ten Enum-übersetzung enfach nur PCDATA!)

-----------------------------------------------

Enumerations are just for convenience, and could as well be realized by character parsers or tag parsers, followed by some xslt processing.
Their definitions consist of alternatives of alternatives, and offer more ways for automated encoding the selected value into the output.

Their definition follows the syntax

 enum_def ::= enum ident =         ( #GENERIC | enum_item ( , enum_item ) * enum_modifier ? )
 enum_item ::= ( ident | stringconst ) decnum ?
 enum_modifier ::= with xmlrep ( empty ) ?         ( attribute | element ) ( = stringconst ) ?         ( numeric | first name | as is ) ?

Every enumeration consists of a set of strings, called enumeration items. Each item is assigned to a numeric value.

The definitions with ident and stringconst are totally equivalent and may be mixed arbitrarily. The latter may always be used. It must be used when the identifier is not an "ident" in the sense of the ddf lexer. So it is possible to define enumerations with items which are not identifiers:

  enum alignments = "<-", "->", "-><-" , "<->" with xmlrep attribute numeric 

The forms with and without decnum may not be mixed: Either each enum item is given a numeric representation explicitly, or no single number appears in the definition and all items are numbered automatically. If numbers are given explicitly, the same number may be assigned to more than one item.

The enum_modifier defines how the recognized item is represented in the xml output:

1. If "numeric" is given, than the numeric value of the selected enum item appears in the output, if "first name" is given, than identifier of the first item in the sequence of definitions is taken which has the same numeric value as the recognized one, and if "as is" is given, the input character sequence is taken verbatim. This is the default case.
2. "attribute" and "element" select the kind of node which is constructed.
3. If there is a stringconst, then this is used as the stem/prefix of the name of the created XML node, otherwise the name of the enum is used.
4. If "empty" is not given, then the result is simply a node with this stem as its name, and the representation of the enum item, as determinted above, as its content.
5. Otherwise, "empty" is given, the result is an empty element with the concatenation of the stem and the representation of the item as its name, or an attribute with this name and identical (redundant) value. Of course this encoding can only be selected if the representations are valid parts of identifiers, e.g. not with the "alignment" type above!

Enumerations can (1) be referred to in tag parsers, and (2) referred to and (3) inserted into character parsers.
In case (1) they appear in the input with their name as their tag, followed by the selected enum item.
In cases (2) and (3) the tag does not appear in the input, as usual with character parses. The difference will be in the output: In (2) a node is constructed as in (1), but in (3) the character parser will interpret the enumeration only as one big alternative of string constants, discarding all encoding and numerical values.

  tags p = (alignment?), (#chars|q|r)* chars q = (alignment?), [rest(S:digit)*] chars r = (@alignment?), (S:digit)* ------ #p #alignment <->here start the chars#/p // yields //

here start the chars

#p #q<->17 here start the chars#/p // yields //

17 here start the chars

#p #r<->17 here start the chars#/p // yields //

<->17 here start the chars



So obviously enumerations do not contribute essentially new features, but support easier processing by high flexibility of encoding and a further translation step. This makes a post-processing superfluous which would imply distribution or duplication of information.

^ToC 2.7 Incomplete Input and Signaling of Errors

In the first course of an authoring process parts of a document may stay incomplete. E.g. bibliographic entries, abstracts, or certain paragraphs of the text body may require completion in a later phase of work. This may refer to semantic completeness of the contents, as well as missing child elements which are syntactically required.

The latter case is supported by d2d by the "brute force" end tags and empty tags, as already mentioned above, Section 2.3.1.

  #p This PhD thesis is organized as follows: #list/// //structure still missing, FIXME ! Evidently, this clean structure will help a lot .... #bibentry rahner59 #title Das Kirchenjahr // shit, WHERE was it publihed ?? TODO find out !!! #///bibentry 

The first example defines a "list" element which must contain at least one "list item", according to its definition. But the author does not yet know how to fill it, so he/she leaves it voluntarily in an incorrect state w.r.t. syntax.

In the second example there are many obligate fields defined for a bib entry (like author, title, place, year, kind, etc.) Many of them are missing, so the element is syntactically invalid and marked as such.

The reaction of the d2d tool on incomplete input is basically the same, whether this is intentionally and marked as such, or erronuously: It generates special "meta-elements" in the generated model.
These follow roughly the definition

 module d2d-meta xmlns d2d = "http://bandm.eu/doctypes/d2d_gp/d2d-meta" is default import S from basic.sets public tags parsingError = kind & tag & (expected | skipped) tags expected, skipped = #chars enum kind = open, close with xmlrep att chars tag = @S:ident with xmlrep att end module 

E.g., the standard xslt translation from our general purpose text model into xhtml makes these positions of the source document visible by presenting the informations concerning the missing contents with most ugly colouring.

There are always to cases:

First an input tag may be found somewhere later in the currently active stack of content models. In this case a sequence of nodes is considered to be missing. The tool inserts the wording of the offendig tag, and a synthesized grammar expression, which represents the structure of the missing input. Please note that this grammar may include required components of different levels of the document type definition, so it may never appear in the definition texts as such!

For example:

 type definition : tags a = b, c, d tags b = e, (f|d)+ tags c,d,e,f = #empty input text: #a #b #e #d resulting document: (f|d)+,c 

The second case is that a tag cannot be considered as "correct but to early". In this case the tag must be discarded. All input up to the next command character is skipped and reported as such in the generated model. The parsing process continues with the next tag, which of course, can again run into this error situation.

An example:

 type definition : tags a = b, c tags b,c = #empty input text: #a #b some text #x more text #c#d resulting document: some text #x more text 

Similar reactions are of course possible on misplaced closing tags.
(Currently, erronuously appearing closing tags and explicitly written premature-closing tags are not distinguished in the generated model, but this could change.)

Beside this generation of output, the tool can be configured

1. whether to accept more than one such error,
2. whether in an error situation it shall present exhaustive information about the parsing context via an interactive channels, eg. the terminal.
3. whether to accept explcitly marked incomplete contents as type correct or as erronuous,
4. whether the successful translation of an an incomplete document shall be signalled as success or as failure to the caller, eg. a make system.

^ToC 2.8 Modules, Substitutions and Parametrization

The syntax of each module is defined as ...

 module ::= module ident         ( defaultDeclaration | importItem | nameSpaceDecl ) *         ( module | defaultGroup | definition ) *         end module
 definition ::= tags_parser_def | chars_parser_def | enum_def | chars_data_def | ...
 ident ::= ASCII_letter ( ASCII_letter| ASCII_digit| _ | - ) *

An ident is a simple identifier as known from "ancient C", since only ASCII letters and digits are permitted.

Modules may be nested, and the name of a module is its identifying path, i.e. the concatentation of the names of all containing modules, in descending order, combined by a dot ".".

The nesting of module does not imply any inheritance. It is just for organizing the names of modules and for putting more than one module into one text file. Each definition file must contain one top-level module, and the name of that module and the name of the file must be in a certain relation which allows the tool implementation to find the definition file from the name of a module.

The purpose of a leaf module is to contain definition s.

Each definition will assign one or more ident s to the parsers it defines, see the grammar rules for the different kinds of definitions tags_parser_def, chars_parser_def or enum_def.

Additionally, a module can contain other modules, and a module can contain importItem s which import definitions from other modules.

For these different kinds of objects there is only one single name space per module. So every ident can only be used once on the top level of a certain module as the name of a module, of a parser definition, or for an import item. Beside this, the enum_item s belonging to one enum_def form a separate name space, and so do local parser definitions, see Section 2.8.2 below.

^ToC 2.8.1 Default Declarations

Currently the following default declarations are supported:

 defaultDeclaration ::= tagManglingDirective | globalTrimming
 globalTrimming ::= plain text is ( trimmed | untrimmed )

The globalTrimming simply sets the default decision, whether element and attribute content shall be trimmed from whitespace at both ends of their string contents. This default itself defaults to false. It may be overridden for every parser definition individually, see Section 2.5.1.

The tagManglingDirective defines how the tags for the XML back-end are derived for locally defined parsers, and is explained in the context of Section 2.8.2.

These defaults can be defined at the beginning of a module, but can also be put into a dedicated defaultGroup, the only purpose of which is to restrict the scope a certain default. This may be very useful e.g. if a certain group of parsers shall have trimmed content, or a specially mangled xml tag, but the rest of the module does not care or wants it explicitly different. The construct, as appearing in the definition of module above, is

 defaultGroup ::= begin defaults ( defaultDeclaration ) *         ( definition | defaultGroup ) *         end defaults

^ToC 2.8.2 Local Definitions

Many conventionally used names are re-used quite frequently in very different contexts. So ident s like "title", "name", "id" or "ident" , "number" or "num", "key" or "caption" are sensible in many different contexts, --- as tags for possibly very different sub-structures.

d2d supports the re-usage of ident s as tags and as names by supporting local definitions.

As mentioned above, the regexp for modifier includes as one of its alternatives localnodes, which is defined as ...

 localnodes ::= local definition * end local

All the definition s which are local to a "containing" definition share one dedicated name space. So the same ident may be used for very different parser definitions, as long as they reside in different local scopes.

Normally these definitions are intended to be used in the regular expression of that "containing parser". Therefore all reference s in the expr which serve as the containing parser's content model definition are (in a first step) resolved against this local scope. If no definition is found, then the next higher containing definition is searched for a local definition, and finally the global scope of the module.

But this is only a convention and an abbreviated notation. This co-incidence is the most frequent case, but it is not necessarily so! Indeed, locally defined parses can be used, ie. referred to or inserted, everywhere. Outside of the expr of the containing parser they must be identified in a different way, namely by using their "qualified name", which is the sequence of the ident s of all parsers they are contained in, starting from module level, separated by a dot ".".

For example ...

  tags link = #implicit url, (text?, when?, loc?) with local chars url = (S:letter~+)~"//~([frame S:letter~*]~"/")* // etc.,let it be a rather complete and complicate parser declaration! end local tags image = link.url, alt, (width? & height?) with local alt = #chars end local tags table = xxxx, (caption?), xxxx with local caption = @image.alt end local 

Please note that the tagging in the input source is independent from the position of the definition: it is only defined by the ident used in the definition of the parser. These tags must be unambiguous w.r.t. the LL(1) property anyhow, totally independent from the position of the defining parser in the scope of the defining module. The world of definitions and the world of text input is only loosely related.

It is sensible on the input side to let the user simply write "title", in the context of a picture or a person or a chapter.

On the other hand, in most cases the tags in the generated XML output model shall differ, so that subsequent processing and modelling can differentiate. So here a tag like "picture_title" or "chapterTitle" would be welcome.

First of all, the xml tag of a local node can be overridden explicitly to a certain string value, as it is possible with top-level definitions, by including this tag string in an xmlspec, cf Section 2.5.1.

 tagManglingDirective ::= local node xmlrep naming =         ( join by stringconst | join upcased | no mangling )

Such a tagManglingDirective can appear at the start of the module, or in a dedicated defaultGroup. It defines how the xml tag of a local definition is derived from its ident and the xml tag of the containing definition.

So the xml tag of a local node is either ...

1. given explicitly,
2. or equal to its identifier (if mangling = "no mangling"),
3. or equal to the concatenation of the xml tag of the containing parser and its up-cased ident (if mangling = "join upcased")
4. or equal to the concatenation of the xml tag of the containing parser, followed by a certain separator string, and finally its own ident (if mangling = "join by ...")

If no mangling directive is set in the source file, it defaults to "no mangling", and a warning is issued whenever this default is applied.

^ToC 2.8.3 XML Name Spaces

The namespaces for which tags are generated can be declared at the beginning of each module:

 nameSpaceDecl ::= xmlns ident = stringconst         ( is ( element ) ? default ) ?

One or more of these statements may appear at the start of a module. All these statements will be inherited by all directly contained modules. But these can of course override.

In each of these statements the ident defines a prefix by which this namespace will be adressed in the xmlspec definitions of the following parser definitions. This prefix is totally arbitrary. It only needs to be unique among all nameSpaceDecl s of this module.

The stringconst is the namespace uri which is intended to adress. This uri connects to "outer reality" and in most cases has a well-known meaning, an owner, and a role defined by convention, etc.

The prefix defined here is likely to be the prefix chosen when writing out the model, but of course this cannot be guaranteed, and it has only some small impact, namely on the readability by humans.
((
This is not quite true because some browsers, inspite of claiming to support xhtml, which is an instance of xml, nevertheless require a certain prefix to be used for recognizing xhtml elements. This is bad behaviour! We support only so far as we can say it is likely that the same prefix will be used for writing out an XML model to a text file, as has been used for declaring the namespace in the ddf definition.
))

The assignment of xml name spaces to xml tags for the different parsers works as follows:

1. At most one of the nameSpaceDecl s may include the suffix "is default".
2. All xml tags (unregarding whether given explicitly, or implicitly by the ident of the parser, or calculated by the mangling rule as described in Section 2.8.2) may contain at most one colon character ":".
3. If there is such a colon, then the characters before are treated as a prefix, and the characters after make up the "local" xml tag. Then there must exist a namespace statement in this module which assigns a namespace uri to this prefix. The node is assigned the local tag as its xml tag, living in the indicated name space.
4. If there is no such nameSpaceDecl for the prefix, an error is signalled.
5. If there is no such colon character in the xml tag, the tag is taken as a whole.
If there is an xml name space declaration marked with "is default", then such a tag is living in the namespace identified by the uri of this declaration.
If there is an xml name space declaration marked with "is element default", then the tag is living in the namespace identified by the uri of this declaration iff the node is an element node, not an attribute node.
6. If no such declaration is present, the tag lives in the "empty string uri" namespace, also called the "no namespace namespace" in the very cryptic sense of xml [xml-ns].

^ToC 2.8.4 Importing Definitions From Other Modules

Modules may import other modules for referring to parsers defined therein:

 importItem ::= import ident from         ( #GENERIC | modulePath moduleSubst * )
 modulePath ::= ident ( . ident ) *

The imported module can be subject to substitutions (=re-writings), specified by moduleSubst. This is the central means for parametrization of modules and described later in detail.

References to definitions contained in the imported module are written in the following text using the declared identifier as a prefix, as in ...

  import E from basic.elements // .. tags myDef = E.importedDef 

The descending into local definitions, as described in Section 2.8.2 above, is also possible with imported definitions. So we get the same grammar as above, but with a different meaning:

 reference ::= ( ident . ) * ident

The heading sequence of ident can be a sequence of import prefices, pointing to an import statement in the current module, than an import statement in this imported module, etc. This sequence is followed by a second sequence of ident, which descends into definitions and local definitions.

Since there is only one name space for all things contained in a module, as mentioned above, // FIXME (import keys, local module names and definition names) these sequences are always unambiguous.

(Please note that modules do inherit nearly nothing from the context they are contained in. So their source text can be moved around arbitrarily.)

The only context dependencies concern the locating algorithm for imported modules. The modulePath of every imported module is resolved by first finding a module according to the first ident component. Then we descemd int the sub-modules of this module, treating the following components as their names. The top-most ident is resolved by testing ...

1. whether it denotates a module relatively to the importing module, ie. declared therein as its sub-module.
2. or, if this is not the case, whether it is the name of a sibling, i.e. a sub-module of the containing module of the importing module,
3. and only at last it is tried as the name of a top-level module. This mapped to a file (or some other resource) in the same way as with the initial module when parsing a text input, see Section 2.1.1. This is implementation dependent.

(Please note that this is just "syntactic sugar" to allow a more convenient moving around of module source fragments, when developing and refactoring.)
(The price for this is, that a top-level module with the same name as a child or sibling module can currently not be addressed.)

^ToC 2.8.5 Parameterization Of Modules and Substitutions

For library modules to be useful for most different purposes, there must be some mechanism for tayloring, parametrizing and extending them. Here, d2d takes an approach based on "glass box parametrization", based on free rewriting.

Dedicated references may be declared as being parameters, the user of a library module must instantiate when using the module. This is done by declaring a tag parser, a character parser or an enumeration as "#GENERIC" (as can be seen in the definitions d_expr and enum_def above) and then referring to this placeholder, or by writing a generic module import, cf. importItem .

But this is only a special case. Basically, a user may re-write any definition of the imported libraries, in any way he/she wants to. But this of course requires the access to the source, the knowledge of the definitions' structure. The advantage is that the author of the library just presents a prototype, and does not restrict its usage. The library can evolve in directions unforeseen and unforeseeable by that author!

The parameterizations of an imported module falls in three classes:

1. replace an imported module as a whole by a different one,
2. replace all references with a certain name with an expression,
3. replace all references with a certain name with an expression, but only in one certain declaration.

The corresponding rules are ...

 moduleSubst ::= ^ ( ident / ident ) |         | ( in reference ) ? substitution
 substitution ::= ^ ( ( expr | #none ) / reference )

So there are several, very different kinds of substitutions. But the syntax of all of the follows the same pattern:

 in this context | insert | | this expr | | | instead of this reference | | | | modulePath ^ ( importPrefix / importPrefix ) modulePath ^ ( expr / reference ) expr ^ ( expr / reference ) 

The first form is special, and only allowed in module import commands. Here the substitution is one step more indirect than the headline suggest: the ident preceding the "/" must correspond to a second importItem in the same, importing module. The second ident but must refer to an import item in the imported module itself.

This allows the exchange of whole groups of definitions:

  module webpage_italian import B from basic.deliverables ^(MYCAL / CAL) import MYCAL from calendaric_italian module calendaric_italian enum month = januario, februario,// etc. // etc. chars date = // etc end module public tags website = @B:website end module 

This substitution instantiates the module "basic.deliverables", but replaces the "calendaric module" (i.e. the module which is imported by this module as a library for calendaric data defintions and parsers) by a new version, contained in the module "webpage_italian.calendaric_italian"

This module (instead of the original) will be adressed by all references containing the prefix "CAL:" in the instantiated module. So finally a new, own parser definition "website" can be constructed, by inserting the parser definition from this instantiated module "B".

Prerequisite for this to work is that the user knows the names of all definitions which are used in "basic.deliverables" with the prefix "CAL:", i.e. the signature of imported definitions, because those must be replaced completely and in a type correct way.

Please note that only simple prefices without any "." inside can be used for this kind of substitution. The forms "import A from a ^(B.X/C)" and "import A from a ^(B/C.X)" are both illegal as soon as B or C is an import prefix.

In the second, general form the appearing non-terminals are what they seem: The re-written part (=the replaced sub-expression of "some context") is always a reference, and what is to replace them is an arbitrary expr, or the special value °#none".

(Currently there is no way of replacing complex expressions. But since in most cases parametrization of modules and parsers aims at extending some pre-defined alternatives, this was not yet necessary in practice !-)

In case of a moduleSubst, the expr to insert is (of course !-) evaluated in the context of the importing module.

In case of a moduleSubst, the optional "in reference"-part allows to restrict the substitution to the body of one certain parser in this imported module. This target is identified by this reference, of course resolved in the imported module.

Substitutions may also occur independently of a module import. This is reflected by extending the rule from above by ...

 atom ::= ... | atom substitution

This allows derivation of new parser definitions from existing ones. Both mechanisms are extensively used in the construction of our standard text format "d2d_gp", as explained in chapter 6. Please see there for instructive examples.

^ToC 2.8.6 Combinations of Substitutions and Insertions

The usage of substitution is esp. powerful in connection with insertion. But here also some caveats have to be considered:

First, the references of inserted expressions coming from imported definitions are evaluated in their original context. In other words: statically bound references are inserted after binding, not mere front-end identifiers:

  module outer module A tags a = b, c tags b,c = #empty end module import A from A tagx x= @A:a tags b,c = #chars end module 

Here, the definition x will refer to the empty elements a and b, as defined in module A.

Nevertheless, a substitution always takes a front-end identifier and replaces it with an evaluated expression, i.e. evaluated in the context where the substitution is written down:

  module outer module A tags a = b, c tags b,c = #empty end module import A from A tagx x= (@A:a) ^(b/c) tags b,c = #chars end module 

Now the contents of x are defined as a sequence of two elements, both called b, the first is empty, defined in module A, -- the second may contain #char data, defined in module outer.

When nesting substitutions, the outer one is applied to the expression part of the inner one (above the "/"), AND to the rewriting result, but not to the (mere front-end) reference text below the slash:

  tags x1 = ( a,b,c ^(b / c) ) ^(d / b) // --> ( a,b,c ^(d / c) ) ^(d / b) // --> ( a,b,d ) ^(d / b) // --> ( a,d,d ) tags x2 = ( a,b,c ^(b / c) ) ^(d / c) // --> ( a,b,b ) ^(d / c) // --> ( a,b,d ) + WARNING, "c" did not occur tags x3 = ( a,b,c ^(@b / c) ) ^(c / b) // --> ( a,b,c ^(@c / c) ) ^(c / b) // --> ( a,b,(c, c)? ) ^(c / b) // --> ( a,c,(c, c)? ) tags b = (d, d)? tags c = (c, c)? /* ==== when replacing from inner to outer hte last example woild instead resolve to ... tags x3 = ( a,b,c ^(@b / c) ) ^(c / b) // --> ( a,b,(d, d)? ) ^(c / b) // --> ( a,c,(d, d)? ) ==== */ 

As mentioned, what is re-written is determined only by the front-end representation of the reference. The "semantics" of the resolved references are not involved in the matching, only their front-end denotation. Consider the following example:

  module OUTER module INNER import I from one_certain_module import J from one_certain_module tags t = I:a tags u = J:a end module // INNNER import IN from INNER ^ ( b / I:a) tags b = // etc ... 

Here only the occurences of "one_certain_module:a" in the definition of "t" will be replaced by "b". The reference "J:a" in the definition of "u" is not touched, in spite it points to the same definition in "one_certain_module". Only the front-end representations of the references are subject to substitutions, not the declarations referred to!

Nevertheless, these front-end representations are the fully qualified ones, after the resolution of the abbreviated access to local definitions!
Consider ...

 module m tags t = a, b, c with local a = // ... end locals tags u = @t ^ (x/a) // this will NOT insert any ref to "x" tags v = @t ^ (x/t.a) // this WILL insert a ref to "x" end module 

Whenever a substitution does not result in any rewriting, a warning is issued by the tool. This is the case with the "u" definition, because there is no reference to "a" in the regular expression of "t", to which the substitution is applied. This is very easy to see when the notation above is read just as an abbreviation for

  tags t = t.a, b, c with ... 

This to consider carefully is esp. important when there is some re-usage of ident s (the necessity of which indeed had been the reason to introduce local scopes !-) In the following example "a" does exist, but what is refered to in the body of "t" is "t.a", not "a":

  module tags a = // .. tags t = a, b, c with local a = // ... end locals tags u = @t ^ (x/a) 

Any insertion only works with a single reference as its argument: The reference must point to a parser definition (of the same kind as the containing definition!) and then its value is inserted. Whenever a substitution yields anything not a reference, we get a typing error, as in ...

  tags b = @a ^(x, y / a) 

Of course, it is not likely to write down such an erronuous form in this directly visible way. But with module imports this happens quite frequently:

  import M from m ^( (a|b|c) / x) 

...ignoring that inside of M we have ...

  module m tags x = #GENERIC tags y = @x 

Also to inertions replacements are applied twice: first to the unresolved, then to the resolved form:

  tags a = a?, b tags b = b?, a tags c = a, b tags x = @a ^(b / a) // --> @b ^(b / a) // --> b?, a ^(b / a) // --> b?, b tags y = @a ^(a / b) // --> a?, b ^(a / b) // --> a?, a tags z = @c ^(a / c) // --> @a ^(a / c) // --> a?, b ^(a / c) // --> a?, b 

Of course, x and y in this example describe infinite types, impossible to denotate instances of. The effects of this kind of definitions soon become unforesssable, and the "instantiated" version of the generated documentation may be helpful, as described in Section 4.2.

Last not least there is the special value #none.

It is special because it can be used only on top of a substitution slash, and it means different things when inserted into different target contexts:

When inserted into a sequence or a permutation, it stands for replacing the reference with the "empty sequence", which always matches but never produces any output. This means, the reference is simply deleted from the sequence.

When inserted into an alternative, it means the "impossible input", which never matches and also never produces any output. So again, the reference is simply forgotten.

These effects are of course indendent from any individual decoration ("?", "*" or "+") the reference has in the target context.

^ToC 2.9 External Document Type Definitions

Beside its own definition format "ddf", as described above, d2d can also use document type definitions in other formats for directing the parsing and model generation of some input text in the d2d syntax.

How this is recognized depends on the implementation.
The current tool (see
chapter 5 below) first searches all positions in its search path for ddf definition modules, e.g. for files ending with ".ddf",".dd2", etc.
Only if no such is found it searches for other document type definition formats.

Currently

1. W3c XML DTDs,
2. and umod model definitions

are recognized.

Every internal module definition can be exported back into the genuine d2d text format. This is done by the main tool with "--mode ddf2format", see chapter 5.

This can be esp. useful for controlling the details of the translation result when an externally defined model is read in. E.g., for the dtd of xhtml (together with the additional, preparatory declarations of namespaces, etc.) the call would be

  d2d2.base --mode ddf2format -0 xhtml_1_0 -1 recognizedHtmlModel.ddf --expanded 1 --path // must be set to find the xhtml_1_0.dtd 

Please note that an "expanded" module is an instance of ResolvedModule . It has several differences to a "raw" input module:

1. All organizing structures (like imported modules, local definitions, one definition with multiple names) are removed; the list of definitions (Chars and Tags Parsers, Enumerations) is flat.
2. As a consequence, the names appearing to the left of the "=" contain the "." separator, reflecting the former organizign structures, and so do all references.
3. Trimming defaults, name space defaults etc. are no longer required on the module level, but already distributed to each single definition and valid locally.
4. All insertions and replacements "@" and "^(a/b)" have been resolved; and further "syntactic sugar" is not applied.
5. The print out of the expanded version of
"module mymod"
starts with
"resolved module mymod$expanded". The textual rendering of a resolved module is not (/currently not yet) re-accepted by the d2d definitions parser. ^ToC 2.9.1 Using W3c XML DTDs Interpreting a dtd as a ddf combines attributes and element contents for each single element type into one single ddf tag parser definition. For each element definition, its attribute list is (lists are) translated into one single permutation expression. This is pre-pended before the translation of the element contents. The latter is straight-forward, mapping DTD constructors to ddf constructors, and falls into one of two(2) categories:   ===> is read as ===> tags x = (a1? & a2? & a3), c, (d|e)*, f? ===> is read as ===> tags x = (a1? & a2? & a3), (#chars | c | d | e)* enum a1 = m1|m2|m3 tags a2,a3 = #chars  Due to the "principle of least surprise", the "#implicit" feature of ddf is never synthesized. Every identifier serving as an ATTRIBUTE name in the dtd is translated to the refefence to a synthetic node definition in the local scope of one single synthetic pseudo-element, mostl named "ATT". This primely to avoid name clashes on the definition level, i.e. in the top-level scope of the constructed ddf module. For avoiding clashes of the tags, in the later text input, attributes can further be prefixed: If there is a name clash between the name of an optional attribute and the "first set" of the regular expression of the content model, the attribute's tag will be prefixed by a string like "A-" or "att". (( E.g. in case of xhtml1.0, the elements "ins", "del" and "q" have a child element with tag "cite", and an attribute with the same name. Consequently, this attribute in these elements is only adressable as "A-cite". Elements without this clash, here only "blockquote", keep the reference to the attribute by its genuine name. Contrarily, in the same example there is an attribute "script" which is nearly ubiquituous, and an element "script", only appearing in the contents of "header". But since this element is one of the few not allowing this attribute, there is no clash, and attribute and element are adressable with their genuine name, which is the same. )) Please note that his kind of clash resolution does not follow the principles of compatible evolution, //strategies for evolution and extending, as they are fundamental for text type definitions in the genuine d2d 2.0 format, and discussed in Section 2.3.6. E.g. adding an element "y" which goes into the first-set of an element's content model, will re-define an attribute's tag from "y" to "A-y". Later insertion of a further, non-optional child "x" in front of "y" will re-name the attribute back. But DTD's are asumed to be fixed. (No one wants to maintain them !-) In the presence of namespaces there is a second level of disambiguation: First, the namespaces must be declared by "<?tdom ..>" processing instructions. The prefix which is declared there is used for translating: It is mangled into the names of elements and possibly attributes, for disambiguation. E.g. in the ddf model of xhtml there is the ubiquituous attribute "xml-lang". In extreme cases this may lead to further mangling steps for further disambiguation, involving numbering, and must possibly be combined with the above-mentioned prefixing of attributes! See the following (non-real!) example:   ===> is read as ===> tags xml_img = #chars* tags x = A_xml_lang, (#chars | xml_lang | xml_0_img)*  Every single occurance of such mangling and renaming is reported to the user by a warning. Please note that currently only dtds with certain restrictions on the structure of tagscan be imported. A tag which does not fulfill the production ident would require further mangling, which is currently not supported. ^ToC 2.9.2 Denotating Values of umod Models As described in the umod documentation, there is a canonical definition for an XML serialization of models. The implicitly induced document type definition can be used for direct denotation of umod data models. ^ToC 2.10 Post-Processing Sometimes a parsed element must comply to special semantic constraints, or the contents of some additional (oftenly "hidden") data field must be calculated, or some normalization shall be performed. Whenever neither the (pure syntactical) means of d2d ddf document defintions are sufficient, nor these processes shall be delayed to a later processing phase, a java method can be employed to perform arbitrary immediate post-processing of a freshly generated model fragment. The usage of this feature should be restricted to very few very special cases. E.g. enriching the elements which represent the calendaric date in a local specific format by one element or attribute which represens the same date in a normalized "UTC" encoding is a typical application of this kind of post-processing.  postprocessor ::= postproc classAsString  classAsString ::= " ident ( . ident ) * " The class referred to must (a) of course be locatable by the tool via its classpath, etc, and (b) must derive from a certain class and offer the certain processing function. Details can be found in the API doc of the PostProcessor class . ^ToC 2.11 Macros and Inclusions ^ToC 2.11.1 Macros ^ToC 2.11.2 Inclusions ^ToC 3 The Xslt Mode A d2d source file can be declared to be a source for an xslt program, ie. declared to contain xslt-templates which generate fragments of a certain document type. The required declaration is as follows:   #d2d 2.0 xslt text producing : // ^^^^ ^^^^^^^^^ // these keyword are different  In this case the selected element is not the root of the generated document. This is instead xslt:stylesheet, as defined in our (slightly simplified) version of xslt, see the d2d source and the static documentation.of the d2d xslt model Consequently, the top-level constructs apperating in this file can be #import, #output, #preserve-space, #template, etc., as usual for an xslt source text. The element declaration indicated in the text type declaration but defines the target language of the xslt rules: All elements which are reachable from the indicated root element are recognized in all those contexts of the following text in which a target element may appear, as defined by xslt. Technically, they are collected into one big alternative which is assigned to the generic definition "RESULT_ELEMENTS" in our xslt model. These target elements can of course contain a hierarchy of further elements of the target language, as long as they confirm to their contents definition. But they also can contain, vice versa, again certain xslt elements, namely those which produce content (e.g. "valueof", "if", "call"). These are identified by the d2d parsing algorithm by looking at the definition of "INSTRUCTIONS" . These, in turn, can again contain target language elements (either directly like "if" or indirectly like "choose/when"), etc., ad lib. So we get a "sandwich" of alternating hierarchies of xslt elements and target language elements. Here an example, symbolically depicted:   xslt:stylesheet | | variable template | if........................ | html:p | | | br a ........................| choose | when...................... | image ........................| call-template  The combinability is defined by classifying the xslt elements into ... 1. type "T", can be contained directly in the top-level element "stylesheet". In the ddf, this is modeled by the definition of TOPLEVEL . 2. type "B", can contain target language elements. This is modeled by the definition TEMPLATE . 3. type "C", produces character data (and possibly also structured content) for the target language elements, modeled by CHAR_PRODUCING . 4. type "P", produces always structured content for the target language elements, modeled by STRUCTURE_PRODUCING   xslt:stylesheet | | (T) | variable | (T) template (B).......................... (target language elements a first group of nesting, top slices of hierarchy ) | (<---x1) (P/C)........................| | (content-producing xslt elements) | | (B).|........................ | | (<---x2) | (target language elements etc. nesting continued )  The implementation is realized by two(2) state machines, operating independently as co-routines. The d2d inference mechanism works for both parts independently. Seen from the user, both parsers and both sets of tags are unified in a transparent way. The point "x1" is the crucial point where new name clashes can occur, because there all tags of the target and many xslt tags are permitted. The clash comes from the production TEMPLATE . This nonterminal does not model the definition of an xslt-template (that is done by template !), but for "all which occurs as the contents of a template in the widest sense", e.g. including values for constants and template arguments, for if-branches, etc. Of course, only here the back-end elements (by @RESULT_ELEMENTS ) and certain xslt elements (by @INSTRUCTIONS ) are combined and can clash. First of all, no new clashes may occur with ATTRIBUTE-like definitions: Only elements, not attributes of the target language are involved in this recursive embedding: Neither may an attribute contain an xslt instruction, nor may it appear on top-level of the contents of an xslt template. This is the only situation where the chosen "xml representation kind" does affect the d2d parsing process. An attriute named "choose" or "if" will therefore never clash with the xslt element with the same name. (See more details at Section 3.3.) But additional clashes can occur between tags from different scopes of the target language. DTD defined languages do not have different scopes, but in ddf defined target languages there are tags with module scope and tags with per-element scope. This is explained in detail in Section 3.2. But most significant and most likely is a clash between top-level elements of the target language and of xslt content producing "instructions". This is solved as follows: All xslt elements are additionally adressable by tags which are prefixed by "X" and "x-". In case (a) that an xslt tag also appears as a target language tag, and both tags are applicable in a certain context, the target language tag has priority and thus the prefixed version must be used for the xslt tag. In case (b) that a prefixed version itself is used by the target language elements, the prefix is replicated as often as necessary. E.g. if the target language contains a tag like "x-if", then the corresponding prefixed version of the xslt tag will be "x-x-if". The tool will issue a corresponding warning in both cases. (( In case of xhtml as target language, there are element definitions "var" and "param". The first clashes with xslt "var", so this must be adressed by "x-var" or by "Xvar" or sim. Contrarily, there is no clash between the two roles of "param": The xslt version can only appear in the prefix part of template , and at this position no target language elements are allowed. Strictly spoken there is nevertheless a clash, but the non-determinism of that is easily resolved by the algorithm being greedy:   module xslt // ... public tags template = (match |name), (mode? & prior? & X:space? & param*), @TEMPLATE --- when instantiated with xhtml_1_0 as a target language, will read as --> public tags template = (match |name), (mode? & prior? & X:space? & param*), ( (if|call|apply|valueof|..) // xslt INSTRUCTIONS | (html|head|p|..|param|.. // xhtml RESULT_ELEMENTS // ^^^^ this is not LL(1) !!! ) )*  As a consequence we get the following interpretations:   #template #name x #param p1 #param/ #param p2 #param/ #element param // xslt xslt xhtml // one trick to start // template part with a xhtml param #template #name x #param p1 #param/ #param p2 #param/ #message!! #param // xslt xslt xhtml / a different way to start // template part with a xhtml param  ^ToC 3.1 Additional Xml Name Space Declarations The name spaces related to the output model, i.e. the xml corpus which will be created by the xslt source text, are imported automatically and applied to the generated output. They are copied to the output with their original prefix definitions, i.e. the prefices can be used in embedded xpath expression and template match patterns to refer to the namespaces. But further name spaces, esp. those of all input documents, must be declared explicitly. This is done like #ldots   #d2d 2.0 xslt text producing : from a = http://bandm.eu/doctypes/options b = http://www.w3.org/1998/Math/MathML = #empty  This assigns the prefices a and b to the namespace uris, and the empty prefix to the empty uri. These prefices can now be used e.g. in embedded xpath expressions, with the declared meaning, since these declarations will be copied to the generated output. ^ToC 3.2 Issues with Type-checking and Coverage when Expanding XSLT Constructs. Shadowing caused by Missing Context Information Please note that there is no (real) typechecking between the upper and the lower hierarchy of target language elements in the picture above! At all junction points "B", all target language tags may appear. The set of allowed tags at point (x2) in the picture above does not depend on the situation at point (x1). (E.g. oftenly at point x2 some arguments to a function ("call template" or "apply templates" in xslt terminology) are constructed, which will be part of the result (inserted at place x1) only after further wrapping, or which will be even totally discarded. This shows that there is no trivial relation between x1 and x2.) As a first consequence of this un-relatedness, the user must be aware that xslt code can be denotated which will produce incorrect results w.r.t. the target language document type. This has mathematical reasons: The type-checking problem is only solvable for a restricted subset of xslt [marnev05]. While this problem is a general one of xslt, the second issue is related to the central inference mechanism inherent to d2d (which intends to simplify denotation and increase readability): the d2d parsing process must use some heuristics to find out how to continue after the embedded content-producing xslt element, e.g. when returning to the stack-level of (x1). See the following example of contents definition and xslt code:   tags a = b,c,d,e,f+,g tags b,c,d,e,f,g = #empty ----- xslt-source 1: #template #match link #a #b #call myNamedTemplate#/call #f #f #g #/template xslt-source 2: #template #match link #a #b #call myNamedTemplate#/call #c #d #e #f #g #/template xslt-source 3: #template #match link #a #b #call myNamedTemplate#/call #/a #/template  All three xslt fragments are (possibly!) correct: In fragment 1 the call to myNamedTemplate must deliver the sequence of three elements as its result, namely a c-, a d-, and an e-element. It may produce some final f-elements. In fragment 2 the call must deliver "nothing", the empty sequence (e.g. just executing a debug message output). In fragment 3 the call must deliver the whole required rest of a's content definition. But in every case the parser will not know how the called template will behave. So it must be assumed that every content producing xslt command (which is embedded into the target language's structure definition and is expanded later, when applying the transformation as a whole) may cover any valid continuation sequence w.r.t. the currently parsed nonterminal of the target language and its contents model. Those components of the current contents model which are left out thus define the minimum coverage of the expansion of the xslt construct. A very important property in this context is, that every such xslt function can only expand to a true sub-expression of the current content model, but never beyond! That is because it always delivers well-formed (sub-)trees, not arbitrary sequences of tokens. Therefore the d2d xslt parsing mechanism must never look farther than the end of the current content model, and that is always known. The implementation currently issues this minimum coverage by the following hints:  case 1: xslt expression is assumed to cover at least (c, d, e) case 3: xslt expression is assumed to cover at least (c, d, e, (f)+, g)  So there is indeed a kind of "minimal type checking" done automatically. At least, the user is clearly informed about what the code has to deliver for correct overall output. Obviously, this "wild card character" of the xslt expansion destroys the strict LL(1) discipline: The same kind of example as above, but more complicated:   tags a = (b,c,d)*, (x,c,d) tags b,c,d,x = #empty ----- xslt-source 4 #template #match link #a #b #call myNamedTemplate#/call #d #/a #/template -->minimal cover c, d, (b, c, d)*, x, c xslt-source 5: #template #match link #a #b #call myNamedTemplate#/call #d #x #c #d #/a #/template -->minimal cover c, (b, c, d)*, #/template  The difference between both cases cannot be recognized by the normal LL(1) parsing of d2d. After the called template, the parser does not know whether to continue with the "first" or the "second" reference to d. Currently, we always decide for the first. The declarative operator covers is intended to list a sequence of tags. IT IS CURRENTLY NOT YET SUPPORTED! The meaning of it shall be to indicate that the corresponding elements are always contained in the produced content of the xslt expansion (as guaranteed, or at least, as intended by the user). The effect of which is to shift the parsing process over the first appearance of this tag. Since the content model (beside the wildcard of the xslt expanson) is always LL(1), there is alway such a tag which can be used for this disambiguation (????) Both examples from above (and a third, new one) are correctly written as ...   tags a = (b,c,d)*, (x,c,d) tags b,c,d,x = #empty ----- #template #match link #a #b #call myNamedTemplate#/call #cover x #d #/a #/template #template #match link #a #b #call myNamedTemplate#/call #d #x #c #d #/a #/template #template #match link #a #b #call myNamedTemplate#/call #d #b #c #d #x #c #d #/a #/template  A third consequence of the above-mentioned un-relatedness of the grammars ruling the parsing points x1 and x2 arises as soon as a certain d2d name used for an element definition on module level conflicts with a local element definition with the same name, see Section 2.8.2. Each tag which appears "freely floating" when re-entering the world of target tags at point x2 (and also in the top-level contents of a template, a variable content defintion, etc.), will thus be interpreted as a reference to the module level definition. The local definition will only be recognized when the tag appears not on top-level, but in the content model of a target language element, which serves as a dis-ambiguation. Anyhow, xslt itself always allows to construct elements explicitly, using xslt:element The following table summarizes all supported tags and assigns them to the interface categories. Please note that our version simplifies the wording of the tags, and replaces some attributes (with a "yes/no" kind of value) with an empty element (inserted for "yes" and left out for "no"). You may refer additionally to the source of the definition module for our version of xslt and the generated documentation.   TOP-level(directly under "stylesheet") Character (and struct.) Producing (under template) (c) under template, but nothing (directly) producing Producing structured content back-end=target lang. elems. directly contained c = instructions contained, but only producing plain char data stylesheet transform import T include T strip-space T preserve-space T output T param (T) (c) B key T decimal-format T namespace-alias T template T B value-of C copy-of C number C apply ("-templates") C apply-imports C foreach ("for-each") C B sort if C B choose C when B other ("otherwise") B attribute-set T call (="call-template") C arg (="with-param") B variable (T) (c) B text C processing-instruction P c element P B attribute P c comment P c copy P B message (c) c fallback B  ^ToC 3.3 Further Caveats for the Xslt Mode For us, it turned out to be quite comfortable to write xslt programs using the d2d input front-end. Nevertheless, there are certain severe draw-backs and caveats. Some of them (e.g., w.r.t. the problem of shadowing of tags and attribute names) are already mentioned above. Some other caveats are described in this section shortly. You still write XML "verbatim" That means that all idiosyncratics of XML still apply, and partly affect the parsing process. Esp., XML "attributes" and "elements" still behave differently. This is not visible in d2d's standard front-end representation. Indeed, it was one of the design goals of d2d to eliminate all the complicated "junctims" coming with the dichotomy between "xml elements vs. attributes". But here this unification does have draw-backs, and you should be careful! For example:   #template #match link #a #href #choose #when XXXX  The #choose will close the contents of href immediately, because this is (in "XML and XSL-T reality") still "only" an attribute, with "only" character content, not an "element" with "element content". (The same mechanism, by the way, closes the match attribute as soon as the #a tag is parsed, which appears to be a quite sensible behaviour!) What you mean when calculating the character contents of an attribute is perhaps an "attribute value template", which is written as an xpath expression in curly brackets:   #template #match link #a #href {concat($myVar,'-',text())} 
  #template #name encodeLinkTarget #href {concat($myVar,'-',text())}  ...is simply syntactically impossible in the XML world, because for the "attribute" href there would be no hosting "open tag"! The verbatim translation to the genuine xslt xml representation yields something which is not valid xslt:    The syntactically correct way of generating an attribute node for the current context is of course creating the node explicitly:   #template #name encodeLinkTarget #attribute #name!href! #valueof concat($myVar,'-',text()) 

d2d parsing uses different kinds of brackets and parantheses, independently of the contents structure. So when writing the following, the braces will be consumed by the d2d parsing algorithm and you will not write an "attribute value template":

  #template #match link #a #href{concat($myVar,'-',text())} --> results in Instead, if you want explicit parentheses for the contents of href, you could choose other parentheses, like   #template #match link #a #href!{concat($myVar,'-',text())}! -- or -- #a #href<{concat($myVar,'-',text())}> -- or -- #a #href {concat($myVar,'-',text())}#/href 

Contrarily, in the xslt context no target language element is treated as empty! Even if neither attributes nor element contents is allowed, there could still be xslt elements which do not produce any output (like debug message output) included into the element as its contents.

But most elements have at least some ubiquituous attribute like "xml:lang" or "id". These attributes could possibly also be created by xslt code. This leads to a sometimes surprising behaviour of the parsing:

  #p #br #call mytemplate -- yields -->



In many cases the user will have meant something different, namely

  #p #br/ #call mytemplate -- yields -->



d2d parsing does not respect ("know of") the XPath syntax, which is wrapped into xslt constructs. As long as the command character is #, you cannot write ...

  #template #match link #a #href {concat($myVar,"#",text())} -- results in --> Last not least, some xslt constructs behave somehow unexpected w.r.t. contents and inferred closing. For "var" (which should better be called "const", and when applied to the xhtml target must be written "x-var"), "param" and "arg" we provide two versions for denotating the value: either an xpath expression, lead in by "xp", or a "template", which may contain "nearly everything". In case of var and param. this includes itself recursively! In case of constants (called "var") this does even make sense, in some rare cases:   // NOT intended nesting: #param p1 #text this text goes sas a value into the param p1 #param p2 #text // <-- and this param TOO : ---- yields surprisingly ---> this text goes sas a value into the param p1 // sensible and intended nesting: #var v1 #var v2 #text complicated constant for replication #/var #valueof$v2#valueof $v2#valueof$v2#valueof $v2#valueof$v2 #/var 

So better write explict close tags for all these elements which otherwise would swallow anything:

  #param p1 #text this text goes sas a value into the param p1 #/param #param p2 #text // etc. #/param 

^ToC 4 Documentations And Transformations

^ToC 4.1 Adjoining Documentation Text And Transformation Rules to Definitions

The ddf modules, ie. the document type definition files in the d2d architecture, support an integrated documentation and transformation system.

A collection of documentation texts and processing instructions can be adjoined to every definition. This collection is indexed by a user defined key, which is used in subsequent processing. This is achieved by statements in the source text which follow the grammar

 definition ::= ... | documentation
 documentation ::= docu ident localrefs ? stringconst
 localrefs ::= localref ( , localref ) *
 localref ::= ident ( . ident ) *

A localref is like a reference, but excluding references to imported definitions. When a list of localrefs appears in a documentation, then the following text is assigned to the definitions with theses names. If localrefs is left out, the text refers to the whole module as such. The ident always gives the key.

By convention (which also the current tool implementatioh adheres to !) there are currently two kinds of keys:

1. "user_<languageCode>", for adjoining human readable explanation text, eg. "user_en".
2. "to_<targetFormat>", for defining xslt transformations, eg. "to_xhtml_1_0".

^ToC 4.2 Generating Documentation of a Module's Structure

The current implementation of the main tool (see chapter 5, by the command line parameter "--mode ddf2doc") allows to generate an xhtml documentation page for each module.
The generated documentation page combines a listing of all tags, reachable from definitions in this module, a graphical representation of the reference relation, for each reachable definition a depiction of the syntax and the above-mentioned documentation text from the definition module, led in by "docu user_<L>". with L being the single selected human language.
The language in which to generate the documentation is selectable.

The documentation texts are parsed and translated to xhtml according to the definition basic.deliverables : docutext Therefore there are "rich text", as any non-taylored "basic.deliverables:webpage" content, using physical mark-up. links, tables, lists, external images, etc. ad lib.

The text fragments contained in the definition source related to the same definition (or module), with the same key, will all be concatenated and preceded by an implicit "#p " source text. Therefore most such doc texts can simply begin with readable text, not caring about formats. If they become longer, they contain more paragraphs by simply using the #p-tag. These details can easily be checked when comparing the source texts and the generated definitions of the d2d gp basic module, as described below chapter 6.

There are two different flavours of this generated documentation: It can either be related to the source text of an un-instantiated module ("static mode"). In this case all sub-modules are included in the documentation, but no instantiation takes place, so no reference is resolved or type-checked, etc.

This is because the definitions contained in a module can in many different ways be replaced and modified before the module is instantiated, so type checking outside of a concrete instantiation may turn out not sensible.

The contrast is the "dynamic" mode, in which one certain module is instantiated and all its top-level definitions are documented. Here all references are checked and instantiated, the same way as when the module is employed for text parsing. All those (and only those) definitions reachable from the top-level definitions are documented, totally independent from the static module which contains there source text.

Both forms may be necessary for your information. The instantiated form shows the really effective content definitions, after resolving all the (possibly rather complicated) substitutions and insertsion (see Section 2.8.5 and Section 2.8.6). The instantiated documentation shows the contents models which really rule the sequence of input tags when you have to denotate correct text input.

But in the instantiated form the origins of definitions, e.g. the insertions already resolved, are not visible anymore. So when designing own instantiations and variants of existing library modules, the static, un-instantiated form of documentation may be more appropriate.

As an example may serve the documentation of the d2d model of xslt, and the documentation pages of the "general purpose" document archictecture "d2d_gp", a link to which is found at the beginning of chapter 6.

^ToC 4.3 Defining a Collection of Transformation Rules Decentrally

In practice, centrally defined xslt scripts for further transformation of a certain d2d model turned out to be hard to maintain. Therefore d2d offers a way to denotate transformations directly with the parser definition.

The syntax is the same as for documentation. The text attached to each content model must be a fragment of an xslt text for a certain backend <b>, and the key must have the wording "to_<b>". An example from "basic.physical":

  tags hr = #empty docu to_xhmtl_1_0 hr = "#hr" docu to_latex hr = "\\\hrule{}" tags emph = @EMPH_CONTENT docu to_xhml_1_0 emph = "#i#apply" docu to_latex emph = "\emph{#apply}" 

The current tool allows to extract and concatenate all xslt rules with a certain key. If the indicated target language is an XML model itself, then this will used for construction of the templates' contents.

A command line like "<D2D_TOOL> --mode ddf2xslt --key xthml_1_0 --sourcefile basic.physical --outputfile x.xslt.d2d" will leave the file x.xslt.d2d with something like

 #d2d 2.0 xslt text using xhtml : html #stylesheet #version 1.0 #template match hr #hr #toplevel // INSERTED for terminating pending "#var" contents, etc. #template match emph #i#apply #eof 

This file, in turn, can be converted to a xslt source in the conventional format as described above in chapter 3, eg. by a command line like
"<D2D_TOOL> --mode text2xml --sourcefile x.xslt.d2d --outputfile x.xslt"

Then you have a classical xslt source file, which can be applied by any xslt processor to any input, e.g. by a command linke like
"eu.bandm.tools.utils3.CallXslt --in xmlsource.xml --xsl x.xslt --out result.html"
or, using the meta_tools make system, " $(call xml2xml, xmlsource.xml, x.xslt, result.html,$(PARAMETERS))"

All these steps can be done internally, in one single step, by calling ...
"eu.bandm.tools.d2d2.base.Main --mode text2target --sourcefile mysource.d2d --key xhtml_1_0 --outputfile result.html $(FURTHER_PARAMS)" or, again using the make macros, ... "$(call d2d2target, mysource.d2d, xhtml_1_0, result.html, $(FURTHER_PARAMS))" This is most convenient for batch converting of text input, esp. because the caching of definition modules and xslt rules spares a lot of processing time. Nevertheless, when developing new transformation systems, it may be helpful for the debugging to perform these steps separately. ^ToC 5 The d2d Main Tool Currently there is one central, universal implementation of a public void main() which allows to activate all functions described in this text from the command line. ( definitions from file ../../src/eu/bandm/tools/d2d2/base/d2dOptions.xml )  -0 --source uri path of source file or name of module to process, depends on 'mode' -d --debug int(=0) debug level, 0=silent 100=maximal verbose not modes.dtd2ddf==mode0 -K --keys ( stringuri) + pairs of target indications 'module:element' into which the xslt result shall be generated, and the output file position modes.text2texts==mode0 -k --key string the target language for which documentation shall be generated; or the pair 'module:element' into which xslt code shall be extracted (modes.ddf2doc==mode0 or modes.ddf2xslt==mode0) -m --mode ( text2xml| text2texts| test| ddf2dtd| ddf2doc| ddf2xslt| ddf2htmlform| ddf2tsoap| ddf2format| dtd2ddf) for what kind of task this application is called -p --path ( string) * where to look for type definition modules not modes.dtd2ddf==mode0 -v --version show version number null==mode0 --additionalSources ( uri) * additional source files, currently only used for documentation and transformation definitions. (modes.ddf2doc==mode0 or modes.ddf2xslt==mode0 or modes.text2texts==mode0) --interactive int(=0) which info to print in case of error: (=1) stack situation (=2) generated output so far (modes.text2xml==mode0 or modes.text2texts==mode0) --lineWidth int(=70) width of a line for most print out procedures. --outputfile uri output file not modes.text2texts==mode0 --partialdocs whether partially correct but incomplete documents may be produced. (modes.text2xml==mode0 or modes.text2texts==mode0) --stylesheetParams ( stringstring) * explicit pairs key:value of parameters for an xslt style sheet processing. (modes.ddf2doc==mode0 or modes.text2texts==mode0) --stylesheetParamFiles ( uri) * list of files containing parameters for xslt processing. (modes.ddf2doc==mode0 or modes.text2texts==mode0) The diverse functions selected by the argument of "--mode" are ... 1. "text2xml" --- parse an input file in the d2d format, and write it out to a standard xml-tagged text file. This is described in detail in the preceding chapters chapter 2 for plain text models, and in chapter 3 for xslt programs. 2. "ddf2doc" --- generate documentation text for a certain module, as described in Section 4.2. 3. "ddf2dtd" --- generate an xml dtd file which (roughly !-) corresponds to an instantiated d2d definition module. All generated xml text which follows this module is guaranteed to comply with the dtd, and vice versa. In practice this is needed mostly for defining the model in the more convenient ddf notation, and then deriving a tdom model. 4. "ddf2xslt" --- extract all transformation rules from an instantiated module into one xslt source text file. 5. "ddf2tsoap" --- serialize a module definition as such. 6. "ddf2htmlform" --- generate html "<form>" elements corresponding to the type definitions in the ddf module. NOT YET ACTIVE 7. "dtd2ddf" --- write out a ddf translation which corresponds to a given dtd. This model is (of course !-) the same as what is constructed internally when using the dtd as type definition for text input, see Section 2.9.1. 8. "dumpddf" --- write out the front-end representation of a ddf module. This can be an instantiated or an un-instantatiated, depending on the value of the expanded switch. It is esp. useful for controling the interpretation of modules defined in a third party language, cf. Section 2.9 above. 9. "text2text" --- combine the translation of a d2d source text into an xml corpus, the selection of an xslt script, and finally its application to the xml for generating directly a target language output. The "--key" parameter is interpreted as in the case "ddf2xslt". So a line like <d2d> --mode text2text --input d2d.d2d --output d2d.html --key xhmlt_1_0:html directly converts the source text (of this file) into the html webpage you are currently reading. The syntax of the key parameter is slightly enhanced: Prefixing it like "xml+xhmlt_1_0:html" will additionally create the xml file. The -p/--path option allows to specify the places where module source texts will be searched. Every string in the argument declares such a place, and they are searched from the left to the rigth for the first match for a given filen name (which normally is created by appending a suffix to a top-level module name, see Section 2.1.1 above.) Currently, three formats are supported: 1. FILE_/a/b/c/d searches for module files at the given position in the file system. 2. RES_a.b.c.D/e/f searches for Resources contained in the class file context (e.g., a ".jar" file). The search is for a resource relative to class "a.b.c.D", prefixed by the directory "e/f". 3. LIB_GP searches in the d2d standard library (containing d2d_gp, see chapter 6, but also other standard formats). Currently it is simply an abbbreviation for RES_eu.bandm.tools.doctypes.DocTypes/d2d_gp/. Xslt style sheet parameters can be given by the options stylesheetParams and stylesheetParamFiles. Every style sheet parameter defintion is a 2-tupel (NamespaceName, constant String value). This is visible in the xslt code, when defined as a top-level "Parameter", see [xslt1_0, sect. 11.4]. First the files listed after stylesheetParamFiles are loaded, in the given sequential order. Each text line (separated by line feed) contains on such key/value pair. Then the explicit definitions from the command line arguments after stylesheetParams are read. Every definition may override a previous one with the same key. ^ToC 6 "d2d_gp" --- a General Purpose Text Architecture d2d_gp is a general purpose text architecture which has been developed in parallel with the different historic states of d2d itself. It follows roughly well-known structural ideas from LaTeX, BibTeX, DocBook, HTML and other text type architectures. Its goal is not to invent new notions of text and its components, but to present established concepts in a versatile, modifiable and extensible way. Therefore some modules, e.g. those which define table and list structures, are left utmost primitive, by intention. Users may plug-in their favourites, and shall not be confrontated with complexity inadequate to their current project. So d2d_gp is intended to be parameterized and modified by the user, for easy definition of textual document types and their subsequent transformations, according to the needs of a certain project or to personal preferences. Beside, it serves as a demonstration object for the mechanisms of d2d itself. It consists of a collection of ddf modules, including user documentation text (currently only in the English language) and xslt transformations (currently only to xhtml 1.0). The translation system into xhtml 1.0 is well-proven and has been used for creating the documentation texts you are currently reading. We plan to derive a translation system into LaTeX, and perhaps into "word-ml" so that you can translate into these de facto/pseudo standard formats (into the latter without the need for ugly tools from Redmond !-) ^ToC 6.1 The "basic" Module and Its Sub-Modules Here a short characteristic of the sub-modules. (The links bring you to the automatically generated documentation. Please note that this doc represents the instantiated case. Consequently, one and the same module source text may appear there more than once, namely differently parameterized! So please, when looking for a certain module, please consider the list of instantiated modules at the beginning of the documentation.) 1. basic.sets defines basic character sets and low-level,but ubiquituous parsers, e.g. for idents. 2. basic.xmlInfra defines the very few basic defintions which are fundamental and ubiquituous in XML. Esp. the xml representation of these elements and attributes is linked to the correct "namespace", cf. Section 2.8.3. 3. basic.physical contains physical mark-up, according to the old-fashioned HTML practice (bold, italic, blinking !-). This is of course deprecated. Nevertheless this module might still be useful in certain contexts. It contains also the definition of "verbatimHtml", which allows to generate target language material directly, comparable to the "\special" mechanism in DVI, dvips, etc. Currently it also holds "#src", giving teletype-fonts and special colors to in-line source text fragments. This is a fundametal means when writing technical texts on information technology, and could be considered specialized semantic markup, but is currently, due to its ubiquity, implemented as mere "physical". 4. basic.inlineElements defines fragmens of a text line, with dedicated funtions or semantics. Like "pers" for marking bespoken persons, or "label" and "ref" for marking text positions and for referring to such positions, etc. 5. basic.calendaric_de is an example instantiation for denotating calendaric dates in a particular language. Other parallel modules for other languages shall follow. 6. basic.interDocuments contains structures for creating hyperlinks, for performing source tree level inclusion, and for exporting sources for third-party rendering tools, the results of which will be inserted into the rendered output On a posix platform, this is accomplished by a special Xslt run, which concatenates all this data in one shell script. This is processed by a dedicated make sub-system which feeds the source fragments into the different tools and arranges their output in correctly named and formatted .png files. So e.g. musical notation can be integrated into texts on music theory, by including sources for "lilypond" or "musixTex" into an basic.interDocuments:embed structure. 7. basic.personal_names_de is again only one instance of several possible ways of parsing and encoding objects of its semantic category. Currently it is not used in the rest of the system (but should !-) 8. basic.citation contains the definitions related to bibliographic entries and citation keys. Here some research and definition work has still to be done. 9. basic.simpleLists is a very simple list module. It is really used in all applications, but is a candidate to be replaced by some more sophisticated solution (but please downward compatible !-) 10. For basic.simpleTable holds the same! 11. basic.floatings All about floating objects and the corresponding directories. 12. basic.structure combines the elements above to define hierarchies, and parameterizes all modules with the required "tag mixtures". 13. basic.deliverables defines some top-level elements which correspond to a certain tradition of publishing, e.g. "article" as in LaTeX, or "webpage" as defined by html. ^ToC 6.2 Special Modules For Technical Documentation A first group of modules which are more specialized are contained in http://bandm.eu/doctypes/d2d_gp/technicalDoc.dd2 These modules support the technical documentation of software. 1. technicalDoc.syntaxDescription for defining context free syntax definitions, terminal symbols, non-terminals, and for refering to them. 2. technicalDoc.commandLineDoc for including an Xhtml documentation of an option command line option definition into a documentation text. See the description of the option module. The definition of the meta_tools documentation itself, i.e. the text you are currently reading, is derived from these modules and contained in http://bandm.eu/doctypes/d2d_gp/mtdocpage.dd2 ^ToC 6.3 The XSLT Transformations Into XHTML 1.0 The transformation system from d2d_gp into xhtml 1.0 is multi lingual. This is controlled by a translation table, which must be extended when extending the rule set, and which may be copied and edited when adding support to new languages. Most of the transformations from d2d_gp into xhtml 1.0 are straight forward: 1. Xpath expressions are used for determining the position number of sub-structures whenever necessary. This happens possibly repeatedly, without any performance considerations, because the implementing machine should support caching, not the transformation code! 2. Normal translation consists of three passes over the text, indicated by mode parameters: The first pass (mode='intoc') collects all headlines. In this mode many special structure-defining elements (like link and label) are reduced to their mere text contents, or even to "epsilon". The second pass (mode='') translates everything fully, except footnote objects, which are translated to footnote marks. The final pass (mode='footnote') renders the footnotes' text contents at the end of the document. 3. A dedicated auxiliary but global mode (mode='getreftext') collects the wording for references. These are constructed using the #label/#ref elements. Labels may appear nearly ubiquituously, and are rendered according to the containing context, like "figure 7" or "section 3.2", translated according to mulitable. Additionally, the position of the reference influences this mode execution to generate relative references like "point 3 in this list", or "in the next section". 4. A special, auxiliary and local mode (mode='puretext') collects the pure text data of elements whenever required, e.g. for constructing Xhtml attribute values. There are some more "modes" like getnumber, which simply represent local functions or visitors. The transformation to xhtml 1.0 is directed by a collection of style sheet parameters. Their names always start with "$user.", to prevent naming clashes with internal parameters and constants.
(The file /doctypes/d2d_gp/mtdocpage_xhtml.css.prototype holds an automatically collected list of all "css-class" definitions and of ALL these variables. There is a script extractClasses.xslt which generates this file automatically when you apply it to the extracted "<XXX>_to_xhtml_1_0.xslt". For creating this textual representation of the Xslt rules, cf. above Section 4.3.)
The most important of them are currently:

 $user.user Name of the person initiating the transformation process. This is currently used by the generated "standard footer".$user.date Date and time of the rendering process. Goes also to the "standard footer". $user.host Name of the machine on which the rendering process is run. Goes to the "standard footer".$user.mulitable URI where the translation table can be found. Is needed by the translation process. $user.defaultLang Serves as a very low priority default language; will be overriden by any lang or langs value in the source text.$user.collectiontitle name of the collection the webpage is part of. Used for the "standard header". $user.currentkey stem of the file which is currently processed (file name w/o directory prefix and ".d2d" or ".html" suffix) Needed for navigation links, etc.$user.url.sitemap url of the sitemap, in the sitemap format. Required if header, footer, or other html elements shall include navigation $user.bibLocation url of the file which contains the bibliography, iff it is not the current file. Iff a value is given, then all xhtml links which correspond to "cite" source elements will link there. (If no value is given, they will stay local to the generated file.)$user.biblistHideUrl "==yes" indicates that a "clickable" entry visually shows only an indication of the document type to download (like ".pdf document" or ".html-Datei") instead of the full URL. $user.linkurlprefices A "self-structuring" list of strings. (This is the concatenation of string values separated by an arbitrary delimiter, which is defined by the first character in the string). Defaults to "http:/" The maximally ten(10) elements of this list are used as a prefices for every url in an explicit source "link" which starts with a decimal digit. This makes a kind of "mount points" for href-prefices which (a) occur frequently and (b) must possibly be re-located when rendering in different contexts. E.g. with "user.linkurlprefices='%http:/%http://aa.bb/c%file://a/%'" every source text " #link 2def#/link" will be rendered as " #link file://a/def#/link" The zero-th prefix should always be set to http:/, which is also the default value of the parameter, since may entries, e.g. in bibliographic lists, rely on this abbreviation.$user.linktextprefices A self-structured list of strings, which are used as a prefix in a similar way as $user.linkurlprefices, but for the text part of link elements.$user.jsUrls self-structured list of urls of "java script" files. One link for each of the mentioned files will be inserted into the html output. $user.cssUrls self-structured list of urls of "css" files. One link for each of the mentioned files will be inserted into the html output.$user.iconUrl one single url to the "icon" used in the title bar of a html browser. Encoded in html by "" $user.showLabels iff =='yes', then all labels in the text are visible, for debugging purpose.$user.pageSource iff =='yes', then all link to the source text will be inserted in the footer "for your information". It has the form of a relative url, lying parallel to the html file. $user.footerSignet defines the first element in the standard footer. Is output as text without escaping, can contain arbitrary Xhmtlm constants, like links and styles and ascii-art, etc. // cf /home/lepper/ml/web12/common.mk TESTED ???? FIXME !!$user.xhtmlVariant defaults to =='strict', can be changed to =='strict'. Declarations are set accordingly (NOT YET OPERATIVE, FIXME), and things like are eliminated, causing a warning. ($p_kind_filter) Must contain the string ' * ' to include ALL kinds of paragraphs, or a list of identifiers and only paragraphs with this "kind" will be included in the output. NOT YET PUBLIC, controlled by "$publicVersion", should be public, FIXME

Further parameters are currently only used for the particular stylesheet generating the documentation you are currently reading, namely $link!http://bandm.eu/doctypes/d2d_gp/mtdocpage.dd2!, but are intended to be abstracted in near future: $user.publicVersion iff =='yes', then all proof-reading info and all paragraphs of "internal" kind are suppressed. $user.collectiontitle The pure name of the collection, used for generating navigation bars, links, meta tags, etc.$user.collectiontitle_html Overriding $user.collectiontitle in case where full Xhtml may be used. Is included without escaping, so it may construct abritrary Xhmlt verbatim!$user.url.sitemap A file following the sitemap format for generating navigation devices.

^ToC 6.3.1 Required Rotations when Translating to XHTML 1.0

In HTML/xhtml the construct "table" is not under a paragraph, as a child node, but on the same level, as a sibling. The same ist true for the list constructs "ul" and "ol", and for the horizontal ruler "hr".
(Contrarily, "br" may only appear inside a "p", or inside another "block element")

The definition of the source-level elements "list" and "table" in d2d_gp is different, since they are children of a "p". In d2d_gp, "p" is the central means for organizing attributes like "lang" or "kind" or "label", etc. So here, a "p" wraps all, including list and table, etc.

When translating to xhtml, the containment-relationship must thus be rotated:

  body body | /|\ | / | \ p p | p /|\ | | | / | \ | | | / | \ | | | / | \ | | | chars table chars chars table chars 

The algorithm is as follows:

  [p]alpha[/p] ==> trans(alpha) trans(alpha) ==> f(e, e, alpha) f(top, sub, e) ==> top;p'(sub) p'(e)=e p'(alpha)=[p]alpha[/p] f(top, sub, [list]beta[/list];alpha) ==> f(top;p'(sub);[ul]beta[/ul], e, alpha) f(top, sub, nonlist;alpha) ==> f(top, sub;nonlist, alpha) f(top, e, e)==> top 

This algorithm is implemented in the functions (="named templated") "new_p, from_p and in_p" in the xslt code contained in basic.structure.

A similar rotation is necessary when translating into "ms-word xml".

made    2023-01-09_11h39   by    lepper   on    washington-ubuntu

produced with eu.bandm.metatools.d2d     and    XSLT    FYI view page d2d source text