bibliography	bandm ^meta_tools	white papers 2

Collected White Papers on Technical Details

1          Intentions
2          Identifying, Searching and Finding of Resources
2.1          XML Document Identifier
2.2          XML-DIs in Documents
2.3          Locating Documents by URL/URI/URN
3          Locations and Locators, i.e. position IN a particular document
3.1          Locators in SAX
3.2          Locations in the d2d/xml/xslt pipelines
4          XML Namespaces
4.1          Namespace Definition, Encoding in Documents
4.2          Namespaces in SAX
4.3          Reserved Namespaces
5          XML Namespaces in D2d processing
6          Error Signalling and Processing
6.1          General Strategies
6.2          The ANTRL / xantlr / tdom / xantlrtdom Error Signalling Pipeline
6.3          Errors in SAX and JAXP/TRAX

^{^ToC} 1 Intentions

In ^meta_tools , most areas of processing are neatly separated and cleanly implemented. But there are some processes and data-flows the details of which are not only rather complicated, but also distributed over different modules and layers of the architecture.

(In most cases, this is due to external standards and tools we have to interface !-)

For these topics, a central "White Paper" is desirable which describes at one central place the coding principles (the basic design decisions as well as the details) and sketches the co-operation of the modules involved.

Some of these "little white papers" will be collected here. The collection itself is rather ad hoc, ie. not carefully structured or aiming at completeness. It will grow as soon as new topics are identified as useful and desired.

^{^ToC} 2 Identifying, Searching and Finding of Resources

^{^ToC} 2.1 XML Document Identifier

The XML Document Identifier (=XML-DI in this text) is implicitly defined in http://www.w3.org/TR/xml11/#NT-ExternalID[=XML]
It is implemented in ^meta_tools in message.XMLDocumentIdentifier
The "SYSTEM" part must be a URL.
http://webdesign.about.com/od/dtds/qt/tipdoctype.htm says that in case of a PUBLIC id present "The rest of the DOCTYPE identifier is optional..", but the syntax graph in [XML] says different!

The "PUBLIC" part is not defined in the XML standard. Only traditionally it is a "Formal Public Identifier" = "FPI" as defined in SGML (= ISO 8879:1986) / (ISO 9070:1991)

See http://en.wikipedia.org/wiki/Formal_Public_Identifier[=WIKIFPI]

http://www.ietf.org/rfc/rfc3151.txt contains a proposal for mapping FPIs into an URN namespace

Following [WIKIFPI], WE as a domain owner have a registered (!!) FPI, namely
+//IDN eu.bandm//... This is supported in some methods in <METATOOLS>/dtd/Utilities.html

^{^ToC} 2.2 XML-DIs in Documents

A doctype declaration of an XML encoded text consists of a TAG, namely the tag of the root element, and a DTD reference. This is in form of an XML-DI.
Our implementation of DTD includes a field for this XML-DI.
This is conceptually NOT CLEAN, because the "system" part of "the same" DTD may differ between computer systems or even applications or even runs, while the "public" part must be identical.
Nevertheless, XML requries the system part to be always present, and to treat both parts as a unit.

The XML-DI of a DTD can be encoded into a PI for tdom, so that TDOM gets to know it. This is done in the WRAPPER DTD, in case a third-party DTD shall be processed by tdom. The TypedDOMGenerator compiles this id into a static final field in the generated sub-class of TypedDTD. It can be retrieved by calling <METATOOLS>/tdom/runtime/TypedDTD.html#getDocumentId()

The method dtd.Utilities.addTdomPI_documentId(Dtd) generates such a PI for a dtd as an internal model.

Whenever a ContentPrinter gets the first "startElement" SAX call, it prints XML version and encoding, followed by a synthesized "DOCTYPE" declaration. For this, the "document locator" must be have been set in advance by a call to the method <METATOOLS>/util2/ContentPrinter.html#setDocumentLocator(org.xml.sax.Locator)
This is a standard function from org.xml.sax.ContentHandler [=SAX-ContenHandler]

In our context, this method is called e.g. by DUMPER CODE GENERATED by tdom. See eg. <METATOOLS>/src/eu/bandm/tools/doctypes/xhtml/Dumper.java, method "match (Document_html)"

BUT THIS SEEMS AN ABUSE, because [SAX-ContenHandler] says this location is the ORIGIN of the sax events, i.e. the document itself, not the origin of its type definition !?!?!?

^{^ToC} 2.3 Locating Documents by URL/URI/URN

The class <METATOOLS>/doctypes/DocTypes.html implements a "URI Resolver" interface. The procedure defintions in "calltools.mk" named "xml2text", "xml2html", "xml2xml" (more may come!) all call "xalan". They attribute the call to the Xalan command line by "-URIRESOLVER eu.bandm.tools.doctypes.DocTypes".

The code of resolve(uri,base)delivers for requests for URIs like bandm.eu/doctypes/<XXX>/<YYY> and
http://bandm.eu/doctypes/<XXX>/<YYY>
a "javax.xml.transform.Source" instance object which points to a "getResouceAsStream()"-resource named <XXX>/<YYY>, -- relative to the class file of DocTypes.

This is employed e.g. to find some files with xslt-procedures which are included in the main transformation file.

remark 1)
The "PUBLIC ID" (which does belong to us !-) is currently NOT DECODED.

remark 2)
The currently used "CMDLINE-WRAPPER" for xantlr does NOT use the uri decoder for the top-level (=commandline argument) files! But it should be eliminated anyhow by a wrapper around the JAXP/TRAX interface.

^{^ToC} 3 Locations and Locators, i.e. position IN a particular document

^{^ToC} 3.1 Locators in SAX

There is an interface definition org.xml.sax.Locator. which offers different methods for inquiring the different coordinates of the current input position (e.g. getColumnNumber(), getSystemId()).

The function ContentHandler.setDocumentLocator(org.xml.sax.Locator locator) should be called "under the hood" by an XML "stream" parser to set the implementing object. It is NOT guaranteed that every XML parser does so.

^{^ToC} 3.2 Locations in the d2d/xml/xslt pipelines

(This is a more internal topic, but we need a kind of memo, so we put it here!-)

Moduleregistry.loadXslt() loads the ddf module as usual. All trees in the AST-TDom get their normal XML location information.
ADDITIONALLY the documentation texts
(resulting from inputs like "docu to_filelist article" or "docu user_en article")
are wrapped into a a special "LocString" object (read: located String), because LINE BREAKS in continguous char data must be respected for the location map.
Then the Def2Xslt processor extracts the source text of the xslt rules into one large string and fills the map "stringpos2filepos" accordingly
The Text2Udom parser constructs XMLDocumentIdentifier from the text input, may it be file or string, and stores them in the ResultingStructures. Since a MemString is used, line numbering starts with zero!
Interleaved, the MemString data is integrated without changes into the output and comes with the original location information.
The Udom2Sax generates Sax Events and implements the Sax Locator callback interface, which translates (via a cache!) these infos.
In case that the Sax events are fed into an XSLT compiler (like XALAN), and this finds an error, we expert this to call that call-back method.
Most errors are reported by thrown exceptions. This is very annoying. The exceptions must be UN-WRAPPED to find the locator information. In most cases a javax.xml.transform.TransformerException can be found wrapped inside, and this carries the location information.
This can e.g. be wrapped into a org.apache.xml.utils.WrappedRuntimeException .
Or into a TransformerException or into a SAXException or into ...
CAVEAT ONE: The very first variant can only be compiled when an appropriate "Xalan-mixture" is in the classpath of the COMPILATION STEP!
CAVEAT TWO: It is UNSPECIFIED which exceptions can be thrown, and what they contain. We try to decode as much as possible to generate sensible error messages, but this is not guaranteed, and catch the upper limit (Exception) to generate an "error not understood error"-message
Both points are VERY ANNOYING!
Once found, the location information is easily re-transposed for the user with the map from this list, list item 3 !

^{^ToC} 4 XML Namespaces

^{^ToC} 4.1 Namespace Definition, Encoding in Documents

XML "Namespaces" and "Namespace Names" are complicated because they reflect a complex history of parallel/sequential, competing/co-operative, clean/hacking developments. Different interfaces and third-party-modules (eg SAX, Xalan) impose different constraints on their input wrt. name spaces, and these are not always fully and exactly specified!

Basically, some tools and sub-specifications of the XML family do not consider namespaces at all (e.g. the XML base specification, the DTD mechanism, [xml]). Others can work in two different modes, either namespace aware or not. A third group does always consider name spaces (XPath, XSL-T).

Whenever name spaces are considered, this applies to all kinds of identifiers, namely element tags, attribute names, and (eg. in XPath and XSLT) names of variables, functions, templates, modes, etc.

Our class NamespaceName in a first step seems to make things even more complicated, to make them more easy, namely interoperable, in a second step! The basic philosophy of this class it trying to combine requirements from different sources:

In XML in general [xml] , in non-namespace mode the colon ":" is a normal character which may occur arbitrarily often in any identifier (tag or attribute name, etc.)

But in namespace mode there may be maximally ONE(1) colon ":", separating a non-empty prefix and a non-empty local name, see [xml-ns] .

Identifiers without colon have an "empty prefix", but the canonical notation ":localname" is nevertheless not allowed.

In every textual representation of a document, every prefix, including the empty prefix, must be mapped to a certain namespace URI. This namespace URI defines the "identity" of a certain namespace, i.e. is used for "equal()" tests. The assignments from prefices to URIs are valid in nested scopes. In an textual representation of a document, such a scope is defined by the contents of an element (including its own name and the names of its attributes, see below).

A special case is the empty URI "" which can only be represented by the empty prefix.
Per default, the empty prefix is mapped to the empty URI.
Further mappings are established (in the textual representation) using the following syntax, which looks like an attribute:

    <pref:ELEMENT  xmlns="http://uri1" 
                   xmlns:pref="http://uri2"> 
       ...
    </pref:ELEMENT>

In this example, the empty prefix is mapped tu "uri1", the prefix pref to "uri2". This mapping is valid "backwards", already for the tag of the containing element!

These assignments only look like attributes, but are no attributes in the sense of SAX and DOM and tdom.
(FIXME says they ARE attributes, but FIXME says they are not!?)

    <ELEMENT  xmlns="" xmlns:pref="" >

Here the first assignmentis valid, the second NOT.

The mapping of the empty prefix is ONLY VALID for element tags, not for attribute names. The latter stay with the empty URI iff they have an empty prefix. This table shows the allowed combinations:

	pref=""	pref non-empty
uri=""	ATT	--
	EL	--
uri non-empty	--	ATT
	EL	EL

An instance of our implementation, NamespaceName, represents all possible identifiers in an XML context, either in namespace mode or in non-namespace mode.
In namespace mode MUST have a local name and a (possibly empty) URI.
It MAY additionally have a prefix value.

All field values are always != null. All arguments to a constructor call must be !=null ! "Not having" means being the empty string "". This corresponds to the requirements of SAX.

Following combinations are possible, selected by different constructor signatures.

         URI         localName          prefix        enableNameSpace

new NamespaceName(uri,localname)
        "NONEMPTY"   "NONEMPTY"         ""             true
        ""           "NONEMPTY"         ""             true

new NamespaceName(uri,prefix,localname)
        "NONEMPTY"   "NONEMPTY"         "NONEMPTY"     true
        ""           "NONEMPTY"         "NONEMPTY"     --> ERROR

new NamespaceName(qname/localname)
        ""           "::NON::EMPTY::"   ""             false

("NONEMPTY" does not contain colons!)

Only URI and localname are considered for the definition of "equals()"! As a consequence, non-namespace-names and namespace names in the no-uri-space are considered EQUAL iff they have the same local name!

(
The prefix is only used to memorize some previous external representation, for writing out the name with the same prefix again, iff possible. It is only used for "ergonomic" reasons, but semantically not significant! When later serializing namespaces, it can serve as a MERE HINT for construting an external representation, resembling the originally read in. It is not really clean to keep it here, but the basic philosophy of namespace names is not ours. Anyhow, this partitioning is much less redundant and intermangled than the SAX "qnames", which contain the prefix and double the local name.
)

^{^ToC} 4.2 Namespaces in SAX

The two kinds of SAX events which represent tags (startElement() and endElement()) have "uri", "localname" and "qName" as three independent arguments, e.g. to the "startElement()" method in ContentHandler.
Its doc text says (not very clearly):
"If the http://xml.org/sax/features/namespaces-property is true (default!), then uri and localname must be provided. If false, both may together be there or may be not there.
If the http://xml.org/sax/features/namespace-prefixes-property is true then the qName is required. If false (default!) then it is optional."
To repeat: "not being there" always means being the empty string value, but not null!

It is not clear in which mode which data takes which precedence! So, to be on the safe side, the all arguments should be supplied redundantly when using badly documented third-party software. Esp. "qname" should comply (redundantly) to the current mapping of prefices to uris.

SAX includes the events startPrefixMapping(String,String) and endPrefixMapping(String) to grow and shrink this map.
It is NOT SPECIFIED what an "start" does when a prefix is already in use, and whether an "end" will let old, shadowed mappings pop-up again.

OUR code which consumes SAX events treats these questions systematically and in a compositional way.

The api doc of org.xml.sax.ContentHandler.startPrefixMapping(String,String) says :
"There should never be start/endPrefixMapping events for the "xml" prefix, since it is predeclared and immutable."

But all library functions which make up the SAX processing pipeline do not treat the "xml:" ns-prefix specially. It is NOT YET CLEAR at which points, and how often, an explicit

startPrefixMapping("xml", "http://www.w3.org/XML/1998/namespace")

is permitted, or even required!
FIXME !!

It is also not yet clear how to avoid the following situation when translating from a less idiosyncratic model, e.g. writing out d2d "Udom" structures:
If a (non-empty) namespace-uri is assigned to the empty prefix, this is fine for all elements. But when later an ATTRIBUTE has to be written with this namespace-uri, then that must be matched to some non-empty prefix additionally to the empty one.
So in this case, "synonyms" seem really necessary.
((
Perhaps it would be the best strategy to leave the empty prefix immutably assigned to the empty URI, because the latter cannot be expressed in another way!?!?
And, even better, do never use the empty URI, if not required by some legacy format!?
))

Generally, our SAX processing pipelines operate in namespace mode. The value of "prefix" is normally ignored. But there are some applications in the outer world which do require a certain prefix! In this context, the original prefix (eg. recognized by a third-party parser and communicated to a metatools internal structure involving NamespaceName instances, via a SAX event) can be stored in the "prefix" field of the NamespaceName instance. It then can be passed on transparently, and at last will be used on final re-serialization. BUT this only as a hint, not carrying real semantics!

The serialization of NamespaceName instances includes the emitting of the corrsponding "prefix mapping" events. In ^meta_tools , the serialization is performed by ContentPrinter.
But this device is NOT namespace-aware, it prints only q-names! So, for to use it in namespace mode, a NamespaceEmbedder is pre-poned! This consumes the start/end mapping events, keeps track of all open scopes, and maps them to "PSEUDO-ATTRIBUTE" definitions. It fills in the resulting qName for all attributes and elements. This is of course an ABUSE of the SAX interface and does only work because the behaviour of our ContentPrinter is known and specified BEYOND the SAX convention!
((
Consequently, in a first version, this code did remove the Uri and the local name from the sax events. But this DOES NOT WORK for Xalan Template Construction, see below, so for this case we have to produce redundant output!
))

Its code requires that there are NO HOMONYMS, ie. nested assignments of the same prefix to different (or the same?) uris. To ensure this, NamespaceHomonymResolver can be included in the pipeline.

The Udom2Sax-serializer from d2d, eg, looks only whether the Uri of the namespaceName to write out next is already mapped to any prefix. If so, it does nothing! Only if not, a currently unused prefix is selected and assigned to this Uri. For this, the prefix stored in the namespace name is the first candidate.

In d2d namespaces can be declared in genuine "ddf" definitions by prefix and ns-uri. Both must be non-empty.
In the "with xmlrep element = <XMLTAG>" part, the xml tag can have a prefix, which then will be mapped back to the declared namespace uri to construct a namespace name.
If there is no prefix in the xml tag, and no namespace has been declared to be "default" then, then a namespace name is constructed with empty prefix, mapped to the empty uri. Otherwise, the declared default is taken.

If a dtd is imported to d2d as a definition module, then the "<?tdom xmlns:... ?>" namespace declarations are respected and required.

The following graphic tries to summarize the situation between d2d output and ContentPrinter:

  STARTPREFIXMAPPING   uuuu           pppp
  STARTELEMENT         uuuu  llll
  ENDELEMENT           uuuu  llll    
  ENDPREFIXMAPPING                    pppp

==================[namespaceEmbedder] ======== >

  STARTPREFIXMAPPING   uuuu           pppp
  STARTELEMENT         uuuu  llll     pppp:llll
     ATTRIBUTE         ----  pppp     xmlns:pppp   = uuuu
  ENDELEMENT           uuuu  llll     pppp:llll
  ENDPREFIXMAPPING                    pppp

//original intention of namespaceEmbedder, to be used with content printer,
// was different, namely:

==================[namespaceEmbedder] ======== >

  STARTELEMENT         ----  ----     pppp:llll
     ATTRIBUTE         ----  ----     xmlns:pppp   = uuuu
  ENDELEMENT           ----  ----     pppp:llll

((
The original design did NOT WORK when the sax events were piped into a JAXP-TemplateHandler receiver for constructing an XSLT Transformer. This SEEMS TO require the redundant version, as we found out after two days' reverse engineering !-(
))

^{^ToC} 4.3 Reserved Namespaces

The most frequent reserved namespace is

  xmnls:xml="http://www.w3.org/XML/1998/namespace"

It is special in very different concerns, which make it hard to treat it consistently:

The namespace uri "http://www.w3.org/XML/1998/namespace" is always bound to the prefix "xml:".
This may be declared, but
it is not required to be declared, ie. this mapping is always valid, implicitly.
This namespace name must not be bound to a different prefix, and
the the prefix "xml:" must not be bound to a different namespace name.

See [NS in XML:sect 3, last paras].

Esp. the last points are crucial, because our normal pipeline cannot treat this prefix and uri in the canonical way. Current firefox (36.0.1) indeed rejects documents which attempt this. VERY UGLY!

^{^ToC} 5 XML Namespaces in D2d processing

In the context of d2d , there are some very different settings involving name spaces:

1--

The fundamental taks of d2d is to produce an XML encoded output from a "readable" d2d source. This is described in d2d.html in detail. A dedicated section describes how name spaces are defined in the text type definition, and how element and attribute definitions can refer to name spaces (via prefices) when defining an explicit xml tag.

2--

When a DTD is used to define the text format and the parsing rules, then, for namespaces to be active, the tags(=element names/attribute names) in the DTD must have to the form with one colon ("ab:cde"), and there must be TDOM-PIs like "<?tdom xmlns:ab="nameSpaceUri" ?>" in the DTD. The d2d parser will do as it alway does: accept (and need) only the pure "local" tags, like "cde", and generate the XML output in the intended name space.

3--

When a d2d source text is used to describe an XSLT transformation, then the target format is given by the header:
"#d2d 2.0 xslt text producing <module> : <toplevelelement>"
By this the (zero to more) namespace declarations occuring in the target text type are known, i.e. the corresponding mappings of prefices to uris. The namespace declaration for xslt itself is known anyhow and uses some fixed prefix.

But additional name spaces can be required, namely those occuring the input document, to which the defined xslt code later shall be applied, or for calling some "tpath runtime extension functions", which can live in any name space. For this the xslt mode defines an additional syntax, see the dedicated section in the d2d xslt mode documentation.

4--

When the xslt rules are defined in the ddf text format definition (with "docu to_<X>" constructs, as explained in the d2d documentation), then the namespaces of the input document are known, because this is the d2d model just created, and the target type is indicated by the "<X>" in the transformation definition.

All additional name spaces needed for tpath extension functions etc. can currently only be added by extending the namespace declaration mechanism: All namespace declarations of all modules involved will be combined into the namespace declarations of the synthesized xslt source. Of course this is not an optimal solution, because conflicts of prefices must be "manually" avoided in advance.

(An automated solution is of course possible as soon as the xslt sources are parsed by ourself and not let to an external tool. But this is not the general case.)

5--

Sometimes a d2d text type definition (ddf) is translated into a DTD. This is executed as a "Task" from the d2d man tool, see the tool doc.
(This is frequently done to compile the DTD via TDOM into a strongly typed Java model.)

In this case, the name spaces declarations from the ddf text type definition will re-appear as "<?tdom xmlns:ab=".." ?>" process instructions, as described above.

^{^ToC} 6 Error Signalling and Processing

^{^ToC} 6.1 General Strategies

Errors and Exceptions can be treated by different mechanisms, depending on the usage context of the code, whether it is sensible to continue the operation of a certain module after the error, and depending on the size of the software module.

Small, low-level and general purpose modules may signal an error by throwing an exception without any declaration, ie. a subclass of <JAVA-API>/java.lang.RuntimeException. With throwing an exception, the execution of the current method call is finished. So this is really a very limited protocol.

Then some higher layer in the software architecture is responsible for translating this error into some diagnosis sensible for a user. An exception should NEVER reach the user's level.

Larger, specialized software modules get a MessageReceiver as a constructor parameter. They send warnings, errors and hints explicitly to this channel.

This can happen arbitrarily often, without terminating execution of the module's functionality.

Additionally, a result may be set ==null or to some other dedicated value, to indicate that the execution has failed and any pipeline processing cannot be continued without special reactions.

The message receiver can "tee" the messages to a MessageCounter, for easily detecting if severe errors have occured in a transformation step.

^{^ToC} 6.2 The ANTRL / xantlr / tdom / xantlrtdom Error Signalling Pipeline

MORE TO COME

^{^ToC} 6.3 Errors in SAX and JAXP/TRAX

org.xml.sax.XMLReader defines a method "setErrorHandler(org.xml.sax.ErrorHander)".

Such an error handler must implement three methods, (warning(), error() and fatalError()), all parametrized with an SAXParseException object.


bibliography	bandm ^meta_tools	white papers 2

made 2018-12-30_11h02 by lepper on linux-q699.site

produced with eu.bandm.metatools.d2d and XSLT FYI view page d2d source text

^ToC 1 Intentions

^ToC 2 Identifying, Searching and Finding of Resources

^ToC 2.1 XML Document Identifier

^ToC 2.2 XML-DIs in Documents

^ToC 2.3 Locating Documents by URL/URI/URN

^ToC 3 Locations and Locators, i.e. position IN a particular document

^ToC 3.1 Locators in SAX

^ToC 3.2 Locations in the d2d/xml/xslt pipelines

^ToC 4 XML Namespaces

^ToC 4.1 Namespace Definition, Encoding in Documents

^ToC 4.2 Namespaces in SAX

^ToC 4.3 Reserved Namespaces

^ToC 5 XML Namespaces in D2d processing

^ToC 6 Error Signalling and Processing

^ToC 6.1 General Strategies

^ToC 6.2 The ANTRL / xantlr / tdom / xantlrtdom Error Signalling Pipeline

^ToC 6.3 Errors in SAX and JAXP/TRAX