All pages: introduction message/location/muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf DOC-EN:lablog.ddf SOURCE:basic.dd2 DOC:xslt.ddf SOURCE:xslt.dd2 DOC:meta.ddf [site map]


white papers	bandm ^meta_tools	white papers 3

Collected White Papers on Technical Details --2--

1          XML Entities, Definition and Usage
1.1          Categories and Declaration Syntax
1.2          References to Entities
1.3          Using Entities / Expanding Entity References
1.3.1          Recursive Entity Expansion
2          Encoding and Escaping in XML
2.1          Character References as Second Level Encoding
2.2          Character References for Escaping Active Characters
2.3          Additionally "Parameter Value Normalization"
2.4          CDATA section
3          Adjacent Text Nodes in W3C DOM implementations
4          Text Encoding, Java Readers and Writers, XML Parsers and Encoders

^{^ToC} 1 XML Entities, Definition and Usage

The notions, categories and roles of "Entities" in XML are quite confusing. As so often with these standards, very different concepts have been thoroughly mangled.

This article tries to give a survey. It is based on
Extensible Markup Language (XML) 1.0 (Third Edition)
W3C Recommendation 04 February 2004
(version: http://www.w3.org/TR/2004/REC-xml-20040204)
See also [xml]

Most of the wording in the following description follows in a one-to-one relation the formal "non-terminals" in that text, even when these are not very sensible, e.g. "Reference" vs. "EntityRef" vs. "PEReference".

In the following in double round parentheses we refer to

sections of that spec, like ((4.2)), ((begin of 4))
well-formed and validity constraints contained therein by ((WFC: In DTD))
grammar production rules ((72))

^{^ToC} 1.1 Categories and Declaration Syntax

With each single XML document, there are always one(1) or two(2) implicit, unnamed entities:
Document entity = the top level physical container, somehow "virtual" but always present. ((4, 4.8))

The (maximally one) DTD file referred to in the document type declaration is called "external subset". This is also a kind of entity! ((4, 2.8))

All other entities are identified by name and defined in an Entity Declaration in the DTD (in the "external subset") or in the doctype definition in the document ("internal subset"). ((4.2, 70pp))

                                          / unparsed 
                                  external
                                 /        \ parsed
           General Entity_______/                   
         /   (="Entity")        \ 
Entities                          internal  (parsed)
         \          
          \                       external  (parsed)
            Parameter Entity ____/
              (="PE")            \internal  (parsed)

EXAMPLES of the corresponding declarations (EntityDecl) :

  <!ENTITY geName2 SYSTEM 'http://bandm.eu/logo80.jpg'
                   NDATA  jpeg >               // general external unparsed 
                                               // ((71+73))

  <!ENTITY geName1 SYSTEM '/var/log/messages'> // general external ((71+73))
  <!ENTITY geName0 'geReplacementText'>        // general internal ((71+73)) 

  <!ENTITY % peName1 
         PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" 
                "http://w3c.org/xhtml/xhtml-lat1.ent"
                                               // parameter external ((72+74))
  <!ENTITY % peName0 'peText'>                 // parameter internal ((72+74))

The first declaration wins, if there are more than one with the same name in the same name space. (most of our tools issue a WARNING). ((4.2))

Entities must be defined before use! (otherwise it is an ERROR. "Use" means "being refered to", and this means "being expanded". The temporal sequence of that expansion is described below in more detail.)
((VC Entity Declared))

General entities are primely used in the document. They are referred to by "&xxx;". (see below)
(Attention: In descriptions, sometimes the word "general" is left out !-)
((begin of 4))
They can appear in (a) document character contents, (b) document parameter values and (c) in default values for these parameter values, which are defined in the DTD.
("used"/"appear" means always "referred to" and "later expanded")
((begin of 4))

Parameter entities are used solely in the dtd, referred to by "%xxx;" (see below) ((WFC: In DTD))

The name spaces for parameter entities and general entities are distinct.
((begin of 4))

Entity declarations (general and parameter) can take one of two forms:
internal entity -- directly defines a (literal) entity value. ((4.2.1))
external entity -- point to an external resource (e.g file) with "SYSTEM" and "PUBLIC" syntax. The contents of this file (after a text declaration is stripped) defines the entity value. ((4.2.2))

Iff an external entity declaration has a "NDATA" part, than this gives the "notation" of the entity, and the entity is an Unparsed Entity (the complete systematic name would be "general external unparsed entity").
These are "non-text files", and "notation" gives their encoding. These are referred to by attribute values of attribute type ENTITY or ENTITIES.

All other entities are parsed entities.
These are referred to by references, as described below.
When parsing the XML document, these references are replaced by the replacement text of the referred entity. This process is called expansion.

An entity value may contain references to OTHER entities.
Cycles are forbidden. ((WFC: No Recursion))
The replacement text of an entity is identical to the entity's literal value, after being expanded itself, recursively. (See next section for details)

Parsed general external entities should begin with an xml declaration, which must be given VERBATIM (not by references to entities) ((4.3.1))

Parsed general external entities must be well-formed (ergo => no close tags dangling!) ((4.3.2))

Each parsed general external entity may define its own ENCODING. ((4.3.3))

^{^ToC} 1.2 References to Entities

                         / CharRef    
               Reference              // ((67))
             /           \ EntityRef
   xReference (=our naming!)
             \ 
               PEReference            // ((69))

A Reference is a CharRef or an EntityRef and appears in a document. It may also appear n the default attribute value in a DTD, because this can be seen as part of all later documents.

Its syntax is like "&xxx;", all characters without intervening whitespace.

An EntityRef refers to a general entity. (A more sytematic name would be "GEReference").

A reference to a CHARACTER is done by its unicode value (All characters without intervening whitespace):

     &#123;    // decimal character value
     &#xf0e;   // hexadecimal character value

The character must be a legal character for input. ((4.1, 66))

amp, lt, gt, apos, quot are pre-defined (general) entities. ((4.6))

Section ((4.6 "Predefined Entities")) says that they SHOULD be declared like all other general entitities "for interoperability", which describes some exisiting SGML based tools. WE just ignore this ...

BUT IF you declare it, the entity replacement text must be a numeric character reference again for ">" and "&". The text brings the example

<!ENTITY lt     "&#38;#60;">
<!ENTITY gt     "&#62;">
<!ENTITY amp    "&#38;#38;">
<!ENTITY apos   "&#39;">
<!ENTITY quot   "&#34;">

A PEReference refers to a parameter entity and occurs in a DTD.

Its syntax is like "%xxx;", all characters without intervening whitespace. ((4.1, 69, WFC: In DTD))

^{^ToC} 1.3 Using Entities / Expanding Entity References

Unparsed entities are referred to by using their name as the value for some attribute of type "ENTITY" oder "ENTITIES". ((WFC: Parsed Entity)) The systematic name for these is "general external unparsed entities".

The rest of this section is about parsed entities only.

The expansion of entity references is defined by [xml] in an imperative way ((4.5)):
Whenever an XML document and the corresponding DTD is "parsed", i.e. the textual representation is transformed into some internal data model, and this is somehow communicated to some "application", then the entity references contained therein are (possibly!) expanded.
At the same time also character references are translated to character data.
The exact behaviour is rather complicated. It depends on the kind of entity and the current parsing context.
The contexts are :

C -- document contents (outside of tags)
A -- in an attribute value (inside an open tag in a document, OR in the default value declaration in a DTD)
E -- in the "literal entity value" of a parsed internal (general or parameter) entity declaration. (This value will be, after being expanded itself, the "replacement text" of the declared entity. When the entity is "used", ie, a reference to it is parsed, the replacement text will possibly be expanded a second time.)
D -- in a dtd, but outside of any entity decl, attribute default value, PI, comment, system/public id, or IGNORE section.

The table from [xml] , section 4.4, in a simplified form:

	param. Ref(int/ext)	internal general ref	ext parsed gen ref	character ref
C	--	expand	expand(V)	expand
A	--	expand(L)	ERROR	expand
E	expand(L)	--	--	expand
D	expand(T)	ERROR	ERROR	ERROR

"--" simply means to leave the reference untouched, as normal text, or for later expansion,

"expand" means to replace the reference with the replacement text of the referred entity / with the character, and continue the parsing process with this data instead of the reference. At the end of that data the parsing process will continue with the data which follows the closing semicolon of the reference.

This "recursive parsing" leads to the expansion of further entity and character references which are contained in the replacement text.

(Parameter entities are never contained in replacement text, but resolved "when" the entity is declared, see next section!)

"expand(V)" means the same, but the external entity may be left un-parsed iff the processing tool does not reclaim to be "validating".

(The parse() method of our TunedDTDParser has a boolean parameter errorOnExpand which set to true modifies parsing in a similar way: An unreachable external entity does lead to an error only iff it is really expanded. )

"expand(L)" means "include into a literal":
The same as above: the entity's replacement text is subject to the parsing process. With one small exception: Double and single quotes from the replacement text are treated as normal character and do not terminate a text constant denotation, into which it is inserted. ((4.4.5)) This allows to include an entity value which contains such characters into an attribute value or into an entity value declaration. (These both may be delimited by single or double quotes, you never know !-)

"expand(T)" means "expand as tokens". This is applied in DTD grammars of content models: The replacement text is framed by whitespace, so that only integral tokens can be inserted.
(This makes feasible the implementation technique in TunedDTDParser: Since they are framed with whitespace, their expansion can only occur at a place where whitespace is syntactically allowed, so that it is sufficient to look for them (and expand them) when parsing (optional or obligate) whitespace!)

"Error" simply means it should not happen !-)

There are further "validity and well formedness constraints" imposed on the declaration and usage of entities:

In INTERNAL subsets, PEReferences may only occur between MarkUp declarations, and (consequently) must be defined as a sequence of these. ((WFC: PEs in Internal Subset))
(Conditional sections are not allowed anyhow!)((begin of 3.4))

In EXTERNAL subsets they can appear anywhere, with certain explicitly listed exceptions (PubId, System Id, PI (what really is a pity!), comment, IGNORE section, text declarations) ((2.8))

The literal entity value of a PE must not contain dangling parentheses, neither round ones (as used in content models) ((VC: Proper Declaration/PE Nesting)), nor "angle brackets" as used for markup declaration ((VC: Proper Group/PE Nesting)), nor parts of the complicated, three-part square bracket constructions used for conditional sections ((VC: Proper Conditional Section/PE Nesting)).

^{^ToC} 1.3.1 Recursive Entity Expansion

The literal entity value of any entity (general/parameter, internal/external), as defined with the entity declaration, may contain further entity references (but only in an a-cyclic way, as mentioned above!)

These references must be contained completely in the text value, i.e. from starting character up to and including the closing semicolon. ((4.5, para 2))

The expansion of these references is executed differently:

Parsed external entities are not expanded on their definition. Their replacement text is identical with the "literal entity value", i.e. the file's contents, after removing only the text/encoding declaration. ((4.5)) References contained therein are expanded when the replacement text itself is subject to parsing, due to the expansion of a reference to it.((4.5))

Different for internal entities: The replacement text is derived from the literal entity value "when" they are declared. ((4.5, para 2))

("When" means "with the current parsing state", i.e. with all preceding entity definitions are already contained in the respective name space, all subsequently following are still undefined!)

This is done by

expanding all character references,
and by expanding all parameter entity references.

All general entity references are left unexpanded.
((4.5, para 2))

The co-operation of these different steps of expansion allows the dynamic creation of declarations. [xml] brings an example, the crucial part of which is the construction ((Appendix D))

 <!ENTITY % zz '&#60;!ENTITY tricky "error-prone" >' >
 %zz;

Including this into a DTD leads to the execution of the declaration of the entity "tricky" by expanding the entity "zz".

Here are some more ugly examples:

     <!ENTITY % protz '&#37;'>
     <!ENTITY %protz; mype 'this is my parameter entity'>

    --and--
     <!ENTITY % entnamename 'entname'>
     <!ENTITY % %entnamename; 'value of entity entname' >

    --and--
     <!ENTITY % entvalue "'value of myentity'">
     <!ENTITY % myentity1 %entvalue; >
     <!ENTITY % myentity2 '%entvalue;' >  
     <!ENTITY % myentity3 "%entvalue;" >

Please note that the last entity definitions can also appear without the percent sign. The resulting "general entities" can then be included into the default value of an attribute declaration.

And please do not ignore ...

     <!ENTITY % IGNORE 'IGNORE'>
     <!ENTITY %IGNORE; '%IGNORE;'>
<![ %IGNORE; [
     <!ELEMENT %IGNORE; (%IGNORE;)* >
     <!ATTLIST %IGNORE; %IGNORE; (%IGNORE;) '&IGNORE;'  >
]]>

^{^ToC} 2 Encoding and Escaping in XML

Encoding and escaping in XML are again topics, where very different concerns have been intermangled by the designers of the standard.
This makes the mechansims hard to understand, and many bugs result therefrom (e.g. the Xalan bug which currently hinders our work, see https://issues.apache.org/jira/browse/XALANJ-2419 and http://stackoverflow.com/questions/11952289/serializing-supplementary-unicode-characters-into-xml-documents-with-java )

What we are talking about, is the mapping beteen external representation, (ER in the following) i.e. the contents of a file or a stream, seen as sequence of binary data, e.g. bytes, and the internal data model. (IM in the following).

The IM is defined in [xml] to consist of a tree (a contiguous, cycle-free, directed graph) of nodes, which are either text nodes or elements. A text node contains only text data, as defined below. Each element has (1) one identifier as its "tag", (2) a finite mapping from attribute identifiers to text data, and (3) a finite sequence of nodes as its contents.

The text contents in attributes is called CDATA in the following, that in text nodes PCDATA. Both are finite sequences of Unicode characters, with certain exclusions (Byte order marks and utf16-surrogate blocks), see [xml, Sect. 2.2]

It is of central importance to be always aware in which of these both worlds, ER or IM, we operate! The confusion about this is one of the main sources of problems in practice, and costs billions of Euro and Dollar.

E.g., if the definition of XHTML-objects was a real instance of xml, the correct rendering of an xhtml file would only depend on the IM. But this is (currently) not the case with most browsers: The namespace PREFIX of the xhtml elements must be the empty one, which is a property only of the ER, and not reflected in the semantics of an IM at all!

Another example is the long lasting dispute about "disable-output-escaping" in Xslt/Xalan, see https://bugzilla.mozilla.org/show_bug.cgi?id=98168
Most of the users who required this feature came obviously with "string-generating" experiences, wanted to manipulate tags explicitly, not nodes, and made xslt code examples which were not even well-formed, because they mentally operated on ER and not on IM.

The XML specification even promotes these misunderstandings, since the IM is not defined explicity but in an operative way, describing the interaction between an "XML processor" and an "application", in sentences like
"Before the value of an attribute is passed to the application or checked for validity, the XML processor MUST normalize the attribute value [...]" [xml, section 3.3.3]
But of course, what this dubious "application" sees, what is "passed to it", is just the IM, and what it does not see does not exist. The historically later "w3c-DOM" specification treats the model aspect more systematically.

A sensible notion of equality and semantics can be constructed only on the IM. (That is what it has been invented for, and why we do not continue using un-formal ASCII-text !-)

There are alway MANY ways of translating an IM into an ER, and all the resulting ERs are equivalent after being re-parsed into an IM. Well, at least they should be, otherwise something is wrong in the transformation !-)

^{^ToC} 2.1 Character References as Second Level Encoding

Consider the process of linearization, of "writing out", of translating an IM into an ER, a data model into a file. For this, (1) the identifiers (tags and attribute names) of all nodes must be written out, (2) together with additional syntactic characters which make them uniquely recognizable (like "=" and "<" and quotes), and (3) the character data (PCDATA, CDATA) from the IM must be written to the ER.

All this must be done in some re-parseable encoding. Such an encoding is, as mentioned above, a mapping from Unicode code points (in which the data (1) to (3) is formulated) to binary stream elements.

This encoding is (1) arbitrarily chosen, (2) must be well-defined and specified, and (3) must be applicable for later re-parsing, e.g. translating ER to IM.

In ancient times of HTML, before Unicode conquered the Net, the only common subset understood by all these proprietary encoding schemes was US-ASCII, mapping the byte values 0x0000 to 0x007F to selected characters.
Therefore a mechanism called "character reference" was introduced, which allows do encode all characters in the IM (which can be nearly all Unicode characters, see above), even those for which the file encoding format does not include a mapping to binary data.

     &#252;
     &#xfc;
     &uuml;

This text allows to adress a certain Unicode character by a decimal notation of its numeric code point value, by a hexdecimal notation [xml, syntax rule 66], and by an identifier [xml, syntax rule 68].

The last form is different in two aspects:
(1) The identifier and its expansion has to be declared somewhere in the DTD as an "general internal entity", see our whitepaper on XML entities.
Exceptions are

     &gt; &lt; &apos; &quot; &amp;

which are pre-defined [xml, 2.4, 4.6]
(2) The entity identifier follows the syntax rule of XML for "Name" [xml, syntax rule [5]], and this may contain characters not in the range 0x00 to 0x7F.

In general: Whenever using a file coding system which covers all Unicode characters (/code points), like UTF-8 or UTF-16, this second level encoding is no longer necessary. So please never use it!

((
A very different thing is the human interface, i.e. rendering and authoring. Whenever your tools do not support this for a particular character, it is possible to denotate or to render it using one of these character references. BUT the most adequate form for this purpose, the use of named entities, is not universally applicable because it is context depended, which causes severe technical and architectonical problems when building concrete processing systems!
So better update your tools
!-))

A good source on XHTML etc. is http://schneegans.de/web/xhtml (German language, modified Thu 26 Sep 2013.) The author says that "'" is not known by HTML agents. (In case that Xhtml must be processed by a Html agent, see [XHMTL 1.0:C 16]) So we adjusted our ContentPrinter.

^{^ToC} 2.2 Character References for Escaping Active Characters

So far, all "character references" and "general parsed entity references" in an ER can be replaced by the corresponding unicode character data, obeying the encoding which rules the ER/stream/file. The resulting IM must stay totally identical.

But there is a second, totally different role of these character and entity references, which has nothing to do with the aspects discussed so far, but which has been mangled into the same mechanism, namely the escaping of syntactically active characters.

These are three(3) different characters, which have a certain role in the ER (not in the IM!):

Whenever appearing in the ER of PCDATA, the "less than" sign = "<" indicates the beginning of a TAG. (I.e. the end of the ER of the PCDATA = the end of the ER of the text node under construction. The tag in turn can indicate the opening tag of a new element, or the closing tag of the currently latest opened.)
Whenever appearing in the ER of CDATA (= an attribute value), the single-quote or the double-quote ends the character data representation. This role depends on the selection of the quote character which opened the ER of the CDATA.
Whenever appearing in the ER of PCDATA or CDATA, the ampersand = "&" starts either a "general entity reference" or a "(numeric) character reference".

Vice versa: whenever these characters are themselves part of character data in the IM, the serialization process must write them out using not the character itself, in the selected file encoding sytem, but must use a reference (= by name, e.g. a general parsed entity reference, or a numeric character reference) instead. These are the only situations where these forms are not interchangeable!

This is a schematic example for all these situations:

// IM = internal model, use [] to delimit character strings :
         Element tag=[X]
         \ 
          attribute name=y value=[<'"&]
          contents: node pcdata [xxx<'"&xxx]
                    node element tag = [y], contents empty
                    node pcdata [zzz]

// ER = external representation :
    <X y="&lt;&apos;&quot;&amp;">xxx&lt;&apos;&quot;&amp;xxx<y/>zzz</X>

// must be escaped, for logical reaons:
//        attribute's CDATA:   element's PCDATA:
//        "&                   <&
//     or '&

// alternatives, allowed:

(1) <X y="&lt;'&quot;&amp;">xxx&lt;'"&amp;xxx<y/>zzz</X>
(2) <X y='&lt;&apos;"&amp;'>xxx&lt;'"&amp;xxx<y/>zzz</X>
(3) <X y="&#60;&#30;&#34;&#38;">xxx&lt;'"&amp;xxx<y/>zzz</X>

// theoretical alternative, FORBIDDEN:

(4) <X y="<'&quot;&">xxx   ...etc.

One easily sees that in element's PCDATA contents the "<" must be escaped, and in an attribute value the opening delimiter of the string denotation.
The ampersand must be escaped in both situations, because otherwise it would start a general entity reference. (This COULD of course be narrowed to ampersands followed by letters or by a hashmark, which would allow line (2) from above. But this would change the paradigm of parsing from one-look-ahead to two-look-ahead, and would not yield much, so we think it's okay to be more restrictive here.)

BUT of course theses rules, so far covering only the really necessary, would have been to easy for an international standard. THEREFORE additional rules have been added which do not really make sense, but make parsing a little bit more complicated. So the text says:

[XML 1.0:2.4, para2] The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section.

(For "CDATA section" see below.) This means that also in attribute values (where it would not do any harm!-) the "<" must be quoted.

^{^ToC} 2.3 Additionally "Parameter Value Normalization"

((AGAIN superfluous, but defined ... ))
Partly INDEPENDENT from the rest: "\n" by character ref, etc. FIXME MORE

^{^ToC} 2.4 CDATA section

To make it more complicated, there is another variant in serialization, called "CDATA section". Its form is

  <![CDATA[  xxx   ]]>

Between its start and end brackets, all character data is taken verbatim. Effectively, only "<" and "&" are not recognized as mark-up, i.e. as start of a tag or start of a general entity reference. So the whole mechanism is superfluous for encoding, it just simulates the quoting of these two characters. (It may have been useful for AUTHORING, see above.)

The only possibility to include the end sequence "]]>" into the contained data is to close the CDATA section and to open it again:

  <![CDATA[  xxx   ]]>]]><![CDATA[ yyy ]]>
// represents the character data contents ->
          "  xxx   ]]> yyy "

While it is NOT POSSIBLE to escape the "]]>" character sequence in a CDATA section, (where it would make at least SOME sense), it is EVEN REQUIRED in ALL OTHER CHARACTER DATA, where it does not make any sense at all [XML 1.0:2.4, para 2]. But this is XML.
(Okay, they apologize that this is only for historic reasons, "for compatibility" to older SGML specs. Currently, the serializer employed by us simply serializes every ">" in character data as ">", even when it is not required.)

^{^ToC} 3 Adjacent Text Nodes in W3C DOM implementations

Under which conditions may a s3c dom contain adjacent text nodes ??? The situation is rather unclear. Here a collection of some (partly contradicting) facts:

(1) In our commandline tool which calls txsl, an xml encoded text is converted into a w3c dom model by calling (eu.bandm.tools.xslt.base.Main line 624pp):

        javax.xml.parsers.DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance() ;
        javax.xml.parsers.DocumentBuilder db = dbf.newDocumentBuilder() ;
	org.w3c.dom.Document inputDom = db.parse(/*File*/ inCorpus);

This call in 201609xx has created repeatedly tdom with adjacent text nodes.

(2) XPATH http://www.w3.org/TR/1999/REC-xpath-19991116#section-Text-Nodes says:

5.7 Text Nodes

Character data is grouped into text nodes. As much character data as possible is grouped into each text node: a text node never has an immediately following or preceding sibling that is a text node. The string-value of a text node is the character data. A text node always has at least one character of data.

(3) The xpath expression "a/text()" evaluates to a "node set" with all text node children of element a, in document order.

(4) Assume "a" has only text contents. Applying an xpath function like "X(a/text())" implicitly converts the node sequence "a/text()" into the very first node only. Thus "X()" processes only an uspecified prefix of the text contents of "a". So does our txsl implementation.

(5) Before that we used XALAN, and this seemingly NORMALIZED the elements contents and let "X()" always process the whole text.

(6) This occured in the xslt rules of "doctypes/d2d_gp/basic.ddf" at several places, namely (~ line 874)

    docu to_xhtml_1_0 treeInclude =
    #d2d
// NO, w3cdom Xerces does not find base url for text node !?!?!?
//      #foreach document(./a:url/text())
      #foreach document(./a:url)
  ....

(~ line 1238)

    docu to_xhtml_1_0 url = #d2d
      #x-var l-start #xp substring(.,1,1)
// XALAN: FIXME      #x-var l-start #xp substring(text(),1,1)
      #choose 
        #when contains('0123456789',$l-start) 
          #x-var l-urlprefix 
            #call a:splitbyfirst
              #arg p-select #xp $l-start
              #arg p-list   #xp $user.linkurlprefices
              #arg p-errormsg  #xp . // XALAN : text()
            #/call
          #/x-var
         .....

(~ line 1238)

    docu to_xhtml_1_0 isbn = #d2d 
      #a #href {$const.ISBN_OFFICIAL_CATAOLOG_RQ}{.}
         #text!ISBN !#valueof . #br
// attention: "#valueof text()" yields a node set of text nodes and converts the FIRST ONLY!
    #/d2d

(7) Document Object Model (DOM) Level 3 Core Specification Version 1.0 W3C Recommendation 07 April 2004 https://www.w3.org/TR/DOM-Level-3-Core/core.html
says:

Interface Text
[...]
If there is no markup inside an element's content, the text is contained in a single object implementing the Text interface that is the only child of the element.
[...]
When a document is first made available via the DOM, there is only one Text node for each block of text. Users may create adjacent Text nodes that represent the contents of a given element [...]
The Node.normalize() method merges any such adjacent Text objects into a single node for each block of text. [...]

[method] splitText
Breaks this node into two nodes at the specified offset, keeping both in the tree as siblings.

(8) https://www.w3.org/2003/01/dom2-javadoc/org/w3c/dom/Text.html
says
[...] If there is no markup inside an element's content, the text is contained in a single object implementing the Text interface that is the only child of the element.

(9) https://www.w3.org/2003/01/dom2-javadoc/org/w3c/dom/Node.html
says
[method] void normalize()
Puts all Text nodes in the full depth of the sub-tree underneath this Node, including attribute nodes, into a "normal" form where only structure (e.g., elements, comments, processing instructions, CDATA sections, and entity references) separates Text nodes, i.e., there are neither adjacent Text nodes nor empty Text nodes.

^{^ToC} 4 Text Encoding, Java Readers and Writers, XML Parsers and Encoders

The naming of the higher-level IO devices is not very regular. This is a cheat sheet:

                                      Writer
                                       (accepts Strings)
                                        /_\ 
                                         |
                                         OutputStreamWriter
    OutputStream    <--------------------- is target (ENCODING can be given)  
     (accepts bytes and                      /_\
      sends them to a sink.                   |
      write(int) writes a byte)               |
     /_\                                      |
      |                                       |
      FileOutputStream                      FileWriter (target is a file, 
       (target is a file)                               ENCODING must be given) 
      ByteArrayOutputStream              CharArrayWriter (target is a String)   
                                         StringWriter (target is a String) 
      ObjectOutputStream                 
       (accepts OBJECTS, not bytes!)     
      PipedOutputStream                  PipedWriter (target is a PipedReader)
        (target is PipedInputStream)
      FilterOutputStream                 FilterWriter (target is a Writer)
       (target is an OutputStream)
          /_\           
           |
          BufferedOutputStream           BufferedWriter (target is a Writer) 
          DataOutputStream
          PrintStream                    PrintWriter    
           (offers print(double),          (offers print(double),
            println(..), etc.               println(..), etc.
            target is a File+ENCODING,      target is a File+ENCODING,
            or an OutputStream)             or an OutputStream)

Please note that in the Java/Oracle world the word "utf8" is used for the OLD version, which is now called "CESU-8".

                                         Reader
                                          (yields chars)
                                            /_\
                                             |
                                             BufferedReader
                                              (source is Reader)
                                               /_\
                                                |
                                                LineNumberReader
     InputStream                             InpuStreamReader
      (yields bytes as int)   ---------------->  ENCODING must be given
      /_\                                      /_\
       |                                        |
      FileInputStream                           FileReader
       (source is File+ENCODING)                 (source is File+ENCODING)
      FilterInputStream
       (source is an InputStream)
         /_\
          |
          BufferedInputStream
          DataInputStream
          PsuhbackInputStream
          (deprecated LineNumberI.S.)
      ObjectInputStream
      PipedInputStream                       PipedReader
        (source is PipedOutputStream)          (source is PipedWriter)
      SequenceInputStream
      (deprecated StringBufferI.S.)          CharArrayReader
                                             StringReader
                                              (source is CharArray)

Currently we use the XML decoder which comes with the javax-w3c-dom implementation. This is called for instance in TdomReader. This calls XMLReaderFactory.createXMLReader().

The encoding of the text file to parse should be included in the XML declaration.

When writing out with our class ContentPrinter, the encoding will be part of the XML-declaration, iff this has been activated before the output starts, and if the encoding is known. (Currently the constructor with PrintWriter does not disclose an encoding !?!?!)


white papers	bandm ^meta_tools	white papers 3

made 2025-07-16_16h18 by lepper on happy-ubuntu

produced with eu.bandm.metatools.d2d and XSLT FYI view page d2d source text

^ToC 1 XML Entities, Definition and Usage

^ToC 1.1 Categories and Declaration Syntax

^ToC 1.2 References to Entities

^ToC 1.3 Using Entities / Expanding Entity References

^ToC 1.3.1 Recursive Entity Expansion

^ToC 2 Encoding and Escaping in XML

^ToC 2.1 Character References as Second Level Encoding

^ToC 2.2 Character References for Escaping Active Characters

^ToC 2.3 Additionally "Parameter Value Normalization"

^ToC 2.4 CDATA section

^ToC 3 Adjacent Text Nodes in W3C DOM implementations

^ToC 4 Text Encoding, Java Readers and Writers, XML Parsers and Encoders