[all pages:] introduction message / location / muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal cygwin tips SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf SOURCE:basic.dd2 SOURCE:xslt.dd2
white papers | bandm meta_tools | white papers 3 |
Collected White Papers on Technical Details --2--
1
XML Entities, Definition and Usage
1.1
Categories and Declaration Syntax
1.2
References to Entities
1.3
Using Entities / Expanding Entity References
1.3.1
Recursive Entity Expansion
2
Encoding and Escaping in XML
2.1
Character References as Second Level Encoding
2.2
Character References for Escaping Active Characters
2.3
Additionally "Parameter Value Normalization"
2.4
CDATA section
3
Adjacent Text Nodes in W3C DOM implementations
4
Text Encoding, Java
Readers and Writers, XML Parsers and Encoders
The notions, categories and roles of "Entities" in XML are quite confusing. As so often with these standards, very different concepts have been thoroughly mangled.
This article tries to give a survey. It is based on
Extensible Markup Language (XML) 1.0 (Third Edition)
W3C Recommendation 04 February 2004
(version: http://www.w3.org/TR/2004/REC-xml-20040204)
See also [xml]
Most of the wording in the following description follows in a one-to-one relation the formal "non-terminals" in that text, even when these are not very sensible, e.g. "Reference" vs. "EntityRef" vs. "PEReference".
In the following in double round parentheses we refer to
With each single XML document,
there are always one(1) or two(2) implicit, unnamed entities:
Document entity = the top level physical container, somehow "virtual"
but always present. ((4, 4.8))
The (maximally one) DTD file referred to in the document type declaration is called "external subset". This is also a kind of entity! ((4, 2.8))
All other entities are identified by name and defined in an Entity Declaration in the DTD (in the "external subset") or in the doctype definition in the document ("internal subset"). ((4.2, 70pp))
/ unparsed external / \ parsed General Entity_______/ / (="Entity") \ Entities internal (parsed) \ \ external (parsed) Parameter Entity ____/ (="PE") \internal (parsed) |
EXAMPLES of the corresponding declarations (EntityDecl) :
<!ENTITY geName2 SYSTEM 'http://bandm.eu/logo80.jpg' NDATA jpeg > // general external unparsed // ((71+73)) <!ENTITY geName1 SYSTEM '/var/log/messages'> // general external ((71+73)) <!ENTITY geName0 'geReplacementText'> // general internal ((71+73)) <!ENTITY % peName1 PUBLIC "-//W3C//ENTITIES Latin 1 for XHTML//EN" "http://w3c.org/xhtml/xhtml-lat1.ent" // parameter external ((72+74)) <!ENTITY % peName0 'peText'> // parameter internal ((72+74)) |
The first declaration wins, if there are more than one with the same name in the same name space. (most of our tools issue a WARNING). ((4.2))
Entities must be defined before use!
(otherwise it is an ERROR. "Use" means "being refered to", and this means
"being expanded". The
temporal sequence of that expansion is described below in more detail.)
((VC Entity Declared))
General entities are primely used in the document.
They are referred to by "&xxx;". (see below)
(Attention: In descriptions, sometimes the word "general" is left out !-)
((begin of 4))
They can appear in (a) document character contents,
(b) document parameter values and (c) in default values for
these parameter values, which are defined in the DTD.
("used"/"appear" means always "referred to" and "later expanded")
((begin of 4))
Parameter entities are used solely in the dtd, referred to by "%xxx;" (see below) ((WFC: In DTD))
The name spaces for parameter entities and general entities are distinct.
((begin of 4))
Entity declarations (general and parameter) can take one of two forms:
internal entity -- directly defines a (literal) entity value. ((4.2.1))
external entity -- point to an
external resource (e.g file) with "SYSTEM" and "PUBLIC" syntax.
The contents of this file (after a text declaration is stripped) defines
the entity value. ((4.2.2))
Iff an external entity declaration has a "NDATA" part, than this
gives the "notation" of the entity, and the entity is an Unparsed Entity
(the complete systematic name would be "general external unparsed entity").
These are "non-text files", and "notation" gives their encoding.
These are referred to by attribute values of attribute type ENTITY or ENTITIES.
All other entities are parsed entities.
These are referred to by references, as described below.
When parsing the XML document, these references are replaced by
the replacement text of the referred entity. This process
is called expansion.
An entity value may contain references to OTHER entities.
Cycles are forbidden. ((WFC: No Recursion))
The replacement text of an entity is identical to the
entity's literal value, after being expanded itself, recursively.
(See next section for details)
Parsed general external entities should begin with an xml declaration, which must be given VERBATIM (not by references to entities) ((4.3.1))
Parsed general external entities must be well-formed (ergo => no close tags dangling!) ((4.3.2))
Each parsed general external entity may define its own ENCODING. ((4.3.3))
/ CharRef Reference // ((67)) / \ EntityRef xReference (=our naming!) \ PEReference // ((69)) |
A Reference is a CharRef or an EntityRef and appears in a document. It may also appear n the default attribute value in a DTD, because this can be seen as part of all later documents.
Its syntax is like "&xxx;", all characters without intervening whitespace.
An EntityRef refers to a general entity. (A more sytematic name would be "GEReference").
A reference to a CHARACTER is done by its unicode value (All characters without intervening whitespace):
{ // decimal character value ༎ // hexadecimal character value |
The character must be a legal character for input. ((4.1, 66))
amp, lt, gt, apos, quot are pre-defined (general) entities. ((4.6))
Section ((4.6 "Predefined Entities")) says that they SHOULD be declared like all other general entitities "for interoperability", which describes some exisiting SGML based tools. WE just ignore this ...
BUT IF you declare it, the entity replacement text must be a numeric character reference again for ">" and "&". The text brings the example
<!ENTITY lt "&#60;"> <!ENTITY gt ">"> <!ENTITY amp "&#38;"> <!ENTITY apos "'"> <!ENTITY quot """> |
A PEReference refers to a parameter entity and occurs in a DTD.
Its syntax is like "%xxx;", all characters without intervening whitespace. ((4.1, 69, WFC: In DTD))
Unparsed entities are referred to by using their name as the value for some attribute of type "ENTITY" oder "ENTITIES". ((WFC: Parsed Entity)) The systematic name for these is "general external unparsed entities".
The rest of this section is about parsed entities only.
The expansion of entity references is defined by [xml] in an imperative
way ((4.5)):
Whenever an XML document and the corresponding DTD is "parsed", i.e.
the textual representation is transformed into some internal
data model, and this is somehow communicated to some "application",
then the entity references contained therein are (possibly!) expanded.
At the same time also character references are translated to character data.
The exact behaviour is rather complicated. It depends on the kind of
entity and the current parsing context.
The contexts are :
The table from [xml] , section 4.4, in a simplified form:
param. Ref(int/ext) | internal general ref | ext parsed gen ref | character ref | |
C | -- | expand | expand(V) | expand |
A | -- | expand(L) | ERROR | expand |
E | expand(L) | -- | -- | expand |
D | expand(T) | ERROR | ERROR | ERROR |
"--" simply means to leave the reference untouched, as normal text, or for later expansion,
"expand" means to replace the reference with the replacement text of the referred entity / with the character, and continue the parsing process with this data instead of the reference. At the end of that data the parsing process will continue with the data which follows the closing semicolon of the reference.
This "recursive parsing" leads to the expansion of further entity and character references which are contained in the replacement text.
(Parameter entities are never contained in replacement text, but resolved "when" the entity is declared, see next section!)
"expand(V)" means the same, but the external entity may be left un-parsed iff the processing tool does not reclaim to be "validating".
(The parse() method of our TunedDTDParser has a boolean parameter errorOnExpand which set to true modifies parsing in a similar way: An unreachable external entity does lead to an error only iff it is really expanded. )
"expand(L)" means "include into a literal":
The same as above: the entity's replacement text is subject to the parsing process.
With one small exception:
Double and single quotes from the replacement text are treated as normal
character and do not terminate a text constant denotation,
into which it is inserted. ((4.4.5))
This allows to include an entity value which contains such characters into
an attribute value or into an entity value declaration. (These both
may be delimited by single or double quotes, you never know !-)
"expand(T)" means "expand as tokens". This is applied in DTD grammars of
content models: The replacement text is framed by whitespace, so
that only integral tokens can be inserted.
(This makes feasible the implementation technique in
TunedDTDParser:
Since they are framed with whitespace, their expansion can only
occur at a place where whitespace is syntactically allowed, so that it is
sufficient to look for them (and expand them) when parsing (optional or
obligate) whitespace!)
"Error" simply means it should not happen !-)
There are further "validity and well formedness constraints" imposed on the declaration and usage of entities:
In INTERNAL subsets, PEReferences may only occur between MarkUp declarations,
and (consequently) must be defined as a sequence of these.
((WFC: PEs in Internal Subset))
(Conditional sections are not allowed anyhow!)((begin of 3.4))
In EXTERNAL subsets they can appear anywhere, with certain explicitly listed exceptions (PubId, System Id, PI (what really is a pity!), comment, IGNORE section, text declarations) ((2.8))
The literal entity value of a PE must not contain dangling parentheses, neither round ones (as used in content models) ((VC: Proper Declaration/PE Nesting)), nor "angle brackets" as used for markup declaration ((VC: Proper Group/PE Nesting)), nor parts of the complicated, three-part square bracket constructions used for conditional sections ((VC: Proper Conditional Section/PE Nesting)).
The literal entity value of any entity (general/parameter, internal/external), as defined with the entity declaration, may contain further entity references (but only in an a-cyclic way, as mentioned above!)
These references must be contained completely in the text value, i.e. from starting character up to and including the closing semicolon. ((4.5, para 2))
The expansion of these references is executed differently:
Parsed external entities are not expanded on their definition. Their replacement text is identical with the "literal entity value", i.e. the file's contents, after removing only the text/encoding declaration. ((4.5)) References contained therein are expanded when the replacement text itself is subject to parsing, due to the expansion of a reference to it.((4.5))
Different for internal entities: The replacement text is derived from the literal entity value "when" they are declared. ((4.5, para 2))
("When" means "with the current parsing state", i.e. with all preceding entity definitions are already contained in the respective name space, all subsequently following are still undefined!)
This is done by
All general entity references are left unexpanded.
((4.5, para 2))
The co-operation of these different steps of expansion allows the dynamic creation of declarations. [xml] brings an example, the crucial part of which is the construction ((Appendix D))
<!ENTITY % zz '<!ENTITY tricky "error-prone" >' > %zz; |
Including this into a DTD leads to the execution of the declaration of the entity "tricky" by expanding the entity "zz".
Here are some more ugly examples:
<!ENTITY % protz '%'> <!ENTITY %protz; mype 'this is my parameter entity'> --and-- <!ENTITY % entnamename 'entname'> <!ENTITY % %entnamename; 'value of entity entname' > --and-- <!ENTITY % entvalue "'value of myentity'"> <!ENTITY % myentity1 %entvalue; > <!ENTITY % myentity2 '%entvalue;' > <!ENTITY % myentity3 "%entvalue;" > |
Please note that the last entity definitions can also appear without the percent sign. The resulting "general entities" can then be included into the default value of an attribute declaration.
And please do not ignore ...
<!ENTITY % IGNORE 'IGNORE'> <!ENTITY %IGNORE; '%IGNORE;'> <![ %IGNORE; [ <!ELEMENT %IGNORE; (%IGNORE;)* > <!ATTLIST %IGNORE; %IGNORE; (%IGNORE;) '&IGNORE;' > ]]> |
Encoding and escaping in XML are again topics, where very different
concerns have been intermangled by the designers of the standard.
This makes the mechansims hard to understand, and many bugs result
therefrom (e.g. the Xalan bug which currently hinders our work, see
https://issues.apache.org/jira/browse/XALANJ-2419
and
http://stackoverflow.com/questions/11952289/serializing-supplementary-unicode-characters-into-xml-documents-with-java
)
What we are talking about, is the mapping beteen external representation, (ER in the following) i.e. the contents of a file or a stream, seen as sequence of binary data, e.g. bytes, and the internal data model. (IM in the following).
The IM is defined in [xml] to consist of a tree (a contiguous, cycle-free, directed graph) of nodes, which are either text nodes or elements. A text node contains only text data, as defined below. Each element has (1) one identifier as its "tag", (2) a finite mapping from attribute identifiers to text data, and (3) a finite sequence of nodes as its contents.
The text contents in attributes is called CDATA in the following, that in text nodes PCDATA. Both are finite sequences of Unicode characters, with certain exclusions (Byte order marks and utf16-surrogate blocks), see [xml, Sect. 2.2]
It is of central importance to be always aware in which of these both worlds, ER or IM, we operate! The confusion about this is one of the main sources of problems in practice, and costs billions of Euro and Dollar.
E.g., if the definition of XHTML-objects was a real instance of xml, the correct rendering of an xhtml file would only depend on the IM. But this is (currently) not the case with most browsers: The namespace PREFIX of the xhtml elements must be the empty one, which is a property only of the ER, and not reflected in the semantics of an IM at all!
Another example is the long lasting dispute about "disable-output-escaping" in
Xslt/Xalan, see
https://bugzilla.mozilla.org/show_bug.cgi?id=98168
Most of the users who required this feature came obviously with "string-generating"
experiences, wanted to manipulate tags explicitly, not nodes, and made
xslt code examples which were not even well-formed,
because they mentally operated on ER and not on IM.
The XML specification even promotes these misunderstandings, since
the IM is not defined explicity but
in an operative way, describing the interaction between an "XML processor" and
an "application", in sentences like
"Before the value of an attribute is passed to the application or checked for validity,
the XML processor MUST normalize the attribute value [...]" [xml, section 3.3.3]
But of course, what this dubious "application" sees, what is "passed to it",
is just the IM, and what it does not see
does not exist. The historically later "w3c-DOM" specification treats the
model aspect more systematically.
A sensible notion of equality and semantics can be constructed only on the IM. (That is what it has been invented for, and why we do not continue using un-formal ASCII-text !-)
There are alway MANY ways of translating an IM into an ER, and all the resulting ERs are equivalent after being re-parsed into an IM. Well, at least they should be, otherwise something is wrong in the transformation !-)
Consider the process of linearization, of "writing out", of translating an IM into an ER, a data model into a file. For this, (1) the identifiers (tags and attribute names) of all nodes must be written out, (2) together with additional syntactic characters which make them uniquely recognizable (like "=" and "<" and quotes), and (3) the character data (PCDATA, CDATA) from the IM must be written to the ER.
All this must be done in some re-parseable encoding. Such an encoding is, as mentioned above, a mapping from Unicode code points (in which the data (1) to (3) is formulated) to binary stream elements.
This encoding is (1) arbitrarily chosen, (2) must be well-defined and specified, and (3) must be applicable for later re-parsing, e.g. translating ER to IM.
In ancient times of HTML, before Unicode conquered the Net,
the only common subset understood by all these proprietary encoding schemes was
US-ASCII, mapping the byte values 0x0000 to 0x007F to selected characters.
Therefore a mechanism called "character reference" was introduced, which allows
do encode all characters in the IM (which can be nearly all Unicode characters, see above),
even those for which the file encoding format does not include a mapping to binary data.
ü ü ü |
This text allows to adress a certain Unicode character by a decimal notation of its numeric code point value, by a hexdecimal notation [xml, syntax rule 66], and by an identifier [xml, syntax rule 68].
The last form is different in two aspects:
(1) The identifier and its expansion has to be declared somewhere
in the DTD as an "general internal entity", see
our whitepaper on XML entities.
Exceptions are
> < ' " & |
which are pre-defined [xml, 2.4, 4.6]
(2)
The entity identifier follows the syntax rule of XML for "Name"
[xml, syntax rule [5]], and this may contain characters
not in the range 0x00 to 0x7F.
In general: Whenever using a file coding system which covers all Unicode characters (/code points), like UTF-8 or UTF-16, this second level encoding is no longer necessary. So please never use it!
((
A very different thing is the human interface, i.e. rendering and authoring.
Whenever your tools do not support this for a particular character, it
is possible to denotate or to render it using one of these character references.
BUT the most adequate form for this purpose, the use
of named entities, is not universally applicable
because it is context depended, which causes severe technical and
architectonical problems when building concrete processing systems!
So better update your tools
!-))
A good source on XHTML etc. is http://schneegans.de/web/xhtml(German language, modified Thu 26 Sep 2013.) The author says that "'" is not known by HTML agents. (In case that Xhtml must be processed by a Html agent, see [XHMTL 1.0:C 16]) So we adjusted our ContentPrinter.
So far, all "character references" and "general parsed entity references" in an ER can be replaced by the corresponding unicode character data, obeying the encoding which rules the ER/stream/file. The resulting IM must stay totally identical.
But there is a second, totally different role of these character and entity references, which has nothing to do with the aspects discussed so far, but which has been mangled into the same mechanism, namely the escaping of syntactically active characters.
These are three(3) different characters, which have a certain role in the ER (not in the IM!):
Vice versa: whenever these characters are themselves part of character data in the IM, the serialization process must write them out using not the character itself, in the selected file encoding sytem, but must use a reference (= by name, e.g. a general parsed entity reference, or a numeric character reference) instead. These are the only situations where these forms are not interchangeable!
This is a schematic example for all these situations:
// IM = internal model, use [] to delimit character strings : Element tag=[X] \ attribute name=y value=[<'"&] contents: node pcdata [xxx<'"&xxx] node element tag = [y], contents empty node pcdata [zzz] // ER = external representation : <X y="<'"&">xxx<'"&xxx<y/>zzz</X> // must be escaped, for logical reaons: // attribute's CDATA: element's PCDATA: // "& <& // or '& // alternatives, allowed: (1) <X y="<'"&">xxx<'"&xxx<y/>zzz</X> (2) <X y='<'"&'>xxx<'"&xxx<y/>zzz</X> (3) <X y="<"&">xxx<'"&xxx<y/>zzz</X> // theoretical alternative, FORBIDDEN: (4) <X y="<'"&">xxx ...etc. |
One easily sees that in element's PCDATA contents the "<" must be escaped, and
in an attribute value the opening delimiter of the string denotation.
The ampersand must be escaped in both situations, because otherwise it would
start a general entity reference. (This COULD of course be narrowed
to ampersands followed by letters or by a hashmark, which would allow
line (2) from above. But this would change the paradigm of parsing
from one-look-ahead to two-look-ahead, and would not yield much, so
we think it's okay to be more restrictive here.)
BUT of course theses rules, so far covering only the really necessary, would have been to easy for an international standard. THEREFORE additional rules have been added which do not really make sense, but make parsing a little bit more complicated. So the text says:
[XML 1.0:2.4, para2] The ampersand character (&) and the left angle bracket (<) MUST NOT appear in their literal form, except when used as markup delimiters, or within a comment, a processing instruction, or a CDATA section.
(For "CDATA section" see below.) This means that also in attribute values (where it would not do any harm!-) the "<" must be quoted.
((AGAIN superfluous, but defined ... ))
Partly INDEPENDENT from the rest: "\n" by character ref, etc. FIXME MORE
To make it more complicated, there is another variant in serialization, called "CDATA section". Its form is
<![CDATA[ xxx ]]> |
Between its start and end brackets, all character data is taken verbatim. Effectively, only "<" and "&" are not recognized as mark-up, i.e. as start of a tag or start of a general entity reference. So the whole mechanism is superfluous for encoding, it just simulates the quoting of these two characters. (It may have been useful for AUTHORING, see above.)
The only possibility to include the end sequence "]]>" into the contained data is to close the CDATA section and to open it again:
<![CDATA[ xxx ]]>]]><![CDATA[ yyy ]]> // represents the character data contents -> " xxx ]]> yyy " |
While it is NOT POSSIBLE to escape the "]]>" character sequence in a CDATA section,
(where it would make at least SOME sense), it is
EVEN REQUIRED in ALL OTHER CHARACTER DATA, where it does not make any sense at all
[XML 1.0:2.4, para 2].
But this is XML.
(Okay, they apologize that this
is only for historic reasons, "for compatibility" to older SGML specs.
Currently, the serializer employed by us simply serializes every ">" in
character data as ">", even when it is not required.)
Under which conditions may a s3c dom contain adjacent text nodes ??? The situation is rather unclear. Here a collection of some (partly contradicting) facts:
(1) In our commandline tool which calls txsl, an xml encoded text is converted into a w3c dom model by calling (eu.bandm.tools.xslt.base.Main line 624pp):
javax.xml.parsers.DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance() ; javax.xml.parsers.DocumentBuilder db = dbf.newDocumentBuilder() ; org.w3c.dom.Document inputDom = db.parse(/*File*/ inCorpus); |
This call in 201609xx has created repeatedly tdom with adjacent text nodes.
(2) XPATH http://www.w3.org/TR/1999/REC-xpath-19991116#section-Text-Nodes says:
5.7 Text Nodes
Character data is grouped into text nodes. As much character data as
possible is grouped into each text node: a text node never has an
immediately following or preceding sibling that is a text node. The
string-value of a text node is the character data. A text node always
has at least one character of data.
(3) The xpath expression "a/text()" evaluates to a "node set" with all text node children of element a, in document order.
(4) Assume "a" has only text contents. Applying an xpath function like "X(a/text())" implicitly converts the node sequence "a/text()" into the very first node only. Thus "X()" processes only an uspecified prefix of the text contents of "a". So does our txsl implementation.
(5) Before that we used XALAN, and this seemingly NORMALIZED the elements contents and let "X()" always process the whole text.
(6) This occured in the xslt rules of "doctypes/d2d_gp/basic.ddf" at several places, namely (~ line 874)
docu to_xhtml_1_0 treeInclude = #d2d // NO, w3cdom Xerces does not find base url for text node !?!?!? // #foreach document(./a:url/text()) #foreach document(./a:url) .... |
(~ line 1238)
docu to_xhtml_1_0 url = #d2d #x-var l-start #xp substring(.,1,1) // XALAN: FIXME #x-var l-start #xp substring(text(),1,1) #choose #when contains('0123456789',$l-start) #x-var l-urlprefix #call a:splitbyfirst #arg p-select #xp $l-start #arg p-list #xp $user.linkurlprefices #arg p-errormsg #xp . // XALAN : text() #/call #/x-var ..... |
(~ line 1238)
docu to_xhtml_1_0 isbn = #d2d #a #href {$const.ISBN_OFFICIAL_CATAOLOG_RQ}{.} #text!ISBN !#valueof . #br // attention: "#valueof text()" yields a node set of text nodes and converts the FIRST ONLY! #/d2d |
(7)
Document Object Model (DOM) Level 3 Core Specification
Version 1.0 W3C Recommendation 07 April 2004
https://www.w3.org/TR/DOM-Level-3-Core/core.html
says:
Interface Text
[...]
If there is no markup inside an element's content, the text is
contained in a single object implementing the Text interface that is
the only child of the element.
[...]
When a document is first made available via the DOM, there is only one
Text node for each block of text. Users may create adjacent Text nodes
that represent the contents of a given element
[...]
The Node.normalize()
method merges any such adjacent Text objects into a single node for
each block of text.
[...]
[method] splitText
Breaks this node into two nodes at the specified offset, keeping both in the tree
as siblings.
(8)
https://www.w3.org/2003/01/dom2-javadoc/org/w3c/dom/Text.html
says
[...] If there is no markup inside an element's content, the text is
contained in a single object implementing the Text interface that is
the only child of the element.
(9)
https://www.w3.org/2003/01/dom2-javadoc/org/w3c/dom/Node.html
says
[method] void normalize()
Puts all Text nodes in the full depth of the sub-tree
underneath this Node, including attribute nodes, into a
"normal" form where only structure (e.g., elements,
comments, processing instructions, CDATA sections, and
entity references) separates Text nodes, i.e., there are
neither adjacent Text nodes nor empty Text nodes.
[all pages:] introduction message / location / muli format dtd xantlr tdom ops paisley metajava umod option auxiliaries d2d downloads & licenses people bibliography APPENDICES:: white papers white papers 2 white papers 3 project struct proposal cygwin tips SOURCE:option.dtd SOURCE:dtd.umod DOC:deliverables.ddf DOC-DE:deliverables.ddf DOC:mtdocpage.ddf DOC-DE:mtdocpage.ddf SOURCE:basic.dd2 SOURCE:xslt.dd2
white papers | bandm meta_tools | white papers 3 |
made
2018-12-30_11h02 by
lepper on
linux-q699.site
produced with
eu.bandm.metatools.d2d
and
XSLT
FYI view
page d2d source text