Difference between revisions of "Whitespace"

From TEIWiki
Jump to navigation Jump to search
(Recommendations)
(redirect to XML Whitespace per request from John McCaskey)
 
Line 1: Line 1:
TEI has robust features for specifying space, gaps, line breaks, and related aspects of the space between text. But TEI is an XML vocabulary, and XML itself, and programs that read and process XML files, have their own ways to deal with what they call whitespace, that is, space, tab, carriage return and linefeed characters. Sometimes the standards, constraints, and conventions imposed by XML cause problems for TEI encodings and for programs that process TEI files.
+
#REDIRECT [[XML Whitespace]]
 
 
This article explains interactions between TEI and XML's treatment of whitespace and concludes with recommendations for both producers of TEI encodings and authors of programs that process TEI encodings.
 
 
 
==Where XML Considers Whitespace to be Significant==
 
 
 
In XML documents, some whitespace is significant, some is not. For example, inside the brackets that mark XML elements extra whitespace is not significant. For any program processing these as pieces of XML,
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em"><tt>&lt;title type="main"&gt;</tt></span>
 
 
 
and
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em"><tt>&lt;title&nbsp;&nbsp;&nbsp;&nbsp;    type =  "main"  &gt;</tt></span>
 
 
 
are the same. There is no significance to the extra space. By XML rules, no application that processes the data in this XML file (processing it as XML and not just as text) is allowed to treat these two representations differently. A person or computer editing this file is free to use either one, based merely on readability and aesthetics. The fact that there is whitespace between <span style="background-color:#C3E6FC"><tt>title</tt></span> and <span style="background-color:#C3E6FC"><tt>type</tt></span> is significant, but how much or of what kind (space characters, tabs, carriage returns, new lines) is not significant. The space between <span style="background-color:#C3E6FC"><tt>type</tt></span> and <span style="background-color:#C3E6FC"><tt>=</tt></span> is not significant.
 
 
 
Whitespace can be significant, however, in the content of an element. For example,
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em"><tt>&lt;name&gt;JoAnn&lt;/name&gt;</tt></span>
 
 
 
and
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em; line-height:1em"><tt>&lt;name&gt;Jo Ann&lt;/name&gt;</tt></span>
 
 
 
are different because of that space between <span style="background-color:#C3E6FC"><tt>Jo</tt></span> and <span style="background-color:#C3E6FC"><tt>Ann</tt></span>, and any program reading this element in an XML file is obliged to maintain the distinction.
 
 
 
But things can get complicated. Consider this:
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em">&lt;persName&gt;    </span>
 
<span style="background-color:#C3E6FC; margin-left:1em">    &lt;forename&gt;Jo&lt;/forename&gt;</span>
 
<span style="background-color:#C3E6FC; margin-left:1em">    &lt;forename&gt;Ann&lt;/forename&gt;</span>
 
<span style="background-color:#C3E6FC; margin-left:1em">    &lt;surname&gt;Henry&lt;/forename&gt;</span>
 
<span style="background-color:#C3E6FC; margin-left:1em">&lt;/persName&gt;</span>
 
 
 
Should the carriage returns and new lines matter? Should it matter if that
 
open area before <span style="background-color:#C3E6FC"><tt>&lt;surname&gt;</tt></span> is a tab or is instead four space characters? Should it matter that there is extra space after <span style="background-color:#C3E6FC"><tt>&lt;persName&gt;</tt></span>?
 
 
 
== Normalize = Collapse + Trim ==
 
 
 
Many applications, including web browsers and many programs that read XML files will, unless instructed otherwise, “collapse” XML whitespace, that is, they will replace any contiguous string of space characters (0x20), tabs (0x09), carriage returns (0x0D) and line feeds (0x0A) with just one space character. So
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;Jo Ann&lt;/name&gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;Jo    Ann&lt;/name&gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;Jo</span>
 
  <span style="background-color:#C3E6FC">    Ann&lt;/name&gt;</span>
 
 
 
would all be treated as if there were just one space character between <span style="background-color:#C3E6FC"><tt>Jo</tt></span> and <span style="background-color:#C3E6FC"><tt>Ann</tt></span>. Moreover many applications will remove, or &ldquo;trim&rdquo;, leading and trailing XML whitespace. So these, too,
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt; Jo Ann&lt;/name &gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;</span>
 
  <span style="background-color:#C3E6FC">    Jo Ann&lt;/name&gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;</span>
 
  <span style="background-color:#C3E6FC">    Jo  </span>
 
  <span style="background-color:#C3E6FC">    Ann</span>
 
  <span style="background-color:#C3E6FC">&lt;/name&gt;</span>
 
 
 
would be treated as if the XML had been simply <span style="background-color:#C3E6FC"><tt>&lt;name&gt;Jo Ann&lt;/name&gt;</tt></span>.
 
 
 
Sometimes, as in the XSLT function &ldquo;normalize-space()&rdquo;, the term &ldquo;normalize&rdquo; refers to the combination of collapsing XML whitespace and then trimming. Other times, as in XML Schema, &ldquo;collapse&rdquo; is the name of the combined operation. This article uses the XSLT terminology: normalizing is collapsing plus trimming.
 
 
 
Normalizing XML whitespace is very common. It is so pervasive that it is easy to overlook that it is happening and even difficult to know which program processing an XML file is doing the normalizing—is it the XSLT processor, the XSL program, the web browser, the print routine, or some combination?
 
 
 
== @xml:space ==
 
 
 
XML defines an attribute, <span style="background-color:#DBF0FD"><tt>xml:space</tt></span>, that when set to <span style="background-color:#DBF0FD"><tt>preserve</tt></span> instructs applications to suspend default trimming, collapsing, and normalizing and instead keep all the spaces, tabs, carriage returns, and line feeds just as they are. If <span style="background-color:#DBF0FD"><tt>@xml:space</tt></span> is set to <span style="background-color:#DBF0FD">default</span> or is simply left off, no such request is made; the application is free to do whatever its developer thinks best.
 
 
 
The attribute <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> is inherited by child elements. One could, for example, put <span style="background-color:#DBF0FD"><tt>xml:space="preserve"</tt></span> into a TEI <span style="background-color:#DBF0FD"><tt>&lt;text&gt;</tt></span> element but not in <span style="background-color:#DBF0FD"><tt>&lt;teiHeader&gt;</tt></span>, to indicate that the request applies to all of the text but to none of the header.
 
 
 
TEI allows <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> to be used on any element. But since TEI has so much rich functionality for encoding spaces, gaps, line breaks, and so on, the <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> attribute is rarely used. Whatever could be accomplished by setting its value to <span style="background-color:#DBF0FD"><tt>preserve</tt></span> would be better accomplished by using native TEI elements. So the value is normally left as <span style="background-color:#DBF0FD"><tt>default</tt></span> by simply not including the attribute. Downstream processors are then left free to treat XML whitespace however the application developers want.
 
 
 
== Default Whitespace Processing ==
 
When <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> is left as <span style="background-color:#DBF0FD"><tt>default</tt></span>, '''nothing in XML or TEI specifies how consumers of a TEI XML file should treat whitespace.'''
 
 
 
There are, however, unspecified conventions. TEI encodings generally assume that space will be normalized, that in this encoding
 
 
 
  <span style="background-color:#DBF0FD">&lt;p&gt;  </span>
 
  <span style="background-color:#DBF0FD">    We hold these truths to be self-evident,  that all men are  </span>
 
  <span style="background-color:#DBF0FD">    are created equal,  that they are endowed by their creator </span>
 
  <span style="background-color:#DBF0FD">    with certain inalienable Rights,  that among these are Life,  </span>
 
  <span style="background-color:#DBF0FD">    Liberty and the pursuit of Happiness.</span>
 
  <span style="background-color:#DBF0FD">&lt;/p&gt;</span>
 
 
 
some downstream processor will collapse spaces, tabs, carriage returns, and line feeds and will trim the space just after the <span style="background-color:#DBF0FD"><tt>&lt;p&gt;</tt></span> and just before the <span style="background-color:#DBF0FD"><tt>&lt;/p&gt;</tt></span>, and that in [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-persName.html this encoding],
 
 
 
  <span style="background-color:#DBF0FD"><persName></span>
 
  <span style="background-color:#DBF0FD">    <forename>Edward</forename></span>
 
  <span style="background-color:#DBF0FD">    <forename>George</forename></span>
 
  <span style="background-color:#DBF0FD">    <surname type="linked">Bulwer-Lytton</surname>, <roleName>Baron Lytton of</span>
 
  <span style="background-color:#DBF0FD">    <placeName>Knebworth</placeName></span>
 
  <span style="background-color:#DBF0FD">    </roleName></span>
 
  <span style="background-color:#DBF0FD"></persName></span>
 
 
 
the man's name is <span style="background-color:#DBF0FD"><tt>Edward George Bulwer-Lytton, Baron Lytton of Knebworth</tt></span>, and not <span style="background-color:#DBF0FD"><tt>&nbsp;Edward George Bulwer-Lytton, Baron Lytton of Knebworth&nbsp;</tt></span> with space on the outsides, or <span style="background-color:#DBF0FD"><tt>EdwardGeorgeBulwer-Lytton,BaronLyttonofKnebworth</tt></span>, or some name with carriage returns in it.
 
 
 
=== Collapsing ===
 
 
 
A TEI encoder should assume that any string of whitespace characters will be collapsed into one space character. In theory, this can be circumvented by setting <span style="background-color:#DBF0FD"><tt>xml:space='preserve'</tt></span>, but not all downstream processors honor such requests. Web browsers, for example, do not. It is safer to use TEI's <span style="background-color:#DBF0FD"><tt>&lt;space&gt;</tt></span> element.
 
 
 
Programmers of downstream applications should feel free to collapse whitespace but should also honor <span style="background-color:#DBF0FD"><tt>xml:space='preserve'</tt></span> unless they can be certain that doing so is unnecessary.
 
 
 
=== Trimming ===
 
 
 
Whether text in an element should or will be trimmed depends on whether it is the only text in the element or it has siblings that are themselves elements.
 
 
 
==== Text-Only Elements ====
 
 
 
Even when specifications may be unclear on the matter, XML culture, conventions, product features, programming habits, and general best practices are allied not only to collapse but to trim whitespace from elements that contain only text. Encoders and consumers of TEI data should accept this. Unless <span style="background-color:#DBF0FD"><tt>@xml:space</tt></span> has been set to <span style="background-color:#DBF0FD"><tt>'preserve'</tt></span>, consumers of TEI files should trim such space and encoders should assume such space will be trimmed.
 
 
 
When this is done, these encodings
 
 
 
    <span style="background-color:#DBF0FD">&lt;country&gt;Australia&lt;/country&gt;</span>
 
 
 
    <span style="background-color:#DBF0FD">&lt;country&gt;  Australia  &lt;/country&gt;</span>
 
 
 
    <span style="background-color:#DBF0FD">&lt;country&gt;</span>
 
    <span style="background-color:#DBF0FD">        Australia    </span>
 
    <span style="background-color:#DBF0FD">&lt;country&gt;</span>
 
 
 
will all produce the same result. If the processing software were extracting data for use in a database, the resulting field would be <tt>country:&nbsp;<span style="border: 1px black solid"><span style="background-color:#DBF0FD" >Australia</span>&nbsp;&nbsp;&nbsp;&nbsp;</span></tt> in all three cases. If an encoder wants leading and trailing space to be preserved, if, for example,
 
 
 
    <span style="background-color:#DBF0FD"><tt>&lt;emph rend='underline'&gt; Yes! &lt;emph&gt;</tt></span>
 
 
 
is meant to underline the space before and after the word, then <span style="background-color:#DBF0FD"><tt>xml:space='preserve'</tt></span> must be included in the <span style="background-color:#DBF0FD"><tt>&lt;emph&gt;</tt></span> element '''and''' it must be ensured that downstream processors actually honor <span style="background-color:#DBF0FD"><tt>xml:space='preserve'</tt></span>. If the underlining is meant to extend for not one but several spaces, only heroic care by encoder and consumer will ensure that it does. Use of <span style="background-color:#DBF0FD"><tt>&lt;space rend='underline'&gt;</tt></span> will be more reliable.
 
 
 
With both collapsing and trimming&mdash;that is, with normalizing&mdash;all of the following encodings would yield the same result.
 
 
 
    <span style="background-color:#DBF0FD">&lt;name&gt;Ralph Waldo Emerson&lt;/name&gt;</span>
 
 
 
    <span style="background-color:#DBF0FD">&lt;name&gt;  Ralph Waldo  Emerson  &lt;/name&gt;</span>
 
 
 
    <span style="background-color:#DBF0FD">&lt;name&gt;</span>
 
    <span style="background-color:#DBF0FD">        Ralph    </span>
 
    <span style="background-color:#DBF0FD">        Waldo    </span>
 
    <span style="background-color:#DBF0FD">      Emerson  </span>
 
    <span style="background-color:#DBF0FD">&lt;name&gt;</span>
 
 
 
==== Mixed-Content Elements ====
 
 
 
If an element contains not just text, but other elements, where and when space should be trimmed is more complicated. Consider the following encoding.
 
 
 
  <span style="background-color:#DBF0FD">&lt;p&gt;&nbsp;&nbsp;The&nbsp;&lt;emph&gt;&nbsp;cat&nbsp;&lt;/emph&gt; ate&nbsp;&nbsp;the&nbsp;&lt;foreign&gt;grande&nbsp;croissant&lt;/foreign&gt;.&nbsp;I didn't!</span>
 
  <span style="background-color:#DBF0FD">  &lt;/p&gt;</span>
 
 
 
The <span style="background-color:#DBF0FD"><tt>&lt;p&gt;</tt></span> element contains five child nodes.
 
{| class="wikitable"
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&nbsp;&nbsp;The&nbsp;</tt></span>
 
|| A text node
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&lt;emph&gt;&nbsp;cat&nbsp;&lt;/emph&gt;</tt></span>
 
|| An <span style="background-color:#DBF0FD"><tt>&lt;emph&gt;</tt></span> element that itself contains one text node
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&nbsp;ate&nbsp;&nbsp;the&nbsp;</tt></span>
 
|| A text node
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&lt;foreign&gt;grande&nbsp;croissant&lt;/foreign&gt;</tt></span>
 
|| A <span style="background-color:#DBF0FD"><tt>&lt;foreign&gt;</tt></span> element that itself contains one text node
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color:#DBF0FD">.&nbsp;I&nbsp;didn't!</span><br>&nbsp;&nbsp;&nbsp;<span style="background-color:#DBF0FD">&nbsp;&nbsp;</span></tt>
 
|| A text node that includes a carriage return and then two spaces
 
|-
 
|}
 
 
 
By convention, it is presumed that this encodes a passage that could have been equivalently encoded one of these ways:
 
 
 
  <span style="background-color:#DBF0FD">&lt;p&gt;The&nbsp;</span>
 
  <span style="background-color:#DBF0FD">&lt;emph&gt;cat&lt;/emph&gt;</span>
 
  <span style="background-color:#DBF0FD">ate&nbsp;the&nbsp;</span>
 
  <span style="background-color:#DBF0FD">&lt;foreign&gt;croissant&lt;/foreign&gt;. </span>
 
  <span style="background-color:#DBF0FD">I didn't!&lt;/p&gt;</span>
 
 
 
  <span style="background-color:#DBF0FD">&lt;p&gt;The&nbsp;&lt;emph&gt;cat&lt;/emph&gt;&nbsp;ate&nbsp;the &lt;foreign&gt;grande&nbsp;croissant&lt;/foreign&gt;.&nbsp;I&nbsp;didn't!&lt;/p&gt;</span>
 
 
 
The algorithm to normalize space in mixed content is:
 
* Collapse all white space, then
 
* trim:
 
** trim leading space on the first text node in an element and
 
** trim trailing space on the last text node in an element,
 
** trim both if a text node is both first and last, i.e., is the only text node in the element.
 
 
 
Applying that algorithm to the above passage:
 
{| class="wikitable"
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&nbsp;&nbsp;The&nbsp;</tt></span><br>
 
<tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>The&nbsp;</tt></span>
 
|| Because this is the first node in the <span style="background-color:#DBF0FD">&lt;p&gt;</span>element, leading space is trimmed and trailing space is collapsed but not trimmed.
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&lt;emph&gt;&nbsp;cat&nbsp;&lt;/emph&gt;</tt></span><br>
 
<tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color:#DBF0FD">cat</span>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt>
 
|| Because the only thing inside the <span style="background-color:#DBF0FD">&lt;emph&gt;</span> element is a text node, the text there gets collapsed and trimmed.
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&nbsp;ate&nbsp;&nbsp;the&nbsp;</tt></span><br>
 
<tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&nbsp;ate&nbsp;the&nbsp;</tt></span>
 
|| Space is collapsed but not trimmed on either side.
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color:#DBF0FD">&lt;foreign&gt;grande&nbsp;croissant&lt;/foreign&gt;</span></tt><br>
 
<tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color:#DBF0FD">grande&nbsp;croissant</span></tt>
 
 
 
|| Space in this text-only node is collapsed and trimmed, but no change results.
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color:#DBF0FD">.&nbsp;I&nbsp;didn't!</span></tt><br><tt>&nbsp;&nbsp;&nbsp;<span style="background-color:#DBF0FD">&nbsp;&nbsp;</span></tt><br>
 
<tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;<span style="background-color:#DBF0FD">.&nbsp;I&nbsp;didn't!</span></tt>
 
|| Because this is the last node in the <span style="background-color:#DBF0FD">&lt;p&gt;</span> element, trailing space is trimmed and leading space is collapsed but not trimmed.
 
|-
 
|}
 
The result is as if the encoding had been
 
  <span style="background-color:#DBF0FD">&lt;p&gt;The&nbsp;&lt;emph&gt;cat&lt;/emph&gt;&nbsp;ate&nbsp;the &lt;foreign&gt;grande&nbsp;croissant&lt;/foreign&gt;.&nbsp;I&nbsp;didn't!&lt;/p&gt;</span>
 
 
 
'''Note:''' The normalization process would have corrupted the text had the encoder put spaces ''inside'' the <span style="background-color:#DBF0FD">&lt;emph&gt;</span>, like this:
 
  <span style="background-color:#DBF0FD">&lt;p&gt;The&lt;emph&gt;&nbsp;cat&nbsp;&lt;/emph&gt;ate&nbsp;the &lt;foreign&gt;grande&nbsp;croissant&lt;/foreign&gt;.&nbsp;I&nbsp;didn't!&lt;/p&gt;</span>
 
 
 
The resulting text would be:
 
 
 
  The<u>cat</u>ate the ''grande croissant''. I didn't!
 
 
 
'''An encoder must assume that an element that includes nothing but text ''will'' get trimmed.'''
 
 
 
== Structured Elements and xsl:strip-space ==
 
 
 
As mentioned above, normalization of whitespace is very common. Programmers implement it without asking encoders. And encoders presume some downstream application will effect it. This complex encoding of a person's name, taken from the TEI 5 Guidelines and mentioned earlier,
 
 
 
  <span style="background-color:#DBF0FD"><persName></span>
 
  <span style="background-color:#DBF0FD">    <forename>Edward</forename></span>
 
  <span style="background-color:#DBF0FD">    <forename>George</forename></span>
 
  <span style="background-color:#DBF0FD">    <surname type="linked">Bulwer-Lytton</surname>, <roleName>Baron Lytton of</span>
 
  <span style="background-color:#DBF0FD">    <placeName>Knebworth</placeName></span>
 
  <span style="background-color:#DBF0FD">    </roleName></span>
 
  <span style="background-color:#DBF0FD"></persName></span>
 
 
 
presumes&mdash;though without saying so&mdash;that a downstream program will normalize space according to the algorithm above and produce the name <span style="background-color:#DBF0FD"><tt>Edward George Bulwer-Lytton, Baron Lytton of Knebworth</tt></span>.
 
 
 
Note here that the <tt><span style="background-color:#DBF0FD">&lt;persName&gt;</span></tt> element contains both text and elements. Note the comma. And note that had the forenames been encoded without intervening whitespace, the result would have been <tt><span style="background-color:#DBF0FD">EdwardGeorge</span></tt>.
 
 
 
A problem lurks. Part of defining an XML vocabulary such as TEI is specifying whether an element may contain text and elements or just elements. In TEI 5, <span style="background-color:#DBF0FD"><tt>&lt;address&gt;</tt></span>, for example, unlike <tt><span style="background-color:#DBF0FD">&lt;persName&gt;</span></tt>, may only contain other elements. This
 
 
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;street&gt;10 Downing Street&lt;/street&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;settlement&gt;London&lt;/settlement&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;postCode&gt;SW1A 2AA&lt;/postCode&gt;</span>
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
 
 
is valid TEI. But this
 
 
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;street&gt;10 Downing Street&lt;/street&gt;,</span>
 
  <span style="background-color:#DBF0FD">  &lt;settlement&gt;London&lt;/settlement&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;postCode&gt;SW1A 2AA&lt;/postCode&gt;</span>
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
 
 
is not, because of that comma after the <span style="background-color:#DBF0FD"><tt>&lt;street&gt;</tt></span> element. Free non-whitespace text is not allowed between the elements that comprise the <span style="background-color:#DBF0FD"><tt>&lt;address&gt;</tt></span> element. Though the term is sometimes used more loosely, <span style="background-color:#DBF0FD"><tt>&lt;address&gt;</tt></span> would commonly be called a "structured element."
 
 
 
Elements that do not allow free non-whitespace text&mdash;structured elements, strictly speaking&mdash;mimic database records. When XML is used to move data between databases, such elements are the norm; indeed many XSLT programmers have never worked on anything but structured data. In a TEI file, structured data is more common in the header than in the text. A program extracting metadata from a TEI file will often be looking for structured data in the header, so that it can populate database fields, maybe like this:
 
 
 
  street:      <span style="border:1px solid black; line-height:1.5em"><span style="background-color:#DBF0FD">10 Downing Street</span>    </span>
 
  settlement:  <span style="border:1px solid black; line-height:1.5em"><span style="background-color:#DBF0FD">London</span>              </span>
 
  postCode:    <span style="border:1px solid black; line-height:1.5em"><span style="background-color:#DBF0FD">SW1A 2AA</span>            </span>
 
 
 
Defining an element as a structured element specifies that space between child elements can be completely ignored. Thus these two encodings
 
 
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;settlement type="city"&gt;London&lt;/settlement&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;postCode&gt;SW1A 2AA;&lt;/postCode&gt;</span>
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
 
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;settlement type="city"&gt;London&lt;/settlement&gt;&lt;postCode&gt;SW1A 2AA;&lt;/postCode&gt;</span>
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
 
 
are equivalent. They encode:
 
 
 
  city:    <span style="border:1px solid black; line-height:1.5em"><span style="background-color:#DBF0FD">London</span>            </span>
 
  postCode: <span style="border:1px solid black; line-height:1.5em"><span style="background-color:#DBF0FD">SW1A 2AA</span>          </span>
 
 
 
Nothing in the encoding indicates that there should be a space, a comma, a new-line, or anything else between "London" and "SW1A 2AA". What, if anything, will be there is left to the processing application. When rendering prose, the application might insert a comma; when printing a mailing label, it might insert a new-line. If might use different punctuation when mailing to different countries.
 
 
 
To correctly process structured elements, XSL programmers insert an instruction, <tt><span style="background-color:#DBF0FD">&lt;xsl:strip-space&gt;</span></tt>, at the beginning of their programs, followed by a list of the names of the structured elements. Among other things, this ensure that all whitespace between the the children of structured elements will be removed. It will be as if such whitespace was collapsed and trimmed and made to completely disappear.
 
 
 
This situation can produce a temptation best resisted. An encoder may want to request that space be inserted between the components of a structured element, that, for example,
 
 
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;settlement type="city"&gt;London&lt;/settlement&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;postCode&gt;SW1A 2AA;&lt;/postCode&gt;</span>
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
 
 
should be taken to encode "London SW1A 2AA". To implement this, the downstream processor could simply treat <tt><span style="background-color:#DBF0FD">&lt;address&gt;</span></tt> as if it were not a structured element and was a mixed-content element instead. The encoder could then leave whitespace between the child elements and the regular normalization algorithm described above would collapse it and leave one space character.
 
 
 
The temptation is all the more seductive because (1) XML verification will not signal an error, (2) demands on the programmers of downstream applications are reduced, and (3) it is easy to succumb unknowingly. In XSLT, the programmer intentionally or inadvertently leaves <tt><span style="background-color:#DBF0FD">&lt;xsl:strip-space&gt;</span></tt> off, something the programmer is happy to do since gathering the list of structured elements was inconvenient anyway, and all seems to be well.
 
 
 
But the better practice is indeed to burden the application with properly formatting structured elements. This burden is part of what it means for an element to be structured. If the project team agrees that whitespace in structured elements will be significant, the schema should be customized to make these elements mixed-content elements instead of structured elements. This ensures that future users of the XML files will be able to understand the files contents. It also signals that
 
 
 
<tt><span style="background-color:#DBF0FD">&lt;settlement&gt;New&lt;/settlement&gt;&lt;settlement&gt;York&lt;/settlement&gt;</span></tt>
 
 
 
and
 
 
 
<tt><span style="background-color:#DBF0FD">&lt;settlement&gt;New&lt;/settlement&gt;&nbsp;&lt;settlement&gt;York&lt;/settlement&gt;</span></tt>
 
 
 
are different, which they would not be if the element were a structured one.
 
 
 
== Recommendations ==
 
 
 
* Programmers should, unless instructed otherwise by <span style="background-color:#DBF0FD">@xml:space='preserve'</span>, implement code that normalizes space.
 
* Encoders should presume such normalization will be done but should include a note in the <tt><span style="background-color:#DBF0FD">&lt;encodingStmt&gt;</span></tt> announcing the presumption.
 
* Encoders should use <span style="background-color:#DBF0FD">xml:space='preserve'</span> only with the utmost care. Whatever could be accomplished by using it is usually accomplished with less risk by using native TEI elements.
 
* Project teams should not intentionally or inadvertently use structured elements as if they were mixed-content elements. If this must be done, the schema should be customized to record the change.
 
 
 
== XSL Normalization Code ==
 
 
 
XSL's <span style="background-color:#DBF0FD">normalize-space()</span> function cannot simply be used on all text nodes. The program must consider where a text node is among its siblings.
 
 
 
The following code implements the trimming algorithm described above. It works for both text-only and mixed-content elements. The code overrides the built-in template for the appropriate text nodes so may simply be added to XSLT stylesheets. XSLT processors will normally take care of collapsing.
 
 
 
    <xsl:template priority=".7" match="text()[position()=1 and not((ancestor::node()/@xml:space)[position()=last()]='preserve')]">
 
        <xsl:value-of select="normalize-space()"/>
 
        <xsl:if test="normalize-space(substring(., string-length(.))) = ''">
 
            <xsl:text> </xsl:text>
 
        </xsl:if>
 
    </xsl:template>
 
    <xsl:template priority=".7" match="text()[position()=last() and not((ancestor::node()/@xml:space)[position()=last()]='preserve')]">
 
        <xsl:if test="normalize-space(substring(., 1, 1)) = ''">
 
            <xsl:text> </xsl:text>
 
        </xsl:if>
 
        <xsl:value-of select="normalize-space()"/>
 
    </xsl:template>
 
    <xsl:template priority=".8" match="text()[position()=1 and position()=last() and not((ancestor::node()/@xml:space)[position()=last()]='preserve')]" >
 
        <xsl:value-of select="normalize-space(.)"/>
 
    </xsl:template>
 
 
 
Filtering on <span style="background-color:#DBF0FD">@xml:space</span> allows <span style="background-color:#DBF0FD">&lt;preserve&gt;</span> to override. The <span style="background-color:#DBF0FD">&lt;test=&gt;</span> is just a way to test for whitespace. The priorities resolve the conflict caused when a node is the only text node in an element, and thus both the first and the last.
 

Latest revision as of 18:42, 5 August 2012

Redirect to: