Difference between revisions of "Whitespace"

From TEIWiki
Jump to navigation Jump to search
(1. A Child of a Structured Element)
(redirect to XML Whitespace per request from John McCaskey)
 
(40 intermediate revisions by one other user not shown)
Line 1: Line 1:
TEI has robust features for specifying space, gaps, line breaks, and related aspects of the space between text. But TEI is an XML vocabulary, and XML itself, and programs that read and process XML files, have their own ways to deal with what they call whitespace, that is, space, tab, carriage return and linefeed characters. Sometimes the standards, constraints, and conventions imposed by XML cause problems for TEI encodings and for programs that process TEI files.
+
#REDIRECT [[XML Whitespace]]
 
 
This article explains interactions between TEI and XML's treatment of whitespace and concludes with recommendations for both producers of TEI encodings and authors of programs that process TEI encodings.
 
 
 
==Where XML Considers Whitespace to be Significant==
 
 
 
In XML documents, some whitespace is significant, some is not. For example, inside the brackets that mark XML elements extra whitespace is not significant. For any program processing these as pieces of XML,
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em"><tt>&lt;title type="main"&gt;</tt></span>
 
 
 
and
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em"><tt>&lt;title&nbsp;&nbsp;&nbsp;&nbsp;    type =  "main"  &gt;</tt></span>
 
 
 
are the same. There is no significance to the extra space. By XML rules, no application that processes the data in this XML file (processing it as XML and not just as text) is allowed to treat these two representations differently. A person or computer editing this file is free to use either one, based merely on readability and aesthetics. The fact that there is whitespace between <span style="background-color:#C3E6FC"><tt>title</tt></span> and <span style="background-color:#C3E6FC"><tt>type</tt></span> is significant, but how much or of what kind (space characters, tabs, carriage returns, new lines) is not significant. The space between <span style="background-color:#C3E6FC"><tt>type</tt></span> and <span style="background-color:#C3E6FC"><tt>=</tt></span> is not significant.
 
 
 
Whitespace can be significant, however, in the content of an element. For example,
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em"><tt>&lt;name&gt;JoAnn&lt;/name&gt;</tt></span>
 
 
 
and
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em; line-height:1em"><tt>&lt;name&gt;Jo Ann&lt;/name&gt;</tt></span>
 
 
 
are different because of that space between <span style="background-color:#C3E6FC"><tt>Jo</tt></span> and <span style="background-color:#C3E6FC"><tt>Ann</tt></span>, and any program reading this element in an XML file is obliged to maintain the distinction.
 
 
 
But things can get complicated. Consider this:
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em">&lt;persName&gt;    </span>
 
<span style="background-color:#C3E6FC; margin-left:1em">    &lt;forename&gt;Jo&lt;/forename&gt;</span>
 
<span style="background-color:#C3E6FC; margin-left:1em">    &lt;forename&gt;Ann&lt;/forename&gt;</span>
 
<span style="background-color:#C3E6FC; margin-left:1em">    &lt;surname&gt;Henry&lt;/forename&gt;</span>
 
<span style="background-color:#C3E6FC; margin-left:1em">&lt;/persName&gt;</span>
 
 
 
Should the carriage returns and new lines matter? Should it matter if that
 
open area before <span style="background-color:#C3E6FC"><tt>&lt;surname&gt;</tt></span> is a tab or is instead four space characters? Should it matter that there is extra space after <span style="background-color:#C3E6FC"><tt>&lt;persName&gt;</tt></span>?
 
 
 
== Normalize = Collapse + Trim ==
 
 
 
Many applications, including web browsers and many programs that read XML files will, unless instructed otherwise, “collapse” XML whitespace, that is, they will replace any contiguous string of space characters (0x20), tabs (0x09), carriage returns (0x0D) and line feeds (0x0A) with just one space character. So
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;Jo Ann&lt;/name&gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;Jo    Ann&lt;/name&gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;Jo</span>
 
  <span style="background-color:#C3E6FC">    Ann&lt;/name&gt;</span>
 
 
 
would all be treated as if there were just one space character between <span style="background-color:#C3E6FC"><tt>Jo</tt></span> and <span style="background-color:#C3E6FC"><tt>Ann</tt></span>. Moreover many applications will remove, or &ldquo;trim&rdquo;, leading and trailing XML whitespace. So these, too,
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt; Jo Ann&lt;/name &gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;</span>
 
  <span style="background-color:#C3E6FC">    Jo Ann&lt;/name&gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;</span>
 
  <span style="background-color:#C3E6FC">    Jo  </span>
 
  <span style="background-color:#C3E6FC">    Ann</span>
 
  <span style="background-color:#C3E6FC">&lt;/name&gt;</span>
 
 
 
would be treated as if the XML had been simply <span style="background-color:#C3E6FC"><tt>&lt;name&gt;Jo Ann&lt;/name&gt;</tt></span>.
 
 
 
Sometimes, as in the XSLT function &ldquo;normalize-space()&rdquo;, the term &ldquo;normalize&rdquo; refers to the combination of collapsing XML whitespace and then trimming. Other times, as in XML Schema, &ldquo;collapse&rdquo; is the name of the combined operation. This article uses the XSLT terminology: normalizing is collapsing plus trimming.
 
 
 
Normalizing XML whitespace is very common. It is so pervasive that it is easy to overlook that it is happening and even difficult to know which program processing an XML file is doing the normalizing—is it the XSLT processor, the XSL program, the web browser, the print routine, or some combination?
 
 
 
== @xml:space ==
 
 
 
XML defines an attribute, <span style="background-color:#DBF0FD"><tt>xml:space</tt></span>, that when set to <span style="background-color:#DBF0FD"><tt>preserve</tt></span> instructs applications to suspend default trimming, collapsing, and normalizing and instead keep all the spaces, tabs, carriage returns, and line feeds just as they are. If <span style="background-color:#DBF0FD"><tt>@xml:space</tt></span> is set to <span style="background-color:#DBF0FD">default</span> or is simply left off, no such request is made; the application is free to do whatever its developer thinks best.
 
 
 
The attribute <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> is inherited by child elements. One could, for example, put <span style="background-color:#DBF0FD"><tt>xml:space="preserve"</tt></span> into a TEI <span style="background-color:#DBF0FD"><tt>&lt;text&gt;</tt></span> element but not in <span style="background-color:#DBF0FD"><tt>&lt;teiHeader&gt;</tt></span>, to indicate that the request applies to all of the text but to none of the header.
 
 
 
TEI allows <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> to be used on any element. But since TEI has so much rich functionality for encoding spaces, gaps, line breaks, and so on, the <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> attribute is rarely used. Whatever could be accomplished by setting its value to <span style="background-color:#DBF0FD"><tt>preserve</tt></span> would be better accomplished by using native TEI elements. So the value is normally left as <span style="background-color:#DBF0FD"><tt>default</tt></span> by simply not including the attribute. Downstream processors are then left free to treat XML whitespace however the application developers want.
 
 
 
== Default Whitespace Processing ==
 
When <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> is left as <span style="background-color:#DBF0FD"><tt>default</tt></span>, '''nothing in XML or TEI specifies how consumers of a TEI XML file should treat whitespace.'''
 
 
 
There are, however, unspecified conventions. TEI encodings generally assume that space will be normalized, that in this encoding
 
 
 
  <span style="background-color:#DBF0FD">&lt;p&gt;  </span>
 
  <span style="background-color:#DBF0FD">    We hold these truths to be self-evident,  that all men are  </span>
 
  <span style="background-color:#DBF0FD">    are created equal,  that they are endowed by their creator </span>
 
  <span style="background-color:#DBF0FD">    with certain inalienable Rights,  that among these are Life,  </span>
 
  <span style="background-color:#DBF0FD">    Liberty and the pursuit of Happiness.</span>
 
  <span style="background-color:#DBF0FD">&lt;/p&gt;</span>
 
 
 
some downstream processor will collapse spaces, tabs, carriage returns, and line feeds and will trim the space just after the <span style="background-color:#DBF0FD"><tt>&lt;p&gt;</tt></span> and just before the <span style="background-color:#DBF0FD"><tt>&lt;/p&gt;</tt></span>, and that in [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-persName.html this encoding],
 
 
 
  <span style="background-color:#DBF0FD"><persName></span>
 
  <span style="background-color:#DBF0FD">    <forename>Edward</forename></span>
 
  <span style="background-color:#DBF0FD">    <forename>George</forename></span>
 
  <span style="background-color:#DBF0FD">    <surname type="linked">Bulwer-Lytton</surname>, <roleName>Baron Lytton of</span>
 
  <span style="background-color:#DBF0FD">    <placeName>Knebworth</placeName></span>
 
  <span style="background-color:#DBF0FD">    </roleName></span>
 
  <span style="background-color:#DBF0FD"></persName></span>
 
 
 
the man's name is <span style="background-color:#DBF0FD"><tt>Edward George Bulwer-Lytton, Baron Lytton of Knebworth</tt></span>, and not <span style="background-color:#DBF0FD"><tt>&nbsp;Edward George Bulwer-Lytton, Baron Lytton of Knebworth&nbsp;</tt></span> with space on the outsides, or <span style="background-color:#DBF0FD"><tt>EdwardGeorgeBulwer-Lytton,BaronLyttonofKnebworth</tt></span>, or some name with carriage returns in it.
 
 
 
=== Collapsing ===
 
 
 
If the TEI file will be displayed in a web browser, the assumption that text will be normalized is generally safe. Authors of web browsers have accepted responsibility for normalizing such text. The authors of other applications that read TEI XML files may need to be instructed to normalize.
 
 
 
Ideally there would be a formal mechanism for such communication, but XML has no such mechanism, and TEI 5 does not either.
 
 
 
Because the assumption made above is so common, the author of a program that reads a TEI XML file should&mdash;unless instructed otherwise by the encoders&mdash;assume the burden of collapsing, ''but not necessarily of trimming'', space in text nodes such as this one above.
 
 
 
=== Trimming ===
 
 
 
Whether text in an element should be trimmed depends (1) on TEI's specification of the element's ''parent'', (2) on whether the element has siblings, and if so, whether the element is first of those siblings, is the last one, or is in the middle, and (3) on practices of the encoders, practices unspecified in TEI.
 
 
 
==== 1. A Child of a Structured Element ====
 
Part of defining an XML vocabulary such as TEI is specifying whether an element may contain text and elements or just elements. In TEI, <span style="background-color:#DBF0FD"><tt>&lt;address&gt;</tt></span>, for example, may only contain other elements. This
 
 
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;street&gt;10 Downing Street&lt;/street&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;settlement&gt;London&lt;/settlement&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;postCode&gt;SW1A 2AA&lt;/postCode&gt;</span>
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
 
 
is valid TEI. But this
 
 
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;street&gt;10 Downing Street&lt;/street&gt;,</span>
 
  <span style="background-color:#DBF0FD">  &lt;settlement&gt;London&lt;/settlement&gt;</span>
 
  <span style="background-color:#DBF0FD">  &lt;postCode&gt;SW1A 2AA&lt;/postCode&gt;</span>
 
  <span style="background-color:#DBF0FD">&lt;address&gt;</span>
 
 
 
is not, because of that comma after the <span style="background-color:#DBF0FD"><tt>&lt;street&gt;</tt></span> element. Free non-whitespace text is not allowed between the elements that comprise the <span style="background-color:#DBF0FD"><tt>&lt;address&gt;</tt></span> element. Though the term is sometimes used more loosely, <span style="background-color:#DBF0FD"><tt>&lt;address&gt;</tt></span> would commonly be called a "structured element."
 
 
 
Elements that do not allow free non-whitespace text&mdash;structured elements, strictly speaking&mdash;mimic database records. When XML is used to move data between databases, such elements are the norm; indeed many XSLT programmers have never worked on anything but structured data. In a TEI file, structured data is more common in the header than in the text. A program extracting metadata from a TEI file will often be looking for structured data in the header, so that it can populate database fields:
 
 
 
  country: <span style="border:1px solid black; line-height:1.5em">                    </span>
 
  post:    <span style="border:1px solid black; line-height:1.5em"><span style="background-color:#DBF0FD">SW1A 2AA</span>            </span>
 
  street:  <span style="border:1px solid black; line-height:1.5em"><span style="background-color:#DBF0FD">10 Downing Street</span>    </span>
 
  city:    <span style="border:1px solid black; line-height:1.5em"><span style="background-color:#DBF0FD">London</span>              </span>
 
 
 
 
 
That comma seen above does not belong in the database. A program reading the database will later decide how to format the full address&mdash;with commas, new-lines, spaces, etc.
 
 
 
An encoder should assume that, if an element's parent does not allow free non-whitespace text, space within the element will be trimmed, that, for example, a database extractor will populate a database field the same way given any of the following encodings:
 
 
 
  <span style="background-color:#DBF0FD">&lt;settlement&gt;  London  &lt;/settlement&gt;</span>
 
 
 
  <span style="background-color:#DBF0FD">&lt;settlement&gt;London&lt;/settlement&gt;</span>
 
 
 
  <span style="background-color:#DBF0FD">&lt;settlement&gt;</span>
 
  <span style="background-color:#DBF0FD">      London  </span>
 
  <span style="background-color:#DBF0FD">&lt;/settlement&gt;</span>
 
 
 
Although TEI Guidelines are silent on the issue, an encoder should not write
 
 
 
  <span style="background-color:#DBF0FD">&lt;settlement&gt;Sydney&nbsp;&lt;settlement&gt;&lt;country&gt;Australia&lt;settlement&gt;</span>
 
 
 
and expect that space between "Sydney" and "Australia" to survive subsequent processing. All of XML culture, specifications, conventions, practices, software libraries, and programming habits are allied against that space. By defining <span style="background-color:#DBF0FD"><tt>&lt;address&gt;</tt></span> as a structured element, TEI gives processing software responsibility for joining the names of the city and the country with a space, a comma, a carriage return, or whatever is appropriate for the downstream application.
 
 
 
To eliminate any false sense of security and any miscommunication, best practice is to leave leading and trailing space out of the child elements of structured elements, that is, of elements that do not allow non-whitespace text between the child elements.
 
 
 
==== 2. Child among Siblings in Mixed-Content Element ====
 
 
 
If an element's parent ''does'' allow free non-whitespace text between elements&mdash;that is, if it is a "mixed-content" element&mdash;then whether or not to trim depends on where the element is among its siblings.
 
 
 
TEI's paragraph element, <span style="background-color:#DBF0FD"><tt>&lt;p&gt;</tt></span>, can contain both free non-whitespace text and other elements. This <span style="background-color:#DBF0FD"><tt>&lt;p&gt;</tt></span> element
 
 
 
  <span style="background-color:#DBF0FD">&lt;p&gt;  The &lt;emph&gt; cat &lt;/emph&gt; ate the &lt;foreign&gt;croissant&lt;/foreign&gt;. It wasn't me!</span>
 
  <span style="background-color:#DBF0FD">  &lt;/p&gt;</span>
 
 
 
has five children:
 
{| class="wikitable"
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&nbsp;&nbsp;The&nbsp;</tt></span>
 
|| A text node
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&nbsp;&lt;emph&gt;&nbsp;cat&nbsp;&lt;/emph&gt;</tt></span>
 
|| An <span style="background-color:#DBF0FD"><tt>&lt;emph&gt;</tt></span> element that itself contains one text node
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&nbsp;ate&nbsp;the&nbsp;</tt></span>
 
|| A text node
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&lt;foreign&gt;croissant&lt;/foreign&gt;</tt></span>
 
|| A <span style="background-color:#DBF0FD"><tt>&lt;foreign&gt;</tt></span> element that itself contains one text node
 
|-
 
| <tt>&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>.&nbsp;It&nbsp;wasn't&nbsp;me!</tt></span><br><tt>&nbsp;&nbsp;&nbsp;</tt><span style="background-color:#DBF0FD"><tt>&nbsp;&nbsp;</tt></span>
 
|| A text node that includes a carriage return and two spaces
 
|-
 
|}
 
 
 
By convention, it is presumed that, pared to its essentials, the encoding above is the same as this:
 
 
 
  <span style="background-color:#DBF0FD">&lt;p&gt;The &lt;emph&gt;cat&lt;/emph&gt; ate the &lt;foreign&gt;croissant&lt;/foreign&gt;. It wasn't me!&lt;/p&gt;</span>
 
 
 
That is, an application reading the two snippets of XML (as XML) would produce identical results. The second is a normalized version of the first.
 
 
 
== Recommendations ==
 
 
 
=== Encoders ===
 
 
 
=== Programmers ===
 

Latest revision as of 18:42, 5 August 2012

Redirect to: