Difference between revisions of "Whitespace"

From TEIWiki
Jump to navigation Jump to search
(redirect to XML Whitespace per request from John McCaskey)
 
(101 intermediate revisions by one other user not shown)
Line 1: Line 1:
==Managing XML’s Whitespace in TEI Documents==
+
#REDIRECT [[XML Whitespace]]
 
 
TEI has robust features for specifying space, gaps, line breaks, and related aspects of the space between text. But TEI is an XML vocabulary, and XML itself, and programs that read and process XML files, have their own ways to deal with what they call whitespace, that is, space, tab, carriage return and linefeed characters. Often the standards, constraints, and conventions imposed by XML cause no problem for TEI encodings. But the interactions between XML's features and TEI's can sometimes cause subtle problems and sometimes even significant damage during processing of a TEI document.
 
 
 
This page offers an introduction to those interactions.
 
 
 
==Where XML Considers Whitespace to be Significant==
 
 
 
In XML documents, some whitespace is significant, some is not. For example, inside the brackets that mark XML elements extra whitespace is not significant. For any program processing these as pieces of XML,
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em"><tt>&lt;title type="main"&gt;</tt></span>
 
 
 
and
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em"><tt>&lt;title&nbsp;&nbsp;&nbsp;&nbsp;    type =  "main"  &gt;</tt></span>
 
 
 
are the same. There is no significance to the extra space. By XML rules, no application that processes the data in this XML file (processing it as XML and not just as text) is allowed to treat these two representations differently. A person or computer editing this file is free to use either one, based merely on readability and aesthetics. The fact that there is whitespace between <span style="background-color:#C3E6FC"><tt>title</tt></span> and <span style="background-color:#C3E6FC"><tt>type</tt></span> is significant, but how much or of what kind (space characters, tabs, carriage returns, new lines) is not significant. The space between <span style="background-color:#C3E6FC"><tt>type</tt></span> and <span style="background-color:#C3E6FC"><tt>=</tt></span> is not significant.
 
 
 
Whitespace can be significant, however, in the content of an element. For example,
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em"><tt>&lt;name&gt;JoAnn&lt;/name&gt;</tt></span>
 
 
 
and
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em; line-height:1em"><tt>&lt;name&gt;Jo Ann&lt;/name&gt;</tt></span>
 
 
 
are different because of that space between <span style="background-color:#C3E6FC"><tt>Jo</tt></span> and <span style="background-color:#C3E6FC"><tt>Ann</tt></span>, and any program reading this element in an XML file is obliged to maintain the distinction.
 
 
 
But things can get complicated. Consider this:
 
 
 
<span style="background-color:#C3E6FC; margin-left:1em">&lt;persName&gt;    </span>
 
<span style="background-color:#C3E6FC; margin-left:1em">    &lt;forename&gt;Jo&lt;/forename&gt;</span>
 
<span style="background-color:#C3E6FC; margin-left:1em">    &lt;forename&gt;Ann&lt;/forename&gt;</span>
 
<span style="background-color:#C3E6FC; margin-left:1em">    &lt;surname&gt;Henry&lt;/forename&gt;</span>
 
<span style="background-color:#C3E6FC; margin-left:1em">&lt;/persName&gt;</span>
 
 
 
Should the carriage returns and new lines matter? Should it matter if that
 
open area before <span style="background-color:#C3E6FC"><tt>&lt;surname&gt;</tt></span> is a tab or is instead four space characters? Should it matter that there is extra space after <span style="background-color:#C3E6FC"><tt>&lt;persName&gt;</tt></span>?
 
 
 
==&ldquo;Collapse,&rdquo; &ldquo;Trim,&rdquo; &ldquo;Normalize&rdquo;==
 
 
 
Many applications, including web browsers and some programs that convert XML into HTML, unless instructed otherwise, “collapse” XML whitespace, that is, they replace any contiguous string of space characters (0x20), tabs (0x09), carriage returns (0x0D) and line feeds (0x0A) with just one space character. So
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;Jo Ann&lt;/name&gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;Jo    Ann&lt;/name&gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;Jo</span>
 
  <span style="background-color:#C3E6FC">    Ann&lt;/name&gt;</span>
 
 
 
are all treated as if there were just one space character between <span style="background-color:#C3E6FC"><tt>Jo</tt></span> and <span style="background-color:#C3E6FC"><tt>Ann</tt></span>. Moreover many applications also remove (&ldquo;trim&rdquo;) leading and trailing XML whitespace. So these, too,
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt; Jo Ann&lt;/name &gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;</span>
 
  <span style="background-color:#C3E6FC">    Jo Ann&lt;/name&gt;</span>
 
 
 
  <span style="background-color:#C3E6FC">&lt;name&gt;</span>
 
  <span style="background-color:#C3E6FC">    Jo  </span>
 
  <span style="background-color:#C3E6FC">    Ann</span>
 
  <span style="background-color:#C3E6FC">&lt;/name&gt;</span>
 
 
 
are treated as if the XML had been simply <span style="background-color:#C3E6FC"><tt>&lt;name&gt;Jo Ann&lt;/name&gt;</tt></span>.
 
 
 
Sometimes, as in the XSLT function &ldquo;normalize-space()&rdquo;, the term &ldquo;normalize&rdquo; refers to the combination of collapsing XML whitespace and then trimming. Other times, as in XML Schema, &ldquo;collapse&rdquo; is the name of the combined operation.
 
 
 
Normalizing XML whitespace is very common. It is so pervasive that it is easy to overlook that it is happening and even difficult to know which program processing an XML file is doing the normalizing—is it the XSLT processor, the XSL program, the web browser, the print routine, or some combination?
 
 
 
== @xml:space ==
 
 
 
XML defines an attribute, <span style="background-color:#C3E6FC"><tt>xml:space</tt></span>, that when set to <span style="background-color:#C3E6FC"><tt>preserve</tt> indicates that applications should suspend default trimming, collapsing, and normalizing and instead keep all the spaces, carriage returns and line feeds just as they appear.        If <span style="background-color:#C3E6FC">@xml:space</span> is set to <span style="background-color:#C3E6FC">default</span> or is simply left off, no such request is made; the application is free to do whatever it thinks best.
 
 
 
The attribute <span>xml:space</span> is inherited by child elements. One could, for example, put <span>xml:space="preserve"</span> into
 
            <span>&lt;text&gt;</span> but not in <span>&lt;teiHeader&gt;</span> to indicate that the request applies to all of the text but to none of the header. </p>
 

Latest revision as of 18:42, 5 August 2012

Redirect to: