Difference between revisions of "Whitespace"

From TEIWiki
Jump to navigation Jump to search
(@xml:space)
Line 73: Line 73:
 
The attribute <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> is inherited by child elements. One could, for example, put <span style="background-color:#DBF0FD"><tt>xml:space="preserve"</tt></span> into a TEI <span style="background-color:#DBF0FD"><tt>&lt;text&gt;</tt></span> element but not in <span style="background-color:#DBF0FD"><tt>&lt;teiHeader&gt;</tt></span> to indicate that the request applies to all of the text but to none of the header.
 
The attribute <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> is inherited by child elements. One could, for example, put <span style="background-color:#DBF0FD"><tt>xml:space="preserve"</tt></span> into a TEI <span style="background-color:#DBF0FD"><tt>&lt;text&gt;</tt></span> element but not in <span style="background-color:#DBF0FD"><tt>&lt;teiHeader&gt;</tt></span> to indicate that the request applies to all of the text but to none of the header.
  
TEI allows <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> to be used on any element. But since TEI has so much rich functionality for encoding spaces, gaps, line breaks, so on, the <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> attribute is rarely used. Whatever could be accomplished by setting its value to <span style="background-color:#DBF0FD"><tt>preserve</tt></span> would be better accomplished by using native TEI elements. So the value is normally left as <span style="background-color:#DBF0FD"><tt>default</tt></span> by simply not including the attribute. Downstream processors are thus left free to treat XML whitespace however their developers want. '''Nothing in XML or TEI specifies how processors should treat such whitespace.'''
+
TEI allows <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> to be used on any element. But since TEI has so much rich functionality for encoding spaces, gaps, line breaks, so on, the <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> attribute is rarely used. Whatever could be accomplished by setting its value to <span style="background-color:#DBF0FD"><tt>preserve</tt></span> would be better accomplished by using native TEI elements. So the value is normally left as <span style="background-color:#DBF0FD"><tt>default</tt></span> by simply not including the attribute. Downstream processors are thus left free to treat XML whitespace however their developers want.  
  
 +
== Default Whitespace Processing ==
 +
When <span style="background-color:#DBF0FD"><tt>xml:space</tt></span> is left as <span style="background-color:#DBF0FD"><tt>default</tt></span>, '''nothing in XML or TEI specifies how processors should treat whitespace.'''
  
        <p>Such ambiguity can create problems, but also, <i>the examples in the TEI guidelines generally assume that text will get normalized,</i>  
+
There are, however, unspecified conventions. TEI encoders generally assume that in this encoding,
 +
 
 +
  <span style="background-color:#DBF0FD">&lt;p&gt;
 +
  We hold these truths to be self-evident, that all men are
 +
  are created equal, that they are endowed by their creator
 +
  with certain inalienable Rights, that among these are Life,
 +
  Liberty and the pursuit of Happiness.
 +
  &lt;/p&gt;</span>
 +
 
 +
downstream processors will collapse the spaces and trim the space just
 +
after and just before the opening and closing paragraph tags.
 +
 
 +
 
 +
also, <i>the examples in the TEI guidelines generally assume that text will get normalized,</i>  
 
             by some process, somewhere along the line. The examples, therefore, are placing a demand on downstream processors without telling the downstream processors  
 
             by some process, somewhere along the line. The examples, therefore, are placing a demand on downstream processors without telling the downstream processors  
 
             and without alerting users of the Guidelines that normalization is being presumed.</p>
 
             and without alerting users of the Guidelines that normalization is being presumed.</p>
 
         <p>Consider the following example, from the specification for <span>&lt;persName&gt;</span>.</p>
 
         <p>Consider the following example, from the specification for <span>&lt;persName&gt;</span>.</p>

Revision as of 21:22, 27 July 2012

Managing XML’s Whitespace in TEI Documents

TEI has robust features for specifying space, gaps, line breaks, and related aspects of the space between text. But TEI is an XML vocabulary, and XML itself, and programs that read and process XML files, have their own ways to deal with what they call whitespace, that is, space, tab, carriage return and linefeed characters. Often the standards, constraints, and conventions imposed by XML cause no problem for TEI encodings. But the interactions between XML's features and TEI's can sometimes cause subtle problems and sometimes even significant damage during processing of a TEI document.

This page offers an introduction to those interactions.

Where XML Considers Whitespace to be Significant

In XML documents, some whitespace is significant, some is not. For example, inside the brackets that mark XML elements extra whitespace is not significant. For any program processing these as pieces of XML,

<title type="main">

and

<title     type = "main" >

are the same. There is no significance to the extra space. By XML rules, no application that processes the data in this XML file (processing it as XML and not just as text) is allowed to treat these two representations differently. A person or computer editing this file is free to use either one, based merely on readability and aesthetics. The fact that there is whitespace between title and type is significant, but how much or of what kind (space characters, tabs, carriage returns, new lines) is not significant. The space between type and = is not significant.

Whitespace can be significant, however, in the content of an element. For example,

<name>JoAnn</name>

and

<name>Jo Ann</name>

are different because of that space between Jo and Ann, and any program reading this element in an XML file is obliged to maintain the distinction.

But things can get complicated. Consider this:

<persName>    
    <forename>Jo</forename>
    <forename>Ann</forename>
    <surname>Henry</forename>
</persName>

Should the carriage returns and new lines matter? Should it matter if that open area before <surname> is a tab or is instead four space characters? Should it matter that there is extra space after <persName>?

“Collapse,” “Trim,” “Normalize”

Many applications, including web browsers and some programs that convert XML into HTML, unless instructed otherwise, “collapse” XML whitespace, that is, they replace any contiguous string of space characters (0x20), tabs (0x09), carriage returns (0x0D) and line feeds (0x0A) with just one space character. So

  <name>Jo Ann</name>
  <name>Jo    Ann</name>
  <name>Jo 
      Ann</name>

are all treated as if there were just one space character between Jo and Ann. Moreover many applications also remove (“trim”) leading and trailing XML whitespace. So these, too,

  <name> Jo Ann</name >
  <name>
      Jo Ann</name>
  <name>
      Jo   
      Ann
  </name>

are treated as if the XML had been simply <name>Jo Ann</name>.

Sometimes, as in the XSLT function “normalize-space()”, the term “normalize” refers to the combination of collapsing XML whitespace and then trimming. Other times, as in XML Schema, “collapse” is the name of the combined operation.

Normalizing XML whitespace is very common. It is so pervasive that it is easy to overlook that it is happening and even difficult to know which program processing an XML file is doing the normalizing—is it the XSLT processor, the XSL program, the web browser, the print routine, or some combination?

@xml:space

XML defines an attribute, xml:space, that when set to preserve indicates that applications should suspend default trimming, collapsing, and normalizing and instead keep all the spaces, tabs, carriage returns and line feeds just as they appear. If @xml:space is set to default or is simply left off, no such request is made; the application is free to do whatever its developer thinks best.

The attribute xml:space is inherited by child elements. One could, for example, put xml:space="preserve" into a TEI <text> element but not in <teiHeader> to indicate that the request applies to all of the text but to none of the header.

TEI allows xml:space to be used on any element. But since TEI has so much rich functionality for encoding spaces, gaps, line breaks, so on, the xml:space attribute is rarely used. Whatever could be accomplished by setting its value to preserve would be better accomplished by using native TEI elements. So the value is normally left as default by simply not including the attribute. Downstream processors are thus left free to treat XML whitespace however their developers want.

Default Whitespace Processing

When xml:space is left as default, nothing in XML or TEI specifies how processors should treat whitespace.

There are, however, unspecified conventions. TEI encoders generally assume that in this encoding,

  <p>
  We hold these truths to be self-evident, that all men are
  are created equal, that they are endowed by their creator 
  with certain inalienable Rights, that among these are Life, 
  Liberty and the pursuit of Happiness.
  </p>

downstream processors will collapse the spaces and trim the space just after and just before the opening and closing paragraph tags.


also, the examples in the TEI guidelines generally assume that text will get normalized,

           by some process, somewhere along the line. The examples, therefore, are placing a demand on downstream processors without telling the downstream processors 

and without alerting users of the Guidelines that normalization is being presumed.

Consider the following example, from the specification for <persName>.