Whitespace
TEI has robust features for specifying space, gaps, line breaks, and related aspects of the space between text. But TEI is an XML vocabulary, and XML itself, and programs that read and process XML files, have their own ways to deal with what they call whitespace, that is, space, tab, carriage return and linefeed characters. Sometimes the standards, constraints, and conventions imposed by XML cause problems for TEI encodings and for programs that process TEI files.
Contents
Where XML Considers Whitespace to be Significant
In XML documents, some whitespace is significant, some is not. For example, inside the brackets that mark XML elements extra whitespace is not significant. For any program processing these as pieces of XML,
<title type="main">
and
<title type = "main" >
are the same. There is no significance to the extra space. By XML rules, no application that processes the data in this XML file (processing it as XML and not just as text) is allowed to treat these two representations differently. A person or computer editing this file is free to use either one, based merely on readability and aesthetics. The fact that there is whitespace between title and type is significant, but how much or of what kind (space characters, tabs, carriage returns, new lines) is not significant. The space between type and = is not significant.
Whitespace can be significant, however, in the content of an element. For example,
<name>JoAnn</name>
and
<name>Jo Ann</name>
are different because of that space between Jo and Ann, and any program reading this element in an XML file is obliged to maintain the distinction.
But things can get complicated. Consider this:
<persName> <forename>Jo</forename> <forename>Ann</forename> <surname>Henry</forename> </persName>
Should the carriage returns and new lines matter? Should it matter if that open area before <surname> is a tab or is instead four space characters? Should it matter that there is extra space after <persName>?
Normalize = Collapse + Trim
Many applications, including web browsers and many programs that read XML files will, unless instructed otherwise, “collapse” XML whitespace, that is, they will replace any contiguous string of space characters (0x20), tabs (0x09), carriage returns (0x0D) and line feeds (0x0A) with just one space character. So
<name>Jo Ann</name>
<name>Jo Ann</name>
<name>Jo Ann</name>
are all treated as if there were just one space character between Jo and Ann. Moreover many applications will remove (“trim”) leading and trailing XML whitespace. So these, too,
<name> Jo Ann</name >
<name> Jo Ann</name>
<name> Jo Ann </name>
would be treated as if the XML had been simply <name>Jo Ann</name>.
Sometimes, as in the XSLT function “normalize-space()”, the term “normalize” refers to the combination of collapsing XML whitespace and then trimming. Other times, as in XML Schema, “collapse” is the name of the combined operation. This article uses the XSLT terminology: normalizing is collapsing plus trimming.
Normalizing XML whitespace is very common. It is so pervasive that it is easy to overlook that it is happening and even difficult to know which program processing an XML file is doing the normalizing—is it the XSLT processor, the XSL program, the web browser, the print routine, or some combination?
@xml:space
XML defines an attribute, xml:space, that when set to preserve instructs applications to suspend default trimming, collapsing, and normalizing and instead keep all the spaces, tabs, carriage returns, and line feeds just as they are. If @xml:space is set to default or is simply left off, no such request is made; the application is free to do whatever its developer thinks best.
The attribute xml:space is inherited by child elements. One could, for example, put xml:space="preserve" into a TEI <text> element but not in <teiHeader>, to indicate that the request applies to all of the text but to none of the header.
TEI allows xml:space to be used on any element. But since TEI has so much rich functionality for encoding spaces, gaps, line breaks, and so on, the xml:space attribute is rarely used. Whatever could be accomplished by setting its value to preserve would be better accomplished by using native TEI elements. So the value is normally left as default by simply not including the attribute. Downstream processors are then left free to treat XML whitespace however the application developers want.
Default Whitespace Processing
When xml:space is left as default, nothing in XML or TEI specifies how consumers of a TEI XML file should treat whitespace. There are, however, unspecified conventions.
Collapsing
TEI encoders generally assume that some downstream processor will collapse spaces, tabs, carriage returns, and line feeds and will trim the space at the beginnings and ends of encodings such as this:
<p> We hold these truths to be self-evident, that all men are are created equal, that they are endowed by their creator with certain inalienable Rights, that among these are Life, Liberty and the pursuit of Happiness. </p>
If the TEI file will be displayed in a web browser, the assumption is generally safe. Authors of web browsers have accepted responsibility for normalizing such text. The authors of other applications that read TEI XML files may need to be instructed to do the same.
Ideally there would be a formal mechanism for such communication, but XML has no such mechanism, and TEI 5 does not either. Because the assumption manifestly made above is so common, the author of a program that reads a TEI XML file should—unless instructed otherwise by the encoders—assume the burden of collapsing, but not necessarily of trimming, space in text nodes such as this one above.
Trimming
Whether text in an element should be trimmed depends (1) on TEI's specification of the element's parent, (2) on whether the element has siblings, and if so, whether the element is first of those siblings, is the last one, or is in the middle, and (3) on practices of the encoders, practices unspecified in TEI.
1. Children of Structured Elements
Part of defining an XML vocabulary such as TEI is specifying whether an element may contain text and elements or just elements. In TEI, <address> may only contain other elements. This
<address> <street>10 Downing Street</street> <settlement>London</settlement> <postCode>SW1A 2AA</postCode> <address>
is valid TEI. But this
<address> <street>10 Downing Street</street>, <settlement>London</settlement> <postCode>SW1A 2AA</postCode> <address>
is not, because of that comma after the <street> element. Free non-whitespace text is not allowed between the elements that comprise the <address> element. Though the term is sometimes used more loosely, <address> would commonly be called a "structured element."
Elements that do not allow free non-whitespace text—structured elements, strictly speaking—mimic database records. When XML is used to move data between databases, such elements are the norm. And often, a program extracting metadata from a TEI file will be looking for structured data, so that it can populate database fields:
street: 10 Downing Street city: London postal: SW1A 2AA country:
The comma does not belong in the database. A program using the database will later decide how to format the full address—with commas, new-lines, spaces, etc.
If an element's parent does not allow free non-whitespace text, space within an element should be trimmed, that is, leading and trailing whitespace should be removed. A database extractor, for example, should populate a database field the same way for any of these encodings:
<settlement> London </settlement>
<settlement>London</settlement>
<settlement> London </settlement>
2. Child Among Siblings
If an element's parent does allow free non-whitespace text between elements—thatis, if it is a "mixed-content" element—then whether or not to trim depends on where the element is among its siblings.
TEI's paragraph element, <p> elements can contain both free non-whitespace text and other elements. This <p> element
<p> The <emph> cat </emph> ate the <foreign>croissant</foreign>. It wasn't me! </p>
has five children:
The | A text node |
<emph> cat </emph> | An <emph> element that itself contains one text node |
ate the | A text node |
<foreign>croissant</foreign> | A foreign element that itself contains one text node |
. It wasn't me! |
A text node that includes a carriage return and two spaces |
Default Whitespace Processing in Elements That Contain Elements
Elements with only text, such as the one immediately above, generally cause little trouble. Without thinking much about it, encoders assume space will get normalized, and programmers have tools that do so. Elements that contain other elements, either alone or with text, are much more troublesome. Different encoders will make different assumptions. Programmers will make yet others. How elements will be processed may depend on how they were specified and consumers, producers or both may be unaware of the details.
Consider first elements that contain both other elements and text, so-called mixed-content elements.