Difference between revisions of "SIG:Overlap"

From TEIWiki
Jump to navigation Jump to search
(Discussion)
Line 171: Line 171:
  
 
* David Barnard, Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, Michael Sperberg-McQueen, and Giovanni Battista Varile. "Hierarchical Encoding of Text: Technical Problems and SGML Solutions." In The Text Encoding Initiative: Background and Contents. Guest Editors: Nancy Ide and Jean Veronis. Computers and the Humanities 29/3 (1995), pages 211-231. (http://xml.coverpages.org/bib-ab.html#barnardHierarchicalCHUM)
 
* David Barnard, Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, Michael Sperberg-McQueen, and Giovanni Battista Varile. "Hierarchical Encoding of Text: Technical Problems and SGML Solutions." In The Text Encoding Initiative: Background and Contents. Guest Editors: Nancy Ide and Jean Veronis. Computers and the Humanities 29/3 (1995), pages 211-231. (http://xml.coverpages.org/bib-ab.html#barnardHierarchicalCHUM)
 
== Discussion ==
 
 
This section is for discussing the trials and tribulations of overlapping markup - please use at will. To start, I include Wendell's response to Lou's response to Syd's response to Lou, and David's response to Wendell, in the thread RE: alternated attributes on the OL Listserv.
 
 
''There must be a better way than this to run a discussion on a Wiki, but I'm pretty new at this - if anyone out there has experience running a Wiki discussion, please contact me ([mailto:dporter@uky.edu dporter@uky.edu])''
 
 
 
===Wendell said:===
 
 
Lou,
 
 
At 04:46 AM 7/7/2005, you wrote (citing Syd):
 
 
''Lou: It seems quite a radical departure from the way we currently teach people that XML works:''
 
 
''Syd: I don't see how this is a radical departure from XML at all. It is almost syntactic sugar for the spanTo=, after all.''
 
 
''Lou: Well, here are two differences I see at once:''
 
 
''1. spanTo  uses existing and well-tested and well understood (by software if not by people) id/idref mechanism to establish links''
 
 
''the horse uses the completely different idea of co-reference (shared naming would be a better term) which is not used only in one other place in the Guidelines (Feature structures) and is not implemented by any software at all that I know of''
 
 
 
Respectfully, I think your stress on what's "implemented by software" is something of a red herring, since I have little doubt (the only doubt I have is the  studied paranoia of a programmer claiming anything is possible before it's been done) that I could write an XSLT transform that would convert markup from one form into the other, assuming the source format (whichever one it was) conformed to the constraints defined for it.
 
 
I don't see what using ID/IDREF traversal to "link" the two milestones gets you, since what's important here is not the two ends, but what's between them. There are also numerous ways these days to establish links besides ID/IDREF, which is commonly ignored in processing layers because of its DTD dependency, in the face of a common requirement to process files known only to be well-formed. Keys to link commonly-labeled nodes, whether nominally ID/IDREF or not, are generally a snap to set up. Heck, Steve or Syd could call sID an ID, and eID an IDREF, with no loss (and no gain either), except that XML prevents you from having two ID-typed attributes on a single element (IIRC), so sID would encroach on an ID already there. (Also this would be the wrong thing to do because sID and eID designate the *range*, not the element marking the start or end of the range.)
 
 
Besides, I refer you to my Extreme paper of last year (where I believe you sat in the audience), where I did implement transformations from one hierarchy into another using Steve's CLIX convention for marking up overlapping ranges. I am the last person to claim that this demonstration shows we are now ready to do this on a large scale -- I myself have many questions about how it would work in practice, especially at scale, and misgivings regarding its design. Nonetheless I think your assertion that somehow CLIX is harder to implement than the alternative syntax is made without much foundation.
 
 
I'd be glad to see it backed up with evidence. Where is there software that does much of anything with either kind of workaround to XML's single hierarchy? What syntactic form(s) does it assume, and how hard would it be to adapt to the other?
 
 
''Lou: 2. In the normal run of events, a start-tag marks the start of something, an end tag marks the end of something, and an empty tag marks a point. In the crazy world of horse, an empty tag may be any of those three, depending entirely on  a configuration of attributes and the way the wind is blowing. I think that's more than syntactic sugar.''
 
 
 
I think "which way the wind is blowing" is both needlessly invidious, and misleading, since DeRose and Bauman have proposed both clear rules for telling these differences, and a validation mechanism to test the integrity of an instance that claims to conform to them. I don't imagine this would be so hard for the alternative you propose, but I haven't seen it done yet either. I don't believe Syd's Schematron checks the direction of the wind.
 
 
Besides, what makes this criticism less true of any other milestone convention?
 
 
(Personally I think XML syntax is simply the wrong tool for the job, but that's me. Jeni Tennison and I have also proposed a syntax we consider better, as you know, to go along with the data model we have also proposed. I'd be thrilled to have the support to bring these proposals to a more workable state. But it's like building a city in the wilderness: first, one has to dig a well. And in the meantime there are plenty of more urgent jobs luring me back at home.)
 
 
''Lou: I worry that the average TEI user will be confused by it andstart thinking that <foo start="x"/> ... <foo end="x"/> is just as good a way of saying <foo> ... </foo> as any other, when it really isn't in any practical sense.''
 
 
''Syd: I really doubt this would be a problem. Of course it would happen on occasion, but on the same level as people using <lb> to denote lines of poetry.''
 
 
''Lou: It is, in general, a bad idea to introduce a mechanism which is easy to abuse, especially if the same goals can be achieved without doing so.''
 
 
 
It seems to me that this is arguing that our preferred mechanism should be as clumsy and ungainly as possible, so people are less likely to use it. If you define "abuse" as "using milestone-marking instead of clean element containment when the latter is possible", I think that risk simply comes with the territory.
 
 
Stepping back: I'd like to see the TEI SIG leave off the question of syntax, which is both the least important of the questions we face, and the most likely to embroil us in unproductive debate over irrelevancies such as the direction of the wind. If people working in the field really can't stand the variety of weeds and wildflowers springing up (personally I have no problem with them), as an alternative, I'd recommend concentrating on standoff approaches to dealing with overlap. While using standoff data structures (whether maintained as text files, in a database or whatever) is less appealing to those of us like me who prefer to get our hands dirty with instance markup (and who therefore distrust the maintenance model that standoff entails), it does take you to the problems that really matter (IMO), namely the data model, the API you build over it, and (finally!) the operations you can then perform. And if you like you can even leave your markup perfectly uncontaminated while doing so.
 
 
In the meantime, if you want to implement even as much as I have, with "half-LMNL", over the syntax you prefer -- please do. (I think you know how to find that Extreme paper.) ''(ed:  [http://www.mulberrytech.com/Extreme/Proceedings/html/2004/Piez01/EML2004Piez01.html here])''
 
 
But personally I'm bored by arguments over syntax. They can be amusing, in the way that arguing over the differences between British and American orthography can be amusing and even illuminating in a small way, as one considers the history of orthography. And some syntaxes are certainly fairer to behold than others. (My opinion is that *any* milestone syntax is ugly, reflecting the very ugliness of the idea of retrofitting overlap into XML. Overlap, properly considered, indicates a superset of XML, not a special case of it.) But finally the importance of a syntax is in what one can do with it, just as when, if the prose is good and the page legible, I don't much care about what "colour" you use. If we agree on a syntax but don't do anything about the more interesting and difficult problems, what have we achieved? The rule "we shall spell things as they are taught at Oxford" doesn't teach us how to write good prose.
 
 
The worst enemy of group decision-making is premature consensus. Or maybe it's the prioritizing of trivia over what is really consequential.
 
 
Regards,
 
Wendell
 
 
===David Durand Said:===
 
 
I won't quote what Wendell said, because I agree with almost all of it. I like the CLIX solution as described here. (I've proposed a variant of it at least once at ACH, years ago, so I may be biased).
 
 
I think that it's better than spanTo because it's a simpler proposal, in a formal sense. You can see that simplicity partly in the hard-to-answer questions raised for spanTo:
 
 
1. What spanTo means for non-empty elements
 
2. Does the element I spanTo have to have the same element type?
 
3. What does it mean if the element I spanTo has content? (i.e. is the
 
content of that element inside the span or not?)
 
 
These questions don't arise for CLIX because the limitation to empty elements means that  we are labeling points, and showing how two points define a span. If the CLIX syntax tempts people to use it when inappropriate, this is perhaps a commentary on people's willingness to adopt non-hierarchical markup when it is possible.
 
 
I think that software is a non-issue: Neither proposal is  hard to implement (modulo the issue of defining the answers to the unanaswered questions above). I'm willing to bet that both are hard to work with meaningfully in
 
XSLT, as it's just a bad language for dealing with things that violate and overlap the tree structure.
 
 
On the other hand, linking versus co-reference is _probably_ a real issue:
 
 
Linking and IDs have been used in many places in the TEI to "build data structures," and it's always been a practice that creates confusion, since most of those pointers are not "references" in the normal sense of navigable link. The fact that you can only have one ID is a limitation for document management (e.g. of tables, figures, etc.). Another problem is that you have to have a DTD subset (or XSD validator) around to declare the ID attribute types.
 
 
In fact, I think the use of shared attribute values to implement "homegrown" ID references is now very common, because of the ease with which it can be done in XSLT. I can't say whether it's more common, but generally don't bother with IDREF anymore at all. ID/IDREF mostly are used for their validation effects, in my experience.
 
 
In answer to Wendell's call to look at some non-syntactic/political issues, here are some open problems that I think are important, and which the CLIX paper probably addresses (I haven't had time to read it yet):
 
 
Different element types sometimes share the same endpoint. The requirement that each span have a distinct start and end element means that the endpoints of spans are always totally ordered with respect to each other.
 
 
Alternatively, if the interpretation is that a span labels a position between characters and not one between characters and elements then:
 
...  <foo eID="fooend"/><bar sID="barstart"/> ...
 
is equivalent to:
 
... <bar sID="barstart"/><foo eID="fooend"/> ...
 
 
In this case, elements representing span starts and ends are unlike other elements because they are not ordered with respect to the scope of normal elements.
 
 
These questions of co-punctuality are independent of the syntax chosen, and reflect different decisions about what to model when modeling spans. Gavin Nichol would say that spans are inherently non-hierarchical, and that the equivalence above is a good thing. I don't like this because I'd rather see a traditional marked up document as a special case of spans that happen to nest in a nice way, but this depends on spans having the ability to nest. This imposes additional complexity on out of line markup, however, because it makes the document addressing model more complex. The elements and spans that you chose to look at affect document addresses. In particular
 
adding a span can create addresses that didn't exist previously. For example, consider the content of the foo element:
 
 
<foo>cat</foo>
 
 
this has 4 positions:  before the 'c', after the 'c', after the 'a', after the 't'.
 
 
Adding a span changes things:
 
 
<bar sID="joe"/> ... <foo>ca<bar eID="joe"/>t</foo>
 
 
Because there's a new position between the 'a' and the 't':  after the end of span "bar joe" This can be nice if you're editing a document, because now have a principled way to express whether the 'r' inserted to change a 'cat' into a 'cart' should be part of the <foo> or not. On the other hand, separate editing of overlapping spans is much easier if there's a fundamental coordinate system that isn't affected by other spans.
 
 
One small matter: I don't like the names sID and eID, as they create a mental confusion with XML IDs. I do see the perspective that says that they are the same thing conceptually -- the unique name for an element of a particular element type -- but I think people will expect other similarities and be confused by them. At the moment this is just a gut feeling, however.
 
 
===Dot says:===
 
 
I'm especially interested in Wendell's comment that "[he]'d like to see the TEI SIG leave off the question of syntax" and concentrate on standoff approaches to dealing with OL. I admit that the Wiki is heavy on suggestions for how to express overlapping markup, but I agree with Wendell that the data model for dealing with overlapping markup in TEI is more important than the syntax used to express it. On the other hand, I think it would be good for TEI to adopt a single, consistent approach to expressing overlapping markup/hierarchies (the simpler the better, which is why I'm excited about CLIX/Horse and unsure about standoff markup - but this could just be my ignorance).
 
 
So, what is the mission of the TEI OL SIG? At the moment, our goal is "to bring together users of the TEI who are acutely interested in issues of multiple hierarchies and in particular handling those in XML" - it looks like we've done that. What's next?
 

Revision as of 00:14, 8 July 2005

Introduction

The goal of the TEI Overlapping Markup SIG is to bring together users of the TEI who are acutely interested in issues of multiple hierarchies and in particular handling those in XML.

It will do this by:

  1. running a mailing list about overlapping hierarchies and solutions to encoding them
  2. assess the TEI and suggest improvements and alterations to the TEI-Council

The SIG is convened by Dot Porter (dporter@uky.edu). If you have developed an approach to overlapping markup, you'd like to comment on existing approaches below, or if you would like to add a citation to the bibliographies, please feel free to log into the Wiki and add you contributions.

The SIG runs a mailing list on this topic. To join visit [1]

Approaches to Handling Overlapping XML Markup

Multiple Hierarchies

The TEI P4 Guidelines provide a chapter that discusses some ways to deal with markup that is not hierarchical (Chapter 31, "Multiple Hierarchies", http://www.tei-c.org/P4X/NH.html). Specific problems mentioned in that chapter include many that should be familiar to even the most basic user of TEI markup:

  • in narrative, a speech by a character may begin in the middle of a paragraph and continue for several more paragraphs
  • in a verse text, the encoder may need to tag both the formal structure of the verse (its stanzas and lines) and its syntactic structures (which sometimes nest within the metrical structure and sometimes cross metrical boundaries)
  • in any kind of text, the encoder may wish to record the physical structure of volume, page, column, and line, as well as the formal or logical structure of chapters and paragraphs or acts and scenes, etc.
  • in verse drama, the structure of acts, scenes, and speeches often conflicts with the metrical structure
  • in any kind of text, an embedded text (e.g. a play within a play, or a song) may be interrupted by other matter; the encoder may wish to establish explicitly the logical unity of the embedded material (e.g. to identify the song as a single song, and to mark its internal formal structure)
  • in a dictionary, different types of information (e.g. orthography, syllabification, and hyphenation) may be combined within a single notation; the encoder may wish both to preserve the presentation of the material in the source text and to disentangle the logically distinct pieces of information in the interests of more convenient processing of the lexical information

Below are some approaches for using multiple hierarchies in XML, both for encoding them and for processing them.

The chapter on Multiple Hierarchies for P5 is currently under revision by Andreas Witt.

Kentucky GODDAG

Bibliography:

ABSTRACT: This document provides semantics of the Extended XPath language (EXPath) for Concurrent Markup Hierarchies (CMH).


ABSTRACT: XPath is a language for addressing parts of an XML document. It is used in many XML query languages and it can be used by itself for querying XML documents. While XPath is, in general, efficient for querying individual XML documents, it lacks the features for querying over collections of documents or joining parts of the same document.

As the amount of complex document-centric XML data is continually increasing, querying such documents has drawn surprisingly little attention. We propose an XPath axes extension to deal with querying collections of document-centric XML documents sharing the same content (called concurrent XML). The algorithms we propose to evaluate the extended axes work in linear time combined complexity (number of documents and total size of documents).


ABSTRACT: The problem of concurrent markup hierarchies in XML encodings of documents has attracted attention of a number of humanities researchers in recent years. The key problem with using concurrent hierarchies to encode documents is that markup in one hierarchy is not necessarily well-formed with respect to the markup in another hierarchy. Previously proposed solutions to this problem rely on the XML expertise of the editors and their ability to maintain correct DTDs for complex markup languages. In this paper, we approach the problem of maintenance of concurrent XML markup from the Computer Science perspective. We propose a framework that allows the editors to concentrate on the semantic aspects of the encoding, while leaving the burden of maintaining XML documents to the software. The paper describes the formal notion of the concurrent markup languages and the algorighms for automatic maintenance of XML documents with concurrent markup.

HORSE

Bibliography:

INTRODUCTION: "Overlap" describes cases where some markup structures do not nest neatly into others, such as when a quotation starts in the middle of one paragraph and ends in the middle of the next. OSIS [Duru03], a standard XML schema for Biblical and related materials, has to deal with extreme amounts of overlap. The simplest is book/chapter/verse and book/story/paragraph hierarchies that pervasively diverge; but many types of overlap are more complicated than this.

The basic options for dealing with overlap in the context of SGML [ISO 8879] or XML [Bray98] are described in the TEI Guidelines [TEI]. I summarize these with their strengths and weaknesses. Previous proposals for expressing overlap, or at least kinds of overlap, don't work well enough for the severe and frequent cases found in OSIS. Thus, I present a variation on TEI milestone markup that has several advantages, though it is not a panacea. This is now the normative way of encoding non-hierarchical structures in OSIS documents.

Citations:

[Bray98] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation 10-February-1998. [Duru03] Patrick Durusau and Steven J. DeRose. "OSIS: A Users' Guide to the Open Scripture Information Standard." Bible Technologies Group, 2003. [ISO 8879] International Organization for Standardization. 1986. ISO 8879: 1986(E). Information Processing: Text and Office Information Systems: Standard Generalized Markup Language. [TEI] Michael Sperberg-McQueen and Lou Burnard (eds). Technical Topics: Multiple Hierarchies. Chapter 31 in the TEI Guidelines for Electronic Text Encoding and Interchange. http://xml.coverpages.org/teichap31.html


  • S. Bauman, "TEI HORSEing around: Handling overlap using the Trojan Horse method" Presentation at Extreme Markup 2005 (Link to be added following the conference)

ABSTRACT: The Text Encoding Initiative’s typed segment-boundary delimiter method is only one of several proposed mechanisms for handling overlap in TEI documents. HORSE (aka CLIX) defines a method by which an XML element is used normally when possible and as an improved version of the typed segment-boundary delimiter method when an overlap problem is encountered. A significant portion of the rules necessary for validation of HORSE markup can be expressed using Schematron. This, combined with an utter hack that can "HORSEify" the declaration of elements in a TEI Relax NG grammar, can provide a potential significant step forward in handling overlap in TEI documents.

Just-In-Time-Trees

Segment Trees

Bibliography:

  • J. W. Jaromczyk, et al. "A web interface to image-based concurrent markup using image maps." Proceedings, 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), November 12-13, 2004, Washington, DC.
  • J. W. Jaromczyk, et al. "On Visualization of Complex Image-Based Markup," Proceedings, International Conference on Computer Vision and Geometry. Warsaw, Poland, September 2004.
  • J. W. Jaromczyk and N. Moore, "Geometric data structures for multihierarchical XML tagging of manuscripts," Proceedings of the 20th European Workshop on Computational Geometry, Seville, Spain, March 2004.


Treespaces

From Peter van Hardenberg, University of Victoria (pvh@uvic.ca)

In brief, a treespace document is a document which contains multiple XML documents within it. Treespaces are each, when viewed alone, valid XML documents, and include some syntax for assigning tags to a particular tree. Tags from differing trees may nest however is convenient.

The notion of the treespace is akin to that of the namespace. Just as a single DTD cannot encompass the full range of desired documents (particularly documents containing fragments from multiple sources), a single nesting tree cannot encode all documents. In essence, treespaces are a continuation of the OHCO hypotheses of Allen Renear. He proposes (with OHCO-3) that a document can be decomposed into multiple hierarchies, each one describing a "view" of a document. Unfortunately, this does not provide a conceptual mechanism for dealing with documents that may have overlapping trees combined from various structures.

To resolve this problem, trees must be considered to apply to a span of a document. This, in essence, creates a hybrid spanning/nesting model. Each tree is susceptible to the standard queries of any XML document and can have a DTD or Schema applied to it individually.

More work is necessary to determine useful extensions for determining relationships between trees.

Similarly, a suitably elegant syntax has not yet been developed.

Implementation of a "treespace" document structure is relatively easy aside from the above caveats. All parsers maintain a stack of "open" tags which are used for validation purposes. To extend an existing parser to test wellformedness of a treespace document requires maintaining a seperate tag-context for each tree. No support for relating trees has been considered at this time -- each tree stands alone in the current model.

Other Approaches to Concurrent Hierarchies

ABSTRACT: The implementation of concurrent markup by Durusau and O'Donnell (Extreme Markup 2001) relies upon related but separate principles. First, markup, commonly described in tree notation, is actually metadata about PCDATA. Second, the membership of any "atom of PCDATA" in a given hierarchy can be recorded as metadata for that PCDATA. These two principles have allowed the authoring and querying of overlapping hierarchies using standard XML software.

This presentation moves beyond the use of text snippets to illustrate overlapping hierarchies and applies the authors' technique to one of the classics of Western literature, John Milton's Paradise Lost. This research has resulted in the first release of overlapping texts for experimentation on overlapping hierarchies and in a firmer theoretical foundation for current and future research on this topic.


ABSTRACT: XML has a tree-structued data model, which is used to uniformly represent structured as well as semi-structured data, and also enable concise query specification in XQuery, via the use of its XPath (twig) patterns. This in turn can leverage the recently developed technology of structural join algorighms to evaluat the query efficiently. In this paper, we identify a fundamental tension in XML data modeling: (1) data represented as deep trees (which can make effective use of twig patterns) are often un-normalized, leading to update anomalies, while (ii) normalized data tends to be shallow, resulting in heavy use of expensive value-based joins in queries.

Our solution to this data modeling problem is a novel multi-colored trees (MCT) logical data model, which is an evolutionary extension of the XML data model, and permits trees with multi-colored nodes to signify their participation in multiple hierarchies. This adds significant semantic structure to individual data nodes. We extend XQuery expressions to navigate between structurally related nodes, taking color into account, and also to create new colored trees as restructurings of an MCT database. While MCT serves as a significant evolutionary extension to XML as a logical data model, one of the key roles of XML is for information exchange. To enable exchange of MCT information, we develop algorighms for optimally serializing an MCT database as XML. We discuss alternative physical representations for MCT databases, using relations and native XML databases, and describe an implementation on top of the Timber native XML database. Experimental evaluation, using our prototype implementation, shows that not only are MCT queries/updates more succinct and easier to express than equivalent shallow tree XML queries, but they can also be significantly more efficient to evaluate than equivalent deep and shallow tree XML queries/updates.


ABSTRACT: An approach to the unification of XML (Extensible Markup Language) documents with identical textual content and concurrent markup in the framework of XML-based multi-layer annotation is introduced. A Prolog program allows the possible relationships between element instances on two annotation layers that share PCDATA to be explored and also the computing of a target node hierarchy for a well-formed, merged XML document. Special attention is paid to identity conflicts between element instances, for which a default solution that takes into account metarelations that hold between element types on the different annotation layers is provided. In addition, rules can be specified by a user to prescribe how identity conflicts should be solved for certain element types.

Other Approaches to Overlapping XML Markup (not concurrent hierarchies)

Non-XML Approaches

Layered Markup Annotation Language

LMNL, pronounced liminal: "an experimental approach to digital text encoding that supports, in SGML/XML terms, overlapping elements (ranges in LMNL) and structured attributes (annotations in LMNL)."
Project website includes a tutorial and much other informative material: http://www.lmnl.net/index.html

TexMECS

C. Huitfeldt and C. M. Sperberg-McQueen, "TexMECS: An experimental markup meta-language for complex documents", 25 January 2001, rev. 17 February 2001 (http://helmer.aksis.uib.no/claus/mlcd/papers/texmecs.html)


Other Non-XML Approaches

Bibliography:

  • M. Hilbert et al. "Making CONCUR work," Presentation at Extreme Markup 2005 (Link to be added following the conference)

ABSTRACT:

The SGML feature CONCUR allowed for a document to be simultaneously marked up in multiple conflicting hierarchical tagsets but validated and interpreted in one tagset at a time. Alas, CONCUR was rarely implemented, and XML does not address the problem of conflicting hierarchies at all. The MuLaX document syntax is a non-XML syntax that enables multiply-encoded hierarchies by distinguishing different “layers” in the hierarchy by adding a layer ID as a prefix to the element names. The IDs tie all the elements in a single hierarchy together in an “annotation layer”. Extraction of a single annotation layer results in a well-formed XML document, and each annotation layer may be associated with an XML schema. The MuLaX processing model developed works on the nodes of one annotation layer at a time. Furthermore, an alternative processing model is proposed which uses a multi-rooted trees approach. CONCUR lives!

General Bibliography

  • S. J. DeRose et al. (1990), 'What is Text, Really?', Journal of Computing in Higher Education, 1.2: 3-26.

Full-text available through ACM Portal (subscription only): http://portal.acm.org/citation.cfm?doid=264842.264843

Abstract: The way in which text is represented on a computer affects the kinds of uses to which it can be put by its creator and by subsequent users. The electronic document model currently in use is impoverished and restrictive. The authors argue that text is best represented as an ordered hierarchy of content object (OHCO), because that is what text really is. This model conforms with emerging standards such as SGML and contains within it advantages for the writer, publisher, and researcher. The authors then describe how the hierarchical model can allow future use and reuse of the document as a database, hypertext, or network.


Abstract: We examine the claim that 'text is an ordered hierarchy of content objects'; this thesis was affirmed by the authors, and others, in the late 1980s and has been associated with certain approaches to text processing and the encoding of literary texts. First we discuss the nature of this claim and its connection with the history of text processing and text encoding standardization projects such as SGML and the Text Encoding Initiative. We then describe how the experience of the text encoding community, as represented and codified in the TEI Guidelines, has raised difficulties for this thesis. Next we consider two progressively weaker versions of this thesis formulated in response to these difficulties. Ultimately we find that no version appears to be free from counterexample.

Although none of these formulations proves to be theoretically sound, they are nonetheless methodologically illuminating as each generalizes actual encoding practices, making explicit certain assumptions that, even though they have been fundamental to the working methodologies of most text encoding projects, have never been explicitly articulated, let alone explained or defended. The counterexamples to the different versions of the OHCO thesis also arise in actual encoding projects -- so although our focus is theoretical it is grounded in the methodology and problems of contemporary encoding practices. The problems discussed here have implications not only for text encoding and our understanding of the nature of textual communication, but raise very fundamental issues in the logic and methodology of the humanities.


  • D. Barnard et al. (1988) 'SGML-Based Markup for Literary Texts: Two Problems and Some Solutions', Computers and the Humanities 22: 265-276.


  • David Barnard, Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, Michael Sperberg-McQueen, and Giovanni Battista Varile. "Hierarchical Encoding of Text: Technical Problems and SGML Solutions." In The Text Encoding Initiative: Background and Contents. Guest Editors: Nancy Ide and Jean Veronis. Computers and the Humanities 29/3 (1995), pages 211-231. (http://xml.coverpages.org/bib-ab.html#barnardHierarchicalCHUM)