SIG:Overlap

Introduction
The goal of the TEI Overlapping Markup SIG is to bring together users of the TEI who are acutely interested in issues of multiple hierarchies and in particular handling those in XML.

It will do this by:


 * 1) running a mailing list about overlapping hierarchies and solutions to encoding them
 * 2) assess the TEI and suggest improvements and alterations to the TEI-Council

The SIG is convened by Dot Porter ([mailto:dporter@uky.edu]). If you have developed an approach to overlapping markup, you'd like to comment on existing approaches below, or if you would like to add a citation to the bibliographies, please feel free to log into the Wiki and add you contributions.

The SIG runs a mailing list on this topic. To join visit

Extended GODDAG
Bibliography:


 * "The Extended XPath language (EXPath) for querying Concurrent Markup Hierarchies," from the website of Ionut Emil Iacob (Research in Computing for Humanities, University of Kentucky). http://dblab.csr.uky.edu/%7Eeiaco0/docs/expath/index.html

ABSTRACT: This document provides semantics of the Extended XPath language (EXPath) for Concurrent Markup Hierarchies (CMH).


 * I. Iacob et al. "XPath Extension for Querying Concurrent XML Markup", (University of Kentucky, Department of Computer Science), Technical Report TR394-04, February 2004. http://dblab.csr.uky.edu/%7Eeiaco0/publications/TR394-04.pdf

ABSTRACT: XPath is a language for addressing parts of an XML document. It is used in many XML query languages and it can be used by itself for querying XML documents. While XPath is, in general, efficient for querying individual XML documents, it lacks the features for querying over collections of documents or joining parts of the same document.

As the amount of complex document-centric XML data is continually increasing, querying such documents has drawn surprisingly little attention. We propose an XPath axes extension to deal with querying collections of document-centric XML documents sharing the same content (called concurrent XML). The algorithms we propose to evaluate the extended axes work in linear time combined complexity (number of documents and total size of documents).


 * A. Dekhtyar et al. "A Framework For Management of Concurrent XML Markup," (2005) Data and Knowledge Engineering, Vol. 52, No. 2, pp. 185 - 215. http://dblab.csr.uky.edu/%7Eeiaco0/publications/dke04-concurrent.pdf

ABSTRACT: The problem of concurrent markup hierarchies in XML encodings of documents has attracted attention of a number of humanities researchers in recent years. The key problem with using concurrent hierarchies to encode documents is that markup in one hierarchy is not necessarily well-formed with respect to the markup in another hierarchy. Previously proposed solutions to this problem rely on the XML expertise of the editors and their ability to maintain correct DTDs for complex markup languages. In this paper, we approach the problem of maintenance of concurrent XML markup from the Computer Science perspective. We propose a framework that allows the editors to concentrate on the semantic aspects of the encoding, while leaving the burden of maintaining XML documents to the software. The paper describes the formal notion of the concurrent markup languages and the algorighms for automatic maintenance of XML documents with concurrent markup.

HORSE
Bibliography:


 * S. DeRose, "Markup Overlap: A Review and a Horse", Extreme Markup Languages 2004: Proceedings http://www.mulberrytech.com/Extreme/Proceedings/html/2004/DeRose01/EML2004DeRose01.html

INTRODUCTION: "Overlap" describes cases where some markup structures do not nest neatly into others, such as when a quotation starts in the middle of one paragraph and ends in the middle of the next. OSIS [Duru03], a standard XML schema for Biblical and related materials, has to deal with extreme amounts of overlap. The simplest is book/chapter/verse and book/story/paragraph hierarchies that pervasively diverge; but many types of overlap are more complicated than this.

The basic options for dealing with overlap in the context of SGML [ISO 8879] or XML [Bray98] are described in the TEI Guidelines [TEI]. I summarize these with their strengths and weaknesses. Previous proposals for expressing overlap, or at least kinds of overlap, don't work well enough for the severe and frequent cases found in OSIS. Thus, I present a variation on TEI milestone markup that has several advantages, though it is not a panacea. This is now the normative way of encoding non-hierarchical structures in OSIS documents.

Citations:

[Bray98] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation 10-February-1998. [Duru03] Patrick Durusau and Steven J. DeRose. "OSIS: A Users' Guide to the Open Scripture Information Standard." Bible Technologies Group, 2003. [ISO 8879] International Organization for Standardization. 1986. ISO 8879: 1986(E). Information Processing: Text and Office Information Systems: Standard Generalized Markup Language. [TEI] Michael Sperberg-McQueen and Lou Burnard (eds). Technical Topics: Multiple Hierarchies. Chapter 31 in the TEI Guidelines for Electronic Text Encoding and Interchange. http://xml.coverpages.org/teichap31.html


 * S. Bauman, "TEI HORSEing around: Handling overlap using the Trojan Horse method" Presentation at Extreme Markup 2005 (Link to be added following the conference)

ABSTRACT: The Text Encoding Initiative’s typed segment-boundary delimiter method is only one of several proposed mechanisms for handling overlap in TEI documents. HORSE (aka CLIX) defines a method by which an XML element is used normally when possible and as an improved version of the typed segment-boundary delimiter method when an overlap problem is encountered. A significant portion of the rules necessary for validation of HORSE markup can be expressed using Schematron. This, combined with an utter hack that can "HORSEify" the declaration of elements in a TEI Relax NG grammar, can provide a potential significant step forward in handling overlap in TEI documents.

Segment Trees
Bibliography:


 * J. W. Jaromczyk, et al. "A web interface to image-based concurrent markup using image maps." Proceedings, 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), November 12-13, 2004, Washington, DC.


 * J. W. Jaromczyk, et al. "On Visualization of Complex Image-Based Markup," Proceedings, International Conference on Computer Vision and Geometry. Warsaw, Poland, September 2004.


 * J. W. Jaromczyk and N. Moore, "Geometric data structures for multihierarchical XML tagging of manuscripts," Proceedings of the 20th European Workshop on Computational Geometry, Seville, Spain, March 2004.

Other Approaches to Concurrent Hierarchies

 * P. Durusau and M. O'Donnell, "Concurrent Markup for XML Documents", Presentation at XML Europe 2002 (http://www.idealliance.org/papers/xmle02/dx_xmle02/papers/03-03-07/03-03-07.html)

ABSTRACT:

The implementation of concurrent markup by Durusau and O'Donnell (Extreme Markup 2001) relies upon related but separate principles. First, markup, commonly described in tree notation, is actually metadata about PCDATA. Second, the membership of any "atom of PCDATA" in a given hierarchy can be recorded as metadata for that PCDATA. These two principles have allowed the authoring and querying of overlapping hierarchies using standard XML software.

This presentation moves beyond the use of text snippets to illustrate overlapping hierarchies and applies the authors' technique to one of the classics of Western literature, John Milton's Paradise Lost. This research has resulted in the first release of overlapping texts for experimentation on overlapping hierarchies and in a firmer theoretical foundation for current and future research on this topic.


 * R. Gartner presented at ACH/ALLC 2005 on integrating METS and TEI to handle concurrent hierarchies. (http://mustard.tapor.uvic.ca/cocoon/ach_abstracts/xq/xhtml.xq?id=125)

Layered Markup Annotation Language
LMNL, pronounced liminal: "an experimental approach to digital text encoding that supports, in SGML/XML terms, overlapping elements (ranges in LMNL) and structured attributes (annotations in LMNL)." Project website includes a tutorial and much other informative material: http://www.lmnl.net/index.html

Other Non-XML Approaches
Bibliography:


 * M. Hilbert et al. "Making CONCUR work," Presentation at Extreme Markup 2005 (Link to be added following the conference)

ABSTRACT:

The SGML feature CONCUR allowed for a document to be simultaneously marked up in multiple conflicting hierarchical tagsets but validated and interpreted in one tagset at a time. Alas, CONCUR was rarely implemented, and XML does not address the problem of conflicting hierarchies at all. The MuLaX document syntax is a non-XML syntax that enables multiply-encoded hierarchies by distinguishing different “layers” in the hierarchy by adding a layer ID as a prefix to the element names. The IDs tie all the elements in a single hierarchy together in an “annotation layer”. Extraction of a single annotation layer results in a well-formed XML document, and each annotation layer may be associated with an XML schema. The MuLaX processing model developed works on the nodes of one annotation layer at a time. Furthermore, an alternative processing model is proposed which uses a multi-rooted trees approach. CONCUR lives!

General Bibliography

 * S. J. DeRose et al. (1990), 'What is Text, Really?', Journal of Computing in Higher Education, 1.2: 3-26.

Full-text available through ACM Portal (subscription only): http://portal.acm.org/citation.cfm?doid=264842.264843

Abstract: The way in which text is represented on a computer affects the kinds of uses to which it can be put by its creator and by subsequent users. The electronic document model currently in use is impoverished and restrictive. The authors argue that text is best represented as an ordered hierarchy of content object (OHCO), because that is what text really is. This model conforms with emerging standards such as SGML and contains within it advantages for the writer, publisher, and researcher. The authors then describe how the hierarchical model can allow future use and reuse of the document as a database, hypertext, or network.


 * A. Renear, et al. "Refining our Notion of What Text Really Is: The Problem of Overlapping Hierarchies" (http://www.stg.brown.edu/resources/stg/monographs/ohco.html)

Abstract: We examine the claim that 'text is an ordered hierarchy of content objects'; this thesis was affirmed by the authors, and others, in the late 1980s and has been associated with certain approaches to text processing and the encoding of literary texts. First we discuss the nature of this claim and its connection with the history of text processing and text encoding standardization projects such as SGML and the Text Encoding Initiative. We then describe how the experience of the text encoding community, as represented and codified in the TEI Guidelines, has raised difficulties for this thesis. Next we consider two progressively weaker versions of this thesis formulated in response to these difficulties. Ultimately we find that no version appears to be free from counterexample.

Although none of these formulations proves to be theoretically sound, they are nonetheless methodologically illuminating as each generalizes actual encoding practices, making explicit certain assumptions that, even though they have been fundamental to the working methodologies of most text encoding projects, have never been explicitly articulated, let alone explained or defended. The counterexamples to the different versions of the OHCO thesis also arise in actual encoding projects -- so although our focus is theoretical it is grounded in the methodology and problems of contemporary encoding practices. The problems discussed here have implications not only for text encoding and our understanding of the nature of textual communication, but raise very fundamental issues in the logic and methodology of the humanities.


 * D. Durand, et al. "What should markup really be? Applying theories of text to the design of markup systems," talk presented at ACH 1996 (http://cs-people.bu.edu/dgd/ach96_talk/Redefining_long.html)


 * D. Barnard et al. (1988) 'SGML-Based Markup for Literary Texts: Two Problems and Some Solutions', Computers and the Humanities 22: 265-276.