SIG:Overlap

From TEIWiki

Jump to: navigation, search


Contents

Introduction

The goal of the TEI Overlapping Markup SIG is to bring together users of the TEI who are acutely interested in issues of multiple hierarchies and in particular handling those in XML. It will do this by:

  1. running a mailing list about overlapping hierarchies and solutions to encoding them
  2. assess the TEI and suggest improvements and alterations to the TEI-Council

The SIG is convened by Dot Porter (dporter@uky.edu). If you have developed an approach to overlapping markup, you'd like to comment on existing approaches below, or if you would like to add a citation to the bibliographies, please feel free to log into the Wiki and add you contributions. The SIG runs a mailing list on this topic. To join visit


Approaches to Handling Overlapping XML Markup

Multiple Hierarchies

The TEI P4 Guidelines provide a chapter that discusses some ways to deal with markup that is not hierarchical (Chapter 31, "Multiple Hierarchies", http://www.tei-c.org/P4X/NH.html). Specific problems mentioned in that chapter include many that should be familiar to even the most basic user of TEI markup:

  • in narrative, a speech by a character may begin in the middle of a paragraph and continue for several more paragraphs
  • in a verse text, the encoder may need to tag both the formal structure of the verse (its stanzas and lines) and its syntactic structures (which sometimes nest within the metrical structure and sometimes cross metrical boundaries)
  • in any kind of text, the encoder may wish to record the physical structure of volume, page, column, and line, as well as the formal or logical structure of chapters and paragraphs or acts and scenes, etc.
  • in verse drama, the structure of acts, scenes, and speeches often conflicts with the metrical structure
  • in any kind of text, an embedded text (e.g. a play within a play, or a song) may be interrupted by other matter; the encoder may wish to establish explicitly the logical unity of the embedded material (e.g. to identify the song as a single song, and to mark its internal formal structure)
  • in a dictionary, different types of information (e.g. orthography, syllabification, and hyphenation) may be combined within a single notation; the encoder may wish both to preserve the presentation of the material in the source text and to disentangle the logically distinct pieces of information in the interests of more convenient processing of the lexical information

Below are some approaches for using multiple hierarchies in XML, both for encoding them and for processing them.


Kentucky GODDAG

Bibliography:

ABSTRACT: This document provides semantics of the Extended XPath language (EXPath) for Concurrent Markup Hierarchies (CMH).


ABSTRACT: XPath is a language for addressing parts of an XML document. It is used in many XML query languages and it can be used by itself for querying XML documents. While XPath is, in general, efficient for querying individual XML documents, it lacks the features for querying over collections of documents or joining parts of the same document.

As the amount of complex document-centric XML data is continually increasing, querying such documents has drawn surprisingly little attention. We propose an XPath axes extension to deal with querying collections of document-centric XML documents sharing the same content (called concurrent XML). The algorithms we propose to evaluate the extended axes work in linear time combined complexity (number of documents and total size of documents).


ABSTRACT: The problem of concurrent markup hierarchies in XML encodings of documents has attracted attention of a number of humanities researchers in recent years. The key problem with using concurrent hierarchies to encode documents is that markup in one hierarchy is not necessarily well-formed with respect to the markup in another hierarchy. Previously proposed solutions to this problem rely on the XML expertise of the editors and their ability to maintain correct DTDs for complex markup languages. In this paper, we approach the problem of maintenance of concurrent XML markup from the Computer Science perspective. We propose a framework that allows the editors to concentrate on the semantic aspects of the encoding, while leaving the burden of maintaining XML documents to the software. The paper describes the formal notion of the concurrent markup languages and the algorighms for automatic maintenance of XML documents with concurrent markup.

HORSE

Bibliography:

INTRODUCTION: "Overlap" describes cases where some markup structures do not nest neatly into others, such as when a quotation starts in the middle of one paragraph and ends in the middle of the next. OSIS [Duru03], a standard XML schema for Biblical and related materials, has to deal with extreme amounts of overlap. The simplest is book/chapter/verse and book/story/paragraph hierarchies that pervasively diverge; but many types of overlap are more complicated than this.

The basic options for dealing with overlap in the context of SGML [ISO 8879] or XML [Bray98] are described in the TEI Guidelines [TEI]. I summarize these with their strengths and weaknesses. Previous proposals for expressing overlap, or at least kinds of overlap, don't work well enough for the severe and frequent cases found in OSIS. Thus, I present a variation on TEI milestone markup that has several advantages, though it is not a panacea. This is now the normative way of encoding non-hierarchical structures in OSIS documents.

Citations:

[Bray98] Tim Bray, Jean Paoli, C. M. Sperberg-McQueen. Extensible Markup Language (XML) 1.0. W3C Recommendation 10-February-1998. [Duru03] Patrick Durusau and Steven J. DeRose. "OSIS: A Users' Guide to the Open Scripture Information Standard." Bible Technologies Group, 2003. [ISO 8879] International Organization for Standardization. 1986. ISO 8879: 1986(E). Information Processing: Text and Office Information Systems: Standard Generalized Markup Language. [TEI] Michael Sperberg-McQueen and Lou Burnard (eds). Technical Topics: Multiple Hierarchies. Chapter 31 in the TEI Guidelines for Electronic Text Encoding and Interchange. http://xml.coverpages.org/teichap31.html


  • S. Bauman, "TEI HORSEing around: Handling overlap using the Trojan Horse method" Presentation at Extreme Markup 2005 (Link to be added following the conference)

ABSTRACT: The Text Encoding Initiative’s typed segment-boundary delimiter method is only one of several proposed mechanisms for handling overlap in TEI documents. HORSE (aka CLIX) defines a method by which an XML element is used normally when possible and as an improved version of the typed segment-boundary delimiter method when an overlap problem is encountered. A significant portion of the rules necessary for validation of HORSE markup can be expressed using Schematron. This, combined with an utter hack that can "HORSEify" the declaration of elements in a TEI Relax NG grammar, can provide a potential significant step forward in handling overlap in TEI documents.

Just-In-Time-Trees

Segment Trees

Bibliography:

  • J. W. Jaromczyk, et al. "A web interface to image-based concurrent markup using image maps." Proceedings, 6th ACM International Workshop on Web Information and Data Management (WIDM 2004), November 12-13, 2004, Washington, DC.
  • J. W. Jaromczyk, et al. "On Visualization of Complex Image-Based Markup," Proceedings, International Conference on Computer Vision and Geometry. Warsaw, Poland, September 2004.
  • J. W. Jaromczyk and N. Moore, "Geometric data structures for multihierarchical XML tagging of manuscripts," Proceedings of the 20th European Workshop on Computational Geometry, Seville, Spain, March 2004.


Treespaces

From Peter van Hardenberg, University of Victoria (pvh@uvic.ca)

In brief, a treespace document is a document which contains multiple XML documents within it. Treespaces are each, when viewed alone, valid XML documents, and include some syntax for assigning tags to a particular tree. Tags from differing trees may nest however is convenient.

The notion of the treespace is akin to that of the namespace. Just as a single DTD cannot encompass the full range of desired documents (particularly documents containing fragments from multiple sources), a single nesting tree cannot encode all documents. In essence, treespaces are a continuation of the OHCO hypotheses of Allen Renear. He proposes (with OHCO-3) that a document can be decomposed into multiple hierarchies, each one describing a "view" of a document. Unfortunately, this does not provide a conceptual mechanism for dealing with documents that may have overlapping trees combined from various structures.

To resolve this problem, trees must be considered to apply to a span of a document. This, in essence, creates a hybrid spanning/nesting model. Each tree is susceptible to the standard queries of any XML document and can have a DTD or Schema applied to it individually.

More work is necessary to determine useful extensions for determining relationships between trees.

Similarly, a suitably elegant syntax has not yet been developed.

Implementation of a "treespace" document structure is relatively easy aside from the above caveats. All parsers maintain a stack of "open" tags which are used for validation purposes. To extend an existing parser to test wellformedness of a treespace document requires maintaining a seperate tag-context for each tree. No support for relating trees has been considered at this time -- each tree stands alone in the current model.

Other Approaches to Concurrent Hierarchies

ABSTRACT: The implementation of concurrent markup by Durusau and O'Donnell (Extreme Markup 2001) relies upon related but separate principles. First, markup, commonly described in tree notation, is actually metadata about PCDATA. Second, the membership of any "atom of PCDATA" in a given hierarchy can be recorded as metadata for that PCDATA. These two principles have allowed the authoring and querying of overlapping hierarchies using standard XML software.

This presentation moves beyond the use of text snippets to illustrate overlapping hierarchies and applies the authors' technique to one of the classics of Western literature, John Milton's Paradise Lost. This research has resulted in the first release of overlapping texts for experimentation on overlapping hierarchies and in a firmer theoretical foundation for current and future research on this topic.


ABSTRACT: XML has a tree-structued data model, which is used to uniformly represent structured as well as semi-structured data, and also enable concise query specification in XQuery, via the use of its XPath (twig) patterns. This in turn can leverage the recently developed technology of structural join algorithms to evaluate the query efficiently. In this paper, we identify a fundamental tension in XML data modeling: (1) data represented as deep trees (which can make effective use of twig patterns) are often un-normalized, leading to update anomalies, while (ii) normalized data tends to be shallow, resulting in heavy use of expensive value-based joins in queries.

Our solution to this data modeling problem is a novel multi-colored trees (MCT) logical data model, which is an evolutionary extension of the XML data model, and permits trees with multi-colored nodes to signify their participation in multiple hierarchies. This adds significant semantic structure to individual data nodes. We extend XQuery expressions to navigate between structurally related nodes, taking color into account, and also to create new colored trees as restructurings of an MCT database. While MCT serves as a significant evolutionary extension to XML as a logical data model, one of the key roles of XML is for information exchange. To enable exchange of MCT information, we develop algorithms for optimally serializing an MCT database as XML. We discuss alternative physical representations for MCT databases, using relations and native XML databases, and describe an implementation on top of the Timber native XML database. Experimental evaluation, using our prototype implementation, shows that not only are MCT queries/updates more succinct and easier to express than equivalent shallow tree XML queries, but they can also be significantly more efficient to evaluate than equivalent deep and shallow tree XML queries/updates.


ABSTRACT: An approach to the unification of XML (Extensible Markup Language) documents with identical textual content and concurrent markup in the framework of XML-based multi-layer annotation is introduced. A Prolog program allows the possible relationships between element instances on two annotation layers that share PCDATA to be explored and also the computing of a target node hierarchy for a well-formed, merged XML document. Special attention is paid to identity conflicts between element instances, for which a default solution that takes into account metarelations that hold between element types on the different annotation layers is provided. In addition, rules can be specified by a user to prescribe how identity conflicts should be solved for certain element types.

Other Approaches to Overlapping XML Markup (not concurrent hierarchies)

Non-XML Approaches

Layered Markup Annotation Language

LMNL, pronounced liminal: "an experimental approach to digital text encoding that supports, in SGML/XML terms, overlapping elements (ranges in LMNL) and structured attributes (annotations in LMNL)."
Project website includes a tutorial and much other informative material: http://www.lmnl.net/index.html

TexMECS

C. Huitfeldt and C. M. Sperberg-McQueen, "TexMECS: An experimental markup meta-language for complex documents", 25 January 2001, rev. 17 February 2001 (http://helmer.aksis.uib.no/claus/mlcd/papers/texmecs.html)


CoNLL-2005 Shared Task format

Shared Task Chairs: Xavier Carreras and Lluís Màrquez http://www.lsi.upc.edu/~srlconll/examples.html

MVD Multi-Version Document Format

Multi-version documents can be used to generally model overlapping hierarchies and textual variation without drawbacks. The MVD format encodes all overlapping structures as a directed graph with one start and one end-point. The arcs of the graph contain the content of each version or encoding perspective, which may contain markup. The versions are the different paths that the text takes through the graph from start to finish. The graph can also be written out as a sequence of pairs, each of which contains a set of versions and a piece of text. This list form can also be converted back into the graph form without loss of information. Efficient procedures exist for:

  1. listing of a given version
  2. comparing two versions
  3. searching all versions for some text
  4. finding the variants of a piece of text
  5. creating and editing an MVD

An MVD can represent insertions, deletions, substitutions (or variants) and transpositions. A paper describing the technology has been published by the International Journal of Human-Computer Studies. See http://multiversiondocs.blogspot.com for further information, in particular, the 'What is an Multi-Version Document' link.

Other Non-XML Approaches

Bibliography:

  • M. Hilbert et al. "Making CONCUR work," Presentation at Extreme Markup 2005 (Link to be added following the conference)

ABSTRACT:

The SGML feature CONCUR allowed for a document to be simultaneously marked up in multiple conflicting hierarchical tagsets but validated and interpreted in one tagset at a time. Alas, CONCUR was rarely implemented, and XML does not address the problem of conflicting hierarchies at all. The MuLaX document syntax is a non-XML syntax that enables multiply-encoded hierarchies by distinguishing different “layers” in the hierarchy by adding a layer ID as a prefix to the element names. The IDs tie all the elements in a single hierarchy together in an “annotation layer”. Extraction of a single annotation layer results in a well-formed XML document, and each annotation layer may be associated with an XML schema. The MuLaX processing model developed works on the nodes of one annotation layer at a time. Furthermore, an alternative processing model is proposed which uses a multi-rooted trees approach. CONCUR lives!

General Bibliography

  • S. J. DeRose et al. (1990), 'What is Text, Really?', Journal of Computing in Higher Education, 1.2: 3-26.

Full-text available through ACM Portal (subscription only): http://portal.acm.org/citation.cfm?doid=264842.264843

Abstract: The way in which text is represented on a computer affects the kinds of uses to which it can be put by its creator and by subsequent users. The electronic document model currently in use is impoverished and restrictive. The authors argue that text is best represented as an ordered hierarchy of content object (OHCO), because that is what text really is. This model conforms with emerging standards such as SGML and contains within it advantages for the writer, publisher, and researcher. The authors then describe how the hierarchical model can allow future use and reuse of the document as a database, hypertext, or network.

Abstract: We examine the claim that 'text is an ordered hierarchy of content objects'; this thesis was affirmed by the authors, and others, in the late 1980s and has been associated with certain approaches to text processing and the encoding of literary texts. First we discuss the nature of this claim and its connection with the history of text processing and text encoding standardization projects such as SGML and the Text Encoding Initiative. We then describe how the experience of the text encoding community, as represented and codified in the TEI Guidelines, has raised difficulties for this thesis. Next we consider two progressively weaker versions of this thesis formulated in response to these difficulties. Ultimately we find that no version appears to be free from counterexample.

Although none of these formulations proves to be theoretically sound, they are nonetheless methodologically illuminating as each generalizes actual encoding practices, making explicit certain assumptions that, even though they have been fundamental to the working methodologies of most text encoding projects, have never been explicitly articulated, let alone explained or defended. The counterexamples to the different versions of the OHCO thesis also arise in actual encoding projects -- so although our focus is theoretical it is grounded in the methodology and problems of contemporary encoding practices. The problems discussed here have implications not only for text encoding and our understanding of the nature of textual communication, but raise very fundamental issues in the logic and methodology of the humanities.


  • D. Barnard et al. (1988) 'SGML-Based Markup for Literary Texts: Two Problems and Some Solutions', Computers and the Humanities 22: 265-276.
  • David Barnard, Lou Burnard, Jean-Pierre Gaspart, Lynne A. Price, Michael Sperberg-McQueen, and Giovanni Battista Varile. "Hierarchical Encoding of Text: Technical Problems and SGML Solutions." In The Text Encoding Initiative: Background and Contents. Guest Editors: Nancy Ide and Jean Veronis. Computers and the Humanities 29/3 (1995), pages 211-231. (http://xml.coverpages.org/bib-ab.html#barnardHierarchicalCHUM)
  • D. Schmidt and R. Colomb (2009) 'A data structure for representing multi-version texts online', International Journal of Human-Computer Studies, 67.6, 497-514.
Personal tools