Stand-off markup

Stand-off markup (also known as remote markup or stand-off annotation) is the kind of markup that resides in a location different from the location of the data being described by it. It is thus the opposite of inline markup, where data and annotations are intermingled within a single location.

Analogy: annotating binary data
In the case of XML describing binary data, such as images or audio/video, this kind of system comes naturally, as it is impossible to store bits of binary files as element content in an XML file &mdash; in order for an XML annotation to be able to describe an image of a text or inscription that got scanned or photographed, a certain reference system is required (e.g. with pixels as individual units), together with pointers that create connections between elements in the XML file and areas of the image. Similarly with audio/video data, which you can index by the time axis, byte order, or within other appropriate reference systems.

This kind of annotation system can also be applied to texts: instead of mixing data and markup, the source text can be left as read-only (and thus secure and possibly even located on a remote server) and the markup that describes it can constitute a separate layer, linked to the original by appropriate pointers.

Advantages

 * separation of logical layers of annotation
 * overlapping hierarchies
 * text read-only or secured, annotations free (cf. ANC)
 * etc.

A bit of history

 * CES, XCES, ANC


 * CES image


 * mention HyTime and the TEI's contribution to XLink?


 * ISO, Ide+Romary (?)


 * mention ATLAS (?)

Current implementations

 * ANC Tool (to aggregate XCES annotations, could it be customized?)


 * mention libxml2 and xmllint as the only (?) free-standing parser that (almost) implements the entire XPointer framework (bugs + non-W3C schemes are not handled)


 * mention TEI schemes defined at http://www.w3.org/2005/04/xpointer-schemes/


 * mention, perhaps, the way to use the string-range function of the xpointer scheme

Granularity: addressing elements/tokens vs. addressing (spans of) characters

 * James's paper
 * NKJP/Polish: catching ambiguities at sublexical level

When one does not want or need to address stand-off annotation at the sound or character level, an easy way to proceed is to build up a primary resource in TEI format, which is fully segmented at the word level (, with IDs). Annotation files may in turn just refer to the corresponding IDs by means of a simple URL (http://www.myorganisation.org/myCorpus#w3425).

Links

 * Documentation
 * CES documentation
 * TEI Guidelines, ch. 16.9


 * Tools that go some way towards creating/handling stand-off markup
 * XSLT++ Project: XPointer implementation
 * EDITOR (the Edition as a Digital Instrument for Text-based Open Research)
 * Gate for ANC (not sure how it would handle TEI, separate plug-ins would be needed?)


 * Related papers
 * A SANE approach to annotation in the digital edition, by Peter Boot, Jahrbuch für Computerphilologie 8 (2006)