Stand-off markup

Stand-off markup (also known as remote markup or stand-off annotation) is the kind of markup that resides in a location different from the location of the data being described by it. It is thus the opposite of inline markup, where data and annotations are intermingled within a single location.

Analogy: annotating binary data

In the case of XML describing binary data, such as images or audio/video, this kind of system comes naturally, as it is impossible to store bits of binary files as element content in an XML file — in order for an XML annotation to be able to describe an image of a text or inscription that got scanned or photographed, a certain reference system is required (e.g. with pixels as individual units), together with pointers that create connections between elements in the XML file and areas of the image. Similarly with audio/video data, which you can index by the time axis, byte order, or within other appropriate reference systems.

This kind of annotation system can also be applied to texts: instead of mixing data and markup, the source text can be left as read-only (and thus secure and possibly even located on a remote server) and the markup that describes it can constitute a separate layer, linked to the original by appropriate pointers.

Advantages

separation of logical layers of annotation
overlapping hierarchies
text read-only or secured, annotations free (cf. ANC)
etc.

A bit of history

CES, XCES, ANC

CES image

mention HyTime and the TEI's contribution to XLink?

ISO, Ide+Romary (?)

mention ATLAS (?)

Current implementations

ANC Tool (to aggregate XCES annotations, could it be customized?)

mention libxml2 and xmllint as the only (?) free-standing parser that (almost) implements the entire XPointer framework (bugs + non-W3C schemes are not handled)

mention TEI schemes defined at http://www.w3.org/2005/04/xpointer-schemes/

mention, perhaps, the way to use the string-range() function of the xpointer() scheme

Granularity: addressing elements/tokens vs. addressing (spans of) characters

James's paper
NKJP/Polish: catching ambiguities at sublexical level

When one does not want or need to address stand-off annotation at the sound or character level, an easy way to proceed is to build up a primary resource in TEI format, which is fully segmented at the word level (<w>, with IDs). Annotation files may in turn just refer to the corresponding IDs by means of a simple URL (http://www.myorganisation.org/myCorpus#w3425).

Links

Documentation
- CES documentation
- TEI Guidelines, ch. 16.9

Tools that go some way towards creating/handling stand-off markup
- XSLT++ Project: XPointer implementation
- EDITOR (the Edition as a Digital Instrument for Text-based Open Research)
- Gate for ANC (not sure how it would handle TEI, separate plug-ins would be needed?)

Related papers
- A SANE approach to annotation in the digital edition, by Peter Boot, Jahrbuch für Computerphilologie 8 (2006)

Stand-off markup

Contents

Analogy: annotating binary data

Advantages

A bit of history

Current implementations

Granularity: addressing elements/tokens vs. addressing (spans of) characters

Links

Navigation menu

Stand-off markup

Analogy: annotating binary data

Advantages

A bit of history

Current implementations

Granularity: addressing elements/tokens vs. addressing (spans of) characters

Links

Navigation menu

Search