Textual Variance

The working group on the critical apparatus chapter is part of the TEI special interest group on manuscript SIG:MSS.

The “Gothenburg model”: A modular architecture for computer-aided collation
Developers of CollateX and Juxta met in 2009 at a joint workshop of the EU-funded research projects COST Action 32 and Interedition in Gothenburg. They wanted to agree on a modular software architecture, so these two as well as similar projects interested in collation software would have a common base for collaborating in the development of needed tools. As a first result the participants identified the following 4 modules/tasks, which were found to be essential to computer-aided collation. The underlying ideas might consequently need to be discussed in the context of encoding the in- and output of these modules as part of – or pre-stage to – a critical apparatus.

Tokenizer


While collators can compare witnesses on a character-by-character basis, in the more common use case each comparand is split up into tokens/ segments before collation and compared on the token level. This preprocessing step called tokenization and performed by a tokenizer can happen on any level of granularity, e.g. on the level of syllables, words, lines, phrases, verses, paragraphes, text nodes in a normalized DOM etc.

Another service provided by tokenizers and of special value to text-oriented collators relates to marked-up comparands: As these collators primarily compare witnesses based on their textual content, embedded markup would usually get in the way and therefore needs to be filtered out and/or “pushed in the background”, so the collator can operate on tokens of textual content. At the same time it might be valueable to have the markup context of every token available, e.g. in case one wants to make use of it in complex token comparator functions.

The figure to the right depicts this process: Think of the upper line as a witness, its characters a, b, c, d as arbitrary tokens and e1, e2 as examples of embedded markup elements. A tokenizer would transform this marked-up text into a sequence of tokens, each referring to their respective markup/tagging context. From now on a collator can compare this witness to others based on its tokenized content and does not have to deal with it on a syntactic level anymore, that is rather specific to a particular markup language or dialect.

Encoding challenges

 * Tokenization can rely on a given segmentation expressed by existing markup in the witness, but it might also introduce a new layer of markup representing its outcome, which can overlap with the existing markup hierarchy.



Aligner


After the witnesses have been tokenized, collators try to align all witnesses involved. Simply put, aligning the witnesses means in this case: Find matching tokens and insert empty tokens (gap tokens) such that the token sequences of all witness line up properly. Interestingly this problem is computationally similar to the problem of sequence alignment encountered in bioinformatics.

Looking at an example, assume that we have three witnesses: the first is comprised of the token sequence (a, b, c, d), the second reads (a, c, d, b) and the third (b, c, d). A collator might align these three witnesses as depicted in a tabular fashion on the right. Each witness occupies a column, matching tokens are aligned in a row, necessary gap tokens as inserted during the alignment process are denoted via a hyphen. Depending from which perspective one interprets this alignment table, one can say for example that the (b) in the second row was ommitted in the second witness or it has been added in the first and the second. A similar statement can be made about (b) in the last row by just inverting the relationship of being added/ommitted.

Alignment tables like the one shown can be encoded losslessly with an existing apparatus encoding scheme, in parallel segmentation mode, as long as only the textual content of token needs to be represented. Each row is represented by a segment with empty readings for gap tokens. Optionally consecutive rows with identical readings for each witness can be compressed into a single segment, e.g.

a  b cd b 

Visualization
...

Modelling textual variance

 * Multi-Version Document Format. Schmidt, D. and Colomb, R, 2009. A data structure for representing multi-version texts online, International Journal of Human-Computer Studies, 67.6, 497-514. (See the related blog and the post “What's a Multi-Version Document?”.
 * Multi-Version Texts and Collation. Schmidt, Desmond. “Merging Multi-Version Texts: a Generic Solution to the Overlap Problem.” Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:10.4242/BalisageVol3.Schmidt01.]

Criticism of the current critical apparatus encoding scheme

 * Vetter, L. and McDonald, J. ‘Witnessing Dickinson’s Witnesses’, Literary and Linguistic Computing, 18.2: 2003, 151-165.
 * Schmidt, D., 2010. The inadequacy of embedded markup for cultural heritage texts. Literary and Linguistic Computing, 25(3), pp. 337-356.

Computer-aided collation: Concepts and algorithms

 * Matthew Spencer, Christopher J. Howe. Collating Texts Using Progressive Multiple Alignment. Computers and the Humanities. 38/2004. S. 253–270.
 * Michael Stolz, Friedrich Michael Dimpel. Computergestützte Kollationierung und ihre Integration in den editorischen Arbeitsfluss. 2006.

Computer-aided collation: Software

 * Collate
 * CollateX
 * Juxta
 * NMerge