The “Gothenburg model”: A modular architecture for computer-aided collation
Developers of CollateX and Juxta met in 2009 at a joint workshop of the EU-funded research projects COST Action 32 and Interedition in Gothenburg. They wanted to agree on a modular software architecture, so these two as well as similar projects interested in collation software would have a common base for collaborating in the development of needed tools. As a first result the participants identified the following 4 modules/tasks, which were found to be essential to computer-aided collation. The underlying ideas might consequently need to be discussed in the context of encoding the in- and output of these modules as part of – or pre-stage to – a critical apparatus.
While collators can compare witnesses on a character-by-character basis, in the more common use case each comparand is split up into tokens/ segments before collation and compared on the token level. This preprocessing step called tokenization and performed by a tokenizer can happen on any level of granularity, e.g. on the level of syllables, words, lines, phrases, verses, paragraphes, text nodes in a normalized DOM etc.
Another service provided by tokenizers and of special value to text-oriented collators relates to marked-up comparands: As these collators primarily compare witnesses based on their textual content, embedded markup would usually get in the way and therefore needs to be filtered out and/or “pushed in the background”, so the collator can operate on tokens of textual content. At the same time it might be valueable to have the markup context of every token available, e.g. in case one wants to make use of it in complex token comparator functions.
The figure to the right depicts this process: Think of the upper line as a witness, its characters a, b, c, d as arbitrary tokens and e1, e2 as examples of embedded markup elements. A tokenizer would transform this marked-up text into a sequence of tokens, each referring to their respective markup/tagging context. From now on a collator can compare this witness to others based on its tokenized content and does not have to deal with it on a syntactic level anymore, that is rather specific to a particular markup language or dialect.
- Tokenization can rely on a given segmentation expressed by existing markup in the witness, but it might also introduce a new layer of markup representing its outcome, which can overlap with the existing markup hierarchy.
After the witnesses have been tokenized, collators try to align all witnesses involved. Simply put, aligning the witnesses means in this case: Find matching tokens and insert empty tokens (gap tokens) such that the token sequences of all witness line up properly. Interestingly this problem is computationally similar to the problem of sequence alignment encountered in bioinformatics.
Looking at an example, assume that we have three witnesses: the first is comprised of the token sequence (a, b, c, d), the second reads (a, c, d, b) and the third (b, c, d). A collator might align these three witnesses as depicted in a tabular fashion on the right. Each witness occupies a column, matching tokens are aligned in a row, necessary gap tokens as inserted during the alignment process are denoted via a hyphen. Depending from which perspective one interprets this alignment table, one can say for example that the (b) in the second row was ommitted in the second witness or it has been added in the first and the third. A similar statement can be made about (b) in the last row by just inverting the relationship of being added/ommitted.
Alignment tables like the one shown can be encoded losslessly with an existing apparatus encoding scheme, in parallel segmentation mode, as long as only the textual content of token needs to be represented. Each row is represented by a segment with empty readings for gap tokens. Optionally consecutive rows with identical readings for each witness can be compressed into a single segment, e.g.
<app> <rdg wit="#w1 #w2">a</rdg> <rdg wit="#w3" /> </app> <app> <rdg wit="#w1 #w3">b</rdg> <rdg wit="w2" /> </app> <app> <rdg wit="#w1 #w2 #w3">cd</rdg> </app> <app> <rdg wit="#w2">b</rdg> <rdg wit="#w1 #w3" /> </app>
- In case the aligned tokens shall be embedded into the apparatus encoding, only their textual content is guaranteed to be embeddable without causing markup validation problems. Otherwise there needs to be a defined way of referencing tokens in their respective markup context, because this context cannot be replicated in the apparatus without loosening the XML schema constraints substantially.
On top of the results delivered by the alignment process, a further analysis can yield additional findings. Echoing the example from the above section, one might want to think of the token (b) in row 2 and 5 as being transposed instead of as being added/omitted separately. Some collators try to detect transpositions as part of the alignment process, some do it as a post-processing step and others do not handle transpositions at all and/or leave it to the user to declare those beforehand. Part of the reason for algorithmic differences in transpostion handling is the fact, that the question which tokens are actually transposed is much more a matter of interpretation than the question of matching and aligning them. While alignment results can still be judged in terms of their quality to some extent, transposition detection can only be done heuristically as one can easily think of cases, where it is impossible for a computer “to get it right”.
Apart from the specific problem of transpositions, it seems generally necessary to incorporate a step in the collation process, in which the user can examine the preliminary collation result, edit and augment it according to her knowledge and possibly feed it back into the collator for another run yielding enhanced results.
- If we model transpositions as links between tokens/segments in an alignment/apparatus, we will also need a way to encode these links. Transpositions in general have not been supported in the TEI until recently when the WG “Genetic editions” proposed an encoding scheme in the context of document-oriented markup.
- The proposed encoding scheme for transpositions should be reviewed, whether it is suitable as input for collation tools. An encoding scheme for user-declared alignments/transpositions would help in supporting roundtrip collation.
The last module of the Gothenburg model deals with visualizing collation results. As we are concerned with modelling and encoding textual variance properly, the question of how to visualize it is of technical importance and should not be disregarded, but is essentially out of scope with regard to this discussion.
The “variant graph”: Schmidt’s model of textual variance
In a recent paper D. Schmidt and R. Colomb proposed a data model of textual variance (or “multi-version texts” as they call it in the title), which they call a variant graph:
In this model, varying texts/ a collation are/ is expressed in a directed acyclic graph with each path through the graph representing one version/ witness. The textual data is annotated on the edges, each edge carrying a (common) segment of text(s) and a set of identifiers, that denotes the versions/witnesses, in which the segment appears. Transpositions can be superimposed on the graph by linking edges of transposed segments.
The tabular model described above and the given graph-based model can be converted into each other, with Schmidt’s model having the advantage, that it is
- more space-efficient as it combines matching segments into a single edge instead of duplicating them per row/column,
- more natural in expressing transpostions as matching segments are linked and not pairs of tokens.
The tabular model on the other hand might be advantageous, if one wanted to keep collation results in a relational datastore.
Modelling textual variance
- Multi-Version Document Format. Schmidt, D. and Colomb, R, 2009. A data structure for representing multi-version texts online, International Journal of Human-Computer Studies, 67.6, 497-514. (See the related blog and the post “What's a Multi-Version Document?”.
- Multi-Version Texts and Collation. Schmidt, Desmond. “Merging Multi-Version Texts: a Generic Solution to the Overlap Problem.” Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:10.4242/BalisageVol3.Schmidt01.]
Computer-aided collation: Concepts and algorithms
- Matthew Spencer, Christopher J. Howe. Collating Texts Using Progressive Multiple Alignment. Computers and the Humanities. 38/2004. S. 253–270.
- Michael Stolz, Friedrich Michael Dimpel. Computergestützte Kollationierung und ihre Integration in den editorischen Arbeitsfluss. 2006.