Difference between revisions of "Textual Variance"

Revision as of 20:27, 9 March 2011

The working group on the critical apparatus chapter is part of the TEI special interest group on manuscript SIG:MSS.

The “Gothenburg model”: A modular architecture for computer-aided collation

Developers of CollateX and Juxta met in 2009 at a joint workshop of the EU-funded research projects COST Action 32 and Interedition in Gothenburg. They wanted to agree on a modular software architecture, so these two as well as similar projects interested in collation software would have a common base for collaborating in the development of needed tools. As a first result the participants identified the following 4 modules/tasks, which were found to be essential to computer-aided collation. The underlying ideas might consequently need to be discussed in the context of encoding the in- and output of these modules as part of – or pre-stage to – a critical apparatus.

Tokenizer

A tokenized text

While collators can compare witnesses on a character-by-character basis, in the more common use case each comparand is split up into tokens/ segments before collation and compared on the token level. This preprocessing step called tokenization and performed by a tokenizer can happen on any level of granularity, e.g. on the level of syllables, words, lines, phrases, verses, paragraphes, text nodes in a normalized DOM etc.

Another service provided by tokenizers and of special value to text-oriented collators relates to marked-up comparands: As these collators primarily compare witnesses based on their textual content, embedded markup would usually get in the way and therefore needs to be filtered out and/or “pushed in the background”, so the collator can operate on tokens of textual content. At the same time it might be valueable to have the markup context of every token available, e.g. in case one wants to make use of it in complex token comparator functions.

The figure to the right depicts this process: Think of the upper line as a witness, its characters a, b, c, d as arbitrary tokens and e1, e2 as examples of embedded markup elements. A tokenizer would transform this marked-up text into a sequence of tokens, each referring to their respective markup/tagging context. From now on a collator can compare this witness to others based on its tokenized content and does not have to deal with it on a syntactic level anymore, that is rather specific to a particular markup language or dialect.

Encoding challenges

Tokenization can rely on a given segmentation expressed by existing markup in the witness, but it might also introduce a new layer of markup representing its outcome, which can overlap with the existing markup hierarchy.

Aligner

An alignment table

Analyzer

An analyzed alignment table

Visualization

...

Resources/ Bibliography

Modelling textual variance

Multi-Version Document Format. Schmidt, D. and Colomb, R, 2009. A data structure for representing multi-version texts online, International Journal of Human-Computer Studies, 67.6, 497-514. (See the related blog and the post “What's a Multi-Version Document?”.
Multi-Version Texts and Collation. Schmidt, Desmond. “Merging Multi-Version Texts: a Generic Solution to the Overlap Problem.” Presented at Balisage: The Markup Conference 2009, Montréal, Canada, August 11 - 14, 2009. In Proceedings of Balisage: The Markup Conference 2009. Balisage Series on Markup Technologies, vol. 3 (2009). doi:10.4242/BalisageVol3.Schmidt01.]

Criticism of the current critical apparatus encoding scheme

Vetter, L. and McDonald, J. ‘Witnessing Dickinson’s Witnesses’, Literary and Linguistic Computing, 18.2: 2003, 151-165.
Schmidt, D., 2010. The inadequacy of embedded markup for cultural heritage texts. Literary and Linguistic Computing, 25(3), pp. 337-356.

Computer-aided collation: Concepts and algorithms

Matthew Spencer, Christopher J. Howe. Collating Texts Using Progressive Multiple Alignment. Computers and the Humanities. 38/2004. S. 253–270.
Michael Stolz, Friedrich Michael Dimpel. Computergestützte Kollationierung und ihre Integration in den editorischen Arbeitsfluss. 2006.

Computer-aided collation: Software

@@ Line 1: / Line 1: @@
 The working group on [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TC.html the critical apparatus chapter] is part of the TEI special interest group on manuscript [[SIG:MSS]].
-== The “Gothenburg model”: A modular architecture for computer-aided collation tools ==
+== The “Gothenburg model”: A modular architecture for computer-aided collation ==
 Developers of [http://collatex.sourceforge.net/ CollateX] and [http://www.juxtasoftware.org/ Juxta] met in 2009 at a joint workshop of the EU-funded research projects [http://www.cost-a32.eu/ COST Action 32] and [http://www.interedition.eu/ Interedition] in Gothenburg. They wanted to agree on a modular software architecture, so these two as well as similar projects interested in collation software would have a common base for collaborating in the development of needed tools. As a first result the participants identified the following 4 modules/tasks, which were found to be essential to computer-aided collation. The underlying ideas might consequently need to be discussed in the context of encoding the in- and output of these modules as part of – or pre-stage to – a critical apparatus.

Difference between revisions of "Textual Variance"

Revision as of 20:27, 9 March 2011

Contents

The “Gothenburg model”: A modular architecture for computer-aided collation

Tokenizer

Encoding challenges

Aligner

Analyzer

Visualization

Resources/ Bibliography

Modelling textual variance

Criticism of the current critical apparatus encoding scheme

Computer-aided collation: Concepts and algorithms

Computer-aided collation: Software

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools