Textual Variance

From TEIWiki
Revision as of 21:21, 9 March 2011 by Gremid (talk | contribs)
Jump to navigation Jump to search

The working group on the critical apparatus chapter is part of the TEI special interest group on manuscript SIG:MSS.

The “Gothenburg model”: A modular architecture for computer-aided collation tools

Developers of CollateX and Juxta met in 2009 at a joint workshop of the EU-funded research projects COST Action 32 and Interedition in Gothenburg. They wanted to agree on a modular software architecture, so these two as well as similar projects interested in collation software would have a common base for collaborating in the development of needed tools. As a first result the participants identified the following 4 modules/tasks, which were found to be essential to computer-aided collation. The underlying ideas might consequently need to be discussed in the context of encoding the in- and output of these modules as part of – or pre-stage to – a critical apparatus.

Tokenizer

A tokenized text

While collators can compare witnesses on a character-by-character basis, in the more common use case each comparand is split up into tokens/ segments before collation and compared on the token level. This preprocessing step called tokenization and performed by a tokenizer can happen on any level of granularity, e.g. on the level of syllables, words, lines, phrases, verses, paragraphes, text nodes in a normalized DOM etc.

Another service provided by tokenizers and of special value to text-oriented collators relates to marked-up comparands: As these collators primarily compare witnesses based on their textual content, embedded markup would usually get in the way and therefore needs to be filtered out and/or “pushed in the background”, so the collator can operate on tokens of textual content. At the same time it might be valueable to have the markup context of every token available, e.g. in case one wants to make use of it in complex token comparator functions.

The figure to the right depicts this process: Think of the upper line as a witness, its characters a, b, c, d as arbitrary tokens and e1, e2 as examples of embedded markup elements. A tokenizer would transform this marked-up text into a sequence of tokens, each referring to their respective markup/tagging context. From now on a collator can compare this witness to others based on its tokenized content and does not have to deal with it on a syntactic level anymore, that is rather specific to a particular markup language or dialect.

Encoding challenges


Aligner

An alignment table

Analyzer

An analyzed alignment table

Visualization

...

Resources/ Bibliography

Modelling textual variance

Criticism of the current critical apparatus encoding scheme

Computer-aided collation: Concepts and algorithms

Computer-aided collation: Software