Critical Apparatus Workgroup

The Critical Apparatus workgroup is part of the TEI special interest group on manuscript SIG:MSS.

Participants to the preliminary workgroup:


 * Marjorie Burghart (MB)
 * James Cummings (JC)
 * Fotis Jannidis (FJ)
 * Gregor Middell (GM)
 * Dan O'Donnell (DOD)
 * Espen Ore (EO)
 * Elena Pierazzo (EP)
 * Roberto Rosselli del Turco (RDT)
 * Chris Wittern (CW)

“Critical Apparatus” vs. “Textual Variance” vs. “Textual Variants”
The very name of the chapter, "Critical apparatus", is felt by some to be be a problem: the critical apparatus is just inherited from the printed world and one of the possible physical embodiment of textual variance. EP therefore proposes to use this new name, moving from "critical apparatus" to textual variance.

MB argues that, oddly, "textual variance" feels more restrictive to her than "critical apparatus": it is a notion linked with Cerquiglini's work, which does not correspond to every branch of textual criticism. On the other hand, strictly speaking, the "critical apparatus" is not limited to registering the variants of the several witnesses of a text. It also includes various kinds of notes (identification of the sources of the text, historical notes, etc.). Even texts with a single witness may have a critical apparatus. Maybe the problem with the name has its origins in the choice of giving the name "critical apparatus" to a part of the guidelines dedicated solely to the registration of textual variants.

FJ argues that for German ears the concept of textual variance is not closely connected to a specific scholar.

MB proposes to use textual variants instead, since it focuses more on actual elements in the edition, when "variance" is nothing concrete but a phenomenon.

Side remarks by MB: this vocabulary queston might prove sticky in the end. The elements is named because it is considered "an apparatus entry", so unless we end up recommending to change the elements names, the phrase "critical apparatus" will still be used in the module, at least to explain the tag names?

RDT argues that while backward compatibility is clearly a bonus, as MB states stands for 'apparatus entry': we shouldn't be afraid to change its function, for instance making it a container instead of a phrase level element. RDT stresses that he is proposing this by way of example, and to stress that our focus is on variants: these might then be organised in s for traditional CA display, and/or in other, new ways for electronic display. Note that this might mean no traditional critical apparatus in a digital edition.

MB: It is characteristic of a print-based approach to encoding that the element was considered as encoding an apparatus entry (hence the name), when what it really encodes is a locus where different witnesses have variant readings (whch would probably have justified a name along the lines of or whatnot).

JC: Thinks this points to a slight divergent nature at the heart of the current critical apparatus recommendations. That of encoding an apparatus at the site of textual variance and encoding a structured view of a note entirely separate from the edited version of texts. (In mass digitization of critical editions, for example, one might have an in a set of notes at the bottom of the page which are not encoded at the site of variance, or indeed necessarily connected with it.) It is this striving to both be able to encode all sorts of various legacy forms of apparatus as well as simultaneously catering for those who are recording the structure by which they will generate an apparatus in producing some output. So JC would argue that the first of these are apparatus and the second of these is a site/locus of textual variance.

Issues
Preliminary notice: most of the issues raised here are connected with the parallel segmentation method, not because it is the more flawed, but because it is the more used by the members of this group. While location-referenced and double-end-point-attachment might be useful for mass conversion of printed material (for the former) and/or when using a piece of software handling the encoding (for the latter), the parallel segmentation method seems to be the easiest and more powerful way to encode the critical apparatus "by hand".

Also, one might point out that most of the issues raised here might be solved with standoff encoding. But this is extremely cumbersome to handle without the aid of a software, and it does not correspond to the way most people work.

Transpositions
In a nutshell: with the parallel segmentation method, it is often cumbersome to render transpositions.

Additionally it is not possible to mark them up explicitly. Juxta for example works around that by storing transposition data in a custom XML format:

    

Neither is this TEI-compliant, nor is the offset/range-based addressing (@start1/@start2 and @end1/@end2) proper XML markup. A standardized encoding would be helpful.

See also:


 * http://tei.markmail.org/thread/fuszgtpnn2ywf6bh

Handling of punctuation
Seems to be a common problem in textual criticism/ apparatus creation, but lacks guidelines/ encoding examples:


 * http://tei.markmail.org/thread/es6byhxpsbgkrxzo

Representing omissions in an apparatus
What's the proper way to represent missing lines/ paragraphs/ verses?


 * http://tei.markmail.org/thread/parztmwmlx5mqsof
 * http://tei.markmail.org/thread/4sheu6nji3dvnf64

Inclusion of structural markup in the apparatus
In a nutshell: the element is phrase-level, when it really should be allowed to include paragraphs, and even &lt;div&gt;s.

Use case:

I'm encoding a 19th c. edition of a medieval text, and one of the witness has omissions of several paragraphs. Of course, the TEI schema won't let me put &lt;p&gt; elements inside an / element...

- I use the parallel segmentation method - It is important to me to keep a methodical link between the encoded apparatus and the notes numbers in the original edition (the @n of each tag bears the number of the footnote in the original edition)

Here is the scan of a page from this edition, please consider footnote number 9. The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the Bal. witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the previous page scanned).


 * http://tei.markmail.org/thread/tbzi2yj5xd4dto34

More use cases from TEI-L:


 * http://tei.markmail.org/thread/jyezaqfycaldtdcv
 * http://tei.markmail.org/thread/fbyuxyabbxq4rwbr
 * http://tei.markmail.org/thread/vrwkl7kkruulyjzh
 * http://tei.markmail.org/thread/x5agpwzn4hiwwwcx

Encoding variants in structural markup

 * http://tei.markmail.org/thread/ap62n37uf6rbfds4
 * http://tei.markmail.org/thread/hbmnsn3v4aqjabt3

Conflicts between individual readings and the semantics of structural markup that surrounds it
In a nutshell: with the parallel segmentation method, witnesses with different forms of lineation pose a problem.

Scalability
In a nutshell: the parallel segmentation method is difficult to handle when adding hundreds of conflicting witnesses.

 Scaling is generally a problem with methods of indicating textual variance, but in parallel segmentation this is exacerbated because as the number of witnesses increase, the likelihood of needing to reformulate the reading boundaries, never mind the difficulty in reading or understanding such encodings. This may be a problem not only when looking at a single text with many witnesses, where variation in structure may be extremely difficult to represent where conflicts occur which disrupt this very basic structure (for example, imagine a set of witnesses where some have lines in linegroups, some just lines, some paragraphs, some paragraphs in divisions, but all with the same underlying text). But also where parallel segmentation is being used to record divergent interpretations of these individual witnesses by many editors (for distributed co-operative editions generated from many editorial views of a text).

Refactoring
In a nutshell: with the the parallel segmentation method, it is cumbersome to add a new reading that necessitates changing where the borders of readings are drawn.

Complexity
Manually crafting an apparatus is error-prone:


 * http://tei.markmail.org/thread/yuxqotf5aynxznq5

Feasibility of double-endpoint-attached method

 * http://tei.markmail.org/thread/fsj7gvojds4mwcm5
 * http://tei.markmail.org/thread/flwcnf4fxm4u7ebj

Showing a lemma different from the content of the or chosen reading in an apparatus note
In a nutshell: depending on the desired output of your digital edition, you may need to show in the apparatus entry a lemma text different from the content of the or desired. This is typically the case for long omissions, when one does not display the full text that is omitted by one or more witnesses, but only the beginning and end of the omitted span of text.

Use case: Let's consider again the example used in a previous use case: Here is the scan of a page from this edition, please consider footnote number 9. The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the Bal. witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the previous page scanned). You certainly do not want to generate a footnote with these two full paragraphs to tell the reader that one witness omits them, but on the other hand you want to be able to represent the source according to its various witnesses, so location-referenced is not in order.

Representing "verbose" apparatus
In a nutshell: when ou want to represent an apparatus entry written in a rather verbose way (in a print-to-digital edition). The same is true if you want to be able to generate a verbose apparatus note in a "born digital" edition.

Use cases: You're encoding an existing edition, and want to represent the source it edits, while keeping intact the text / apparatus of the existing edition. Some apparatus entries are easy to represent with the / / elements, some others add editorial comments to the listing of the variants, and are quite difficult to represent. BTW, the same goes when you are encoding a born-digital edition for which you want to be able to generate an alternative print output corresponding to the traditional standards of a collection. A - When I have a footnote giving two lectiones from the same manuscrip, one before correction and the other after:  Text : ad lectorem Venetum (b).  Note : b) ms., lectionem venerum  corrigé postérieurement en  lectorem Venetum If I encode it like this, with two seprate rdg for the same witness, each with a different @type (for instance, "anteCorr" and "postCorr"), it gives an accurate account of the state of the witness, BUT it is an interpretation of the original note in the critical apparatus, i.e. if I do this I delete some text added by the original editor.

&lt;app n="b"&gt; &lt;lem&gt;lectorem Venetum&lt;/lem&gt; &lt;rdg wit="#ms.2" type="anteCorr"&gt;lectionem venerum&lt;/rdg&gt; &lt;rdg wit="#ms.2" type="postCorr"&gt;lectorem Venetum&lt;/rdg&gt;

&lt;/app&gt;

 Let's consider this other note. There is some text added verbosely within the apparatus note by the editor. <p class="MsoNormal" style="text-align: justify;"> Text </b> : Hiis diebus civitas Pergamensis(b) tenebat exersitum Note </b> : b) se, mis indûment avant tenebat par le ms.

<p class="MsoNormal"> Should I encode it as: ... Pergamensis &lt;app n="b"&gt; &lt;lem/&gt; &lt;rdg type="addition" wit="#ms"&gt;&lt;sic&gt;se&lt;/sic&gt;&lt;/rdg&gt; &lt;/app&gt; ...

I one represents this note strictly with the /, it leads to suppress remarks by the original editor. Adding a note in the rdg to preserve the editor's comments could work here, ut it's not always the case Like: ... Pergamensis &lt;app n="b"&gt; &lt;lem/&gt; &lt;rdg type="addition" wit="#ms"&gt;&lt;sic&gt;se&lt;/sic&gt; &lt;note&gt;&lt;hi rend="italics"&gt; mis indûment avant &lt;/hi&gt; tenebat. &lt;/note&gt;&lt;/rdg&gt;

&lt;/app&gt;

<blockquote style="background:#FFEAEA"> <p class="MsoNormal" style="text-align: justify;">Text : …reliqui demum meos socios (d) Note : d) domum meam solito, Bal.; dni ou dm, ms.; en note meam solita.

Here we have 2 witnesses (Bal. et ms.), the latter with a) an uncertain lectio ("dni" or "dm") and b) a part of the lectio which is written as a note ("meam solita"). This is tricky to encode.

See also:


 * http://tei.markmail.org/thread/ib3bsrpirepp4ibc
 * http://tei.markmail.org/thread/diubpw5adw6ntcas

Representation of suggestions by the editor: lege dele etc.
In a nutshell: Sometimes, the editor provides working suggestions through apparatus notes such as lege(ndum) ("read"), dele(ndum) ("delete)" etc. They do not belong in the textual variants per se, and are not attached to witnesses, although they do belong in the critical apparatus.


 * http://tei.markmail.org/thread/vfw25psb5vgdiftw

Collations of differing granularity

 * http://tei.markmail.org/thread/bonflsyb2d3ebtp2
 * http://tei.markmail.org/thread/gqyymzd4a4xvhch7

An encoding proposal from the perspective of computer-aided collation tools
Gregor Middell gave an overview of textual variance from a software developer's perspective for the workgroup on a separate page. The models described there are used in tools like CollateX, Juxta and nmerge.

Collecting ideas from the mailinglist by James Cummings, Dan O'Donnell and Marjorie Burghardt as well as following the “Gothenburg model” of textual variance, a first take at separating the model from the representation of textual variance could be structured as follows.

Modelling input data: Make the units of a collation addressable in the witnesses
The Gothenburg model assumes a preprocessing step by which the witnesses get split up into tokens of desired granularity. This granularity becomes the minimal unit of collation and can defined as pages, paragraphs, verses, lines, words, characters or any other unit that makes sense in the context of a particular tradition under investigation. To model collation results on top of tokenized witnesses, those tokens have to be addressable.

The TEI defines an array of pointing mechanisms, which can be used to address anything from a whole XML document via URIs down to arbitrary content of those documents via sophisticated XPointer schemes. Projects would be free to choose among those mechanisms as long as each token is made available for later reference.

Examples:

<p xml:base="http://edition.org/witness_1"> <w>The</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w> <w>quickly</w>.

<p xml:base="http://edition.org/witness_2"> <w>Quickly</w>, <w>the</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w>.

Here tokens on the word-level could be addressed via the xpath1 XPointer scheme:


 * 1) http://edition.org/witness_1#xpath1(/p[1]/w[1])
 * 2) http://edition.org/witness_1#xpath1(/p[1]/w[2])

A less verbose scheme would rely on each container element of a token being identified via a (possibly autogenerated)  attribute, like in the following verse-level tokenization.

<lg xml:base="urn:goethe:faust2"> <l xml:id="l_1">Die Sonne sinkt, die letzten Schiffe</l> <l xml:id="l_2">Sie ziehen munter hafenein.</l> <l xml:id="l_3">Ein großer Kahn ist im Begriffe</l> <l xml:id="l_4">Auf dem Canale hier zu sein.</l> </lg>


 * 1) urn:goethe:faust2#l_1
 * 2) urn:goethe:faust2#l_2

One can even think of reference schemes, which are as independent of existing markup as possible. By introducing &lt;anchor/> milestone elements at token boundaries and using the range XPointer scheme the tokenization of arbitrary TEI documents can be accomplished, because &lt;anchor/> is part of model.global.

Modelling collated data: Encode the alignment/linking between tokens
After tokens in the different witnesses have been made addressable, collation data can be modelled on top of that as alignments of tokens. An alignment can be expressed as a set of tokens from different witnesses or, in accordance with the corresponding guidelines chapter as a link between two or more tokens.

Taking the first example from above, a collation of the two given witnesses could be expressed as

<linkGrp type="collation" xml:base="http://edition.org/"> <link target="witness_1#xpath1(/p[1]/w[1]) witness_2#xpath1(/p[1]/w[2])" /> <link target="witness_1#xpath1(/p[1]/w[2]) witness_2#xpath1(/p[1]/w[3])" /> <link target="witness_1#xpath1(/p[1]/w[3]) witness_2#xpath1(/p[1]/w[4])" /> <link target="witness_1#xpath1(/p[1]/w[4]) witness_2#xpath1(/p[1]/w[5])" /> <link target="witness_1#xpath1(/p[1]/w[5]) witness_2#xpath1(/p[1]/w[6])" /> <link target="witness_1#xpath1(/p[1]/w[6]) witness_2#xpath1(/p[1]/w[1])" type="transposition" /> </linkGrp>

Each link in this example corresponds to a row in an alignment table as depicted in the Gothenburg model description. Omitted/ added tokens are expressed implictly by not linking to tokens in other witnesses, this is to say: Whether a set of tokens has been added to a witness or has been omitted from it, is a matter of interpreting collation data as expressed above from the perspective of one witness or another and with regard to the way, this witness aligns with others.

One advantage of encoding collation data in such a set-oriented way is its scalability:


 * 1) Gradually adding witnesses to the collation may amount to adding alignments to the existing ones or modifying/augmenting the latter, depending on whether the collation is done pairwise (e. g. in relation to a base text) or via multiple alignment (e. g. without a prechosen base).
 * 2) Guiding a collation tool in producing ever more precise aligments in consecutive runs can be achieved by declaring alignments (for example transpositions), feeding those into the collator, adjusting the resulting alignment set, feeding it back into the collator for another run and so forth. Being able to encode the initial/preliminary results of such an iterative process in a standardized way, makes it possible to run different collation tools on the same text tradition, ideally each being able to make use of former results by other tools and to contribute to the overall result.

The major disadvantage of encoding collation data this way is its apparant lack of human readability and that it is hardly possible to edit it by hand, especially when the collated text tradition grows larger. This problem can only be solved via tool support.

Encoding the interpretation/ representation: Derive an apparatus from the collation
A TEI-encoded critical apparatus is one possible rendition of collation data, possibly enhanced with information yielded from interpreting the alignments. There are a couple of ways how we could encode the above collation as an apparatus.

Apparatus pointing to the collated tokens (for easier post-processing)
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" /> <rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" xml:id="w2_1"> <ptr target="#xpath1(/p[1]/w[1])" /> <rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1"> <ptr target="#xpath1(/p[1]/w[1])" /> <ptr target="#xpath1(/p[1]/w[2])" /> <ptr target="#xpath1(/p[1]/w[3])" /> <ptr target="#xpath1(/p[1]/w[4])" /> <ptr target="#xpath1(/p[1]/w[5])" /> <rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2"> <ptr target="#xpath1(/p[1]/w[2])" /> <ptr target="#xpath1(/p[1]/w[3])" /> <ptr target="#xpath1(/p[1]/w[4])" /> <ptr target="#xpath1(/p[1]/w[5])" /> <ptr target="#xpath1(/p[1]/w[6])" /> <rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" corresp="#w2_1"> <ptr target="#xpath1(/p[1]/w[6])" /> <rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" />

Apparatus with embedded textual content (for readability)
<rdg wit="http://edition.org/witness_1" /> <rdg wit="http://edition.org/witness_2" xml:id="w2_1">Quickly, <rdg wit="http://edition.org/witness_1">The cat ate the food <rdg wit="http://edition.org/witness_2">the cat ate the food. <rdg wit="http://edition.org/witness_1" corresp="#w2_1">quickly. <rdg wit="http://edition.org/witness_2" />

Some problems here:


 * @corresp vs. for transpositions over more than two witnesses
 * How to derive the segment content from the original witness automatically, if the token content does not add up to it (e. g. because of punctuation being excluded from the tokens from the start)?