Critical Apparatus Workgroup

The Critical Apparatus workgroup is part of the TEI special interest group on manuscript SIG:MSS.

Participants to the preliminary workgroup:

Marjorie Burghart (MB)
James Cummings (JC)
Fotis Jannidis (FJ)
Gregor Middell (GM)
Dan O'Donnell (DOD)
Matija Ogrin (MO)
Espen Ore (EO)
Elena Pierazzo (EP)
Roberto Rosselli del Turco (RDT)
Chris Wittern (CW)

Comments below from others:

Pascale Sutter (PS)
Stuart Yeates (SAY)

“Critical Apparatus” vs. “Textual Variance” vs. “Textual Variants”

The very name of the chapter, "Critical apparatus", is felt by some to be be a problem: the critical apparatus is just inherited from the printed world and one of the possible physical embodiments of textual variance. EP therefore proposes to use this new name, moving from "critical apparatus" to textual variance.

MB argues that, oddly, "textual variance" feels more restrictive to her than "critical apparatus": it is a notion linked with Cerquiglini's work, which does not correspond to every branch of textual criticism. On the other hand, strictly speaking, the "critical apparatus" is not limited to registering the variants of the several witnesses of a text. It also includes various kinds of notes (identification of the sources of the text, historical notes, etc.). Even texts with a single witness may have a critical apparatus. Maybe the problem with the name has its origins in the choice of giving the name "critical apparatus" to a part of the guidelines dedicated solely to the registration of textual variants.

FJ argues that for German ears the concept of textual variance is not closely connected to a specific scholar.

MB proposes to use textual variants instead, since it focuses more on actual elements in the edition, when "variance" is nothing concrete but a phenomenon.

Side remarks by MB: this vocabulary question might prove sticky in the end. The <app> elements is named <app> because it is considered "an apparatus entry", so unless we end up recommending to change the elements names, the phrase "critical apparatus" will still be used in the module, at least to explain the tag names?

RDT argues that while backward compatibility is clearly a bonus, as MB states <app> stands for 'apparatus entry': we shouldn't be afraid to change its function, for instance making it a container instead of a phrase level element. RDT stresses that he is proposing this by way of example, and to stress that our focus is on variants: these might then be organised in <app>s for traditional CA display, and/or in other, new ways for electronic display. Note that this might mean no traditional critical apparatus in a digital edition.

MB: It is characteristic of a print-based approach to encoding that the <app> element was considered as encoding an apparatus entry (hence the <app> name), when what it really encodes is a locus where different witnesses have variant readings (whch would probably have justified a name along the lines of <locus> or whatnot).

JC: Thinks this points to a slight divergent nature at the heart of the current critical apparatus recommendations. That of encoding an apparatus at the site of textual variance and encoding a structured view of a note entirely separate from the edited version of texts. (In mass digitization of critical editions, for example, one might have an <app> in a set of notes at the bottom of the page which are not encoded at the site of variance, or indeed necessarily connected with it.) It is this striving to both be able to encode all sorts of various legacy forms of apparatus as well as simultaneously catering for those who are recording the structure by which they will generate an apparatus in producing some output. So JC would argue that the first of these are apparatus and the second of these is a site/locus of textual variance.

SAY: Prefers either textual variants or textual variance over Critical Apparatus simply because I believe their meaning is clearer to a larger proportion of English speaking people. Clear meanings help us inter-operate with other groups and standards by making our standard easier to read by third parties.

MO: argues that we should retain Critical Apparatus because it is a core philological term. In every branch of the Humanities, the terminology is among those core elements that are most resistant to changes, because old terms are deeply rooted in the practice and in the epistemological system of the disciplines. The most common way how terminology is changed in the Humanities is that old terms gradually accept new layers of meaning -- new semantics. Let us think, e.g., of 'digital edition'. The term 'edition' absorbed new meanings, and continues to be a good, functional term. This applies also to the dilemma of 'Critical Apparatus' vs. 'Textual Variants'. In the TEI community, we should also use the same fundamental terminology as is used in other, more traditional areas of philology, and show the advantage of our conception of the same term.

Issues

Preliminary notice: most of the issues raised here are connected with the parallel segmentation method, not because it is the more flawed, but because it is the more used by the members of this group. While location-referenced and double-end-point-attachment might be useful for mass conversion of printed material (for the former) and/or when using a piece of software handling the encoding (for the latter), the parallel segmentation method seems to be the easiest and more powerful way to encode the critical apparatus "by hand".

Also, one might point out that most of the issues raised here might be solved with standoff encoding. But this is extremely cumbersome to handle without the aid of a software, and it does not correspond to the way most people work.

Specific phenomena

Transpositions

In a nutshell: with the parallel segmentation method, it is often cumbersome to render transpositions.

Additionally it is not possible to mark them up explicitly. Juxta for example works around that by storing transposition data in a custom XML format:

<moves>
        <move doc1="1855 MS" space1="original" start1="9679" end1="10462" doc2="1881 1st Ed." space2="original" start2="9872" end2="10467" />
        <move doc1="1855 MS" space1="original" start1="9679" end1="10483" doc2="1870 2nd Ed." space2="original" start2="7781" end2="8376" />
        <move doc1="1855 MS" space1="original" start1="9679" end1="10504" doc2="1870 Proof" space2="original" start2="8458" end2="9056" />
        <move doc1="1855 MS" space1="original" start1="9886" end1="10525" doc2="1870 1st Ed." space2="original" start2="8546" end2="9141" />
        <move doc1="1870 Proof" space1="original" start1="1640" end1="1850" doc2="1881 1st Ed." space2="original" start2="2961" end2="3070" />
</moves>

Neither is this TEI-compliant, nor is the offset/range-based addressing (@start1/@start2 and @end1/@end2) proper XML markup. A standardized encoding would be helpful.

Handling of punctuation

Seems to be a common problem in textual criticism/ apparatus creation, but lacks guidelines/ encoding examples:

http://tei.markmail.org/thread/es6byhxpsbgkrxzo

Representing omissions in an apparatus

What's the proper way to represent missing lines/ paragraphs/ verses?

Markup-related

Inclusion of structural markup in the apparatus

In a nutshell: the <app> element is phrase-level, when it really should be allowed to include paragraphs, and even <div>s.

Use case:

I'm encoding a 19th c. edition of a medieval text, and one of the witness has omissions of several paragraphs. Of course, the TEI schema
won't let me put <p> elements inside an <app>/<lem> element...

- I use the parallel segmentation method
- It is important to me to keep a methodical link between the encoded apparatus and the notes numbers in the original edition (the @n of each <app> tag bears the number of the footnote in the original edition)

Here is the scan of a page from this edition, please consider footnote number 9. The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the Bal. witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the previous page scanned).

http://tei.markmail.org/thread/tbzi2yj5xd4dto34

Another use case:

We have several witnesses of a poem, and in our base text, we have 8 stanzas, in another version only 6 stanzas and in some other 7? We are not allowed to have an <lg> and not even an <l> element within the <app> entry. So, we must put it the other way round, which may be somewhat awkward, like this:

   <l>
   <app>
   <rdg wit="#A"> [the text of the verse line here] </rdg>
   <rdg wit="#B"/>
   <rdg wit="#C"/>
   </app>
   </l>

We should then repeat this for every <l> , while it would be more practical to "say" this only once for entire stanza (with an <lg> within a <rdg> ).

Source: http://marjorie.burghart.online.fr/?q=en/content/tei-critical-apparatus-cheatsheet#comment-15

More use cases from TEI-L:

Encoding variants in structural markup

Conflicts between individual readings and the semantics of structural markup that surrounds it

In a nutshell: with the parallel segmentation method, witnesses with different forms of lineation pose a problem.

Workflow-related

Scalability

In a nutshell: the parallel segmentation method is difficult to handle when adding hundreds of conflicting witnesses.

Scaling is generally a problem with methods of indicating textual variance, but in parallel segmentation this is exacerbated because as the number of witnesses increase, the likelihood of needing to reformulate the reading boundaries, never mind the difficulty in reading or understanding such encodings. This may be a problem not only when looking at a single text with many witnesses, where variation in structure may be extremely difficult to represent where conflicts occur which disrupt this very basic structure (for example, imagine a set of witnesses where some have lines in linegroups, some just lines, some paragraphs, some paragraphs in divisions, but all with the same underlying text). But also where parallel segmentation is being used to record divergent interpretations of these individual witnesses by many editors (for distributed co-operative editions generated from many editorial views of a text). A plausible recommendation is to use a form of stand-off apparatus for such editions rather than parallel segmentation. And while some of the current methods can be used in a stand-off method, they should be updated to reflect current P5 usage of URI-based pointers.

Refactoring

In a nutshell: with the the parallel segmentation method, it is cumbersome to add a new reading that necessitates changing where the borders of readings are drawn.

Complexity

Manually crafting an apparatus is error-prone:

http://tei.markmail.org/thread/yuxqotf5aynxznq5

Feasibility of double-endpoint-attached method

Model vs. Representation

Showing a lemma different from the content of the <lem> or chosen reading in an apparatus note

In a nutshell: depending on the desired output of your digital edition, you may need to show in the apparatus entry a lemma text different from the content of the <lem> or desired <rdg>. This is typically the case for long omissions, when one does not display the full text that is omitted by one or more witnesses, but only the beginning and end of the omitted span of text.

Use case:

Let's consider again the example used in a previous use case:
Here is the scan of a page from this edition, please consider footnote number 9. The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the Bal. witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the previous page scanned).

You certainly do not want to generate a footnote with these two full paragraphs to tell the reader that one witness omits them, but on the other hand you want to be able to represent the source according to its various witnesses, so location-referenced is not in order.

Possible workaround: add a new possible child to <add>, indicating how the content of the <lem> should be dislayed; something like:

   <app>
   <lemDisplay>Eodem anno, rex Francie… dampnificati</lemDisplay>
   <lem>[several lines or paragraphs]</lem>
   <rdg wit="#V"></rdg>
   </app>

Representing "verbose" apparatus

In a nutshell: when you want to represent an apparatus entry written in a rather verbose way (in a print-to-digital edition). The same is true if you want to be able to generate a verbose apparatus note in a "born digital" edition.

Use cases:

You're encoding an existing edition, and want to represent the source it edits, while keeping intact the text / apparatus of the existing edition. Some apparatus entries are easy to represent with the <app> / <lem> / <rdg> elements, some others add editorial comments to the listing of the variants, and are quite difficult to represent. BTW, the same goes when you are encoding a born-digital edition for which you want to be able to generate an alternative print output corresponding to the traditional standards of a collection.

A - When I have a footnote giving two lectiones from the same manuscript, one before correction and the other after:
Text: ad lectorem Venetum (b) .
Note: b) ms., lectionem venerum corrigé postérieurement en lectorem Venetum

If I encode it like this, with two separate rdg for the same witness, each with a different @type (for instance, "anteCorr" and "postCorr"), it gives an accurate account of the state of the witness, BUT it is an interpretation of the original note in the critical apparatus, i.e. if I do this I delete some text added by the original editor.

<app n="b">
<lem>lectorem Venetum</lem>
<rdg wit="#ms.2" type="anteCorr">lectionem venerum</rdg>
<rdg wit="#ms.2" type="postCorr">lectorem Venetum</rdg>

</app>

Let's consider this other note. There is some text added verbosely within the apparatus note by the editor.

Text: Hiis diebus civitas Pergamensis(b) tenebat exersitum
Note: b) se, mis indûment avant tenebat par le ms.

Should I encode it as:
... Pergamensis <app n="b">
    <lem/>
    <rdg type="addition" wit="#ms"><sic>se</sic></rdg>
</app>...

If one represents this note strictly with the <app> / <rdg>, it leads to the suppression of remarks by the original editor. Adding a note in the rdg to preserve the editor's comments could work here, but it's not always the case
Like:
... Pergamensis <app n="b">
    <lem/>
    <rdg type="addition" wit="#ms"><sic>se</sic> <note><hi rend="italics">mis indûment avant</hi> tenebat.</note></rdg>

</app>

Text: …reliqui demum meos socios (d)
Note: d) domum meam solito, Bal.; dni ou dm, ms.; en note meam solita.

Here we have 2 witnesses (Bal. et ms.), the latter with a) an uncertain lectio ("dni" or "dm") and b) a part of the lectio which is written as a note ("meam solita"). This is tricky to encode.

Representation of suggestions by the editor: lege dele etc.

In a nutshell: Sometimes, the editor provides working suggestions through apparatus notes such as lege(ndum) ("read"), dele(ndum) ("delete)" etc. They do not belong in the textual variants per se, and are not attached to witnesses, although they do belong in the critical apparatus.

http://tei.markmail.org/thread/vfw25psb5vgdiftw

Collations of differing granularity

An encoding proposal from the perspective of computer-aided collation tools

Gregor Middell gave an overview of textual variance from a software developer's perspective for the workgroup on a separate page. The models described there are used in tools like CollateX, Juxta and nmerge.

Collecting ideas from the mailinglist by James Cummings, Dan O'Donnell and Marjorie Burghart as well as following the “Gothenburg model” of textual variance, a first take at separating the model from the representation of textual variance could be structured as follows.

Modelling input data: Make the units of a collation addressable in the witnesses

The Gothenburg model assumes a preprocessing step by which the witnesses get split up into tokens of desired granularity. This granularity becomes the minimal unit of collation and can defined as pages, paragraphs, verses, lines, words, characters or any other unit that makes sense in the context of a particular tradition under investigation. To model collation results on top of tokenized witnesses, those tokens have to be addressable.

The TEI defines an array of pointing mechanisms, which can be used to address anything from a whole XML document via URIs down to arbitrary content of those documents via sophisticated XPointer schemes. Projects would be free to choose among those mechanisms as long as each token is made available for later reference.

Examples:

<p xml:base="http://edition.org/witness_1">
  <w>The</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w> <w>quickly</w>.
</p>

<p xml:base="http://edition.org/witness_2">
  <w>Quickly</w>, <w>the</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w>.
</p>

Here tokens on the word-level could be addressed via the xpath1() XPointer scheme:

http://edition.org/witness_1#xpath1(/p[1]/w[1])
http://edition.org/witness_1#xpath1(/p[1]/w[2])
...

A less verbose scheme would rely on each container element of a token being identified via a (possibly autogenerated) xml:id attribute, like in the following verse-level tokenization.

<lg xml:base="urn:goethe:faust2">
  <l xml:id="l_1">Die Sonne sinkt, die letzten Schiffe</l>
  <l xml:id="l_2">Sie ziehen munter hafenein.</l>
  <l xml:id="l_3">Ein großer Kahn ist im Begriffe</l>
  <l xml:id="l_4">Auf dem Canale hier zu sein.</l>
</lg>

urn:goethe:faust2#l_1
urn:goethe:faust2#l_2
...

One can even think of reference schemes, which are as independent of existing markup as possible. By introducing <anchor/> milestone elements at token boundaries and using the range() XPointer scheme the tokenization of arbitrary TEI documents can be accomplished, because <anchor/> is part of model.global.

Modelling collated data: Encode the alignment/linking between tokens

After tokens in the different witnesses have been made addressable, collation data can be modelled on top of that as alignments of tokens. An alignment can be expressed as a set of tokens from different witnesses or, in accordance with the corresponding guidelines chapter as a link between two or more tokens.

Taking the first example from above, a collation of the two given witnesses could be expressed as

<linkGrp type="collation" xml:base="http://edition.org/">
  <link target="witness_1#xpath1(/p[1]/w[1]) witness_2#xpath1(/p[1]/w[2])" />
  <link target="witness_1#xpath1(/p[1]/w[2]) witness_2#xpath1(/p[1]/w[3])" />
  <link target="witness_1#xpath1(/p[1]/w[3]) witness_2#xpath1(/p[1]/w[4])" />
  <link target="witness_1#xpath1(/p[1]/w[4]) witness_2#xpath1(/p[1]/w[5])" />
  <link target="witness_1#xpath1(/p[1]/w[5]) witness_2#xpath1(/p[1]/w[6])" />
  <link target="witness_1#xpath1(/p[1]/w[6]) witness_2#xpath1(/p[1]/w[1])" type="transposition" />
</linkGrp>

Each link in this example corresponds to a row in an alignment table as depicted in the Gothenburg model description. Omitted/ added tokens are expressed implictly by not linking to tokens in other witnesses, this is to say: Whether a set of tokens has been added to a witness or has been omitted from it, is a matter of interpreting collation data as expressed above from the perspective of one witness or another and with regard to the way, this witness aligns with others.

One advantage of encoding collation data in such a set-oriented way is its scalability:

Gradually adding witnesses to the collation may amount to adding alignments to the existing ones or modifying/augmenting the latter, depending on whether the collation is done pairwise (e. g. in relation to a base text) or via multiple alignment (e. g. without a prechosen base).
Guiding a collation tool in producing ever more precise aligments in consecutive runs can be achieved by declaring alignments (for example transpositions), feeding those into the collator, adjusting the resulting alignment set, feeding it back into the collator for another run and so forth. Being able to encode the initial/preliminary results of such an iterative process in a standardized way, makes it possible to run different collation tools on the same text tradition, ideally each being able to make use of former results by other tools and to contribute to the overall result.

The major disadvantage of encoding collation data this way is its apparant lack of human readability and that it is hardly possible to edit it by hand, especially when the collated text tradition grows larger. This problem can only be solved via tool support.

Encoding the interpretation/ representation: Derive an apparatus from the collation

A TEI-encoded critical apparatus is one possible rendition of collation data, possibly enhanced with information yielded from interpreting the alignments. There are a couple of ways how we could encode the above collation as an apparatus.

Apparatus pointing to the collated tokens (for easier post-processing)

<app>
  <rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" />
  <rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" xml:id="w2_1">
    <ptr target="#xpath1(/p[1]/w[1])" />
  </rdg>
</app>
<app>
  <rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1">
    <ptr target="#xpath1(/p[1]/w[1])" />
    <ptr target="#xpath1(/p[1]/w[2])" />
    <ptr target="#xpath1(/p[1]/w[3])" />
    <ptr target="#xpath1(/p[1]/w[4])" />
    <ptr target="#xpath1(/p[1]/w[5])" />
  </rdg>
  <rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2">
    <ptr target="#xpath1(/p[1]/w[2])" />
    <ptr target="#xpath1(/p[1]/w[3])" />
    <ptr target="#xpath1(/p[1]/w[4])" />
    <ptr target="#xpath1(/p[1]/w[5])" />
    <ptr target="#xpath1(/p[1]/w[6])" />
  </rdg>
</app>
<app>
  <rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" corresp="#w2_1">
    <ptr target="#xpath1(/p[1]/w[6])" />
  </rdg>
  <rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" />
</app>

Apparatus with embedded textual content (for readability)

<app>
  <rdg wit="http://edition.org/witness_1" />
  <rdg wit="http://edition.org/witness_2" xml:id="w2_1">Quickly,</rdg>
</app>
<app>
  <rdg wit="http://edition.org/witness_1">The cat ate the food</rdg>
  <rdg wit="http://edition.org/witness_2">the cat ate the food.</rdg>
</app>
<app>
  <rdg wit="http://edition.org/witness_1" corresp="#w2_1">quickly.</rdg>
  <rdg wit="http://edition.org/witness_2" />
</app>

Some problems here:

@corresp vs. <link/> for transpositions over more than two witnesses
How to derive the segment content from the original witness automatically, if the token content does not add up to it (e. g. because of punctuation being excluded from the tokens from the start)?

Bibliography

O'Donnell, Daniel Paul. “The Ghost in the Machine: Revisiting an Old Model for the Dynamic Generation of Digital Editions.” HumanIT 8.1 (2005): 5171.
Vetter, L. and McDonald, J. ‘Witnessing Dickinson’s Witnesses’, Literary and Linguistic Computing, 18.2: 2003, 151-165.
Schmidt, D., 2010. The inadequacy of embedded markup for cultural heritage texts. Literary and Linguistic Computing, 25(3), pp. 337-356.