SIGMS Minutes 20111014

2011-10-14T08:24:45Z

Gremid:

SIGMS Minutes 20111014

2011-10-14T08:19:50Z

Gremid:

2011-10-14T08:04:46Z

Gremid:

SIGMS Minutes 20111014

2011-10-14T07:59:32Z

Gremid:

== Participants ==

* Elena Pierazzo (EP)
* Marjorie Burghardt (MB)
* Torsten Schaßan (TS)
* Christian Wittern (CW)
* Gregor Middell (GM) – minutes
...

== Introduction (EP) ==

* introduction of participants
* explains relationship and interdependencies to other SIGs (Facsimile, Libraries, ...), willingness to contribute vs. ability to do so on a persistent basis, problem of time constraints, financial and other resources as well as attribution of work in the SIGs
* outline of current activities
** '''Genetic editing''': Council adopts recommendations of the SIG WG, will be circulated for the first time shortly; likely to be included in the next release of the Guidelines
** '''Critical Apparatus''': chapter in the Guidelines has not been revised since P4, automatic collation vs. manual construction of a CA, question of relationship between the two, MB appointed as leader of a WG to improve on this situation
** '''Manuscript description''': extension of existing means with dimensions like time, geospatial information etc., problem of scope (to what extent are we talking about manuscripts or about cultural artifacts in general)
* organizational matters, some financial support by the TEI available, communication via [http://listserv.brown.edu/tei-ms-sig.html mailing list], [[SIG:MSS|wiki]] and [http://www.tei-c.org/SIG/Manuscripts/ web-site]
* call for participation

== What should we do, what could you do? ==

* TS: ODDs from [http://enrich.manuscriptorium.com/ ENRICH project] available
* support for marginal notes in the Guidelines/ SIG proposal
* manuscript description: <summary/> content model is not flexible enough for more complex descriptions (e.g. )
* CW: there are – apart from the SIG – means to address such issues, for example the [http://sourceforge.net/tracker/?atid=644065&group_id=106328&func=browse Sourceforge Issue Tracker]
* EP: SIG can act as a proxy though

SIGMS Minutes 20111014

2011-10-14T07:57:43Z

Gremid:

== Participants ==

* Elena Pierazzo (EP)
* Marjorie Burghardt (MB)
* Torsten Schaßan (TS)
* Gregor Middell (GM) – minutes
...

== Introduction (EP) ==

* introduction of participants
* explains relationship and interdependencies to other SIGs (Facsimile, Libraries, ...), willingness to contribute vs. ability to do so on a persistent basis, problem of time constraints, financial and other resources as well as attribution of work in the SIGs
* outline of current activities
** '''Genetic editing''': Council adopts recommendations of the SIG WG, will be circulated for the first time shortly; likely to be included in the next release of the Guidelines
** '''Critical Apparatus''': chapter in the Guidelines has not been revised since P4, automatic collation vs. manual construction of a CA, question of relationship between the two, MB appointed as leader of a WG to improve on this situation
** '''Manuscript description''': extension of existing means with dimensions like time, geospatial information etc., problem of scope (to what extent are we talking about manuscripts or about cultural artifacts in general)
* organizational matters, some financial support by the TEI available, communication via [http://listserv.brown.edu/tei-ms-sig.html mailing list], [[SIG:MSS|wiki]] and [http://www.tei-c.org/SIG/Manuscripts/ web-site]
* call for participation

== What should we do, what could you do? ==

* TS: ODDs from [http://enrich.manuscriptorium.com/ ENRICH project] available
* support for marginal notes in the Guidelines/ SIG proposal
* manuscript description: <summary/> content model is not flexible enough for more complex descriptions (e.g. )

2011-10-14T07:46:49Z

Gremid:

SIGMS Minutes 20111014

2011-10-14T07:43:34Z

Gremid:

SIGMS Minutes 20111014

2011-10-14T07:43:20Z

Gremid:

== Participants ==

* Elena Pierazzo (EP)
* Marjorie Burghardt (MB)

== Introduction (EP) ==

* introduction of participants
* explains relationship and interdependencies to other SIGs (Facsimile, Libraries, ...), willingness to contribute vs. ability to do so on a persistent basis, problem of time constraints, financial and other resources as well as attribution of work in the SIGs
* outline of current activities
** '''Genetic editing''': Council adopts recommendations of the SIG WG, will be circulated for the first time shortly; likely to be included in the next release of the Guidelines
** '''Critical Apparatus''': chapter in the Guidelines has not been revised since P4, automatic collation vs. manual construction of a CA, question of relationship between the two, MB appointed as leader of a WG to improve on this situation
** '''Manuscript description''': extension of existing means with dimensions like time, geospatial information etc., problem of scope (to what extent are we talking about manuscripts or about cultural artifacts in general)
* organizational matters, some financial support by the TEI available, communication via mailing list and Wiki

2011-04-08T15:45:57Z

Gremid:

The [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TC.html Critical Apparatus] workgroup is part of the TEI special interest group on manuscript [[SIG:MSS]].
This page provides a summary of the preliminary discussions regarding the current issues with [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TC.html the critical apparatus chapter].

Participants to the preliminary workgroup: Marjorie Burghart (MB), James Cummings (JC), Fotis Jannidis (FJ), Gregor Middell (GM), Dan O'Donnell (DOD), Espen Ore (EO), Elena Pierazzo (EP), Roberto Rosselli del Turco (RDT), Chris Wittern (CW)

== A preliminary vocabulary question ==
The very name of the chapter, "Critical apparatus", is felt by some to be be a problem: the critical apparatus is just inherited from the printed world and one of the possible physical embodiment of TEXTUAL VARIANCE. EP therefore proposes to use this new name, moving from "citical apparatus" to textual variance.

MB argues that, oddly, "textual variance" feels more restrictive to her than "critical apparatus": it is a notion linked with Cerquiglini's work, which does not correspond to '''every''' branch of textual criticism. On the other hand, strictly speaking, the "critical apparatus" is not limited to registering the variants of the several witnesses of a text. It also includes various kinds of notes (identification of the sources of the text, historical notes, etc.). Even texts with a single witness may have a critical apparatus. Maybe the problem with the name has its origins in the choice of giving the name "critical apparatus" to a part of the guidelines dedicated solely to the registration of textual variants.

FJ argues that for German ears the concept of textual variance is not closely connected to a specific scholar.

MB proposes to use "TEXTUAL VARIANTS" instead, since it focuses more on actual elements in the edition, when "variance" is nothing concrete but a phenomenon.

Side remarks by MB: this vocabulary queston might prove sticky in the end. The <app> elements is named <app> because it is considered "an apparatus entry", so unless we end up recommending to change the elements names, the phrase "critical apparatus" will still be used in the module, at least to explain the tag names?

RDT argues that while backward compatibility is clearly a bonus, as MB states <app> stands for 'apparatus entry': we shouldn't be afraid to change its function, for instance making it a container instead of a phrase level element. RDT stresses that he is proposing this by way of example, and to stress that our focus is on variants: these might then be organised in <app>s for traditional CA display, and/or in other, new ways for electronic display. Note that this might mean no traditional critical apparatus in a digital edition.

MB: It is characteristic of a print-based approach to encoding that the <app> element was considered as encoding an apparatus entry (hence the <app> name), when what it really encodes is a locus where different witnesses have variant readings (whch would probably have justified a name along the lines of <locus> or whatnot).

JC: Thinks this points to a slight divergent nature at the heart
of the current critical apparatus recommendations. That of encoding
an apparatus at the site of textual variance and encoding a structured
view of a note entirely separate from the edited version of texts.
(In mass digitization of critical editions, for example, one might
have an <app> in a set of notes at the bottom of the page which are
not encoded at the site of variance, or indeed necessarily connected
with it.) It is this striving to both be able to encode all sorts of
various legacy forms of apparatus as well as simultaneously catering
for those who are recording the structure by which they will generate
an apparatus in producing some output. So JC would argue that the first of
these are apparatus and the second of these is a site/locus of textual
variance.

== Issues with the current Critical Apparatus chapter/module ==

Preliminary notice: most of the issues raised here are connected with the parallel segmentation method, not because it is the more flawed, but because it is the more used by the members of this group. While location-referenced and double-end-point-attachment might be useful for mass conversion of printed material (for the former) and/or when using a piece of software handling the encoding (for the latter), the parallel segmentation method seems to be the easiest and more powerful way to encode the critical apparatus "by hand".

Also, one might point out that most of the issues raised here might be solved with standoff encoding. But this is extremely cumbersome to handle without the aid of a software, and it does not correspond to the way most people work.

=== Inclusion of structural markup in the apparatus ===

In a nutshell: the <app> element is phrase-level, when it really should be allowed to include paragraphs, and even <div>s.

Use case:

<blockquote style="background:#FFEAEA">I'm encoding a 19th c. edition of a medieval text, and one of the
witness has omissions of several paragraphs. Of course, the TEI schema
won't let me put elements inside an <app>/<lem> element... 

- I use the parallel segmentation method 
- It is important to me to keep a methodical link between the encoded
apparatus and the notes numbers in the original edition (the
@n of each <app> tag bears the number of the footnote in the original
edition) 

Here is the [http://baluze.univ-avignon.fr/scan/t1/%285%29.jpg scan of a page from this edition], please consider footnote number 9.
The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the ''Bal.'' witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the [http://baluze.univ-avignon.fr/scan/t1/%284%29.jpg previous page scanned]).
</blockquote>

* http://tei.markmail.org/thread/tbzi2yj5xd4dto34

More use cases from TEI-L:

* http://tei.markmail.org/thread/jyezaqfycaldtdcv
* http://tei.markmail.org/thread/fbyuxyabbxq4rwbr
* http://tei.markmail.org/thread/vrwkl7kkruulyjzh

=== Transpositions ===

In a nutshell: with the parallel segmentation method, it is often cumbersome to render transpositions.

Additionally it is not possible to mark them up explicitly. [http://juxtasoftware.org/ Juxta] for example works around that by storing transposition data in a custom XML format:

<pre>
<moves>
<move doc1="1855 MS" space1="original" start1="9679" end1="10462" doc2="1881 1st Ed." space2="original" start2="9872" end2="10467" />
<move doc1="1855 MS" space1="original" start1="9679" end1="10483" doc2="1870 2nd Ed." space2="original" start2="7781" end2="8376" />
<move doc1="1855 MS" space1="original" start1="9679" end1="10504" doc2="1870 Proof" space2="original" start2="8458" end2="9056" />
<move doc1="1855 MS" space1="original" start1="9886" end1="10525" doc2="1870 1st Ed." space2="original" start2="8546" end2="9141" />
<move doc1="1870 Proof" space1="original" start1="1640" end1="1850" doc2="1881 1st Ed." space2="original" start2="2961" end2="3070" />
</moves>
</pre>

Neither is this TEI-compliant, nor is the offset/range-based addressing (@start1/@start2 and @end1/@end2) proper XML markup. A standardized encoding would be helpful.

=== Scalability ===

In a nutshell: the parallel segmentation method is difficult to handle when adding hundreds of conflicting witnesses.

Also manually crafting an apparatus is error-prone:

* http://tei.markmail.org/thread/yuxqotf5aynxznq5

=== Refactoring ===
In a nutshell: with the the parallel segmentation method, it is cumbersome to add a new reading that necessitates changing where the borders of readings are drawn.

=== conflicts between individual readings and the semantics of structural markup that surrounds it ===
In a nutshell: with the parallel segmentation method, witnesses with different forms of lineation pose a problem.

=== Showing a lemma different from the content of the <lem> or chosen reading in an apparatus note ===

In a nutshell: depending on the desired output of your digital edition, you may need to show in the apparatus entry a lemma text different from the content of the <lem> or desired <rdg>. This is typically the case for long omissions, when one does not display the full text that is omitted by one or more witnesses, but only the beginning and end of the omitted span of text.

Use case:
<blockquote style="background:#FFEAEA">Let's consider again the example used in a previous use case:
Here is the [http://baluze.univ-avignon.fr/scan/t1/%285%29.jpg scan of a page from this edition], please consider footnote number 9.
The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the ''Bal.'' witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the [http://baluze.univ-avignon.fr/scan/t1/%284%29.jpg previous page scanned]).
 
You certainly do not want to generate a footnote with these two full paragraphs to tell the reader that one witness omits them, but on the other hand you want to be able to represent the source according to its various witnesses, so location-referenced is not in order.
</blockquote>

=== Representing "verbose" apparatus ===
In a nutshell: when ou want to represent an apparatus entry written in a rather verbose way (in a print-to-digital edition). The same is true if you want to be able to generate a verbose apparatus note in a "born digital" edition.

Use cases:
<blockquote style="background:#FFEAEA">You're encoding an existing edition, and want to represent the source it edits, while keeping intact the text / apparatus of the existing edition. Some apparatus entries are easy to represent with the <app> / <lem> / <rdg> elements, some others add editorial comments to the listing of the variants, and are quite difficult to represent. BTW, the same goes when you are encoding a born-digital edition for which you want to be able to generate an alternative print output corresponding to the traditional standards of a collection.
 
A - When I have a footnote giving two lectiones from the same manuscrip, one before correction and the other after: 
Text: ad lectorem Venetum (b) .
 
Note: b) ms., lectionem venerum corrigé postérieurement en lectorem Venetum
 
 
If I encode it like this, with two seprate rdg for the same
witness, each with a different @type (for instance, "anteCorr" and
"postCorr"), it gives an accurate account of the state of the witness, BUT it is an
interpretation of the original note in the critical apparatus, i.e. if
I do this I delete some text added by the original editor. 

<app n="b">
 
<lem>lectorem Venetum</lem>
 
<rdg wit="#ms.2" type="anteCorr">lectionem venerum</rdg>
 
<rdg wit="#ms.2" type="postCorr">lectorem Venetum</rdg>

 
</app>
</blockquote>

 

<blockquote style="background:#FFEAEA">
Let's consider this other note. There is some text added verbosely within the apparatus note by the editor. 
Text: Hiis diebus civitas
Pergamensis(b) tenebat exersitum 
Note: b) se, mis indûment avant tenebat par le ms.

Should I encode it as: 
... Pergamensis <app
n="b"> 
    <lem/> 
    <rdg
type="addition" wit="#ms"><sic>se</sic></rdg> 
</app>... 


I one represents this note strictly with the <app> / <rdg>, it leads to suppress remarks by the original editor. Adding a note in the rdg to preserve the editor's comments could work here, ut it's not always the case 
Like: 
... Pergamensis <app
n="b"> 
    <lem/> 
    <rdg
type="addition" wit="#ms"><sic>se</sic> <note><hi
rend="italics">mis
indûment avant</hi> tenebat.</note></rdg> 

</app>
</blockquote>

 

<blockquote style="background:#FFEAEA">
'''Text''': …reliqui demum meos socios (d) 
'''Note''': d) domum
meam solito, Bal.; dni ou dm, ms.; en note meam solita.

Here we have 2 witnesses (Bal. et ms.), the latter with a) an uncertain
lectio ("dni" or "dm") and b) a part of the lectio which is written as
a note ("meam solita"). This is tricky to encode.
</blockquote>

=== Representation of suggestions by the editor: ''lege'' ''dele'' etc. ===

In a nutshell: Sometimes, the editor provides working suggestions through apparatus notes such as ''lege(ndum)'' ("read"), ''dele(ndum)'' ("delete)" etc. They do not belong in the textual variants ''per se'', and are not attached to witnesses, although they do belong in the critical apparatus.

=== Handling of punctuation ===

Seems to be a common problem in textual criticism/ apparatus creation, but lacks guidelines/ encoding examples:

* http://tei.markmail.org/thread/es6byhxpsbgkrxzo

== An encoding proposal from the perspective of computer-aided collation tools ==

Gregor Middell gave an overview of textual variance from a software developer's perspective for the workgroup on a [[Textual_Variance|separate page]]. The models described there are used in tools like [http://collatex.sourceforge.net/ CollateX], [http://www.juxtasoftware.org/ Juxta] and [http://code.google.com/p/multiversiondocs/ nmerge].

Collecting ideas from the mailinglist by James Cummings, Dan O'Donnell and Marjorie Burghardt as well as following the “Gothenburg model” of textual variance, a first take at separating the [http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller model from the representation] of textual variance could be structured as follows.

=== Modelling input data: Make the units of a collation addressable in the witnesses ===

The Gothenburg model assumes a [[Textual_Variance#Tokenizer|preprocessing step]] by which the witnesses get split up into '''tokens''' of desired granularity. This granularity becomes the minimal unit of collation and can defined as pages, paragraphs, verses, lines, words, characters or any other unit that makes sense in the context of a particular tradition under investigation. To model collation results on top of tokenized witnesses, those tokens have to be addressable.

The TEI defines an [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SAXP array of pointing mechanisms], which can be used to address anything from a whole XML document via URIs down to arbitrary content of those documents via sophisticated XPointer schemes. Projects would be free to choose among those mechanisms as long as each token is made available for later reference.

Examples:

<pre>

<w>The</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w> <w>quickly</w>.

</pre>

<pre>

<w>Quickly</w>, <w>the</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w>.

</pre>

Here tokens on the word-level could be addressed via the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSXP xpath1() XPointer scheme]:

# <nowiki>http://edition.org/witness_1#xpath1(/p[1]/w[1])</nowiki>
# <nowiki>http://edition.org/witness_1#xpath1(/p[1]/w[2])</nowiki>
# ...

A less verbose scheme would rely on each container element of a token being identified via a (possibly autogenerated) <code>xml:id</code> attribute, like in the following verse-level tokenization.

<pre>
<lg xml:base="urn:goethe:faust2">
<l xml:id="l_1">Die Sonne sinkt, die letzten Schiffe</l>
<l xml:id="l_2">Sie ziehen munter hafenein.</l>
<l xml:id="l_3">Ein großer Kahn ist im Begriffe</l>
<l xml:id="l_4">Auf dem Canale hier zu sein.</l>
</lg>
</pre>

# <nowiki>urn:goethe:faust2#l_1</nowiki>
# <nowiki>urn:goethe:faust2#l_2</nowiki>
# ...

One can even think of reference schemes, which are as independent of existing markup as possible. By introducing <anchor/> milestone elements at token boundaries and using the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSRN range() XPointer scheme] the tokenization of arbitrary TEI documents can be accomplished, because <anchor/> is part of [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-model.global.html model.global].

=== Modelling collated data: Encode the alignment/linking between tokens ===

After tokens in the different witnesses have been made addressable, collation data can be modelled on top of that as [[Textual_Variance#Aligner|alignments of tokens]]. An '''alignment''' can be expressed as a set of tokens from different witnesses or, in accordance with the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html corresponding guidelines chapter] as a link between two or more tokens.

Taking the first example from above, a collation of the two given witnesses could be expressed as

<pre>
<linkGrp type="collation">
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[1]) http://edition.org/witness_2#xpath1(/p[1]/w[2])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[2]) http://edition.org/witness_2#xpath1(/p[1]/w[3])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[3]) http://edition.org/witness_2#xpath1(/p[1]/w[4])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[4]) http://edition.org/witness_2#xpath1(/p[1]/w[5])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[5]) http://edition.org/witness_2#xpath1(/p[1]/w[6])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[6]) http://edition.org/witness_2#xpath1(/p[1]/w[1])" type="transposition" />
</linkGrp>
</pre>

Each link in this example corresponds to a row in an alignment table as depicted in the Gothenburg model description. Omitted/ added tokens are expressed implictly by not linking to tokens in other witnesses, this is to say: Whether a set of tokens has been added to a witness or has been omitted from it, is a matter of interpreting collation data as expressed above from the perspective of one witness or another and with regard to the way, this witness aligns with others.

One advantage of encoding collation data in such a set-oriented way is its '''scalability''':

# Gradually adding witnesses to the collation may amount to adding alignments to the existing ones or modifying/augmenting the latter, depending on whether the collation is done pairwise (e. g. in relation to a base text) or via multiple alignment (e. g. without a prechosen base).
# Guiding a collation tool in producing ever more precise aligments in consecutive runs can be achieved by [[Textual_Variance#Analyzer|declaring alignments]] (for example transpositions), feeding those into the collator, adjusting the resulting alignment set, feeding it back into the collator for another run and so forth. Being able to encode the initial/preliminary results of such an iterative process in a standardized way, makes it possible to run different collation tools on the same text tradition, ideally each being able to make use of former results by other tools and to contribute to the overall result.

The major disadvantage of encoding collation data this way is its apparant lack of human readability and that it is hardly possible to edit it by hand, especially when the collated text tradition grows larger. This problem can only be solved via tool support.

=== Encoding the interpretation/ representation: Derive an apparatus from the collation ===

A TEI-encoded critical apparatus is one possible rendition of collation data, possibly enhanced with information yielded from interpreting the alignments. There are a couple of ways how we could encode the above collation as an apparatus.

==== Apparatus pointing to the collated tokens (for easier post-processing) ====

<pre>
<app>
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" />
<rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" xml:id="w2_1">
<ptr target="#xpath1(/p[1]/w[1])" />
</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1">
<ptr target="#xpath1(/p[1]/w[1])" />
<ptr target="#xpath1(/p[1]/w[2])" />
<ptr target="#xpath1(/p[1]/w[3])" />
<ptr target="#xpath1(/p[1]/w[4])" />
<ptr target="#xpath1(/p[1]/w[5])" />
</rdg>
<rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2">
<ptr target="#xpath1(/p[1]/w[2])" />
<ptr target="#xpath1(/p[1]/w[3])" />
<ptr target="#xpath1(/p[1]/w[4])" />
<ptr target="#xpath1(/p[1]/w[5])" />
<ptr target="#xpath1(/p[1]/w[6])" />
</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" corresp="#w2_1">
<ptr target="#xpath1(/p[1]/w[6])" />
</rdg>
<rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" />
</app>
</pre>

==== Apparatus with embedded textual content (for readability) ====

<pre>
<app>
<rdg wit="http://edition.org/witness_1" />
<rdg wit="http://edition.org/witness_2" xml:id="w2_1">Quickly,</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1">The cat ate the food</rdg>
<rdg wit="http://edition.org/witness_2">the cat ate the food.</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1" corresp="#w2_1">quickly.</rdg>
<rdg wit="http://edition.org/witness_2" />
</app>
</pre>

Some problems here:

* @corresp vs. <link/> for transpositions over more than two witnesses
* How to derive the segment content from the original witness automatically, if the token content does not add up to it (e. g. because of punctuation being excluded from the tokens from the start)?

== Bibliography ==

* O'Donnell, Daniel Paul. [http://etjanst.hb.se/bhs/ith/1-8/dpo.pdf “The Ghost in the Machine: Revisiting an Old Model for the Dynamic Generation of Digital Editions.”] HumanIT 8.1 (2005): 5171.
[[Category:SIG:Manuscripts]]
* Vetter, L. and McDonald, J. ‘Witnessing Dickinson’s Witnesses’, Literary and Linguistic Computing, 18.2: 2003, 151-165.
* [http://eprints.qut.edu.au/38436/ Schmidt, D., 2010. The inadequacy of embedded markup for cultural heritage texts. Literary and Linguistic Computing, 25(3), pp. 337-356.]

Critical Apparatus Workgroup

2011-04-08T15:34:12Z

Gremid: /* Scalability */

The [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TC.html Critical Apparatus] workgroup is part of the TEI special interest group on manuscript [[SIG:MSS]].
This page provides a summary of the preliminary discussions regarding the current issues with [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TC.html the critical apparatus chapter].

Participants to the preliminary workgroup: Marjorie Burghart (MB), James Cummings (JC), Fotis Jannidis (FJ), Gregor Middell (GM), Dan O'Donnell (DOD), Espen Ore (EO), Elena Pierazzo (EP), Roberto Rosselli del Turco (RDT), Chris Wittern (CW)

== A preliminary vocabulary question ==
The very name of the chapter, "Critical apparatus", is felt by some to be be a problem: the critical apparatus is just inherited from the printed world and one of the possible physical embodiment of TEXTUAL VARIANCE. EP therefore proposes to use this new name, moving from "citical apparatus" to textual variance.

MB argues that, oddly, "textual variance" feels more restrictive to her than "critical apparatus": it is a notion linked with Cerquiglini's work, which does not correspond to '''every''' branch of textual criticism. On the other hand, strictly speaking, the "critical apparatus" is not limited to registering the variants of the several witnesses of a text. It also includes various kinds of notes (identification of the sources of the text, historical notes, etc.). Even texts with a single witness may have a critical apparatus. Maybe the problem with the name has its origins in the choice of giving the name "critical apparatus" to a part of the guidelines dedicated solely to the registration of textual variants.

FJ argues that for German ears the concept of textual variance is not closely connected to a specific scholar.

MB proposes to use "TEXTUAL VARIANTS" instead, since it focuses more on actual elements in the edition, when "variance" is nothing concrete but a phenomenon.

Side remarks by MB: this vocabulary queston might prove sticky in the end. The <app> elements is named <app> because it is considered "an apparatus entry", so unless we end up recommending to change the elements names, the phrase "critical apparatus" will still be used in the module, at least to explain the tag names?

RDT argues that while backward compatibility is clearly a bonus, as MB states <app> stands for 'apparatus entry': we shouldn't be afraid to change its function, for instance making it a container instead of a phrase level element. RDT stresses that he is proposing this by way of example, and to stress that our focus is on variants: these might then be organised in <app>s for traditional CA display, and/or in other, new ways for electronic display. Note that this might mean no traditional critical apparatus in a digital edition.

MB: It is characteristic of a print-based approach to encoding that the <app> element was considered as encoding an apparatus entry (hence the <app> name), when what it really encodes is a locus where different witnesses have variant readings (whch would probably have justified a name along the lines of <locus> or whatnot).

JC: Thinks this points to a slight divergent nature at the heart
of the current critical apparatus recommendations. That of encoding
an apparatus at the site of textual variance and encoding a structured
view of a note entirely separate from the edited version of texts.
(In mass digitization of critical editions, for example, one might
have an <app> in a set of notes at the bottom of the page which are
not encoded at the site of variance, or indeed necessarily connected
with it.) It is this striving to both be able to encode all sorts of
various legacy forms of apparatus as well as simultaneously catering
for those who are recording the structure by which they will generate
an apparatus in producing some output. So JC would argue that the first of
these are apparatus and the second of these is a site/locus of textual
variance.

== Issues with the current Critical Apparatus chapter/module ==

Preliminary notice: most of the issues raised here are connected with the parallel segmentation method, not because it is the more flawed, but because it is the more used by the members of this group. While location-referenced and double-end-point-attachment might be useful for mass conversion of printed material (for the former) and/or when using a piece of software handling the encoding (for the latter), the parallel segmentation method seems to be the easiest and more powerful way to encode the critical apparatus "by hand".

Also, one might point out that most of the issues raised here might be solved with standoff encoding. But this is extremely cumbersome to handle without the aid of a software, and it does not correspond to the way most people work.

=== Inclusion of structural markup in the apparatus ===

In a nutshell: the <app> element is phrase-level, when it really should be allowed to include paragraphs, and even <div>s.

Use case:

<blockquote style="background:#FFEAEA">I'm encoding a 19th c. edition of a medieval text, and one of the
witness has omissions of several paragraphs. Of course, the TEI schema
won't let me put elements inside an <app>/<lem> element... 

- I use the parallel segmentation method 
- It is important to me to keep a methodical link between the encoded
apparatus and the notes numbers in the original edition (the
@n of each <app> tag bears the number of the footnote in the original
edition) 

Here is the [http://baluze.univ-avignon.fr/scan/t1/%285%29.jpg scan of a page from this edition], please consider footnote number 9.
The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the ''Bal.'' witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the [http://baluze.univ-avignon.fr/scan/t1/%284%29.jpg previous page scanned]).
</blockquote>

* http://tei.markmail.org/thread/tbzi2yj5xd4dto34

More use cases from TEI-L:

* http://tei.markmail.org/thread/jyezaqfycaldtdcv
* http://tei.markmail.org/thread/fbyuxyabbxq4rwbr
* http://tei.markmail.org/thread/vrwkl7kkruulyjzh

=== Transpositions ===

In a nutshell: with the parallel segmentation method, it is often cumbersome to render transpositions.

Additionally it is not possible to mark them up explicitly. [http://juxtasoftware.org/ Juxta] for example works around that by storing transposition data in a custom XML format:

<pre>
<moves>
<move doc1="1855 MS" space1="original" start1="9679" end1="10462" doc2="1881 1st Ed." space2="original" start2="9872" end2="10467" />
<move doc1="1855 MS" space1="original" start1="9679" end1="10483" doc2="1870 2nd Ed." space2="original" start2="7781" end2="8376" />
<move doc1="1855 MS" space1="original" start1="9679" end1="10504" doc2="1870 Proof" space2="original" start2="8458" end2="9056" />
<move doc1="1855 MS" space1="original" start1="9886" end1="10525" doc2="1870 1st Ed." space2="original" start2="8546" end2="9141" />
<move doc1="1870 Proof" space1="original" start1="1640" end1="1850" doc2="1881 1st Ed." space2="original" start2="2961" end2="3070" />
</moves>
</pre>

Neither is this TEI-compliant, nor is the offset/range-based addressing (@start1/@start2 and @end1/@end2) proper XML markup. A standardized encoding would be helpful.

=== Scalability ===

In a nutshell: the parallel segmentation method is difficult to handle when adding hundreds of conflicting witnesses.

Also manually crafting an apparatus is error-prone:

* http://tei.markmail.org/thread/yuxqotf5aynxznq5

=== Refactoring ===
In a nutshell: with the the parallel segmentation method, it is cumbersome to add a new reading that necessitates changing where the borders of readings are drawn.

=== conflicts between individual readings and the semantics of structural markup that surrounds it ===
In a nutshell: with the parallel segmentation method, witnesses with different forms of lineation pose a problem.

=== Showing a lemma different from the content of the <lem> or chosen reading in an apparatus note ===

In a nutshell: depending on the desired output of your digital edition, you may need to show in the apparatus entry a lemma text different from the content of the <lem> or desired <rdg>. This is typically the case for long omissions, when one does not display the full text that is omitted by one or more witnesses, but only the beginning and end of the omitted span of text.

Use case:
<blockquote style="background:#FFEAEA">Let's consider again the example used in a previous use case:
Here is the [http://baluze.univ-avignon.fr/scan/t1/%285%29.jpg scan of a page from this edition], please consider footnote number 9.
The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the ''Bal.'' witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the [http://baluze.univ-avignon.fr/scan/t1/%284%29.jpg previous page scanned]).
 
You certainly do not want to generate a footnote with these two full paragraphs to tell the reader that one witness omits them, but on the other hand you want to be able to represent the source according to its various witnesses, so location-referenced is not in order.
</blockquote>

=== Representing "verbose" apparatus ===
In a nutshell: when ou want to represent an apparatus entry written in a rather verbose way (in a print-to-digital edition). The same is true if you want to be able to generate a verbose apparatus note in a "born digital" edition.

Use cases:
<blockquote style="background:#FFEAEA">You're encoding an existing edition, and want to represent the source it edits, while keeping intact the text / apparatus of the existing edition. Some apparatus entries are easy to represent with the <app> / <lem> / <rdg> elements, some others add editorial comments to the listing of the variants, and are quite difficult to represent. BTW, the same goes when you are encoding a born-digital edition for which you want to be able to generate an alternative print output corresponding to the traditional standards of a collection.
 
A - When I have a footnote giving two lectiones from the same manuscrip, one before correction and the other after: 
Text: ad lectorem Venetum (b) .
 
Note: b) ms., lectionem venerum corrigé postérieurement en lectorem Venetum
 
 
If I encode it like this, with two seprate rdg for the same
witness, each with a different @type (for instance, "anteCorr" and
"postCorr"), it gives an accurate account of the state of the witness, BUT it is an
interpretation of the original note in the critical apparatus, i.e. if
I do this I delete some text added by the original editor. 

<app n="b">
 
<lem>lectorem Venetum</lem>
 
<rdg wit="#ms.2" type="anteCorr">lectionem venerum</rdg>
 
<rdg wit="#ms.2" type="postCorr">lectorem Venetum</rdg>

 
</app>
</blockquote>

 

<blockquote style="background:#FFEAEA">
Let's consider this other note. There is some text added verbosely within the apparatus note by the editor. 
Text: Hiis diebus civitas
Pergamensis(b) tenebat exersitum 
Note: b) se, mis indûment avant tenebat par le ms.

Should I encode it as: 
... Pergamensis <app
n="b"> 
    <lem/> 
    <rdg
type="addition" wit="#ms"><sic>se</sic></rdg> 
</app>... 


I one represents this note strictly with the <app> / <rdg>, it leads to suppress remarks by the original editor. Adding a note in the rdg to preserve the editor's comments could work here, ut it's not always the case 
Like: 
... Pergamensis <app
n="b"> 
    <lem/> 
    <rdg
type="addition" wit="#ms"><sic>se</sic> <note><hi
rend="italics">mis
indûment avant</hi> tenebat.</note></rdg> 

</app>
</blockquote>

 

<blockquote style="background:#FFEAEA">
'''Text''': …reliqui demum meos socios (d) 
'''Note''': d) domum
meam solito, Bal.; dni ou dm, ms.; en note meam solita.

Here we have 2 witnesses (Bal. et ms.), the latter with a) an uncertain
lectio ("dni" or "dm") and b) a part of the lectio which is written as
a note ("meam solita"). This is tricky to encode.
</blockquote>

=== Representation of suggestions by the editor: ''lege'' ''dele'' etc. ===

In a nutshell: Sometimes, the editor provides working suggestions through apparatus notes such as ''lege(ndum)'' ("read"), ''dele(ndum)'' ("delete)" etc. They do not belong in the textual variants ''per se'', and are not attached to witnesses, although they do belong in the critical apparatus.

== An encoding proposal from the perspective of computer-aided collation tools ==

Gregor Middell gave an overview of textual variance from a software developer's perspective for the workgroup on a [[Textual_Variance|separate page]]. The models described there are used in tools like [http://collatex.sourceforge.net/ CollateX], [http://www.juxtasoftware.org/ Juxta] and [http://code.google.com/p/multiversiondocs/ nmerge].

Collecting ideas from the mailinglist by James Cummings, Dan O'Donnell and Marjorie Burghardt as well as following the “Gothenburg model” of textual variance, a first take at separating the [http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller model from the representation] of textual variance could be structured as follows.

=== Modelling input data: Make the units of a collation addressable in the witnesses ===

The Gothenburg model assumes a [[Textual_Variance#Tokenizer|preprocessing step]] by which the witnesses get split up into '''tokens''' of desired granularity. This granularity becomes the minimal unit of collation and can defined as pages, paragraphs, verses, lines, words, characters or any other unit that makes sense in the context of a particular tradition under investigation. To model collation results on top of tokenized witnesses, those tokens have to be addressable.

The TEI defines an [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SAXP array of pointing mechanisms], which can be used to address anything from a whole XML document via URIs down to arbitrary content of those documents via sophisticated XPointer schemes. Projects would be free to choose among those mechanisms as long as each token is made available for later reference.

Examples:

<pre>

<w>The</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w> <w>quickly</w>.

</pre>

<pre>

<w>Quickly</w>, <w>the</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w>.

</pre>

Here tokens on the word-level could be addressed via the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSXP xpath1() XPointer scheme]:

# <nowiki>http://edition.org/witness_1#xpath1(/p[1]/w[1])</nowiki>
# <nowiki>http://edition.org/witness_1#xpath1(/p[1]/w[2])</nowiki>
# ...

A less verbose scheme would rely on each container element of a token being identified via a (possibly autogenerated) <code>xml:id</code> attribute, like in the following verse-level tokenization.

<pre>
<lg xml:base="urn:goethe:faust2">
<l xml:id="l_1">Die Sonne sinkt, die letzten Schiffe</l>
<l xml:id="l_2">Sie ziehen munter hafenein.</l>
<l xml:id="l_3">Ein großer Kahn ist im Begriffe</l>
<l xml:id="l_4">Auf dem Canale hier zu sein.</l>
</lg>
</pre>

# <nowiki>urn:goethe:faust2#l_1</nowiki>
# <nowiki>urn:goethe:faust2#l_2</nowiki>
# ...

One can even think of reference schemes, which are as independent of existing markup as possible. By introducing <anchor/> milestone elements at token boundaries and using the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSRN range() XPointer scheme] the tokenization of arbitrary TEI documents can be accomplished, because <anchor/> is part of [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-model.global.html model.global].

=== Modelling collated data: Encode the alignment/linking between tokens ===

After tokens in the different witnesses have been made addressable, collation data can be modelled on top of that as [[Textual_Variance#Aligner|alignments of tokens]]. An '''alignment''' can be expressed as a set of tokens from different witnesses or, in accordance with the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html corresponding guidelines chapter] as a link between two or more tokens.

Taking the first example from above, a collation of the two given witnesses could be expressed as

<pre>
<linkGrp type="collation">
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[1]) http://edition.org/witness_2#xpath1(/p[1]/w[2])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[2]) http://edition.org/witness_2#xpath1(/p[1]/w[3])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[3]) http://edition.org/witness_2#xpath1(/p[1]/w[4])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[4]) http://edition.org/witness_2#xpath1(/p[1]/w[5])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[5]) http://edition.org/witness_2#xpath1(/p[1]/w[6])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[6]) http://edition.org/witness_2#xpath1(/p[1]/w[1])" type="transposition" />
</linkGrp>
</pre>

Each link in this example corresponds to a row in an alignment table as depicted in the Gothenburg model description. Omitted/ added tokens are expressed implictly by not linking to tokens in other witnesses, this is to say: Whether a set of tokens has been added to a witness or has been omitted from it, is a matter of interpreting collation data as expressed above from the perspective of one witness or another and with regard to the way, this witness aligns with others.

One advantage of encoding collation data in such a set-oriented way is its '''scalability''':

# Gradually adding witnesses to the collation may amount to adding alignments to the existing ones or modifying/augmenting the latter, depending on whether the collation is done pairwise (e. g. in relation to a base text) or via multiple alignment (e. g. without a prechosen base).
# Guiding a collation tool in producing ever more precise aligments in consecutive runs can be achieved by [[Textual_Variance#Analyzer|declaring alignments]] (for example transpositions), feeding those into the collator, adjusting the resulting alignment set, feeding it back into the collator for another run and so forth. Being able to encode the initial/preliminary results of such an iterative process in a standardized way, makes it possible to run different collation tools on the same text tradition, ideally each being able to make use of former results by other tools and to contribute to the overall result.

The major disadvantage of encoding collation data this way is its apparant lack of human readability and that it is hardly possible to edit it by hand, especially when the collated text tradition grows larger. This problem can only be solved via tool support.

=== Encoding the interpretation/ representation: Derive an apparatus from the collation ===

A TEI-encoded critical apparatus is one possible rendition of collation data, possibly enhanced with information yielded from interpreting the alignments. There are a couple of ways how we could encode the above collation as an apparatus.

==== Apparatus pointing to the collated tokens (for easier post-processing) ====

<pre>
<app>
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" />
<rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" xml:id="w2_1">
<ptr target="#xpath1(/p[1]/w[1])" />
</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1">
<ptr target="#xpath1(/p[1]/w[1])" />
<ptr target="#xpath1(/p[1]/w[2])" />
<ptr target="#xpath1(/p[1]/w[3])" />
<ptr target="#xpath1(/p[1]/w[4])" />
<ptr target="#xpath1(/p[1]/w[5])" />
</rdg>
<rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2">
<ptr target="#xpath1(/p[1]/w[2])" />
<ptr target="#xpath1(/p[1]/w[3])" />
<ptr target="#xpath1(/p[1]/w[4])" />
<ptr target="#xpath1(/p[1]/w[5])" />
<ptr target="#xpath1(/p[1]/w[6])" />
</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" corresp="#w2_1">
<ptr target="#xpath1(/p[1]/w[6])" />
</rdg>
<rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" />
</app>
</pre>

==== Apparatus with embedded textual content (for readability) ====

<pre>
<app>
<rdg wit="http://edition.org/witness_1" />
<rdg wit="http://edition.org/witness_2" xml:id="w2_1">Quickly,</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1">The cat ate the food</rdg>
<rdg wit="http://edition.org/witness_2">the cat ate the food.</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1" corresp="#w2_1">quickly.</rdg>
<rdg wit="http://edition.org/witness_2" />
</app>
</pre>

Some problems here:

* @corresp vs. <link/> for transpositions over more than two witnesses
* How to derive the segment content from the original witness automatically, if the token content does not add up to it (e. g. because of punctuation being excluded from the tokens from the start)?

== Bibliography ==

* O'Donnell, Daniel Paul. [http://etjanst.hb.se/bhs/ith/1-8/dpo.pdf “The Ghost in the Machine: Revisiting an Old Model for the Dynamic Generation of Digital Editions.”] HumanIT 8.1 (2005): 5171.
[[Category:SIG:Manuscripts]]
* Vetter, L. and McDonald, J. ‘Witnessing Dickinson’s Witnesses’, Literary and Linguistic Computing, 18.2: 2003, 151-165.
* [http://eprints.qut.edu.au/38436/ Schmidt, D., 2010. The inadequacy of embedded markup for cultural heritage texts. Literary and Linguistic Computing, 25(3), pp. 337-356.]

Critical Apparatus Workgroup

2011-04-08T15:24:01Z

Gremid: /* Inclusion of structural markup in the apparatus */

The [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TC.html Critical Apparatus] workgroup is part of the TEI special interest group on manuscript [[SIG:MSS]].
This page provides a summary of the preliminary discussions regarding the current issues with [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/TC.html the critical apparatus chapter].

Participants to the preliminary workgroup: Marjorie Burghart (MB), James Cummings (JC), Fotis Jannidis (FJ), Gregor Middell (GM), Dan O'Donnell (DOD), Espen Ore (EO), Elena Pierazzo (EP), Roberto Rosselli del Turco (RDT), Chris Wittern (CW)

== A preliminary vocabulary question ==
The very name of the chapter, "Critical apparatus", is felt by some to be be a problem: the critical apparatus is just inherited from the printed world and one of the possible physical embodiment of TEXTUAL VARIANCE. EP therefore proposes to use this new name, moving from "citical apparatus" to textual variance.

MB argues that, oddly, "textual variance" feels more restrictive to her than "critical apparatus": it is a notion linked with Cerquiglini's work, which does not correspond to '''every''' branch of textual criticism. On the other hand, strictly speaking, the "critical apparatus" is not limited to registering the variants of the several witnesses of a text. It also includes various kinds of notes (identification of the sources of the text, historical notes, etc.). Even texts with a single witness may have a critical apparatus. Maybe the problem with the name has its origins in the choice of giving the name "critical apparatus" to a part of the guidelines dedicated solely to the registration of textual variants.

FJ argues that for German ears the concept of textual variance is not closely connected to a specific scholar.

MB proposes to use "TEXTUAL VARIANTS" instead, since it focuses more on actual elements in the edition, when "variance" is nothing concrete but a phenomenon.

Side remarks by MB: this vocabulary queston might prove sticky in the end. The <app> elements is named <app> because it is considered "an apparatus entry", so unless we end up recommending to change the elements names, the phrase "critical apparatus" will still be used in the module, at least to explain the tag names?

RDT argues that while backward compatibility is clearly a bonus, as MB states <app> stands for 'apparatus entry': we shouldn't be afraid to change its function, for instance making it a container instead of a phrase level element. RDT stresses that he is proposing this by way of example, and to stress that our focus is on variants: these might then be organised in <app>s for traditional CA display, and/or in other, new ways for electronic display. Note that this might mean no traditional critical apparatus in a digital edition.

MB: It is characteristic of a print-based approach to encoding that the <app> element was considered as encoding an apparatus entry (hence the <app> name), when what it really encodes is a locus where different witnesses have variant readings (whch would probably have justified a name along the lines of <locus> or whatnot).

JC: Thinks this points to a slight divergent nature at the heart
of the current critical apparatus recommendations. That of encoding
an apparatus at the site of textual variance and encoding a structured
view of a note entirely separate from the edited version of texts.
(In mass digitization of critical editions, for example, one might
have an <app> in a set of notes at the bottom of the page which are
not encoded at the site of variance, or indeed necessarily connected
with it.) It is this striving to both be able to encode all sorts of
various legacy forms of apparatus as well as simultaneously catering
for those who are recording the structure by which they will generate
an apparatus in producing some output. So JC would argue that the first of
these are apparatus and the second of these is a site/locus of textual
variance.

== Issues with the current Critical Apparatus chapter/module ==

Preliminary notice: most of the issues raised here are connected with the parallel segmentation method, not because it is the more flawed, but because it is the more used by the members of this group. While location-referenced and double-end-point-attachment might be useful for mass conversion of printed material (for the former) and/or when using a piece of software handling the encoding (for the latter), the parallel segmentation method seems to be the easiest and more powerful way to encode the critical apparatus "by hand".

Also, one might point out that most of the issues raised here might be solved with standoff encoding. But this is extremely cumbersome to handle without the aid of a software, and it does not correspond to the way most people work.

=== Inclusion of structural markup in the apparatus ===

In a nutshell: the <app> element is phrase-level, when it really should be allowed to include paragraphs, and even <div>s.

Use case:

<blockquote style="background:#FFEAEA">I'm encoding a 19th c. edition of a medieval text, and one of the
witness has omissions of several paragraphs. Of course, the TEI schema
won't let me put elements inside an <app>/<lem> element... 

- I use the parallel segmentation method 
- It is important to me to keep a methodical link between the encoded
apparatus and the notes numbers in the original edition (the
@n of each <app> tag bears the number of the footnote in the original
edition) 

Here is the [http://baluze.univ-avignon.fr/scan/t1/%285%29.jpg scan of a page from this edition], please consider footnote number 9.
The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the ''Bal.'' witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the [http://baluze.univ-avignon.fr/scan/t1/%284%29.jpg previous page scanned]).
</blockquote>

* http://tei.markmail.org/thread/tbzi2yj5xd4dto34

More use cases from TEI-L:

* http://tei.markmail.org/thread/jyezaqfycaldtdcv
* http://tei.markmail.org/thread/fbyuxyabbxq4rwbr
* http://tei.markmail.org/thread/vrwkl7kkruulyjzh

=== Transpositions ===

In a nutshell: with the parallel segmentation method, it is often cumbersome to render transpositions.

Additionally it is not possible to mark them up explicitly. [http://juxtasoftware.org/ Juxta] for example works around that by storing transposition data in a custom XML format:

<pre>
<moves>
<move doc1="1855 MS" space1="original" start1="9679" end1="10462" doc2="1881 1st Ed." space2="original" start2="9872" end2="10467" />
<move doc1="1855 MS" space1="original" start1="9679" end1="10483" doc2="1870 2nd Ed." space2="original" start2="7781" end2="8376" />
<move doc1="1855 MS" space1="original" start1="9679" end1="10504" doc2="1870 Proof" space2="original" start2="8458" end2="9056" />
<move doc1="1855 MS" space1="original" start1="9886" end1="10525" doc2="1870 1st Ed." space2="original" start2="8546" end2="9141" />
<move doc1="1870 Proof" space1="original" start1="1640" end1="1850" doc2="1881 1st Ed." space2="original" start2="2961" end2="3070" />
</moves>
</pre>

Neither is this TEI-compliant, nor is the offset/range-based addressing (@start1/@start2 and @end1/@end2) proper XML markup. A standardized encoding would be helpful.

=== Scalability ===

In a nutshell: the parallel segmentation method is difficult to handle when adding hundreds of conflicting witnesses.

=== Refactoring ===
In a nutshell: with the the parallel segmentation method, it is cumbersome to add a new reading that necessitates changing where the borders of readings are drawn.

=== conflicts between individual readings and the semantics of structural markup that surrounds it ===
In a nutshell: with the parallel segmentation method, witnesses with different forms of lineation pose a problem.

=== Showing a lemma different from the content of the <lem> or chosen reading in an apparatus note ===

In a nutshell: depending on the desired output of your digital edition, you may need to show in the apparatus entry a lemma text different from the content of the <lem> or desired <rdg>. This is typically the case for long omissions, when one does not display the full text that is omitted by one or more witnesses, but only the beginning and end of the omitted span of text.

Use case:
<blockquote style="background:#FFEAEA">Let's consider again the example used in a previous use case:
Here is the [http://baluze.univ-avignon.fr/scan/t1/%285%29.jpg scan of a page from this edition], please consider footnote number 9.
The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the ''Bal.'' witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the [http://baluze.univ-avignon.fr/scan/t1/%284%29.jpg previous page scanned]).
 
You certainly do not want to generate a footnote with these two full paragraphs to tell the reader that one witness omits them, but on the other hand you want to be able to represent the source according to its various witnesses, so location-referenced is not in order.
</blockquote>

=== Representing "verbose" apparatus ===
In a nutshell: when ou want to represent an apparatus entry written in a rather verbose way (in a print-to-digital edition). The same is true if you want to be able to generate a verbose apparatus note in a "born digital" edition.

Use cases:
<blockquote style="background:#FFEAEA">You're encoding an existing edition, and want to represent the source it edits, while keeping intact the text / apparatus of the existing edition. Some apparatus entries are easy to represent with the <app> / <lem> / <rdg> elements, some others add editorial comments to the listing of the variants, and are quite difficult to represent. BTW, the same goes when you are encoding a born-digital edition for which you want to be able to generate an alternative print output corresponding to the traditional standards of a collection.
 
A - When I have a footnote giving two lectiones from the same manuscrip, one before correction and the other after: 
Text: ad lectorem Venetum (b) .
 
Note: b) ms., lectionem venerum corrigé postérieurement en lectorem Venetum
 
 
If I encode it like this, with two seprate rdg for the same
witness, each with a different @type (for instance, "anteCorr" and
"postCorr"), it gives an accurate account of the state of the witness, BUT it is an
interpretation of the original note in the critical apparatus, i.e. if
I do this I delete some text added by the original editor. 

<app n="b">
 
<lem>lectorem Venetum</lem>
 
<rdg wit="#ms.2" type="anteCorr">lectionem venerum</rdg>
 
<rdg wit="#ms.2" type="postCorr">lectorem Venetum</rdg>

 
</app>
</blockquote>

 

<blockquote style="background:#FFEAEA">
Let's consider this other note. There is some text added verbosely within the apparatus note by the editor. 
Text: Hiis diebus civitas
Pergamensis(b) tenebat exersitum 
Note: b) se, mis indûment avant tenebat par le ms.

Should I encode it as: 
... Pergamensis <app
n="b"> 
    <lem/> 
    <rdg
type="addition" wit="#ms"><sic>se</sic></rdg> 
</app>... 


I one represents this note strictly with the <app> / <rdg>, it leads to suppress remarks by the original editor. Adding a note in the rdg to preserve the editor's comments could work here, ut it's not always the case 
Like: 
... Pergamensis <app
n="b"> 
    <lem/> 
    <rdg
type="addition" wit="#ms"><sic>se</sic> <note><hi
rend="italics">mis
indûment avant</hi> tenebat.</note></rdg> 

</app>
</blockquote>

 

<blockquote style="background:#FFEAEA">
'''Text''': …reliqui demum meos socios (d) 
'''Note''': d) domum
meam solito, Bal.; dni ou dm, ms.; en note meam solita.

Here we have 2 witnesses (Bal. et ms.), the latter with a) an uncertain
lectio ("dni" or "dm") and b) a part of the lectio which is written as
a note ("meam solita"). This is tricky to encode.
</blockquote>

=== Representation of suggestions by the editor: ''lege'' ''dele'' etc. ===

In a nutshell: Sometimes, the editor provides working suggestions through apparatus notes such as ''lege(ndum)'' ("read"), ''dele(ndum)'' ("delete)" etc. They do not belong in the textual variants ''per se'', and are not attached to witnesses, although they do belong in the critical apparatus.

== An encoding proposal from the perspective of computer-aided collation tools ==

Gregor Middell gave an overview of textual variance from a software developer's perspective for the workgroup on a [[Textual_Variance|separate page]]. The models described there are used in tools like [http://collatex.sourceforge.net/ CollateX], [http://www.juxtasoftware.org/ Juxta] and [http://code.google.com/p/multiversiondocs/ nmerge].

Collecting ideas from the mailinglist by James Cummings, Dan O'Donnell and Marjorie Burghardt as well as following the “Gothenburg model” of textual variance, a first take at separating the [http://en.wikipedia.org/wiki/Model%E2%80%93view%E2%80%93controller model from the representation] of textual variance could be structured as follows.

=== Modelling input data: Make the units of a collation addressable in the witnesses ===

The Gothenburg model assumes a [[Textual_Variance#Tokenizer|preprocessing step]] by which the witnesses get split up into '''tokens''' of desired granularity. This granularity becomes the minimal unit of collation and can defined as pages, paragraphs, verses, lines, words, characters or any other unit that makes sense in the context of a particular tradition under investigation. To model collation results on top of tokenized witnesses, those tokens have to be addressable.

The TEI defines an [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SAXP array of pointing mechanisms], which can be used to address anything from a whole XML document via URIs down to arbitrary content of those documents via sophisticated XPointer schemes. Projects would be free to choose among those mechanisms as long as each token is made available for later reference.

Examples:

<pre>

<w>The</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w> <w>quickly</w>.

</pre>

<pre>

<w>Quickly</w>, <w>the</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w>.

</pre>

Here tokens on the word-level could be addressed via the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSXP xpath1() XPointer scheme]:

# <nowiki>http://edition.org/witness_1#xpath1(/p[1]/w[1])</nowiki>
# <nowiki>http://edition.org/witness_1#xpath1(/p[1]/w[2])</nowiki>
# ...

A less verbose scheme would rely on each container element of a token being identified via a (possibly autogenerated) <code>xml:id</code> attribute, like in the following verse-level tokenization.

<pre>
<lg xml:base="urn:goethe:faust2">
<l xml:id="l_1">Die Sonne sinkt, die letzten Schiffe</l>
<l xml:id="l_2">Sie ziehen munter hafenein.</l>
<l xml:id="l_3">Ein großer Kahn ist im Begriffe</l>
<l xml:id="l_4">Auf dem Canale hier zu sein.</l>
</lg>
</pre>

# <nowiki>urn:goethe:faust2#l_1</nowiki>
# <nowiki>urn:goethe:faust2#l_2</nowiki>
# ...

One can even think of reference schemes, which are as independent of existing markup as possible. By introducing <anchor/> milestone elements at token boundaries and using the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSRN range() XPointer scheme] the tokenization of arbitrary TEI documents can be accomplished, because <anchor/> is part of [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-model.global.html model.global].

=== Modelling collated data: Encode the alignment/linking between tokens ===

After tokens in the different witnesses have been made addressable, collation data can be modelled on top of that as [[Textual_Variance#Aligner|alignments of tokens]]. An '''alignment''' can be expressed as a set of tokens from different witnesses or, in accordance with the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html corresponding guidelines chapter] as a link between two or more tokens.

Taking the first example from above, a collation of the two given witnesses could be expressed as

<pre>
<linkGrp type="collation">
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[1]) http://edition.org/witness_2#xpath1(/p[1]/w[2])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[2]) http://edition.org/witness_2#xpath1(/p[1]/w[3])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[3]) http://edition.org/witness_2#xpath1(/p[1]/w[4])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[4]) http://edition.org/witness_2#xpath1(/p[1]/w[5])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[5]) http://edition.org/witness_2#xpath1(/p[1]/w[6])" />
<link target="http://edition.org/witness_1#xpath1(/p[1]/w[6]) http://edition.org/witness_2#xpath1(/p[1]/w[1])" type="transposition" />
</linkGrp>
</pre>

Each link in this example corresponds to a row in an alignment table as depicted in the Gothenburg model description. Omitted/ added tokens are expressed implictly by not linking to tokens in other witnesses, this is to say: Whether a set of tokens has been added to a witness or has been omitted from it, is a matter of interpreting collation data as expressed above from the perspective of one witness or another and with regard to the way, this witness aligns with others.

One advantage of encoding collation data in such a set-oriented way is its '''scalability''':

# Gradually adding witnesses to the collation may amount to adding alignments to the existing ones or modifying/augmenting the latter, depending on whether the collation is done pairwise (e. g. in relation to a base text) or via multiple alignment (e. g. without a prechosen base).
# Guiding a collation tool in producing ever more precise aligments in consecutive runs can be achieved by [[Textual_Variance#Analyzer|declaring alignments]] (for example transpositions), feeding those into the collator, adjusting the resulting alignment set, feeding it back into the collator for another run and so forth. Being able to encode the initial/preliminary results of such an iterative process in a standardized way, makes it possible to run different collation tools on the same text tradition, ideally each being able to make use of former results by other tools and to contribute to the overall result.

The major disadvantage of encoding collation data this way is its apparant lack of human readability and that it is hardly possible to edit it by hand, especially when the collated text tradition grows larger. This problem can only be solved via tool support.

=== Encoding the interpretation/ representation: Derive an apparatus from the collation ===

A TEI-encoded critical apparatus is one possible rendition of collation data, possibly enhanced with information yielded from interpreting the alignments. There are a couple of ways how we could encode the above collation as an apparatus.

==== Apparatus pointing to the collated tokens (for easier post-processing) ====

<pre>
<app>
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" />
<rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" xml:id="w2_1">
<ptr target="#xpath1(/p[1]/w[1])" />
</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1">
<ptr target="#xpath1(/p[1]/w[1])" />
<ptr target="#xpath1(/p[1]/w[2])" />
<ptr target="#xpath1(/p[1]/w[3])" />
<ptr target="#xpath1(/p[1]/w[4])" />
<ptr target="#xpath1(/p[1]/w[5])" />
</rdg>
<rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2">
<ptr target="#xpath1(/p[1]/w[2])" />
<ptr target="#xpath1(/p[1]/w[3])" />
<ptr target="#xpath1(/p[1]/w[4])" />
<ptr target="#xpath1(/p[1]/w[5])" />
<ptr target="#xpath1(/p[1]/w[6])" />
</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1" xml:base="http://edition.org/witness_1" corresp="#w2_1">
<ptr target="#xpath1(/p[1]/w[6])" />
</rdg>
<rdg wit="http://edition.org/witness_2" xml:base="http://edition.org/witness_2" />
</app>
</pre>

==== Apparatus with embedded textual content (for readability) ====

<pre>
<app>
<rdg wit="http://edition.org/witness_1" />
<rdg wit="http://edition.org/witness_2" xml:id="w2_1">Quickly,</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1">The cat ate the food</rdg>
<rdg wit="http://edition.org/witness_2">the cat ate the food.</rdg>
</app>
<app>
<rdg wit="http://edition.org/witness_1" corresp="#w2_1">quickly.</rdg>
<rdg wit="http://edition.org/witness_2" />
</app>
</pre>

Some problems here:

* @corresp vs. <link/> for transpositions over more than two witnesses
* How to derive the segment content from the original witness automatically, if the token content does not add up to it (e. g. because of punctuation being excluded from the tokens from the start)?

== Bibliography ==

* O'Donnell, Daniel Paul. [http://etjanst.hb.se/bhs/ith/1-8/dpo.pdf “The Ghost in the Machine: Revisiting an Old Model for the Dynamic Generation of Digital Editions.”] HumanIT 8.1 (2005): 5171.
[[Category:SIG:Manuscripts]]
* Vetter, L. and McDonald, J. ‘Witnessing Dickinson’s Witnesses’, Literary and Linguistic Computing, 18.2: 2003, 151-165.
* [http://eprints.qut.edu.au/38436/ Schmidt, D., 2010. The inadequacy of embedded markup for cultural heritage texts. Literary and Linguistic Computing, 25(3), pp. 337-356.]