Difference between revisions of "Critical Apparatus Workgroup"
m (→An encoding proposal from the perspective of computer-aided collation tools) |
(→Modelling input data: Make the units of a collation addressable in the witnesses) |
||
Line 204: | Line 204: | ||
</pre> | </pre> | ||
− | Here tokens on the word-level could be addressed via the xpath1() XPointer scheme: | + | Here tokens on the word-level could be addressed via the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSXP xpath1() XPointer scheme]: |
# <nowiki>http://edition.org/witness_1#xpath1(/p[1]/w[1])</nowiki> | # <nowiki>http://edition.org/witness_1#xpath1(/p[1]/w[1])</nowiki> | ||
Line 224: | Line 224: | ||
# <nowiki>urn:goethe:faust2#l_2</nowiki> | # <nowiki>urn:goethe:faust2#l_2</nowiki> | ||
# ... | # ... | ||
+ | |||
+ | One can even think of reference schemes, which are as independent of existing markup as possible. By introducing <anchor/> milestone elements at token boundaries and using the [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SATSRN range() XPointer scheme] the tokenization of arbitrary TEI documents can be accomplished, because <anchor/> is part of [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-model.global.html model.global]. | ||
=== Modelling collated data: Encode the alignment/linking between tokens === | === Modelling collated data: Encode the alignment/linking between tokens === |
Revision as of 18:27, 1 April 2011
The Critical Apparatus workgroup is part of the TEI special interest group on manuscript SIG:MSS. This page provides a summary of the preliminary discussions regarding the current issues with the critical apparatus chapter.
Participants to the preliminary workgroup: Marjorie Burghart (MB), James Cummings (JC), Fotis Jannidis (FJ), Gregor Middell (GM), Dan O'Donnell (DOD), Espen Ore (EO), Elena Pierazzo (EP), Roberto Rosselli del Turco (RDT), Chris Wittern (CW)
Contents
- 1 A preliminary vocabulary question
- 2 Issues with the current Critical Apparatus chapter/module
- 2.1 A reading covering several paragraphs
- 2.2 Transpositions
- 2.3 Scalability
- 2.4 Refactoring
- 2.5 conflicts between individual readings and the semantics of structural markup that surrounds it
- 2.6 Showing a lemma different from the content of the <lem> or chosen reading in an apparatus note
- 2.7 Representing "verbose" apparatus
- 2.8 Representation of suggestions by the editor: lege dele etc.
- 3 An encoding proposal from the perspective of computer-aided collation tools
- 4 Bibliography
A preliminary vocabulary question
The very name of the chapter, "Critical apparatus", is felt by some to be be a problem: the critical apparatus is just inherited from the printed world and one of the possible physical embodiment of TEXTUAL VARIANCE. EP therefore proposes to use this new name, moving from "citical apparatus" to textual variance.
MB argues that, oddly, "textual variance" feels more restrictive to her than "critical apparatus": it is a notion linked with Cerquiglini's work, which does not correspond to every branch of textual criticism. On the other hand, strictly speaking, the "critical apparatus" is not limited to registering the variants of the several witnesses of a text. It also includes various kinds of notes (identification of the sources of the text, historical notes, etc.). Even texts with a single witness may have a critical apparatus. Maybe the problem with the name has its origins in the choice of giving the name "critical apparatus" to a part of the guidelines dedicated solely to the registration of textual variants.
FJ argues that for German ears the concept of textual variance is not closely connected to a specific scholar.
MB proposes to use "TEXTUAL VARIANTS" instead, since it focuses more on actual elements in the edition, when "variance" is nothing concrete but a phenomenon.
Side remarks by MB: this vocabulary queston might prove sticky in the end. The <app> elements is named <app> because it is considered "an apparatus entry", so unless we end up recommending to change the elements names, the phrase "critical apparatus" will still be used in the module, at least to explain the tag names?
RDT argues that while backward compatibility is clearly a bonus, as MB states <app> stands for 'apparatus entry': we shouldn't be afraid to change its function, for instance making it a container instead of a phrase level element. RDT stresses that he is proposing this by way of example, and to stress that our focus is on variants: these might then be organised in <app>s for traditional CA display, and/or in other, new ways for electronic display. Note that this might mean no traditional critical apparatus in a digital edition.
MB: It is characteristic of a print-based approach to encoding that the <app> element was considered as encoding an apparatus entry (hence the <app> name), when what it really encodes is a locus where different witnesses have variant readings (whch would probably have justified a name along the lines of <locus> or whatnot).
JC: Thinks this points to a slight divergent nature at the heart of the current critical apparatus recommendations. That of encoding an apparatus at the site of textual variance and encoding a structured view of a note entirely separate from the edited version of texts. (In mass digitization of critical editions, for example, one might have an <app> in a set of notes at the bottom of the page which are not encoded at the site of variance, or indeed necessarily connected with it.) It is this striving to both be able to encode all sorts of various legacy forms of apparatus as well as simultaneously catering for those who are recording the structure by which they will generate an apparatus in producing some output. So JC would argue that the first of these are apparatus and the second of these is a site/locus of textual variance.
Issues with the current Critical Apparatus chapter/module
Preliminary notice: most of the issues raised here are connected with the parallel segmentation method, not because it is the more flawed, but because it is the more used by the members of this group. While location-referenced and double-end-point-attachment might be useful for mass conversion of printed material (for the former) and/or when using a piece of software handling the encoding (for the latter), the parallel segmentation method seems to be the easiest and more powerful way to encode the critical apparatus "by hand".
Also, one might point out that most of the issues raised here might be solved with standoff encoding. But this is extremely cumbersome to handle without the aid of a software, and it does not correspond to the way most people work.
A reading covering several paragraphs
In a nutshell: the <app> element is phrase-level, when it really should be allowed to include paragraphs, and even <div>s.
Use case:
I'm encoding a 19th c. edition of a medieval text, and one of the
witness has omissions of several paragraphs. Of course, the TEI schema won't let me put <p> elements inside an <app>/<lem> element...
- I use the parallel segmentation method
- It is important to me to keep a methodical link between the encoded apparatus and the notes numbers in the original edition (the @n of each <app> tag bears the number of the footnote in the original edition)
Here is the scan of a page from this edition, please consider footnote number 9. The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the Bal. witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the previous page scanned).
Transpositions
In a nutshell: with the parallel segmentation method, it is often cumbersome to render transpositions.
Additionally it is not possible to mark them up explicitly. Juxta for example works around that by storing transposition data in a custom XML format:
<moves> <move doc1="1855 MS" space1="original" start1="9679" end1="10462" doc2="1881 1st Ed." space2="original" start2="9872" end2="10467" /> <move doc1="1855 MS" space1="original" start1="9679" end1="10483" doc2="1870 2nd Ed." space2="original" start2="7781" end2="8376" /> <move doc1="1855 MS" space1="original" start1="9679" end1="10504" doc2="1870 Proof" space2="original" start2="8458" end2="9056" /> <move doc1="1855 MS" space1="original" start1="9886" end1="10525" doc2="1870 1st Ed." space2="original" start2="8546" end2="9141" /> <move doc1="1870 Proof" space1="original" start1="1640" end1="1850" doc2="1881 1st Ed." space2="original" start2="2961" end2="3070" /> </moves>
Neither is this TEI-compliant, nor is the offset/range-based addressing (@start1/@start2 and @end1/@end2) proper XML markup. A standardized encoding would be helpful.
Scalability
In a nutshell: the parallel segmentation method is difficult to handle when adding hundreds of conflicting witnesses.
Refactoring
In a nutshell: with the the parallel segmentation method, it is cumbersome to add a new reading that necessitates changing where the borders of readings are drawn.
conflicts between individual readings and the semantics of structural markup that surrounds it
In a nutshell: with the parallel segmentation method, witnesses with different forms of lineation pose a problem.
Showing a lemma different from the content of the <lem> or chosen reading in an apparatus note
In a nutshell: depending on the desired output of your digital edition, you may need to show in the apparatus entry a lemma text different from the content of the <lem> or desired <rdg>. This is typically the case for long omissions, when one does not display the full text that is omitted by one or more witnesses, but only the beginning and end of the omitted span of text.
Use case:
Let's consider again the example used in a previous use case:
Here is the scan of a page from this edition, please consider footnote number 9. The note contains: "9. Eodem anno, rex Francie… dampnificati, paragraphes omis par Bal.", meaning that the Bal. witness has an omission where other witnesses have two long paragraphs, the first one beginning on the previous page (see the previous page scanned).
You certainly do not want to generate a footnote with these two full paragraphs to tell the reader that one witness omits them, but on the other hand you want to be able to represent the source according to its various witnesses, so location-referenced is not in order.
Representing "verbose" apparatus
In a nutshell: when ou want to represent an apparatus entry written in a rather verbose way (in a print-to-digital edition). The same is true if you want to be able to generate a verbose apparatus note in a "born digital" edition.
Use cases:
You're encoding an existing edition, and want to represent the source it edits, while keeping intact the text / apparatus of the existing edition. Some apparatus entries are easy to represent with the <app> / <lem> / <rdg> elements, some others add editorial comments to the listing of the variants, and are quite difficult to represent. BTW, the same goes when you are encoding a born-digital edition for which you want to be able to generate an alternative print output corresponding to the traditional standards of a collection.
A - When I have a footnote giving two lectiones from the same manuscrip, one before correction and the other after:
Text: ad lectorem Venetum (b) .
Note: b) ms., lectionem venerum corrigé postérieurement en lectorem Venetum
If I encode it like this, with two seprate rdg for the same witness, each with a different @type (for instance, "anteCorr" and "postCorr"), it gives an accurate account of the state of the witness, BUT it is an interpretation of the original note in the critical apparatus, i.e. if I do this I delete some text added by the original editor.
<app n="b">
<lem>lectorem Venetum</lem>
<rdg wit="#ms.2" type="anteCorr">lectionem venerum</rdg>
<rdg wit="#ms.2" type="postCorr">lectorem Venetum</rdg>
</app>
Let's consider this other note. There is some text added verbosely within the apparatus note by the editor.
Text: Hiis diebus civitas Pergamensis(b) tenebat exersitum
Note: b) se, mis indûment avant tenebat par le ms.Should I encode it as:
... Pergamensis <app n="b">
<lem/>
<rdg type="addition" wit="#ms"><sic>se</sic></rdg>
</app>...
I one represents this note strictly with the <app> / <rdg>, it leads to suppress remarks by the original editor. Adding a note in the rdg to preserve the editor's comments could work here, ut it's not always the case
Like:
... Pergamensis <app n="b">
<lem/>
<rdg type="addition" wit="#ms"><sic>se</sic> <note><hi rend="italics">mis indûment avant</hi> tenebat.</note></rdg>
</app>
Text: …reliqui demum meos socios (d)
Note: d) domum meam solito, Bal.; dni ou dm, ms.; en note meam solita.Here we have 2 witnesses (Bal. et ms.), the latter with a) an uncertain lectio ("dni" or "dm") and b) a part of the lectio which is written as a note ("meam solita"). This is tricky to encode.
Representation of suggestions by the editor: lege dele etc.
In a nutshell: Sometimes, the editor provides working suggestions through apparatus notes such as lege(ndum) ("read"), dele(ndum) ("delete)" etc. They do not belong in the textual variants per se, and are not attached to witnesses, although they do belong in the critical apparatus.
An encoding proposal from the perspective of computer-aided collation tools
Gregor Middell gave an overview of textual variance from a software developer's perspective for the workgroup on a separate page. The models described there are used in tools like CollateX, Juxta and nmerge.
Collecting ideas from the mailinglist by James Cummings, Dan O'Donnell and Marjorie Burghardt as well as following the “Gothenburg model” of textual variance, a first take at separating the model from the representation of textual variance could be structured as follows.
Modelling input data: Make the units of a collation addressable in the witnesses
The Gothenburg model assumes a preprocessing step by which the witnesses get split up into tokens of desired granularity. This granularity becomes the minimal unit of collation and can defined as pages, paragraphs, verses, lines, words, characters or any other unit that makes sense in the context of a particular tradition under investigation. To model collation results on top of tokenized witnesses, those tokens have to be addressable.
The TEI defines an array of pointing mechanisms, which can be used to address anything from a whole XML document via URIs down to arbitrary content of those documents via sophisticated XPointer schemes. Projects would be free to choose among those mechanisms as long as each token is made available for later reference.
Examples:
<p xml:base="http://edition.org/witness_1"> <w>The</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w> <w>quickly</w>. </p>
<p xml:base="http://edition.org/witness_2"> <w>Quickly</w>, <w>the</w> <w>cat</w> <w>ate</w> <w>the</w> <w>food</w>. </p>
Here tokens on the word-level could be addressed via the xpath1() XPointer scheme:
- http://edition.org/witness_1#xpath1(/p[1]/w[1])
- http://edition.org/witness_1#xpath1(/p[1]/w[2])
- ...
A less verbose scheme would rely on each container element of a token being identified via a (possibly autogenerated) xml:id
attribute, like in the following verse-level tokenization.
<lg xml:base="urn:goethe:faust2"> <l xml:id="l_1">Die Sonne sinkt, die letzten Schiffe</l> <l xml:id="l_2">Sie ziehen munter hafenein.</l> <l xml:id="l_3">Ein großer Kahn ist im Begriffe</l> <l xml:id="l_4">Auf dem Canale hier zu sein.</l> </lg>
- urn:goethe:faust2#l_1
- urn:goethe:faust2#l_2
- ...
One can even think of reference schemes, which are as independent of existing markup as possible. By introducing <anchor/> milestone elements at token boundaries and using the range() XPointer scheme the tokenization of arbitrary TEI documents can be accomplished, because <anchor/> is part of model.global.
Modelling collated data: Encode the alignment/linking between tokens
- link tokens from different witnesses as they are identified as being the same, different, transposed ...
- tokens which are not linked to tokens in other witnesses can be interpreted as being added/omitted
- encoding of linking/alignment via standard means described in the corresponding guidelines chapter, aka. via <link/> etc.; cf. alignment details
- links/alignments can be computed by a collation tool or can be declared by the user ahead of the automated collation process, thereby guiding the collation tool in cases, where it cannot algorithmically decide, which tokens to align; cf. analyzer details
Encoding the interpretation/ representation: Derive an apparatus from the collation
- create a (possibly commented) apparatus from the alignment information: <app/>, <rdg/> etc.
- compress consecutive aligning tokens into segments
- link transposed segments explicitly
- either embed the textual content of the tokens into the generated apparatus (for readability) or point from the apparatus segments into the witness (for easier processing)
Bibliography
- O'Donnell, Daniel Paul. “The Ghost in the Machine: Revisiting an Old Model for the Dynamic Generation of Digital Editions.” HumanIT 8.1 (2005): 5171.
- Vetter, L. and McDonald, J. ‘Witnessing Dickinson’s Witnesses’, Literary and Linguistic Computing, 18.2: 2003, 151-165.
- Schmidt, D., 2010. The inadequacy of embedded markup for cultural heritage texts. Literary and Linguistic Computing, 25(3), pp. 337-356.