Difference between revisions of "Stand-off use cases"

From TEIWiki
Jump to navigation Jump to search
(add challenging aspects)
(add detail)
Line 26: Line 26:
 
# compound texts which are compiled with other texts not of interest
 
# compound texts which are compiled with other texts not of interest
 
# linguistically and etymologically complex features of interest: dates based on names ("In the eighth year of the reign of John"); places named after people ("Washington");  
 
# linguistically and etymologically complex features of interest: dates based on names ("In the eighth year of the reign of John"); places named after people ("Washington");  
 +
# existing semantic, linguistic, bibliographic or provenance tagging which needs to be integrated with the new taggings (usually the new taggings are linguistic in nature).
  
  

Revision as of 01:15, 11 May 2012


Use cases

This is a list of use cases for using stand-off markup in TEI. These examples are an attempt to build a shared common ground on which to move forward and are the (by-)product of a discussion on TEI-SOM. Initials indicate discussion members who find this use case important or motivating.

  1. A third party publishes a text in non-TEI XML at a stable URL. You want to perform linguistic annotations on the text without losing reference to the underlying third party text. A TEI text is built with references back to the original non-TEI XML.
  2. A third party publishes a text in TEI at a stable URL. You want to perform linguistic annotations on the text without losing reference to the underlying third party text. A new TEI text is built with references back to the original TEI.
  3. A third party publishes a text in TEI XML at a stable URL. You want to perform linguistic annotations on the text without losing reference to the underlying third party text. A TEI text is built with references back to the original XML. (SY)
  4. There are multiple competing marking's up of linguistic information in a TEI text. The 'obvious' one is marked up and the others are relegated to stand-off markup. Tools are used to bring each alternative markup to the fore. (SY)
  5. There are multiple competing marking's up of linguistic information in a TEI text. Rather than privilege one marking up, all are relegated to stand-off markup. Tools are used to bring each alternative markup to the fore. (SY)
  6. A third party publishes a text in non-TEI XML at a stable URL. You wish to use TEI-based tools to find and correct errors in original XML and feed those corrections back to the publisher in a format native to them. (SY)
  7. Running an automated tool is run against an XML source, generating a TEI document which encodes information discovered in that source. For instance, you might run a process against an RSS news feed, and generate a TEI document containing analysis of it. You might store a copy of the feed with your TEI document, but you're not really interested in editing it; you're interested in what your process discovered about it (you might do sentiment analysis or something like that). Using TEI Pointers, you could point at target words or phrases in the RSS feed which form part of the analysis. from here (MH)
  8. A text is being being published by the first the first time; TEI with stand-off markup is being used for the linguistic information. (SY)
  9. For the purpose of a training course, you want a group of students to do preliminary analysis of the same text (identifying salient features they would choose to mark up, delimiting structural blocks, etc.). An interface allows students to select points and ranges in the text, and attach annotations to them; these are stored in TEI as stand-off markup, and a rendering engine allows the group to view and discuss the annotations. Actual inline TEI can be generated from them when students are ready to begin their markup projects. (MH)
  10. Textual fragments from within a TEI text are extracted and used in an index into the text as part of the end-matter, in a manner akin to a back-of-the-book index. (see Papyri.info example below) (HC)

Challenging aspects of TEI texts

There are a number of challenging features sometimes found in TEI-encoding texts. These include:

  1. non-ASCII character sets
  2. non-Unicode character sets (see wh ligature)
  3. multicultural naming patterns (Proper nouns is different language(s) or scripts to the underlying text)
  4. texts with errors, omissions or additions
  5. compound texts which are compiled with other texts not of interest
  6. linguistically and etymologically complex features of interest: dates based on names ("In the eighth year of the reign of John"); places named after people ("Washington");
  7. existing semantic, linguistic, bibliographic or provenance tagging which needs to be integrated with the new taggings (usually the new taggings are linguistic in nature).



Example from Papyri.info of personal name standoff markup

(HC)

See O.Leid. 24

The standoff annotation section starts at line 87.

The use case is that I want to have an automated process attempt to recognize personal names in the text, and then provide an interface for users to correct the automated annotations. Because the text will be edited using a non-TEI interface that does not understand <persName> and the like, I cannot do this with inline markup. Inserting the name markup inline might be tough on the existing element hierarchy anyway. The data from the annotations will be submitted to a partner site, and we will keep the annotations as a record.

Note: I'm doing several things "wrong" here for the sake of argument. For starters, string-range() does not allow you to use a node without text in it as the first argument. I'd like to change that to allow the starting point to be, e.g. an <lb> (note, however, that if this were a sticking point, an XPath like //lb[@n='1']/following-sibling::text()[1] would work, it's just more verbose). Second, my use of string-range() is incorrect because it does not allow an XPath as the first argument. To be correct according to the current spec, I'd have to do something like:

#string-range(xpath1(//lb[@n='1']/following-sibling::text()[1]),3,6) instead of

#string-range(//lb[@n='1'],3,6)

Note too that since I'm keeping a copy of the recognized name, a scheme like match() might be a better fit, though again, I'd propose doing away with the restriction that the node pointed to must contain the text. In that case, I might posit something like:

#match(//lb[@n='1'],'Κράτης')