Difference between revisions of "Stand-off use cases"

Latest revision as of 06:48, 11 May 2012

For technical methods that might be used to implement these, see XPointer.

Use cases

This is a list of use cases for using stand-off markup in TEI. These examples are an attempt to build a shared common ground on which to move forward and are the (by-)product of a discussion on TEI-SOM. Initials indicate discussion members who find this use case important or motivating.

A third party publishes a text in non-TEI XML at a stable URL. You want to perform linguistic annotations on the text without losing reference to the underlying third party text. A TEI text is built with references back to the original non-TEI XML.
A third party publishes a text in TEI at a stable URL. You want to perform linguistic annotations on the text without losing reference to the underlying third party text. A new TEI text is built with references back to the original TEI.
A third party publishes a text in TEI XML at a stable URL. You want to perform linguistic annotations on the text without losing reference to the underlying third party text. A TEI text is built with references back to the original XML. (SY)
There are multiple competing marking's up of linguistic information in a TEI text. The 'obvious' one is marked up and the others are relegated to stand-off markup. Tools are used to bring each alternative markup to the fore. (SY)
There are multiple competing marking's up of linguistic information in a TEI text. Rather than privilege one marking up, all are relegated to stand-off markup. Tools are used to bring each alternative markup to the fore. (SY)
A third party publishes a text in non-TEI XML at a stable URL. You wish to use TEI-based tools to find and correct errors in original XML and feed those corrections back to the publisher in a format native to them. (SY)
A third party publishes an RSS-XML source which is converted to TEI with links back to the original RSS for processing and analysis (sentiment analysis, named entity extraction, etc). The interest is not in the source text per se, but in the resulting analysis and the textual fragments in the source supporting that analysis. from here (MH)
A text is being being published by the first the first time; TEI with stand-off markup is being used for the linguistic information. (SY)
Multiple independent markings-up of a TEI-encoded text are done in parallel, for example when there is contention as the the underlying meaning or when the text is set as a pedagogical mark-up exercise. (MH)
Textual fragments from within a TEI text are extracted and used in an index into the text as part of the end-matter, in a manner akin to a back-of-the-book index. (see papyri.info example below) (HC)

Challenging aspects of TEI texts

There are a number of challenging features sometimes found in TEI-encoding texts. These include:

Non-ASCII character sets (accents, macrons, Arabic, Far Eastern languages, etc)
Characters that do not qualify for inclusion in Unicode (see wh ligature, for exmaple)
Multicultural naming patterns (proper nouns in different language(s) or scripts to the underlying text, etc)
Texts with errors, omissions or additions
Texts which are compiled with other texts not of interest
Linguistically and etymologically complex features of interest: dates based on names ("In the eighth year of the reign of John"); places named after people ("Washington");
Existing semantic, linguistic, bibliographic or provenance tagging which needs to be integrated with the new taggings (usually the new taggings are linguistic in nature).
Significantly richer metadata than many linguistic processing communities general use.

Example from Papyri.info of personal name standoff markup

(HC)

See O.Leid. 24

The standoff annotation section starts at line 87. I'm doing the annotation in the same document for the sake of argument—it might not be the ultimate solution.

The use case is that I want to have an automated process attempt to recognize personal names in the text, and then provide an interface for users to correct the automated annotations. Because the text will be edited using a non-TEI interface that does not understand <persName> and the like, I cannot do this with inline markup. Inserting the name markup inline might be tough on the existing element hierarchy anyway. The data from the annotations will be submitted to a partner site, and we will keep the annotations as a record.

Note: I'm doing several things "wrong" here for the sake of argument. For starters, string-range() does not allow you to use a node without text in it as the first argument. I'd like to change that to allow the starting point to be, e.g. an <lb> (note, however, that if this were a sticking point, an XPath like //lb[@n='1']/following-sibling::text()[1] would work, it's just more verbose). Second, my use of string-range() is incorrect because it does not allow an XPath as the first argument. To be correct according to the current spec, I'd have to do something like:

#string-range(xpath1(//lb[@n='1']/following-sibling::text()[1]),3,6) instead of

#string-range(//lb[@n='1'],3,6)

Note too that since I'm keeping a copy of the recognized name, a scheme like match() might be a better fit, though again, I'd propose doing away with the restriction that the node pointed to must contain the text. In that case, I might posit something like:

#match(//lb[@n='1'],'Κράτης')

@@ Line 1: / Line 1: @@
 [[category:markup]]
+''For technical methods that might be used to implement these, see [[XPointer]].''
 === Use cases ===
@@ Line 13: / Line 15: @@
 # A third party publishes an RSS-XML source which is converted to TEI with links back to the original RSS for processing and analysis (sentiment analysis, named entity extraction, etc). The interest is not in the source text per se, but in the resulting analysis and the textual fragments in the source supporting that analysis. [http://listserv.brown.edu/?A2=TEi-SOM;8464a70e.1205 from here] (MH)
 # A text is being being published by the first the first time; TEI with stand-off markup is being used for the linguistic information. (SY)
-# For the purpose of a training course, you want a group of students to do preliminary analysis of the same text (identifying salient features they would choose to mark up, delimiting structural blocks, etc.). An interface allows students to select points and ranges in the text, and attach annotations to them; these are stored in TEI as stand-off markup, and a rendering engine allows the group to view and discuss the annotations. Actual inline TEI can be generated from them when students are ready to begin their markup projects. (MH)
+# Multiple independent markings-up of a TEI-encoded text are done in parallel, for example when there is contention as the the underlying meaning or when the text is set as a pedagogical mark-up exercise. (MH)
-# Textual fragments from within a TEI text are extracted and used in an index into the text as part of the end-matter, in a manner akin to a back-of-the-book index. (see Papyri.info example below) (HC)
+# Textual fragments from within a TEI text are extracted and used in an index into the text as part of the end-matter, in a manner akin to a back-of-the-book index. (see [http://papyri.info/ papyri.info] example below) (HC)
 === Challenging aspects of TEI texts ===
@@ Line 28: / Line 30: @@
 # Existing semantic, linguistic, bibliographic or provenance tagging which needs to be integrated with the new taggings (usually the new taggings are linguistic in nature).
 # Significantly richer metadata than many linguistic processing communities general use.

Difference between revisions of "Stand-off use cases"

Latest revision as of 06:48, 11 May 2012

Use cases

Challenging aspects of TEI texts

Example from Papyri.info of personal name standoff markup

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools