Difference between revisions of "Stand-off use cases"
Stuartyeates (talk | contribs) (richer metadata) |
Stuartyeates (talk | contribs) (ref) |
||
(2 intermediate revisions by the same user not shown) | |||
Line 1: | Line 1: | ||
[[category:markup]] | [[category:markup]] | ||
+ | |||
+ | ''For technical methods that might be used to implement these, see [[XPointer]].'' | ||
=== Use cases === | === Use cases === | ||
Line 13: | Line 15: | ||
# A third party publishes an RSS-XML source which is converted to TEI with links back to the original RSS for processing and analysis (sentiment analysis, named entity extraction, etc). The interest is not in the source text per se, but in the resulting analysis and the textual fragments in the source supporting that analysis. [http://listserv.brown.edu/?A2=TEi-SOM;8464a70e.1205 from here] (MH) | # A third party publishes an RSS-XML source which is converted to TEI with links back to the original RSS for processing and analysis (sentiment analysis, named entity extraction, etc). The interest is not in the source text per se, but in the resulting analysis and the textual fragments in the source supporting that analysis. [http://listserv.brown.edu/?A2=TEi-SOM;8464a70e.1205 from here] (MH) | ||
# A text is being being published by the first the first time; TEI with stand-off markup is being used for the linguistic information. (SY) | # A text is being being published by the first the first time; TEI with stand-off markup is being used for the linguistic information. (SY) | ||
− | # | + | # Multiple independent markings-up of a TEI-encoded text are done in parallel, for example when there is contention as the the underlying meaning or when the text is set as a pedagogical mark-up exercise. (MH) |
− | # Textual fragments from within a TEI text are extracted and used in an index into the text as part of the end-matter, in a manner akin to a back-of-the-book index. (see | + | # Textual fragments from within a TEI text are extracted and used in an index into the text as part of the end-matter, in a manner akin to a back-of-the-book index. (see [http://papyri.info/ papyri.info] example below) (HC) |
=== Challenging aspects of TEI texts === | === Challenging aspects of TEI texts === | ||
Line 28: | Line 30: | ||
# Existing semantic, linguistic, bibliographic or provenance tagging which needs to be integrated with the new taggings (usually the new taggings are linguistic in nature). | # Existing semantic, linguistic, bibliographic or provenance tagging which needs to be integrated with the new taggings (usually the new taggings are linguistic in nature). | ||
# Significantly richer metadata than many linguistic processing communities general use. | # Significantly richer metadata than many linguistic processing communities general use. | ||
− | |||
Latest revision as of 07:48, 11 May 2012
For technical methods that might be used to implement these, see XPointer.
Use cases
This is a list of use cases for using stand-off markup in TEI. These examples are an attempt to build a shared common ground on which to move forward and are the (by-)product of a discussion on TEI-SOM. Initials indicate discussion members who find this use case important or motivating.
- A third party publishes a text in non-TEI XML at a stable URL. You want to perform linguistic annotations on the text without losing reference to the underlying third party text. A TEI text is built with references back to the original non-TEI XML.
- A third party publishes a text in TEI at a stable URL. You want to perform linguistic annotations on the text without losing reference to the underlying third party text. A new TEI text is built with references back to the original TEI.
- A third party publishes a text in TEI XML at a stable URL. You want to perform linguistic annotations on the text without losing reference to the underlying third party text. A TEI text is built with references back to the original XML. (SY)
- There are multiple competing marking's up of linguistic information in a TEI text. The 'obvious' one is marked up and the others are relegated to stand-off markup. Tools are used to bring each alternative markup to the fore. (SY)
- There are multiple competing marking's up of linguistic information in a TEI text. Rather than privilege one marking up, all are relegated to stand-off markup. Tools are used to bring each alternative markup to the fore. (SY)
- A third party publishes a text in non-TEI XML at a stable URL. You wish to use TEI-based tools to find and correct errors in original XML and feed those corrections back to the publisher in a format native to them. (SY)
- A third party publishes an RSS-XML source which is converted to TEI with links back to the original RSS for processing and analysis (sentiment analysis, named entity extraction, etc). The interest is not in the source text per se, but in the resulting analysis and the textual fragments in the source supporting that analysis. from here (MH)
- A text is being being published by the first the first time; TEI with stand-off markup is being used for the linguistic information. (SY)
- Multiple independent markings-up of a TEI-encoded text are done in parallel, for example when there is contention as the the underlying meaning or when the text is set as a pedagogical mark-up exercise. (MH)
- Textual fragments from within a TEI text are extracted and used in an index into the text as part of the end-matter, in a manner akin to a back-of-the-book index. (see papyri.info example below) (HC)
Challenging aspects of TEI texts
There are a number of challenging features sometimes found in TEI-encoding texts. These include:
- Non-ASCII character sets (accents, macrons, Arabic, Far Eastern languages, etc)
- Characters that do not qualify for inclusion in Unicode (see wh ligature, for exmaple)
- Multicultural naming patterns (proper nouns in different language(s) or scripts to the underlying text, etc)
- Texts with errors, omissions or additions
- Texts which are compiled with other texts not of interest
- Linguistically and etymologically complex features of interest: dates based on names ("In the eighth year of the reign of John"); places named after people ("Washington");
- Existing semantic, linguistic, bibliographic or provenance tagging which needs to be integrated with the new taggings (usually the new taggings are linguistic in nature).
- Significantly richer metadata than many linguistic processing communities general use.
Example from Papyri.info of personal name standoff markup
(HC)
See O.Leid. 24
The standoff annotation section starts at line 87. I'm doing the annotation in the same document for the sake of argument—it might not be the ultimate solution.
The use case is that I want to have an automated process attempt to recognize personal names in the text, and then provide an interface for users to correct the automated annotations. Because the text will be edited using a non-TEI interface that does not understand <persName> and the like, I cannot do this with inline markup. Inserting the name markup inline might be tough on the existing element hierarchy anyway. The data from the annotations will be submitted to a partner site, and we will keep the annotations as a record.
Note: I'm doing several things "wrong" here for the sake of argument. For starters, string-range() does not allow you to use a node without text in it as the first argument. I'd like to change that to allow the starting point to be, e.g. an <lb> (note, however, that if this were a sticking point, an XPath like //lb[@n='1']/following-sibling::text()[1] would work, it's just more verbose). Second, my use of string-range() is incorrect because it does not allow an XPath as the first argument. To be correct according to the current spec, I'd have to do something like:
#string-range(xpath1(//lb[@n='1']/following-sibling::text()[1]),3,6) instead of
#string-range(//lb[@n='1'],3,6)
Note too that since I'm keeping a copy of the recognized name, a scheme like match() might be a better fit, though again, I'd propose doing away with the restriction that the node pointed to must contain the text. In that case, I might posit something like:
#match(//lb[@n='1'],'Κράτης')