Prefix Definition Proposal

From TEIWiki

Jump to: navigation, search

NOTE: This proposal has now been accepted and is a part of the TEI Guidelines: [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SAPU].

This page contains a draft proposal for a framework whereby private URI schemes and other similar abbreviated pointing systems used in TEI attributes with datatypes of data.pointer can be documented and dereferenced. The feature request ticket relating to this proposal is [http://purl.org/tei/fr/3576367].

NOTE: This document is a second draft of the proposal Private URI Schemes. A redraft was requested by Council at the meeting in Oxford in September 2012.

Contents

The Problem

For some time now, we have been discussing the use of "magic tokens" in attributes such as @key. Magic tokens are problematic because they are meaningful only within the context of a specific project (@key "provides an externally-defined means of identifying the entity (or entities) being named, using a coded value of some kind"). At one point it was suggest that @key attributes be documented through the use of a <taxonomy> element in the TEI header (as in the Best Practices for TEI in Libraries), but Lou has argued against this, and KH has noted this for a future revision of the BP. Nevertheless, documentation in this way does not provide a machine-readable method of dereferencing a key.

On several occasions, Council has discussed discouraging the use of @key and friends in future, and has talked about encouraging instead the use of private URI schemes instead. [1] There are many good arguments against the use of private URI schemes (see for instance URI Schemes at the W3C -- but as long as they are restricted to a specific project and well documented in that project, the approach seems a reasonable alternative to magic tokens.

Except that without a solid dereferencing scheme, they're actually no different from magic tokens. There's not much difference between <name key="FRED"> and <name ref="myproj:FRED">.

In addition to private URI schemes, it is easy to imagine projects making use of other abbreviated pointing methods which are similarly unintelligible if not documented, and which cannot be processed automatically.

A Possible Solution

The primary value in using a project-specific key-style attribute is that it's short and simple. In many projects, @key is used when a perfectly straightforward and reliable pointer could be provided, because that pointer would be too long to be manageable by encoders. For instance, the Colonial Despatches project uses keys like this:

 <name key="mills">John Powell Mills</name>

when what is actually meant is something like this:

 <name ref="../bios/bios.xml#mills">John Powell Mills</name>

Where the key value corresponds to a unique @xml:id within the project, and the project data is stored in an XML database, dereferencing the key to look up the <person> element to which it corresponds is simple. But if the XML data is removed from the context of the XML database and associated XQuery which enables the simple lookup, the relationship between the key and the target element becomes opaque, and any researcher working with the data will have to read the encoding description and reconstruct it.

The proposed solution is to create a method of documenting this relationship which can be mechanically dereferenced as well as being described in human readable text. This would enable a processor to reconstruct the actual path of a link without human intervention. This can be done with a search-and-replace operation, encoded for example as the second and third arguments to XPath 2.0's replace() function. The attributes @matchPattern and @replacementPattern are adopted from the existing TEI element <cRefPattern>. Imagine this in the <encodingDesc> of a document:

 <listPrefixDefs>
   <prefixDef ident="bios" matchPattern="([a-z]+)" replacementPattern="../bios/bios.xml#$1">
     <p>In the context of this project, private URIs with the prefix "bios" point to <gi>person</gi> elements in the project's bios/bios.xml file.</p>
   </prefixDef>
 </listPrefixDefs>

Any processor, presented with the "bios" prefix in a private URI:

 <name ref="bios:mills">John Powell Mills</name>

can look it up in the header, and apply a search/replace operation using the @pattern and @replacement attributes to arrive at a full working URI.

The same approach can be used with external references. In the Map of Early Modern London, we use a private URI scheme like this:

 <pb facs="moleebo:18464|1"/>

This expands into a full URL that looks like this:

 <pb facs="http://eebo.chadwyck.com/fetchimage?vid=18464&page=1&width=1200"/>

It's pointless and error-prone to reproduce the entire URL in every @facs attribute when the only two pieces of information that matter are the document number on EEBO (here 18464) and the page number (here 1). So we could document our moleebo prefix like this:

 <listPrefixDefs>
   <prefixDef ident="moleebo" matchPattern="([0-9]+)|([0-9]+)" replacementPattern="http://eebo.chadwyck.com/fetchimage?vid=$1&page=$2&width=1200">
     <p>In the context of this project, private URIs with the prefix "moleebo" point to facsimile pages on the EEBO website.</p>
   </prefixDef>
 </listPrefixDef>

Since this dereferencing can be processed using XSLT 2.0 with the replace() function, this handling could easily be built into the standard TEI stylesheets.

As noted in a footnote in this document, Council will most likely encourage the use of tag URIs in place of magic tokens as well as private URIs, but these are also long, and can be similarly replaced with private URIs, which can be dereferenced in the same way. That is, a tag URI such as:

 <name ref="tag:bcgenesis.uvic.ca.bios.2012-06-29:mills">John Powell Mills</name>

could more simply be encoded as:

 <name ref="bios:mills">John Powell Mills</name>

which could be dereferenced like this:

 <listPrefixDef>
   <prefixDef ident="bios" matchPattern="([a-z]+)" replacementPattern="tag:bcgenesis.uvic.ca.bios.2012-06-29:$1">
     <p>In the context of this project, private URIs with the prefix "bios" represent tag URIs based on the project domain bcgenesis.uvic.ca.</p>
   </prefixDef>
 </listPrefixDef>

Elements, attributes and datatypes

The following elements, attributes and classes are proposed:

  • <listPrefixDef>: This is a container element for a list of <prefixDef> elements, since many projects will use more than one. It would be a child of <encodingDesc>, as a member of model.encodingDescPart.
  • att.patternReplacement: a new attribute class including these attributes, currently defined on <cRefPattern>:
    • @matchPattern: data.pattern. A regular expression. OPTIONAL.
    • @replacementPattern: data.text. The replacement text (which may include captured groups from the regex, such as $1, $2). OPTIONAL.
    • Note that although both attributes are optional (allowing a pure text description of a dereferencing process which may not be encodable in terms of regular expressions), one without the other is pointless, so a Schematron constraint should be added to enforce the presence of both or neither.

<cRefPattern> would be added to this class, along with <prefixDef>.

  • <prefixDef>: This is the core element which provides the expansion functionality for a specific private URI prefix. Its content model would be the usual globals, + model.pLike to provide for a detailed textual description of the dereferencing process if required. It would be a member of the proposed att.patternReplacement class.
  • @ident: one instance of data.name. This defines the prefix that will be used to construct the private URIs. It would be defined on <prefixDef>, and would be REQUIRED.

Council discussions in Oxford in September 2012 suggested that it will often be useful or necessary to provide multiple expansions for a single prefix. In the example of the EEBO link above, for instance, it might be helpful to provide users who do not have access to EEBO with some alternative link to a public lower-resolution image, or some other helpful information. For this reason, @ident rather than @xml:id is used to specify the prefix, so that multiple dereferencing methods can be described. Processors may choose either to implement the first method for each specific prefix, or generate multiple links. The TEI stylesheets would probably do the former.

Revisions to Guidelines

New Guidelines Section

A proposed new section for the Guidelines is available for comment and editing here:

[http://tei.svn.sourceforge.net/viewvc/tei/trunk/P5/temp-dev/prefix_definition_proposal_draft.xml?view=markup]

The section would be located between the current 16.2.2 (Pointing locally) and 16.2.3 (W3C element() Scheme), and is entitled "Using abbreviated pointers".

Change to definition of @matchPattern

The definition of @matchPattern would need to be revised slightly as it's moved into the new attribute class. Currently it says "specifies a regular expression against which the values of cRef attributes can be matched"; this would need to be redrafted, either to extend it:

"specifies a regular expression against which the values of cRef attributes (in the context of cRefPattern) or data.pointers (in the context of prefixDef) can be matched."

Alternatively, it could be generalized:

"specifies a regular expression against which the values of other attributes can be matched."

[http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-cRefPattern.html]

Addition of links to the new section

A link to the new section should be added in 16.2.5 (Canonical references) [http://www.tei-c.org/release/doc/tei-p5-doc/en/html/SA.html#SACR]

Notes

Council, please add any comments or suggestions here.

This looks fine. It's basically a more powerful version of @xml:base. (Kshawkin 23:10, 13 November 2012 (EST))

I don't think xml:base is a good analogy, but that is beside the point. What I am somewhat uneasy about here is -- as of this moment purely theoretical -- a potential security risk. Imagine a system where you include a single instance of the header into thousands of individual documents (if you don't want to imagine that, download a sample of the National Corpus of Polish). That obviously provides flexibility (that's why we have used it in this way), but, on the other hand, in contrast to tag URIs, a single attack at the directory where the header is kept results in thousands of changes throughout, and even if (theorizing feebly now...) the result is a kind of DOS attack, whereby you make all those files that use this system try to retrieve something from a single specified URL, it can still be somewhat painful in effect. Don't get me wrong: I think this system is neat. But also extremely powerful (even the fact that it acts on both sides of the '#' is a neat but also very powerful property), and robust systems naturally have the potential to open up serious security holes, when abused or even only misused. The question now is: do we bother. Piotr 12:28, 17 December 2012 (EST)

Footnotes

  1. We have also proposed the use of "tag URIs" as described in RFC 4151. See comments on soft deprecation of @key. However, tag URIs, since they are intended to be globally unique, are inevitably inconveniently long, and even where tag URIs are used, it is likely that projects will want to resort to shorter, simpler token-like attribute values, so the need for a dereferencing system still applies when tag URIs are used.
Personal tools