Private URI Schemes

NOTE: THIS PAGE HAS BEEN SUPERCEDED BY THE Prefix Definition Proposal following the TEI Council meeting in Oxford in September 2012.

This page contains a draft proposal for a framework whereby private URI schemes used in TEI attributes with datatypes of data.pointer can be documented and dereferenced.

The Problem
For some time now, we have been discussing the use of "magic tokens" in attributes such as @key. Magic tokens are problematic because they are meaningful only within the context of a specific project (@key "provides an externally-defined means of identifying the entity (or entities) being named, using a coded value of some kind"). At one point it was suggest that @key attributes be documented through the use of a element in the TEI header. However, this is no longer the case; KH is working on another recommendation for this. Nevertheless, documentation in this way does not provide a machine-readable method of dereferencing a key.

On several occasions, Council has discussed discouraging the use of @key and friends in future, and has talked about encouraging instead the use of private URI schemes instead. There are many good arguments against the use of private URI schemes (see for instance URI Schemes at the W3C -- but as long as they are restricted to a specific project and well documented in that project, the approach seems a reasonable alternative to magic tokens.

Except that without a solid dereferencing scheme, they're actually no different from magic tokens. There's not much difference between  and .

A Possible Solution
The primary value in using a project-specific key-style attribute is that it's short and simple. In many projects, @key is used when a perfectly straightforward and reliable pointer could be provided, because that pointer would be too long to be manageable by encoders. For instance, the Colonial Despatches project uses keys like this:

when what is actually meant is something like this:

Where the key value corresponds to a unique @xml:id within the project, and the project data is stored in an XML database, dereferencing the key to look up the element to which it corresponds is simple. But if the XML data is removed from the context of the XML database and associated XQuery which enables the simple lookup, the relationship between the key and the target element becomes opaque, and any researcher working with the data will have to read the encoding description and reconstruct it.

The proposed solution is to create a method of documenting this relationship which can be mechanically dereferenced as well as being described in human readable text. This would enable a processor to reconstruct the actual path of a link without human intervention. This can be done with a search-and-replace operation, encoded for example as the second and third arguments to XPath 2.0's replace function. The attributes @matchPattern and @replacementPattern are adopted from the existing TEI element . Imagine this in the &lt;encodingDesc&gt; of a document:

Any processor, presented with the "bios" prefix in a private URI:

can look it up in the header, and apply a search/replace operation using the @pattern and @replacement attributes to arrive at a full working URI.

The same approach can be used with external references. In the Map of Early Modern London, we use a private URI scheme like this:

This expands into a full URL that looks like this:

It's pointless and error-prone to reproduce the entire URL in every @facs attribute when the only two pieces of information that matter are the document number on EEBO (here 18464) and the page number (here 1). So we could document our moleebo prefix like this:

Since this dereferencing can be processed using XSLT 2.0 with the replace function, this handling could easily be built into the standard TEI stylesheets.

As noted, Council will most likely encourage the use of tag URIs in place of magic tokens as well as private URIs, but these are also long, and can be similarly replaced with private URIs, which can be dereferenced in the same way:

could more simply be encoded as:

which could be dereferenced like this:

Elements, attributes and datatypes
The following elements, attributes and classes are proposed:

 would be added to this class, along with .
 * : This is a container element for a list of privateUri elements, since many projects will use more than one. It would be a child of encodingDesc.
 * att.patternReplacement: a new attribute class including these attributes, currently defined on :
 * @matchPattern: data.pattern. A regular expression. REQUIRED.
 * @replacementPattern: data.text. The replacement text (which may include captured groups from the regex, such as $1, $2). REQUIRED.
 * : This is the core element which provides the expansion functionality for a specific private URI prefix. Its content model would be the usual globals, + model.pLike to provide for a detailed textual description of the dereferencing process if required. It would be a member of the proposed att.patternReplacement class.
 * @prefix: one instance of data.name. This defines the prefix that will be used to construct the private URIs. It would be defined on , and would be REQUIRED.

Remaining questions

 * 1) Should it be possible to provide multiple dereferencing privateUri elements for the same prefix? This seems potentially useful; one might provide a relative path through the document collection, for instance, while another might provide a web URL that would retrieve the same information from a web application.
 * 2) JC is uncomfortable with naming elements based on the actual name of a protocol (if that's what it is) such as Private URI Scheme. However, we already have attributes such as @url.