Future changes to Best Practices for TEI in Libraries
This page describes changes to the Best Practices for TEI in Libraries which will be made at some point but which did not delay release of version 3.0.
These items will be migrated into the GitHub issue tracker.
<milestone unit="typography" n="******"/> -- Is this TEI-conformant? Is there a better way to do this in any case?
- This is not TEI conformant, unless you think that "*****" is a valid way of naming something (here this particular milestone). I would suggest
<milestone type="separator" unit="nonstructural" rend="stars"/>
- Or possibly
<space dim="vertical" extent="[whatever]" rend="stars"/>
- Or you could use <ornament> if you have followed tite in adding it. LouBurnard
- I agree with Lou that “typography” is not really a unit, and stars are not really a label for the unit, whatever it is. I’m not fond of <space> for this purpose, but perhaps there are convincing arguments I haven’t considered. I quite like
<milestone type="separator" unit="nonstructural" rend="stars"/>
- except that I think, in general, these units are structural. So I would prefer "undetermined" as the value of unit=. Syd Bauman
- Note that these values of @rend do not conform to our general recommendation to use CSS for values of @rend. (Kshawkin 12:38, 13 November 2011 (EST))
ref using a URN
The section on "key and ref" needs to be revised so that the Shakespeare example points to a Linked Data URI for this authority record, which is now this would be done now that id.loc.gov is available. Furthermore, the use of <taxonomy> in this example is not even warranted in its current form in the BP, where we recommend it simply to gloss a string used as part of the value of @key even when not for a typology of any sort.
So once both of these things are fixed, the Shakespeare example no longer illustrates a magic token, so a new example should be created that uses a "tag" URI for the value of @target as implemented in P5 in feature request 3437509. Then we will really no longer be recommending use of @key, so the title of this section in the BP should be revised. (Kshawkin 22:20, 17 June 2012 (EDT))
- An alternative approach would be to use a scheme for documenting private URI schemes and similar abbreviated pointing systems as proposed by Martin Holmes to the Technical Council. (Kshawkin 23:05, 13 November 2012 (EST))
alternative handling of ISBNs
See outcome of http://purl.org/TEI/fr/3500566 . Solution (e) on this ticket seems especially good for machine processing, though we should probably use ref instead of ptr so that there is a way to represent the ISBN printed on an item (which might not be the real one). See AACR2 for how to represent the stated and known ISBNs.
The TEI Header
breaking out components of MARC 1xx and 7xx subfields
Consider whether various MARC 1xx and 7xx subfields could be broken out into components of persName. If so, we'll change recommendations for persName@type.
Separately, we could also simply use <date> within <author> and <editor<. So
<author><persName>Shakespeare, William, 1564-1616</persName></author>
<author><persName>Shakespeare, William</persName>, <date notBefore="1564" notAfter="1616">1564-1616</date></author>
list of elements deleted and changed by our ODDs
Identifiers for outside metadata?
Pending Issues Discussed
Should we have a place in the header to indicate an identifier for an outside metadata record for the item? Examples:
- record number for the source document in the local catalog
- record number for the source document in WorldCat
- record number for this TEI document in the local catalog
- record number for this TEI document in WorldCat
Having such a link would allow a delivery system to provide an unambiguous link to this full metadata without relying on matching other information in the header like a title, ISBN, or call number. (Kshawkin)
Yes, I think we should. How about the spot where the TEI Guidelines recommend putting the code for the classification of the text (in some scheme), <classCode> inside <classDec>, or is that too much of a stretch? (—Syd)
- During the call on 2009-02-10, Syd said he no longer thinks use of classCode (and a corresponding classDecl) is a good idea. Instead, he suggested we createa new element, otherDesc, to contain elements from outside the TEI namespace for metadata not covered by the TEI header. The BP could specify how this element is used. (Kshawkin)
NOTE: we talked about this during our conf call on 2009-02-10; we decided to have a sub-group conference call on 2009-02-17 to talk in more detail about this. Emcaulay
- We didn't get to this on 2009-02-17, so we postponed to 2009-03-03. However, few people showed up, so we postponed again. As Syd put it, there are two issues to consider here:
- A. What mechanism should we use to we point from the TEI header to metadata located outside the TEI document? (For example, how do you identify a MARC, METS, or MODS record that provides additional metadata about the TEI document and/or the source document?)
- B. Should we provide a recommendation on storing non-TEI metadata within the TEI document (using a different element namespace)? For example, should we allow Dublin Core elements anywhere in the TEI header?
Email discussions in late March 2009 and early April 2009 with Syd, Melanie, Kevin, Michelle and Glen did not reach a conclusion. Tentative plans for the future would do this sort of thing when an element has the @ref attribute:
<author> <persName xml:id="persName_1" ref="http://authorities.loc.gov/cgi-bin/Pwebrecon.cgi?AuthRecID=1563939&v1=1&HC=1&SEQ=20090404152214&PID=wRSbpUQ7Uptm_ypRikIdNPzF">Welles, Gideon, 1802-1878.</persName> </author>
except that in your example there's no @type or other method for describing the relationship between the content of <persName> and the value of @ref. P5 says that @ref "provides an explicit means of locating a full definition for the entity being named by means of one or more URIs", but we are looking for a typology of some sort for these links and need a place to indicate the type of link.
And we'd do this when there's no @ref:
<sourceDesc id="sourceDesc_1"> [. . .] </sourceDesc>
for which you'd find elsewhere in the document:
<link type="MARCsource" target="#sourceDesc_1 http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=000601789&local_base=MIU01_PUB"/>
link elements might be grouped together in one of these places:
- 1st child of <text> last child of <text>
UPDATE: We'll probably use <idno> in various header elements: see https://sourceforge.net/tracker/index.php?func=detail&aid=2493417&group_id=106328&atid=644065 . In any case, we'll need to tell people how much metadata to include in TEI header if they will also have external, possibly canonical, metadata sources.
While at the Swinburne conference with John I happen to look over his shoulder while he was encoding and I noticed his use of <relatedItem> in the <biblStruct>: http://www.tei-c.org/release/doc/tei-p5-doc/en/html/ref-relatedItem.html. You may have come across this already in your break-out meetings, but at a quick glance, the application seems to fit our needs: <relatedItem> contains or references some other bibliographic item which is related to the present one in some specified manner, for example as a constituent or alternative version of it. John was using it in conjunction with an embedded <note> to provide additional context. <ptr> and <ref> are also valid within <relatedItem>. (—Mdalmau)
- this element is designed to reference a bibliographic item, not another metadata record or a single piece of metadata. So I don't think it's quite right. (Kshawkin)
Metadata Working Group
Per discussion at the SIG on Libraries meeting in Ann Arbor (2009-11), Syd put out a call for members of a working group to look at the relationship between the TEI header and other sources of metadata. This group will formulate recommendations for the TEI Council and/or for the text of the Best Practices. However, as decided a few times in the past, release of the Best Practices will not be held up for this group's work to finish because of the large scope of the work.
Expressing RDF in TEI
http://listserv.brown.edu/archives/cgi-bin/wa?A2=ind1106&L=TEI-L&T=0&F=&S=&P=15216 (which refers to https://sourceforge.net/tracker/index.php?func=detail&aid=3309894&group_id=106328&atid=644065 , where there is ongoing discussion)
Use of mets:mdref
Consider using <mets:mdref> to reference outside metadata.
New container element?
A new element might be added to the header to hold non-TEI metadata. See feature request 453.
<extent>viii, -215 p</extent> <extent>20 cm.</extent>
we might use elements from the msDescription module and do it this way:
<extent>viii, -215 p</extent> <extent> <height quantity="20" unit="cm"/> </extent>
Had a width been given for this item in the catalog record, in the header it would have been:
<extent>viii, -215 p</extent> <extent> <width quantity="30" unit="cm"/> <height quantity="20" unit="cm"/> </extent>
appInfo and application
Consider inclusion of these elements in the header. Lou wrote, "There is at least one proposal forthcoming for further work on defining the scope and usage of these elements". Wait for these proposals to make their way into P5 before revisiting this question.
children of editorialDeclPer a change in P5, editorialDecl can now have mixed content of
s and specialized elements like <hyphenation>. Revisit our decision to put all content into <p>s and come up with recommended uses for specialized elements.
respStmt within imprint
Per https://sourceforge.net/tracker/?func=detail&aid=3408897&group_id=106328&atid=644065, P5 will soon allow respStmt within imprint for printer, distributor, etc. Consider recommending this as well, which will require us to dissasiate the BP with version 1.9.1 of P5. Note will be difficult to map from MARC 260 $b.
indicating interviewers and interviewees
Section “Level 4 Oral History” currently recommends that the speaking participants in the interview be identified is as authors of the document, or in profileDesc/particDesc/list/item/name. In the example encoding of this interviewees and interviewers are differentiated as text inside the given <item>. This strikes me as sloppy. It may well be appropriate to permit this encoding, in case someone is digitizing hundreds of interviews and has OCRed metadata that lists the participants this way. But certainly a best practice would make use of <listPerson>, and explicitly indicate "interviewer", "interviewee", "thirdParty" (or whatever) on the role= of <person>, no? —Syd
encoding <pb>s within <note>s
Give guidance on encoding of <pb/>s within <note>s. Should these be encoded or omitted, with the <note> element appearing in the XML "within" the page on which it began? (Note that the BP allows for local practice in gathering all notes for a given section to a div at the end of the section.)
If encoding such <pb/>s will be optional or required, there will generally be two instances of <pb n="X"/> every time a note crosses a page boundary. Should we recommend use of @sameAs or @corresp to indicate that you are encoding the same page break twice? See this thread on TEI-L.
bibliographies and other lists of works cited
We give no guidance on encoding bibliographies, lists of works cited, lists of references, and other such things in documents. In TEI, you would use listBibl for this. We need to decide:
- Whether to include this in the BP and at which level
- Whether to use bibl, biblStruct, or biblFull inside the list. (I strongly argue for bibl for faithfully source documents.)
- Since listBibl can include a head, the listBibl element could be a sibling of a div in TEI. However, this would go against the BP, which says to use divs for everything. So should listBibl go inside a div? If so, should the head be a child of listBibl (as in TEI) or of the div (to be consistent with other parts of the BP)?
(Kshawkin 12:45, 14 July 2011 (EDT))
MARCXML <--> teiHeader
- Black Mesa Technologies has undertaken this work for us (gratis)!
MODS <--> teiHeader
Tite --> Level 3.5 (Syd's recommendation)
- Syd said in Wuerzburg in October 2011 that he would do this. (Kshawkin 15:08, 13 November 2011 (EST))
providing tools for workflows
The section "Determining Data Values for the TEI Header" mentions AACR2 and ISBD(ER). Perhaps also mention RDA?
how to record non-ASCII characters
People often want to know what to do with non-ASCII characters. There are generally four options:
- insert them as they are (and assume the files will be manipulated by Unicode-aware software and that people can distinguish these from similar characters)
- insert decimal entity references
- insert hexadecimal entity references
- insert mneumonic entity references (which need to be declared)
We should give advice on this.
add <distinct> to Level 4
Consider adding distinct to Level 4.
The section on linking between page images and the TEI document does not advise on use of any particular METS profile. Consider developing one, especially one that would work with DFG-Viewer since this tool can be used to view any METS object regardless of where it's hosted. Torsten Schaßen developed a METS profile for manuscripts for the DFG-Viewer that uses the TEI header (as opposed to the regular book profile, which uses MODS).
Guidance on encoding serials?
People often ask how to handle encoding of serials. Should you have a single TEI document for the whole journal run, for individual volumes, for individual issues, or for individual articles? How do you encode metadata about the serial as a whole, issues, and article?
How to indicate conformance to the BP?
In TEI, you typically use editorialDecl to say things about your specific use of the TEI. We use @n to say which encoding level you follow. And we have various p elements to describe encoding practices. But we have no place to say that the @n refers to the BP or that you follow the BP at all. We really should have this.
- You could refer to a schema using http://www.w3.org/TR/xml-model/ . Still need a mechanism to point to the ODD and/or prose documentation. This is part of a larger problem in the TEI. (Kshawkin 17:02, 7 November 2011 (EST))
@rend vs. @html:style for CSS values
In TEI, the value of @rend is data.word+. James Cummings wrote in an email to tei-council, "While it is true that the order of data.word+ is significant and important they are meant to be individual tokens that do not necessarily have any relation to each other." (More on James's blog.) More specifically, there's nothing that says that data.word+ must be interpreted as discrete tokens with no relation to each other, but many programmers find it obvious that a set of token words will be an unordered set. This is problematic if you use give CSS property-value pairs in the value of @rend. So Sebastian suggested that in the BP we use @html:style instead. This way we could continue using non-CSS values of @rend for hyphenation without violating our own recommendation.
- I'm not sure if I get this right: do you mean, the problem is that the property and the value of a property-value pair are interpreted separately because of the whitespace between them? If so, why don't you simply omit this whitespace? Besides, @html:style seems to be something very different, semantically, than @rend, so I don't think this would be a suitable alternative. --Martin de la Iglesia 04:18, 10 January 2012 (EST)
- CSS syntax, like XML syntax, disregards whitespace, so things like "font-align: right" and "font-align:right" are equivalent. Right now the BP is written to assume that any valid CSS could be put in @rend, so we'd have to change it to say something like "use any valid CSS but remove all whitespace". If we did this, it would basically fix the problem, as you say.
- You're right that @rend and @html:style currently have different semantics since @rend allows any description of appearance whereas @html:style requires use of the CSS vocabulary and syntax. However, the BP essentially redefines use of @rend to make it semantically equivalent to @html:style, so it seems more transparent to simply use @html:style.
Rework Level 1 and Level 2 structure?
Consider one of two ways to rework the encoding at Levels 1 and 2:
A) Kevin suggests removing the wrapper div1 and ab elements, leaving the OCR text in the body element. This would make it easier to upgrade documents to a higher level (because you wouldn't need to remove tags but could simply insert new ones.
B) Instead of putting the whole text in a single ab element, use one ab for each page, Lou suggests this:
<div1> <ab n="[pageno]" facs="00000001.tif"> <!-- uncorrected OCR for first page image goes here --> </ab> <ab n="[pageno]" facs="00000002.tif"> <!-- uncorrected OCR for second page image goes here --> </ab> <!-- etc. --> </div1>
Then you wouldn't even need the pb element. But upgrading such a document to Level 3 would require significant reworking.
pb@xml:id and METS
Use the @xml:id attribute on each <pb> element and a METS document to provide correspondence between <pb> elements and one or more facsimile page images (e.g., master, web derivatives, etc.).
isn't very clear. Is the value of each pb@xml:id also included somewhere in the METS document? If so, where exactly?
revised display examples for levels 3 and 4
The display example for Level 3 has texts with lots of features not found in Level 3 and is of oral transcriptions, which don't have a clear source document. For clarity, I suggest we instead link to http://name.umdl.umich.edu/abu0246.0001.001 as a display example for Level 3 and http://webapp1.dlib.indiana.edu/vwwp/view?docId=VAB7020.xml for Level 4.
The @rend value "keep-hyphen" was meant to mean "keep the hyphen when tokenizing", not "keep the hyphen when rendering" as it appears from a quick read. It's also misleading to use @tei:rend to distinguish soft and hard hyphens when these can't be distinguished by the appearance in the source document (as @tei:rend is meant for). Consider revising this attribute value.
- Conal Tuohy proposes encoding a "hard" hyphen (such as in "so-called") as an actual hyphen in the source text (without the need for any XML markup around it), and a "soft" hyphen as
<lb rend="hyphen"/>(noting that a line-break was rendered in the source document).
- Just for the record: my current suggested solution for encoding "hard" hyphens is
<lb break="keepHyphen"/>, because I think @break is where this information fits best, semantically. --Martin de la Iglesia 08:53, 18 October 2013 (EDT)
- Just for the record: my current suggested solution for encoding "hard" hyphens is