Talk:Best Practices for TEI in Libraries

To consider in future revisions

milestone element

<milestone unit="typography" n="******"/> -- Is this TEI-conformant? Is there a better way to do this in any case?

The TEI Header

Identifiers for outside metadata?

Should we have a place in the header to indicate an identifier for an outside metadata record for the item? Examples:

record number for the source document in the local catalog
record number for the source document in WorldCat
record number for this TEI document in the local catalog
record number for this TEI document in WorldCat

Having such a link would allow a delivery system to provide an unambiguous link to this full metadata without relying on matching other information in the header like a title, ISBN, or call number. (Kshawkin)

Yes, I think we should. How about the spot where the TEI Guidelines recommend putting the code for the classification of the text (in some scheme), <classCode> inside <classDec>, or is that too much of a stretch? (—Syd)

During the call on 2/10/09, Syd said he no longer thinks use of classCode (and a corresponding classDecl) is a good idea. Instead, he suggested we createa new element, otherDesc, to contain elements from outside the TEI namespace for metadata not covered by the TEI header. The GBP could specify how this element is used. (Kshawkin)

NOTE: we talked about this during our conf call on 2/10/09; we decided to have a sub-group conference call on 2/17/09 to talk in more detail about this. Emcaulay

We didn't get to this on 2009-02-17, so we postponed to 2009-03-03. However, few people showed up, so we postponed again. As Syd put it, there are two issues to consider here:

A. What mechanism should we use to we point from the TEI header to metadata located outside the TEI document? (For example, how do you identify a MARC, METS, or MODS record that provides additional metadata about the TEI document and/or the source document?)

B. Should we provide a recommendation on storing non-TEI metadata within the TEI document (using a different element namespace)? For example, should we allow Dublin Core elements anywhere in the TEI header?

Email discussions in late March 2009 and early April 2009 with Syd, Melanie, Kevin, Michelle and Glen did not reach a conclusion. Tentative plans for the future would do this sort of thing when an element has the @ref attribute:

<author>
  <persName xml:id="persName_1" ref="http://authorities.loc.gov/cgi-bin/Pwebrecon.cgi?AuthRecID=1563939&v1=1&HC=1&SEQ=20090404152214&PID=wRSbpUQ7Uptm_ypRikIdNPzF">Welles, Gideon, 1802-1878.</persName>
</author>

except that in your example there's no @type or other method for describing the relationship between the content of <persName> and the value of @ref. P5 says that @ref "provides an explicit means of locating a full definition for the entity being named by means of one or more URIs", but we are looking for a typology of some sort for these links and need a place to indicate the type of link.

And we'd do this when there's no @ref:

<sourceDesc id="sourceDesc_1">
[. . .]
</sourceDesc>

for which you'd find elsewhere in the document:

<link type="MARCsource" target="#sourceDesc_1 http://mirlyn.lib.umich.edu/F/?func=direct&doc_number=000601789&local_base=MIU01_PUB"/>

link elements might be grouped together in one of these places:

TEI/teiHeader/profileDesc/creation/ab/linkGrp
1st child of <text> last child of <text>
TEI/text/back/div[@type='editorial']/linkGrp

Issues Pending

I updated the Harkin/Pushkin L5 example from P4 to P5. I noticed a prolbem that I hope I've fixed correctly. The <text> element has lang="rus" specified. But as far as I can tell, all of its content was in English. So I removed it (and make the presumption, as with all our other examples, that xml:lang="en" is specified on <TEI>). Syd

Issues Resolved, Changes made

language identification

We also need to figure out what to do about the recommendation for lang=. That's a tougher issue, because it's not just a syntax change. Our guildelines for best practice are at odds with the TEI Guidelines and the IETF best current practices. Syd 2008-10-21T19:09Z.

Syd, I think we agreed we want to conform with TEI Guidelines and IETF best current practices ... Can you point out or edit one of the offending sections so that we can get back in line (I'm not sure where the problem surfaces and how to fix it.). For me this is not a topic of debate ... Is there anyone who objects? Emcaulay

Syd brought this up again on 2009-04-03 since we haven't dealt with it. Let's align with with P5. (Kshawkin)

=Level 1 and Level 2 Examples and Linking to P5 Guidelines

Link to corresponding sections of the P5 Guidelines when ever possible (for each recommendation); Lisa and Rich need to add these to Level 2 and Lisa and Andrew need to add for Level 1
IN EXAMPLES and GUIDELINES, CHANGE the "@id" TO "@xml:id". Lisa and Rich need to make sure this is handled correctly in Level 2. Lisa and Andrew need to make sure this is handled for Level 1.
CHANGE TEI.2 as the root element to TEI. Lisa and Rich need to make sure this is handled correctly in Level 2. Lisa and Andrew need to make sure this is handled for Level 1.

Acknowledgments and Bibliography

The header portion of this document was originally prepared by Judy Ahronheim, Thomas Champagne, Lynn Marko, Kelly Webster, and Chris Wilcox of the University of Michigan Library and Jackie Shieh of the University of Virginia Library in October 1998. The source documents were the cataloging guides prepared by those two institutions (Virginia and Michigan). In addition, documentation from the Oxford Text Archive, Arts and Humanities Data Service of the United Kingdom also was made available to assist in this effort.
This text was heavily revised in 2008 by Melanie Schlosser and Kevin Hawkins, with input from other members of the SIG on Libraries.

Do we need this in the Header section or even at all? Can we just have a general acknowledgements section for all who have contributed in the appdenices?

This was moved to the appendix.

Remove mention of entity references

Someone wrote that we'll need to "ditch the bit on entities". Looks like this happened.

Inline comments on the header

There are a number of inline comments we need to address. Look for colored text in this section of the GBP.

Syd, Melanie, and Kevin recommend removing the section called "Advanced TEI Header Practices" because it raises more questions than it answers.

Kshawkin 13:15, 8 March 2009 (EDT): now done because realized that none of us added that section

Level 1 and Level 2 -- should they be combined? (resolved 2009-02-24

The levels should remain separate with an empahsis of "no intervention in the initial tagging of structure" for Level 1.

Use of floatingText (resolved 2009-02-24)

Should we recommend or require use of floatingText in Level 3 and above? In Level 4 and above? In Level 5 only?

Resolved during 2009-02-24 meeting. See minutes.

Filenaming (resolved on 2009-01-27)

These issues were resolved in the conference call on Jan. 27 and changed in the wiki on Feb. 1. The guidelines are broad and vague, except on saying only only a narrow range of characters should be used in filenames. (Kshawkin)

{snippet from BPG text; comments about file naming} [This recommendation also seems dated (and the standard is targeted for CD-ROM file naming). I think we should recommend a consistency in file naming according to respective digital object storage practices. For example, IUDLP has guidelines in place and perhaps we can mine the more general recommendations from there like only ASCII, no spaces, 3 letter extensions, etc. (Mdalmau)] sounds like a good idea to remove or revise; as is it seems weird. (emcaulay) I'll just point out that people still use CD-ROM as an archival storage medium (I'm looking at you, Chris) as well as a file transfer mechanism [pwillettt] {end snippet}

The question isn’t is anyone still using CD-ROM, lots of folks probably do. The question is, is anyone still using ISO 9660 (as opposed to UDF, ECMA-168, or ISO 13490) CD-ROMs whithout using Rock Ridge or Joliet extensions. Anyone even know how to do that? -- Syd

File naming is still an issue. Perry pointed out that some folks store TEI files on CD-Rom (makes sense). Perhaps it just needs to be teased out for those who use CD-ROM for storage and more general filenmaing guidelines for server storage/delivery, like:

Standardized file naming for a particular encoding project is key for reliable online storage and delivery of these files. Consider the following best practices when determining the file name scheme for your project:
Each filename should contain an identifier that uniquely specifies a single digital object within the parent collection (e.g., a parent collection of text, images and other related materials)

Each filename should be fully specified. It should not just be a sequence number that is dependent on location within a directory structure for context

Filenames should not include spaces

Filenames should following a predicatble case constructions (e.g., all lowercase, camelCase, etc.)

The first character of the filename should be an ASCII letter ('a' through 'z' or 'A' through 'Z') to comply with current restrictions on identifiers by many programming and metadata languages such as METS

The "base" filename may include only ASCII letters ('a' through 'z' and 'A' through 'Z'), ASCII digits ('0' through '9'), hyphens, underscores, and periods. Refrain from using other characters and limit period usage to only once (to separate base name from file extensions).

For those saving files to CD-ROM for storage or file transfer, file naming should follow ISO 9660 conventions: 8-character filenames, 3-character extensions, using A-Z, a-z, 0-9, underscores and hyphens.

(Mdalmau)

While I actually think my recommendation of 2008-12-03T11:57-05 (“I was wrong”) is syntactically slightly superior, it’s time to apply Syd’s wheel reinvention prevention convention in full force. The conventions MD refers to above (i.e., IUDLP has guidelines in place) are perfectly workable. We should just refer the reader there and be done with it. Syd

I did steal these from the IU DLP guidelines, but I was selective because there's a lot of "Fedora" construct influencing our filenaming conventions. Wouldn't want the users to go there and feel overwhelmed. I selected the more "basic" factors for consideration, but certainly a pointer to DLP documentation or anywhere else could be helpful (as a footnote?). (Mdalmau)

I hate to throw another comment into the mix on this issue, but I seem to recall reading somewhere that underscores are now discouraged in file naming, because they can be mistaken for blank spaces between words. If this is incorrect, my apologies. rwisnesk

It's true about underscores: http://www.education.umd.edu/ETS/web/webNamingConventionRP.html .

I've just read this, and it is only a mild reccomendation, and their argument is pretty weak. (Hard to read printed when underlined: in real printing, underlines aren't used, italics are; this is only a problem when printing from a web browser, really, in which case web browsers should do a better job.) Syd

I still wonder does anyone really need to worry about 8-character long filenames when burning modern CD-ROMs? Emcaulay As I said above, no; I don't think there is any reason to worry about 8-dot-3 using modern CDs, DVDs, or Blue-ray disks.Syd

Numbered Divs (resolved on 2009-01-27: see minutes)

[This seems worth revisiting. Do we really need such a software-specific recommendation? (Kshawkin)] [I agree. We generally avoid numbered divisions. Recent survey revealed a nearly 50/50 split on the topic, but we shouldn't advocate one or the other. (Mdalmau)] I disagree pretty strongly — many perceive it a shortcoming of the TEI Guidelines that they often offer more than one way to do something when there is not much gain in the difference. For our purposes, I think this is one of those cases, and we should avoid causing the same confusion. We should pick either numbered or unnumbered and stick with it throughout. (And I don't think it matters much which we pick.) — Syd For a discussion of whether to use numbered or unnumbered divs, see the TEI P5 Guidelines Chapter 4: Default Text Structure(emcaulay) I'm not sure when Syd added his comment--before or after the conference call? We did pick one for the original best practices but there has been significant unhappiness about that decision ever since. As Michelle points out, the community out there is split 50-50. It's not a disagreement that's going away.(Pwillett 19:37, 9 February 2009 (EST)) I just now see the "resolved" banner above, so ignore my comments just above. (Pwillett 19:38, 9 February 2009 (EST))

Chapter 4 of the P5 Guidelines make allowances for both, with a preference towards unnumbered as it more easily supports arbitrary levels of nesting (as opposed to a fixed number). Unnumbered is also preferred because designated levels to parts of a text may change from project to project or even book to book within the same project. The guidelines make allowances for both: unnumbered divisions using the @type to designate the level. For those who type more semantically and for those who need numbered divisions for more predictable processing how about we re-write the section thusly:

We recommend the use of unnumbered divisions throughout the electronic text with proper values inserted in the @type attribute. For those of you who require numbered divisions for software processing, populate the @ type attribute with a number, 1-7 (?), that corresponds to the appropriate level. For those of you who prefer a semantic label (e.g., chapter, section, etc.), determine a typology beforehand and designate the appropriate level in the @type attribute. The ability to do both is also possible if it is important to maintain an explicit connection between the numbered and unnumbered labels by using @ type and @subtype accordingly. However, a combination of numbered (e.g., <div1>) and unnumbered (e.g,, <div>) divisions is not supported. For a more detailed discussion about numbered and unnumbered divisions, consult Chapter 4: Default Test Structure of the TEI P5 Guidelines. (Mdalmau)

This revision may impact how we display examples throughout the text. We need to keep this in mind if accepted. [Mdalmau].

The construction of "typologies" is a common activity for many of us when performing document analysis. When we are ready to expand the guidelines, I think including a section on "document analysis" is key. We can then explore issues of typology-building and how to constrain those values in the schema (or even Schematron). But defining the value list is not an easy task, which is one benefit of using numbered divs. (Mdalmau)

Using type= to mimick numbered <div>s seems too close to tag abuse for comfort. I completely agree that typologies are part of document analysis and enthusiastically support having a section on that. But I don't see how using numbered <div>s has anything to do with it. A project should develop a useful typology whether they are using numbered or unnumbered. Many projects won’t, anyway.

Would this wording be one possible compromise: Use unnumbered divisions <div>, unless your text has obvious divisions, such as chapters with no complex subdivisions, in which case begin with <div1>(rwisnesk)

Page Breaks (resolved on 2009-01-27: see minutes)

[Always including page breaks within a div seems quite software-specific. I suggest revisiting. (Kshawkin)] As we've discussed on a conference call, this isn't software specific. There are two points here. The historical point is that we wanted to recommend a practice, as a way of creating consistency and uniformity among encoded documents. There's a choice to be made about where to stick page breaks, so we chose one. But more importantly, it's about any software (eg XSLT) that will grab and return an entire DIV. You'll want to include the page break in that chunck of encoded text. In my experience, this generally works, except at the beginning of the volume, which typically would have <TEXT><BODY><PB><HEAD>Book Title</HEAD><DIV><HEAD>Chapter title</HEAD> [pwillett]

It seems that the page break blurb we have in place is not really an issue that needs to be revisited. I agree with Perry that promoting consistency is helpful (and also aids processing of text in most cases; page breaks as are all milestone tags, are hard to reckon with sometimes). The suggestion seems neutral enough that it can remain as-is. If someone disagrees, please provide the rational for further review. Thanks!(Mdalmau)

While I am, of course, willing to be outvoted, this strikes me as less than a Best Practice, for both theoretical and practical reasons.

Theoretically, I think, as Snoopy said, “honesty is the best policy”. In general in these cases, the page break does not occur inside a division, it occurs between them. To encode it otherwise seems to me to be asking for trouble.

Practically, this recommendation favors <div> above all others. For any other element, if you wanted to know “on what page did this occur”, you would ask the question “What is the value of n= of the most recent <pb> on the preceding:: axis?”. This works for <quote>s, <said>s, <lg>s, <head>s, etc. If page breaks are encoded where they lie, it works for <div>, too. If they are moved into the division as per this recommendation, then a different question has to be asked: “What is the value of n= of my first child <pb>?”. Neither question is difficult to ask. But how do you know which to ask? —Syd

Level 1 section

Paragraphs or Anonymous Block (resolved on 2009-01-27: see minutes)

Currently, level 1 contains a table with the following information for the <p> tag:

At least one "container" element per div is required (while <ab> is another option for this case, the Task Force suggests using <p> in order that the document be open to being extended to other encoding levels).I don't remember this discussion. It doesn't seem very difficult, once the decision is made to upgrade, to transform all ab's to p's. Or? [pwillett]

I agree with Perry and our goal is be conformant. So the <p> could simply be changed to <ab>. Do we want to address the <p> legacy any further or maybe as an end note? (Mdalmau)

TEI Header introductory paragraphs rewritten

On the teiHeader subgroup conference call on 2/17/09, we decided to do a rewrite of the opening paragraphs related to the teiHeader (everything before the element table). Lisa has made a complete rewrite, but is placing here all the text that was in this section as of Feb 20, 2009 around 11:30AM eastern time.

OLD TEXT (cut from main article on 2/20/09)

Introduction?

The TEI header may be used to describe a collection of documents, a single item, or a portion of an item. Variances in TEI header content can result from making different choices of what is being described. Within the library domain, a TEI header is often perceived as similar to or at least related to a MARC record. However, a TEI header does not typically have a one-to-one correspondence with a MARC record: one TEI header may be described by multiple MARC analytic records, or one MARC record may be used to describe a collection of TEI documents with individual headers.

Purpose

A TEI header serves several purposes. It may contain an historical background on how the file has been treated. It can extend the information of a classic catalog record. The text center or cataloging agency can act as the gatekeeper for creators by providing standards for content. A TEI header can serve many publics: headers can be created in a text center and reflect the center's standards, or they can serve as the basis for other types of metadata system records produced by other agencies. Headers can function in detached form as records in a catalog, as a title page inherent to the document, or as a source for index displays.

Chief Sources of Information for Creating the TEI Header

Does the TEI header act as the electronic title page for the encoded document (part of the item) or as a catalog record for it (pure metadata)? Is it integral to the document it describes or independent? Depending on the community being served, the TEI elements will reflect the interest of that community. Nonetheless, it is possible to describe a set of "best practices" that will produce compatible content while accommodating this variety of purposes. Compatibility of content encourages a more understandable set of results when information about assorted items is displayed as a set of search results, a contents list, or an index, and it allows for more reasonable conversion of content information from TEI elements to elements of other metadata sets when this action seems advisable.

It is a traditional practice of librarianship to agree upon which location(s) in a document and in what order of preference one should look to identify the title, author, etc., of that document. This practice permits a certain consistency in terminology and allows for a certain amount of authentication of content. We recommend the following preferences to those who create headers and to those who attempt to use headers to create traditional catalog records that are compliant with AACR2 Anglo-American Cataloging Rules 2nd Edition and ISBD(ER) International Standard Bibliographic Description for Electronic Resources rules.

As a member of the academic community, the header creator or editor has a responsibility to verify, whenever humanly possible, the intellectual source for an electronic document that presents itself without any information regarding its source or authorship.

Who Should Create and / or the TEI Header

Every group will have its own method for creating and editing TEI Headers. Generally, the person who creates the TEI Header is familiar with TEI and is also familiar with bibliographic description. (emcaulay)

For an electronic document with a digitized title page, prefer
1. Chief source of information = information coded as title page
2. Use added information from an originating paper document if absolutely certain it is the source
If no title page is present and there is no evidence of a source document, the header creator
1. May assign a title and author if appropriate
2. Enclose the information in brackets, using the standard English language convention for editorial interjections
If neither header nor title page is present but the header creator has satisfactory evidence of an originating source, that document should be used as the chief source of information for the title and author of the header. If the source cannot be fully verified as to edition, authorship, etc., this fact should be clearly indicated in a note in the <fileDesc>.

[End of old text]