Talk:Best Practices for TEI in Libraries

Some comments for TEI Header Guidelines


 * You probably want to change the @id in a few places to @xml:id. And the examples have TEI.2 as the root. Piotr 22:48, 19 October 2008 (EDT)

Yes, at some point soon we will need to change passim. Will also need to ditch the bit on entities. Maybe I can get to these later today. — done.
 * id= to xml:id=
 *  to 
 * target="blah" to target="#blah"

We also need to figure out what to do about the recommendation for lang=. That's a tougher issue, because it's not just a syntax change. Our guildelines for best practice are at odds with the TEI Guidelines and the IETF best current practices. Syd 2008-10-21T19:09Z.

Referencing the P5 Guidelines
I think when ever possible, we should link to the P5 guidelines to further illustrate examples or even as additional reference to prose. These guidelines mention "text hierarchy" which is akin to structure of a text so we should link to Chapter 4, Default Text Structure for illustrations of the various forms, etc. Not only do we leverage existing, robust documentation, but we help all TEI users, novice or expert, "penetrate" the monolithic guidelines when relevant. Mdalmau

agreed. emcaulay

To do

 * Link to corresponding sections of the P5 Guidelines when ever possible (for each recommendation)

Filenaming
{snippet from BPG text; comments about file naming} [This recommendation also seems dated (and the standard is targeted for CD-ROM file naming). I think we should recommend a consistency in file naming according to respective digital object storage practices. For example, IUDLP has guidelines in place and perhaps we can mine the more general recommendations from there like only ASCII, no spaces, 3 letter extensions, etc. (Mdalmau)] sounds like a good idea to remove or revise; as is it seems weird. (emcaulay) I'll just point out that people still use CD-ROM as an archival storage medium (I'm looking at you, Chris) as well as a file transfer mechanism [pwillettt] {end snippet}

The question isn’t is anyone still using CD-ROM, lots of folks probably do. The question is, is anyone still using ISO 9660 (as opposed to UDF, ECMA-168, or ISO 13490) CD-ROMs whithout using Rock Ridge or Joliet extensions. Anyone even know how to do that? -- Syd

File naming is still an issue. Perry pointed out that some folks store TEI files on CD-Rom (makes sense). Perhaps it just needs to be teased out for those who use CD-ROM for storage and more general filenmaing guidelines for server storage/delivery, like:

Standardized file naming for a particular encoding project is key for reliable online storage and delivery of these files. Consider the following best practices when determining the file name scheme for your project:


 * Each filename should contain an identifier that uniquely specifies a single digital object within the parent collection (e.g., a parent collection of text, images and other related materials)
 * Each filename should be fully specified. It should not just be a sequence number that is dependent on location within a directory structure for context
 * Filenames should not include spaces
 * Filenames should following a predicatble case constructions (e.g., all lowercase, camelCase, etc.)
 * The first character of the filename should be an ASCII letter ('a' through 'z' or 'A' through 'Z') to comply with current restrictions on identifiers by many programming and metadata languages such as METS
 * The "base" filename may include only ASCII letters ('a' through 'z' and 'A' through 'Z'), ASCII digits ('0' through '9'), hyphens, underscores, and periods. Refrain from using other characters and limit period usage to only once (to separate base name from file extensions).

For those saving files to CD-ROM for storage or file transfer, file naming should follow ISO 9660 conventions: 8-character filenames, 3-character extensions, using A-Z, a-z, 0-9, underscores and hyphens.

(Mdalmau)

While I actually think my recommendation of 2008-12-03T11:57-05 (“I was wrong”) is syntactically slightly superior, it’s time to apply Syd’s wheel reinvention prevention convention in full force. The conventions MD refers to above (i.e., IUDLP has guidelines in place) are perfectly workable. We should just refer the reader there and be done with it. Syd

Numbered Divs
[This seems worth revisiting. Do we really need such a software-specific recommendation? (Kshawkin)] [I agree. We generally avoid numbered divisions. Recent survey revealed a nearly 50/50 split on the topic, but we shouldn't advocate one or the other. (Mdalmau)] For a discussion of whether to use numbered or unnumbered divs, see the TEI P5 Guidelines Chapter 4: Default Text Structure(emcaulay)

Chapter 4 of the P5 Guidelines make allowances for both, with a preference towards unnumbered as it more easily suports arbitrary levels of nesting (as opposed to a fixed number). Unnumbered is also preferred because designated levels to parts of a text may change from project to project or even book to book within the same project. The guidelines make allowances for both: unnumbered divisions using the @type to designate the level. For those who type more semantically and for those who need numbered divisions for more predictable processing how about we re-write the section thusly:

We recommend the use of unnumbered divisions throughout the electronic text with proper values inserted in the @type attribute. For those of you who require numbered divisions for software processing, populate the @ type attribute with a number, 1-7 (?), that corresponds to the appropriate level. For those of you who prefer a semantic label (e.g., chapter, section, etc.), determine a typology beforehand and designate the appropriate level in the @type attribute. The ability to do both is also possible if it is important to maintain an explicit connection between the numbered and unnumbered labels by using @ type and @subtype accordingly. However, a combination of numbered (e.g., &lt;div1&gt;) and unnumbered (e.g,, &lt;div&gt;) divisions is not supported. For a more detailed discussion about numbered and unnumbered divisions, consult Chapter 4: Default Test Structure of the TEI P5 Guidelines. (Mdalmau)

This revision may impact how we display examples throughout the text. We need to keep this in mind if accepted. [Mdalmau].

The construction of "typologies" is a common activity for many of us when performing document analysis. When we are ready to expand the guidelines, I think including a section on "document analysis" is key. We can then explore issues of typology-building and how to constrain those values in the schema (or even Schematron). But defining the value list is not an easy task, which is one benefit of using numbered divs. (Mdalmau)

Page Breaks
[Always including page breaks within a div seems quite software-specific. I suggest revisiting. (Kshawkin)] As we've discussed on a conference call, this isn't software specific. There are two points here. The historical point is that we wanted to recommend a practice, as a way of creating consistency and uniformity among encoded documents. There's a choice to be made about where to stick page breaks, so we chose one. But more importantly, it's about any software (eg XSLT) that will grab and return an entire DIV. You'll want to include the page break in that chunck of encoded text. In my experience, this generally works, except at the beginning of the volume, which typically would have &lt;TEXT&gt;&lt;BODY&gt;&lt;PB&gt;&lt;HEAD&gt;Book Title&lt;/HEAD&gt;&lt;DIV&gt;&lt;HEAD&gt;Chapter title&lt;/HEAD&gt; [pwillett]

It seems that the page break blurb we have in place is not really an issue that needs to be revisited. I agree with Perry that promoting consistency is helpful (and also aids processing of text in most cases; page breaks as are all milestone tags, are hard to reckon with sometimes). The suggestion seems neutral enough that it can remain as-is. If someone disagrees, please provide the rational for further review. Thanks!(Mdalmau)

Paragraphs or Anonymous Block
Currently, level 1 contains a table with the following information for the &lt;p&gt; tag:

At least one "container" element per div is required (while &lt;ab&gt; is another option for this case, the Task Force suggests using &lt;p&gt; in order that the document be open to being extended to other encoding levels). I don't remember this discussion. It doesn't seem very difficult, once the decision is made to upgrade, to transform all ab's to p's. Or? [pwillett]

I agree with Perry and our goal is be conformant. So the &lt;p&gt; could simply be changed to &lt;ab&gt;. Do we want to address the &lt;p&gt; legacy any further or maybe as an end note? (Mdalmau)