Subject: comments on guidelines document From: Syd Bauman Date: Wed, 22 Mar 2006 06:19:04 -0500 To: DLF-TEI@LISTS.DIGLIB.ORG Here are some comments on what I believe to be the current (2.0?) version of the DLF "TEI Text Encoding in Libraries Guidelines for Best Encoding Practices". I realize that these comments would probably have been a lot more useful if I'd posted them days ago, or at least more than 12 hours in advance of the conference call. Sorry. That's what conference calls are for in large part -- to get slugs like me to finally do what they should have done weeks ago, so we don't have to be embarrassed by saying "I didn't read it" on the call. I would have posted > 12 hours in advance, BTW, but there is no wireless on this flight. (Lame, lame, lame.) I am going to apologize in advance if I am raising any issues here that we already brought up at the meeting and have decided. Please attribute any such transgressions to the late hour and my poor memory, not to any malice on my part. General comments ------- -------- * First (and perhaps foremost) I am a strong proponent of version control. This document claims to be Version 2.0 of 2005-11-20, but Matthew sent it to us just the other day. * The word "tag" is sometimes used when the word "element" is what is intended. I've tried to flag these, but may have missed some. * I personally prefer "heading" to "head" to describe the thing at the beginning of a chapter which we would typically encode as a <head>. I. -- Minor nit-pick. When the chair's position is noted parenthetically, I think all the names should be listed alphabetically, even the chair's. ("That's easy for you to say", Perry objects "your surname starts with 'B'!) Also, wasn't Matthew chair of the 2006-02 meeting? III. ---- "Page breaks <pb/> should occur at the top of the page, and entirely within any division." * There is no strong reason that this document should feel compelled to follow TEI house style. On the other hand, there is some advantage to such consistency, so: in TEI documents, an element name (encoded with <gi>) is rendered with "<" and ">", whether it is declared as an empty element or not. Thus "<pb>", not "<pb/>". * Saying "Page breaks <pb>" seems to me like giving the subject of the sentence twice. (Reminds me of the pilot episode of Star Trek: the Next Generation.) Although perhaps it was just intended that the tag be parenthetical, which would be fine. But perhaps being more verbose would be clearer. I would recommend "Page breaks should be encoded using the <pb> element ..." or "The <pb> element should be used to indicate the top ..." * Since <pb> is empty, it does not make sense to say that one <pb> should be "entirely within" any XML element, since it always must be entirely within each and every ancestor element. I think perhaps what was meant is "always", but this presents problems (discussed below). * If the recommendation is that <pb> go within <div>[1], and that <pb> go at the top of each page (rather than between pages), it logically follows that, in general, a page break that occurs between chapters 2 & 3 should be encoded near the top of the <div> that holds chapter 3 (rather than near the bottom of the <div> that holds chapter 2). However, it may be worth stating this explicitly. IV.1 ---- * In "Rationale" the initial word "That" should be dropped or replaced with "The". * "... using the teixlite DTD allows Level 1 texts to be compatible with more richly encoded teixlite texts for searching, ...": I am not sure it is worth changing the wording of the document, but I don't think this is strictly true. It is quite easy to imagine, e.g., two XML documents, transcriptions of similar document sources (say, two monographs in a series -- Hardy Boys or whatever), which are both valid against teixlite.dtd, but which are encoded so differently as to make context-sensitive searching pretty incompatible. I.e., I don't think that it is the use of teixlite that permits this compatibility, but rather adherence to far more strict rules (some of which are expressed in the document we are writing) that make the encoding consistent. * "<div1> type="section" is the default attribute value": does that mean that type="section" should be the default specification (i.e., all of your <div1> elements should have a type= -- if you have no other idea for type=, put in type=section), or that because type=section is the default, when you encode a <div1>, if you do not encode a type=, software should presume type=section? I don't think it matters much which we pick, just that the wording should make it clear, e.g. "If no type= attribute is specified, a type= of "section" should be presumed". One of the things that makes this a bit difficult to describe is the fact that, in technical terms, type="section" is *not* the default ala the DTD. * There is an extra semicolon and space after "... extended to other encoding levels" in the description of <p>. * For <pb>, the description starts "This is required ...". To be consistent this should probably be just "Required ...". * "Page images can be linked to the text using id/idref." AFAIK, the systems y'all have in place for linking a page image to a <pb> do not make use IDREFs. Rather, they make use of the fact that there exists a file in your system whose filename matches the value of id=. If that's true, then this needs to be reworded. Either Page images can be linked to the text using the value of id=. or Page images can be linked to the text using IDs. * "Because ids are unique ..." should read "Because IDs are unique". (Or perhaps "Because IDs must be unique" or "Because the values of id= are definitionally unique within any given document", etc.) * The example has some minor indentation inconsistencies. IV.2 ---- * Should "... be displayed separate from their page images" by "... be displayed separately from their page images"? (Note the "ly") * "It is recommended that the n attribute be included to record the div sequence." - should be "... record the sequence of divisions" or "<div> sequence" or some such - if we recommend using n= to record the sequence, shouldn't we give more advice about how to do so? E.g. which divisions get counted, whether or not to use hierarchical n= values (I guess that's not a problem with level1, is it? :-) , to use Arabic numerals padded on the left with sufficient zeroes, etc.? * Example has <front> [optional text of titlepage, etc] </front> from which I'm worried that people will incorrectly infer that the <front> tags are required, but the content is not. * Example has <body> <div1 type="chapter" n="1"> <head>Chapter 1</head> <p>[text of Chapter 1 goes here interspersed with <pb/> elements pointing to page images]</p> </div1> This, I think, is a good place to point out the problem with "<pb> should always be inside a <divN>". If there were a heading of the body, which occurred on a page of its own, the <pb> element between the front matter and the body could not be recorded at the top of the <div1>. IV.4 ---- * "... a searcher could limit his or her search in a dramatic text ... to the speeches of a particular character." This is a bad example because we are not recommending use of the who= attribute, which is often essential for limiting such searches. (The contents of <speaker> is often not consistent enough to be used for this purpose.) * "Typographically distinct text should be encoded as <foreign>, <title>, or <emph> as appropriate." Does that mean other phrase-level elements intended for typographically distinct text (e.g., <term>, <q>, <gloss>, <mentioned>, <soCalled>) should not be used? * "It is recommended that the <sic> element be used to indicate typographic errors, with corrections noted as the value of the corr attribute." This means the recommendation is that at level 4 corr= should always be used. Is that what we intend? Or is it reasonable to use <sic> w/o corr= at level 4? If this is the case, we can just insert an "if desired". (Or is <sic> w/o corr= a level 3 intervention?) * "<titlepage>" should be "<titlePage>". * "... if present, divided with by <pb n="verso"/>." has an extra preposition. * "... in a separate numbered div," should be either "in a separate numbered division," or "in a separate numbered <div>" (I think the latter is better, now that I think about it). * "... with <opener>, <dateline>, <salute>, <signed>, <closer> included as appropriate." Probably need to provide more guidance on the use of these, esp. since <dateline>, <salute>, and <signed> can be used either inside <opener> (or <closer>) or without <opener> (or <closer>). V. -- Let me say up front that I do not think the "specify attributes in a particular order so you can tweak your files with Perl" recommendation is a good one. That was a reasonable recommendation when software that understood attributes was hard to come by and even harder to use. The advantage of using XML in the modern world is that such software is readily available, and some of it is even pretty easy to use. Keep in mind, also, that just putting them in a specific order still does not make tweaking them with string-matching tools possible. Differences in whitespace (including within the value) and use of LIT (") vs LITA (') still means that a pattern-matching tool is required. And it gets ugly. Things like s/<name\s+type\s*=["']\s*person\s*['"]([^>]*)>/<persName$1>/g; although compact, are, I think, harder to read, write, and debug than <xsl:template match="name[@type='person']"> <xsl:element name="persName"> <xsl:copy-of select="@*[not(name()='type')]"/> <xsl:apply-templates/> </xsl:element> </xsl:template> Furthermore, that Perl will fail in ways the XSLT won't (e.g., changing things inside comments or CDATA marked sections, matches type='person" when it shouldn't). Besides, the XML specification is really very clear that "the order of attribute specifications in a start-tag or empty-element tag is not significant.". That said, if this section stays, a few details should be corrected. * "... must always be declared first." should read "... must always be specified first." * OK ... - type= is 1st - n= is last - id= is 1st - target= is same as id= I think this needs to be reworked a bit more coherently. Perhaps something like "Attributes should always be specified in the following order, when present: type=, id=, target=, n=, followed by all other attributes in alphabetical order, except that rend= is always last." * "whenever multiple attributes are being used to define a tag," is problematic, because attributes don't define elements, let alone tags. Perhaps "whenever multiple attributes are being specified in a single tag," or "whenever multiple attributes are being specified on a particular element" or some such. * "always be declared first" should be "always be specified first". * The entry for entity= is false; entity= is how <figure> points to the target image -- it is much more like target= than id= (id= is how other things point to the <figure>). * "Brown Women Writers Project" should be "Brown University Women Writers Project". * "This concept allows for strings of rendition features to be included as one rend value. Rendition ladders consist of categories of renditions, with further defined values included in parentheses." should read "This system allows for sets of multiple renditional features to be included in one rend= value. Rendition ladders consist of categories of renditional features with specific values for each feature following, enclosed in parentheses." or some such. * "Combining attributes would result in a tag with attributes such as" should read "Combining renditional features would result in a tag with attributes such as" * I realize that the recommended rendition system is an "adaption of" the WWP rendition ladder system. Is there a reason it is an adaptation rather than an adoption of the whole system? (Is that reason perhaps that Syd has never actually *published* the whole system?) And is there a reason the adaptation is egregiously different than the WWP system? (E.g., combining slant, case, and font all into font.) * lang=: While it is perfectly reasonable to use ISO639-2 3-letter codes preferentially over ISO639-1 2-letter codes in P4, it will not be in P5, as I posted earlier. * "ident" should be "indent". Note ---- [1] I think this is a bad idea, as it does not seem to represent reality. As Michael used to say, "if in doubt always prefer truth above a convenient lie." (Or something like that.)