Subject:
comments on guidelines document
From:
Syd Bauman 
Date:
Wed, 22 Mar 2006 06:19:04 -0500
To:
DLF-TEI@LISTS.DIGLIB.ORG

Here are some comments on what I believe to be the current (2.0?)
version of the DLF "TEI Text Encoding in Libraries Guidelines for Best
Encoding Practices".

I realize that these comments would probably have been a lot more
useful if I'd posted them days ago, or at least more than 12 hours in
advance of the conference call. Sorry. That's what conference calls
are for in large part -- to get slugs like me to finally do what they
should have done weeks ago, so we don't have to be embarrassed by
saying "I didn't read it" on the call. I would have posted &gt; 12 hours
in advance, BTW, but there is no wireless on this flight. (Lame, lame,
lame.)

I am going to apologize in advance if I am raising any issues here
that we already brought up at the meeting and have decided. Please
attribute any such transgressions to the late hour and my poor memory,
not to any malice on my part.


General comments
------- --------

* First (and perhaps foremost) I am a strong proponent of version
  control. This document claims to be Version 2.0 of 2005-11-20, but
  Matthew sent it to us just the other day.

* The word "tag" is sometimes used when the word "element" is what is
  intended. I've tried to flag these, but may have missed some. 

* I personally prefer "heading" to "head" to describe the thing at
  the beginning of a chapter which we would typically encode as a
  &lt;head&gt;. 


I.
--

Minor nit-pick. When the chair's position is noted parenthetically, I
think all the names should be listed alphabetically, even the chair's.
("That's easy for you to say", Perry objects "your surname starts with
'B'!)

Also, wasn't Matthew chair of the 2006-02 meeting?


III. 
----

"Page breaks &lt;pb/&gt; should occur at the top of the page, and
entirely within any division."

* There is no strong reason that this document should feel compelled
  to follow TEI house style. On the other hand, there is some
  advantage to such consistency, so: in TEI documents, an element name
  (encoded with &lt;gi&gt;) is rendered with "&lt;" and "&gt;", whether it is
  declared as an empty element or not. Thus "&lt;pb&gt;", not "&lt;pb/&gt;". 

* Saying "Page breaks &lt;pb&gt;" seems to me like giving the subject of the
  sentence twice. (Reminds me of the pilot episode of Star Trek: the
  Next Generation.) Although perhaps it was just intended that the tag
  be parenthetical, which would be fine. But perhaps being more
  verbose would be clearer. I would recommend 
    "Page breaks should be encoded using the &lt;pb&gt; element ..."
  or
    "The &lt;pb&gt; element should be used to indicate the top ..."

* Since &lt;pb&gt; is empty, it does not make sense to say that one &lt;pb&gt;
  should be "entirely within" any XML element, since it always must
  be entirely within each and every ancestor element. I think perhaps
  what was meant is "always", but this presents problems (discussed
  below). 

* If the recommendation is that &lt;pb&gt; go within &lt;div&gt;[1], and that &lt;pb&gt;
  go at the top of each page (rather than between pages), it logically
  follows that, in general, a page break that occurs between chapters
  2 & 3 should be encoded near the top of the &lt;div&gt; that holds chapter
  3 (rather than near the bottom of the &lt;div&gt; that holds chapter 2).
  However, it may be worth stating this explicitly.


IV.1
----

* In "Rationale" the initial word "That" should be dropped or replaced
  with "The".

* "... using the teixlite DTD allows Level 1 texts to be compatible
  with more richly encoded teixlite texts for searching, ...": I am
  not sure it is worth changing the wording of the document, but I
  don't think this is strictly true. It is quite easy to imagine,
  e.g., two XML documents, transcriptions of similar document sources
  (say, two monographs in a series -- Hardy Boys or whatever), which
  are both valid against teixlite.dtd, but which are encoded so
  differently as to make context-sensitive searching pretty
  incompatible. I.e., I don't think that it is the use of teixlite
  that permits this compatibility, but rather adherence to far more
  strict rules (some of which are expressed in the document we are
  writing) that make the encoding consistent.

* "&lt;div1&gt; type="section" is the default attribute value": does that
  mean that type="section" should be the default specification (i.e.,
  all of your &lt;div1&gt; elements should have a type= -- if you have no
  other idea for type=, put in type=section), or that because
  type=section is the default, when you encode a &lt;div1&gt;, if you do not
  encode a type=, software should presume type=section? I don't think
  it matters much which we pick, just that the wording should make it
  clear, e.g. "If no type= attribute is specified, a type= of
  "section" should be presumed". One of the things that makes this a
  bit difficult to describe is the fact that, in technical terms,
  type="section" is *not* the default ala the DTD.

* There is an extra semicolon and space after "... extended to other
  encoding levels" in the description of &lt;p&gt;.

* For &lt;pb&gt;, the description starts "This is required ...". To be
  consistent this should probably be just "Required ...".

* "Page images can be linked to the text using id/idref." AFAIK, the
  systems y'all have in place for linking a page image to a &lt;pb&gt; do
  not make use IDREFs. Rather, they make use of the fact that there
  exists a file in your system whose filename matches the value of
  id=. If that's true, then this needs to be reworded. Either
    Page images can be linked to the text using the value of id=.
  or 
    Page images can be linked to the text using IDs.

* "Because ids are unique ..." should read "Because IDs are unique".
  (Or perhaps "Because IDs must be unique" or "Because the values of
  id= are definitionally unique within any given document", etc.)

* The example has some minor indentation inconsistencies.


IV.2
----

* Should "... be displayed separate from their page images" by "... be
  displayed separately from their page images"? (Note the "ly")

* "It is recommended that the n attribute be included to record the
  div sequence."
  - should be "... record the sequence of divisions" or "&lt;div&gt;
    sequence" or some such
  - if we recommend using n= to record the sequence, shouldn't we give
    more advice about how to do so? E.g. which divisions get counted,
    whether or not to use hierarchical n= values (I guess that's not a
    problem with level1, is it?  :-) , to use Arabic numerals padded on
    the left with sufficient zeroes, etc.?

* Example has
    &lt;front&gt;
      [optional text of titlepage, etc]
    &lt;/front&gt;
  from which I'm worried that people will incorrectly infer that the
  &lt;front&gt; tags are required, but the content is not.

* Example has
    &lt;body&gt;
      &lt;div1 type="chapter" n="1"&gt;
        &lt;head&gt;Chapter 1&lt;/head&gt;
	&lt;p&gt;[text of Chapter 1 goes here interspersed with &lt;pb/&gt;
	 elements pointing to page images]&lt;/p&gt;
      &lt;/div1&gt;
  This, I think, is a good place to point out the problem with "&lt;pb&gt;
  should always be inside a &lt;divN&gt;". If there were a heading of the
  body, which occurred on a page of its own, the &lt;pb&gt; element between
  the front matter and the body could not be recorded at the top of
  the &lt;div1&gt;.


IV.4
----
 * "... a searcher could limit his or her search in a dramatic text
   ... to the speeches of a particular character." This is a bad
   example because we are not recommending use of the who= attribute,
   which is often essential for limiting such searches. (The contents
   of &lt;speaker&gt; is often not consistent enough to be used for this
   purpose.)

* "Typographically distinct text should be encoded as &lt;foreign&gt;,
  &lt;title&gt;, or &lt;emph&gt; as appropriate." Does that mean other
  phrase-level elements intended for typographically distinct text
  (e.g., &lt;term&gt;, &lt;q&gt;, &lt;gloss&gt;, &lt;mentioned&gt;, &lt;soCalled&gt;) should not be
  used? 

* "It is recommended that the &lt;sic&gt; element be used to indicate
  typographic errors, with corrections noted as the value of the corr
  attribute." This means the recommendation is that at level 4 corr=
  should always be used. Is that what we intend? Or is it reasonable
  to use &lt;sic&gt; w/o corr= at level 4? If this is the case, we can just
  insert an "if desired". (Or is &lt;sic&gt; w/o corr= a level 3
  intervention?)

* "&lt;titlepage&gt;" should be "&lt;titlePage&gt;".

* "... if present, divided with by &lt;pb n="verso"/&gt;." has an extra
  preposition. 

* "... in a separate numbered div," should be either "in a separate
  numbered division," or "in a separate numbered &lt;div&gt;" (I think the
  latter is better, now that I think about it).

* "... with &lt;opener&gt;, &lt;dateline&gt;, &lt;salute&gt;, &lt;signed&gt;, &lt;closer&gt;
  included as appropriate." Probably need to provide more guidance on
  the use of these, esp. since &lt;dateline&gt;, &lt;salute&gt;, and &lt;signed&gt; can
  be used either inside &lt;opener&gt; (or &lt;closer&gt;) or without &lt;opener&gt; (or
  &lt;closer&gt;).


V.
--
Let me say up front that I do not think the "specify attributes in a
particular order so you can tweak your files with Perl" recommendation
is a good one. That was a reasonable recommendation when software that
understood attributes was hard to come by and even harder to use. The
advantage of using XML in the modern world is that such software is
readily available, and some of it is even pretty easy to use. Keep in
mind, also, that just putting them in a specific order still does not
make tweaking them with string-matching tools possible. Differences in
whitespace (including within the value) and use of LIT (") vs LITA (')
still means that a pattern-matching tool is required. And it gets
ugly. Things like
  s/&lt;name\s+type\s*=["']\s*person\s*['"]([^&gt;]*)&gt;/&lt;persName$1&gt;/g;
although compact, are, I think, harder to read, write, and debug
than
  &lt;xsl:template match="name[@type='person']"&gt;
    &lt;xsl:element name="persName"&gt;
      &lt;xsl:copy-of select="@*[not(name()='type')]"/&gt;
      &lt;xsl:apply-templates/&gt;
    &lt;/xsl:element&gt;
  &lt;/xsl:template&gt;
Furthermore, that Perl will fail in ways the XSLT won't (e.g.,
changing things inside comments or CDATA marked sections, matches 
type='person" when it shouldn't).

Besides, the XML specification is really very clear that "the order of
attribute specifications in a start-tag or empty-element tag is not
significant.".

That said, if this section stays, a few details should be corrected.

* "... must always be declared first." should read "... must always be
  specified first."

* OK ... 
  - type= is 1st
  - n= is last
  - id= is 1st
  - target= is same as id=
  I think this needs to be reworked a bit more coherently. Perhaps
  something like "Attributes should always be specified in the
  following order, when present: type=, id=, target=, n=, followed by
  all other attributes in alphabetical order, except that rend= is
  always last."

* "whenever multiple attributes are being used to define a tag," is
  problematic, because attributes don't define elements, let alone
  tags. Perhaps "whenever multiple attributes are being specified in a
  single tag," or "whenever multiple attributes are being specified on
  a particular element" or some such.

* "always be declared first" should be "always be specified first".

* The entry for entity= is false; entity= is how &lt;figure&gt; points to
  the target image -- it is much more like target= than id= (id= is
  how other things point to the &lt;figure&gt;).

* "Brown Women Writers Project" should be "Brown University Women
  Writers Project".

* "This concept allows for strings of rendition features to be
  included as one rend value. Rendition ladders consist of categories
  of renditions, with further defined values included in parentheses."
  should read "This system allows for sets of multiple renditional
  features to be included in one rend= value. Rendition ladders
  consist of categories of renditional features with specific values
  for each feature following, enclosed in parentheses." or some such.

* "Combining attributes would result in a tag with attributes such as"
  should read "Combining renditional features would result in a tag
  with attributes such as"

* I realize that the recommended rendition system is an "adaption of"
  the WWP rendition ladder system. Is there a reason it is an
  adaptation rather than an adoption of the whole system? (Is that
  reason perhaps that Syd has never actually *published* the whole
  system?) And is there a reason the adaptation is egregiously
  different than the WWP system? (E.g., combining slant, case, and
  font all into font.)

* lang=: While it is perfectly reasonable to use ISO639-2 3-letter
  codes preferentially over ISO639-1 2-letter codes in P4, it will not
  be in P5, as I posted earlier.

* "ident" should be "indent".

Note
----
[1] I think this is a bad idea, as it does not seem to represent
    reality. As Michael used to say, "if in doubt always prefer truth
    above a convenient lie." (Or something like that.)