Difference between revisions of "Talk:Best Practices for TEI in Libraries"
Line 92: | Line 92: | ||
Or something along those lines, since OCR is pretty useless on manuscript materials, and even on complicated typefaces like blackletter/Fraktur. | Or something along those lines, since OCR is pretty useless on manuscript materials, and even on complicated typefaces like blackletter/Fraktur. | ||
− | The other characteristics listed do describe the project workflow, not the physical documents, so perhaps this type of specific characteristic is not appropriate after all. | + | The other characteristics listed do describe the project workflow, not the physical documents, so perhaps this type of specific characteristic is not appropriate after all. It seems worth noting somewhere in some fashion that certain collections, such as manuscript materials and some types of print, will by their very nature almost certainly require a higher level of encoding. But where or how? Or is this even necessary? Perhaps it is already implied, assumed, or generally understood. |
Revision as of 22:29, 18 June 2009
Introduction
1) Definition of level 5 encoding currently reads:
"The text is generated either through corrected OCR or keyboarding, but the tagging requires substantial human intervention by encoders with subject knowledge. "
I suggest instead:
"The text is generated either through corrected OCR or keyboarding, and the tagging requires substantial human intervention by encoders with subject knowledge, "
because corrected OCR, keyboarding, and expert tagging ALL require substantial human intervention (though the first two, of course, don't require subject knowledge, and perhaps that is the point of the original phrasing)
2) "If a library uses TEI Tite to outsource its encoding, it should find conversion of TEI Tite files to be trivial: to Level 3 with some loss of granularity and to Level 4 with the addition of some markup, which amounts to minimal human intervention."
Should the colon after "trivial" be there?
2.9 General Guidelines for Attribute Usage
1) Since this isn't a comprehensive list of attributes (I don't think), why bother including the "xml:id" and "target" attributes if specific details about how libraries should use these is not actually included in this document? Is the documentation for these elements considered important to these guidelines, but too extensive to replicate? How does this differ from the specific best practices given for other attributes listed here, like "n" or "rend"?
2) Under "key and ref":
"For example,
<author><persName type="marc100" key="lccn-n78-95332">Shakespeare, William, 1564-1616</persName></author>
gives a project-interal key (lccn-n78-95332) for this name in the Library of Congress Name Authority File. Values of key attributes may be partially explained in a non-machine-readable way through use of a taxonomy element: "
should "project-interal" be "project-internal?" Or "project-integral?" Or something else?
3) Under "rend and rendition":
"The rend and rendition attributes may be used when it is desirable to record information about how the content object was displayed in the source document. "
Is it meant to read "content object," or just "content," or even just "object?" Having both sounds strange to me, but perhaps it's TEI terminology with which I'm not familiar.
4.2 The TEI Header
1) Currently reads: "The TEI header is a metadata record that describes an electronic text encoded according to the TEI specification."
Since there are multiple levels of encoding (does this translate to multiple "specifications?"), should this read either
a) "...encoded according to a TEI specification" or b) "...encoded according to the TEI specifications" ?
4.4 The TEI Header and Other Metadata Schemas
1) Currently reads:
"Unfortunately, there is currently no mechanism for specifying that the content of an element should be drawn from an outside metadata source or that it should supplement the content of the element"
To me, the "it" was confusing/ambiguous--I suggest instead:
"Unfortunately, there is currently no mechanism for specifying that the content of an element should be drawn from an outside metadata source or that outside metadata should supplement the content of the element"
This feels a little more redundant/wordy, perhaps, but it is clearer.
4.5 Determining Data Values for the TEI Header
1) Currently reads:
"If there is no digitized title page but the header creator has satisfactory evidence of the source document, the header creator should refer to the source document for metadata creation. The lack of a title page may be for one of many reasons: for example, the original document is a manuscript item, or the electronic edition is a portion of the original object (a poem or short story that was published in a collection or an article from a serial). In all cases, it is recommended that important bibliographic evidence, such as a digitized image of the title page and title page verso for a collection, be provided to the header creator, even if just a piece of the collection is used."
Does "source document" refer an analog (physical) source document? Or digitized pages, just lacking a title page? Or OCR or keyboarded text? Or any or all of these things? What counts as "evidence" of a source document?
Follow up question: If the electronic text already exists, wouldn't title page information be captured in the <text>
element, and so metadata for the header could be gathered from here even without a facsimile of the title page?
4.6 Element Recommendations for the TEI Header
1) Under the instructions for the title
element that falls within <sourceDesc>
, it currently reads:
"At least one title element is required for the title of the source document. Give the title according to the national cataloging code. Use a type attribute with a value of marc245c to give the statement of responsibility from a MARC record. "
The information in the second sentence (about marc245c) is immediately reiterated, along with other information, in a list of the possible type
attributes that can be used for this element. So, stating it here seems unnecessary and also confusing--without having seen yet that we can also use marc245a and marc245b for the other elements of the title, I don't know why we've skipped right to statements of responsibility in a title element (but I'm not a cataloger)
2) Within <profileDesc>
, is <keywords scheme=>
only used if its <term>
children come from a specific controlled vocabulary? Can there be <term>
s without a parent <keywords scheme=>
?
5.1.3 Rationale for Level 1 Encoding
Under the characteristics of projects best-suited for Level 1 encoding, may I suggest the following addition:
-the source documents are printed or nearly all printed
Or something along those lines, since OCR is pretty useless on manuscript materials, and even on complicated typefaces like blackletter/Fraktur.
The other characteristics listed do describe the project workflow, not the physical documents, so perhaps this type of specific characteristic is not appropriate after all. It seems worth noting somewhere in some fashion that certain collections, such as manuscript materials and some types of print, will by their very nature almost certainly require a higher level of encoding. But where or how? Or is this even necessary? Perhaps it is already implied, assumed, or generally understood.