Best Practices for TEI in Libraries

Introduction

These recommendations are for libraries using the Text Encoding Initiative’s Guidelines for Text Encoding and Interchange (P5). They are intended for use in large, library-based digitization projects, but may be useful in other scenarios as well. Consult the full TEI Guidelines for guidance beyond what is described below.

There are many different library text digitization projects, for different purposes. With this in mind, the Task Force has attempted to make these recommendations as inclusive as possible by developing a series of encoding levels. These levels are meant to allow for a range of practice, from wholly automated text creation and encoding, to encoding that requires expert content knowledge, analysis, and editing.

Task Force W/C though historically correct may lose context in this new revisions. Consider something more generic like "Working Group" or past tense: "the Task Force attempted to make these ..." --Michelle

Recommendations for Levels 1-4 are intended for projects wishing to create encoded electronic text with structural markup, but minimal semantic or content markup. Also, the encoding levels are cumulative: encoding requirements at each level incorporate the requirements of lower levels. Levels 1-4 allow the conversion and encoding of texts to be performed without the assistance of deep content knowledge and can be enriched with more markup at any time. Level 5, in contrast, requires scholarly analysis.

General Recommendations

An encoding project should strive for internal consistency and for use of standards so that the data can be modified or enhanced in the future with ease. In cases where local practice deviates from standards, there should at least be internal consistency in the local practice.

When reformatting to digital media using any level of encoding, the electronic text should begin with the transcription of the first word on the first leaf of the original work. It may be impractical or undesirable to transcribe and encode certain features of the text, such as publisher's advertisements or indexes, but if at all possible, they should be included as links to page images. Any omissions of material found in the original work should be noted in the <editorialDecl> in the TEI header.
A filename scheme should be established for the project. Filenames should ensure cross-platform compatibility: use only the characters A-Z, a-z, and 0-9 in filenames, and avoid file extensions longer than three characters.
An encoding project should use only numbered divisions (i.e., <div1>, <div2>, etc.) or unnumbered divisions (i.e., <div>) but not both. This applies both within a TEI document (i.e., within <front>, <body>, <back>, even if nested within <group> or <floatingText>) and across TEI documents in any given collection. Keep in mind that numbering of divs starts over (at <div1>) within <floatingText>, so any software that expects to process nested numbered divisions within a document will need to account for this.
Whether numbered or unnumbered divisions are used, the @type attribute of the division element is not recommended at level 1, is optional at level 2, is recommended at level 3, and required at levels 4 and 5.
Page breaks should be encoded using the <pb> element, which should demark the top of a page (i.e. the text of page seven should immediately follow <pb n="7"/>), and should always be contained within a div for ease of retrieval with indexing software. For example, a page break that occurs between chapters 2 and 3 should be encoded near the top of the <div> that holds chapter 3 (rather than near the bottom of the <div> that holds chapter 2).
Conisder moving page image referencing info here.

Structure of a TEI Document

A valid TEI XML document must contain the following elements:

a root <TEI> element, containing:
- a <teiHeader> element
- a <text> element

Within those two elements, there are additional requirements, which are discussed in these guidelines and in the complete TEI P5 Guidelines. The <teiHeader> element serves as a description of the document presented in the <text> element. The <text> element contains the transcription of the source document.

The TEI Header

Reference

Chapter 2, TEI Header, P5 Guidelines

The TEI Header

The TEI Header is a metadata record that describes an electronic text encoded according to the TEI specification. The purpose of the TEI Header is to declare the bibliographic information related to the electronic document and if appropriate, the bibliographic data for original analog source document from which the electronic edition was created. The TEI Header often includes a description of the encoding decisions or practices used to create the electronic document. Since the advent of the TEI twenty years ago, many people have described the TEI Header as a title page for the electronic edition, and many librarians have compared it to traditional library catalog records (MARC).

As with any descriptive metadata, the metadata in the TEI Header can serve multiple audiences. In the local context, a TEI Header provides metadata about the TEI document, its source, and its provenance. The TEI Header may used for metadata exchange, to automatically create indexes (author lists, title lists) for a collection of TEI documents, and to aid in browsing heterogeneous TEI documents. TEI Headers may also be used as a basis for other metadata records (such as MARC or Dublin Core), though generation of other formats may require human intervention because they often are more granualar, or have different granularity, than TEI Headers.

The TEI Header and MARC

While a TEI header is often perceived as similar to or at least related to a MARC record, a TEI header does not typically have a one-to-one correspondence with a MARC record. One TEI header may be described by multiple MARC analytic records, or one MARC record may be used to describe a collection of TEI documents with individual headers. Furthermore, while a MARC record captures metadata about a bibliographic entity in a library's collection, a TEI header records information both about an encoded text and about the source document for that encoded text.

Each institution and even each project may have a different approach to the way electronic texts are created in TEI and then represented in a larger public catalog through MARC. At one institution, the same unit (e.g., a cataloging department) may be responsible for creating both TEI Headers and MARC records, while at other institutions the work may be distributed among different units. Within the library domain, metadata or cataloging experts are usually required for at least review and standardization of both the TEI Header and the MARC record.

The TEI Header and Other Metadata Schemas

Several other descriptive metadata schemas are prevalent within the library domain, including Dublin Core (DC), Dublin Core Qualified (DCQ), and the Metadata Object Description Schema (MODS). Each of these schemas contains elements that capture the same data as many of the elements in the TEI Header. As with MARC, a variety of automated or manual workflows can be implemented to crosswalk metadata from one standard to another and provide for increased sharing of metadata about electronic texts in larger contexts. In particular, DC and MODS are common schemas used by the Open Access Initiative (OAI) and may be particularly valuable for sharing metadata across institutions.

Determining Data Values for the TEI Header

Within the library domain, there are several authoritative publications on how to create bibliographic and descriptive metadata for objects. These are usually called “content standards;” two prominent examples are the Anglo-American Cataloging Rules Second Edition (AACR2) and the International Standard Bibliographic Description for Electronic Resources (ISBD(ER)). These standards are extensive and outline a set of rules that enforce consistency across a voluminous amount of metadata.

Perhaps the primary purpose of these content standards is to give rules for what sources of information may be used in transcribing or generating metadata about a bibliographic entity. Within an electronic context, the analog object may not be available, so the TEI Header creator will need access to digitized images or other verifiable information to create accurate metadata.

The following sources of information are recommended in creating the TEI Header:

For an electronic document with a digitized title page and title page verso:
1. Chief source of information is the information coded as title page.
2. Use added information from an originating paper document if absolutely certain it is the source.
If there is no digitized title page but the header creator has satisfactory evidence of the source document, the header creator should refer to the source document for metadata creation. The lack of a title page may be for one of many reasons, among them: the original document is a manuscript item or the electronic edition is a portion of the original object (a poem or short story that was published in a collection or an article from a serial). In all cases, it is recommended that important bibliographic evidence, such as a digitized image of the title page and title page verso for a collection, be provided to the header creator, even if just a piece of the collection is used.
If no title page is present and there is no evidence of a source document, the header creator
1. May assign a title and author, if appropriate.
2. Enclose the information in brackets, using the standard English language convention for editorial interjections.

Element Recommendations for the <teiHeader>

Element						Description
<teiHeader>
└	<fileDesc>					The fileDesc contains metadata about the TEI document. One of its child elements, `sourceDesc`, describes the source document from which the TEI document was created.
│	├	<titleStmt>
│	│	├	<title>			One or more `title` elements are used to give the title of the TEI document being created. It is suggested that titles be constructed based on the source document according to a national cataloging code. Use of the `level` attribute is not recommended since it does not apply to a TEI document in a collection. The `type` attribute may have any of the following values: `main` `sub` `alt` `short` `desc` `translated` `MARC245a` `MARC245b` `uniform`
│	│	├	<author>			One or more `author` elements (one name per element) are used to encode the names of entity primarily responsible for the content of the TEI document. Use `persName` or `orgName` when applicable. Whenever possible, establish or use the form of the name from a national name authority file. Examples: `<author><persName>Shakespeare, William, 1564-1616</persName></author>` `<author><orgName>National Organization for Women</orgName></author>` `<author>(unknown)</author>`
│	│	├	<editor>			If applicable, use one or more `editor` elements (one name per element) to encode the names of entities besides those in `author` elements that have made a significant intellectual contribution to the work. If considered appropriate by the encoding project, the editor of the TEI document should be entered here. Use `persName` or `orgName` when applicable. Whenever possible, establish or use the form of the name from a national name authority file.
│	│	└	<respStmt>			Record the names of other persons or organizations not covered by <author> and <editor>, one per `respStmt`. Each `respStmt` may have only one `resp` child element and one `name` element, though they may occur in either order. Use `persName` or `orgName` as children of `name` when applicable. Whenever possible, establish or use the form of the name from a national name authority file.
│	├	<editionStmt>				This element contains information about the edition of the the TEI document produced, not the source document.
│	├	<publicationStmt>				Use the child elements below rather than <p> for a prose description.
│	│	├	<publisher>			The publisher is the party responsible for making the file (the TEI document, not the source document) public.
│	│	├	<distributor>			The distributor is the party from whom copies of the file (the TEI document, not the source document) can be obtained. Often the same as <publisher>, in which case no <distributor> element should be specified.
│	│	├	<authority>			Only used for a text (the TEI document, not the source document) that is not formally published, but is nevertheless made available for circulation, in which case the party who makes it available should be recorded here.
│	│	├	<idno>			Any unique identification number for the TEI document determined by the publisher.
│	│	├	<availability><p>			Provide a prose rights statement for the TEI document. Provide a standard license, such as one from Creative Commons, if possible. Provide information on all applicable rights: rights in the original work, rights in page images of the source document, and rights in the encoded text.
│	│	└	<date when="____">			Refers to the date of the first publication of the TEI document. Use `when` attribute to aid machine processing.
│	├	<seriesStmt>				This element contains information about the electronic series being created. It has one required element (`<title level="s">`) and other optional elements.
│	│	└	<title level="s" type="_">			Whenever possible, establish or use the form of the name from a national name authority file for the electronic series being created. Value of level attribute is drawn from TEI Guidelines.
│	└	<notesStmt>				Optional.
│	└	<sourceDesc>				In order to effectively represent the source(s) when many documents are represented by the TEI header in the absence of structures identifying parent-child and component relationships, multiple source descriptions should be employed with relationships described in free text. Relationships also could be useful in other portions of the TEI header. Cataloger may need to do research to establish the original source.
│		└	<biblStruct>			Metadata for the source document may be automatically generated from a MARC record. Use <biblStruct> with child elements listed, in the order below, for ease of display according to ISBD.
│			└	<monogr>		Use this element to group together the elements describing the whole source document, even if the whole source document is not a "monograph", per the TEI definition of this element.
│				├	<author>	One or more `author` elements (one name per element) are used to encode the name for the personal author or corporate body responsible for the creation of the source document, even if this creator is not the main entry in the catalog record. Use `persName` or `orgName` when applicable. Whenever possible, establish or use the form of the name from a national name authority file.
│				├	<title level="_" type="_">	At least one `title` element is required for the title of the source document. Give the title according to the national cataloging code. The `level` attribute is used as in the TEI Guidelines. Use of the `type` attribute is required: it may have any of the following values: `main` `sub` `alt` `short` `desc` `translated` `MARC245a` `MARC245b` `uniform`
│				├	<title level="_" type="MARC245b">
│				├	<respStmt>	Statement of responsibility, according to the national cataloging code. If metadata is generated automatically from a MARC record, include a single empty <resp></resp> and put the entire statement of responsibility in <name> If creating metadata manually, include one `respStmt` for each responsible party. Each `respStmt` may have only one `resp` child element and one `name` element, though they may occur in either order. Use `persName` or `orgName` as children of `name` when applicable. Whenever possible, establish or use the form of the name from a national name authority file.
│				├	<edition>	Edition statement (if present).
│				├	<pubPlace>	Place of publication from the original source (if present)
│				├	<publisher>	First publisher etc. from the original source (if present)
│				├	<date when="____">	Date of publication etc. from the original source (if present). Use `when` attribute to aid machine processing.
│				├	<extent>	Use of this element to describe the extent of the source document is recommended. If the data is generated by hand, it should include a comprehensible statement of the size of the item, such as the number of pages or leaves. If generated from a catalog record, it should include the extent of the item according to a national cataloging code.
│				├	<series>	Following the <monogr> element grouping together all the above metadata about the encoded item, information about the series to which the item belongs goes here, following the content model in the TEI Guidelines. If generating this data from a catalog record, it is likely that you will have only one child element: a `title`.
│				├	<note>	Notes about the source document, according to national cataloging codes.
│				└	<idno>	In this location, <idno> refers to identification numbers for the source document. Use type="isbn-13" and type="isbn-10" if applicable. This element may also be used to indicate the source's location in an individual institution's collection. If a formal standard location system is being used, indicate the nature of the system, e.g., <idno type="LC_call_number">.
├	encodingDesc
│	├	<p>				In the first child of `encodingDesc`, include a `p` element containing a prose statement about the format of the data in the header. Does the data in the `sourceDesc` follow AACR rules? How about in the `fileDesc`? Is ISBD punctuation included?
│	└	<projectDesc><p>				Enter a description of the purpose for which the electronic file was encoded.
│		├	<editorialDecl n="_"><p>			Record encoding level for the content as an arabic numeral in the n attribute. As content of this element, record editorial decisions made during encoding and notes about omissions of material found in the original work.
│		├	<tagsDecl><namespace name="http://www.tei-c.org/ns/1.0"><tagUsage>			<tagUsage> must be one of the following: `<tagUsage gi="div1">Numbered divs used.</tagUsage>` `<tagUsage gi="div">Unnumbered divs used.</tagUsage>`
│		└	<classDecl><taxonomy xml:id="____"><bibl>			Use to document classification schemes used in the header or body of the TEI document. For example: `<taxonomy xml:id="LCC"><bibl>Library of Congress Classification</bibl></taxonomy>` `<taxonomy xml:id="LCSH"><bibl>Library of Congress Subject Headings</bibl></taxonomy>` `<taxonomy xml:id="AAT"><bibl>Art & Architecture Theasaurus</bibl></taxonomy>`
├	<profileDesc>
	└	<textClass>				The elements below are contained within this element.
│		├	<classCode scheme="___">			True classification numbers as opposed to call numbers can be entered here. The value of the scheme attribute corresponds to a classification scheme defined previously in classDecl. Example: `scheme="#LCC"`
│		└	<keywords scheme="____">			Repeat this element as many times as there are keyword schemes. If the child `term` elements contain terms from a controlled vocabulary, indicate that controlled vocabulary through the scheme attribute. The value of the scheme attribute corresponds to a classification scheme defined previously in `classDecl`. Example: `scheme="#LCSH"`
│			└	<term>		Use for terms from controlled or uncontrolled vocabularies, as indicated in the parent `keywords` element.
└	<revisionDesc>
	└	<change when="YYYY-MM-DD" who="URI">				Create a `change>` element to record each significant change to the TEI document, in reverse chronological order (i.e., most recent first). A prose description of the change is recorded as the content of each `<change>` element. This prose may contain lists for organization, and phrase-level markup (like `<gi>`, `<ptr>`, or `<date>`), but not paragraphs. The date of the change in ISO 8601 form (YYYY-MM-DD) should be recorded on the `when=` attribute. The person who is responsible for making the change is indicated by the `who=` attribute of `<change>`. Its value is a URI that points to a `<respStmt>` or `<person>` element that encodes information about the responsible party. Note that this reference is a URI reference and not an ID/IDREF reference, and thus is not checked by validation software. Small projects sometimes take advantage of this by putting information into the URI itself, and not having a `<respStmt>` or `<person>` element. E.g., `who="#Kevin_Hawkins"`.

Sample TEI Header

  <teiHeader>
    <fileDesc>
      <titleStmt>
        <title type="main">Lincoln and Seward.</title>
        <author>
          <persName>Welles, Gideon, 1802-1878.</persName>
        </author>
      </titleStmt>
      <publicationStmt>
        <publisher>University of Michigan, Digital Library Initiatives</publisher>
        <availability>
          <p>These pages may be freely searched and displayed. Permission must be received for
            subsequent distribution in print or electronically. Please go to
            http://www.umdl.umich.edu/ for more information.</p>
        </availability>
        <date when="1996"/>
      </publicationStmt>
      <seriesStmt>
        <title level="s" type="main">Making of America</title>
      </seriesStmt>
      <sourceDesc>
        <biblStruct>
          <monogr>
            <author>
              <persName>Welles, Gideon, 1802-1878.</persName>
            </author>
            <title level="m" type="MARC245a">Lincoln and Seward.</title>
            <title level="m" type="MARC245b">Remarks upon the memorial address of Chas. Francis
              Adams, on the late William H. Seward, with incidents and comments illustrative of the
              measures and policy of the administration of Abraham Lincoln. And views as to the
              relative positions of the late President and secretary of state.</title>
            <imprint>
              <pubPlace>New York</pubPlace>
              <publisher>Sheldon & company</publisher>
              <date when="1874"/>
            </imprint>
            <extent>viii, [7]-215 p ; 20 cm.</extent>
          </monogr>
          <note>First published in condensed form in the Galaxy, v. 16, 1873, p. [518]-530,
            [687]-700, [793]-804.</note>
          <idno type="isbn-10">1-4255-1817-6</idno>
          <idno type="LC_call_number">E456 .W44</idno>
        </biblStruct>
      </sourceDesc>
    </fileDesc>
    <encodingDesc>
      <p>Data in the <gi>sourceDesc</gi> of the header comes from a pre-AACR2 record. Other data follows
        AACR2 when applicable.</p>
      <projectDesc>
        <p>XML created for the Making of America collection.</p>
      </projectDesc>
      <editorialDecl n="1">
        <p><gi>sourceDesc</gi> created by exporting from catalog on 2008-06-15.</p>
        <p>This electronic text file was created by optical character recognition (OCR). No
          corrections have been made to the OCR-ed text and no editing has been done to the content
          of the original document. Encoding has been done using the recommendations for Level 1 of
          the TEI in Libraries Guidelines.</p>
      </editorialDecl>
      <tagsDecl>
        <namespace name="http://www.tei-c.org/ns/1.0">
          <tagUsage gi="div">Only un-numbered divisions are used.</tagUsage>
        </namespace>
      </tagsDecl>
      <classDecl>
        <taxonomy xml:id="LCC">
          <bibl>Library of Congress Classification</bibl>
        </taxonomy>
        <taxonomy xml:id="LCSH">
          <bibl>Library of Congress Subject Headings</bibl>
        </taxonomy>
      </classDecl>
    </encodingDesc>
    <profileDesc>
      <textClass>
        <classCode scheme="#LCC">E456</classCode>
        <keywords scheme="#LCSH">
          <term>Lincoln, Abrahan, 1809-1865.</term>
          <term>Seward, William Henry, 1801-1872.</term>
          <term>Adams, Charles Francis, 1807-1886. Address of Charles Francis Adams ... on the life
            ... of William H. Seward.</term>
        </keywords>
      </textClass>
    </profileDesc>
    <revisionDesc>
      <change who="#CKP" when="2005-05-25">Header generated from export of MARC record</change>
    </revisionDesc>
  </teiHeader>

Linking between encoded text and images of source documents

Consider moving this up under the General Rec. section. --Michelle

There are two recommended mechanisms for linking between the encoded text and facsimile page images of source documents. Projects may either:

Use the @facs attribute on each <pb> element to point to the corresponding page image using a URI.
Use the @xml:id attribute on each <pb> element and a METS document to provide correspondence between <pb> elements and one or more facsimile page images (e.g., master, web derivatives, etc.).

Do we need to say this? And do we want all examples to use @facs?--Michelle The examples below use the former method.

Encoding Levels

LEVEL 1: Fully Automated Conversion and Encoding

Reference

Chapter 3, Elements Available in All TEI Documents

Purpose

To create electronic text with the primary purpose of keyword searching and linking to page images. The primary advantage in using the TEI at this very strictly limited level of encoding is that a TEI header is attached to the text file.

Rationale

The text is subordinate to the page image, and is not intended to stand alone as an electronic text (without page images).

Texts at Level 1 can be created and encoded by fully automated means, using uncorrected OCR of page images ("dirty OCR") or exporting from existing electronic text files. Encoding is performed automatically based on artifacts of the OCR or other document creation process (page breaks, for example) and metadata collected during the imaging or preparation process. This encoding is both minimal and reliable, and does not typically require extensive review of each page of each text.

Level 1 texts are not intended to be adequate for textual analysis; they are more likely to be suited to the goals of a preservation unit or mass digitization initiative. Though their encoding is minimal, Level 1 texts are fully valid XML texts. In addition to taking advantage of the TEI header, these texts, while lightly encoded, can be easily combined with more richly encoded texts (that also follow these guidelines) for searching. Further encoding based on document structures or content analysis can be added to a Level 1 text at any time.

Level 1 is most suitable for projects with the following characteristics:

a large volume of material is to be made available online quickly
a digital image of each page is desired
no manual intervention will be performed in the text creation process
the material is of interest to a large community of users who wish to read texts that allow keyword searching
sophisticated search and display capabilities based on the structure of the text are not necessary
extensibility is desired; that is, one desires to keep open the option for a higher level of encoding to be added at a later date

Element Recommendations for Level 1

<div1> or <div>	There should be only one child of <body>: a single <div> (or <div1>)
<ab>	There should be only one child of the <div> (or <div1>): a single <ab> wrapping all of the OCR text. If the text is ever “upgraded” to a Level 3 or higher, the <ab> element will be replaced by structural elements like <p> and <table>.
<pb>	Required in Level 1. Page images can be linked to the text by specifying a jpeg or other image file as the value of the facs= attribute. Page numbers can be supplied with the n= attribute to record the number that is on the page. The Task Force sees the use of METS here as having a tremendous advantage. METS/TEI page turning documentation will be included in the near future.

Level 1 Example: Alger Hiss document

<TEI xml:id="someid" xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
[Source and processing information goes here]
</teiHeader>
<text>
<body>
<div1>
<ab>
<pb n="113" facs="00000001.tif"/>
POINT VIII.
BECAUSE OF UNLAWFUL SURVEILLANCE, PETITIONER'S
CONVICTION SHOULD BE VACATED; ALTERNATIVELY,
DISCOVERY AND A HEARING SHOULD BE ORDERED.
The nature and extent of surveillance of Hiss, his
family and associates was not known at the time of trial by
the defense. Even now, with the release of some of the government documents concerning FBI investigative techniques regarding
Hiss, the full extent of surveillance -- wiretapping, mail openings, mail covers, physical surveillance, and other intrusive
techniques -- is still not 'clear. Nevertheless, it is apparent
that information gathered through the exploitation of unlawful
wiretaps and other illegal surveillance was used at trial and
consequently the conviction must be reversed. Alternatively,
further discovery and a hearing is essential to a fair determination regarding these issues.
FBI surveillance of Hiss began in earnest in 1941 with
the institution of a mail cover on his incoming correspondence
at his home in connection with an FBI investigation of possible
Hatch Act violations. CN Ex. 98A. Another mail cover was placed

-113 -

<pb n="114" facs="00000002.tif"/>
on the Hiss mail in 1945, and at the same time the FBI obtained
toll call records from the Hiss residence Telephone for the
years 1943 and 1944 as well. CN Ex. 99. In September, 1945,
the FBI intercepted telegrams to Hiss as well. CN Ex. 100.
In late November, 1945, FBI surveillance of the Hiss
residence in Washington, D.C., escalated. For the third time,
a mail cover was instituted beginning on November 28, 1945,
which was continued at least until 1946. CN Ex. 101 at p. 70;
CN Ex. 102. Continuous physical surveillance of Hiss was begun
as well. CN Ex. 101 at p. 72. Although this twenty-four-hour
surveillance was discontinued on December 14, 1945, physical
surveillance was conducted frequently at various times until
September, 1947. CN Ex. 102; CN Ex. 103.
The most intrusive invasion of petitioner's rights
68/ Also before 1947, a letter from Priscilla Hiss addressed
to her son, Timothy Hobson, was intercepted and its contents
read. CN Ex. 100A at p. 167. In approximately March, 1947,
a letter from a Michael Greenberg addressed to petitioner regarding an application for employment with the United Nations
was also intercepted, in a manner not revealed by the documents. CN Ex. 100B

-114 -

<pb n="115" facs="00000003.tif"/>
occurred from December 13, 1945 until the Hisses moved from
Washington, D.C. to New York City on September 13, 1947. A
"technical surveillance," -- a wiretap -- was placed on the Hiss
telephone at their residence on P Street-in Washington, D.C.
The logs of this surveillance constitute twenty-nine volumes
of FBI serials and are roughly 2,500 pages in length, in which
an enormous amount of information concerning the Hisses' personal lives, relationships with friends and associates, and
habits is recorded.
The wiretap was installed following FBI Director Hoover's
application to the Attorney General for authorization, although
no written authorization appears in the documents released to
Hiss. The purpose of the application was to gather information
regarding Hiss' alleged contacts with Soviet espionage agents and
communists in government service, general allegations which had
been made by Elizabeth Bentley and Chambers.
As one would expect, the interception of every telephone
h9/ Hoover's initial request was answered by a note requesting information on Hiss. CN Ex. 104. Additional information
was furnished by letter dated November 30, 1945. CN Ex. 105.

-115 -

</ab>
</div1>
</body>
</text>
</TEI>

LEVEL 2: Minimal Encoding

Reference

Purpose

To create electronic text for full-text searching, linking to page images, and identifying simple structural hierarchy to improve navigation. (For example, you can create a table of contents from such encoding.)

Rationale

The text is mainly subordinate to the page image, though navigational markers (textual divisions, headings) are captured. However, the text could stand alone as electronic text (without page images) if the accuracy of its contents is suitable to its intended use and it is not necessary to display low-level typographic or structural information. Level 2 requires a set of elements more granular than those of Level 1, including bibliographic or structural information below the monographic or volume level. One of the motivations for using Level 2 is to avoid expensive analysis of textual elements and/or the expense of accurate text conversion, e.g., double-keying or detailed proofreading of automatic OCR.

Though texts at Level 2 can be created and encoded by automated means, based on the typographic elements in the electronic file (for example, bold centered text at the top of the page surrounded by whitespace indicates a new chapter heading, and thus a new division), automated methods are not likely to be reliable across a large body of work, especially if the materials are from earlier than 1900. Level 2 encoding requires some human intervention to identify each textual division and heading. Level 2 texts do not require any special knowledge or manual intervention below the section level.

For the most part, Level 2 texts are not intended to be displayed separately from their page images. Level 2 encoding of sections and headings provides greater navigational possibilities than Level 1 encoding, and enables searching to be restricted within particular textual divisions (for example, searching for two phrases within the same chapter).

Level 2 is most suitable for projects in which:

a large volume of material is to be made available online quickly
a digital image of each page is desired
the material is of interest to a large community of users who wish to read texts that allow keyword searching
rudimentary search and display capabilities based on the large structures of the text are desired
each text is checked to ensure that divisions and headers are properly identified
extensibility is desired; that is, one desires to keep open the option for a higher level of encoding to be added at a later date

Element Recommendations for Level 2

Use all elements specified in Level 1 plus the following:

<front>, <back>	Optional
<div1> or <div>	If no type= attribute is specified, a type= value of "section" should be presumed.
<head>	Required if headings are present
<ab>	At least one container element is required.

Level 2 Example: Basic Structure

<TEI xmlns="http://www.tei-c.org/ns/1.0">
 <teiHeader type="text">[See above for an example of a TEI Header]</teiHeader>
 <text>
  <front>[title page information, table of contents, prefaces, etc.][optional]</front>
  <body>
   <div type="section">
    <pb xml:id="p21198-zz0002mpqr" n="1"/>
    <head>A DISSERTATION UPON Religious Worship.</head>
    <ab>[a whole section is contained within this anonymous block tag; interspersed with <pb> elements pointing to page
        images]<pb xml:id="p21198-zz0002mpwb" n="2"/></ab>
   </div>
   <div type="section">
    <pb xml:id="p21198-zz0002mq0c" n="27"/>
    <ab>
    </ab>
    <div type="subsection">
      <head>CHAP. I. The Origin of the Customs and Ceremonies of the Jews. their federal Divisions;
      and the various Particulars wherein they differ.</head>
        <ab>[all the paragraphs of chapter one go here with page breaks inserted]</ab>
    </div>
   </div>
  </body>
  <back> [optional] </back>
 </text>
</TEI>

Level 2 Example: Alger Hiss document

<TEI xml:id="someid" xmlns="http://www.tei-c.org/ns/1.0">
<teiHeader>
[Source and processing information goes here]
</teiHeader>
<text>
<body>
<div1>
<pb n="113" facs="00000001.tif"/>
<head>POINT VIII: BECAUSE OF UNLAWFUL SURVEILLANCE, PETITIONER'S CONVICTION SHOULD BE VACATED; ALTERNATIVELY, DISCOVERY AND A HEARING SHOULD BE ORDERED.</head>
<ab>
POINT VIII.
BECAUSE OF UNLAWFUL SURVEILLANCE, PETITIONER'S
CONVICTION SHOULD BE VACATED; ALTERNATIVELY,
DISCOVERY AND A HEARING SHOULD BE ORDERED.
The nature and extent of surveillance of Hiss, his
family and associates was not known at the time of trial by
the defense. Even now, with the release of some of the government documents concerning FBI investigative techniques regarding
Hiss, the full extent of surveillance -- wiretapping, mail openings, mail covers, physical surveillance, and other intrusive
techniques -- is still not 'clear. Nevertheless, it is apparent
that information gathered through the exploitation of unlawful
wiretaps and other illegal surveillance was used at trial and
consequently the conviction must be reversed. Alternatively,
further discovery and a hearing is essential to a fair determination regarding these issues.
FBI surveillance of Hiss began in earnest in 1941 with
the institution of a mail cover on his incoming correspondence
at his home in connection with an FBI investigation of possible
Hatch Act violations. CN Ex. 98A. Another mail cover was placed

-113 -

<pb n="114" facs="00000002.tif"/>
on the Hiss mail in 1945, and at the same time the FBI obtained
toll call records from the Hiss residence Telephone for the
years 1943 and 1944 as well. CN Ex. 99. In September, 1945,
the FBI intercepted telegrams to Hiss as well. CN Ex. 100.
In late November, 1945, FBI surveillance of the Hiss
residence in Washington, D.C., escalated. For the third time,
a mail cover was instituted beginning on November 28, 1945,
which was continued at least until 1946. CN Ex. 101 at p. 70;
CN Ex. 102. Continuous physical surveillance of Hiss was begun
as well. CN Ex. 101 at p. 72. Although this twenty-four-hour
surveillance was discontinued on December 14, 1945, physical
surveillance was conducted frequently at various times until
September, 1947. CN Ex. 102; CN Ex. 103.
The most intrusive invasion of petitioner's rights
68/ Also before 1947, a letter from Priscilla Hiss addressed
to her son, Timothy Hobson, was intercepted and its contents
read. CN Ex. 100A at p. 167. In approximately March, 1947,
a letter from a Michael Greenberg addressed to petitioner regarding an application for employment with the United Nations
was also intercepted, in a manner not revealed by the documents. CN Ex. 100B

-114 -

<pb n="115" facs="00000003.tif"/>
occurred from December 13, 1945 until the Hisses moved from
Washington, D.C. to New York City on September 13, 1947. A
"technical surveillance," -- a wiretap -- was placed on the Hiss
telephone at their residence on P Street-in Washington, D.C.
The logs of this surveillance constitute twenty-nine volumes
of FBI serials and are roughly 2,500 pages in length, in which
an enormous amount of information concerning the Hisses' personal lives, relationships with friends and associates, and
habits is recorded.
The wiretap was installed following FBI Director Hoover's
application to the Attorney General for authorization, although
no written authorization appears in the documents released to
Hiss. The purpose of the application was to gather information
regarding Hiss' alleged contacts with Soviet espionage agents and
communists in government service, general allegations which had
been made by Elizabeth Bentley and Chambers.
As one would expect, the interception of every telephone
h9/ Hoover's initial request was answered by a note requesting information on Hiss. CN Ex. 104. Additional information
was furnished by letter dated November 30, 1945. CN Ex. 105.

-115 -

</ab>
</div1>
</body>
</text>
</TEI>

LEVEL 3: Simple Analysis

Reference

Purpose

To create a stand alone electronic text and identify hierarchy (logical structure) and typography without content analysis being of primary importance.

Rationale

Level 3 texts can be created by conversion from an electronic source such as HTML or word-processed documents or a print source with the automatic generation of full text by Optical Character Recognition software. Level 3 texts can also be created from scratch (e.g., transcription, born digital, etc.). Encoding at this level offers the advantage of the TEI header, interoperability with other TEI collections, and extensibility to higher levels of encoding. Level 3 generally requires some human editing, but the features to be encoded are determined by the logical structure and appearance of the text and not specialized content analysis.

Level 3 texts identify front and back matter, divisions within the text, and all paragraph breaks. Floating texts, or sub-texts like a poem or letter embedded in the greater text, are supported in this level. The finer granularity of encoding these features, as well as figures, notes, and all changes of typography, allows a range of options for display, delivery, and searching. For example, one has the option of identifying and, therefore, specifying the display charactersitics of different typographic styles, and regularizing the display and placement of note text.

Level 3 texts can stand alone as text without page images and, therefore, can be uploaded, downloaded and delivered quickly, and require less storage space than digital collections with page images. However, the simple level of structural anaylsis and absence of specialized content analysis reflected in Level 3 encoding may make it desirable for some, depending on project priorities, to include page images in order to provide users with a fuller set of resources.

Level 3 is most suitable for projects with the following characteristics:

the material is of interest to a large community of users who wish to read texts that allow for keyword searching
some sophistication of display, delivery, and searching based on structure of the text is desired
each text will undergo quality control to ensure that encoding decisions have been made appropriately
the users of the texts may have limited storage or display capabilities
the creator of the texts has limited or no ability to provide content expertise to analyze, tag, or review texts
extensibility is desired; that is, one desires to keep open the option for a higher level of encoding to be added at a later date

Element Recommendations for Level 3

Use all elements specified in Levels 1 and 2, plus the following:

<front>, <back>	Required if present.
<div>	Required if present; `type` attribute is recommended.
<floatingText>	Recommended if present.
<p>	Required for paragraph breaks in prose.
<lg> and <l>	Required for identifying groups of lines and lines, respectively.
<list> and <item>	May be used in this level to indicate ordered and unordered list structures.
<table>, <row>, and <cell>	May be used to indicate table structures.
<figure>	Required to indicate figures other than page images.
<hi>	Required to indicate changes in typeface; `rend` attribute is optional.
<note>	All notes must be encoded. It is also recommended that notes that extend beyond one page be combined into one `<note>` element. Marginal notes, without reference, should occur at the beginning of the paragraph to which they refer, with the value of the `place` attribute as "margin".
<lb>	May be used to indicate line breaks.

General Level 3 Recommendations

Forme Works

Running heads, catch words, and other such forme work information should not be included in Level 3, with the exception of page numbers, which are recorded using pb. If upgrading a text from Level 1 or Level 2 that was generated using OCR, discard the forme work information.

Front Matter

<div type="contents">: Use lists to mark up the table of contents with the <ptr> element used to reference the starting page number. The <ptr> element can reference the <pb> identifier or an identifier (e.g., @xml:id) placed in the corresponding division of text.

Body

<note>: It may be desirable to move footnotes from their original location in the text. If left at the bottom of a page, a note may become included in another paragraph or section of the encoded text, and thus separated from its reference. There are options for placement of footnotes if they are moved:

Inline. The note is inserted at the point of reference. An n attribute records the value of the note reference if there is one.
End-of-Division. Notes moved to the end of the corresponding division of the text (e.g., end of chapter).

Back Matter

<div type="index">: Use lists to mark up index entries with the <ref> element used to reference the corresponding page number. Add the "target" attribute (@target) to reference the <pb> identifier to generate links from the index into the text proper.

Level 3 Examples

Basic Structure: Prose (See full example)

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="VAA2383">
 <teiHeader> [stuff] </teiHeader>
 <text>
      <front>
           <div type="frontispiece">[figure]</div1>
           <titlePage>[text]</titlePage>
           <div type="dedication">[text]</div1>
           <div type="contents">[text]</div1>
      </front>
      <body>
           <div type="book">
           <head>[book title]</head>
                <div type="chapter">[text]</div2>  
                <div type="chapter">[text]</div2>  
                <div type="chapter">[text]</div2>  
                <div type="chapter">[text]</div2>  
                <div type="chapter">[text]</div2>     
           </div> 
      </body>
      <back>
           <div type="appendix">[text]</div1>
           <div type="index">[text]</div1>
      </back> 
 </text>
</TEI>

Basic Structure: Verse (See full example)

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="VAA2383">
 <teiHeader> [info] </teiHeader>
 <text>
      <front>
           <titlePage>[text]</titlePage>
           <div type="dedication">[text]</div1>
           <div type="contents">[text]</div1>
      </front>
      <body>
           <div type="book">
           <head>[book title]</head>
                <div type="part">
                <head>[section title]</head>
                       <div type="poem">
                       <head>THE DAYS GONE BY.</head>
                       <lg>
                            <l n="1">O the days gone by! O the days gone by!</l>
                            <l n="2">The apples in the orchard, and the pathway through the rye;</l>
                            <l n="3">The chirrup of the robin, and the whistle of the quail</l>
                            <l n="4">As he piped across the meadows sweet as any nightingale;</l>
                            <l n="5">When the bloom was on the clover, and the blue was in the sky,</l>
                           <l n="6">And my happy heart brimmed overin the happy days gone by.</l>
                     </lg>
                     <lg>[lines of poetry]</lg>
                     <lg>[lines of poetry]</lg>
                     <lg>[lines of poetry]</lg>
                    </div>
                </div>  
           </div> 
      </body>
 </text>
</TEI>

<!--@target references page break identifier-->
<div type="contents">
       <head>CONTENTS</head>
                <list type="simple">
                    <item>I. A Boy and His Dog <hi rend="right">3</hi>
                        <ptr target="#VAA2383_011"/></item>
                    <item>II. Romance <hi rend="right">12</hi>
                        <ptr target="#VAA2383_020"/></item>
                    <item>III. The Costume <hi rend="right">21</hi>
                        <ptr target="#VAA2383_029"/></item>
                    <item>IV. Desperation <hi rend="right">30</hi>
                        <ptr target="#VAA2383_038"/></item>
                    <item>V. The Pageant of the Table Round <hi rend="right">38</hi>
                        <ptr target="#VAA2383_046"/></item>
</div>

Chapter with Letter

<div type="chapter">
<pb xml:id="VAA2383_126" n="118"/>
     <head type="main">CHAPTER XIV</head>
     <head type="subtitle">MAURICE LEVY'S CONSTITUTION</head>
          <p><hi rend="b">L</hi>O, SAM!" said Maurice cautiously. "What you doin'?"</p>
          <p>Penrod at that instant had a singular experiencean intellectual shock like a flash of fire in the
                    brain. Sitting in darkness, a great light flooded him with wild brilliance. He gasped!</p>
   <!--Text removed from example-->        
           <p>"What you doin'?" asked Maurice for the third time, Sam Williams not having decided upon a reply.</p>
           <pb xml:id="VAA2383_127" n="119"/>
           <p>It was Penrod who answered.</p>
           <p>"Drinkin' lickrish water," he said simply, and wiped his mouth with such delicious enjoyment that
                        Sam's jaded thirst was instantly stimulated. He took the bottle eagerly from Penrod.</p>
            <p>"A-a-h!" exclaimed Penrod, smacking his lips. "That was a good un!"</p>
                    <!--Text removed from example-->
                    <p>Penrod uttered some muffled words and then waved both armseither in response or as an expression
                        of his condition of mind; it may have been a gesture of despair. How much intention there was in
                        this actobviously so rash, considering the position he occupiedit is impossible to say.
                        Undeniably there must remain a suspicion of deliberate purpose.</p>
                    <!--Text removed from example-->
             <pb xml:id="VAA2383_138" n="130"/>
             <p>The damsel curtsied again and handed him the following communication, addressed to herself: </p>
                   <floatingText>
                        <body>
                            <div type="letter">
                                <p>"Dear madam Please excuse me from dancing the cotilo with you
                                    this afternoon as I have fell off the barn</p>
                                <p>"Sincerly yours<lb/>
                                    "P<hi rend="sc">ENROD</hi> S<hi rend="sc">CHOFIELD</hi>."
                                </p>
                            </div>
                        </body>
                 </floatingText>
</div>

LEVEL 4: Basic Content Analysis

Reference

Purpose

To create text that can stand alone as electronic text, identifies hierarchy and typography, specifies function of textual and structural elements, and describes the nature of the content and not merely its appearance. This level is not meant to encode or identify all structural, semantic or bibliographic features of the text.

Rationale

Greater description of function and content allows for:

flexibility of display and delivery
sophisticated searching within specified textual and structural elements
combining the broadest range of uses and audiences

Texts encoded at Level 4 are able to stand alone as part of a library collection, and do not require page images in order for them to be read by students, scholars and general readers. This level of TEI encoding allows them to be displayed or printed in a variety of ways suitable for classroom or scholarly use.

Level 4 texts contain elements and attributes that describe content. Features of the text that may contribute to meaning, such as indentation of verse lines and typographic change, are preserved. These are textual features that are not encoded at lower levels and that allow the text to be used and understood fully independent of images. The ability to stand alone as text means that Level 4 texts are more nimble and robust for exercises such as format repurposing and textual analysis.

Finally, functionally accurate encoding in Level 4 texts allows them to be searched or displayed in sophisticated ways. For example, a searcher could limit his or her search in a dramatic text to stage directions or in a verse text to only first lines. In a political tract published by subscription, a search could be confined to names that appear in lists, thus limiting a search to names of people who subscribed to a particular volume. This ability to limit searches becomes more significant as textbases become larger, and thus is of great importance to the library community as it attempts to build into the initial design and implementation of textbases the features needed to enhance interoperability.

Level 4 is most suitable for projects with the following characteristics:

sophisticated search and retrieval capabilities are desired
the texts will be used for textual analysis
extensibility is desired; that is, one desires to keep open the option for a higher level of encoding to be added by the scholarly community at a later date
the users of the texts may have limited storage or display capabilities

Element Recommendations for Level 4

Use all elements specified in Levels 1, 2 and 3, plus elements in the following table. Note that some of these elements are defined in Level 3 as well, but their use in Level 4 is more strict.

<titlePage> and appropriate child elements	Required.
<group>	Required to encode a collection of independent texts that are regarded as a single group for processing or other purposes.
<list> and <item>	Required to indicate ordered and unordered list structures.
<table>, <row>, and <cell>	Required to indicate table structures.
<hi>	Required to indicate change in rendition when a more specific element is not being used; rend attribute is optional.
<emph>, <foreign>, <gloss>, <term>, or <title>	Recommended to identify typographically distinct text.
<epigraph>, <quote>, <said>, <mentioned>, or <soCalled>	Recommended to represent speech, thought, quotation, etc.
<sic>, <corr>, or <choice>	Recommended to encode errors or typos.
<opener>, <dateline>, <salute> <closer>, <signed>, <postscript>	Required to indicate specific parts of letters.
<argument>	Recommended to encode a "list of topics sometimes found at the start of a chapter or other division".
<trailer>	Recommended to encode a heading- or title-like content at the end of a division.
<add>, <del>, <gap>, and <unclear>	Recommended to encode material that is omitted, added, marked for deletion, or is illegible, invisible, or inaudible.
<figure>, <head>, <figDesc>, and <graphic>	Used to refer to illustrative images and descriptive information about those images.
<castList>, <castItem>, <sp>, <speaker>, and <stage>	Required to encode different structures in performance texts (i.e. drama).
<sp> and <speaker>	Required to encode oral histories interviews.
<persName>, <placeName>, <geogName>and <orgName>	Recommended to encode personal, place and organizational names referenced in a text.
<listName>, <listPlace> and <listOrg>	Recommended in support of personal, place and organizational names normalization and to capture additional information about the names. Should be captured in an external TEI file for easier maintenance of names.

General Level 4 Recommendations

The use of <group> is required when you need to encode a body of distinct texts that are grouped together and are regarded as a unit. Most typical examples of such composite texts would be anthologies, collected works of an author, etc. Section 4.3.1 Grouped Texts states, "The presence of common front matter referring to the whole collection, possibly in addition to front matter relating to each individual text, is a good indication that a given text might usefully be encoded in this way."

Any ambiguous emphasized text should be encoded as <hi> (e.g. <hi rend="bold">), if more specific elements are not used.

Typographically distinct text could be encoded with more specificity. We recommend the following approaches:
- to represent speech, thought, quotation, etc.:
  - <epigraph>
  - <quote>,
  - <said>,
  - <mentioned>,
  - <soCalled>,
- to represent foreign words or phrases, linguistically emphatic or stressed words or phrases, words regarded as a technical term, etc.:
  - <emph>,
  - <foreign> (e.g. <foreign xml:lang="fr">),
  - <gloss>,
  - <term>,
  - <title>

We recommend the following three approaches to encode errors or typos in original texts.
- <sic> element is recommend to be used to indicate errors without correcting them
- <corr> element is recommended to be used if an encoder chooses not to correct
- <choice> element represents the combination for two approaches, with the encoder documenting both encountered errors and their corrections:

<p>He has no Scruple about Fish; but won't touch a bit of Pork, it being 
    <choice>
      <sic>expresly</sic>
      <corr>expressly<corr>
    </choice> forbidden by their Law.</p>

FROM: Thomas Bluett. Some Memoirs of the Life of Job, the Son of Solomon, the High Priest of Boonda in Africa; Who was a Slave About Two Years in Maryland; and Afterwards Being Brought to England, was Set Free, and Sent to His Native Land in the Year 1734. London: Printed for R. Ford, 1734.

or

<p>4. The art of writing she obtained by her own industry and curiosity, and in so
short a time that in the year 1765, when she was not more than twelve years of
<choice>
<sic>age,she</sic>
<corr>age, she></corr>
</choice>
was capable of writing letters to her friends <pb xml:id="p11" n="11"/> on various
subjects. She also wrote to several persons in high stations.</p>

FROM: Abigail Mott, 1766-1851. Biographical Sketches and Interesting Anecdotes of Persons of Colour. To Which is Added, a Selection of Pieces in Poetry. New-York: M. Day, 1826.

Use <argument> to encode a prefatory list or prose description of the topics usually discovered at the beginning of a chapter. The content within the <argument> element can be presented as a list or as a paragraph:

<div type="chapter" n="1">
<pb xml:id="albert14" n="14"/>
   <head>CHAPTER I.<lb/>CHARLOTTE BROOKS.</head>
    <argument>
	<p>Causes of immorality among colored people - Charlotte Brooks - She is sold South -
Sunday work.</p>
    </argument>
	<p> ... </p>
</div>

FROM: Octavia V. Rogers Albert. The House of Bondage, or, Charlotte Brooks and Other Slaves, Original and Life Like, As They Appeared in Their Old Plantation and City Slave Life; Together with Pen-Pictures of the Peculiar Institution, with Sights and Insights into Their New Relations as Freedmen, Freemen, and Citizens. New York: Hunt & Eaton, 1890.

The <trailer> element is recommended to encode a heading- or title-like content at the end of a division (i.e. chapter, book, etc.)

<TEI xmlns="http://www.tei-c.org/ns/1.0" xml:id="someid">
 <teiHeader>[stuff]</teiHeader>
 <text>
     <body>
           <div type="book">
           <head>[book title]</head>
                <div type="chapter" n="1">
                <head>[chapter title]</head>
                   <p>[text]</p>
                   <trailer>Here ends the Chapter 1.<trailer>
                </div>  
                <div type="chapter" n="2">
                <head>[chapter title]</head>
                   <p>[text]</p>
                   <trailer>Here ends the Chapter 2.<trailer>
                </div> 
           <trailer>FINIS.<trailer>       
           </div> 
      </body>
  </text>
</TEI>

The elements <add>, <del>, <unclear>, <gap> could be used to indicate instances when a text (i.e. word or part of it, phrase or part of it) has been omitted, added, marked for deletion, or because the material is illegible, invisible, or inaudible (i.e. while transcribing oral history interviews.)

<p>But it is well authenticated by the observation of every one, that <del
rend="overstrike" hand="JHL">their manner</del> <add rend="sup" hand="JHL">this way—i.e.
the above</add> of writing influences the style of compos. of those who practise it
considerably, when they grow up to years of manhood; for their productions, <del
hand="JHL" rend="overstrike">instead</del> far from being terse, argumentative,
convincing, are without head or tail & are generally an incongruous mass mixed up in the
most disgusting manner, without divisions or heads & in short without a subject (so to
speak).</p>

FROM: Class Composition of J. Horace Lacy, [January 1851]1. Lacy, James Horace, 1834-1852

<p> [. . .]But I still hope for & trust in God and I believe he will animate our brave
defenders with a superhuman power and we will yet drive from our soil the hated invaders
whose tread <unclear reason="ink blot"></unclear> profanation, but this is an hour to try
men's souls—Fort Donelson has been taken by the enemy.  Frank was there and covered
himself with honor but his bravery cost him a wound; he was wounded in the leg slightly—a
flesh wound only, you must not be uneasy.  [. . .]</p>

FROM: Kimberly Family Personal Correspondence, 1862-1864. Transcript of the manuscript, UNC-Chapel Hill, Southern Historical Collection.

Names should be encoded using <persName>, <placeName>, <geogName>, and <orgName>, tags with the "key" attribute providing a reference to an external file for managing name normalization and compilation of additional information such as biographical or geospatial information. The external TEI file maintains a unique entry for each name, grouped accordingly under <listPerson>, <listPlace>, and <listOrg>, which is uniquely referenced with an "xml:id" attribute. The "key" value in the source file references the "xml:id" value in the external file. References to controlled vocabularies and national or local authority files can be signified by a prefix in the "xml:id" attribute (e.g., tgn_0000000 for the Getty's Thesaurus of Geographic Names). When referencing a controlled vocabulary be sure to specify this information in the <classDecl> section of the TEI Header.

<!--Place name tagging example in source file-->      
        <p>The first Jews arrived in <placeName key="tgn_7012924">Indianapolis</placeName> 
            in the middle of the 19th century. Primarily immigrants from <placeName key="tgn_7000084">
            Germany</placeName> and other points in central Europe (though many had lived elsewhere in the 
            <placeName key="tgn_7012149">United States</placeName> before they arrived in the city), 
            they were drawn from throughout the Midwest by the growth of commerce and rail lines in
            <placeName key="tgn_7012924">Indianapolis</placeName>.
        </p>

<!--External file for maintaining place name normalization and additional information.
<listPlace>
              <place xml:id="tgn_7012924">
                  <placeName><settlement type="city">
                      Indianapolis
                  </settlement></placeName>
              </place>
              <place xml:id="tgn_7000084">
                  <placeName><country>
                      Deutschland
                  </country></placeName>
              </place>
              <place xml:id="tgn_7012149">
                  <placeName><country >
                      United States
                  </country></placeName>
              </place>
 </listPlace>

<!--Personal and organizational name tagging example in source file-->      
 <figure n="VAA7662-004-001">
            <p>PRIZE LIBRARY GIFT-Indiana University President <persName key="lcnaf_82134365">Elvis J. Stahr</persName> (right), 
                a former law dean and practicing attorney, reminisces with Professor of Law 
                <persName key="lcnaf_00113347">W. Howard Mann</persName> as the two inspect some
                of the nearly 3,000 volumes of <orgName key="lcnaf_79006848">U.S. Supreme Court</orgName> records recently 
                transferred to I.U. from the <orgName key="lcnaf_79109178">Indiana Supreme Court Library</orgName>. 
                The collection, dating back to 1925, is one of the oldest and most complete sets in existence. </p>
</figure>

<!--External file for maintaining personal and organization name normalization and additional information.
<listPerson>
                <person xml:id="lcnaf_82134365">
                    <persName>
                        <surname>Stahr</surname>, <forename type="first">Elvis</forename> 
                        <forename type="middle">J.</forename> 
                    </persName>
                    <birth when="1916">1916</birth>
                </person>
                <person xml:id="lcnaf_00113347">
                    <persName>
                        <surname>Mann</surname>, <forename type="first">W.</forename> 
                        <forename type="middle">Howard</forename>
                    </persName>
                </person>
</listPerson>

 <listOrg>
                <org xml:id="lcnaf_79006848">
                    <orgName>
                        United States. Supreme Court
                    </orgName>
                </org>
                <org xml:id="lcnaf_79109178">
                    <orgName>
                        Indiana. Supreme Court
                    </orgName>
                </org>
 </listOrg>

Level 4 Front and Back Matter

The use of the <titlePage> element with appropriate child elements describing the major features of most title pages is required. The child elements are listed in Section 4.6 "Title Pages"

<titlePage> should include the verso if present, divided by <pb n="verso"/>.

Frontispieces should be encoded as a <figure>, within a separate division (numbered or unnumbered, depending on the general editorial decision for a specific encoding project) and <p>.

Tables of contents, errata, subscription lists, "other titles by the same author" should be included in a separate division (numbered or unnumbered, depending on the general editorial decision for a specific encoding project), as a <list> with <item>s.
It is recommended that all prefaces, tables of contents, afterwords, appendices, endnotes and apparatus be encoded.
For publisher's advertisements, indexes, and glossaries or other front or back matter that are not considered of primary importance to the text, we propose three options:
- Fully transcribe and encode
- Link to page images (may not include an encoded transcription)
- Fully omit and note the omission in <editorialDecl>

Level 4 Letters

Letters that occur within the text body provided memorable challenges in P4. However, the introduction of the <floatingText> element in TEI P5 gives a cleaner approach of encoding letters or any "independent text which interrupts the text containing it at any point but after which the surrounding text resumes” (see more in Section 4.3.2 Floating Texts) It is recommended that quoted letters that occur as part of a text (and not collections of letters themselves) be encoded within <floatingText> <body> <div1 type="letter"> with <opener>, <dateline>, <salute>, <signed>, <closer>, <postscript> included as appropriate.

<p>She opened and read as follows:</p>
          <floatingText>
              <body>
                <div1 type="letter">
                  <opener>
                    <dateline>AUGUSTA, March 4th, 18—</dateline>
                    <salute>
                      <hi rend="italics">Mrs. A. Mitten:</hi>
                    </salute>
                  </opener>
                  <p>"Having recently understood that you have procured a private teacher, we have
ventured to stop your advertisement, <hi rend="italics">though ordered to continue it
until forbid,</hi> under the impression that you have probably forgotten to have it
stopped. If, however, we have been misinformed, we will promptly resume the
publication of it. You will find our account below; which as we are much in want of
funds, you will oblige us by settling as soon as convenient. Hoping your teacher is
all that you could desire in one,</p>
                  <closer>
                    <salute>"We remain, your ob't. serv'ts,</salute>
                    <signed>"H—& B—”</signed>
                  </closer>
                </div1>
              </body>       
         </floatingText>

FROM: Augustus Baldwin Longstreet, 1790-1870. Master William Mitten: or, A Youth of Brilliant Talents, Who Was Ruined by Bad Luck. Macon, Ga.: Burke, Boykin, 1864.

Level 4 Drama

Within the front matter (<front>) of a performance text, cast lists should be encoded as <castList>s, with each item in that list encoded as <castItem>s.
Also, if desired (though not required), each <castItem> can be uniquely identified with the xml:id attribute construct.

From Shakespeare's King Lear:

<front>
    <castList><head>Dramatis Personae</head>
        <castItem xml:id="kllear">LEAR king of Britain</castItem>
        <castItem xml:id="klfrance">KING OF FRANCE</castItem>        
        <castItem xml:id="klburgundy">DUKE OF BURGUNDY</castItem>
        <castItem xml:id="klcornwall">DUKE OF CORNWALL</castItem>
        <castItem xml:id="klalbany">DUKE OF ALBANY</castItem>
        <castItem xml:id="klkent">EARL OF KENT</castItem>
        <castItem xml:id="klgloucester">EARL OF GLOUCESTER</castItem>
        <castItem xml:id="kledgar">EDGAR son to Gloucester.</castItem>
        <castItem xml:id="kledmund">EDMUND bastard son to Gloucester.</castItem> 
        [. . .]
    </castList>
</front>

Within the body of performative texts, speeches are encoded as <sp>, with speakers identified by the <speaker> element which is a child of <sp>
Stage directions are encoded as <stage> and enclose block level content describing scenery, etc.
When encoding the actual speech content itself, utilize elements and attributes that correspond to the type of dramatic speech presented (e.g. <p> for prose speech with <lb> to designate a new line in a particular edition of the text or <lg> and <l> to describe dramatic verse structures).
If referencing the xml:id defined in the <castList> is desired, utilize the "who" attribute construct for the IDREF datatype.

Again, from King Lear:

<div type="act" n="1">
    <head>Act 1</head>
    <div type="scene" n="1">
        <head>Scene 1</head>
           <stage>King Lear's palace.</stage>
           <stage>Enter KENT, GLOUCESTER, and EDMUND</stage>

           <sp>
               <speaker who="klkent">KENT</speaker>
               <p>I thought the king had more affected the Duke of<lb/>
               Albany than Cornwall.</p>
           </sp>
           <sp>
               <speaker who="klgloucester">GLOUCESTER</speaker>
               <p>It did always seem so to us: but now, in the<lb/>
               division of the kingdom, it appears not which of<lb/>
               the dukes he values most; for equalities are so<lb/>
               weighed, that curiosity in neither can make choice<lb/>
               of either's moiety.</p>
           </sp>
           <sp>
                <speaker who="klkent">KENT</speaker>
                <p>Is not this your son, my lord?</p>
           </sp>
          [. . .]
     </div>
</div>

Level 4 Oral History

[I work on it Natasha]

Speakers in oral history interviews, i.e. interviewee(s) and interviewer(s), can be identified in the <teiHeader> in several ways:
- In the <profileDesc>, in the <particDesc>, using the <list> element, with <name> inside of <item>s.
- As a list of author <name>s within <fileDesc> / <titleStmt>

In either method, use an xml:id= on the <name> element to uniquely identify the individual participant

The list of interview's participants can be also listed within the body of the interview (see an example below.)
Questions and answers from interviewees and interviewers are encoded as <sp>, with speakers identified within <speaker> elements with a who= attribute the value of which corresponds to the xml:id= in the list of interview participants.

<list type="simple">
<head>Interview Participants</head>
   <item>
   <name xml:id="spk1" key="wf" reg="Friday, William C." type="interviewee">WILLIAM C. FRIDAY
   </name>, interviewee
   </item>
   <item>
   <name xml:id="spk2" key="wl" reg="Link, William" type="interviewer">WILLIAM LINK</name>, interviewer
   </item>
</list>

[. . . ]

<sp who="spk2">
  <speaker n="2">WILLIAM LINK:</speaker>
     <p>Last time we were talking about Frank Porter Graham. And I have a couple of questions
about Graham, and I wonder if you could clear them up for me. You have mentioned that you
had worked with him as a student at North Carolina State, had you met him before?
     </p>
</sp>
<sp who="spk1">
   <speaker n="1">WILLIAM C. FRIDAY: </speaker>
     <p>No. That budget hearing was the first that I knew of him, of course, but the first time that I ever encountered him. I was president of class at N.C. State, and that through me into this kind of public adventure. And so I went merrily on downtown and sat there in the budget hearing, along with the president of the student body, and some others. 
     </p>
</sp>

One of the approaches to synchronize audio and transcript has been introduced in Oral Histories of the American South", using <milestone> with a timestamp attribute:

<milestone n="7248" unit="empty" type="stop" timestamp="00:08:54"/>

Level 4 Verse

[Matthew G. will work on this section]

All verse, even poems without separate stanzas or verse paragraphs, should be contained within a line group element <lg>. This will assist with automated processing and retrieval.
It is common to see informal divisions within poems, noted by a string of asterisks or periods. These should be encoded as <milestone>s with attribute values of unit="typography" and n="()" indicating the character used and its occurrence, <milestone unit="typography" n="******"/>. *<l> It is recommended that indentation be recorded and that the rend attribute be used to do this.

LEVEL 5: Scholarly Encoding Projects

Level 5 texts are those that require subject knowledge, and encode semantic, linguistic, prosodic, or other elements beyond a basic structural level.

need snippet examples!

General Guidelines for Attribute Usage

Some general advice on the use of particular attributes follows.

type=: Constructing a list of acceptable attribute values for type that could find wide agreement is impossible. Instead, it is recommended that projects describe the type= attribute values used in their texts in the project ODD file and that this list be made available to people using the texts. It is worth noting that, at present, Roma, the web front-end editor for ODD files, does not have a mechanism for providing this documentation — it should be added to the ODD file directly. For a list of standard names and definitions of bibliographic features of printed books, see ABC for Book Collectors by John Carter (8th edition, New Castle, Del. and London: Oak Knoll Books and the British Library, 2004, available online at http://www.ilab.org/images/abcforbookcollectors.pdf). For those elements where type is not required, such as <head> and <title>, use the attribute values for subtitles and additional titles, but not main titles.
Example: <div1 type="volume">
n=: Sometimes an n= (number) attribute can be used by itself. For instance in the case of pagebreaks:
Example: <pb n="456"/>
xml:id=: If you are in a situation that requires you to uniquely identify an element that will be used to reference another specific location in one or more texts, use an @xml:id attribute. The value of this attribute must be unique within a document, and must be composed of alphanumeric characters, dots, hyphens, and underscores, and must start with a letter.
Example: <note xml:id="n5" n="5">
target=: A URI. May be used to point internally within the same document by preceding the value of the target element's @xml:id with a number sign (U+0023). E.g., in the case of footnotes where the <anchor xml:id="n5" n="5"> is at a specific place in the text and is referred to by the <note target="#n5" n="5"> which contains the actual content of the footnote itself elsewhere.

rend=: Difficulty using rend= attributes occurs when it is desirable to record more than one rendition feature. With this in mind, it is recommended that projects employ the following adaptation of “rendition ladders”, a concept developed at the Brown University Women Writers Project. This system allows for sets of multiple renditional features to be included in one rend= value. Rendition ladders consist of categories of renditional features with values of each of those features enclosed in parentheses.
rend= should only be used to override a default value. For instance, if all text encoded as <hi> is defined as being rendered in italics, there is no reason to encode text as <hi rend="slant(italics)"> Combining renditional features would result in an element with attributes such as <l rend="slant(italics)align(right)">

keyword	some possible values
`slant`	`italics` or `upright`
`weight`	`bold` or `normal`
`case`	`allcaps` or `lower` or `smallcaps` or `upper` or `mixed`
`align`	`left` or `right` or `center` or `centre` or `inside` or `outside` or `none`
`indent`	a number

xml:lang=: Used to indicate the natural language of the content of an element. Generally not used at levels 1 or 2. At levels 3 or 4 should contain the appropriate language subtag; further subtags (script, region, variant, extension, private use) are rarely appropriate. If only the language subtag is present (the most common case), a corresponding <language> element should not be present in the TEI header. If a private use subtag is present, a corresponding <language> element must be present in the TEI header. See the TEI documentation.

References: <references/>

Appendix A: History of this Document

The Text Encoding Initiative Guidelines for Electronic Text Encoding and Interchange (referred to as the TEI Guidelines) were first published in 1994 and represent a tremendous achievement in electronic text standards by providing a highly sophisticated structure for encoding electronic text. Digital librarians have benefited greatly from the standardization provided by these guidelines, and the potential for interoperability and long-term preservation of digital collections facilitated by their wide adoption.

In 1998, the Digital Library Federation (DLF) sponsored the TEI and XML in Digital Libraries Workshop at the Library of Congress to discuss the use of the TEI Guidelines in libraries for electronic text, and to create a set of best practices for librarians implementing them. From this workshop, three working groups were formed the members of which represented some of the largest and most mature digital library programs in the U.S.

Group 1 was charged to recommend some best practices for TEI header content and to review the relationship between the Text Encoding Initiative header and MARC. To this end, representatives of the University of Virginia Library and the University of Michigan Library gathered in Ann Arbor in early October 1998 to develop a recommended practice guide. This work was assisted by similar efforts that had taken place in the United Kingdom under the auspices of the Oxford Text Archive the previous year. The section on the header is based on a draft of those recommended practices. It was submitted to various constituencies for comment. In 2008 and 2009, it was heavily revised by Melanie Schlosser, Kevin Hawkins, and other members of the TEI SIG on Libraries.

Group 2 was charged with developing a set of recommendations for libraries using the TEI Guidelines in electronic text encoding. This group included the following representatives from six libraries:

LeeEllen Friedland, Library of Congress
Nancy Kushigian, University of California, Davis
Christina Powell, University of Michigan
David Seaman, University of Virginia
Natalia Smith, University of North Carolina at Chapel Hill
Perry Willett, Indiana University (chair)

At the ALA mid-winter (January 1999), the DLF task force revised a draft set of best practices, called TEI Text Encoding in Libraries: Guidelines for Best Practices (referred to as TEI in Libraries Guidelines). The revised recommendations were circulated to the conference working group in May 1999 and presented at the joint annual meeting of the Association of Computers and the Humanities and Association of Literary and Linguistic Computing in June 1999. Version 1.0 was circulated for comments in August 1999. These guidelines were endorsed by the DLF, and have been used by many digital libraries, including those of the task force members, as a model for their own local best practices. Libraries, museums and end-users have benefitted from a set of best practices for electronic text in a number of ways, including better interoperability between electronic text collections, better documented practices among digital libraries, and a starting point for discussion of best practices with commercial publishers regarding electronic text creation.

Written in 1998, this first iteration of TEI in Libraries Guidelines made no mention of XML, XSLT, or any of the other powerful tools that have now become common parlance and practice in creating digital documents and collections. Based on these important changes in markup technology, it came to the attention of the DLF and members of the original Task Force that the TEI in Libraries Guidelines required substantial revision. In 2002, the TEI Consortium published a new edition of the complete TEI Guidelines that conformed to XML specifications. In order to remain useful, the TEI in Libraries Guidelines had to be updated to reflect these developments.

Furthermore, librarians need more guidance than the original TEI in Libraries Guidelines provided. There are many library-specific encoding issues which need to be addressed and documented to ensure consistency. The intention of this document is to provide recommended paths of encoding for these issues.

In addition, these library guidelines have the potential to be much more useful if they can serve as a training document from which librarians can learn about text encoding and addressing particular encoding challenges. To fulfill this role, the guidelines require more examples and detailed explanations, giving documentation of the use of TEI in a library context. Librarians also need a set of standards and best practices for vendors and publishers who create electronic text for digital libraries, so that these collections adhere to the same archival standards as locally-created electronic text collections. With detailed guidelines that could serve as an encoding specification, librarians might encourage vendors to follow the principles in these standards, to facilitate the long-term preservation of commercially published electronic text collections, and more readily allow for cross-collection searching.

In order to facilitate the evolution of this document, another DLF-sponsored Task Force—some of the representatives of which were on the original Task Force—met on October 24-25, 2003 at the Cosmos Club in Washington, D.C.:

Richard Gartner, Oxford University Library
Matthew Gibson, University of Virginia Library
Kirk Hastings, California Digital Library
Christina Powell, University of Michigan
Merrilee Proffitt, RLG
David Seaman, Digital Library Federation
Natalia Smith, University of North Carolina at Chapel Hill
Perry Willett, Indiana University (chair)

These representatives met to revise the original TEI in Libraries Guidelines in order that they:

reflect changes occuring within the text encoding world generally and within the TEI community specifically
further illuminate the different levels of encoding by offering clearer and more robust examples.

After producing Version 2.0 of the Guidelines, this group (with some changes in membership) met again at the Cosmos Club on February 13-14, 2006. Those in attendance were:

Syd Bauman, The TEI Consortium
Richard Gartner, Oxford University Library (by phone)
Matthew Gibson, Virginia Foundation for the Humanities (chair)
Chris Powell, The University of Michigan
Merrilee Proffitt, RLG
David Seaman, Digital Library Federation
Natasha Smith, University of North Carolina at Chapel Hill
Perry Willett, The University of Michigan

Best Practices for TEI in Libraries

Contents

Introduction

General Recommendations

Structure of a TEI Document

The TEI Header

Reference

The TEI Header

The TEI Header and MARC

The TEI Header and Other Metadata Schemas

Determining Data Values for the TEI Header

Element Recommendations for the <teiHeader>

Sample TEI Header

Linking between encoded text and images of source documents

Encoding Levels

LEVEL 1: Fully Automated Conversion and Encoding

Reference

Purpose

Rationale

Element Recommendations for Level 1

Level 1 Example: Alger Hiss document

LEVEL 2: Minimal Encoding

Reference

Purpose

Rationale

Element Recommendations for Level 2

Level 2 Example: Basic Structure

Level 2 Example: Alger Hiss document

LEVEL 3: Simple Analysis

Reference

Purpose

Rationale

Element Recommendations for Level 3

General Level 3 Recommendations

Level 3 Examples

Basic Structure: Prose (See full example)

Basic Structure: Verse (See full example)

Table of Contents

Chapter with Letter

LEVEL 4: Basic Content Analysis

Reference

Purpose

Rationale

Element Recommendations for Level 4

General Level 4 Recommendations

Level 4 Front and Back Matter

Level 4 Letters

Level 4 Drama

Level 4 Oral History

Level 4 Verse

LEVEL 5: Scholarly Encoding Projects

General Guidelines for Attribute Usage

Appendix A: History of this Document

Navigation menu

Search