TEI Libraries SIG Manifesto

From TEIWiki

Jump to: navigation, search

Here is our draft of a "manifesto" that states rationales that library administration leaders should consider for supporting TEI. The SIG on Libraries welcomes all comments, edits, and suggested changes!

Contents

Draft

Libraries have long digitized analog materials in order to provide greater access to the content[1] and increasingly also to preserve and enrich these materials. The Text Encoding Initiative (TEI) Guidelines,[2] developed by scholars and libraries and first published in 1994, provides a standard, non-proprietary format for enhancing digitized textual documents that promotes preservation and future reuse. Libraries have used the TEI Guidelines from its earliest incarnation, yet advances in digital technologies have made scanning and the application of optical character recognition (OCR) on page images the preferred means of mass digitization.

Scanning and OCR are not the right tools for all types of documents, nor for all scholarly uses of the content. For example:

  • Some source documents are too fragile even for overhead (planetary) scanners.
  • OCR technology works best on modern printed documents but is not yet capable of deciphering documents rendered illegible because of faded ink, typeface bleeding, obscure fonts, and nearly all handwriting.
  • Full-text searching of OCR text and even n-gram browsers are insufficient for searching reference works by headword, for navigating canonical works by chapter and verse, for phrasal searching in multi-column documents, and for supporting fine-grained searching, analysis, and visualization of texts.

For such documents, there is still value for preservation and for support of scholarly research in transcription and analytic encoding of the text of a document. Such work need not be done manually or even in house at all; instead, this work is often outsourced to vendors, who encode to a specification,[3] while quality assurance and sometimes more detailed encoding are done in a library or institution under the direction of a scholar.

The standard format used by scholars and supported by funding agencies is TEI. While the TEI Guidelines are exhaustive in coverage, users are meant to choose a subset of features for a particular need. The Best Practices for TEI in Libraries provides an indispensable and concise guide for common library applications of TEI, offering a clear path for libraries to create extensible, repurposable digital surrogates of items in a library's collections, transcribed and tagged to an appropriate level of detail. The TEI's SIG on Libraries provides a venue for sharing initiatives, methodologies, and workflows in order to strengthen the use of TEI in libraries, and to help promote libraries' support of TEI at their institution.

TEI text encoding is often carried out in digital humanities centers, though the leading DH centers have strong ties to libraries, and DH practitioners have long proclaimed the importance of having librarians involved in their projects.

Librarians bring great expertise to the use of TEI, and institutions should support librarians working in tandem with scholars when the use of TEI is called for.

References

  1. http://www.clir.org/pubs/abstract//reports/pub80-smith
  2. http://www.tei-c.org/Guidelines/
  3. TEI member institutions have access to a discount rate on encoding through the AccessTEI program.

Outline

  • Preamble
  • Rationale for digitizing (in one sentence): why we make things digital … first principles. Reference CLIR docs.
  • For different intended uses of content, different types of access mechanisms are needed.
  • While mass digitization meets many common needs, it’s insufficient for certain purposes, such as:
    • Things that don’t OCR (especially manuscripts and early printed works) and/or are illegible or hard to read in the page image
    • Source documents that can’t be scanned because they’re too fragile
    • Reference works where you want to be able to search on a headword
    •  ????
  • For such things, we still need a non-proprietary format for representing a digital surrogate of the item that is designed for:
    • long-term preservation
    • data curation
    • interchange
  • And which will enable:
    • visualization
    • analysis
  • For textual content (but not data sets or purely tabular data published in print), the obvious choice is TEI.
  • TEI encoding can be scoped: you don’t have to (and shouldn’t!) use all of its features.
    • TEI encoding can be implemented in stages. You can take mass digitized content and apply a light-level of encoding on top of it, and if the use case presents itself to merit more encoding, then you can do further work.
  • Encoding is often outsourced, including through AccessTEI. If your project calls for richer encoding, can enrich the outsourced data by doing the higher-level encoding in house.

Suggestions

  1. You may want to also include argument some library administrators make that TEI work is best left to digital humanities departments, not libraries. We hired a new library director two years ago, who has made this argument. Our TEI work and instruction in the library has since been suspended (R. Wisneski, Case Western Reserve University, Cleveland, OH)
    Yes, good point. Need to discuss how the most successful DH centers have institutional ties to libraries and how DH people have long praised the early involvement of librarians in projects. (Kshawkin 22:08, 29 December 2012 (EST))
  2. A list of libraries that already use TEI, with pointers to the sites they use it for and/or documentation of why they picked TEI. Stuartyeates 17:05, 14 February 2013 (EST)
Personal tools