Abbot

From TEIWiki

Jump to: navigation, search


Contents

Synopsis

Abbot coordinates two phases of text preparation:

1. Normalization of XML-like text collections into TEI-A, an XML format designed to facilitate corpus-based text analysis.

2. Validation of the converted files.

Features

The first phase of the conversion reads the DTD/Schema of the target collection and uses the information found within to generate a customized stylesheet that can affect the conversion from the target to TEI-A. This method, which we call "schema harvesting," is remarkably robust, but it cannot perform miracles. If your texts do not parse or contain a lot of irregular constructions, you will probably need to do some pre-processing prior to the pre-processing with Abbot.

Abbot is set up as a pipeline (with abbot.sh as the main controlling file). If you look in that file, you'll references to a series of modular shell scripts, most of which perform quick corrections on the converted files. We used Abbot to convert some very large, and very well known text collections, and so these scripts contain adjustments for common errors and irregularities (including some that are unavoidably introduced through schema harvesting). You may find it useful to use these scripts as a guide, adding to the pipeline and adjusting the existing scripts for your own circumstances.

The second phase pass involves validation of converted files against the TEI-A schema using Sun's Multi-Schema XML Validator (MSV).

Running abbot is simply a matter of:

abbot.sh [target_dir]

Abbot will write the results to the "output" directory. Invalid files are sent to the "quarantine" directory for review.

Please note that abbot is self contained and ships with all the necessary libraries. You should run it from the abbot directory and leave everything as it is.


User commentary

Please sign all comments. (please leave the above note about signing comments, and add signed comments here below it)

System requirements

(type in that information here)

Source code and licensing

http://monkproject.org/license.html

Support for TEI

(Does it support TEI or TEI Lite "out of the box"?) (How easily can TEI be implemented?) (Are there customized versions of the tool created for the TEI community, perhaps even by those not affiliated with the tool's creators?)

Language(s)

All of the code for Abbot is written using a combination of (bash) shell, Java, and XSLT. It was designed to run on UNIX-like systems and avails itself of a number of standard UNIX utilities (such as those found in the GNU coreutils package).

We would hesitate to run it using a version of Java lower than 1.5. We would also hesitate to run it on a system that did not have a fast processor and at least 8 gigs of RAM. Some text collections can take many hours to convert even with the latest server hardware. Your mileage may vary. A lot.


Documentation

http://monkproject.org/downloads/abbot/

Tech support

User community

  • Brian Pytlik-Zillig -- bpytlikz@unlnotes.unl.edu
  • Stephen Ramsay -- sramsay.unl@gmail.com
  • Martin Mueller -- martinmueller@northwestern.edu

Sample implementations

(links to demo sites running the tool or successful implementations of it)

Current version number and date of release

(type in that information here)

History of versions

(type in that information here)

How to download or buy

http://monkproject.org/downloads/

Additional notes

(type in that information here)

Personal tools