Introduction: on automating tagging
When you begin learning text encoding, you might think that people transcribe a source document and then add all the tags into the document by hand.
Then you learn how to use an XML editor to not only validate your XML but also save keystrokes on entering elements.
And then you learn about OCR software, which can save you effort in transcribing a source document, leaving you just to correct OCR errors.
And then you learn that there are vendors that will do some combination of scanning, OCR, and encoding documents (often through "double keyboarding" according to a vendor spec such as TEI Tite, which guarantees verity high accuracy in transcription and encoding). While vendors are generally not expected to add markup requiring specialized knowledge (this might be left to project staff), it is also not a good use of resources to expect a vendor to perform basic structural tagging if this can be easily deduced from OCR output or from page images. Upconversion from a less structured format to a more structured one can sometimes be performed through rules-based techniques (like XSLT), but the data is usually inconsistent enough that there may be competing sets of rules. Basic heuristic techniques that choose between competing rules can be used to make judgments about features on the page and derive markup from these. They are never perfect, but neither are human encoders.
Furthermore, more advanced heuristics and information retrieval (IR) techniques can be used to identify named entities, topics, sentiment, tone, and complexity of a text, and IR techniques are frequently used for authorship attribution.
Below are resources that might be helpful for those looking to use heuristic and IR techniques to add markup to an XML document.
Adding structural markup
- various projects coming out of LAMP
- LA-PDFText (turns PDF into some sort of XML)
- scrapely -- HTML to JSON
- see “Ideas” section of http://scholrev.org/hackathon/
- see discussion of structuring OCRd text on CODE4LIB
- Open Typesetting Stack (formerly the PKP XML Parsing Service): (sourcecode, beta site) -- discerns structure to create various output formats