Upconversion

Introduction: on automating tagging
When you begin learning text encoding, you might think that people transcribe a source document and then add all the tags into the document by hand.

Then you learn how to use an XML editor to not only validate your XML but also save keystrokes on entering elements.

And then you learn about OCR software, which can save you effort in transcribing a source document, leaving you just to correct OCR errors.

And then you learn that there are vendors that will do some combination of scanning, OCR, and encoding documents (often through "double keyboarding" according to a vendor spec such as TEI Tite, which guarantees verity high accuracy in transcription and encoding). While vendors are generally not expected to add markup requiring specialized knowledge (this might be left to project staff), it is also not a good use of resources to expect a vendor to perform basic structural tagging if this can be easily deduced from OCR output or from page images. Basic heuristic techniques can be used to identify such features on the page and derive markup from these. They are never perfect, but neither are human encoders.

Furthermore, more advanced heuristics and information retrieval (IR) techniques can be used to identify named entities, topics, sentiment, tone, and complexity of a text, and IR techniques are frequently used for authorship attribution.

Below are resources that might be helpful for those looking to use heuristic and IR techniques to add markup to an XML document.

Adding structural markup

 * GROBID
 * BILBO
 * ParsCit
 * pdf2xml
 * various projects coming out of LAMP
 * pdfx
 * LA-PDFText (turns PDF into some sort of XML)
 * Merops
 * scrapely -- HTML to JSON
 * pdf2htmlEX
 * see “Ideas” section of http://scholrev.org/hackathon/

Identifying named entities

 * BILBO
 * Calais
 * SEASR

Identifying topics, sentiment, tone, and complexity

 * Calais