GROBID

Synopsis
"GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications." See a brief overview.

Features
According to the Grobid website:
 * Written in Java (with JNI call).
 * High performance - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes, parsing of 3000 references in 18 seconds.
 * Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling.
 * Full encoding in TEI, both for the training corpus and the parsed results.
 * Reinforcement of extracted bibliographical data via online call to Crossref (optional), export in OpenURL, etc. for easier integration into Digital Library environments.
 * Rich bibliographical processing: fine grained parsing of author names, dates, affiliations, addresses, etc. but also quite reliable automatic attachment of affiliations to corresponding authors.
 * "Automatic Generation" of pre-formatted training data based on new pdf documents, for supporting semi-automatic training data generation.

User commentary
Please sign all comments.

System requirements
"Grobid should run properly on MacOS X, Linux (32 & 64) and Windows (32) environments 'out of the box'."

Source code and licensing
"Grobid is distributed under Apache 2.0 license."

Support for TEI
Output created in TEI P5 XML.

Language(s)
"Written in Java (with JNI call)."

Documentation
http://grobid.readthedocs.org/

Sample implementations

 * demo site (not working as of 2017-01-05)
 * online demo

How to download or buy
https://github.com/kermitt2/grobid