"GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications." See a brief overview.
According to the Grobid website:
- Written in Java (with JNI call).
- High performance - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes, parsing of 3000 references in 18 seconds.
- Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling.
- Full encoding in TEI, both for the training corpus and the parsed results.
- Reinforcement of extracted bibliographical data via online call to Crossref (optional), export in OpenURL, etc. for easier integration into Digital Library environments.
- Rich bibliographical processing: fine grained parsing of author names, dates, affiliations, addresses, etc. but also quite reliable automatic attachment of affiliations to corresponding authors.
- "Automatic Generation" of pre-formatted training data based on new pdf documents, for supporting semi-automatic training data generation.
Please sign all comments.
"Grobid should run properly on MacOS X, Linux (32 & 64) and Windows (32) environments 'out of the box'."
Source code and licensing
Support for TEI
Output created in TEI P5 XML.
"Written in Java (with JNI call)."