From TEIWiki

Jump to: navigation, search



"GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications."[1] See a brief overview.


According to the Grobid website:[2]

  • Written in Java (with JNI call).
  • High performance - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes, parsing of 3000 references in 18 seconds.
  • Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling.
  • Full encoding in TEI, both for the training corpus and the parsed results.
  • Reinforcement of extracted bibliographical data via online call to Crossref (optional), export in OpenURL, etc. for easier integration into Digital Library environments.
  • Rich bibliographical processing: fine grained parsing of author names, dates, affiliations, addresses, etc. but also quite reliable automatic attachment of affiliations to corresponding authors.
  • "Automatic Generation" of pre-formatted training data based on new pdf documents, for supporting semi-automatic training data generation.

User commentary

Please sign all comments.

System requirements

"Grobid should run properly on MacOS X, Linux (32 & 64) and Windows (32) environments 'out of the box'."[3]

Source code and licensing

"Grobid is distributed under Apache 2.0 license."[4]

Support for TEI

Output created in TEI P5 XML.


"Written in Java (with JNI call)."[5]


Tech support

User community

Sample implementations

Current version number and date of release

History of versions

How to download or buy

Additional notes


Personal tools