GROBID
Contents
- 1 Synopsis
- 2 Features
- 3 User commentary
- 4 System requirements
- 5 Source code and licensing
- 6 Support for TEI
- 7 Language(s)
- 8 Documentation
- 9 Tech support
- 10 User community
- 11 Sample implementations
- 12 Current version number and date of release
- 13 History of versions
- 14 How to download or buy
- 15 Additional notes
- 16 References
Synopsis
"GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications."<ref>https://github.com/kermitt2/grobid</ref> See a brief overview.
Features
According to the Grobid website:<ref>https://github.com/kermitt2/grobid</ref>
- Written in Java (with JNI call).
- High performance - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes, parsing of 3000 references in 18 seconds.
- Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling.
- Full encoding in TEI, both for the training corpus and the parsed results.
- Reinforcement of extracted bibliographical data via online call to Crossref (optional), export in OpenURL, etc. for easier integration into Digital Library environments.
- Rich bibliographical processing: fine grained parsing of author names, dates, affiliations, addresses, etc. but also quite reliable automatic attachment of affiliations to corresponding authors.
- "Automatic Generation" of pre-formatted training data based on new pdf documents, for supporting semi-automatic training data generation.
User commentary
Please sign all comments.
System requirements
"Grobid should run properly on MacOS X, Linux (32 & 64) and Windows (32) environments 'out of the box'."<ref>https://github.com/kermitt2/grobid</ref>
Source code and licensing
"Grobid is distributed under Apache 2.0 license."<ref>https://github.com/kermitt2/grobid</ref>
Support for TEI
Output created in TEI P5 XML.
Language(s)
"Written in Java (with JNI call)."<ref>https://github.com/kermitt2/grobid</ref>
Documentation
http://grobid.readthedocs.org/
Tech support
User community
Sample implementations
- demo site (not working as of 2017-01-05)
- online demo
Current version number and date of release
History of versions
How to download or buy
https://github.com/kermitt2/grobid
Additional notes
References
<references/>