Difference between revisions of "GROBID"

From TEIWiki
Jump to navigation Jump to search
(Created page with "Category:Tools Category:Conversion and preprocessing tools == Synopsis == "Grobid is a machine learning library for extracting, parsing and TEI-encoding of bibliographi...")
(No difference)

Revision as of 19:27, 25 November 2014


Synopsis

"Grobid is a machine learning library for extracting, parsing and TEI-encoding of bibliographical information at large, with a particular focus on technical and scientific articles."<ref>https://github.com/kermitt2/grobid</ref>

Features

According to the Grobid website:<ref>https://github.com/kermitt2/grobid</ref>

  • Written in Java (with JNI call).
  • High performance - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes, parsing of 3000 references in 18 seconds.
  • Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling.
  • Full encoding in TEI, both for the training corpus and the parsed results.
  • Reinforcement of extracted bibliographical data via online call to Crossref (optional), export in OpenURL, etc. for easier integration into Digital Library environments.
  • Rich bibliographical processing: fine grained parsing of author names, dates, affiliations, addresses, etc. but also quite reliable automatic attachment of affiliations to corresponding authors.
  • "Automatic Generation" of pre-formatted training data based on new pdf documents, for supporting semi-automatic training data generation.

User commentary

Please sign all comments.


System requirements

"Grobid should run properly on MacOS X, Linux (32 & 64) and Windows (32) environments 'out of the box'."<ref>https://github.com/kermitt2/grobid</ref>

Source code and licensing

"Grobid is distributed under Apache 2.0 license."<ref>https://github.com/kermitt2/grobid</ref>

Support for TEI

Output created in TEI P5 XML.

Language(s)

"Written in Java (with JNI call)."<ref>https://github.com/kermitt2/grobid</ref>

Documentation

See "Usage" section of https://github.com/kermitt2/grobid .

Tech support

User community

Sample implementations

Current version number and date of release

History of versions

How to download or buy

https://github.com/kermitt2/grobid

Additional notes

References

<references/>