Difference between revisions of "GROBID"

From TEIWiki
Jump to navigation Jump to search
(Created page with "Category:Tools Category:Conversion and preprocessing tools == Synopsis == "Grobid is a machine learning library for extracting, parsing and TEI-encoding of bibliographi...")
 
(Synopsis: new first sentence of description from https://github.com/kermitt2/grobid)
 
(5 intermediate revisions by the same user not shown)
Line 4: Line 4:
  
 
== Synopsis ==
 
== Synopsis ==
"Grobid is a machine learning library for extracting, parsing and TEI-encoding of bibliographical information at large, with a particular focus on technical and scientific articles."<ref>https://github.com/kermitt2/grobid</ref>
+
"GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications."<ref>https://github.com/kermitt2/grobid</ref> See a [http://ercim-news.ercim.eu/en100/r-i/grobid-information-extraction-from-scientific-publications brief overview].
  
 
== Features ==
 
== Features ==
Line 33: Line 33:
  
 
== Documentation ==
 
== Documentation ==
See "Usage" section of https://github.com/kermitt2/grobid .
+
http://grobid.readthedocs.org/
  
 
== Tech support ==
 
== Tech support ==
Line 42: Line 42:
  
 
== Sample implementations ==
 
== Sample implementations ==
 
+
* [http://scite-it.eu/ demo site] (not working as of 2017-01-05)
 +
* [http://cloud.science-miner.com/grobid/ online demo]
  
 
== Current version number and date of release ==
 
== Current version number and date of release ==

Latest revision as of 18:47, 16 February 2017


Synopsis

"GROBID is a machine learning library for extracting, parsing and re-structuring raw documents such as PDF into structured TEI-encoded documents with a particular focus on technical and scientific publications."<ref>https://github.com/kermitt2/grobid</ref> See a brief overview.

Features

According to the Grobid website:<ref>https://github.com/kermitt2/grobid</ref>

  • Written in Java (with JNI call).
  • High performance - on a modern but low profile MacBook Pro: header extraction from 4000 PDF in 10 minutes, parsing of 3000 references in 18 seconds.
  • Modular and reusable machine learning models. The extractions are based on Linear Chain Conditional Random Fields which is currently the state of the art in bibliographical information extraction and labeling.
  • Full encoding in TEI, both for the training corpus and the parsed results.
  • Reinforcement of extracted bibliographical data via online call to Crossref (optional), export in OpenURL, etc. for easier integration into Digital Library environments.
  • Rich bibliographical processing: fine grained parsing of author names, dates, affiliations, addresses, etc. but also quite reliable automatic attachment of affiliations to corresponding authors.
  • "Automatic Generation" of pre-formatted training data based on new pdf documents, for supporting semi-automatic training data generation.

User commentary

Please sign all comments.


System requirements

"Grobid should run properly on MacOS X, Linux (32 & 64) and Windows (32) environments 'out of the box'."<ref>https://github.com/kermitt2/grobid</ref>

Source code and licensing

"Grobid is distributed under Apache 2.0 license."<ref>https://github.com/kermitt2/grobid</ref>

Support for TEI

Output created in TEI P5 XML.

Language(s)

"Written in Java (with JNI call)."<ref>https://github.com/kermitt2/grobid</ref>

Documentation

http://grobid.readthedocs.org/

Tech support

User community

Sample implementations

Current version number and date of release

History of versions

How to download or buy

https://github.com/kermitt2/grobid

Additional notes

References

<references/>