Piotr Banski: text prepared by Adrien Barbaresi

2020-08-06T13:30:15Z

text prepared by Adrien Barbaresi

New page

[[Category:Tools]]

[[Category:Development tools]]
[[Category:Conversion and preprocessing tools]]
[[Category:Testing and QA tools]]
[[Category:Analysis tools]]
[[Category:All-in-one Tools]]

[[Category:Discovering]]
[[Category:Annotating]]
[[Category:Sampling]]

== Synopsis ==

Trafilatura is a Python library and command-line tool which processes
HTML documents and converts the output to plain text, CSV, JSON, XML and
TEI-XML. It seamlessly downloads, parses, and scrapes web page data: it
can extract metadata, the main body text and comments while preserving
part of the text formatting and page structure. It also includes a
parser and validator for TEI documents.

Distinguishing between whole page and essential parts can help to
alleviate many quality problems related to web texts as it deals with
the noise consisting of recurring elements (headers and footers, ads,
links/blogroll). The extractor has to be precise enough not to miss
texts or discard valid documents, robust but also reasonably fast. It is
designed to run in production on millions of web documents.

This effort serves the development of methods for deriving information
from web documents in order to build text databases for research,
especially for linguistic analysis and natural language processing as
part of projects by the Center for Digital Lexicography of German
([https://zdl.org ZDL] and [https://www.dwds.de DWDS]). A significant
challenge resides in the ability to extract and pre-process web texts to
meet scientific expectations: Web corpus construction involves numerous
design decisions, and this software package helps facilitate collection
and enhance corpus quality.

== Features ==

* Robust extraction algorithm preserving
** Metadata (title, author, date, site name, categories and tags)
** Structural elements (paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting)
* Seamless, parallelized online (including page retrieval) or offline processing
* TEI-XML supported as output format, including document validation
* Management of download/URL lists (ATOM/RSS feeds, URL queues, blacklisting)

== User commentary ==

'''Please sign all comments.'''

== System requirements ==

The software is tested on MacOS and Linux, it is expected to work on
Windows as well. It supports all common Python 3 versions (3.4 upwards).

== Source code and licensing ==

Open-source software under [https://www.gnu.org/licenses/gpl-3.0.en.html GNU General Public License v3.0].

Source code homepage: see [https://github.com/adbar/trafilatura GitHub repository].

== Support for TEI ==

Support for TEI P5 out of the box.

== Language(s) ==

* Tool written in Python, use within Python or on the command-line.
* The documentation is available in English.

== Documentation ==

* [https://trafilatura.readthedocs.io/ Documentation]
* [https://trafilatura.readthedocs.io/en/latest/tutorials.html Tutorials]
** [https://trafilatura.readthedocs.io/en/latest/tutorial2.html Production and validation of TEI files in Python]
** [http://adrien.barbaresi.eu/blog/validating-tei-xml-python.html Validating TEI-XML documents]

== Tech support ==

Using this [http://adrien.barbaresi.eu/ contact info] or filing issues
on the [https://github.com/adbar/trafilatura/issues dedicated page].

== User community ==

[https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md Contributions] to code and documentation are welcome! These
[https://github.com/adbar/trafilatura/graphs/contributors contributors] have already submitted features and fixes.

Tutorial video in German by Simon Meier-Vieracker: [https://www.youtube.com/watch?v=9RPrVE0hHgI Content von Webseiten laden mit Trafilatura].

== History of versions ==

Current version: [https://pypi.org/project/trafilatura/ v0.5.1], 2020-07-15.

* [https://github.com/adbar/trafilatura/blob/master/HISTORY.md Change log]
* [https://github.com/adbar/trafilatura/releases GitHub releases]

== Download and installation ==

Trafilatura is packaged as a software library available from the package
repository ''PyPI''. As such it can notably be installed with
<code>pip</code> or <code>pipenv</code>: <code>pip install --upgrade
trafilatura</code>.

For more details please read the [https://trafilatura.readthedocs.io/en/latest/installation.html installation documentation].

== Additional notes ==

'''Evaluation and alternatives:''' The extraction focuses on the main
content, which is usually the part displayed centrally, without the left
or right bars, the header or the footer, but including potential titles
and (optionally) comments. This task is also known as ''web scraping'',
''boilerplate removal'', ''DOM-based content extraction'', ''main
content identification'', or ''web page cleaning''.

Reproducible results are published on the
[https://trafilatura.readthedocs.io/en/latest/evaluation.html evaluation page].

Trafilatura - Revision history

Piotr Banski: text prepared by Adrien Barbaresi