Trafilatura

From TEIWiki
Jump to navigation Jump to search


Synopsis

Trafilatura is a Python library and command-line tool which processes HTML documents and converts the output to plain text, CSV, JSON, XML and TEI-XML. It seamlessly downloads, parses, and scrapes web page data: it can extract metadata, the main body text and comments while preserving part of the text formatting and page structure. It also includes a parser and validator for TEI documents.

Distinguishing between whole page and essential parts can help to alleviate many quality problems related to web texts as it deals with the noise consisting of recurring elements (headers and footers, ads, links/blogroll). The extractor has to be precise enough not to miss texts or discard valid documents, robust but also reasonably fast. It is designed to run in production on millions of web documents.

This effort serves the development of methods for deriving information from web documents in order to build text databases for research, especially for linguistic analysis and natural language processing as part of projects by the Center for Digital Lexicography of German (ZDL and DWDS). A significant challenge resides in the ability to extract and pre-process web texts to meet scientific expectations: Web corpus construction involves numerous design decisions, and this software package helps facilitate collection and enhance corpus quality.

Features

  • Robust extraction algorithm preserving
    • Metadata (title, author, date, site name, categories and tags)
    • Structural elements (paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting)
  • Seamless, parallelized online (including page retrieval) or offline processing
  • TEI-XML supported as output format, including document validation
  • Management of download/URL lists (ATOM/RSS feeds, URL queues, blacklisting)

User commentary

Please sign all comments.

System requirements

The software is tested on MacOS and Linux, it is expected to work on Windows as well. It supports all common Python 3 versions (3.4 upwards).

Source code and licensing

Open-source software under GNU General Public License v3.0.

Source code homepage: see GitHub repository.

Support for TEI

Support for TEI P5 out of the box.

Language(s)

  • Tool written in Python, use within Python or on the command-line.
  • The documentation is available in English.

Documentation

Tech support

Using this contact info or filing issues on the dedicated page.

User community

Contributions to code and documentation are welcome! These contributors have already submitted features and fixes.

Tutorial video in German by Simon Meier-Vieracker: Content von Webseiten laden mit Trafilatura.

History of versions

Current version: v0.5.1, 2020-07-15.

Download and installation

Trafilatura is packaged as a software library available from the package repository PyPI. As such it can notably be installed with pip or pipenv: pip install --upgrade trafilatura.

For more details please read the installation documentation.

Additional notes

Evaluation and alternatives: The extraction focuses on the main content, which is usually the part displayed centrally, without the left or right bars, the header or the footer, but including potential titles and (optionally) comments. This task is also known as web scraping, boilerplate removal, DOM-based content extraction, main content identification, or web page cleaning.

Reproducible results are published on the evaluation page.