<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.tei-c.org/index.php?action=history&amp;feed=atom&amp;title=Trafilatura</id>
	<title>Trafilatura - Revision history</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.tei-c.org/index.php?action=history&amp;feed=atom&amp;title=Trafilatura"/>
	<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=Trafilatura&amp;action=history"/>
	<updated>2026-04-24T00:41:17Z</updated>
	<subtitle>Revision history for this page on the wiki</subtitle>
	<generator>MediaWiki 1.32.0</generator>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=Trafilatura&amp;diff=16799&amp;oldid=prev</id>
		<title>Piotr Banski: text prepared by Adrien Barbaresi</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=Trafilatura&amp;diff=16799&amp;oldid=prev"/>
		<updated>2020-08-06T13:30:15Z</updated>

		<summary type="html">&lt;p&gt;text prepared by Adrien Barbaresi&lt;/p&gt;
&lt;p&gt;&lt;b&gt;New page&lt;/b&gt;&lt;/p&gt;&lt;div&gt;[[Category:Tools]]&lt;br /&gt;
&lt;br /&gt;
[[Category:Development tools]]&lt;br /&gt;
[[Category:Conversion and preprocessing tools]]&lt;br /&gt;
[[Category:Testing and QA tools]]&lt;br /&gt;
[[Category:Analysis tools]]&lt;br /&gt;
[[Category:All-in-one Tools]]&lt;br /&gt;
&lt;br /&gt;
[[Category:Discovering]]&lt;br /&gt;
[[Category:Annotating]]&lt;br /&gt;
[[Category:Sampling]]&lt;br /&gt;
&lt;br /&gt;
== Synopsis ==&lt;br /&gt;
&lt;br /&gt;
Trafilatura is a Python library and command-line tool which processes&lt;br /&gt;
HTML documents and converts the output to plain text, CSV, JSON, XML and&lt;br /&gt;
TEI-XML. It seamlessly downloads, parses, and scrapes web page data: it&lt;br /&gt;
can extract metadata, the main body text and comments while preserving&lt;br /&gt;
part of the text formatting and page structure. It also includes a&lt;br /&gt;
parser and validator for TEI documents.&lt;br /&gt;
&lt;br /&gt;
Distinguishing between whole page and essential parts can help to&lt;br /&gt;
alleviate many quality problems related to web texts as it deals with&lt;br /&gt;
the noise consisting of recurring elements (headers and footers, ads,&lt;br /&gt;
links/blogroll). The extractor has to be precise enough not to miss&lt;br /&gt;
texts or discard valid documents, robust but also reasonably fast. It is&lt;br /&gt;
designed to run in production on millions of web documents.&lt;br /&gt;
&lt;br /&gt;
This effort serves the development of methods for deriving information&lt;br /&gt;
from web documents in order to build text databases for research,&lt;br /&gt;
especially for linguistic analysis and natural language processing as&lt;br /&gt;
part of projects by the Center for Digital Lexicography of German&lt;br /&gt;
([https://zdl.org ZDL] and [https://www.dwds.de DWDS]). A significant&lt;br /&gt;
challenge resides in the ability to extract and pre-process web texts to&lt;br /&gt;
meet scientific expectations: Web corpus construction involves numerous&lt;br /&gt;
design decisions, and this software package helps facilitate collection&lt;br /&gt;
and enhance corpus quality.&lt;br /&gt;
&lt;br /&gt;
== Features ==&lt;br /&gt;
&lt;br /&gt;
* Robust extraction algorithm preserving&lt;br /&gt;
** Metadata (title, author, date, site name, categories and tags)&lt;br /&gt;
** Structural elements (paragraphs, titles, lists, quotes, code, line breaks, in-line text formatting)&lt;br /&gt;
* Seamless, parallelized online (including page retrieval) or offline processing&lt;br /&gt;
* TEI-XML supported as output format, including document validation&lt;br /&gt;
* Management of download/URL lists (ATOM/RSS feeds, URL queues, blacklisting)&lt;br /&gt;
&lt;br /&gt;
== User commentary ==&lt;br /&gt;
&lt;br /&gt;
'''Please sign all comments.'''&lt;br /&gt;
&lt;br /&gt;
== System requirements ==&lt;br /&gt;
&lt;br /&gt;
The software is tested on MacOS and Linux, it is expected to work on&lt;br /&gt;
Windows as well. It supports all common Python 3 versions (3.4 upwards).&lt;br /&gt;
&lt;br /&gt;
== Source code and licensing ==&lt;br /&gt;
&lt;br /&gt;
Open-source software under [https://www.gnu.org/licenses/gpl-3.0.en.html GNU General Public License v3.0].&lt;br /&gt;
&lt;br /&gt;
Source code homepage: see [https://github.com/adbar/trafilatura GitHub repository].&lt;br /&gt;
&lt;br /&gt;
== Support for TEI ==&lt;br /&gt;
&lt;br /&gt;
Support for TEI P5 out of the box.&lt;br /&gt;
&lt;br /&gt;
== Language(s) ==&lt;br /&gt;
&lt;br /&gt;
* Tool written in Python, use within Python or on the command-line.&lt;br /&gt;
* The documentation is available in English.&lt;br /&gt;
&lt;br /&gt;
== Documentation ==&lt;br /&gt;
&lt;br /&gt;
* [https://trafilatura.readthedocs.io/ Documentation]&lt;br /&gt;
* [https://trafilatura.readthedocs.io/en/latest/tutorials.html Tutorials]&lt;br /&gt;
** [https://trafilatura.readthedocs.io/en/latest/tutorial2.html Production and validation of TEI files in Python]&lt;br /&gt;
** [http://adrien.barbaresi.eu/blog/validating-tei-xml-python.html Validating TEI-XML documents]&lt;br /&gt;
&lt;br /&gt;
== Tech support ==&lt;br /&gt;
&lt;br /&gt;
Using this [http://adrien.barbaresi.eu/ contact info] or filing issues&lt;br /&gt;
on the [https://github.com/adbar/trafilatura/issues dedicated page].&lt;br /&gt;
&lt;br /&gt;
== User community ==&lt;br /&gt;
&lt;br /&gt;
[https://github.com/adbar/trafilatura/blob/master/CONTRIBUTING.md Contributions] to code and documentation are welcome! These&lt;br /&gt;
[https://github.com/adbar/trafilatura/graphs/contributors contributors] have already submitted features and fixes.&lt;br /&gt;
&lt;br /&gt;
Tutorial video in German by Simon Meier-Vieracker: [https://www.youtube.com/watch?v=9RPrVE0hHgI Content von Webseiten laden mit Trafilatura].&lt;br /&gt;
&lt;br /&gt;
== History of versions ==&lt;br /&gt;
&lt;br /&gt;
Current version: [https://pypi.org/project/trafilatura/ v0.5.1], 2020-07-15.&lt;br /&gt;
&lt;br /&gt;
* [https://github.com/adbar/trafilatura/blob/master/HISTORY.md Change log]&lt;br /&gt;
* [https://github.com/adbar/trafilatura/releases GitHub releases]&lt;br /&gt;
&lt;br /&gt;
== Download and installation ==&lt;br /&gt;
&lt;br /&gt;
Trafilatura is packaged as a software library available from the package&lt;br /&gt;
repository ''PyPI''. As such it can notably be installed with&lt;br /&gt;
&amp;lt;code&amp;gt;pip&amp;lt;/code&amp;gt; or &amp;lt;code&amp;gt;pipenv&amp;lt;/code&amp;gt;: &amp;lt;code&amp;gt;pip install --upgrade&lt;br /&gt;
trafilatura&amp;lt;/code&amp;gt;.&lt;br /&gt;
&lt;br /&gt;
For more details please read the [https://trafilatura.readthedocs.io/en/latest/installation.html installation documentation].&lt;br /&gt;
&lt;br /&gt;
== Additional notes ==&lt;br /&gt;
&lt;br /&gt;
'''Evaluation and alternatives:''' The extraction focuses on the main&lt;br /&gt;
content, which is usually the part displayed centrally, without the left&lt;br /&gt;
or right bars, the header or the footer, but including potential titles&lt;br /&gt;
and (optionally) comments. This task is also known as ''web scraping'',&lt;br /&gt;
''boilerplate removal'', ''DOM-based content extraction'', ''main&lt;br /&gt;
content identification'', or ''web page cleaning''.&lt;br /&gt;
&lt;br /&gt;
Reproducible results are published on the&lt;br /&gt;
[https://trafilatura.readthedocs.io/en/latest/evaluation.html evaluation page].&lt;/div&gt;</summary>
		<author><name>Piotr Banski</name></author>
		
	</entry>
</feed>