Difference between revisions of "TXM"

From TEIWiki
Jump to navigation Jump to search
m (Language(s))
m (Language(s))
Line 83: Line 83:
 
** French (FR)
 
** French (FR)
 
* portal  version:
 
* portal  version:
** French (FR) (tutorial, alpha)
+
** French (FR) (tutorial - alpha state)
  
 
TXM works natively with any Unicode conformant corpus. Language support is specific to each NLP tool (for example, TreeTagger can tag the following languages: BG, DE, EN, ES, ET, FR, FRO, GL, IT, LA, PT, RU, SW, ZH).
 
TXM works natively with any Unicode conformant corpus. Language support is specific to each NLP tool (for example, TreeTagger can tag the following languages: BG, DE, EN, ES, ET, FR, FRO, GL, IT, LA, PT, RU, SW, ZH).

Revision as of 21:54, 19 August 2011


Synopsis

TXM is free, open-source TEI compatible text corpora analysis environment and graphical client based on CQP and R. It is available for Microsoft Windows, GNU/Linux, Mac OS X (in alpha) and J2EE. The Textométrie scientific project web site is http://textometrie.ens-lyon.fr/?lang=en.

Features

  • Works on any collection of documents of various formats: TXT, XML, XML-TEI P5 (bfm project customization), XML-Transcriber, XML-TMX (aligned corpora - alpha), XML-PPS (Factiva - alpha), etc.
  • Applies various NLP tools on the fly on texts before analysis (e.g. TreeTagger for lemmatization and pos tagging)
  • Indexes words and their properties as well as the hierarchical structure of texts
  • Indexes external or internal metadata of texts or speakers
  • Allows to build various subcorpora and partitions (for constrative analysis between text structures or groups of words)
  • Provides qualitative analysis tools : various index and concordances of patterns based on word & structure level queries, rich HTML based text editions navigation, patterns occurrences layout display
  • Provides quantitative analysis tools : factorial correspondance analysis, constrative word specificities, hierarchical classification, cooccurrents of patterns
  • Exports any result in CSV, XML or SVG format
  • Script drivable for repetitive tasks automation or platform extension (in Groovy/Java)
  • Includes a text editor to edit the sources, results and scripts
  • Runs as standalone Windows, Mac OS X or Linux application
  • Runs also as portal web application to access and analyze corpora online through a web browser (with access control management)
  • Open source: based on the best open source components for text analysis: CQP, R and Java & XSLT libraries
  • Modular architecture (Eclipse RCP OSGi and J2EE conformant): one toolbox connecting all core components is used by all the applications
  • Efficient Eclipse or Netbeans powered development framework

User commentary

Please sign all comments.

System requirements

The standalone version runs on:

  • Windows - 32bit or 64bit (tested on XP, Vista and Seven)
  • Mac OS X (tested on 10.5 and 10.6)
  • Linux - 32bit or 64bit (tested on Ubuntu and Debian)

The portal server should run on any JVM/J2EE capable platform but has only been tested on a Linux Ubuntu in Tomcat or Glassfish containers for the moment.

Source code and licensing

Open Source under GPL V3 licence.

Support for TEI

Supports TEI and TEI Lite "out of the box" at the XML level: words will be tokenized inside any #PCDATA and all the XML structure will be imported directly as textual structure.

Supports the TEI P5 encoding semantics used by the Base de Français Médiéval (BFM) project (http://bfm.ens-lyon.fr) at the TEI level: words - #PCDATA, <w>, <num>..., edition - <sic>, <corr>..., structures - <div>, <p>..., notes, etc. See "BFM encoding manual" - in French http://bfm.ens-lyon.fr/article.php3?id_article=158).

The "TEI P5 BFM" TXM import module is completely written in several Groovy and XSL scripts, so as to be able to be adapted by the user to any specific TEI encoding usage.

TXM Import Modules also provide various import parameters to tune each import process to specific data sources.

[Note: The Presses Universitaires de Caen (PUC) center has successfully experimented the TXM import process on their own TEI text editions (July 2011).]

Language(s)

TXM is written in the following programming languages:

The user interface is currently available in:

  • standalone version:
    • English (EN)
    • French (FR)
  • portal version:
    • English (EN)
    • French (FR)

The documentation is currently available in:

  • standalone version:
    • English (EN)
    • French (FR)
  • portal version:
    • French (FR) (tutorial - alpha state)

TXM works natively with any Unicode conformant corpus. Language support is specific to each NLP tool (for example, TreeTagger can tag the following languages: BG, DE, EN, ES, ET, FR, FRO, GL, IT, LA, PT, RU, SW, ZH).

Documentation

Tech support

Tech support is mainly provided through two mailing lists (see below).

Users can also use 3 different trackers:

User community

Currently, the user community of TXM is mostly animated through two mailing lists and a wiki:

TXM is also taught every year at the CNRS summer school called « Computing and Statistical Methods in Text Analysis » (MISAT), see http://laseldi.univ-fcomte.fr/ecole.

The JADT conference (http://jadt.org) is the main place where the TXM user community meet.

Sample implementations

The standalone version of TXM is delivered with several sample corpus included, that can be directly analyzed from within TXM after installation.

The portal version of TXM has a demo running online at http://txm.risc.cnrs.fr/test/?locale=en (work in progress).

A previous experiment of a web application based on TXM applied to one TEI encoded text can be found at http://txm.risc.cnrs.fr/txm/texte/quete.

Current version number and date of release

  • standalone: Current version is 0.5 released March 2011
  • portal: Current version is 0.3 beta 2 released July 2011

History of versions

See the Roadmap section on the developer's wiki at http://sourceforge.net/apps/mediawiki/textometrie.

How to download or buy

TXM is free to download:

Additional notes

For publications related to TXM, please visit the Textométrie project web site at http://textometrie.ens-lyon.fr/spip.php?article82&lang=en:

  • See for example:
    Heiden, S. (2010b). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In K. I. Ryo Otoguro (Ed.), 24th Pacific Asia Conference on Language, Information and Computation - PACLIC24 (p. 389-398). Institute for Digital Enhancement of Cognitive Development, Waseda University, Sendai, Japan. Online.

Sponsors & Contributors:

  • Initial design and development of TXM (2007-2010) was supported by French ANR grant #ANR-06-CORP-029
  • Currently the platform continues its development through various contracts:
    • Lyon 3 University contract 2010: XML-Transcriber import, R GUI
    • CNRS contract 2010 (DGLFLF grant): GGHF corpus processing
    • ENS-LYON contract 2010 (Rhône-Alpes region Cluster 13 grant): Queste del saint Graal web prototype
    • ENS-LYON contract 2010-2011 (ANR CORPTEF Research Project funding): portal development
  • Other independent projects also improve TXM (community of developers):
    • LASLA project 2011: import of ancient latin and greek corpora
    • GREYC-PUC project 2011: PUC corpora import, improvement of portal, test on Glassfish
    • PhD thesis on micro-finance 2011-: Factiva and Calibre import