Difference between revisions of "TXM"

From TEIWiki
Jump to navigation Jump to search
(Sample implementations)
Line 21: Line 21:
 
== Features ==
 
== Features ==
 
* Provides qualitative analysis tools:
 
* Provides qualitative analysis tools:
** lexical patterns '''concordances''' based on the efficient [http://cwb.sourceforge.net CQP] full text search engine and its CQL query language
+
** '''concordances''' of lexical patterns based on the efficient [http://cwb.sourceforge.net CQP] full text search engine and its CQL query language
 
** CQL pattern '''frequency lists''' for any word property (type, lemma, pos...)
 
** CQL pattern '''frequency lists''' for any word property (type, lemma, pos...)
 
** CQL pattern '''occurrence graphics'''
 
** CQL pattern '''occurrence graphics'''
** lexical patterns are expressed in the CQL query language: based on word & structure level properties
+
** lexical patterns are expressed in the CQL query language, based on word & structure level properties: (for example)
*** "aiming" to simply look for the word 'aiming'
+
*** "aiming" to simply search for the word 'aiming'
*** ".*ing" to approximately look for progressive aspect
+
*** ".*ing" to search for words ending in "ing" (including mainly verb forms)
*** [pos="VERB" & word=".*ing"] to look for progressive aspect (combining with Part of Speech annotation on words)
+
*** [pos="VERB" & word=".*ing"] to search for verb forms ending in ".ing" (where Part of Speech annotation is present)
*** [lemma="group"] []{0,3} [pos="VERB" & word=".*ing"] to look for the collocate <group lemma> followed by a <verb with progressive aspect> with at most 3 words in-between
+
*** [lemma="group"] []{0,3} [pos="VERB" & word=".*ing"] to search for the collocation <group lemma> followed by a <verb with progressive aspect> with at most 3 words in between
** rich HTML based text edition navigation with links from all other tools
+
** rich HTML-based text edition navigation with links from all other tools
 
* Provides quantitative analysis tools, based on [http://www.r-project.org R packages]:
 
* Provides quantitative analysis tools, based on [http://www.r-project.org R packages]:
 
** '''factorial correspondance analysis'''
 
** '''factorial correspondance analysis'''
 
** constrative word '''specificities'''
 
** constrative word '''specificities'''
 
** '''hierarchical classification'''
 
** '''hierarchical classification'''
** '''cooccurrents''' analysis
+
** '''analysis of cooccurring words''' or lexical patterns
* Works on any collection of '''Unicode''' encoded documents of various formats: '''TXT''', '''XML''', '''XML-TEI''' P5 (bfm project customization), XML-'''Transcriber''', XML-'''TMX''' (aligned corpora - alpha), XML-PPS ('''Factiva''' - alpha), etc.
+
* May be used with any collection of '''Unicode''' encoded documents in various formats: '''TXT''', '''XML''', '''XML-TEI''' P5 (BFM project customization), XML-'''Transcriber''', XML-'''TMX''' (aligned corpora - alpha), XML-PPS ('''Factiva''' - alpha), etc.
 
* Applies various NLP tools on the fly on texts before analysis (e.g. '''TreeTagger''' for lemmatization and pos tagging)
 
* Applies various NLP tools on the fly on texts before analysis (e.g. '''TreeTagger''' for lemmatization and pos tagging)
 
* Indexes words and their properties as well as hierarchical structure of texts
 
* Indexes words and their properties as well as hierarchical structure of texts
 
* Indexes external or internal metadata of texts or speakers
 
* Indexes external or internal metadata of texts or speakers
* Allows to build various '''subcorpora''' and '''partitions''' (for constrative analysis between text structures or groups of words)
+
* Allows construction of various '''subcorpora''' and '''partitions''' (for constrative analysis between text structures or groups of words)
 
* '''Export'''s any result in CSV, XML or SVG format
 
* '''Export'''s any result in CSV, XML or SVG format
* Script drivable for repetitive tasks automation or platform extension (in '''Groovy'''/Java)
+
* Scripting possible for automation of repetitive tasks or platform extension (in '''Groovy'''/Java)
 
* Includes a '''text editor''' to edit data sources, results and scripts
 
* Includes a '''text editor''' to edit data sources, results and scripts
 
* Runs as standalone '''Windows''', '''Mac OS X''' or '''Linux''' application
 
* Runs as standalone '''Windows''', '''Mac OS X''' or '''Linux''' application
Line 49: Line 49:
 
* Efficient Eclipse or Netbeans powered development framework
 
* Efficient Eclipse or Netbeans powered development framework
  
== User commentary ==
+
== User comments ==
 
'''Please sign all comments.'''
 
'''Please sign all comments.'''
  
 
== System requirements ==
 
== System requirements ==
 
The standalone version runs on:
 
The standalone version runs on:
* Windows - 32bit or 64bit (tested on XP, Vista and Seven)
+
* Windows - 32bit or 64bit (tested on XP, Vista and 7)
 
* Mac OS X (tested on 10.5 and 10.6)
 
* Mac OS X (tested on 10.5 and 10.6)
 
* Linux - 32bit or 64bit (tested on Ubuntu and Debian)
 
* Linux - 32bit or 64bit (tested on Ubuntu and Debian)
  
The portal server should run on any JVM/J2EE capable platform but has only been tested on a Linux Ubuntu in Tomcat or Glassfish containers for the moment.
+
The portal server should run on any JVM/J2EE capable platform but has only been tested on Linux Ubuntu in Tomcat or Glassfish containers for the moment.
  
 
== Source code and licensing ==
 
== Source code and licensing ==
Line 66: Line 66:
 
Supports TEI and TEI Lite "out of the box" at the XML level: words will be tokenized inside any #PCDATA and all the XML structure will be imported directly as textual structure.
 
Supports TEI and TEI Lite "out of the box" at the XML level: words will be tokenized inside any #PCDATA and all the XML structure will be imported directly as textual structure.
  
Supports the TEI P5 encoding semantics used by the Base de Français Médiéval (BFM) project (http://bfm.ens-lyon.fr) at the TEI level: words - <nowiki>#PCDATA, <w>, <num>..., edition - <sic>, <corr>..., structures - <div>, <p>..., notes, etc. See "BFM encoding manual"</nowiki> - in French http://bfm.ens-lyon.fr/article.php3?id_article=158).
+
Supports the TEI P5 encoding semantics used by the Base de Français Médiéval (BFM) project (http://bfm.ens-lyon.fr) at the TEI level: words - <nowiki>#PCDATA, <w>, <num>..., edition - <sic>, <corr>..., structures - <div>, <p>..., notes, etc. See "BFM encoding manual"</nowiki> - (in French http://bfm.ens-lyon.fr/article.php3?id_article=158).
  
The "TEI P5 BFM" TXM import module is completely written in several Groovy and XSL scripts, so as to be able to be adapted by the user to any specific TEI encoding usage.
+
The "TEI P5 BFM" TXM import module consists only of Groovy and XSL scripts, so as to be able to be adapted by the user to any specific TEI encoding usage.
  
 
TXM Import Modules also provide various import parameters to tune each import process to specific data sources.
 
TXM Import Modules also provide various import parameters to tune each import process to specific data sources.
  
<nowiki>[Note: The Presses Universitaires de Caen (PUC) center has successfully experimented the TXM import process on their own TEI text editions (July 2011).]</nowiki>
+
<nowiki>[Note: The Presses Universitaires de Caen (PUC) center has successfully used the TXM import process on their own TEI text editions (July 2011).]</nowiki>
  
 
== Language(s) ==
 
== Language(s) ==
Line 94: Line 94:
  
 
=== Text/Corpus Language(s) ===
 
=== Text/Corpus Language(s) ===
TXM works natively with any Unicode conformant corpus.<br/>
+
TXM works natively with any Unicode-conformant corpus.<br/>
 
Language support is specific to each NLP tool used (for example, TreeTagger can tag the following languages: BG, DE, EN, ES, ET, FR, FRO, GL, IT, LA, PT, RU, SW, ZH).
 
Language support is specific to each NLP tool used (for example, TreeTagger can tag the following languages: BG, DE, EN, ES, ET, FR, FRO, GL, IT, LA, PT, RU, SW, ZH).
  
Line 101: Line 101:
 
* C for CQP search engine (independent open source project http://cwb.sourceforge.net)
 
* C for CQP search engine (independent open source project http://cwb.sourceforge.net)
 
* C and R for statistical packages (independent open source project http://www.r-project.org)
 
* C and R for statistical packages (independent open source project http://www.r-project.org)
* Java for the Toolbox and the Applications (driven by an independent open consortium http://jcp.org/en/home/index))
+
* Java for the Toolbox and the Applications (driven by an independent open consortium http://jcp.org/en/home/index)
** Using the Eclipse RCP framework for the standalone version (independent open source project http://wiki.eclipse.org/index.php/Rich_Client_Platform)
+
** Eclipse RCP framework used for the standalone version (independent open source project http://wiki.eclipse.org/index.php/Rich_Client_Platform)
** Using the GWT framework for the web portal version (independent open source project http://code.google.com/intl/fr/webtoolkit)
+
** GWT framework used for the web portal version (independent open source project http://code.google.com/intl/fr/webtoolkit)
 
* Groovy for the import modules and command scripts (independent open source project http://groovy.codehaus.org)
 
* Groovy for the import modules and command scripts (independent open source project http://groovy.codehaus.org)
  
Line 109: Line 109:
 
* Main entry point for documentation on TXM at the Textométrie project web site: http://textometrie.ens-lyon.fr/spip.php?article98&lang=en
 
* Main entry point for documentation on TXM at the Textométrie project web site: http://textometrie.ens-lyon.fr/spip.php?article98&lang=en
 
** See for example the online TXM reference manual at http://txm.sourceforge.net/doc/refman/TXMReferenceManual0.5_EN.xhtml
 
** See for example the online TXM reference manual at http://txm.sourceforge.net/doc/refman/TXMReferenceManual0.5_EN.xhtml
* Wiki of TXM users community (in French) at https://listes.cru.fr/wiki/txm-users (includes a FAQ)
+
* TXM user community wiki (in French) at https://listes.cru.fr/wiki/txm-users (includes a FAQ)
* Wiki of TXM developers (in English) on Sourceforge : http://sourceforge.net/apps/mediawiki/txm
+
* TXM developers wiki (in English) on Sourceforge : http://sourceforge.net/apps/mediawiki/txm
 
* All available documentation (for users and for developers) published on Sourceforge: http://sourceforge.net/projects/txm/files/documentation
 
* All available documentation (for users and for developers) published on Sourceforge: http://sourceforge.net/projects/txm/files/documentation
  
Line 117: Line 117:
  
 
Users can also use 3 different trackers:
 
Users can also use 3 different trackers:
* Bug Reports - to describe the bugs that you encounter in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190738
+
* Bug Reports - to describe bugs encountered in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190738
* Feature requests - to describe the features, changes in interface or any other improvements you want to see in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190851
+
* Feature requests - to describe the features, changes in interface or any other improvements required in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190851
* Request for help - to describe a very difficult technical problem that you encounter in using the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190852
+
* Request for help - to describe a very difficult technical problem encountered in using the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190852
  
 
== User community ==
 
== User community ==
Currently, the user community of TXM is mostly animated through two mailing lists and a wiki:
+
Currently, the TXM user community communicates using two mailing lists and a wiki:
* The international mailing list : txm-open AT lists.sourceforge.net (very low activity for the moment)
+
* International mailing list : txm-open AT lists.sourceforge.net (very low activity for the moment)
** See the archives at http://sourceforge.net/mailarchive/forum.php?forum_name=txm-open
+
** See archives at http://sourceforge.net/mailarchive/forum.php?forum_name=txm-open
* The mostly French speaking mailing list : txm-users AT cru.fr (the most active)
+
* The mostly French-speaking mailing list : txm-users AT cru.fr (the most active)
** See the archives at https://listes.cru.fr/sympa/arc/txm-users
+
** See archives at https://listes.cru.fr/sympa/arc/txm-users
* Wiki of TXM users community (in French) at https://listes.cru.fr/wiki/txm-users
+
* TXM user community wiki (in French) at https://listes.cru.fr/wiki/txm-users
  
TXM is also taught every year at the CNRS summer school called « Computing and Statistical Methods in Text Analysis » (MISAT), see http://laseldi.univ-fcomte.fr/ecole.
+
Training in the use of TXM is available every year at the CNRS summer school « Computing and Statistical Methods in Text Analysis » (MISAT), see http://laseldi.univ-fcomte.fr/ecole.
  
The JADT conference (http://jadt.org) is the main place where the TXM user community meet.
+
The JADT conference (http://jadt.org) is the main meeting place for the TXM user community.
  
 
== Sample implementations ==
 
== Sample implementations ==
The standalone version of TXM is delivered with several sample corpus included, that can be directly analyzed from within TXM after installation.
+
The standalone version of TXM is delivered with several sample corpora included, which can be directly analyzed from within TXM after installation.
  
 
The portal version of TXM has a demo running online at http://txm.risc.cnrs.fr/demo/?locale=en (work in progress).
 
The portal version of TXM has a demo running online at http://txm.risc.cnrs.fr/demo/?locale=en (work in progress).
Line 152: Line 152:
 
** First point your browser to http://sourceforge.net/projects/txm
 
** First point your browser to http://sourceforge.net/projects/txm
 
** Then click on the green Download button to download the setup for your architecture.
 
** Then click on the green Download button to download the setup for your architecture.
** [Note for Mac users: Running TXM on Mac is still experimental, please read the Mac setup FAQ entry at https://listes.cru.fr/wiki/txm-users/public/faq#comment_installer_txm_05_sur_mac_os_x (in French sorry)]
+
** [Note for Mac users: Running TXM on Mac is still experimental, please read the Mac setup FAQ entry at https://listes.cru.fr/wiki/txm-users/public/faq#comment_installer_txm_05_sur_mac_os_x (in French, sorry!)]
* portal: No war release yet. Please follow the instructions on the developer's wiki at http://sourceforge.net/apps/mediawiki/txm/index.php?title=Build_the_toolbox_or_the_application#TXM-WEB_:_GWT_web_application to install and run from sources.
+
* portal: No Web Application ARchive (WAR) release yet. Please follow the instructions on the developer's wiki at http://sourceforge.net/apps/mediawiki/txm/index.php?title=Build_the_toolbox_or_the_application#TXM-WEB_:_GWT_web_application to install and run from sources.
  
 
== Additional notes ==
 
== Additional notes ==
Line 160: Line 160:
  
 
Sponsors & Contributors:
 
Sponsors & Contributors:
* Initial design and development of TXM (2007-2010) was supported by French ANR grant #ANR-06-CORP-029
+
* Initial design and development of TXM (2007-2010) supported by French ANR grant #ANR-06-CORP-029
 
* Currently the platform continues its development through various contracts:
 
* Currently the platform continues its development through various contracts:
 
** Lyon 3 University contract 2010: XML-Transcriber import, R GUI
 
** Lyon 3 University contract 2010: XML-Transcriber import, R GUI

Revision as of 14:27, 27 November 2011


Synopsis

TXM is free, open-source TEI compatible text/corpus analysis environment and graphical client based on CQP and R. It is available for Microsoft Windows, GNU/Linux, Mac OS X (in alpha) and J2EE web portal.

Features

  • Provides qualitative analysis tools:
    • concordances of lexical patterns based on the efficient CQP full text search engine and its CQL query language
    • CQL pattern frequency lists for any word property (type, lemma, pos...)
    • CQL pattern occurrence graphics
    • lexical patterns are expressed in the CQL query language, based on word & structure level properties: (for example)
      • "aiming" to simply search for the word 'aiming'
      • ".*ing" to search for words ending in "ing" (including mainly verb forms)
      • [pos="VERB" & word=".*ing"] to search for verb forms ending in ".ing" (where Part of Speech annotation is present)
      • [lemma="group"] []{0,3} [pos="VERB" & word=".*ing"] to search for the collocation <group lemma> followed by a <verb with progressive aspect> with at most 3 words in between
    • rich HTML-based text edition navigation with links from all other tools
  • Provides quantitative analysis tools, based on R packages:
    • factorial correspondance analysis
    • constrative word specificities
    • hierarchical classification
    • analysis of cooccurring words or lexical patterns
  • May be used with any collection of Unicode encoded documents in various formats: TXT, XML, XML-TEI P5 (BFM project customization), XML-Transcriber, XML-TMX (aligned corpora - alpha), XML-PPS (Factiva - alpha), etc.
  • Applies various NLP tools on the fly on texts before analysis (e.g. TreeTagger for lemmatization and pos tagging)
  • Indexes words and their properties as well as hierarchical structure of texts
  • Indexes external or internal metadata of texts or speakers
  • Allows construction of various subcorpora and partitions (for constrative analysis between text structures or groups of words)
  • Exports any result in CSV, XML or SVG format
  • Scripting possible for automation of repetitive tasks or platform extension (in Groovy/Java)
  • Includes a text editor to edit data sources, results and scripts
  • Runs as standalone Windows, Mac OS X or Linux application
  • Runs also as web portal to access and analyze corpora online through a web browser (with access control management)
  • Open source: based on the best open source components for text analysis: CQP, R and Java & XSLT libraries
  • Modular architecture (Eclipse RCP OSGi and J2EE conformant): one toolbox connecting all core components is used by all the applications
  • Efficient Eclipse or Netbeans powered development framework

User comments

Please sign all comments.

System requirements

The standalone version runs on:

  • Windows - 32bit or 64bit (tested on XP, Vista and 7)
  • Mac OS X (tested on 10.5 and 10.6)
  • Linux - 32bit or 64bit (tested on Ubuntu and Debian)

The portal server should run on any JVM/J2EE capable platform but has only been tested on Linux Ubuntu in Tomcat or Glassfish containers for the moment.

Source code and licensing

Open Source under GPL V3 licence.

Support for TEI

Supports TEI and TEI Lite "out of the box" at the XML level: words will be tokenized inside any #PCDATA and all the XML structure will be imported directly as textual structure.

Supports the TEI P5 encoding semantics used by the Base de Français Médiéval (BFM) project (http://bfm.ens-lyon.fr) at the TEI level: words - #PCDATA, <w>, <num>..., edition - <sic>, <corr>..., structures - <div>, <p>..., notes, etc. See "BFM encoding manual" - (in French http://bfm.ens-lyon.fr/article.php3?id_article=158).

The "TEI P5 BFM" TXM import module consists only of Groovy and XSL scripts, so as to be able to be adapted by the user to any specific TEI encoding usage.

TXM Import Modules also provide various import parameters to tune each import process to specific data sources.

[Note: The Presses Universitaires de Caen (PUC) center has successfully used the TXM import process on their own TEI text editions (July 2011).]

Language(s)

User Interface Language(s)

The user interface is currently available in:

  • standalone version:
    • English (EN)
    • French (FR)
  • portal version:
    • English (EN)
    • French (FR)

Documentation Language(s)

The documentation is currently available in:

  • standalone version:
    • English (EN)
    • French (FR)
  • portal version:
    • French (FR) (tutorial - alpha state)

Text/Corpus Language(s)

TXM works natively with any Unicode-conformant corpus.
Language support is specific to each NLP tool used (for example, TreeTagger can tag the following languages: BG, DE, EN, ES, ET, FR, FRO, GL, IT, LA, PT, RU, SW, ZH).

Programming Language(s)

TXM is written in the following programming languages:

Documentation

Tech support

Tech support is mainly provided through two mailing lists (see below).

Users can also use 3 different trackers:

User community

Currently, the TXM user community communicates using two mailing lists and a wiki:

Training in the use of TXM is available every year at the CNRS summer school « Computing and Statistical Methods in Text Analysis » (MISAT), see http://laseldi.univ-fcomte.fr/ecole.

The JADT conference (http://jadt.org) is the main meeting place for the TXM user community.

Sample implementations

The standalone version of TXM is delivered with several sample corpora included, which can be directly analyzed from within TXM after installation.

The portal version of TXM has a demo running online at http://txm.risc.cnrs.fr/demo/?locale=en (work in progress).

A previous experiment of a web application based on TXM applied to one TEI encoded text can be found at http://txm.risc.cnrs.fr/txm/texte/quete.

Current version number and date of release

  • standalone: Current version is 0.5 released March 2011
  • portal: Current version is 0.3 beta 2 released July 2011

History of versions

See the Roadmap section on the developer's wiki at http://sourceforge.net/apps/mediawiki/txm.

How to download or buy

TXM is free to download and use:

Additional notes

For publications related to TXM, please visit the Textométrie project web site at http://textometrie.ens-lyon.fr/spip.php?article82&lang=en:

  • See for example:
    Heiden, S. (2010b). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In K. I. Ryo Otoguro (Ed.), 24th Pacific Asia Conference on Language, Information and Computation - PACLIC24 (p. 389-398). Institute for Digital Enhancement of Cognitive Development, Waseda University, Sendai, Japan. Online.

Sponsors & Contributors:

  • Initial design and development of TXM (2007-2010) supported by French ANR grant #ANR-06-CORP-029
  • Currently the platform continues its development through various contracts:
    • Lyon 3 University contract 2010: XML-Transcriber import, R GUI
    • CNRS contract 2010 (DGLFLF grant): GGHF corpus processing
    • ENS-LYON contract 2010 (Rhône-Alpes region Cluster 13 grant): Queste del saint Graal web prototype
    • ENS-LYON contract 2010-2011 (ANR CORPTEF Research Project funding): portal development
  • Other independent projects also improve TXM (community of developers):
    • LASLA project 2011: import of ancient latin and greek corpora
    • GREYC-PUC project 2011: PUC corpora import, improvement of portal, test on Glassfish
    • PhD thesis on micro-finance 2011-: Factiva and Calibre import