[[Category:Tools]]

[[Category:Administrative tools]]
[[Category:Development tools]]
[[Category:Conversion and preprocessing tools]]
[[Category:Publishing and delivery tools]]
[[Category:Querying tools]]
[[Category:Analysis tools]]
[[Category:All-in-one Tools]]
[[Category:Interfaces]]

[[Category:Discovering]]
[[Category:Comparing]]
[[Category:Sampling]]
[[Category:Illustrating]]
[[Category:Representing]]

== Synopsis ==
TXM is free, open-source Unicode, XML & TEI compatible text/corpus analysis environment and graphical client based on CQP and R. It is available for Microsoft Windows, Linux, Mac OS X and as a J2EE web portal.

== Features ==
* Provides qualitative analysis tools:
** kwic '''concordances''' of word patterns based on the efficient [http://cwb.sourceforge.net CQP] full text search engine and its powerfull CQL query language
** word pattern '''frequency lists''' based on any word property (graphical form or type, lemma, pos...)
** word pattern '''progression graphics'''
** Examples of word patterns, expressed in the CQL query language which is based on word & structural level properties:
*** "aiming" to simply search for the word 'aiming'
*** ".*ing" to search for words ending in "ing" (including mainly verb forms)
*** [pos="VERB" & word=".*ing"] to search for verb forms ending in ".ing" (where Part of Speech annotation is present)
*** [lemma="group"] []{0,3} [pos="VERB" & word=".*ing"] to search for the collocation <group lemma> followed by a <verb with progressive aspect> with at most 3 words in between
** rich HTML-based text edition navigation with links from all other tools
* Provides quantitative analysis tools, based on [http://www.r-project.org R packages]:
** '''factorial correspondance analysis'''
** '''cluster analysis'''
** '''specific''' word patterns analysis
** '''collocations''' analysis
* Helps to build various corpus configurations: '''sub-corpora''' or '''partitions''' (for contrastive analysis between text structures or word selections)
* Large spectrum of input formats
** several text formats (from raw to rich):
*** '''Unicode TXT'''
*** '''ODT'''
*** '''XML'''
*** '''XML/w''' (where some or all word limits and properties can be pre-encoded)
*** XML-'''TEI P4''' (according to Perseus project practice)
*** XML-'''TEI P5''' (according to various projects practice: BFM, BVH, NLTK, etc.)
** speech transcription: XML-'''TRS''' (from Transcriber software, with time synchro)
** aligned corpora: XML-'''TMX''' (with texts in relation of translation or versioning)
** news portal articles: XML-'''PPS''' (Factiva), Europresse
** etc.
* Applies various NLP tools on the fly on texts before analysis (e.g. '''TreeTagger''' for lemmatization and pos tagging)
* Indexes words and their properties as well as hierarchical structure of texts
* Indexes external or internal text metadata or speaker metadata
* '''Export'''s any result in CSV, XML or SVG format
* Provides Scripting facilities for repetitive or lengthy tasks automation or for platform extension (in '''Groovy'''/Java dynamic language)
* Includes a complete '''text editor''' to edit data sources, results and scripts
* Runs as a desktop application for '''Windows''', '''Mac OS X''' or '''Linux'''
* Runs also as '''web portal''' to give corpora access and analysis online through any web browser (including account and access control management)
* '''Open source''' licence: based on the best open source components for text analysis: CQP, R and Java & XSLT libraries
* Modular architecture (Eclipse RCP OSGi and J2EE conformant): one toolbox connecting all core components to build the applications
* Efficient Eclipse or Netbeans powered development framework

== User comments ==
'''Please sign all comments.'''

== System requirements ==
The desktop version runs on:
* Windows - 32bit or 64bit (tested on XP, Vista, 7 and 8)
* Mac OS X (tested on 10.5, 10.6, 10.7, 10.8 and 10.9)
* Linux - 32bit or 64bit (tested on Ubuntu and Debian)

The portal server runs on any J2EE capable platform (tested in Tomcat and Glassfish).

== Source code and licensing ==
Open Source under GPL V3 licence.

== Support for TEI ==
Supports TEI and TEI Lite "out of the box" '''at XML level semantics''': words will be tokenized inside any #PCDATA and all the XML structure will be imported directly as textual structure.

Supports various flavours of TEI P5 encoding semantics '''at TEI level semantics''':
* words and their properties: <nowiki>#PCDATA, <w>, <num>...</nowiki>
* editorial markup: <nowiki><sic>, <corr>...</nowiki>
* texts and their properties: <nowiki><TEI>, <text>...</nowiki>
* intermediate text structures and their properties: <nowiki><div>, <p>...</nowiki>
* edition rendering: <nowiki><pb/>, <p>, <lb/>...</nowiki>
* what should not be indexed but considered for edition rendering: <nowiki><teiHeader>, <note>...</nowiki>
* alignment between texts: <nowiki><teiCorpus>, <linkGrp>, <link>...</nowiki>
* words identifier policy: <nowiki>@xml:id</nowiki>
* language declaration policy: <nowiki>@xml:lang</nowiki>
See the "BFM encoding manual" for an example of TEI encoding practice interpreted by TXM, in French, http://bfm.ens-lyon.fr/article.php3?id_article=158.

The "TEI P5 BFM" TXM import module consists of Groovy and XSL scripts: they can be adapted directly by the user to any specific TEI encoding usage.

TXM Import Modules also provide various import parameters to tune each import process to specific data sources.

TEI sources from the following projects are currently imported into TXM at TEI level semantics:
* Perseus Digital Library: http://www.perseus.tufts.edu/hopper
* TextGrid: http://www.textgrid.de/en
* NLTK - Brown Corpus (TEI XML Version): http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
* Frantext (libre): http://www.cnrtl.fr/corpus/frantext
* Base de Français Médiéval (BFM): http://bfm.ens-lyon.fr
* BVH Epistemon: http://www.bvh.univ-tours.fr/Epistemon
* Bouvard&Pécuchet: http://dossiers-flaubert.ish-lyon.cnrs.fr
* Presses Universitaires de Caen (PUC), MRSH de Caen - Revues.org: http://www.unicaen.fr/recherche/mrsh/document_numerique/outils ([[http://discours.revues.org?lang=en DISCOURS scientific journal]])
* TXM (TXM own pivot format): https://sourceforge.net/apps/mediawiki/txm/index.php?title=XML-TXM

TEI sources are preprocessed by several XSL stylesheets, one can find in TXM source code.
Some of those stylesheets are available in the online TXM XSL stylesheets library:
http://sourceforge.net/projects/txm/files/library/xsl

== Language(s) ==

=== User Interface Language(s) ===
The user interface is currently available in:
* desktop version:
** English (EN)
** French (FR)
** Russian (RU)
* portal version:
** English (EN)
** French (FR)

=== Documentation Language(s) ===
The documentation is currently available in:
* desktop version:
** English (EN)
** French (FR)
* portal version:
** French (FR) (tutorial - alpha state)

=== Text/Corpus Language(s) ===
TXM works natively with any Unicode-conformant corpus.<br/>
Language support is specific to each NLP tool used (for example, TreeTagger can tag the following languages: BG, DE, EN, ES, ET, FR, FRO, GL, IT, LA, PT, RU, SW, ZH).

=== Programming Language(s) ===
TXM is written in the following programming languages:
* '''C''' for CQP search engine (independent open source project http://cwb.sourceforge.net)
* '''C''' and '''R''' for statistical packages (independent open source project http://www.r-project.org)
* '''Java''' for the Toolbox and the Applications (driven by an independent open consortium http://jcp.org/en/home/index)
** Eclipse RCP framework used for the desktop version (independent open source project http://wiki.eclipse.org/index.php/Rich_Client_Platform)
** GWT framework used for the web portal version (independent open source project http://code.google.com/intl/fr/webtoolkit)
* '''Groovy''' for the import modules and command scripts (independent open source project http://groovy.codehaus.org)

== Documentation ==
* Main entry point for documentation on TXM at the Textométrie project web site: http://textometrie.ens-lyon.fr/spip.php?article98&lang=en
** See for example the TXM manual (in French) at http://txm.svn.sourceforge.net/viewvc/txm/trunk/doc/Manuel%20de%20TXM%200.7%20FR.pdf?revision=2332
* TXM user community wiki (in French) at https://listes.cru.fr/wiki/txm-users (includes a FAQ)
* TXM developers wiki (in English) on Sourceforge : http://sourceforge.net/apps/mediawiki/txm
* All available documentation (for users and for developers) published on Sourceforge: http://sourceforge.net/projects/txm/files/documentation

== Tech support ==
Tech support is mainly provided through two mailing lists (see below).

Users can also use 3 different trackers:
* Bug Reports - to describe bugs encountered in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190738
* Feature requests - to describe the features, changes in interface or any other improvements required in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190851
* Request for help - to describe a very difficult technical problem encountered in using the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190852

== User community ==
Currently, the TXM user community communicates using two mailing lists and a wiki:
* International mailing list : txm-open AT lists.sourceforge.net (very low activity for the moment)
** See archives at http://sourceforge.net/mailarchive/forum.php?forum_name=txm-open
* The mostly French-speaking mailing list : txm-users AT cru.fr (the most active)
** See archives at https://listes.cru.fr/sympa/arc/txm-users
* TXM user community wiki (in French) at https://listes.cru.fr/wiki/txm-users

Training in the use of TXM is available every year at the CNRS summer school « Computing and Statistical Methods in Text Analysis » (MISAT), see http://laseldi.univ-fcomte.fr/ecole.

The JADT conference (http://jadt.org) is the main meeting place for the TXM user community.

== Sample implementations ==
The desktop version of TXM is delivered with several sample corpora included, which can be directly analyzed from within TXM after installation.

The portal version of TXM has a demo running online at http://portal.textometrie.org/demo/?locale=en.

== Current version number and date of release ==
* TXM desktop: Current version is 0.7.5 released February 2014
* TXM portal: Current version is 0.6alpha released June 2014

== History of versions ==
See the Roadmap section on the developer's wiki at http://sourceforge.net/apps/mediawiki/txm.

== How to download or buy ==
TXM is free to download and use:
* desktop (Windows, Mac, Linux):
** First point your browser to http://sourceforge.net/projects/txm
** Then click on the green Download button to download the setup for your architecture.
* portal (J2EE):
** First choose the archive for your architecture at [https://sourceforge.net/projects/txm/files/software/TXM%20portal https://sourceforge.net/projects/txm/files/software/TXM portal]
** Then follow installation instructions at https://sourceforge.net/apps/mediawiki/txm/index.php?title=TXM_WEB:_Quick_Install
** See also the demo portal http://portal.textometrie.org/demo/?locale=en

== Additional notes ==
For publications related to TXM, please visit the Textométrie project web site at http://textometrie.ens-lyon.fr/spip.php?article82&lang=en:
* See for example:<br/>Heiden, S. (2010b). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In K. I. Ryo Otoguro (Ed.), 24th Pacific Asia Conference on Language, Information and Computation - [http://www.compling.jp/paclic24 PACLIC24] (p. 389-398). Institute for Digital Enhancement of Cognitive Development, Waseda University, Sendai, Japan. [http://halshs.archives-ouvertes.fr/halshs-00549764/en Online].

Sponsors & Contributors:
* Initial design and development of TXM (jan 2007- dec 2011) supported by French ANR grant #ANR-06-CORP-029
* Currently the platform continues its development through various contracts:
** ENS-LYON contract jun-aug 2009 (Rhône-Alpes region Cluster 13 grant): Queste del saint Graal web prototype
** ENS-LYON contract sept 2009 - jul 2010 (ANR CORPTEF Research Project funding): portal development
** Lyon 3 University contract jan-mar 2011: XML-Transcriber import, R GUI
** CNRS contract 2011 (DGLFLF grant): GGHF corpus processing
** Paris 1 University contract jan 2012 - dec 2014 (Matrice Equipex): TXM development and infrastructure for historians
* Other independent projects also improve TXM (community of developers):
** LASLA project 2011: import of ancient latin and greek corpora
** GREYC-PUC project may-jul 2011: PUC corpora import, improvement of portal, test on Glassfish
** PhD thesis on micro-finance 2011-: Factiva and Calibre import
** ANR-DFG SRCMF contract jun-jul 2012 : Tiger Search module, import & syntactic concordances

TXM

2014-06-06T19:10:29Z

Sheiden: /* Sample implementations */

[[Category:Tools]]

[[Category:Administrative tools]]
[[Category:Development tools]]
[[Category:Conversion and preprocessing tools]]
[[Category:Publishing and delivery tools]]
[[Category:Querying tools]]
[[Category:Analysis tools]]
[[Category:All-in-one Tools]]
[[Category:Interfaces]]

[[Category:Discovering]]
[[Category:Comparing]]
[[Category:Sampling]]
[[Category:Illustrating]]
[[Category:Representing]]

== Synopsis ==
TXM is free, open-source Unicode, XML & TEI compatible text/corpus analysis environment and graphical client based on CQP and R. It is available for Microsoft Windows, Linux, Mac OS X and as a J2EE web portal.

== Features ==
* Provides qualitative analysis tools:
** kwic '''concordances''' of word patterns based on the efficient [http://cwb.sourceforge.net CQP] full text search engine and its powerfull CQL query language
** word pattern '''frequency lists''' based on any word property (graphical form or type, lemma, pos...)
** word pattern '''progression graphics'''
** Examples of word patterns, expressed in the CQL query language which is based on word & structural level properties:
*** "aiming" to simply search for the word 'aiming'
*** ".*ing" to search for words ending in "ing" (including mainly verb forms)
*** [pos="VERB" & word=".*ing"] to search for verb forms ending in ".ing" (where Part of Speech annotation is present)
*** [lemma="group"] []{0,3} [pos="VERB" & word=".*ing"] to search for the collocation <group lemma> followed by a <verb with progressive aspect> with at most 3 words in between
** rich HTML-based text edition navigation with links from all other tools
* Provides quantitative analysis tools, based on [http://www.r-project.org R packages]:
** '''factorial correspondance analysis'''
** '''cluster analysis'''
** '''specific''' word patterns analysis
** '''collocations''' analysis
* Helps to build various corpus configurations: '''sub-corpora''' or '''partitions''' (for contrastive analysis between text structures or word selections)
* Large spectrum of input formats
** several text formats (from raw to rich):
*** '''Unicode TXT'''
*** '''ODT'''
*** '''XML'''
*** '''XML/w''' (where some or all word limits and properties can be pre-encoded)
*** XML-'''TEI P4''' (according to Perseus project practice)
*** XML-'''TEI P5''' (according to various projects practice: BFM, BVH, NLTK, etc.)
** speech transcription: XML-'''TRS''' (from Transcriber software, with time synchro)
** aligned corpora: XML-'''TMX''' (with texts in relation of translation or versioning)
** news portal articles: XML-'''PPS''' (Factiva), Europresse
** etc.
* Applies various NLP tools on the fly on texts before analysis (e.g. '''TreeTagger''' for lemmatization and pos tagging)
* Indexes words and their properties as well as hierarchical structure of texts
* Indexes external or internal text metadata or speaker metadata
* '''Export'''s any result in CSV, XML or SVG format
* Provides Scripting facilities for repetitive or lengthy tasks automation or for platform extension (in '''Groovy'''/Java dynamic language)
* Includes a complete '''text editor''' to edit data sources, results and scripts
* Runs as a desktop application for '''Windows''', '''Mac OS X''' or '''Linux'''
* Runs also as '''web portal''' to give corpora access and analysis online through any web browser (including account and access control management)
* '''Open source''' licence: based on the best open source components for text analysis: CQP, R and Java & XSLT libraries
* Modular architecture (Eclipse RCP OSGi and J2EE conformant): one toolbox connecting all core components to build the applications
* Efficient Eclipse or Netbeans powered development framework

== User comments ==
'''Please sign all comments.'''

== System requirements ==
The desktop version runs on:
* Windows - 32bit or 64bit (tested on XP, Vista, 7 and 8)
* Mac OS X (tested on 10.5, 10.6, 10.7, 10.8 and 10.9)
* Linux - 32bit or 64bit (tested on Ubuntu and Debian)

The portal server runs on any J2EE capable platform (tested in Tomcat and Glassfish).

== Source code and licensing ==
Open Source under GPL V3 licence.

== Support for TEI ==
Supports TEI and TEI Lite "out of the box" '''at XML level semantics''': words will be tokenized inside any #PCDATA and all the XML structure will be imported directly as textual structure.

Supports various flavours of TEI P5 encoding semantics '''at TEI level semantics''':
* words and their properties: <nowiki>#PCDATA, <w>, <num>...</nowiki>
* editorial markup: <nowiki><sic>, <corr>...</nowiki>
* texts and their properties: <nowiki><TEI>, <text>...</nowiki>
* intermediate text structures and their properties: <nowiki><div>, <p>...</nowiki>
* edition rendering: <nowiki><pb/>, <p>, <lb/>...</nowiki>
* what should not be indexed but considered for edition rendering: <nowiki><teiHeader>, <note>...</nowiki>
* alignment between texts: <nowiki><teiCorpus>, <linkGrp>, <link>...</nowiki>
* words identifier policy: <nowiki>@xml:id</nowiki>
* language declaration policy: <nowiki>@xml:lang</nowiki>
See the "BFM encoding manual" for an example of TEI encoding practice interpreted by TXM, in French, http://bfm.ens-lyon.fr/article.php3?id_article=158.

The "TEI P5 BFM" TXM import module consists of Groovy and XSL scripts: they can be adapted directly by the user to any specific TEI encoding usage.

TXM Import Modules also provide various import parameters to tune each import process to specific data sources.

TEI sources from the following projects are currently imported into TXM at TEI level semantics:
* Perseus: http://www.perseus.tufts.edu/hopper
* TextGrid: http://www.textgrid.de/en
* NLTK - Brown Corpus (TEI XML Version): http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
* Frantext (libre): http://www.cnrtl.fr/corpus/frantext
* Base de Français Médiéval (BFM): http://bfm.ens-lyon.fr
* BVH Epistemon: http://www.bvh.univ-tours.fr/Epistemon
* Bouvard&Pécuchet: http://dossiers-flaubert.ish-lyon.cnrs.fr
* Presses Universitaires de Caen (PUC), MRSH de Caen - Revues.org: http://www.unicaen.fr/recherche/mrsh/document_numerique/outils ([[http://discours.revues.org?lang=en DISCOURS scientific journal]])
* TXM (TXM own pivot format): https://sourceforge.net/apps/mediawiki/txm/index.php?title=XML-TXM

TEI sources are preprocessed by several XSL stylesheets, one can find in TXM source code.
Some of those stylesheets are available in the online TXM XSL stylesheets library:
http://sourceforge.net/projects/txm/files/library/xsl

== Language(s) ==

=== User Interface Language(s) ===
The user interface is currently available in:
* desktop version:
** English (EN)
** French (FR)
** Russian (RU)
* portal version:
** English (EN)
** French (FR)

=== Documentation Language(s) ===
The documentation is currently available in:
* desktop version:
** English (EN)
** French (FR)
* portal version:
** French (FR) (tutorial - alpha state)

=== Text/Corpus Language(s) ===
TXM works natively with any Unicode-conformant corpus.<br/>
Language support is specific to each NLP tool used (for example, TreeTagger can tag the following languages: BG, DE, EN, ES, ET, FR, FRO, GL, IT, LA, PT, RU, SW, ZH).

=== Programming Language(s) ===
TXM is written in the following programming languages:
* '''C''' for CQP search engine (independent open source project http://cwb.sourceforge.net)
* '''C''' and '''R''' for statistical packages (independent open source project http://www.r-project.org)
* '''Java''' for the Toolbox and the Applications (driven by an independent open consortium http://jcp.org/en/home/index)
** Eclipse RCP framework used for the desktop version (independent open source project http://wiki.eclipse.org/index.php/Rich_Client_Platform)
** GWT framework used for the web portal version (independent open source project http://code.google.com/intl/fr/webtoolkit)
* '''Groovy''' for the import modules and command scripts (independent open source project http://groovy.codehaus.org)

== Documentation ==
* Main entry point for documentation on TXM at the Textométrie project web site: http://textometrie.ens-lyon.fr/spip.php?article98&lang=en
** See for example the TXM manual (in French) at http://txm.svn.sourceforge.net/viewvc/txm/trunk/doc/Manuel%20de%20TXM%200.7%20FR.pdf?revision=2332
* TXM user community wiki (in French) at https://listes.cru.fr/wiki/txm-users (includes a FAQ)
* TXM developers wiki (in English) on Sourceforge : http://sourceforge.net/apps/mediawiki/txm
* All available documentation (for users and for developers) published on Sourceforge: http://sourceforge.net/projects/txm/files/documentation

== Tech support ==
Tech support is mainly provided through two mailing lists (see below).

Users can also use 3 different trackers:
* Bug Reports - to describe bugs encountered in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190738
* Feature requests - to describe the features, changes in interface or any other improvements required in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190851
* Request for help - to describe a very difficult technical problem encountered in using the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190852

== User community ==
Currently, the TXM user community communicates using two mailing lists and a wiki:
* International mailing list : txm-open AT lists.sourceforge.net (very low activity for the moment)
** See archives at http://sourceforge.net/mailarchive/forum.php?forum_name=txm-open
* The mostly French-speaking mailing list : txm-users AT cru.fr (the most active)
** See archives at https://listes.cru.fr/sympa/arc/txm-users
* TXM user community wiki (in French) at https://listes.cru.fr/wiki/txm-users

Training in the use of TXM is available every year at the CNRS summer school « Computing and Statistical Methods in Text Analysis » (MISAT), see http://laseldi.univ-fcomte.fr/ecole.

The JADT conference (http://jadt.org) is the main meeting place for the TXM user community.

== Sample implementations ==
The desktop version of TXM is delivered with several sample corpora included, which can be directly analyzed from within TXM after installation.

The portal version of TXM has a demo running online at http://portal.textometrie.org/demo/?locale=en.

== Current version number and date of release ==
* TXM desktop: Current version is 0.7.5 released February 2014
* TXM portal: Current version is 0.6alpha released June 2014

== History of versions ==
See the Roadmap section on the developer's wiki at http://sourceforge.net/apps/mediawiki/txm.

== How to download or buy ==
TXM is free to download and use:
* desktop (Windows, Mac, Linux):
** First point your browser to http://sourceforge.net/projects/txm
** Then click on the green Download button to download the setup for your architecture.
* portal (J2EE):
** First choose the archive for your architecture at [https://sourceforge.net/projects/txm/files/software/TXM%20portal https://sourceforge.net/projects/txm/files/software/TXM portal]
** Then follow installation instructions at https://sourceforge.net/apps/mediawiki/txm/index.php?title=TXM_WEB:_Quick_Install
** See also the demo portal http://portal.textometrie.org/demo/?locale=en

== Additional notes ==
For publications related to TXM, please visit the Textométrie project web site at http://textometrie.ens-lyon.fr/spip.php?article82&lang=en:
* See for example:<br/>Heiden, S. (2010b). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In K. I. Ryo Otoguro (Ed.), 24th Pacific Asia Conference on Language, Information and Computation - [http://www.compling.jp/paclic24 PACLIC24] (p. 389-398). Institute for Digital Enhancement of Cognitive Development, Waseda University, Sendai, Japan. [http://halshs.archives-ouvertes.fr/halshs-00549764/en Online].

Sponsors & Contributors:
* Initial design and development of TXM (jan 2007- dec 2011) supported by French ANR grant #ANR-06-CORP-029
* Currently the platform continues its development through various contracts:
** ENS-LYON contract jun-aug 2009 (Rhône-Alpes region Cluster 13 grant): Queste del saint Graal web prototype
** ENS-LYON contract sept 2009 - jul 2010 (ANR CORPTEF Research Project funding): portal development
** Lyon 3 University contract jan-mar 2011: XML-Transcriber import, R GUI
** CNRS contract 2011 (DGLFLF grant): GGHF corpus processing
** Paris 1 University contract jan 2012 - dec 2014 (Matrice Equipex): TXM development and infrastructure for historians
* Other independent projects also improve TXM (community of developers):
** LASLA project 2011: import of ancient latin and greek corpora
** GREYC-PUC project may-jul 2011: PUC corpora import, improvement of portal, test on Glassfish
** PhD thesis on micro-finance 2011-: Factiva and Calibre import
** ANR-DFG SRCMF contract jun-jul 2012 : Tiger Search module, import & syntactic concordances

TXM

2014-06-06T19:09:41Z

Sheiden: /* Support for TEI */

[[Category:Tools]]

[[Category:Administrative tools]]
[[Category:Development tools]]
[[Category:Conversion and preprocessing tools]]
[[Category:Publishing and delivery tools]]
[[Category:Querying tools]]
[[Category:Analysis tools]]
[[Category:All-in-one Tools]]
[[Category:Interfaces]]

[[Category:Discovering]]
[[Category:Comparing]]
[[Category:Sampling]]
[[Category:Illustrating]]
[[Category:Representing]]

== Synopsis ==
TXM is free, open-source Unicode, XML & TEI compatible text/corpus analysis environment and graphical client based on CQP and R. It is available for Microsoft Windows, Linux, Mac OS X and as a J2EE web portal.

== Features ==
* Provides qualitative analysis tools:
** kwic '''concordances''' of word patterns based on the efficient [http://cwb.sourceforge.net CQP] full text search engine and its powerfull CQL query language
** word pattern '''frequency lists''' based on any word property (graphical form or type, lemma, pos...)
** word pattern '''progression graphics'''
** Examples of word patterns, expressed in the CQL query language which is based on word & structural level properties:
*** "aiming" to simply search for the word 'aiming'
*** ".*ing" to search for words ending in "ing" (including mainly verb forms)
*** [pos="VERB" & word=".*ing"] to search for verb forms ending in ".ing" (where Part of Speech annotation is present)
*** [lemma="group"] []{0,3} [pos="VERB" & word=".*ing"] to search for the collocation <group lemma> followed by a <verb with progressive aspect> with at most 3 words in between
** rich HTML-based text edition navigation with links from all other tools
* Provides quantitative analysis tools, based on [http://www.r-project.org R packages]:
** '''factorial correspondance analysis'''
** '''cluster analysis'''
** '''specific''' word patterns analysis
** '''collocations''' analysis
* Helps to build various corpus configurations: '''sub-corpora''' or '''partitions''' (for contrastive analysis between text structures or word selections)
* Large spectrum of input formats
** several text formats (from raw to rich):
*** '''Unicode TXT'''
*** '''ODT'''
*** '''XML'''
*** '''XML/w''' (where some or all word limits and properties can be pre-encoded)
*** XML-'''TEI P4''' (according to Perseus project practice)
*** XML-'''TEI P5''' (according to various projects practice: BFM, BVH, NLTK, etc.)
** speech transcription: XML-'''TRS''' (from Transcriber software, with time synchro)
** aligned corpora: XML-'''TMX''' (with texts in relation of translation or versioning)
** news portal articles: XML-'''PPS''' (Factiva), Europresse
** etc.
* Applies various NLP tools on the fly on texts before analysis (e.g. '''TreeTagger''' for lemmatization and pos tagging)
* Indexes words and their properties as well as hierarchical structure of texts
* Indexes external or internal text metadata or speaker metadata
* '''Export'''s any result in CSV, XML or SVG format
* Provides Scripting facilities for repetitive or lengthy tasks automation or for platform extension (in '''Groovy'''/Java dynamic language)
* Includes a complete '''text editor''' to edit data sources, results and scripts
* Runs as a desktop application for '''Windows''', '''Mac OS X''' or '''Linux'''
* Runs also as '''web portal''' to give corpora access and analysis online through any web browser (including account and access control management)
* '''Open source''' licence: based on the best open source components for text analysis: CQP, R and Java & XSLT libraries
* Modular architecture (Eclipse RCP OSGi and J2EE conformant): one toolbox connecting all core components to build the applications
* Efficient Eclipse or Netbeans powered development framework

== User comments ==
'''Please sign all comments.'''

== System requirements ==
The desktop version runs on:
* Windows - 32bit or 64bit (tested on XP, Vista, 7 and 8)
* Mac OS X (tested on 10.5, 10.6, 10.7, 10.8 and 10.9)
* Linux - 32bit or 64bit (tested on Ubuntu and Debian)

The portal server runs on any J2EE capable platform (tested in Tomcat and Glassfish).

== Source code and licensing ==
Open Source under GPL V3 licence.

== Support for TEI ==
Supports TEI and TEI Lite "out of the box" '''at XML level semantics''': words will be tokenized inside any #PCDATA and all the XML structure will be imported directly as textual structure.

Supports various flavours of TEI P5 encoding semantics '''at TEI level semantics''':
* words and their properties: <nowiki>#PCDATA, <w>, <num>...</nowiki>
* editorial markup: <nowiki><sic>, <corr>...</nowiki>
* texts and their properties: <nowiki><TEI>, <text>...</nowiki>
* intermediate text structures and their properties: <nowiki><div>, <p>...</nowiki>
* edition rendering: <nowiki><pb/>, <p>, <lb/>...</nowiki>
* what should not be indexed but considered for edition rendering: <nowiki><teiHeader>, <note>...</nowiki>
* alignment between texts: <nowiki><teiCorpus>, <linkGrp>, <link>...</nowiki>
* words identifier policy: <nowiki>@xml:id</nowiki>
* language declaration policy: <nowiki>@xml:lang</nowiki>
See the "BFM encoding manual" for an example of TEI encoding practice interpreted by TXM, in French, http://bfm.ens-lyon.fr/article.php3?id_article=158.

The "TEI P5 BFM" TXM import module consists of Groovy and XSL scripts: they can be adapted directly by the user to any specific TEI encoding usage.

TXM Import Modules also provide various import parameters to tune each import process to specific data sources.

TEI sources from the following projects are currently imported into TXM at TEI level semantics:
* Perseus: http://www.perseus.tufts.edu/hopper
* TextGrid: http://www.textgrid.de/en
* NLTK - Brown Corpus (TEI XML Version): http://nltk.googlecode.com/svn/trunk/nltk_data/index.xml
* Frantext (libre): http://www.cnrtl.fr/corpus/frantext
* Base de Français Médiéval (BFM): http://bfm.ens-lyon.fr
* BVH Epistemon: http://www.bvh.univ-tours.fr/Epistemon
* Bouvard&Pécuchet: http://dossiers-flaubert.ish-lyon.cnrs.fr
* Presses Universitaires de Caen (PUC), MRSH de Caen - Revues.org: http://www.unicaen.fr/recherche/mrsh/document_numerique/outils ([[http://discours.revues.org?lang=en DISCOURS scientific journal]])
* TXM (TXM own pivot format): https://sourceforge.net/apps/mediawiki/txm/index.php?title=XML-TXM

TEI sources are preprocessed by several XSL stylesheets, one can find in TXM source code.
Some of those stylesheets are available in the online TXM XSL stylesheets library:
http://sourceforge.net/projects/txm/files/library/xsl

== Language(s) ==

=== User Interface Language(s) ===
The user interface is currently available in:
* desktop version:
** English (EN)
** French (FR)
** Russian (RU)
* portal version:
** English (EN)
** French (FR)

=== Documentation Language(s) ===
The documentation is currently available in:
* desktop version:
** English (EN)
** French (FR)
* portal version:
** French (FR) (tutorial - alpha state)

=== Text/Corpus Language(s) ===
TXM works natively with any Unicode-conformant corpus.<br/>
Language support is specific to each NLP tool used (for example, TreeTagger can tag the following languages: BG, DE, EN, ES, ET, FR, FRO, GL, IT, LA, PT, RU, SW, ZH).

=== Programming Language(s) ===
TXM is written in the following programming languages:
* '''C''' for CQP search engine (independent open source project http://cwb.sourceforge.net)
* '''C''' and '''R''' for statistical packages (independent open source project http://www.r-project.org)
* '''Java''' for the Toolbox and the Applications (driven by an independent open consortium http://jcp.org/en/home/index)
** Eclipse RCP framework used for the desktop version (independent open source project http://wiki.eclipse.org/index.php/Rich_Client_Platform)
** GWT framework used for the web portal version (independent open source project http://code.google.com/intl/fr/webtoolkit)
* '''Groovy''' for the import modules and command scripts (independent open source project http://groovy.codehaus.org)

== Documentation ==
* Main entry point for documentation on TXM at the Textométrie project web site: http://textometrie.ens-lyon.fr/spip.php?article98&lang=en
** See for example the TXM manual (in French) at http://txm.svn.sourceforge.net/viewvc/txm/trunk/doc/Manuel%20de%20TXM%200.7%20FR.pdf?revision=2332
* TXM user community wiki (in French) at https://listes.cru.fr/wiki/txm-users (includes a FAQ)
* TXM developers wiki (in English) on Sourceforge : http://sourceforge.net/apps/mediawiki/txm
* All available documentation (for users and for developers) published on Sourceforge: http://sourceforge.net/projects/txm/files/documentation

== Tech support ==
Tech support is mainly provided through two mailing lists (see below).

Users can also use 3 different trackers:
* Bug Reports - to describe bugs encountered in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190738
* Feature requests - to describe the features, changes in interface or any other improvements required in the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190851
* Request for help - to describe a very difficult technical problem encountered in using the software: https://sourceforge.net/tracker/?group_id=247041&atid=1190852

== User community ==
Currently, the TXM user community communicates using two mailing lists and a wiki:
* International mailing list : txm-open AT lists.sourceforge.net (very low activity for the moment)
** See archives at http://sourceforge.net/mailarchive/forum.php?forum_name=txm-open
* The mostly French-speaking mailing list : txm-users AT cru.fr (the most active)
** See archives at https://listes.cru.fr/sympa/arc/txm-users
* TXM user community wiki (in French) at https://listes.cru.fr/wiki/txm-users

Training in the use of TXM is available every year at the CNRS summer school « Computing and Statistical Methods in Text Analysis » (MISAT), see http://laseldi.univ-fcomte.fr/ecole.

The JADT conference (http://jadt.org) is the main meeting place for the TXM user community.

== Sample implementations ==
The desktop version of TXM is delivered with several sample corpora included, which can be directly analyzed from within TXM after installation.

The portal version of TXM has a demo running online at http://portal.textometrie.org/demo/?locale=en (work in progress).

A previous experiment of a web application based on TXM applied to one TEI encoded text can be found at http://txm.ish-lyon.cnrs.fr/txm.

== Current version number and date of release ==
* TXM desktop: Current version is 0.7.5 released February 2014
* TXM portal: Current version is 0.6alpha released June 2014

== History of versions ==
See the Roadmap section on the developer's wiki at http://sourceforge.net/apps/mediawiki/txm.

== How to download or buy ==
TXM is free to download and use:
* desktop (Windows, Mac, Linux):
** First point your browser to http://sourceforge.net/projects/txm
** Then click on the green Download button to download the setup for your architecture.
* portal (J2EE):
** First choose the archive for your architecture at [https://sourceforge.net/projects/txm/files/software/TXM%20portal https://sourceforge.net/projects/txm/files/software/TXM portal]
** Then follow installation instructions at https://sourceforge.net/apps/mediawiki/txm/index.php?title=TXM_WEB:_Quick_Install
** See also the demo portal http://portal.textometrie.org/demo/?locale=en

== Additional notes ==
For publications related to TXM, please visit the Textométrie project web site at http://textometrie.ens-lyon.fr/spip.php?article82&lang=en:
* See for example:<br/>Heiden, S. (2010b). The TXM Platform: Building Open-Source Textual Analysis Software Compatible with the TEI Encoding Scheme. In K. I. Ryo Otoguro (Ed.), 24th Pacific Asia Conference on Language, Information and Computation - [http://www.compling.jp/paclic24 PACLIC24] (p. 389-398). Institute for Digital Enhancement of Cognitive Development, Waseda University, Sendai, Japan. [http://halshs.archives-ouvertes.fr/halshs-00549764/en Online].

Sponsors & Contributors:
* Initial design and development of TXM (jan 2007- dec 2011) supported by French ANR grant #ANR-06-CORP-029
* Currently the platform continues its development through various contracts:
** ENS-LYON contract jun-aug 2009 (Rhône-Alpes region Cluster 13 grant): Queste del saint Graal web prototype
** ENS-LYON contract sept 2009 - jul 2010 (ANR CORPTEF Research Project funding): portal development
** Lyon 3 University contract jan-mar 2011: XML-Transcriber import, R GUI
** CNRS contract 2011 (DGLFLF grant): GGHF corpus processing
** Paris 1 University contract jan 2012 - dec 2014 (Matrice Equipex): TXM development and infrastructure for historians
* Other independent projects also improve TXM (community of developers):
** LASLA project 2011: import of ancient latin and greek corpora
** GREYC-PUC project may-jul 2011: PUC corpora import, improvement of portal, test on Glassfish
** PhD thesis on micro-finance 2011-: Factiva and Calibre import
** ANR-DFG SRCMF contract jun-jul 2012 : Tiger Search module, import & syntactic concordances