<?xml version="1.0"?>
<feed xmlns="http://www.w3.org/2005/Atom" xml:lang="en">
	<id>https://wiki.tei-c.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sloiseau</id>
	<title>TEIWiki - User contributions [en]</title>
	<link rel="self" type="application/atom+xml" href="https://wiki.tei-c.org/api.php?action=feedcontributions&amp;feedformat=atom&amp;user=Sloiseau"/>
	<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=Special:Contributions/Sloiseau"/>
	<updated>2026-04-21T15:24:47Z</updated>
	<subtitle>User contributions</subtitle>
	<generator>MediaWiki 1.32.0</generator>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=Wiki2TEI&amp;diff=3669</id>
		<title>Wiki2TEI</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=Wiki2TEI&amp;diff=3669"/>
		<updated>2007-10-11T11:55:57Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Language(s) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Tools]]&lt;br /&gt;
[[Category:Conversion and preprocessing tools]]&lt;br /&gt;
&lt;br /&gt;
== Synopsis ==&lt;br /&gt;
&lt;br /&gt;
The mediawiki format is used by wikimedia fundation wikis (Wikipedia, Wikibooks, Wikisource), and many other wikis using the mediawiki software. Large amounts of free hight-quality structured texts are available in this format. These texts are used more and more often in NLP (natural language processing) projects. However, the mediawiki parser is oriented towards rendition and the mediawiki syntax is complex and hard to parse.&lt;br /&gt;
 &lt;br /&gt;
The Wiki2Tei converter makes available the information contained in wiki syntax (structuration, highlighting, etc.), and allows to properly retrieve the plain text. This conversion is intended to preserve all the properties of the original text. Wiki2Tei is closely coupled with the mediawiki software, allowing to convert all the features of the mediawiki syntax.&lt;br /&gt;
&lt;br /&gt;
== Features ==&lt;br /&gt;
* Tools for converting from mediawiki database or from collection of files.&lt;br /&gt;
* Tools for checking well formedness and validation.&lt;br /&gt;
* The documentation of the vocabulary use ODD syntax.&lt;br /&gt;
* Wiki2Tei converter works with mediawiki 1.5 software.&lt;br /&gt;
&lt;br /&gt;
== System requirements ==&lt;br /&gt;
&lt;br /&gt;
You need to set up a mediawiki software, that is:&lt;br /&gt;
- a mysql server&lt;br /&gt;
- a php 5 interpreter.&lt;br /&gt;
- some third parties tools are needed for specific task (openjade, xsltproc)&lt;br /&gt;
&lt;br /&gt;
== Source code and licensing ==&lt;br /&gt;
&lt;br /&gt;
This software is released under a BSD licence.&lt;br /&gt;
&lt;br /&gt;
== Support for TEI ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Language(s) ==&lt;br /&gt;
&lt;br /&gt;
The converter use the PHP 5 language only.&lt;br /&gt;
&lt;br /&gt;
== Documentation ==&lt;br /&gt;
&lt;br /&gt;
Documentation is available online:&lt;br /&gt;
&lt;br /&gt;
    http://wiki2tei.sourceforge.net/&lt;br /&gt;
    http://wiki2tei.sourceforge.net/Wiki2TeiHelp.html&lt;br /&gt;
&lt;br /&gt;
== Tech support ==&lt;br /&gt;
&lt;br /&gt;
A technical support is provided through the following mailing list:&lt;br /&gt;
&lt;br /&gt;
    https://lists.sourceforge.net/lists/listinfo/wiki2tei-users&lt;br /&gt;
&lt;br /&gt;
== User community ==&lt;br /&gt;
&lt;br /&gt;
The community tools of the SourceForge web site may be used.&lt;br /&gt;
&lt;br /&gt;
== Sample implementations ==&lt;br /&gt;
&lt;br /&gt;
    http://wiki2tei.sourceforge.net/demo/&lt;br /&gt;
&lt;br /&gt;
== Current version number and date of release ==&lt;br /&gt;
&lt;br /&gt;
Version 1.0, release 10-10-2007.&lt;br /&gt;
&lt;br /&gt;
== History of versions ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== How to download or buy ==&lt;br /&gt;
&lt;br /&gt;
    http://sourceforge.net/project/showfiles.php?group_id=198407&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=Wiki2TEI&amp;diff=3668</id>
		<title>Wiki2TEI</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=Wiki2TEI&amp;diff=3668"/>
		<updated>2007-10-11T11:55:38Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Features */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Tools]]&lt;br /&gt;
[[Category:Conversion and preprocessing tools]]&lt;br /&gt;
&lt;br /&gt;
== Synopsis ==&lt;br /&gt;
&lt;br /&gt;
The mediawiki format is used by wikimedia fundation wikis (Wikipedia, Wikibooks, Wikisource), and many other wikis using the mediawiki software. Large amounts of free hight-quality structured texts are available in this format. These texts are used more and more often in NLP (natural language processing) projects. However, the mediawiki parser is oriented towards rendition and the mediawiki syntax is complex and hard to parse.&lt;br /&gt;
 &lt;br /&gt;
The Wiki2Tei converter makes available the information contained in wiki syntax (structuration, highlighting, etc.), and allows to properly retrieve the plain text. This conversion is intended to preserve all the properties of the original text. Wiki2Tei is closely coupled with the mediawiki software, allowing to convert all the features of the mediawiki syntax.&lt;br /&gt;
&lt;br /&gt;
== Features ==&lt;br /&gt;
* Tools for converting from mediawiki database or from collection of files.&lt;br /&gt;
* Tools for checking well formedness and validation.&lt;br /&gt;
* The documentation of the vocabulary use ODD syntax.&lt;br /&gt;
* Wiki2Tei converter works with mediawiki 1.5 software.&lt;br /&gt;
&lt;br /&gt;
== System requirements ==&lt;br /&gt;
&lt;br /&gt;
You need to set up a mediawiki software, that is:&lt;br /&gt;
- a mysql server&lt;br /&gt;
- a php 5 interpreter.&lt;br /&gt;
- some third parties tools are needed for specific task (openjade, xsltproc)&lt;br /&gt;
&lt;br /&gt;
== Source code and licensing ==&lt;br /&gt;
&lt;br /&gt;
This software is released under a BSD licence.&lt;br /&gt;
&lt;br /&gt;
== Support for TEI ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Language(s) ==&lt;br /&gt;
&lt;br /&gt;
This converter use PHP 5 language only.&lt;br /&gt;
&lt;br /&gt;
== Documentation ==&lt;br /&gt;
&lt;br /&gt;
Documentation is available online:&lt;br /&gt;
&lt;br /&gt;
    http://wiki2tei.sourceforge.net/&lt;br /&gt;
    http://wiki2tei.sourceforge.net/Wiki2TeiHelp.html&lt;br /&gt;
&lt;br /&gt;
== Tech support ==&lt;br /&gt;
&lt;br /&gt;
A technical support is provided through the following mailing list:&lt;br /&gt;
&lt;br /&gt;
    https://lists.sourceforge.net/lists/listinfo/wiki2tei-users&lt;br /&gt;
&lt;br /&gt;
== User community ==&lt;br /&gt;
&lt;br /&gt;
The community tools of the SourceForge web site may be used.&lt;br /&gt;
&lt;br /&gt;
== Sample implementations ==&lt;br /&gt;
&lt;br /&gt;
    http://wiki2tei.sourceforge.net/demo/&lt;br /&gt;
&lt;br /&gt;
== Current version number and date of release ==&lt;br /&gt;
&lt;br /&gt;
Version 1.0, release 10-10-2007.&lt;br /&gt;
&lt;br /&gt;
== History of versions ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== How to download or buy ==&lt;br /&gt;
&lt;br /&gt;
    http://sourceforge.net/project/showfiles.php?group_id=198407&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=Wiki2TEI&amp;diff=3667</id>
		<title>Wiki2TEI</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=Wiki2TEI&amp;diff=3667"/>
		<updated>2007-10-11T11:54:40Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;[[Category:Tools]]&lt;br /&gt;
[[Category:Conversion and preprocessing tools]]&lt;br /&gt;
&lt;br /&gt;
== Synopsis ==&lt;br /&gt;
&lt;br /&gt;
The mediawiki format is used by wikimedia fundation wikis (Wikipedia, Wikibooks, Wikisource), and many other wikis using the mediawiki software. Large amounts of free hight-quality structured texts are available in this format. These texts are used more and more often in NLP (natural language processing) projects. However, the mediawiki parser is oriented towards rendition and the mediawiki syntax is complex and hard to parse.&lt;br /&gt;
 &lt;br /&gt;
The Wiki2Tei converter makes available the information contained in wiki syntax (structuration, highlighting, etc.), and allows to properly retrieve the plain text. This conversion is intended to preserve all the properties of the original text. Wiki2Tei is closely coupled with the mediawiki software, allowing to convert all the features of the mediawiki syntax.&lt;br /&gt;
&lt;br /&gt;
== Features ==&lt;br /&gt;
* Tools for converting from mediawiki database or from collection of files&lt;br /&gt;
* Tools for checking well formedness and validation&lt;br /&gt;
* documentation of the vocabulary used in an ODD document&lt;br /&gt;
* work with mediawiki 1.5 software; &lt;br /&gt;
&lt;br /&gt;
== System requirements ==&lt;br /&gt;
&lt;br /&gt;
You need to set up a mediawiki software, that is:&lt;br /&gt;
- a mysql server&lt;br /&gt;
- a php 5 interpreter.&lt;br /&gt;
- some third parties tools are needed for specific task (openjade, xsltproc)&lt;br /&gt;
&lt;br /&gt;
== Source code and licensing ==&lt;br /&gt;
&lt;br /&gt;
This software is released under a BSD licence.&lt;br /&gt;
&lt;br /&gt;
== Support for TEI ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== Language(s) ==&lt;br /&gt;
&lt;br /&gt;
This converter use PHP 5 language only.&lt;br /&gt;
&lt;br /&gt;
== Documentation ==&lt;br /&gt;
&lt;br /&gt;
Documentation is available online:&lt;br /&gt;
&lt;br /&gt;
    http://wiki2tei.sourceforge.net/&lt;br /&gt;
    http://wiki2tei.sourceforge.net/Wiki2TeiHelp.html&lt;br /&gt;
&lt;br /&gt;
== Tech support ==&lt;br /&gt;
&lt;br /&gt;
A technical support is provided through the following mailing list:&lt;br /&gt;
&lt;br /&gt;
    https://lists.sourceforge.net/lists/listinfo/wiki2tei-users&lt;br /&gt;
&lt;br /&gt;
== User community ==&lt;br /&gt;
&lt;br /&gt;
The community tools of the SourceForge web site may be used.&lt;br /&gt;
&lt;br /&gt;
== Sample implementations ==&lt;br /&gt;
&lt;br /&gt;
    http://wiki2tei.sourceforge.net/demo/&lt;br /&gt;
&lt;br /&gt;
== Current version number and date of release ==&lt;br /&gt;
&lt;br /&gt;
Version 1.0, release 10-10-2007.&lt;br /&gt;
&lt;br /&gt;
== History of versions ==&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
== How to download or buy ==&lt;br /&gt;
&lt;br /&gt;
    http://sourceforge.net/project/showfiles.php?group_id=198407&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2819</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2819"/>
		<updated>2006-10-26T23:54:03Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora, KWIC &lt;br /&gt;
and quantitative information from arbitrarly large corpora &lt;br /&gt;
in the TEI vocabulary. It intends to provide ways for &lt;br /&gt;
processing corpora containing milestoned annotation. &lt;br /&gt;
It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader does not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applied on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element in the query document divide &lt;br /&gt;
the stream into sub-documents. Each &amp;lt;code&amp;gt;filter&amp;lt;/code&amp;gt; &lt;br /&gt;
into a &amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element see the corpus as several &lt;br /&gt;
documents rooted at the elements defined by &lt;br /&gt;
&amp;lt;code&amp;gt;split/@localName&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Several filters allow XPath syntax for addressing nodes.&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Intended for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Examples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* Download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* URL of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* For questions (or advices!): [mailto:sylvain.loiseau@u-paris10.fr sloiseau@u-paris10.fr]&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2756</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2756"/>
		<updated>2006-08-29T22:28:10Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Using high level languages on large documents (XPath, XSLT, XQuery) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora, KWIC &lt;br /&gt;
and quantitative information from arbitrarly large corpora &lt;br /&gt;
in the TEI vocabulary. It intends to provide ways for &lt;br /&gt;
processing corpora containing milestoned annotation. &lt;br /&gt;
It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element in the query document divide &lt;br /&gt;
the stream into sub-documents. Each &amp;lt;code&amp;gt;filter&amp;lt;/code&amp;gt; &lt;br /&gt;
into a &amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element see the corpus as several &lt;br /&gt;
documents rooted at the elements defined by &lt;br /&gt;
&amp;lt;code&amp;gt;split/@localName&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
Several filters allow XPath syntax for addressing nodes.&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Intended for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* Download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* URL of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* For questions (or advices!): [mailto:sylvain.loiseau@u-paris10.fr sloiseau@u-paris10.fr]&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2755</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2755"/>
		<updated>2006-08-29T22:26:27Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora, KWIC &lt;br /&gt;
and quantitative information from arbitrarly large corpora &lt;br /&gt;
in the TEI vocabulary. It intends to provide ways for &lt;br /&gt;
processing corpora containing milestoned annotation. &lt;br /&gt;
It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element in the query document divide &lt;br /&gt;
the stream into sub-documents. Each &amp;lt;code&amp;gt;filter&amp;lt;/code&amp;gt; &lt;br /&gt;
into a &amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element see the corpus as several &lt;br /&gt;
documents rooted at the elements defined by &lt;br /&gt;
&amp;lt;code&amp;gt;split/@localName&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Intended for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* Download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* URL of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* For questions (or advices!): [mailto:sylvain.loiseau@u-paris10.fr sloiseau@u-paris10.fr]&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2751</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2751"/>
		<updated>2006-08-26T18:52:59Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* External Links */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora, KWIC &lt;br /&gt;
and quantitative information from arbitrarly large corpora &lt;br /&gt;
in the TEI vocabulary. It intends to provide ways for &lt;br /&gt;
processing corpora containing milestoned annotation. &lt;br /&gt;
It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element in the query document divide &lt;br /&gt;
the stream into sub-documents. Each &amp;lt;code&amp;gt;filter&amp;lt;/code&amp;gt; &lt;br /&gt;
into a &amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element see the corpus as several &lt;br /&gt;
documents rooted at the elements defined by &lt;br /&gt;
&amp;lt;code&amp;gt;split/@localName&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* Download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* URL of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* For questions (or advices!): [mailto:sylvain.loiseau@u-paris10.fr sloiseau@u-paris10.fr]&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2750</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2750"/>
		<updated>2006-08-26T18:51:55Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* External Links */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora, KWIC &lt;br /&gt;
and quantitative information from arbitrarly large corpora &lt;br /&gt;
in the TEI vocabulary. It intends to provide ways for &lt;br /&gt;
processing corpora containing milestoned annotation. &lt;br /&gt;
It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element in the query document divide &lt;br /&gt;
the stream into sub-documents. Each &amp;lt;code&amp;gt;filter&amp;lt;/code&amp;gt; &lt;br /&gt;
into a &amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element see the corpus as several &lt;br /&gt;
documents rooted at the elements defined by &lt;br /&gt;
&amp;lt;code&amp;gt;split/@localName&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* url of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* for question (or advice): [mailto:sylvain.loiseau@u-paris10.fr sloiseau@u-paris10.fr]&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2749</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2749"/>
		<updated>2006-08-26T18:51:14Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* External Links */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora, KWIC &lt;br /&gt;
and quantitative information from arbitrarly large corpora &lt;br /&gt;
in the TEI vocabulary. It intends to provide ways for &lt;br /&gt;
processing corpora containing milestoned annotation. &lt;br /&gt;
It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element in the query document divide &lt;br /&gt;
the stream into sub-documents. Each &amp;lt;code&amp;gt;filter&amp;lt;/code&amp;gt; &lt;br /&gt;
into a &amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element see the corpus as several &lt;br /&gt;
documents rooted at the elements defined by &lt;br /&gt;
&amp;lt;code&amp;gt;split/@localName&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* url of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* for question (or advice): &amp;lt;a href=&amp;quot;mailto:sylvain.loiseau@u-paris10.fr&amp;quot;&amp;gt;sloiseau@u-paris10.fr&amp;lt;/a&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2748</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2748"/>
		<updated>2006-08-26T18:50:40Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* External Links */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora, KWIC &lt;br /&gt;
and quantitative information from arbitrarly large corpora &lt;br /&gt;
in the TEI vocabulary. It intends to provide ways for &lt;br /&gt;
processing corpora containing milestoned annotation. &lt;br /&gt;
It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element in the query document divide &lt;br /&gt;
the stream into sub-documents. Each &amp;lt;code&amp;gt;filter&amp;lt;/code&amp;gt; &lt;br /&gt;
into a &amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element see the corpus as several &lt;br /&gt;
documents rooted at the elements defined by &lt;br /&gt;
&amp;lt;code&amp;gt;split/@localName&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* url of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* for question (or advice): &amp;lt;mailto href=&amp;quot;sloiseau@u-paris10.fr&amp;quot;&amp;gt;sloiseau@u-paris10.fr&amp;lt;/mailto&amp;gt;&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2747</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2747"/>
		<updated>2006-08-26T18:23:52Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora, KWIC &lt;br /&gt;
and quantitative information from arbitrarly large corpora &lt;br /&gt;
in the TEI vocabulary. It intends to provide ways for &lt;br /&gt;
processing corpora containing milestoned annotation. &lt;br /&gt;
It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element in the query document divide &lt;br /&gt;
the stream into sub-documents. Each &amp;lt;code&amp;gt;filter&amp;lt;/code&amp;gt; &lt;br /&gt;
into a &amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element see the corpus as several &lt;br /&gt;
documents rooted at the elements defined by &lt;br /&gt;
&amp;lt;code&amp;gt;split/@localName&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* url of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* for question or comment: sloiseau@u-paris10.fr&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2746</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2746"/>
		<updated>2006-08-26T18:21:04Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element in the query document divide &lt;br /&gt;
the stream into sub-documents. Each &amp;lt;code&amp;gt;filter&amp;lt;/code&amp;gt; &lt;br /&gt;
into a &amp;lt;code&amp;gt;split&amp;lt;/code&amp;gt; element see the corpus as several &lt;br /&gt;
documents rooted at the elements defined by &lt;br /&gt;
&amp;lt;code&amp;gt;split/@localName&amp;lt;/code&amp;gt;:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* url of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* for question or comment: sloiseau@u-paris10.fr&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2745</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2745"/>
		<updated>2006-08-26T18:14:45Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document divide the stream&lt;br /&gt;
into sub-documents. Each &amp;quot;filter&amp;quot; into a &amp;quot;split&amp;quot; element &lt;br /&gt;
see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* url of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* for question or comment: sloiseau@u-paris10.fr&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2744</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2744"/>
		<updated>2006-08-26T18:13:33Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document divide the stream&lt;br /&gt;
into sub-documents. Each &amp;quot;filter&amp;quot; into a &amp;quot;split&amp;quot; element &lt;br /&gt;
see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* url of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
* for question or comment: sloiseau@u-paris10.fr&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2743</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2743"/>
		<updated>2006-08-26T18:12:22Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* External Links */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document divide the stream&lt;br /&gt;
into sub-documents. Each &amp;quot;filter&amp;quot; into a &amp;quot;split&amp;quot; element &lt;br /&gt;
see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
* download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html&lt;br /&gt;
* url of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2742</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2742"/>
		<updated>2006-08-26T18:10:59Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Using a low level API */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document divide the stream&lt;br /&gt;
into sub-documents. Each &amp;quot;filter&amp;quot; into a &amp;quot;split&amp;quot; element &lt;br /&gt;
see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;: it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* it provides ready-to-use functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.&lt;br /&gt;
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.&lt;br /&gt;
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2741</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2741"/>
		<updated>2006-08-26T18:08:06Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made &lt;br /&gt;
available: the stream of XML events is bufferised, &lt;br /&gt;
transformed or queried, and throw back to the next &lt;br /&gt;
filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. The &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document divide the stream&lt;br /&gt;
into sub-documents. Each &amp;quot;filter&amp;quot; into a &amp;quot;split&amp;quot; element &lt;br /&gt;
see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reusing &lt;br /&gt;
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in french and already old).&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2740</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2740"/>
		<updated>2006-08-26T18:04:20Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The program rely on a &amp;quot;streaming API&amp;quot;, where the document is processed as a stream and not as a tree. The &amp;quot;functions&amp;quot; of the program are implemented as &amp;quot;filters&amp;quot; applyed on this stream. The program is mainly a collection of filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
&lt;br /&gt;
The names of the filters (and their arguments), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XPath, XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed or queried, and throw back to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2739</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2739"/>
		<updated>2006-08-26T17:59:46Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site]&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2738</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2738"/>
		<updated>2006-08-26T17:59:24Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [web site | http://panini.u-paris10.fr/~sloiseau/CR/]&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2737</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2737"/>
		<updated>2006-08-26T17:59:07Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It runs at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a [web site http://panini.u-paris10.fr/~sloiseau/CR/]&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2736</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2736"/>
		<updated>2006-08-26T17:57:39Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader try to be a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2735</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2735"/>
		<updated>2006-08-26T17:56:23Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] in particulary).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2734</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2734"/>
		<updated>2006-08-26T17:55:58Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.), &lt;br /&gt;
* exporting quantitative data in statistical tool formats (toward Matlab, [R http://www.r-project.org/] and [DTM http://www.lebart.org] in particulary).&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2733</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2733"/>
		<updated>2006-08-26T16:04:27Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Using high level languages on large documents (XPath, XSLT, XQuery) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora, and corpora containing &lt;br /&gt;
milestoned annotation, in the TEI vocabulary. It provides a functionality &lt;br /&gt;
for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing elements in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, &lt;br /&gt;
the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2732</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2732"/>
		<updated>2006-08-26T16:04:00Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Fitted for documents in the TEI scheme */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora, and corpora containing &lt;br /&gt;
milestoned annotation, in the TEI vocabulary. It provides a functionality &lt;br /&gt;
for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme: &lt;br /&gt;
some structural properties of TEI documents are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2731</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2731"/>
		<updated>2006-08-26T16:03:20Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Using a low level API */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora, and corpora containing &lt;br /&gt;
milestoned annotation, in the TEI vocabulary. It provides a functionality &lt;br /&gt;
for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable functions for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2730</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2730"/>
		<updated>2006-08-26T16:02:54Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Process all documents in the TEI scheme */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora, and corpora containing &lt;br /&gt;
milestoned annotation, in the TEI vocabulary. It provides a functionality &lt;br /&gt;
for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Fitted for documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2729</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2729"/>
		<updated>2006-08-26T16:02:20Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Process all document in the TEI scheme */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora, and corpora containing &lt;br /&gt;
milestoned annotation, in the TEI vocabulary. It provides a functionality &lt;br /&gt;
for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all documents in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
&lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2728</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2728"/>
		<updated>2006-08-26T16:01:57Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Using high level language on large document (XPath, XSLT, XQuery) */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora, and corpora containing &lt;br /&gt;
milestoned annotation, in the TEI vocabulary. It provides a functionality &lt;br /&gt;
for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level languages on large documents (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level languages (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2727</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2727"/>
		<updated>2006-08-26T16:00:44Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora, and corpora containing &lt;br /&gt;
milestoned annotation, in the TEI vocabulary. It provides a functionality &lt;br /&gt;
for merging several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2726</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2726"/>
		<updated>2006-08-26T16:00:00Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* External Links */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2725</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2725"/>
		<updated>2006-08-26T15:59:10Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: /* Using a low level API */&lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2724</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2724"/>
		<updated>2006-08-26T15:58:37Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2723</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2723"/>
		<updated>2006-08-26T15:57:20Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several documents ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2722</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2722"/>
		<updated>2006-08-26T15:56:28Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document. Each &amp;quot;filter&amp;quot; into &lt;br /&gt;
a &amp;quot;split&amp;quot; element see the corpus as several documents rooted &lt;br /&gt;
at the split/@localName elements.&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several document ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2721</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2721"/>
		<updated>2006-08-26T15:54:17Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= Technical features =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several document ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2720</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2720"/>
		<updated>2006-08-26T15:52:59Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provides help &lt;br /&gt;
for:&lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= technical feature =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and works with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external high quality open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
CR relies heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and takes few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several document ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2719</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2719"/>
		<updated>2006-08-26T15:50:13Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provide help &lt;br /&gt;
for &lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
= technical feature =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and work with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The design rely heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and take few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Merging several document ==&lt;br /&gt;
&lt;br /&gt;
CR contains a mechanism for merging an external document into an already &lt;br /&gt;
annotated corpus without breaking well-formedness. It is useful for reuse &lt;br /&gt;
the output of existing linguistic annotation tool.&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot;. &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2718</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2718"/>
		<updated>2006-08-26T15:46:30Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting subcorpora and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provide help &lt;br /&gt;
for &lt;br /&gt;
&lt;br /&gt;
* importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
* exporting quantitative data in statistical tool formats.&lt;br /&gt;
&lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. &lt;br /&gt;
&lt;br /&gt;
It contains mechanism for merging the output of an linguistic annotation &lt;br /&gt;
tool into an already annotated corpus. Two XML document can be merged &lt;br /&gt;
into one document without breaking well-formedness: the two streams &lt;br /&gt;
are aligned using a common content in the two documents, then merged.&lt;br /&gt;
&lt;br /&gt;
= technical feature =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and work with Java 1.4 and further. It relies on &lt;br /&gt;
numerous external open source libraries.&lt;br /&gt;
&lt;br /&gt;
It is run only at the command line.&lt;br /&gt;
&lt;br /&gt;
It is released under the BSD licence.&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The design rely heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and take few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot; &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2717</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2717"/>
		<updated>2006-08-26T15:42:15Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting sub corpus and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not &lt;br /&gt;
have any ''linguistic'' nor ''statistical'' skill... but provide help &lt;br /&gt;
for (1) importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
and for (2) exporting quantitative data in statistical tool formats. &lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. It contains mechanism for merging the output &lt;br /&gt;
of an linguistic annotation tool into an already annotated corpus: &lt;br /&gt;
by merging two XML markups into one document without breaking well-formedness:&lt;br /&gt;
the two streams are aligned using a common content in the two documents, &lt;br /&gt;
then merged.&lt;br /&gt;
&lt;br /&gt;
= technical feature =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and work with Java 1.4 and further.&lt;br /&gt;
 &lt;br /&gt;
It is run at the command line.&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
Here are some properties I tried to make:&lt;br /&gt;
 &lt;br /&gt;
(1) Functions are filters: in order to process arbitrarly large corpora, the program is build on the SAX API.&lt;br /&gt;
(2) Dealing with intersecting hierarchies&lt;br /&gt;
(3) Using high level language on large document (XPath, XSLT, XQuery)&lt;br /&gt;
(4) Using a low level API&lt;br /&gt;
(5) Process all document in the TEI scheme&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The design rely heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and take few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot; &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2716</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2716"/>
		<updated>2006-08-26T15:41:58Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting sub corpus and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a ''quantitative'' corpus _linguistic_ tool, CorpusReader do not &lt;br /&gt;
have any _linguistic_ nor _statistical_ skill... but provide help &lt;br /&gt;
for (1) importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
and for (2) exporting quantitative data in statistical tool formats. &lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. It contains mechanism for merging the output &lt;br /&gt;
of an linguistic annotation tool into an already annotated corpus: &lt;br /&gt;
by merging two XML markups into one document without breaking well-formedness:&lt;br /&gt;
the two streams are aligned using a common content in the two documents, &lt;br /&gt;
then merged.&lt;br /&gt;
&lt;br /&gt;
= technical feature =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and work with Java 1.4 and further.&lt;br /&gt;
 &lt;br /&gt;
It is run at the command line.&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
Here are some properties I tried to make:&lt;br /&gt;
 &lt;br /&gt;
(1) Functions are filters: in order to process arbitrarly large corpora, the program is build on the SAX API.&lt;br /&gt;
(2) Dealing with intersecting hierarchies&lt;br /&gt;
(3) Using high level language on large document (XPath, XSLT, XQuery)&lt;br /&gt;
(4) Using a low level API&lt;br /&gt;
(5) Process all document in the TEI scheme&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The design rely heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and take few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot; &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2715</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2715"/>
		<updated>2006-08-26T15:41:48Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting sub corpus and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a 'quantitative' corpus _linguistic_ tool, CorpusReader do not &lt;br /&gt;
have any _linguistic_ nor _statistical_ skill... but provide help &lt;br /&gt;
for (1) importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
and for (2) exporting quantitative data in statistical tool formats. &lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. It contains mechanism for merging the output &lt;br /&gt;
of an linguistic annotation tool into an already annotated corpus: &lt;br /&gt;
by merging two XML markups into one document without breaking well-formedness:&lt;br /&gt;
the two streams are aligned using a common content in the two documents, &lt;br /&gt;
then merged.&lt;br /&gt;
&lt;br /&gt;
= technical feature =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and work with Java 1.4 and further.&lt;br /&gt;
 &lt;br /&gt;
It is run at the command line.&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
Here are some properties I tried to make:&lt;br /&gt;
 &lt;br /&gt;
(1) Functions are filters: in order to process arbitrarly large corpora, the program is build on the SAX API.&lt;br /&gt;
(2) Dealing with intersecting hierarchies&lt;br /&gt;
(3) Using high level language on large document (XPath, XSLT, XQuery)&lt;br /&gt;
(4) Using a low level API&lt;br /&gt;
(5) Process all document in the TEI scheme&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The design rely heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and take few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot; &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2714</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2714"/>
		<updated>2006-08-26T15:40:39Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting sub corpus and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a _quantitative_ corpus _linguistic_ tool, CorpusReader do not &lt;br /&gt;
have any _linguistic_ nor _statistical_ skill... but provide help &lt;br /&gt;
for (1) importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
and for (2) exporting quantitative data in statistical tool formats. &lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. It contains mechanism for merging the output &lt;br /&gt;
of an linguistic annotation tool into an already annotated corpus: &lt;br /&gt;
by merging two XML markups into one document without breaking well-formedness:&lt;br /&gt;
the two streams are aligned using a common content in the two documents, &lt;br /&gt;
then merged.&lt;br /&gt;
&lt;br /&gt;
= technical feature =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and work with Java 1.4 and further.&lt;br /&gt;
 &lt;br /&gt;
It is run at the command line.&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
Here are some properties I tried to make:&lt;br /&gt;
 &lt;br /&gt;
(1) Functions are filters: in order to process arbitrarly large corpora, the program is build on the SAX API.&lt;br /&gt;
(2) Dealing with intersecting hierarchies&lt;br /&gt;
(3) Using high level language on large document (XPath, XSLT, XQuery)&lt;br /&gt;
(4) Using a low level API&lt;br /&gt;
(5) Process all document in the TEI scheme&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The design rely heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and take few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot; &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;br /&gt;
&lt;br /&gt;
[[Category:Tools]]&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
	<entry>
		<id>https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2713</id>
		<title>CorpusReader</title>
		<link rel="alternate" type="text/html" href="https://wiki.tei-c.org/index.php?title=CorpusReader&amp;diff=2713"/>
		<updated>2006-08-26T15:39:48Z</updated>

		<summary type="html">&lt;p&gt;Sloiseau: &lt;/p&gt;
&lt;hr /&gt;
&lt;div&gt;= Summary =&lt;br /&gt;
&lt;br /&gt;
CorpusReader (CR) is a tool for extracting sub corpus and quantitative &lt;br /&gt;
information from arbitrarly large corpora and corpora containing &lt;br /&gt;
milestoned annotation. It provides a functionality for merging &lt;br /&gt;
several XML documents together.&lt;br /&gt;
&lt;br /&gt;
= Background =&lt;br /&gt;
&lt;br /&gt;
CorpusReader was developped for quantitative corpus linguistics. &lt;br /&gt;
Its original goal was to provide a way for extracting quantitative &lt;br /&gt;
information (such as cooccurrency matrix) using all information &lt;br /&gt;
expressed in the XML infoset.&lt;br /&gt;
 &lt;br /&gt;
As a _quantitative_ corpus _linguistic_ tool, CorpusReader do not &lt;br /&gt;
have any _linguistic_ nor _statistical_ skill... but provide help &lt;br /&gt;
for (1) importing outputs of existing linguistic tools into a corpus, &lt;br /&gt;
and for (2) exporting quantitative data in statistical tool formats. &lt;br /&gt;
The rationale is that existing linguistic and statistical tools should &lt;br /&gt;
be reused and made easy to use in the context of a TEI corpus. &lt;br /&gt;
Thus CorpusReader is a bridge between linguistic and statistical &lt;br /&gt;
tools for creating and exploring empirically &lt;br /&gt;
complex data. It contains mechanism for merging the output &lt;br /&gt;
of an linguistic annotation tool into an already annotated corpus: &lt;br /&gt;
by merging two XML markups into one document without breaking well-formedness:&lt;br /&gt;
the two streams are aligned using a common content in the two documents, &lt;br /&gt;
then merged.&lt;br /&gt;
&lt;br /&gt;
= technical feature =&lt;br /&gt;
&lt;br /&gt;
CR is written in java, and work with Java 1.4 and further.&lt;br /&gt;
 &lt;br /&gt;
It is run at the command line.&lt;br /&gt;
&lt;br /&gt;
= Properties =&lt;br /&gt;
 &lt;br /&gt;
Here are some properties I tried to make:&lt;br /&gt;
 &lt;br /&gt;
(1) Functions are filters: in order to process arbitrarly large corpora, the program is build on the SAX API.&lt;br /&gt;
(2) Dealing with intersecting hierarchies&lt;br /&gt;
(3) Using high level language on large document (XPath, XSLT, XQuery)&lt;br /&gt;
(4) Using a low level API&lt;br /&gt;
(5) Process all document in the TEI scheme&lt;br /&gt;
 &lt;br /&gt;
== Functions are filters ==&lt;br /&gt;
&lt;br /&gt;
The design rely heavily on the SAX API. The &amp;quot;functions&amp;quot; of the program &lt;br /&gt;
are implemented as SAX filters. The program is mainly a collection of &lt;br /&gt;
SAX filters and a mechanism for pluging the filters into a pipeline.&lt;br /&gt;
 &lt;br /&gt;
Each filter is specialized into a precise task and take few argument. &lt;br /&gt;
This allows modularity and reusability. While each filter performs a &lt;br /&gt;
simple task, the pipeline of filters may achieve complexe tasks.&lt;br /&gt;
 &lt;br /&gt;
The names of the filters (and their arguments for some of them), &lt;br /&gt;
defining the pipeline, are given to the program through a file &lt;br /&gt;
(an XML document of course), called the &amp;quot;query document&amp;quot;. The URL of &lt;br /&gt;
this document is given as an argument to the program, at the &lt;br /&gt;
command line.&lt;br /&gt;
 &lt;br /&gt;
A query document looks like:&lt;br /&gt;
 &lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;corpus.xml&amp;quot;&lt;br /&gt;
          outURI=&amp;quot;sample-manuel-1.out&amp;quot;&lt;br /&gt;
          /&amp;gt;&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;filter name=&amp;quot;myFilterName&amp;quot; javaClass=&amp;quot;java.class.qualified.Name&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;args&amp;gt;&lt;br /&gt;
        &amp;lt;!--  Argument subtree, passed to the filter if any, after&lt;br /&gt;
              validation if a schema is know for this filter.&lt;br /&gt;
        --&amp;gt;&lt;br /&gt;
      &amp;lt;/args&amp;gt;&lt;br /&gt;
    &amp;lt;/filter&amp;gt;&lt;br /&gt;
    &amp;lt;!--  etc.: as many filter as needed  --&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
 &lt;br /&gt;
In some cases, filters can communicate directly with each other &lt;br /&gt;
(in addition to communicating through the stream of SAX events).&lt;br /&gt;
&lt;br /&gt;
== Dealing with intersecting hierarchies ==&lt;br /&gt;
&lt;br /&gt;
[TODO]&lt;br /&gt;
&lt;br /&gt;
== Using high level language on large document (XPath, XSLT, XQuery) ==&lt;br /&gt;
&lt;br /&gt;
High level language (XSLT and XQuery) are made available: &lt;br /&gt;
the stream of XML events is bufferised, transformed, and throw back &lt;br /&gt;
to the next filter in the pipeline as a stream of XML events.&lt;br /&gt;
 &lt;br /&gt;
When the corpus does not fit in memory, a mechanism allows &lt;br /&gt;
to address the subtrees to be bufferised and transformed &lt;br /&gt;
successively, one by one, as separate document. It is the &lt;br /&gt;
&amp;quot;split&amp;quot; element in the query document:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split localName=&amp;quot;TEI&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
&lt;br /&gt;
There is also a way for addressing element in the stream &lt;br /&gt;
through XPath expression evaluated against each element, &lt;br /&gt;
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:&lt;br /&gt;
&lt;br /&gt;
&amp;lt;pre&amp;gt;&amp;lt;nowiki&amp;gt;&lt;br /&gt;
&amp;lt;query&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;header&amp;gt;&lt;br /&gt;
    &amp;lt;name&amp;gt;&amp;lt;/name&amp;gt;&lt;br /&gt;
    &amp;lt;date&amp;gt;&amp;lt;/date&amp;gt;&lt;br /&gt;
    &amp;lt;desc&amp;gt;&amp;lt;/desc&amp;gt;&lt;br /&gt;
  &amp;lt;/header&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;corpus inURI=&amp;quot;path/to/corpus&amp;quot; outURI=&amp;quot;path/to/output&amp;quot;&amp;gt;&amp;lt;/corpus&amp;gt;&lt;br /&gt;
&lt;br /&gt;
  &amp;lt;filterList&amp;gt;&lt;br /&gt;
    &amp;lt;split elxpath=&amp;quot;*[namespace-uri()='http://www.tei-c.org/ns/1.0' &lt;br /&gt;
                      and local-name()='div']&amp;quot;&amp;gt;&lt;br /&gt;
      &amp;lt;filterList&amp;gt;&lt;br /&gt;
        &amp;lt;filter name=&amp;quot;transform_my_div&amp;quot; javaClass=&amp;quot;tei.cr.filters.XSLT&amp;quot;&amp;gt;&lt;br /&gt;
          &amp;lt;args&amp;gt;&lt;br /&gt;
            &amp;lt;stylesheet URI=&amp;quot;path/to/stylesheet&amp;quot;&amp;gt;&amp;lt;/stylesheet&amp;gt;&lt;br /&gt;
          &amp;lt;/args&amp;gt;&lt;br /&gt;
        &amp;lt;/filter&amp;gt;&lt;br /&gt;
      &amp;lt;/filterList&amp;gt;&lt;br /&gt;
    &amp;lt;/split&amp;gt;&lt;br /&gt;
  &amp;lt;/filterList&amp;gt;&lt;br /&gt;
&amp;lt;/query&amp;gt;&lt;br /&gt;
&amp;lt;/nowiki&amp;gt;&amp;lt;/pre&amp;gt;&lt;br /&gt;
&lt;br /&gt;
== Using a low level API ==&lt;br /&gt;
&lt;br /&gt;
I tried to make the program stand between a &amp;quot;tool&amp;quot; and an &amp;quot;API&amp;quot; &lt;br /&gt;
(a framework for facilitating the use of a low level API). &lt;br /&gt;
Thus, there may be different ways of using it:&lt;br /&gt;
&lt;br /&gt;
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.&lt;br /&gt;
* may be used as a way of applying XSLT / XQuery on big corpora&lt;br /&gt;
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class &lt;br /&gt;
in the query document (this class should be found in the CLASSPATH variable). &lt;br /&gt;
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter: &lt;br /&gt;
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava&lt;br /&gt;
&lt;br /&gt;
The goal was to make easy the use of a low-level API. &amp;quot;CorpusReader&amp;quot; &lt;br /&gt;
is nammed after &amp;quot;XMLReader&amp;quot;, the name of the class parsing a document &lt;br /&gt;
in the java SAX API. It is intended to be a &amp;quot;layer&amp;quot; onto SAX for &lt;br /&gt;
the TEI vocabulary.&lt;br /&gt;
&lt;br /&gt;
== Process all document in the TEI scheme ==&lt;br /&gt;
&lt;br /&gt;
The program tries to rely on the TEI scheme &lt;br /&gt;
(in fact it is already strongly associated with P5): &lt;br /&gt;
some structural properties of TEI document are sometime needed, &lt;br /&gt;
but I try not to make it relying on a specific TEI customisation. &lt;br /&gt;
(I would like to develop a mechanism for using the &amp;quot;TEI customization&amp;quot; &lt;br /&gt;
document produced by Roma for overriding the default vocabulary).&lt;br /&gt;
 &lt;br /&gt;
= External Links =&lt;br /&gt;
 &lt;br /&gt;
* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/&lt;br /&gt;
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances&lt;/div&gt;</summary>
		<author><name>Sloiseau</name></author>
		
	</entry>
</feed>