= Summary =

CorpusReader (CR) is a tool for extracting sub corpus and quantitative
information from arbitrarly large corpora and corpora containing
milestoned annotation. It provides a functionality for merging
several XML documents together.

= Background =

CorpusReader was developped for quantitative corpus linguistics.
Its original goal was to provide a way for extracting quantitative
information (such as cooccurrency matrix) using all information
expressed in the XML infoset.

As a _quantitative_ corpus _linguistic_ tool, CorpusReader do not
have any _linguistic_ nor _statistical_ skill... but provide help
for (1) importing outputs of existing linguistic tools into a corpus,
and for (2) exporting quantitative data in statistical tool formats.
The rationale is that existing linguistic and statistical tools should
be reused and made easy to use in the context of a TEI corpus.
Thus CorpusReader is a bridge between linguistic and statistical
tools for creating and exploring empirically
complex data. It contains mechanism for merging the output
of an linguistic annotation tool into an already annotated corpus:
by merging two XML markups into one document without breaking well-formedness:
the two streams are aligned using a common content in the two documents,
then merged.

= technical feature =

CR is written in java, and work with Java 1.4 and further.

It is run at the command line.

= Properties =

Here are some properties I tried to make:

(1) Functions are filters: in order to process arbitrarly large corpora, the program is build on the SAX API.
(2) Dealing with intersecting hierarchies
(3) Using high level language on large document (XPath, XSLT, XQuery)
(4) Using a low level API
(5) Process all document in the TEI scheme

== Functions are filters ==

The design rely heavily on the SAX API. The "functions" of the program
are implemented as SAX filters. The program is mainly a collection of
SAX filters and a mechanism for pluging the filters into a pipeline.

Each filter is specialized into a precise task and take few argument.
This allows modularity and reusability. While each filter performs a
simple task, the pipeline of filters may achieve complexe tasks.

The names of the filters (and their arguments for some of them),
defining the pipeline, are given to the program through a file
(an XML document of course), called the "query document". The URL of
this document is given as an argument to the program, at the
command line.

A query document looks like:

<pre><nowiki>
<query>
<header>
<name></name>
<date></date>
<desc></desc>
</header>

<corpus inURI="corpus.xml"
outURI="sample-manuel-1.out"
/>
<filterList>
<filter name="myFilterName" javaClass="java.class.qualified.Name">
<args>

</args>
</filter>

</filterList>
</query>
</nowiki></pre>

In some cases, filters can communicate directly with each other
(in addition to communicating through the stream of SAX events).

== Dealing with intersecting hierarchies ==

[TODO]

== Using high level language on large document (XPath, XSLT, XQuery) ==

High level language (XSLT and XQuery) are made available:
the stream of XML events is bufferised, transformed, and throw back
to the next filter in the pipeline as a stream of XML events.

When the corpus does not fit in memory, a mechanism allows
to address the subtrees to be bufferised and transformed
successively, one by one, as separate document. It is the
"split" element in the query document:

<pre><nowiki>
<query>

<header>
<name></name>
<date></date>
<desc></desc>
</header>

<corpus inURI="path/to/corpus" outURI="path/to/output"></corpus>

<filterList>
<split localName="TEI">
<filterList>
<filter name="transform_my_div" javaClass="tei.cr.filters.XSLT">
<args>
<stylesheet URI="path/to/stylesheet"></stylesheet>
</args>
</filter>
</filterList>
</split>
</filterList>
</query>
</nowiki></pre>

There is also a way for addressing element in the stream
through XPath expression evaluated against each element,
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:

<pre><nowiki>
<query>

<header>
<name></name>
<date></date>
<desc></desc>
</header>

<corpus inURI="path/to/corpus" outURI="path/to/output"></corpus>

<filterList>
<split elxpath="*[namespace-uri()='http://www.tei-c.org/ns/1.0'
and local-name()='div']">
<filterList>
<filter name="transform_my_div" javaClass="tei.cr.filters.XSLT">
<args>
<stylesheet URI="path/to/stylesheet"></stylesheet>
</args>
</filter>
</filterList>
</split>
</filterList>
</query>
</nowiki></pre>

== Using a low level API ==

I tried to make the program stand between a "tool" and an "API"
(a framework for facilitating the use of a low level API).
Thus, there may be different ways of using it:

* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.
* may be used as a way of applying XSLT / XQuery on big corpora
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class
in the query document (this class should be found in the CLASSPATH variable).
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter:
http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html and http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava

The goal was to make easy the use of a low-level API. "CorpusReader"
is nammed after "XMLReader", the name of the class parsing a document
in the java SAX API. It is intended to be a "layer" onto SAX for
the TEI vocabulary.

== Process all document in the TEI scheme ==

The program tries to rely on the TEI scheme
(in fact it is already strongly associated with P5):
some structural properties of TEI document are sometime needed,
but I try not to make it relying on a specific TEI customisation.
(I would like to develop a mechanism for using the "TEI customization"
document produced by Roma for overriding the default vocabulary).

= External Links =

* The site (mainly in french): http://panini.u-paris10.fr/~sloiseau/CR/
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances