Difference between revisions of "CorpusReader"

From TEIWiki
Jump to navigation Jump to search
(Using high level language on large document (XPath, XSLT, XQuery))
(typos, + category)
 
(29 intermediate revisions by 3 users not shown)
Line 1: Line 1:
 
= Summary =
 
= Summary =
  
CorpusReader (CR) is a tool for extracting subcorpora and quantitative  
+
CorpusReader (CR) is a tool for extracting subcorpora, KWIC
information from arbitrarly large corpora, and corpora containing
+
and quantitative information from arbitrarily large corpora  
milestoned annotation, in the TEI vocabulary. It provides a functionality
+
in the TEI vocabulary. It intends to provide ways for
for merging several XML documents together.
+
processing corpora containing milestoned annotation.
 +
It provides mechanism for merging several XML documents together.
  
 
= Background =
 
= Background =
  
CorpusReader was developped for quantitative corpus linguistics.  
+
CorpusReader was developed for quantitative corpus linguistics.  
 
Its original goal was to provide a way for extracting quantitative  
 
Its original goal was to provide a way for extracting quantitative  
information (such as cooccurrency matrix) using all information  
+
information (such as co-occurrence matrix) using all information  
 
expressed in the XML infoset.
 
expressed in the XML infoset.
 
   
 
   
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader do not  
+
As a ''quantitative'' corpus ''linguistic'' tool, CorpusReader does not  
 
have any ''linguistic'' nor ''statistical'' skill... but provides help  
 
have any ''linguistic'' nor ''statistical'' skill... but provides help  
 
for:
 
for:
  
* importing outputs of existing linguistic tools into a corpus,  
+
* importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.),  
* exporting quantitative data in statistical tool formats.
+
* exporting quantitative data in statistical tool formats (toward Matlab, [http://www.r-project.org/ R] and [http://www.lebart.org DTM] for instance).
  
 
The rationale is that existing linguistic and statistical tools should  
 
The rationale is that existing linguistic and statistical tools should  
 
be reused and made easy to use in the context of a TEI corpus.  
 
be reused and made easy to use in the context of a TEI corpus.  
Thus CorpusReader is a bridge between linguistic and statistical  
+
Thus CorpusReader try to be a bridge between linguistic and statistical  
 
tools for creating and exploring empirically  
 
tools for creating and exploring empirically  
 
complex data.  
 
complex data.  
Line 31: Line 32:
 
numerous external high quality open source libraries.
 
numerous external high quality open source libraries.
  
It is run only at the command line.
+
It runs at the command line.
  
 
It is released under the BSD licence.
 
It is released under the BSD licence.
  
There is a web site : http://panini.u-paris10.fr/~sloiseau/CR/
+
There is a [http://panini.u-paris10.fr/~sloiseau/CR/ web site] (See the bottom of the page for more links)
  
 
= Properties =
 
= Properties =
Line 41: Line 42:
 
== Functions are filters ==
 
== Functions are filters ==
  
CR relies heavily on the SAX API. The "functions" of the program  
+
The program rely on a "streaming API", where the document is processed as a stream and not as a tree. The "functions" of the program are implemented as "filters" applied on this stream. The program is mainly a collection of filters and a mechanism for plugging the filters into a pipeline.
are implemented as SAX filters. The program is mainly a collection of  
 
SAX filters and a mechanism for pluging the filters into a pipeline.
 
 
   
 
   
 
Each filter is specialized into a precise task and takes few argument.  
 
Each filter is specialized into a precise task and takes few argument.  
 
This allows modularity and reusability. While each filter performs a  
 
This allows modularity and reusability. While each filter performs a  
simple task, the pipeline of filters may achieve complexe tasks.
+
simple task, the pipeline of filters may achieve complex tasks.
+
 
The names of the filters (and their arguments for some of them),
 
defining the pipeline, are given to the program through a file
 
(an XML document of course), called the "query document". The URL of
 
this document is given as an argument to the program, at the
 
command line.
 
 
 
A query document looks like:
 
A query document looks like:
 
   
 
   
Line 82: Line 75:
 
   
 
   
 
In some cases, filters can communicate directly with each other  
 
In some cases, filters can communicate directly with each other  
(in addition to communicating through the stream of SAX events).
+
(in addition to communicating through the stream of events).
 
 
== Dealing with intersecting hierarchies ==
 
 
 
[TODO]
 
  
 
== Using high level languages on large documents (XPath, XSLT, XQuery) ==
 
== Using high level languages on large documents (XPath, XSLT, XQuery) ==
  
High level languages (XSLT and XQuery) are made available:  
+
High level languages (XPath, XSLT and XQuery) are made  
the stream of XML events is bufferised, transformed, and throw back  
+
available: the stream of XML events is buffered,  
to the next filter in the pipeline as a stream of XML events.
+
transformed or queried, and throw back to the next  
 +
filter in the pipeline as a stream of XML events.
 
   
 
   
 
When the corpus does not fit in memory, a mechanism allows  
 
When the corpus does not fit in memory, a mechanism allows  
to address the subtrees to be bufferised and transformed  
+
to address the subtrees to be buffered and transformed  
successively, one by one, as separate document. It is the
+
successively, one by one, as separate document. The
"split" element in the query document. Each "filter" into  
+
<code>split</code> element in the query document divide
a "split" element see the corpus as several documents rooted  
+
the stream into sub-documents. Each <code>filter</code>
at the split/@localName elements.
+
into a <code>split</code> element see the corpus as several  
 +
documents rooted at the elements defined by
 +
<code>split/@localName</code>:
  
 
<pre><nowiki>
 
<pre><nowiki>
Line 126: Line 118:
 
</nowiki></pre>
 
</nowiki></pre>
  
 
+
There is also a way for addressing elements in the stream  
There is also a way for addressing element in the stream  
 
 
through XPath expression evaluated against each element,  
 
through XPath expression evaluated against each element,  
one at one, as if they were a stand-alone document. For instance, the query document above could be written as:
+
one at one, as if they were a stand-alone document. For instance,  
 +
the query document above could be written as:
  
 
<pre><nowiki>
 
<pre><nowiki>
Line 157: Line 149:
 
</nowiki></pre>
 
</nowiki></pre>
  
== Merging several documents ==
+
Several filters allow XPath syntax for addressing nodes.
 +
 
 +
== Merging documents ==
  
 
CR contains a mechanism for merging an external document into an already  
 
CR contains a mechanism for merging an external document into an already  
annotated corpus without breaking well-formedness. It is useful for reuse
+
annotated corpus without breaking well-formedness. It is useful for reusing
the output of existing linguistic annotation tool.
+
the outputs of existing linguistic annotation tools. There is a [http://panini.u-paris10.fr/~sloiseau/CR/filtres/EncodingMerger.html documentation] (in French and already old).
 +
 
 +
== Dealing with intersecting hierarchies ==
 +
 
 +
[TODO]
  
 
== Using a low level API ==
 
== Using a low level API ==
  
I tried to make the program stand between a "tool" and an "API".
+
I tried to make the program stand between a "tool" and an "API": it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:
(a framework for facilitating the use of a low level API).  
 
Thus, there may be different ways of using it:
 
  
* provide directly usable function for creating kwic, extracting subcorpus, computing cooccurrency matrix, merging markup, etc.
+
* it provides ready-to-use functions for creating KWIC, extracting subcorpora, computing co-occurrence matrices, merging markup, etc.
* may be used as a way of applying XSLT / XQuery on big corpora
+
* it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.
* may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable).  
+
* it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.
* may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava
+
* it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava
  
 
The goal was to make easy the use of a low-level API. "CorpusReader"  
 
The goal was to make easy the use of a low-level API. "CorpusReader"  
Line 179: Line 175:
 
the TEI vocabulary.
 
the TEI vocabulary.
  
== Process all document in the TEI scheme ==
+
== Intended for documents in the TEI scheme ==
  
The program tries to rely on the TEI scheme  
+
The program tries to rely on the TEI scheme:  
(in fact it is already strongly associated with P5):  
+
some structural properties of TEI documents are sometime needed,  
some structural properties of TEI document are sometime needed,  
 
 
but I try not to make it relying on a specific TEI customisation.  
 
but I try not to make it relying on a specific TEI customisation.  
 
(I would like to develop a mechanism for using the "TEI customization"  
 
(I would like to develop a mechanism for using the "TEI customization"  
 
document produced by Roma for overriding the default vocabulary).
 
document produced by Roma for overriding the default vocabulary).
+
 
 
= External Links =
 
= External Links =
 
   
 
   
* The site with documentation (mainly in french) and downloadable archives: http://panini.u-paris10.fr/~sloiseau/CR/
+
* The site with documentation (mainly in French): http://panini.u-paris10.fr/~sloiseau/CR/
* Exemples summing up the properties of CR in a sort of english: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances
+
* Examples summing up the properties of CR in a sort of English: http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#concordances
 +
* Download page: http://panini.u-paris10.fr/~sloiseau/CR/download.html
 +
* URL of an archive containing the program and all the required external libraries: http://panini.u-paris10.fr/~sloiseau/CR/download/CR.zip
 +
* For questions (or advices!): [mailto:sylvain.loiseau@u-paris10.fr sloiseau@u-paris10.fr]
  
 
[[Category:Tools]]
 
[[Category:Tools]]
 +
[[Category:Querying tools]]
 +
[[Category:XQuery]]

Latest revision as of 12:28, 9 July 2007

Summary

CorpusReader (CR) is a tool for extracting subcorpora, KWIC and quantitative information from arbitrarily large corpora in the TEI vocabulary. It intends to provide ways for processing corpora containing milestoned annotation. It provides mechanism for merging several XML documents together.

Background

CorpusReader was developed for quantitative corpus linguistics. Its original goal was to provide a way for extracting quantitative information (such as co-occurrence matrix) using all information expressed in the XML infoset.

As a quantitative corpus linguistic tool, CorpusReader does not have any linguistic nor statistical skill... but provides help for:

  • importing outputs of existing linguistic tools into a corpus (tagger, parser, etc.),
  • exporting quantitative data in statistical tool formats (toward Matlab, R and DTM for instance).

The rationale is that existing linguistic and statistical tools should be reused and made easy to use in the context of a TEI corpus. Thus CorpusReader try to be a bridge between linguistic and statistical tools for creating and exploring empirically complex data.

Technical features

CR is written in java, and works with Java 1.4 and further. It relies on numerous external high quality open source libraries.

It runs at the command line.

It is released under the BSD licence.

There is a web site (See the bottom of the page for more links)

Properties

Functions are filters

The program rely on a "streaming API", where the document is processed as a stream and not as a tree. The "functions" of the program are implemented as "filters" applied on this stream. The program is mainly a collection of filters and a mechanism for plugging the filters into a pipeline.

Each filter is specialized into a precise task and takes few argument. This allows modularity and reusability. While each filter performs a simple task, the pipeline of filters may achieve complex tasks.

A query document looks like:

<query>
  <header>
    <name></name>
    <date></date>
    <desc></desc>
  </header>
 
  <corpus inURI="corpus.xml"
          outURI="sample-manuel-1.out"
          />
  <filterList>
    <filter name="myFilterName" javaClass="java.class.qualified.Name">
      <args>
        <!--  Argument subtree, passed to the filter if any, after
              validation if a schema is know for this filter.
        -->
      </args>
    </filter>
    <!--  etc.: as many filter as needed  -->
  </filterList>
</query>

In some cases, filters can communicate directly with each other (in addition to communicating through the stream of events).

Using high level languages on large documents (XPath, XSLT, XQuery)

High level languages (XPath, XSLT and XQuery) are made available: the stream of XML events is buffered, transformed or queried, and throw back to the next filter in the pipeline as a stream of XML events.

When the corpus does not fit in memory, a mechanism allows to address the subtrees to be buffered and transformed successively, one by one, as separate document. The split element in the query document divide the stream into sub-documents. Each filter into a split element see the corpus as several documents rooted at the elements defined by split/@localName:

<query>

  <header>
    <name></name>
    <date></date>
    <desc></desc>
  </header>

  <corpus inURI="path/to/corpus" outURI="path/to/output"></corpus>

  <filterList>
    <split localName="TEI">
      <filterList>
        <filter name="transform_my_div" javaClass="tei.cr.filters.XSLT">
          <args>
            <stylesheet URI="path/to/stylesheet"></stylesheet>
          </args>
        </filter>
      </filterList>
    </split>
  </filterList>
</query>

There is also a way for addressing elements in the stream through XPath expression evaluated against each element, one at one, as if they were a stand-alone document. For instance, the query document above could be written as:

<query>

  <header>
    <name></name>
    <date></date>
    <desc></desc>
  </header>

  <corpus inURI="path/to/corpus" outURI="path/to/output"></corpus>

  <filterList>
    <split elxpath="*[namespace-uri()='http://www.tei-c.org/ns/1.0' 
                      and local-name()='div']">
      <filterList>
        <filter name="transform_my_div" javaClass="tei.cr.filters.XSLT">
          <args>
            <stylesheet URI="path/to/stylesheet"></stylesheet>
          </args>
        </filter>
      </filterList>
    </split>
  </filterList>
</query>

Several filters allow XPath syntax for addressing nodes.

Merging documents

CR contains a mechanism for merging an external document into an already annotated corpus without breaking well-formedness. It is useful for reusing the outputs of existing linguistic annotation tools. There is a documentation (in French and already old).

Dealing with intersecting hierarchies

[TODO]

Using a low level API

I tried to make the program stand between a "tool" and an "API": it may be seen as a framework for facilitating the use of a low level API. Thus, there may be different ways of using it:

  • it provides ready-to-use functions for creating KWIC, extracting subcorpora, computing co-occurrence matrices, merging markup, etc.
  • it may be used more generally as a way of applying XSLT / XQuery on big corpora, whatever the vocabulary is.
  • it may be used for plugging any SAX filters; and reduce the complexity of SAX by allowing to write only a SAX handler code, while the program manage the parser, the pipeline, and the serialisation of the output of the pipeline back to disk. Any class implementing the XMLFilter interface may be plugged in the filter by providing the qualified name of the java class in the query document (this class should be found in the CLASSPATH variable), and can interact with the already existing filters.
  • it may be used for prototyping java code by embedding in the pipeline java code defining a SAX filter, see http://panini.u-paris10.fr/~sloiseau/CR/filtres/Script.html or http://panini.u-paris10.fr/~sloiseau/CR/exemples.html#embeddedJava

The goal was to make easy the use of a low-level API. "CorpusReader" is nammed after "XMLReader", the name of the class parsing a document in the java SAX API. It is intended to be a "layer" onto SAX for the TEI vocabulary.

Intended for documents in the TEI scheme

The program tries to rely on the TEI scheme: some structural properties of TEI documents are sometime needed, but I try not to make it relying on a specific TEI customisation. (I would like to develop a mechanism for using the "TEI customization" document produced by Roma for overriding the default vocabulary).

External Links