Difference between revisions of "Oddbyexample"

From TEIWiki
Jump to navigation Jump to search
(How to download or buy)
(System requirements)
Line 24: Line 24:
  
 
== System requirements ==
 
== System requirements ==
Memory capacity is an issue. It's not going to read a giant corpus without you have a big load of memory to assign to Java.
+
Memory capacity is likely to be an issue for large corpuses. It's not going to read a giant corpus unless you have a great deal of memory to assign to Java. For situations like this, it is suggested that you construct a smaller corpus of representative sample documents and work with that. After generating a schema, you can validate your entire corpus, and each time you find an invalid document, add it to your smaller corpus and start again.
  
 
== Source code and licensing ==
 
== Source code and licensing ==

Revision as of 01:19, 12 October 2012


Synopsis

This utility attempts to work out the minimal TEI customization needed to validate a collection of files. The XSLT (version 2) stylesheet which traverses a nominated directory tree looking for *.xml files which have <TEI> or <teiCorpus> root elements. It analyzes the collection of elements and attributes in the resulting corpus, and compares that to the whole of TEI P5. An ODD file is generated which:

  • loads the required modules
  • deletes any elements which are not used
  • deletes any attributes (including class attributes) which are not used by each element
  • for every attribute which has a TEI "data.enumerated" datatype, constructs a closed <valList> enumerating the values actually used.

From this you can construct a target schema.

Features

User commentary

Please sign all comments.


System requirements

Memory capacity is likely to be an issue for large corpuses. It's not going to read a giant corpus unless you have a great deal of memory to assign to Java. For situations like this, it is suggested that you construct a smaller corpus of representative sample documents and work with that. After generating a schema, you can validate your entire corpus, and each time you find an invalid document, add it to your smaller corpus and start again.

Source code and licensing

open source

Support for TEI

Limitations:

  • deriving simplified content models (beyond what Roma already does)
  • adding new elements and deriving a content model
  • dealing with non-TEI namespaces
  • generating attribute datatypes with complex regexps
  • working out Schematron constraints etc

Language(s)

XSLT

Documentation

The script assumes you have the TEI package which has a file called "/usr/share/xml/tei/odd/p5subset.xml". If you don't have that, grab http://www.tei-c.org/release/xml/tei/odd/p5subset.xml, put the file somewhere, and add a "tei" parameter to point at it.

Here's a sample command to run it:

saxon -o my.odd oddbyexample.xsl oddbyexample.xsl corpus=/wherever/you/have/yourfiles/

Tech support

User community

Sample implementations

Current version number and date of release

History of versions

How to download or buy

Grab getfiles.xsl and oddbyexample.xsl from Sourceforge (http://tei.svn.sourceforge.net/viewvc/tei/trunk/Stylesheets/tools/oddbyexample.xsl)

Additional notes