Split-teiCorpus

Summary
This is a quick XSLT to split a  file, filled with  elements into individual files.

While one can actually do this on the linux command line with an easy bash script or pipeline, XSLT2 is my preferred method for doing this since the document is already an XML document. An extra template is include to change an element along the way in case you want to do that.

The original post I made about it is at:

http://faqingperplxd.wordpress.com/2009/03/02/xslt-to-split-teicorpus-files-to-individual-parts/

Add any comments to the 'discussion' tab.

Run with something like: saxon -o index.xml teiCorpusFile.xml Split-teiCorpus.xsl

Required Input
A file with the structure something like:

  

   



That is, a  element containing a  and multiple  elements. To get one file per  element is really fairly simple using  and  but I also added in that the output file produces a list of filled with references to the files created.

Expected Output
An index.xml file with a list referencing each file created, and an individual file for each TEI element.

Known Restrictions or Problems

 * It assumes you want the files numbered sequentially
 * It assumes that you have a title in the corpus header
 * It is XSLT2, so XSLT1-only processors can’t handle it
 * Comments and processing instructions that are siblings or children of the root <tt>&lt;teiCorpus></tt> are summarily dropped
 * Nested <tt>&lt;teiCorpus></tt> elements are handled by processing their <tt>&lt;TEI></tt> children as if they were children of the root <tt>&lt;teiCorpus></tt>

Stylesheet
<?xml version="1.0" encoding="UTF-8"?> <xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0" xmlns="http://www.tei-c.org/ns/1.0" xmlns:tei="http://www.tei-c.org/ns/1.0" exclude-result-prefixes="#all">

<xsl:output indent="yes" method="xml"/>

<xsl:template match="/tei:teiCorpus">  <xsl:apply-templates select="./tei:teiHeader"/> Individual Files <xsl:for-each select=".//tei:TEI"> <xsl:variable name="file">file-<xsl:number format="0000" />.xml</xsl:variable> <xsl:variable name="title"> <xsl:apply-templates select="./tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title[1]" />                           </xsl:variable> <xsl:result-document href="{$file}"> <xsl:copy> <xsl:apply-templates select="@*|node|comment"/> </xsl:copy> </xsl:result-document> </xsl:for-each> </TEI> </xsl:template>

<xsl:template match="@*|node|comment " priority="-1"> <xsl:copy> <xsl:apply-templates select="@*|node|comment"/> </xsl:copy> </xsl:template>

<xsl:template match="tei:fooBar"> <seg type="fooBar"> </xsl:template> </xsl:stylesheet>