Split-teiCorpus

From TEIWiki
Jump to navigation Jump to search

Summary

This is a quick XSLT to split a <teiCorpus> file, filled with <TEI> elements into individual files.

While one can actually do this on the linux command line with an easy bash script or pipeline, XSLT2 is my preferred method for doing this since the document is already an XML document. An extra template is include to change an element along the way in case you want to do that.

The original post I made about it is at:

http://faqingperplxd.wordpress.com/2009/03/02/xslt-to-split-teicorpus-files-to-individual-parts/

Add any comments to the 'discussion' tab.


Run with something like:

saxon -o index.xml teiCorpusFile.xml Split-teiCorpus.xsl


Required Input

A file with the structure something like:

<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <!-- Corpus header -->
    </teiHeader>

    <!-- Individual TEI file -->
    <TEI>
        <teiHeader>
            <!-- File Header -->
        </teiHeader>
        <text>
            <!-- text of document -->
        </text>
    </TEI>

    <!-- More TEI elements as needed -->

</teiCorpus>

That is, a <teiCorpus> element containing a <teiHeader> and multiple <TEI> elements. To get one file per <TEI> element is really fairly simple using <xsl:for-each> and <xsl:result-document> but I also added in that the output file produces a list of filled with references to the files created.

Expected Output

An index.xml file with a list referencing each file created, and an individual file for each TEI element.

Known Restrictions or Problems

  • It assumes you want the files numbered sequentially
  • It assumes that you have a title in the corpus header
  • It is XSLT2, so XSLT1-only processors can’t handle it
  • Comments and processing instructions that are siblings or children of the root <teiCorpus> are summarily dropped
  • Nested <teiCorpus> elements are handled by processing their <TEI> children as if they were children of the root <teiCorpus>

Stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
  xmlns="http://www.tei-c.org/ns/1.0" xmlns:tei="http://www.tei-c.org/ns/1.0"
  exclude-result-prefixes="#all">

  <!-- 
    This file looks for a root element of teiCorpus in a TEI P5 XML file 
    and spits out an individual file for each TEI element it finds under that.
    It keeps an index of the files in the output file.
    
    It should be run something like:
    
    saxon -o index.xml teiCorpusFile.xml split-corpus.xsl
    
  -->

  <!-- Output should be indented and xml -->
  <xsl:output indent="yes" method="xml"/>

  <!-- Match a root teiCorpus element -->
  <xsl:template match="/tei:teiCorpus">
    <!-- Output for index.xml -->
    <TEI>
      <!-- copy the teiHeader -->
      <xsl:copy-of select="./tei:teiHeader"/>
      <text>
        <body>
          <head>
            <xsl:apply-templates select="./tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title[1]"/>
            <xsl:text>: Individual Files</xsl:text>
          </head>
          <list>
            <!-- Create a list, one item for each TEI element -->
            <xsl:for-each select=".//tei:TEI">
              <!-- Previously I had this using the @xml:id of each TEI element, but your needs might be different   -->
              <!-- <xsl:variable name="file"><xsl:value-of select="@xml:id"/></xsl:variable> -->
              <xsl:variable name="file">file-<xsl:number format="0000"/>.xml</xsl:variable>
              <xsl:variable name="title">
                <xsl:apply-templates
                  select="./tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title[1]"/>
              </xsl:variable>
              <!-- Output one item per TEI file -->
              <item>
                <ref target="{$file}">
                  <name type="file">
                    <xsl:value-of select="$file"/>
                  </name>
                  <title>
                    <xsl:value-of select="$title"/>
                  </title>
                </ref>
              </item>
              <!-- Also while we are here output a result-document  -->
              <xsl:result-document href="{$file}">
                <xsl:copy-of select="."/>
              </xsl:result-document>
            </xsl:for-each>
          </list>
        </body>
      </text>
    </TEI>
  </xsl:template>

</xsl:stylesheet>