Difference between revisions of "Split-teiCorpus"

From TEIWiki
Jump to navigation Jump to search
m (Stylesheet: category fix)
m (Known Restrictions or Problems)
Line 49: Line 49:
 
* It assumes you want the files numbered sequentially
 
* It assumes you want the files numbered sequentially
 
* It assumes that you have a title in the corpus header
 
* It assumes that you have a title in the corpus header
* It is XSLT2, so use saxon
+
* It is XSLT2, so XSLT1-only processors can’t handle it
 
+
* Comments and processing instructions that are siblings or children of the root <tt>&lt;teiCorpus></tt> are summarily dropped
 +
* Nested <tt>&lt;teiCorpus></tt> elements are handled by processing their <tt>&lt;TEI></tt> children as if they were children of the root <tt>&lt;teiCorpus></tt>
  
 
== Stylesheet ==
 
== Stylesheet ==

Revision as of 03:49, 10 April 2011

Summary

This is a quick XSLT to split a <teiCorpus> file, filled with <TEI> elements into individual files.

While one can actually do this on the linux command line with an easy bash script or pipeline, XSLT2 is my preferred method for doing this since the document is already an XML document. An extra template is include to change an element along the way in case you want to do that.

The original post I made about it is at:

http://faqingperplxd.wordpress.com/2009/03/02/xslt-to-split-teicorpus-files-to-individual-parts/

Add any comments to the 'discussion' tab.


Run with something like:

saxon -o index.xml teiCorpusFile.xml Split-teiCorpus.xsl


Required Input

A file with the structure something like:

<teiCorpus xmlns="http://www.tei-c.org/ns/1.0">
    <teiHeader>
        <!-- Corpus header -->
    </teiHeader>

    <!-- Individual TEI file -->
    <TEI>
        <teiHeader>
            <!-- File Header -->
        </teiHeader>
        <text>
            <!-- text of document -->
        </text>
    </TEI>

    <!-- More TEI elements as needed -->

</teiCorpus>

That is, a <teiCorpus> element containing a <teiHeader> and multiple <TEI> elements. To get one file per <TEI> element is really fairly simple using <xsl:for-each> and <xsl:result-document> but I also added in that the output file produces a list of filled with references to the files created.

Expected Output

An index.xml file with a list referencing each file created, and an individual file for each TEI element.

Known Restrictions or Problems

  • It assumes you want the files numbered sequentially
  • It assumes that you have a title in the corpus header
  • It is XSLT2, so XSLT1-only processors can’t handle it
  • Comments and processing instructions that are siblings or children of the root <teiCorpus> are summarily dropped
  • Nested <teiCorpus> elements are handled by processing their <TEI> children as if they were children of the root <teiCorpus>

Stylesheet

<?xml version="1.0" encoding="UTF-8"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="2.0"
    xmlns="http://www.tei-c.org/ns/1.0" xmlns:tei="http://www.tei-c.org/ns/1.0"
    exclude-result-prefixes="#all">

    <!-- 
    This file looks for a root element of teiCorpus in a TEI P5 XML file 
    and spits out an individual file for each TEI element it finds under that.
    It keeps an index of the files in the output file.
    
    It should be run something like:
    
    saxon -o index.xml teiCorpusFile.xml split-corpus.xsl
    
    -->

    <!-- Output should be indented and xml -->
    <xsl:output indent="yes" method="xml"/>

    <!-- Match a root teiCorpus element -->
    <xsl:template match="/tei:teiCorpus">
        <!-- Output for index.xml -->
        <TEI>
            <!-- Only apply-templates to teiHeader, which ends up copying it -->
            <xsl:apply-templates select="./tei:teiHeader"/>
            <text>
                <div>
                    <head>Individual Files</head>
                    <list>
                        <!-- Create a list one item for each TEI element -->
                        <xsl:for-each select=".//tei:TEI">
                            <!-- Previously I had this using the @xml:id of each TEI element, but your needs might be different   -->
                            <!-- <xsl:variable name="file"><xsl:value-of select="@xml:id"/></xsl:variable> -->
                            <xsl:variable name="file">file-<xsl:number format="0000"
                                />.xml</xsl:variable>
                            <xsl:variable name="title">
                                <xsl:apply-templates
                                    select="./tei:teiHeader/tei:fileDesc/tei:titleStmt/tei:title[1]"
                                />
                            </xsl:variable>
                            <!-- Output one item per TEI file -->
                            <item>
                                <ref href="{$file}">
                                    <xsl:value-of select="concat($file, ': ', $title)"/>
                                </ref>
                            </item>
                            <!-- Also while we are here output a result-document  -->
                            <xsl:result-document href="{$file}">
                                <xsl:copy>
                                    <xsl:apply-templates select="@*|node()|comment()"/>
                                </xsl:copy>
                            </xsl:result-document>
                        </xsl:for-each>
                    </list>
                </div>
            </text>
        </TEI>
    </xsl:template>

    <!-- Default action, if in doubt copy any attributes, nodes or comments -->
    <xsl:template match="@*|node()|comment() " priority="-1">
        <xsl:copy>
            <xsl:apply-templates select="@*|node()|comment()"/>
        </xsl:copy>
    </xsl:template>

    <!-- If you want to change one of the elements along the way...  -->
    <xsl:template match="tei:fooBar">
        <seg type="fooBar"><!-- Just an example of how to change something else --></seg>
    </xsl:template>
</xsl:stylesheet>