Difference between revisions of "Unshy.xslt"

From TEIWiki
Jump to navigation Jump to search
m
m (fix some comments)
Line 12: Line 12:
 
<!-- Copyleft 2008 Syd Bauman and the Brown University Women Writers Project -->
 
<!-- Copyleft 2008 Syd Bauman and the Brown University Women Writers Project -->
 
<!--  -->
 
<!--  -->
<!-- Known issues (some would consider them bugs, other features): -->
+
<!-- Known issues (some would consider them bugs, others would say features): -->
<!-- * If there are more than one shy in a given text node, we end up -->
+
<!-- * If there is more than one shy in a given text node, we end up -->
 
<!--  doing the wrong thing: deleting all the text after the *first* -->
 
<!--  doing the wrong thing: deleting all the text after the *first* -->
 
<!--  shy, not the last. However, in this case we issue a warning -->
 
<!--  shy, not the last. However, in this case we issue a warning -->
Line 26: Line 26:
 
<!--  removed. It should probably be removed, but IMHO a warning would be a -->
 
<!--  removed. It should probably be removed, but IMHO a warning would be a -->
 
<!--  good idea. -->
 
<!--  good idea. -->
<!-- In truth, the right thing to do might be to validate that shys are palced -->
+
<!-- In truth, the right thing to do might be to validate that shys are plqced -->
 
<!-- appropiately with some other software (like a Schematron schema), and then -->
 
<!-- appropiately with some other software (like a Schematron schema), and then -->
 
<!-- in this stylesheet test for shy only at end-of-line (except for whitespace). -->
 
<!-- in this stylesheet test for shy only at end-of-line (except for whitespace). -->
Line 114: Line 114:
 
     <!-- which grabbed and printed out the first token of this text node. So our -->
 
     <!-- which grabbed and printed out the first token of this text node. So our -->
 
     <!-- goal is to print out the rest. But if this node also ends in a shy, we have -->
 
     <!-- goal is to print out the rest. But if this node also ends in a shy, we have -->
     <!-- process it accordingly, too. -->
+
     <!-- to process it accordingly, too. -->
 
     <xsl:choose>
 
     <xsl:choose>
 
       <xsl:when test="contains(.,'&#x00AD;')">
 
       <xsl:when test="contains(.,'&#x00AD;')">
Line 129: Line 129:
 
</xsl:stylesheet>
 
</xsl:stylesheet>
 
</nowiki></pre>
 
</nowiki></pre>
 +
 
I make no claim that this is even a good, let alone the best, way to do this. It did work on my test files, though.
 
I make no claim that this is even a good, let alone the best, way to do this. It did work on my test files, though.
  
 
[[Category:XSLT]][[Category:XSLT:1.0]]
 
[[Category:XSLT]][[Category:XSLT:1.0]]

Revision as of 14:35, 3 August 2008

A stylesheet to read in TEI (or other XML files) and remove line-end soft hyphens. See header comment for details.

<?xml version="1.0" encoding="UTF-8"?>
<!-- unshy.xslt -->
<!-- Reads in an XML file, writes out the same file with end-of-line soft hyphens -->
<!-- removed. Any whitespace between the soft hyphen and the next text node is -->
<!-- also removed. The markup between the soft hyphen and the next text node is -->
<!-- inserted immediately before the first whitespace character of the next text node -->
<!-- (or after it, if it has no internal whitespace). -->
<!-- Written 2008-07-24/26 by Syd Bauman -->
<!-- Copyleft 2008 Syd Bauman and the Brown University Women Writers Project -->
<!--  -->
<!-- Known issues (some would consider them bugs, others would say features): -->
<!-- * If there is more than one shy in a given text node, we end up -->
<!--   doing the wrong thing: deleting all the text after the *first* -->
<!--   shy, not the last. However, in this case we issue a warning -->
<!--   message, and since this case is never supposed to happen at the -->
<!--   WWP, we'd prefer the warning to better behavior. -->
<!-- * If the shy is not the last non-whitespace character, we end up -->
<!--   doing the wrong thing: deleting all the text after the shy, rather -->
<!--   than just any following whitespace. But again, this situation is -->
<!--   never supposed to happen at the WWP, so we'd prefer the warning to -->
<!--   doing the right thing. -->
<!-- * If the shy is the last text character in the document, it is silently -->
<!--   removed. It should probably be removed, but IMHO a warning would be a -->
<!--   good idea. -->
<!-- In truth, the right thing to do might be to validate that shys are plqced -->
<!-- appropiately with some other software (like a Schematron schema), and then -->
<!-- in this stylesheet test for shy only at end-of-line (except for whitespace). -->
<xsl:stylesheet version="1.0"
  xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:tei="http://www.tei-c.org/ns/1.0" >

  <!-- housekeeping -->
  <xsl:output encoding="UTF-8" omit-xml-declaration="no" method="xml"/>
  
  <!-- start at root, match any children (i.e., root element, probably /TEI or /tei:TEI, -->
  <!-- and any comments or PIs outside the root element) -->
  <xsl:template match="/">
    <xsl:apply-templates/>
  </xsl:template>
  
  <!-- For any node (element, attribute, text, PI, or comment) that is not -->
  <!-- matched more specifically below, copy it to itself. Thus, except for -->
  <!-- the templates below, this stylesheet is the identity transoform. -->
  <xsl:template match="@*|node()">
    <xsl:copy>
      <xsl:apply-templates select="@*|node()"/>
    </xsl:copy>
  </xsl:template>
  
  <!-- Match (or handle when called explicitly) any text node that contains -->
  <!-- a soft hyphen. -->
  <xsl:template match="text()[contains(.,'­')]" name="hasShy">
    <!-- Should we be stripping off the first word of this text node? -->
    <xsl:param name="strip1st">false</xsl:param> <!-- not unless asked -->
    <!-- set the text to be processed (putting it in variable $me) -->
    <xsl:variable name="me">
      <xsl:choose>
        <!-- if we've been asked to strip off the first word, -->
        <xsl:when test="$strip1st = 'true'">
          <!-- do so -->
          <xsl:choose>
            <!-- if there's a blank, -->
            <xsl:when test="contains(.,' ')">
              <!-- take everything after it -->
              <xsl:value-of select="substring-after(normalize-space(.),' ')"/>
            </xsl:when>
            <xsl:otherwise><!-- return null if there only is one word --></xsl:otherwise>
          </xsl:choose>
        </xsl:when>
        <!-- else just normalize space, just so things are handled evenly-->
        <xsl:otherwise><xsl:value-of select="normalize-space(.)"/></xsl:otherwise>
      </xsl:choose>
    </xsl:variable>
    <!-- get the string content of the next text node (put it in variable $nextText) -->
    <xsl:variable name="nextText">
      <xsl:value-of select="normalize-space(following-sibling::text()[1])"/>
    </xsl:variable>
    <!-- some rudimentary error-checking -->
    <xsl:choose>
      <xsl:when test="contains(substring-after($me,'­'),'­')">
        <xsl:message>Oh dear. More than 1 &shy; in this text sequence, I'm not going to get this one right.</xsl:message>
      </xsl:when>
      <xsl:when test="contains(substring-after($me,'­'),'­')">
        <xsl:message>Oh dear. This &shy; is not the last (non-whitespace) character of the text node. I'm probably going to mess this up.</xsl:message>
      </xsl:when>
    </xsl:choose>
    <!-- output myself up to, but not including, shy -->    
    <xsl:value-of select="substring-before($me,'­')"/>
    <!-- first token of next text node -->
    <xsl:choose>
      <!-- if there is a blank, -->
      <xsl:when test="contains($nextText,' ')">
        <!-- take chars up to it -->
        <xsl:value-of select="substring-before($nextText,' ')"/>
      </xsl:when>
      <xsl:otherwise>
        <!-- else just take entire text string -->
        <xsl:value-of select="$nextText"/>
      </xsl:otherwise>
    </xsl:choose>
    <!-- a newline to separate this word from the rest -->
    <xsl:text>
</xsl:text>
  </xsl:template>

  <!-- Now match text nodes for which the preceding text node had a shy -->
  <!-- we want this template to be higher priority than 'hasShy', so that -->
  <!-- when we hit a line that meets both criteria (has a shy, previous -->
  <!-- text node had a shy), we come here first -->
  <xsl:template match="text()[preceding-sibling::text()[1][contains(.,'­')]]" priority="1">
    <!-- preceding text node ended in shy, so it was caught by the hasShy template, -->
    <!-- which grabbed and printed out the first token of this text node. So our -->
    <!-- goal is to print out the rest. But if this node also ends in a shy, we have -->
    <!-- to process it accordingly, too. -->
    <xsl:choose>
      <xsl:when test="contains(.,'­')">
        <xsl:call-template name="hasShy">
          <xsl:with-param name="strip1st">true</xsl:with-param>
        </xsl:call-template>
      </xsl:when>
      <xsl:otherwise>
        <xsl:value-of select="substring-after(normalize-space(.),' ')"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>
  
</xsl:stylesheet>

I make no claim that this is even a good, let alone the best, way to do this. It did work on my test files, though.