DownloadP4toP5.bash

This is a shell script that tries to read the category page for stylesheets for P4 to P5 conversion, follow all of the links from that page to the articles in this category, download each of those pages, and then extract the actual code from the downloaded page and save it with the expected name.

To use this program copy-and-paste the following code into a file in an otherwise empty directory. The new file can be named whatever you like. Here we are going to assume it is called “downloadP4toP5.bash”. Ensure that this new file is executable (e.g., `chmod a+x downloadP5toP5.bash`). With the current working directory set to the directory that “downloadP4toP5.bash” is in, execute it (e.g., `./downloadP5toP5.bash`). No parameters or switches are required, but an internet connection is.

#! /bin/bash

# downloadP4toP5.bash
# Extract the code from all of the articles listed in the P4toP5 category
# of the TEI wiki.
#
# Written 2006-10-05 by Syd Bauman
# Copyright 2006 by Syd Bauman and the Text Encoding Initiative Consortium
# See near bottom of file for complete copyleft notice
#
# This routine has been testing in Debian GNU/Linux and in Mac OS X.
# It may work in other environments, it may not. It downloads the
# files to its own directory, which had better be writeable or you'll
# see a lot of error messages and get no results.
#
# Limitations / Future Plans
# ----------- - ------ -----
# Currently, this routine requires that you have `xsltproc` and either
# `wget` or `curl` in your PATH. Perhaps in the future we should make
# it operate with other XSLT engines and URL fetchers.
#
# Should add a -d switch which will cause some tracing the _temp files
# to remain, otherwise they should be erased.
#
# Warning
# -------
# This program is really quite fragile. It's really just a hack, and
# it will break if anything on the wiki changes in any significant
# way. E.g., if someone were to add a new article in which the
# stylesheet code itself were not the 1st <pre>, we'd mess up on that
# particular stylesheet.

#
# error subroutine
# 
die()
{
    echo; echo "ERROR: $@."
    D=`date "+%Y-%m-%d %H:%M:%S"`
    echo "This was a fatal error. $D"
    exit 1
}

#
# subroutine to retrieve web page
# 
if which wget
then
    get()
    {
	wget --no-directories --output-document="$2" $1
    }
elif which curl
then
    get()
    {
	curl --silent $1 > $2
    }
else
    die "I can't find a command with which to download web files, so I'm giving up."
fi

#
# create our XSLT file for extracting the list of stylesheets
# 
cat > extractList.xslt <<EOEL
<?xml version="1.0" encoding="UTF-8"?>
<!-- temporary stylesheet for use by shell script that -->
<!-- sucks P4 to P5 stylesheets from wiki. This routine -->
<!-- reads in the P4 to P5 index page and produces a -->
<!-- space-separated list of the links to articles con- -->
<!-- tained therein. -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:x="http://www.w3.org/1999/xhtml" version="1.0">

  <xsl:output method="text"/>

  <xsl:template match="/">
    <!-- ignore everything but <table> elements -->
    <xsl:apply-templates select="//x:table"/>
  </xsl:template>

  <!-- ignore the <table> that's for the table of contents, too -->
  <xsl:template match="x:table[@id='toc']"/>

  <!-- the only non-TOC table should be the list of articles -->
  <xsl:template match="x:table">
    <!-- for each descendant link ... -->
    <xsl:for-each select=".//x:a">
      <!-- ... write out the target ... -->
      <xsl:value-of select="@href"/>
      <!-- ... and a blank to separate. -->
      <xsl:text> </xsl:text>
    </xsl:for-each>
  </xsl:template>

</xsl:stylesheet>
EOEL

#
# create our XSLT file for extracting code from a page
# 
cat > extractCode.xslt <<EOEC
<?xml version="1.0" encoding="UTF-8"?>
<!-- temporary stylesheet for use by shell script that -->
<!-- sucks P4 to P5 stylesheets from wiki. -->
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform"
  xmlns:x="http://www.w3.org/1999/xhtml" version="1.0">

  <!-- There is already an XML declaration in the file -->
  <!-- we are extracting, and we don't want two of 'em -->
  <xsl:output method="xml" omit-xml-declaration="yes"/>

  <xsl:template match="/">
    <!-- get the title of this web page -->
    <xsl:variable name="title">
      <xsl:value-of select="//x:head/x:title"/>
    </xsl:variable>
    <xsl:choose>
      <!-- special case: if this is the Copy-All.xsl page, -->
      <!-- take the 2nd <pre> element -->
      <xsl:when test="starts-with(\$title,'Copy-All.')">
        <xsl:apply-templates select="//x:pre[2]"/>
      </xsl:when>
      <!-- for all others take the first (and only) <pre> -->
      <xsl:otherwise>
        <xsl:apply-templates select="//x:pre[1]"/>
      </xsl:otherwise>
    </xsl:choose>
  </xsl:template>

  <!-- when we hit the <pre> we want, just copy it -->
  <!-- entirely to the output -->
  <xsl:template match="x:pre">
    <xsl:apply-templates/>
  </xsl:template>

</xsl:stylesheet>
EOEC

#
# Check that we can process those stylesheets we just created
# 
which xsltproc || die "I can't find `xsltproc`, which is the only XSLT processor I know about."

#
# get the TEI Wiki P4 to P5 index page (which lists all the stylesheets
# we want to get)
# 
f="http://www.tei-c.org/wiki/index.php/Category:P4toP5"
echo "extracting index page $f to file ${f##*/}"
get "$f" "${f##*/}" || die "Unable to fetch P4 to P5 index page."

#
# extract the list of files we want from the list of articles
# in the index page; then, for each of those files ...
# 
for f in `xsltproc extractList.xslt Category:P4toP5` ; do
    # ... generate the web address and announce our intention ...
    f="http://www.tei-c.org$f"
    echo "extracting code from $f to file ${f##*/} ..."
    # ... get the page from the web, storing as a temp file ...
    get "$f" "$0_temp" || echo "WARNING: unable to fetch $f."
    # ... process temp file into the desired file
    xsltproc extractCode.xslt $0_temp | perl -pe 's/&lt;/</g; s/&gt;/>/g; s/&amp;/&/g;' >  ${f##*/}
    done

# -----------------------------------------------------
# Copyright 2006 Syd Bauman and the Text Encoding Initiative
# Consortium. This program is free software; you can redistribute it
# and/or modify it under the terms of the GNU General Public License
# as published by the Free Software Foundation; either version 2 of
# the License, or (at your option) any later version. This program is
# distributed in the hope that it will be useful, but WITHOUT ANY
# WARRANTY; without even the implied warranty of MERCHANTABILITY or
# FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
# for more details. You should have received a copy of the GNU General
# Public License along with this program; if not, write to the
#        Free Software Foundation, Inc.
#        675 Mass Ave
#        Cambridge, MA  02139
#        USA
#        gnu@prep.ai.mit.edu
#
# Syd Bauman, north american editor
# Text Encoding Initiative Consortium
# Box 1841
# Providence, RI  02912-1841
# 401-863-3835
# Syd_Bauman@Brown.edu

DownloadP4toP5.bash

Navigation menu

Personal tools

Namespaces

Variants

Views

More

Search

Navigation

Tools