DownloadP4toP5.bash

This is a shell script that tries to read the P4toP5 category page, follow all of the links from that page to the articles in this category, download each of those pages, and then extract the actual code from the downloaded page and save it with the expected name.

To use this program copy-and-paste the following code into a file in an otherwise empty directory. The new file can be named whatever you like. Here we are going to sassume it is called "downloadP4toP5.bash". Ensure that this new file is executable (e.g., `chmod a+x downloadP5toP5.bash`). With the current working directory set to the directy downloadP4toP5.bash is in, execute it (e.g., `./downloadP5toP5.bash`). No parameters or switches are required, but an internet connection is.

#! /bin/bash


 * 1) downloadP4toP5.bash
 * 2) Extract the code from all of the articles listed in the P4toP5 category
 * 3) of the TEI wiki.
 * 4) Written 2006-10-05 by Syd Bauman
 * 5) Copyright 2006 by Syd Bauman and the Text Encoding Initiative Consortium
 * 6) See near bottom of file for complete copyleft notice
 * 7) This routine has been testing in Debian GNU/Linux and in Mac OS X.
 * 8) It may work in other environments, it may not. It downloads the
 * 9) files to its own directory, which had better be writeable or you'll
 * 10) see a lot of error messages and get no results.
 * 11) Limitations / Future Plans
 * 12) Currently, this routine requires that you have `xsltproc` and either
 * 13) `wget` or `curl` in your PATH. Perhaps in the future we should make
 * 14) it operate with other XSLT engines and URL fetchers.
 * 15) Should add a -d switch which will cause some tracing the _temp files
 * 16) to remain, otherwise they should be erased.
 * 17) Warning
 * 18) This program is really quite fragile. It's really just a hack, and
 * 19) it will break if anything on the wiki changes in any significant
 * 20) way. E.g., if someone were to add a new article in which the
 * 21) stylesheet code itself were not the 1st, we'd mess up on that
 * 22) particular stylesheet.
 * 1) Warning
 * 2) This program is really quite fragile. It's really just a hack, and
 * 3) it will break if anything on the wiki changes in any significant
 * 4) way. E.g., if someone were to add a new article in which the
 * 5) stylesheet code itself were not the 1st, we'd mess up on that
 * 6) particular stylesheet.
 * 1) particular stylesheet.

die {   echo; echo "ERROR: $@." D=`date "+%Y-%m-%d %H:%M:%S"` echo "This was a fatal error. $D" exit 1 }
 * 1) error subroutine
 * 1) error subroutine

if which wget then get {	wget --no-directories --output-document="$2" $1 } elif which curl then get {	curl --silent $1 > $2 } else die "I can't find a command with which to download web files, so I'm giving up." fi
 * 1) subroutine to retrieve web page
 * 1) subroutine to retrieve web page

cat > extractList.xslt <
 * 1) create our XSLT file for extracting the list of stylesheets
 * 1) create our XSLT file for extracting the list of stylesheets





  



      

 EOEL

cat > extractCode.xslt <
 * 1) create our XSLT file for extracting code from a page
 * 1) create our XSLT file for extracting code from a page



<xsl:output method="xml" omit-xml-declaration="yes"/>

<xsl:template match="/"> <xsl:variable name="title"> <xsl:value-of select="//x:head/x:title"/> </xsl:variable> <xsl:choose> <xsl:when test="starts-with(\$title,'Copy-All.')"> <xsl:apply-templates select="//x:pre[2]"/> </xsl:when> <xsl:otherwise> <xsl:apply-templates select="//x:pre[1]"/> </xsl:otherwise> </xsl:choose> </xsl:template>

<xsl:template match="x:pre"> <xsl:apply-templates/> </xsl:template>

</xsl:stylesheet> EOEC

which xsltproc || die "I can't find `xsltproc`, which is the only XSLT processor I know about."
 * 1) Check that we can process those stylesheets we just created
 * 1) Check that we can process those stylesheets we just created

f="http://www.tei-c.org/wiki/index.php/Category:P4toP5" echo "extracting index page $f to file ${f##*/}" get "$f" "${f##*/}" || die "Unable to fetch P4 to P5 index page."
 * 1) get the TEI Wiki P4 to P5 index page (which lists all the stylesheets
 * 2) we want to get)
 * 1) we want to get)

for f in `xsltproc extractList.xslt Category:P4toP5` ; do   # ... generate the web address and announce our intention ... f="http://www.tei-c.org$f" echo "extracting code from $f to file ${f##*/} ..." # ... get the page from the web, storing as a temp file ...   get "$f" "$0_temp" || echo "WARNING: unable to fetch $f." # ... process temp file into the desired file xsltproc extractCode.xslt $0_temp | perl -pe 's/&lt;/</g; s/&gt;/>/g;' > ${f##*/} done
 * 1) extract the list of files we want from the list of articles
 * 2) in the index page; then, for each of those files ...
 * 1) in the index page; then, for each of those files ...


 * 1) Copyright 2006 Syd Bauman and the Text Encoding Initiative
 * 2) Consortium. This program is free software; you can redistribute it
 * 3) and/or modify it under the terms of the GNU General Public License
 * 4) as published by the Free Software Foundation; either version 2 of
 * 5) the License, or (at your option) any later version. This program is
 * 6) distributed in the hope that it will be useful, but WITHOUT ANY
 * 7) WARRANTY; without even the implied warranty of MERCHANTABILITY or
 * 8) FITNESS FOR A PARTICULAR PURPOSE. See the GNU General Public License
 * 9) for more details. You should have received a copy of the GNU General
 * 10) Public License along with this program; if not, write to the
 * 11)        Free Software Foundation, Inc.
 * 12)        675 Mass Ave
 * 13)        Cambridge, MA  02139
 * 14)        USA
 * 15)        gnu@prep.ai.mit.edu
 * 16) Syd Bauman, north american editor
 * 17) Text Encoding Initiative Consortium
 * 18) Box 1841
 * 19) Providence, RI  02912-1841
 * 20) 401-863-3835
 * 21) Syd_Bauman@Brown.edu
 * 1) 401-863-3835
 * 2) Syd_Bauman@Brown.edu