TEI-to-DICT howto

From TEIWiki

Jump to: navigation, search

Preface

This HOWTO is intended for people who need to create a DICT-compatible dictionary database out of a TEI-encoded document. The aim is to guide the reader through the process of conversion using the tools available from the FreeDict project. The tools are not documented sufficiently, which makes the process quite tricky at some points. This HOWTO should help you avoid at least some of the pitfalls.

This text assumes that you use Linux (or the Cygwin environment for Microsoft Windows) and the bash shell.

In theory, there are two more or less straightforward methods of conversion, but in practice, only one of them (the dictfmt method) is currently 'blessed' and will be focussed on here. The other, deprecated method, using the xmltei2xmldict script, is mentioned below for historical reasons.

Contents


Preparations

In order to convert a TEI document into DICT, you will need tools developed by FreeDict, an XSLT processor, and the dictd package from the DICT project. If you do not use Linux, refer first to the subsection on installing Cygwin below – it is necessary to at least emulate Linux environment for the purpose of compiling and running DICT tools, and Cygwin is the option described here; everything else can be handled by other tools.

Getting FreeDict tools

Set up a CVS directory: you only need to do the following in order to initialize the system (assuming that your CVS directory is ~/CVS)[1]:

  • execute CVSROOT=~/CVS; export CVSROOT (to set the environment variable for CVS)
  • execute cvs init in ~/CVS (a new subdirectory of ~/CVS will be created, called CVSROOT/)

Having done that, download the FreeDict tools by issuing the following commands:

  • cvs -d:pserver:anonymous@freedict.cvs.sourceforge.net:/cvsroot/freedict login (just press Enter for the password)
  • cvs -z3 -d:pserver:anonymous@freedict.cvs.sourceforge.net:/cvsroot/freedict co -P tools

This will create another subcategory of ~/CVS, namely tools/.[2]

An XSLT processor

All you need here is a lightweight XSLT processor, such as xsltproc, which comes with the libxslt package, available in most Linux distributions and in Cygwin). Install the libxslt package if it is not present in your system. You should be able to see the usage info after you enter the xsltproc command now.

Getting DICT tools

Download the dictd package from Sourceforge. Unpack it to a separate directory. In that directory, run

  • ./configure --disable-plugin
  • make
  • (optionally) make install

The configure and make commands may take quite a while to run; the --disable-plugin option is for clean systems without the extra libraries needed for dictd plugins – otherwise, without those extra libraries, make will fail.

You should be able to run dictfmt now, just to see if it prints a help page. If you chose to make install, you should be able to run it from any directory in your system.

Getting Cygwin (for Windows users)

Cygwin is a Linux emulator for Windows. Download the installer from http://www.cygwin.com/setup.exe, put it into a separate directory (it will create subdirectories) and run it. Try accepting the defaults, in the mirror choice dialog choose the server that seems closest; the download should visibly proceed practically immediately after you hit the "next" button – if it does not, cancel and try other sites.

In order to be able to create FreeDict dictionary databases, you need to choose at least the following in the package selection dialog (some of them will automatically select other packages that they depend on):

  • in the "Devel" category: "binutils", "bison", "cvs" (for CVS), "flex", "gcc", "libtool", "libxslt" (for xsltproc), and "make"
  • in the "Shells" category, "bash" should be selected by default; if you are new to the commandline and *nix tools, select also "mc" (Midnight Commander, remember the Norton version for DOS? Invoke it with mc)

After everything installs (watch your firewall messages; it may try to block the installation scripts), you will be able to open a shell window by clicking on the Cygwin shortcut.

Validating your dictionary

Before you attempt to compile a DICT dictionary, you may want to make sure if your TEI source is well formed and, optionally, if it is valid. Under Linux or Cygwin, you can do the following:

  • xmllint --noout your-dictionary.tei (to check well-formedness)
  • xmllint --noout --valid your-dictionary.tei (to validate the dictionary against TEI DTD)

Non-commandline tools

You need to run Linux or Cygwin in order to compile and run DICT tools. Any other task described here can be handled by GUI tools, typically running under any operating system. Most of the procedures described here have been tested under Kernow, XML Copy Editor, and oXygen.

Converting with XSLT and dictfmt

The following assumes that you are in a subdirectory of ~/ (other than the dictd package directory and the ~/CVS) that you created for the purpose of manipulating your TEI files.

Metadata

One of the most tricky parts of the conversion process is to make all the metadata present in the TEI file appear under correct headings in the DICT database. Earlier, the XSLT method of conversion did not process data from the TEI header. This means that any metadata to be included in the resulting DICT database had to appear as dictionary entries in the document's <body> element – e.g. an <orth> element containing a DICT header (e.g. 00-database-short), followed by a <sense> element containing the appropriate information (e.g. "My Dictionary"). The current version of tei2c5.xsl appears to operate on the fileDesc/titleStmt/title element directly to produce 00-database-short. The -t option of dictfmt (see below) should guarantee that this is transferred to the final dictionary, together with the other relevant parts of the TEI header.

Conversion

For the purpose of this example I also assume that your TEI file is named dictionary.tei. Listed below are all the steps you have to take in order to produce a DICT database from a TEI-encoded dictionary, on the assumption that you are in a subdirectory of your ~/ and that the Freedict tools are in ~/CVS/tools/:

  1. Transform the TEI file into something appropriate to be fed into dictfmt by entering
    xsltproc -o dictionary.c5 -novalid --stringparam current-date $(date) \
    ../CVS/tools/xsl/tei2c5.xsl dictionary.tei
  2. Transform the intermediate file into the database and index files by entering one (or a combination) of the following:
    dictfmt -t --utf8 my_dictionary < dictionary.c5
    dictfmt -t --headword-separator %%% --utf8 my_dictionary < dictionary.c5
    dictfmt -t -s <short_descriptive_name> --utf8 my_dictionary < dictionary.c5
  • The --headword-separator %%% option is for cases where a single <form> element contains more than one <orth> – the tei2c5.xsl stylesheet connects such <orth>s into a single sequence glued by three per-cent characters (to yield e.g. "spring%%%fall"); the --headword-separator %%% lets dictfmt know that it needs to watch out for headwords containing "%%%" and split them.[3]
  • The argument to the -s option should be a name for your dictionary – this option is not needed if your /teiHeader/fileDesc/titleStmt/title element is set, for details refer to the FreeDict TEI document.
  • The -t and --utf8 options can be treated as internal: -t tells dictfmt that it will be processing a c5 file, that it should not copy the headword inside the definition, and that it should not generate the standard preface to the dictionary, using instead the contents of the <teiHeader> element.

Issuing the second command should produce a correct DICT database and index (in this example, the filenames will be my_dictionary.dict and my_dictionary.index). All that is left to be done at this point is copying it into some permanent directory, editing the dictd configuration file, and restarting your DICT server (actually, sending it a SIGHUP with kill -1 is enough). You can issue the dict -D command to see if your new dictionary is correctly recognized by the server.

Converting with xmltei2xmldict

This method employs a Perl script developed by the FreeDict project called xmltei2xmldict. It does not require any intermediate files and the whole conversion is achieved by issuing just one command; however, in order to run, the script requires a certain Perl module, which sometimes proves to be quite difficult to install.

For this method, you need the following (consult CVS/Tools/README for details):

Note: This method is apparently deprecated (see the README for FreeDict tools) and, consequently, the present section is not maintained. Suggestions for improvement (or votes for eliminating this section altogether) are welcome on the discussion page.

Metadata

According to the FreeDict HOWTO, the xmltei2xmldict script knows how to process TEI headers. But this is only true provided that an appropriate XSLT stylesheet is used (yes, this script also underlyingly makes use of XSLT; all this means that the previous method could process the TEI header too, if only the stylesheet were upgraded). The FreeDict HOWTO claims that the stylesheet they recommend to be used with the script (which is tei2txt.xsl) converts the whole header into 00-database-info (accessible via dict -i <dictionary name>), the <title> element becomes 00-database-short (visible when dict -D is issued), and the <sourceDesc> element becomes 00-database-url. Therefore, when using this method, it is not necessary to include any metadata in the entries.

Conversion

The assumptions concerning the temporary directory's structure are the same as in the previous section. To convert a TEI dictionary into DICT, you need to do the following:

  1. Validate the file by typing xmllint --noout dictionary.tei. This is absolutely necessary -- the script will crash if the TEI file is not valid.
  2. Convert the file to the DICT format by typing
../CVS/tools/xmltei2xmldict.pl -f dictionary.tei -t ../CVS/tools/xsl/tei2txt.xsl

By now you should have a dictionary file and an index file for use with your DICT server. The only problem I ran into is that the <sourceDesc> element is not converted into 00-database-url. This is most probably due to a bug in the XSLT stylesheet.

Conclusion

You may want to make sure your dictionary works as intended by testing it locally – either under dictd or a different server. If you run dictd under Cygwin, remember to either modify the PATH variable or invoke it as /usr/local/sbin/dictd. The configuration files reside in /usr/local/etc/ by default; you have to create them by hand according to the documentation and examples. If this is too much hassle, you can use a java-based server such as JDictd (straightforward to configure, but does not escape the "<" and ">" symbols that are used to enclose grammatical information, so missing POS information does not indicate a problem with your dictionary) or JavaDICT (untested).

Revision History

This HOWTO was written by Radek Moszczyński on May 22, 2005. Piotr talked Radek into donating it to TEI Wiki in November 2007. Subsequent changes are documented on the history page.

Please make sure to supply an informative edit description when you modify this file.

Notes

  1. In case you are a Windows user new to the *nix way, here is what you have to do in your Cygwin shell first: mkdir CVS to create the directory and cd CVS in order to change to that directory. cd ~ will always take you back to your homedir, and ls lists the directory contents. Midnight Commander (mc) may come handy in browsing the directory tree and editing files.
  2. For details on CVS, refer to the SourceForge CVS page, man cvs in Cygwin/Linux, the CVS Book, or the concise gnulamp.com tutorial. The SourceForge page suggests some Windows-based CVS clients as well, in case this looks too scary.
  3. Thus, the "%%%" sequence is intended to mark orthographic variations of a single form as well as synonyms. It gets even worse when a reversed dictionary is created (this is ignored in the present HOWTO): all translation equivalents of an <orth> (regardless of subsense divisions) end up linked by "%%%" – notice that they need not even be full synonyms in such cases. Needless to say, this option opens the door for bad lexicographic practice and as such should be discouraged in cases other than those involving alternative forms of a single lexeme. Piotr 04:23, 3 October 2008 (CEST)

Links

Personal tools