HTML5 Microdata TEI Serialization

From TEIWiki

Jump to: navigation, search

HTML5 Microdata specification offers the ability to add semantics to HTML, including the ability to specify a particular namespace (in HTML5, called an "item type"), offering the potential for TEI to be encoded in HTML5 without loss of any semantic information.

A collaboration between Google, Microsoft, and Yahoo, Schema.org, is pushing for adoption over other offered alternatives such as RDFa or Microformats, citing the complexity of the former and the lack of open extensibility (namespacing) of the latter, and with their ability to offer search tools (as Google's) to take advantage of this information, it seems increasingly likely that this may be the ultimate means of encoding enhanced semantics in regular HTML, thereby offering a compelling argument for allowing TEI texts to be offered in such a serialization.

The question then becomes, when and where to use TEI XML and when and where (and how) to use such a serialization (especially once a particular approach for serializing may be formalized).

Note that in discussing the advantages of one format over the other, using one in one environment does not preclude the possibility of converting between the two formats. For example, one might create texts in TEI XML (including inside web-based editors), share them as HTML5 Microdata to take advantage of search engine discoverability, collaborative HTML editing tools which only are built to recognize HTML (e.g., wikis), and/or convert HTML5 TEI-based Microdata texts found on the web or in such HTML tools back into TEI for convenience of markup editing or manipulation. It would be possible to ensure lossless conversions in either direction (assuming the HTML is not enriched with features not used in TEI such as form controls), so there is little disadvantage in providing such a serialization.

Still, the entries below highlight environments and conditions under which preferring one over the other would be advantageous.

Advantages of editing original texts in TEI XML (or use cases for sometimes converting such TEI-enriched HTML5 to TEI XML)

  • Existing tools for TEI such as Roma and the default XSL stylesheets assume TEI XML (though stylesheets could also be designed to convert back into TEI XML to allow taking advantage of tools built for TEI XML)
  • It is more succinct in syntax than would be an HTML5 Microdata serialization, making the text itself more succinct and human-readable as with queries (such as XQueries), XSL transformations, CSS selectors, JavaScript or other programmatic manipulation against such data, and making the text more rapidly deliverable over networks, assuming that the likely necessary but bulky stylesheets could be cached for reuse. (It is conceivable that browsers might support a language called XBL which would allow transformations which occur opaquely to preserve the original DOM structure, thus allowing queries, etc. to work against the original syntax, but XBL is currently phrased as not supporting transformations of semantics, as converting to XHTML5 Microdata would be seeking to do.)
  • It is slightly easier to produce such texts, not only due to brevity, but also because an HTML5 serialization would require editors to be familiar with more HTML markup--the degree depending on the serialization implementation chosen (with some implementations allowing for very little new HTML to be required (primarily using anonymous HTML5 div, span, meta, or link elements joined with the handful of Microdata attributes to add TEI element and attributes), and others seeking to take more advantage of HTML's native markup (e.g., converting quote to blockquote, as the TEI XSL stylesheets do) by converting into it where there is an adequate correspondence with TEI semantics). A Microdata serialization might also wish to translate parts of TEI, where equivalent, into specific Schema.org schemas to take advantage of recognition by search engines, requiring those wishing to directly work with such HTML to be familiar with those schemas' semantic markup which could differ in naming from TEI.

Advantages of using an HTML5 (Microdata) Serialization

  • Allows for taking advantage of HTML-only tools such as online WYSIWYG editors (e.g., CKEditor or WYSIWYM editors (probably none created to date which are TEI-aware), tools which may not be aware of XML or XSL and thus otherwise unable and unlikely to support TEI.
  • Allows for storage on sites which only allow HTML (or conversion into HTML) such as the popular Mediawiki wikis, driving such sites as Wikisource (a logical place where TEI texts might be offered and made universally discoverable). In the case of Mediawiki, you might add your support to https://bugzilla.wikimedia.org/show_bug.cgi?id=28776 in order for Mediawiki to make the relatively easy but productive fix of allowing Microdata on such wikis (allowing TEI XML would be a much harder sell, given its less familiar and non-historically-web-oriented non-HTML nature). Such wikis also allow a succinct and user-friendly syntax for creating HTML without need for all of the mark-up (e.g., line breaks might auto-create paragraph tags), though admittedly this could introduce some initial ambiguity for editors in becoming clear about the exact HTML which will be produced. While their open collaborative potential is large as evidenced by the success of Wikipedia, wikis can also be used in closed projects to offer revision control and history.
  • Allowed encoding in the format already most familiar to the web community, HTML, albeit enhanced, in a standardly outlined manner, by

TEI-based semantic markup. While it could add the additional burden that mark-up creators must learn both HTML and TEI, applications could be designed to utilize TEI XML as the primary format, converting back to HTML when sending text to the wikis (and converting from HTML to TEI XML when the editor obtains the latest copy on the server, thereby avoiding the need for editors to know anything about HTML at all--using it exclusively as a delivery or storage format. People not familiar with TEI would thus also be able to build the initial syntax in HTML which could be progressively enhanced by TEI semantics.

  • Search engines such as Google (see http://www.google.com/support/webmasters/bin/answer.py?answer=99170 ) are already being made available which can discover such markup in a semantically-aware manner. For example, if your TEI text encoded mention of a particular date or person, your document could show up in such searches. Google's search testing tool offers examples such as this one demonstrating that, for example, that it is aware (and can expose to searchers) semantic information encoded in the document, such as calendar events, person names, etc.---unlike traditional unstructured search indexing which does not know for certain or is not intelligent enough to know that a certain snippet of text represents a date, for example.

Questions to determine in deciding on a specific serialization algorithm

Although search engines could discover TEI-encoded Microdata HTML no matter the algorithm, it would probably be of benefit to standardize on a particular approach, specifying exactly how TEI should be converted into such HTML, and exactly how HTML could then be converted back into TEI.

Some issues to consider:

  • Degree to which HTML markup will be used in place of (or in addition to) TEI semantics. The more native HTML markup, the more one can take advantage of HTML features (such as preview in an HTML editor or HTML-aware search tools looking for say a "blockquote" but which wouldn't understand the HTML5 TEI semantic equivalent (e.g., <div itemprop="quote">), but the more HTML a TEI editor would need to learn. The default TEI stylesheets tend to already take advantage of such markup where available, so these could be leveraged in preparing such an enriched conversion into HTML5. (A review of the stylesheets might also help express build up a human-readable description of the default formatting typically assumed when using TEI markup as well as enhance the Comprehensive CSS Stylesheet effort.)
  • Degree to which Schema.org schemas would be used in place of (or in addition to) TEI semantics, but the more other semantic vocabularies a TEI editor would need to learn.
Personal tools