Talk:SIG:Overlap

Discussion
This section is for discussing the trials and tribulations of overlapping markup - please use at will. To start, I include Wendell's response to Lou's response to Syd's response to Lou, and David's response to Wendell, in the thread RE: alternated attributes on the OL Listserv.

There must be a better way than this to run a discussion on a Wiki, but I'm pretty new at this - if anyone out there has experience running a Wiki discussion, please contact me ([mailto:dporter@uky.edu dporter@uky.edu])

Wendell said:
Lou,

At 04:46 AM 7/7/2005, you wrote (citing Syd):

Lou: It seems quite a radical departure from the way we currently teach people that XML works:

''Syd: I don't see how this is a radical departure from XML at all. It is almost syntactic sugar for the spanTo=, after all.''

Lou: Well, here are two differences I see at once:

1. spanTo uses existing and well-tested and well understood (by software if not by people) id/idref mechanism to establish links

the horse uses the completely different idea of co-reference (shared naming would be a better term) which is not used only in one other place in the Guidelines (Feature structures) and is not implemented by any software at all that I know of

Respectfully, I think your stress on what's "implemented by software" is something of a red herring, since I have little doubt (the only doubt I have is the studied paranoia of a programmer claiming anything is possible before it's been done) that I could write an XSLT transform that would convert markup from one form into the other, assuming the source format (whichever one it was) conformed to the constraints defined for it.

I don't see what using ID/IDREF traversal to "link" the two milestones gets you, since what's important here is not the two ends, but what's between them. There are also numerous ways these days to establish links besides ID/IDREF, which is commonly ignored in processing layers because of its DTD dependency, in the face of a common requirement to process files known only to be well-formed. Keys to link commonly-labeled nodes, whether nominally ID/IDREF or not, are generally a snap to set up. Heck, Steve or Syd could call sID an ID, and eID an IDREF, with no loss (and no gain either), except that XML prevents you from having two ID-typed attributes on a single element (IIRC), so sID would encroach on an ID already there. (Also this would be the wrong thing to do because sID and eID designate the *range*, not the element marking the start or end of the range.)

Besides, I refer you to my Extreme paper of last year (where I believe you sat in the audience), where I did implement transformations from one hierarchy into another using Steve's CLIX convention for marking up overlapping ranges. I am the last person to claim that this demonstration shows we are now ready to do this on a large scale -- I myself have many questions about how it would work in practice, especially at scale, and misgivings regarding its design. Nonetheless I think your assertion that somehow CLIX is harder to implement than the alternative syntax is made without much foundation.

I'd be glad to see it backed up with evidence. Where is there software that does much of anything with either kind of workaround to XML's single hierarchy? What syntactic form(s) does it assume, and how hard would it be to adapt to the other?

''Lou: 2. In the normal run of events, a start-tag marks the start of something, an end tag marks the end of something, and an empty tag marks a point. In the crazy world of horse, an empty tag may be any of those three, depending entirely on a configuration of attributes and the way the wind is blowing. I think that's more than syntactic sugar.''

I think "which way the wind is blowing" is both needlessly invidious, and misleading, since DeRose and Bauman have proposed both clear rules for telling these differences, and a validation mechanism to test the integrity of an instance that claims to conform to them. I don't imagine this would be so hard for the alternative you propose, but I haven't seen it done yet either. I don't believe Syd's Schematron checks the direction of the wind.

Besides, what makes this criticism less true of any other milestone convention?

(Personally I think XML syntax is simply the wrong tool for the job, but that's me. Jeni Tennison and I have also proposed a syntax we consider better, as you know, to go along with the data model we have also proposed. I'd be thrilled to have the support to bring these proposals to a more workable state. But it's like building a city in the wilderness: first, one has to dig a well. And in the meantime there are plenty of more urgent jobs luring me back at home.)

''Lou: I worry that the average TEI user will be confused by it andstart thinking that ... is just as good a way of saying ... as any other, when it really isn't in any practical sense.''

''Syd: I really doubt this would be a problem. Of course it would happen on occasion, but on the same level as people using  to denote lines of poetry.''

Lou: It is, in general, a bad idea to introduce a mechanism which is easy to abuse, especially if the same goals can be achieved without doing so.

It seems to me that this is arguing that our preferred mechanism should be as clumsy and ungainly as possible, so people are less likely to use it. If you define "abuse" as "using milestone-marking instead of clean element containment when the latter is possible", I think that risk simply comes with the territory.

Stepping back: I'd like to see the TEI SIG leave off the question of syntax, which is both the least important of the questions we face, and the most likely to embroil us in unproductive debate over irrelevancies such as the direction of the wind. If people working in the field really can't stand the variety of weeds and wildflowers springing up (personally I have no problem with them), as an alternative, I'd recommend concentrating on standoff approaches to dealing with overlap. While using standoff data structures (whether maintained as text files, in a database or whatever) is less appealing to those of us like me who prefer to get our hands dirty with instance markup (and who therefore distrust the maintenance model that standoff entails), it does take you to the problems that really matter (IMO), namely the data model, the API you build over it, and (finally!) the operations you can then perform. And if you like you can even leave your markup perfectly uncontaminated while doing so.

In the meantime, if you want to implement even as much as I have, with "half-LMNL", over the syntax you prefer -- please do. (I think you know how to find that Extreme paper.) (ed: here)

But personally I'm bored by arguments over syntax. They can be amusing, in the way that arguing over the differences between British and American orthography can be amusing and even illuminating in a small way, as one considers the history of orthography. And some syntaxes are certainly fairer to behold than others. (My opinion is that *any* milestone syntax is ugly, reflecting the very ugliness of the idea of retrofitting overlap into XML. Overlap, properly considered, indicates a superset of XML, not a special case of it.) But finally the importance of a syntax is in what one can do with it, just as when, if the prose is good and the page legible, I don't much care about what "colour" you use. If we agree on a syntax but don't do anything about the more interesting and difficult problems, what have we achieved? The rule "we shall spell things as they are taught at Oxford" doesn't teach us how to write good prose.

The worst enemy of group decision-making is premature consensus. Or maybe it's the prioritizing of trivia over what is really consequential.

Regards, Wendell

David Durand Said:
I won't quote what Wendell said, because I agree with almost all of it. I like the CLIX solution as described here. (I've proposed a variant of it at least once at ACH, years ago, so I may be biased).

I think that it's better than spanTo because it's a simpler proposal, in a formal sense. You can see that simplicity partly in the hard-to-answer questions raised for spanTo:

1. What spanTo means for non-empty elements 2. Does the element I spanTo have to have the same element type? 3. What does it mean if the element I spanTo has content? (i.e. is the content of that element inside the span or not?)

These questions don't arise for CLIX because the limitation to empty elements means that we are labeling points, and showing how two points define a span. If the CLIX syntax tempts people to use it when inappropriate, this is perhaps a commentary on people's willingness to adopt non-hierarchical markup when it is possible.

I think that software is a non-issue: Neither proposal is hard to implement (modulo the issue of defining the answers to the unanaswered questions above). I'm willing to bet that both are hard to work with meaningfully in XSLT, as it's just a bad language for dealing with things that violate and overlap the tree structure.

On the other hand, linking versus co-reference is _probably_ a real issue:

Linking and IDs have been used in many places in the TEI to "build data structures," and it's always been a practice that creates confusion, since most of those pointers are not "references" in the normal sense of navigable link. The fact that you can only have one ID is a limitation for document management (e.g. of tables, figures, etc.). Another problem is that you have to have a DTD subset (or XSD validator) around to declare the ID attribute types.

In fact, I think the use of shared attribute values to implement "homegrown" ID references is now very common, because of the ease with which it can be done in XSLT. I can't say whether it's more common, but generally don't bother with IDREF anymore at all. ID/IDREF mostly are used for their validation effects, in my experience.

In answer to Wendell's call to look at some non-syntactic/political issues, here are some open problems that I think are important, and which the CLIX paper probably addresses (I haven't had time to read it yet):

Different element types sometimes share the same endpoint. The requirement that each span have a distinct start and end element means that the endpoints of spans are always totally ordered with respect to each other.

Alternatively, if the interpretation is that a span labels a position between characters and not one between characters and elements then: ...  ... is equivalent to: ...  ...

In this case, elements representing span starts and ends are unlike other elements because they are not ordered with respect to the scope of normal elements.

These questions of co-punctuality are independent of the syntax chosen, and reflect different decisions about what to model when modeling spans. Gavin Nichol would say that spans are inherently non-hierarchical, and that the equivalence above is a good thing. I don't like this because I'd rather see a traditional marked up document as a special case of spans that happen to nest in a nice way, but this depends on spans having the ability to nest. This imposes additional complexity on out of line markup, however, because it makes the document addressing model more complex. The elements and spans that you chose to look at affect document addresses. In particular adding a span can create addresses that didn't exist previously. For example, consider the content of the foo element:

cat

this has 4 positions: before the 'c', after the 'c', after the 'a', after the 't'.

Adding a span changes things:

 ... cat

Because there's a new position between the 'a' and the 't': after the end of span "bar joe" This can be nice if you're editing a document, because now have a principled way to express whether the 'r' inserted to change a 'cat' into a 'cart' should be part of the or not. On the other hand, separate editing of overlapping spans is much easier if there's a fundamental coordinate system that isn't affected by other spans.

One small matter: I don't like the names sID and eID, as they create a mental confusion with XML IDs. I do see the perspective that says that they are the same thing conceptually -- the unique name for an element of a particular element type -- but I think people will expect other similarities and be confused by them. At the moment this is just a gut feeling, however.

Dot says:
I'm especially interested in Wendell's comment that "[he]'d like to see the TEI SIG leave off the question of syntax" and concentrate on standoff approaches to dealing with OL. I admit that the Wiki is heavy on suggestions for how to express overlapping markup, but I agree with Wendell that the data model for dealing with overlapping markup in TEI is more important than the syntax used to express it. On the other hand, I think it would be good for TEI to adopt a single, consistent approach to expressing overlapping markup/hierarchies (the simpler the better, which is why I'm excited about CLIX/Horse and unsure about standoff markup - but this could just be my ignorance).

So, what is the mission of the TEI OL SIG? At the moment, our goal is "to bring together users of the TEI who are acutely interested in issues of multiple hierarchies and in particular handling those in XML" - it looks like we've done that. What's next?