Draft minutes of 2010-04 Council meeting

The following are minutes of the 2010 Council meeting, held April 29-30 in Dublin.

A final version should be added to http://www.tei-c.org/Activities/Council/Meetings/index.xml along with the past meeting minutes.

reports of small-group discussions of feature requests
Council members met in groups of two or three to discuss various tickets that were unresolved after yesterday's discussions. Below are notes from reporting back.

CALS for TEI (2940838) – Laurent reported that he, Dan, and Kevin decided on these things to do: (1) add a reference to CALS in the Guidelines as an alternative way to encode tables and (2) use Sebastian's ODD that he developed for CALS as part of his work with ISO [?] to include in the TEI (much like MathML and SVG in the TEI).

Sebastian noted that we actually have a private re-implementation of MathML and SVG by incorporating these from the standard [at the time of ODD generation?]. He also noted that "CALS" can mean many things, not all of which are clearly defined, but most people use it to refer to the CALS exchange model, which is well-specified. He said we need to talk to Norm Walsh [who is our best contact for CALS].

Laurent said the third thing to do is to contact Norm Walsh.

Dan asked Sebastian whether his ODD needs polishing before public distribution. Sebastian replied that ____ and said he just needs a namespace to use for CALS elements.

Lou asked to clarify that there was consensus not to include CALS elements in the Guidelines. Everyone agreed that we would not do so at this time.

generic dates (2925145) – Dot reported that she, Elena, and James discussed a ticket proposing to create att.datable.generic for normalizing dates using non-Gregorian calendars and dating systems. She said they like the idea but are unsure of implementation. James added that the datatype would need to be so loose that it basically becomes free text.

Elena said that ____ would need a new date element.

James said that the ticket proposes a new form of canonical referencing, which Elena noted would need to be defined in the header.

Laurent asked whether we should attempt to rework the proposal or send it back to the author to re-propose in a different form. Dan replied that there's a risk that a new proposal would be less TEI-like.

After a discussion, it was agreed that Elena would summarize the possible ways forward [for the proposer of the ticket]. Lou asked her to monitor the ticket for future discussion and proposals.

allow @cert on choice and model.choicePart (2834505) – Julianne reported that she and Brett discussed this ticket, which included an alternative suggestion in the comment to allow @cert on seg. She and Brett did not like the latter idea, but they also didn't like the original proposal since too many attributes would be allowed on choice and model.choicePart.

Instead, Julianne and Brett proposed to make all elements in model.choicePart members of att.responsibility.

allow @to and @from on choice (2783323) – Julianne reported that she and Brett also discussed this ticket. She said the proposal is in the spirit of the TEI, but she noted that some good alternative encodings were also suggested on the ticket. She said they would like to know what Christian Wittern (who proposed the ticket) thinks of the alternative encodings. She noted that the Guidelines give no examples of @to and @from on app, so it's hard to compare the alternatives.

Dan said that @to and @from are on app because of the possibility of there being a lemma, whereas choice doesn't have these because it doesn't assume the existence of a lemma.

Brett summarized the use case given in the ticket. Elena noted that the proposal provides a simple mechanism for accomplishing something like stand-off markup, but she said it's not clear why we wouldn't allow these attributes on all elements. Laurent and Lou agreed that we need a more generic standoff mechanism and shouldn't create a hack for use only on choice.

Dan noted that, for the use proposed, that there are existing mechanisms (@ref and @key) for pointing to controlled vocabularies and an existing mechanism (app) for encoding a lemma.

allowing non-numbers in idnos and alllowing idno in author (2493417) – Sebastian reported that his group discussed a proposal to allow idno to contain non-numeric identifiers such as URIs and DOIs. His group agreed with this proposal to add idno to model.nameLike.

As for allowing idno in author, the group realized that author already has a content model that appears more flexible than desired (for example, allowing add and del), so they proposed to correct this by changing the content model of author of model.limitedPhrase. Kevin gave a use case for add and del within author: encoding a typewriter manuscript of a draft of a work with a bibliography, where the bibliographic citations are encoded using bibl or biblStruct, in which the author made corrections to author names.

After discussion, consensus was reached to no longer change the content model to model.limitedPhrase but still allow idno in author. Sebastian noted that this will have the side-effect of allowing people to use idno anywhere they might use author (not just within a bibl or biblStruct). He questioned whether we really want to do this. There was discussion.

Lou noted that having idno as a child of author goes against the principle voiced yesterday that elements should describe their parent. Kevin said there are many ways in which markup requires human inference to fully understand it and that this surely is not the only place where a TEI element does not describe the parent.

It was decided to create a separate feature request for ____.

space in core module (2794512) – Dan said the issue is that the example is actually a transcription: the space is important because it appears in the layout not because it has a rhetorical or linguistic meaning--e.g. the leading space in a indented paragraph. So if it is important, you should invoke transcription. The confusion is that gap has two meanings one appropriate to transcription and one appropriate to non-transcription circumstances (such as sampling). Gap was originally omit (sampling) but rename expanded semantically to cover the transcription situation in the move from P2-P3.

We need to indicate in the Guidelines that gap has two distinct meanings: one appropriate strictly to transcription (illegible//missing) and the other (sampling) more generally applicable. We might also want to consider resurrecting P2's omit for the sampling application and say that gap should be used only for transcription (reversing the P3 decision).

Dot said that we want to recommend to David Sewell (the ticket submitter) that he use space (from the transcription module).

[http://# target/targets (****perhaps ticket number 2531384****?)] – Lou reported that there are 8 cases in the Guidelines in which @target takes a single value and 8 others in which it takes 2. Only 4 of these instances have the attribute value defined by an attribute class. His group proposed to introduce an attribute class for all instances of this attribute which would allow 1 to many values. However, the prose of the Guidelines will need to explain that for some elements, it doesn't make sense to have multiple values for @target.

Once this change is made, it will no longer make sense to use the @targets attribute. We will leave this attribute in the Guidelines but discourage its use. The three elements that have the @targets attribute should be added to the new attribute class.

It was also agreed that the discussion of cRef on the ticket should be "spun off" into a different ticket.

hyphenation
Lou summarized the debate on how to handle encoding of hyphenation.

If you are transcribing early printed books with hyphens at the end of lines, there are a number of ways to do it. If your goal is to transcribe text, including hyphens, faithfully or to encode the text in a way that will allow you to process lexical items (generally speaking, words) without marking up these words with w elements, you will need to represent hyphenation in the source document.

If your encoding will mark line breaks (using lb), this complicates the method for encoding hyphenation and requires any tokenization software to be capable to ignoring elements that can appear within words (like lb). Alternatively, a derived text with lb and other intraword elements removed could be produced for the concordance software from the master encoded text.

In short, the Guidelines are currently not helpful in giving guidance on encoding hyphens that appear to be accidents of line breaks (where a hyphen would not appear in the word had there not been a line break). It has been suggested to use the Unicode soft hyphen character for these cases, and Lou initially thought this would be appropriate; however, Deborah Anderson asked senior Unicode people about this and they told us that use of the soft hyphen for such cases is inappropriate. (The soft hyphen is meant for cases where processing software might choose to break a word, not where it was previously broken.)

So how, then, to indicate on an lb element whether the word was broken across lines? You might use the type attribute to indicate whether a lexical unit has been broken by the hyphen, use whitespace before the element, or use the rend attrbute to describe the hyphenation. Lou suggested using the type attribute to indicate whether the hyphen marks the boundary between lexical items and the rend attribute to describe how this boundary is indicated (using a hyphen, semicolon, etc.). With this method, no hyphen is left as character data in the XML document.

Elena said this is exactly how the Austen project handled hyphenation.

Lou continued that this leaves the problem of handling hyphenation of words across line breaks in languages like Dutch and German, where letters within the word are sometimes duplicated before and after the hyphen. Juliana noted that Old Irish did something similar.

Lou said that it's not clear where to put the lb element and said it would seem you would need to repeat it. Brett suggested a standoff choice element. Elena said that the Austen project used xml:id and corresp. Matthew said that [some project he was involved in] used sameAs with xml:id.

Brett noted that we still haven't given advice on how to handle ambiguous cases (where it's not clear whether the source document's author would have used a hyphen had there not been a line break).

After discussion, it was decided that users could use any of the following values for the rend attribute for cases of ambiguity:


 * hyphen
 * soft or hard hyphen
 * ambiguous

Lou asked whether the Council needs to give advice to ____ or to those revising the Best Practices for TEI in Libraries (both of which had questions about this issue). Kevin said those working on the Best Practices would use the Council decision in order to revise their work.

ODD (r)evolutions
Sebastian said several tickets have been submitted for problems with and suggestions for the ODD architecture. A mailing list ["tei-meta"] was formed to discuss the future of the ODD system, but there was little discussion. He summarized the desired changes that were detailed in a message to the list on 2010-04-02.

Laurent noted that the proposal for including and excluding elements individually assumes that there's something behind ____ that is pointed to.

There was discussion of the proposed changes.

Dan noted that these changes work best for elements that are in classes but asked how it would affect elements not in classes. There was further discussion, during which Sebastian said that there would need to be a "magic module" -- a TEI module that would automatically select ___.

There was further discussion.

Laurent asked whether we should provide an explicit mechanism to say which ___ [is/are] in the TEI [module?].

There was much further discussion.

Dan and Laurent said we need to have the source [of what?] specify itself to avoid ambiguity. Sebastian disagreed that there's any ambiguity.

Brett asked whether all elements would be included or excluded by default. Sebastian said that currently _____.

There was a discussion on the merits of including or excluding elements as you go while constructing a project-specific schema.

Council agreed to support the the further development of ODD.

Sebastian noted that we will have a problem with combining elements with the same name from different namespaces. Currently, our classes are named after exemplary elements, but it wouldn't be clear which namespace these exemplary elements belong to. He suggested three ways to fix this:


 * 1) Change the way model classes are specified in the ODDs.  This unfortunately breaks any ODDs currently in use.
 * 2) Use namespaces in the names of model classes.  This would not be valid RelaxNG, but it's perhaps such a feature could be suggested to the RelaxNG developers.
 * 3) Create fake namespaces within the names of model classes, perhaps using the raised dot Unicode character as a delimiter (since it's one of the few characters allowed where needed).

Kevin suggested choosing the first option but adding a feature to Roma that will upgrade any existing ODDs when they are uploaded to use the new system of model classes.

Dan asked whether the first solution still leaves the problem of the third solution. He suggested a fourth solution: adding an attribute (perhaps called "prefix") to the element specification in the ODD language.

After discussion, Council decided that Sebastian will choose the best way to handle homonymous elements from different namespaces. Dan added that he should strive to make ODD mechanisms generalizable beyond the TEI.