Minutes from December 2, 2008
Present:
- Syd Bauman (Brown University)
- Michelle Dalmau (Indiana University)
- Matthew Gibson (University of Virginia)
- Kevin Hawkins (University of Michigan)
- Lisa McCauley (University of California, Los Angeles)
- Chris Powell (University of Michigan)
- Andrew Rouner (Washington University in St. Louis)
- Natasha Smith (University of North Carolina)
- Perry Willett (California Digital Library)
- Rich Wisneski (Case Western Reserve University)
- Glen Worthey (Stanford University)
The meeting was called to order at 1:10 p.m., EST.
Contents
- 1 Overall organizational structure of the Guidelines for Best Practices (GBP)
- 2 Discussion of "general recommendations" section
- 3 Discussion of relationship of TEI Lite to GBP
- 4 Discussion of "general recommendations" section, continued
- 5 Discussion of header revisions
- 6 Level 1 and METS
- 7 Monthly meetings, deadlines, and DLF Spring Forum 2009
Overall organizational structure of the Guidelines for Best Practices (GBP)
We agreed to move the acknowledgements to the end and remove the section on "recommendation additions to teixlite DTD". [I left the acknowledgement section where it is because it only applies to the header portion of the document. But I clarified this fact in the text. If someone wants to do a more serious overhaul of the appendix to incorporate it, please feel free. (Kshawkin)]
Discussion of "general recommendations" section
Encoding level recorded in editorialDecl
Chris noted that Michigan uses the n attribute on editorialDecl to record the encoding level. Kevin noted that this makes it machine-readable and that he neglected to include this in the proposed revisions.
There was a discussion of the content model of editorialDecl, which allows either a prose description or a structured description of editorial practices but not both. Kevin suggested library-based encoding projects were unlikely to include detailed descriptions of editorial practices, so using the p element is probably sufficient.
Lisa said that since the GBP are based on TEI Lite, we can't use the structured description.
Discussion of relationship of TEI Lite to GBP
Syd said he wants the GBP to become a set of TEI customizations, not a TEI Lite customization, as this is makes much more sense with how P5 should be used. Furthermore, the latter is not currently possible with TEI provided software.
Many said we should first update the GBP to work with TEI Lite P5 and then consider whether TEI Lite is sufficient for our needs. Lisa suggested we start a new wiki page for desiderata relating to customization of P5 for libraries for us to return to later.
Discussion of "general recommendations" section, continued
Syd said he's not entirely happy with doing this since n is canonically used to record a label, e.g. an iteration number ("the first editorialDecl, the second editorialDecl, etc.). However, he saw no other place to record this information and agreed that it is the most appropriate place to record it. Kevin said he would edit the BPG and his sample header to add this attribute value. Michelle said we could discuss it further by email.
File naming conventions
There was a discussion of whether ISO 9660 is still relevant. Many said it is not, but Kevin said he was unable to find a standard that superceded this. Syd suggested prescribing the Rock Ridge Interchange Protocol and said he would send a link to the group.
Numbered divisions
Syd said there are two separate issues: whether to use numbered or unnumbered divs and whether each text should include at least one div1 (or div).
Perry noted that the description of each encoding level prescribes the use of a div1.
Kevin suggested that we prescribe numbered or unnumbered divs to ease interchange but also remind users that documents produced according to the GBP need not be the same files used in their delivery system. Instead, these documents could be thought of as the archival format.
Syd agreed that we should prescribe one or the other. He said it didn't matter to him which we choose, though he prefers unnumbered divs.
Michelle suggested we all look at chapter 4 of the guidelines and discuss it further by email.
Page break locations
Syd said he doesn't understand the logic behind requiring page breaks to be inside of a div. Chris said this makes it easier for software to grab the whole div for display, including displaying the page break at which that division begins.
Andrew noted that it will be harder to write a stylesheet to transform from one way of encoding page breaks to another than it would be to go from numbered to unnumbered divs. Syd added that the algorithm for determining the page number on which an element occurs is to look for the preceding pb, but if a page break between two divisions is encoded at the top of the latter division, to find out on which page that second division begins, you would need to look for the first child pb of the div.
Kevin said he is reconsidering his opposition to this on the grounds that it is software-specific because the ontological reality here is not all that important, whereas it seems that having page breaks inside divs makes the document easier to process.
Perry said we should keep in mind that we want to guide inexperienced encoders.
Kevin said we should pick one way or the other to aid interchange. Perry noted that we've discovered that people don't interchange documents. Kevin said they might aggregate them.
Michelle said she prefers to represent ontological reality at the expense of processability.
Internal consistency
Lisa suggested adding a new bullet to recommend internal consistency in encoding. Syd concurred.
Discussion of header revisions
Michelle said she noted that the mechanism for encoding rendition differs in P4 and P5. Syd replied that the old mechanism still exists, so we don't need to adopt the new additional one.
We discussed how to best resolve the many outstanding issues in the header section. Syd suggested moving the discussion points to the "talk" page of the wiki, with links from the GBP text to it.
Lisa suggested doing an online survey to answer the questions. Michelle said Indiana University has an account on SurveyMonkey that we could use. Perry asked whether this would be open to just those of us participating in this call. Everyone agreed that that would be the case.
Lisa suggested that we try to reach some consensus before putting a question to a survey.
Michelle said she, Matthew, and Kevin would go through the comments, resolve what can be resolved, and survey others for the remainder.
Level 1 and METS
Lisa said she'd like to know how Level 1 could work with METS. Chris said you basically export the catalog record and create a header. She said this was developed for the work with Google, but she said she could send a sample to Lisa.
Michelle said Indiana University does Levels 3 and 4 automatically as well. She asked Chris whether she has documentation. Chris said she could put something together. She noted that there's already a public collection in HathiTrust which contains Level 1 texts digitized through Michigan's workflows.
Natasha asked Chris to send it to everyone. Chris asked what list to send this to. Kevin said there's no list for our current group, so just "reply to all" on the latest email.
Kevin (not realizing the discussion was about METS and not just Level 1) noted that Michigan's use of Level 1 predates the Google partnership and is based simply on automated processes based on OCR text.
Michelle said Indiana University attempted mass digitization usign Level 1, creating "TEI shells" automatically. She said this might also be helpful.
Rich said he is also interested in this. He asked whether Chris creates the METS file as well.
chris said she has a script for moving content from Michigan's old repository (DLXS) into the new one (HathiTrust), which uses METS. The script creates a manifest in METS based on timestamps on files and other provenance information. What is produced fits into the HathiTrust alongside content from Google. While their internal structure is not identical to that of content from Google, it is interoperable in the architecture.
Michelle noted that METS might be useful at other encoding levels as well.
Monthly meetings, deadlines, and DLF Spring Forum 2009
Michelle suggested that we continue meeting at this same time of the week (Tuesdays at 1 p.m.) but meet every month. She named the future meeting dates:
January 13, 2009 February 10, 2009 March 10, 2009 April 13 [actually 14!], 2009
Natasha asked when the DLF Spring Forum 2009 will be. ___ said it will be May 4-9 in Raleigh.
Michelle said she would like to have a draft revision of the BPG ready by March 30 for public comment, giving people two weeks to respond. This would leave us more than 2 weeks, during which there's a call scheduled, to incorporate these comments into the document.
Kevin asked Michelle to send the dates by email.
Michelle asked Matthew to coordinate a meeting in Raleigh. She said meeting the morning of the conference would be good, but it's fine to fit it in at a different time as well.
Natasha said she could also host in nearby Chapel Hill. Michelle said this is a good back-up plan.
The meeting adjourned at 2:07 p.m.