SIG:CMC/CLARIN-D schema draft for representing CMC in TEI (2015)
This page is part of the wiki space of the TEI-SIG “Computer-mediated communication".
ODD file, RNG schema and encoding examples
ZIP archive (ODD and RNG files):
- Tei-CMC CHATKORPUS2CLARIN(odd).zip (October 21, 2015)
HTML versions of the ODD for online browsing:
- ODD, short (new & modified models only): tei_CMC_CHATCORPUS2CLARIN_SHORT.html
- ODD, complete: tei_CMC_CHATCORPUS2CLARIN.html
First, manually annotated sample xml files with empty teiHeader:
- chat interaction (logfile from the Dortmund Chat Corpus)
- user discussion on a wikipedia talk page (logfile from the German Wikipedia Corpus in DeReKo)
About this schema
This ODD describes an encoding schema for genres of computer-mediated communication (CMC) / social media. It is meant as a contribution to the work and discussions in the special interest group “Computer-mediated communication" (CMC-SIG) of the Text Encoding Initiative (TEI). The schema has been developed in the context of the CLARIN-D curation project "ChatCorpus2CLARIN".
Authors: Michael Beißwenger, Eric Ehrhardt, Axel Herold, Harald Lüngen and Angelika Storrer.
The schema is based on version P5 (2.9.0) of the TEI Guidelines for Electronic Text Encoding and Interchange (henceforth: ‘TEI-P5’) and uses customizations to adapt the models defined in TEI-P5 for the modeling of structural and linguistic peculiarities of CMC genres. The schema takes into consideration previous schema drafts that have been developed by members of the SIG (the 'DeRiK schema' described in Beißwenger et al. 2012, the 'CoMeRe schema' described in Chanier et al. 2014) as well as feedback and discussions on these previous drafts received at the TEI conferences 2011 and 2013 and at workshops held in the context of the DFG scientific network Empirikom.
Status of the schema
Consider this schema as a draft and as a basis for further discussions. A rationale for the models included in the schema will be given as part of the panel “TEI across corpora, languages and genres: Towards a standard for the representation of social media and computer-mediated communication” at the TEI Conference and Members Meeting 2015 in Lyon. We are looking forward to feedback and further suggestions at the conference, via the SIG space in the TEI wiki and/or via the SIG’s mailing list (firstname.lastname@example.org).
Characteristics of the schema: TEI customizations and best practices
The schema uses four types of customizations:
- The content models of three elements from TEI-P5 have been modified (<s>, <p>, <quote>) to include the model model.floatP.cmc (s.b.)
- The three new elements <post>, <prod>, and <signatureContent> have been introduced.
- Two attribute classes have been modified to introduce the CMC-specific new attribute auto, and to allow the existing attribute who to appear on all elements: att.ascribed, att.global
- Two classes have been introduced to combine the new, CMC-specific elements (model.divPart.cmc), or to combine existing TEI-P5 elements for less restricted usage in CMC documents (model.floatP.cmc): <opener>, <closer> <signed>, <postscript>, <trailer>.
In addition to these customizations, we have defined best practices for using the TEI-P5 models <w>, <phr>, <signed>, <time>, <div>, <name> and others for annotating CMC phenomena and for adding part-of-speech information for every word token. We'll describe these best practices in the panel at the TEI-MM in Lyon.
Description of the DeRiK schema:
- Michael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and Angelika Storrer (2012). «A TEI Schema for the Representation of Computer-mediated Communication», Journal of the Text Encoding Initiative, Issue 3. DOI: 10.4000/jtei.476
Description of the CoMeRe project:
- Chanier,T., Poudat,C., Sagot, B., Antoniadis, G., Wigham,C. R. , Hriba, L.,Longhi, J. & Seddah, D. (2014) «The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres». Special issue on « Building And Annotating Corpora Of Computer-Mediated Discourse: Issues and Challenges at the Interface of Corpus and Computational Linguistics ». JLCL (Journal of Language Technology and Computational Linguistics), pp. 1-31