SIG:CMC/CMC-core schema for representing CMC in TEI (2019)

This page is part of the wiki space of the TEI-SIG “Computer-mediated communication".

ODD file, RNG schema and encoding examples
ZIP archive (ODD and RNG files): HTML versions of the ODD for online browsing: Manually annotated sample xml files:
 *  []  (September 24, 2019)
 * ODD, short (new & modified models only): []
 * ODD, complete:  [] 
 * 1)  [] 
 * 2)  [] 

About this schema
This ODD describes an encoding schema for genres of computer-mediated communication (CMC) / social media. It is meant to define a basic setup that one needs to encode CMC corpora, but which is not in the TEI yet.

Authors: Michael Beißwenger, Laura Herzberg, Harald Lüngen and Ciara R. Wigham.

The schema is based on version P5 (3.3.0) of the TEI Guidelines for Electronic Text Encoding and Interchange (henceforth: ‘TEI-P5’) and uses customizations to adapt the models defined in TEI-P5 for the modeling of structural and linguistic peculiarities of CMC genres. The schema takes into consideration previous schema drafts that have been developed by members of the SIG (the 'DeRiK schema' described in Beißwenger et al. 2012, the 'CoMeRe schema' described in Chanier et al. 2014, and the 'CLARIN-D schema' described in Lüngen et al. (2016)) as well as discussions on a core schema at the TEI conference 2016.

Status of the schema
Consider this schema as a draft and as a basis for further discussions. A rationale for the models included in the schema will be given as part of the panel “TEI across corpora, languages and genres: Towards a standard for the representation of social media and computer-mediated communication” at the TEI Conference and Members Meeting 2015 in Lyon. We are looking forward to feedback and further suggestions at the conference, via the SIG space in the TEI wiki and/or via the SIG’s mailing list (tei-cmc@googlegroups.com).

Characteristics of the schema: TEI customizations and best practices
The schema uses four types of customizations:
 * 1) The content models of three elements from TEI-P5 have been modified (&lt;s>, &lt;p>, ) to include the model model.floatP.cmc (s.b.)
 * 2) The three new elements, , and  have been introduced.
 * 3) Two attribute classes have been modified to introduce the CMC-specific new attribute auto, and to allow the existing attribute who to appear on all elements: att.ascribed, att.global
 * 4) Two classes have been introduced to combine the new, CMC-specific elements (model.divPart.cmc), or to combine existing TEI-P5 elements for less restricted usage in CMC documents (model.floatP.cmc):,  , ,.

In addition to these customizations, we have defined best practices for using the TEI-P5 models ,, , , &lt;div>, and others for annotating CMC phenomena and for adding part-of-speech information for every word token. We'll describe these best practices in the panel at the TEI-MM in Lyon.

References
Description of the DeRiK schema:


 * Michael Beißwenger, Maria Ermakova, Alexander Geyken, Lothar Lemnitzer, and Angelika Storrer (2012). «A TEI Schema for the Representation of Computer-mediated Communication», Journal of the Text Encoding Initiative, Issue 3. DOI: 10.4000/jtei.476

Description of the CoMeRe project:


 * Chanier,T., Poudat,C., Sagot, B., Antoniadis, G., Wigham,C. R., Hriba, L.,Longhi, J. & Seddah, D. (2014) «The CoMeRe corpus for French: structuring and annotating heterogeneous CMC genres». Special issue on « Building And Annotating Corpora Of Computer-Mediated Discourse: Issues and Challenges at the Interface of Corpus and Computational Linguistics ». JLCL (Journal of Language Technology and Computational Linguistics), pp. 1-31