SIG:CMC/CoMeRe schema draft for representing CMC in TEI (2014)

Status of this draft
This page describes a draft for a basic schema for representing genres on computer-mediated communication (CMC) in TEI. The draft has been created by members of the TEI-SIG "Computer-Mediated Communication" where members are developping databank of CMC corpora encoded into TEI in various European langues (e.g. ).

The SIG encourages everybody to discuss this draft and give their feedback/comments using the "discussion" function on top of this page. The comments/discussions will be carefully taken into consideration in the further development of the schema.

The history of the draft is documented on the main wiki page of the SIG. This page should be read in parallel to SIG:CMC/Draft: A metadata schema for CMC

Authors of this draft: Thierry Chanier, N.N., N.N.

Interaction types
Figure 1: Interaction Space and multimodal interactions

Interaction
Participants are in the same interaction space (IS) when they can interact (but not necessarily do it, cf. lurkers). They interact through input devices,(microphone, keyboard, mouse, gloves, etc.), which let them use the modality tools and output devices, mainly producing visual or oral signals. (These however, will not be described in this article). Hence when participants cannot hear nor see the other participants’ actions, they are not in the same IS. Of course, participants may not be participants during the whole time frame of the IS. They can enter late, or leave early.

In an IS, actions occur between participants. Let us call the trace of an action within an environment and one particular modality an “act”. Acts are generated by participants, and sometimes by the system. Some of them may be considered as directly communicative (verbal ones in synchronous text or oral modalities). Others may not be directly communicative but may represent the cause of communicative reaction / interaction (e.g. when participants write collaboratively in an online word processor and comment on their work). Participants see and hear what others are doing. These actions may represent the rationale for participants to be there and to interact (produce something collectively). Hence the distinction between acts, directly communicative or not, is irrelevant.

An important distinction may be made between an IS where only one modality (tool) is used by participants, and an IS where several occur. In the next section, we start presenting some examples of mono modality environments where actions occur en bloc (Beisswenger et al., 2012). A more complicated case appears when an IS uses several modalities. We will also find an example in this article.

Within other multimodal environments verbal (speech, text chat) and nonverbal acts occur simultaneously. The main purpose of transcriptions is then to describe inter-relations amongst acts and within acts: the participant’s utterance may be re-planned when s/he talks depending on other specific acts occurring at the same time (see Wigham & Chanier, 2013). Indeed, written communication can be simultaneously combined with other modalities. For example, there are situations where a participant does not plan an utterance as a one-shot process before it is sent as an en bloc message to a server, which in turn displays it to the other participants as an non modifiable piece of language (e.g. as a text chat turn). Undeniably, an utterance can also be planned, then modified in the throes of the interaction while taking into account what is happening in other modalities of communication (e.g. in an audio chat turn. Table 1: CMC environments and modalities

Examples of macrostructures
Accordingly to (Beiswenger et al., 2012), we refer to macro-structure when considering the general information attached to an interaction (adresses, copy-to, title, label, readers, attached files, etc.) as well as issues dealing with way of arranging sets of interactions per modality or interaction space. The micro-structure of the text (next section) refers to the type of elements found in the actual contents of the interaction (, or ) for example interaction words, emoticons, hash code, etc.

Before assembling general proposals concerning and and their attributes, let us consider some examples of CMC interactions.

Multimodal example
From : cmr-copeas-tei-v1. Context: Context: Lyceum audio-graphic conference environment, 3 learners (English L2) working into a word processor: one writing, others helping.
 * (1.2): collaborative word processor
 * (1.3): audio, clarification
 * (1.4): textchat, correction (with error)
 * (1.5): textchat, request clarification

(1) (1.2) (modify),paragraph (ad,For example:to have comparaison between                 web sites, to know more criterias for a good site) (1.3)euh + to                 euh + to use the good euh + good euh vocabulary + I think euh + we euh + we wrote euh for me I have euh +++ I've euh much progress in euh in use the good vocabulary to euh + to evaluate euh a website (1.4) according to the differents criteria (1.5) ?

Textchat
From cmr-getalp_org-rhone-alpes-tei-v1. Textchat turns correspond here, respectively, to :
 * the first 3: messages
 * 4: change alias
 * 5: change rights

(2)  Apres je vé faire ma physique c aussi les equation bilan  Aujourd'hui c la journée equation  lol  Changement de pseudo: Tsu -&amp;gt; Tsu[H  #Rhone-alpes:changement de mode(s) '+o Mega-Link' par Hera!services@olympe.epiknet.org

SMS
From cmr-smslareunion-tei-v1. Two SMS messages sent by the same person to the same addressee. Note that here the adresse is not explicitly encoded (phone numbers of adresses have not been collected, hence are not known, and there is no attribute in TEI to encode adresses (same problem with speech / oral chapter in TEI)

(3)  é@??$?Le + triste c ke tu na aucune phraz agréabl et ke tu va encor me dir ke c moi ki Merde par mon attitu2! Moi je deman2 pa mieu ke klke mot agréabl échangé […]  ...2 te comporter comme ca avec moi. Je ve bien admettr mes erreur kan j'agi vraimen mal comm hier mé fo pa exagérer. Si t pa d'accor c ton droi. Si tentain.le rest c à dirreposer dé question sur 1 sujet déjà expliké c pa 1 raison valabl pr         ke tu te monte contr moi.pr moi ossi ca suffi.

Discussion forum
From cmr-simuligne-tei-v1. The author of the message is a native speaker of French who is replying to a post made by a learner of French. Each person mentioned has been identified in the message structure (author, list of readers -here shortened-) and in its contents (addressee, signature of the author, attached file). This information may lead to other types of research on discourse and group interactions. For example, who takes the position of a leader, or an animator in a group? Can subgroups of communication be traced within a group, thanks to an analysis of clusters, cliques?

(4)  les sons du Suffolk    Read [other readers]  Puisqu'on parlait de ce qui est par la fenêtre, j'ai mis mon micro tout près de la fenêtre ce soir... Le coucou s'était déjà couché, malheureusement, mais les autres chantent très fort! Les 'Bullocks' ont tous le même père: un grand taureau Charolais, qui est le père de tous les boeufs de la région! bonne nuit à tous<name ref="#cmr-Simu-Al5" type="person"> Marja

Wikipedia discussion
to be added

Blog
From cmr-infral-tei-v1. One message and its comment. (5)      <post xml:id="cmr-blog-a2" synch="#T2" who="#P2" type="blog-message"> Présentation de ma personne étapeE1 ; Bon soir à tous!<lb/> Maintenant, je vais commencer avec les présentations.......              <lb/> Je pense que vous avez vu que je m'appelle <name ref="#P2">Kerstin. J'ai 22 ans. Mon nom est un nom suédois qui est très fréquent en Allemagne. Comme vous savez peut-être, on a commencé nos études en master cette semaine. <lb/> Ma              famille - mes parents et mes deux soeurs -habite à Osnabrueck. C'est une ville qui est pas loin de Brême. Après avoir passé mon bac à Osnabrueck, j'ai commencé mes études de francais et de sport à Brême. La raison pour laquelle j'ai choisi ces deux matières est que j'aime faire du sport (jouer au tennis, nager) et que j'adore la              culture francaise. J'adore la langue francaise et le pays me plaît beaucoup (le              paysage francais....). <lb/> Les deux étés passés, j'ai fait un stage de plus que deux mois en Suisse francophone et en France (près de Lyon) pour améliorer mes connaissances de la langue francaise et la pratique du francais à l'oral. <lb/> En ce              qui concerne mes études de francais, ce qui me plaît surtout, c'est, d'explorer la               culture francaise d'une manière différente (les textes littéraires, les séquences               vidéos......). <lb/> J'attends vos présentations et je vous souhaite encore un bon soir........ <lb/> A bientôt, <name ref="#P2">Kerstin <post xml:id="cmr-blog-a3" synch="#T3" who="#P3" type="blog-comment" ref="#cmr-blog-a2"> Hallo Kirstin! J'ai lu que tu as fait des stages e...            Hallo<name ref="#P2">Kirstin ! J'ai lu que tu as fait des stages en Suisse francophone ! Où exactement car j'habite près de la frontière suisse (à 1h de              Lausanne !)! Je pense qu'on aura l'occasion d'en reparler ! Bis Bald

Email
From cmr-simuligne-tei-v1. On email snet to one person and read by this person.

(6) <post xml:id="cmr-Simu-Aq-At-Outbox-0080" when="2001-05-12T01:15:00" who="#cmr-Simu-At" type="email-message"> ta photo  <person corresp="#cmr-Simu-Al6"> <event type="SendTo"> SendTo <person corresp="#cmr-Simu-Al6"> <event type="Read" when="2001-05-12T01:15:00"> Read </listPerson> Coucou<name ref="#cmr-Simu-Al6" type="person"> Mia, Tu peux aller te voir dans Publications : maintenant, tu y existe en totalité ! A bientôt,<name ref="#cmr-Simu-At" type="person"> Anna

The element
In our schema, the element is the basic structural element of a CMC document corresponding to textual "enbloc" interactions. We consider it a macrostructural element, but it is the pivot between the higher level macrostructural components thread and logfile and the microstructure of the content which it encloses. The structure of is based on that of the existing  element.

The  and elements have the following similarities:
 * and are high-level elements, belonging to the same class(model.divLike);
 * and contain the major divisions of text;
 * and have similar internal content.

It is important to note that, like , does not belong to the class of pLike elements. One may consist of one or more paragraphs, similar to a. While a division may represent, for example, a chapter of a book, represents one user contribution to some computer-mediated communication event (forum, blog, web-discussion, or chat). Such a contribution can contain multiple paragraphs, just like. In the chat, all postings consist of exactly one paragraph and the portion of text exhibits no special markup, but on the Wikipedia talk page given in figure 2, some of the postings contain divisions and markup that the authors inserted into the content of their postings in order to structure their content. Therefore, cannot be a model.pLike element.

The  and elements have the following differences: document (such as an entire forum thread, an entire blog with user comments, or a chat logfile).
 * is a self-nesting element, while is not;
 * s can only appear inside of a division which encloses one complete CMC

In other words, is a child element of and shares its content model except that it does not contain divisions and does not embed itself. Normally, consists of one or more paragraphs. In some cases a posting contains a head, typically with a title.

Attributes for
Here is a summary of the attributes and other information which may be attached to different types of CMC environments. Note that information relative to StatusRead and Receiver have been encoded in TEI in the of the (close to the - for email, forum, blog- and (for blog)). Attributes relative to Wikipedia forums have not yet been used (see the corresponding section).

Table2 : attributes of the element

Macrostructure & multimodality: u and prod elements
Figure2: non verbal acts in a 3D CMC environment

Types of divisions for the interaction space
As already seen an interaction space may be described at 2 very different levels:
 * 1) the meta level (see SIG:CMC/Draft: A metadata schema for CMC) ;
 * 2) the interactions per themselves (i.e. the set of acts. These acts, in all examples given here, are included within a division which correspond to a session or a division within a division.

We may distinguish several types of divisions :


 * div type=”thread”, e.g. forum, blog with different tools and then
 * child element : with different types
 * div type =”logfile”, e.g. textchat, SMS, with different tools
 * child element : with different types for example within a textchat
 * div type =”oral-discourse" for audiochat
 * child element :  see chapter TEI on speech
 * div type=”multi-modalities”
 * child element :
 * child element :
 * child element : (for iconic acts - vote, raise_hand, brief_absence_act, etc. -, all collective tools – wordprocessor, semantic map, whiteboard, etc. - nonverbal communication ) . As an example of non-verbal classification of acts, see the figure besides which represents non-verbal acts in Second Life as encoded by (Wigham & Chanier, 2013).

element
As explained the element refers to acts which are non-verbal, are part of the interction process, at the same level than the and  elements. After example (1) given herebefore, here is another example (7) of interactions between one tutor and learners (From : cmr-copeas-tei-v1 . Context: Context: Lyceum audio-graphic conference environment).
 * (7.1) audio act : yes/no question by the tutor
 * (7.2) positive answer givent through a non-verbal modality by a learner (inconic system,  modality here named "vote" with content "agree")
 * (7.3) audio act : yes/no question by the tutor
 * (7.4) textchat act : complementary info given by a learner
 * (7.5) to (7.9) : yes / no answers of 5 participants through the iconic system

(7) (7.1)<u xml:id="cmr-copeas-R2_lobby-a_1297" xml:lang="eng" start="#cmr-copeas-tl_r-w107" end="#cmr-copeas-tl_r-w109" who="#AR4"> euh no + euh I don't know the + the style ++ in french it's a band + named + {les enfoirés} ++ you know euh + {enfoirés} |+++ (7.2)<prod xml:id="cmr-copeas-R2_lobby-a_1298" synch="#cmr-copeas-tl_r-w108" who="#AR7" type="vote">agree (7.3)<u xml:id="cmr-copeas-R2_lobby-a_1299" xml:lang="eng" start="#cmr-copeas-tl_r-w109" end="#cmr-copeas-tl_r-w110" who="#TutR">anybody else know | (7.4)<post xml:id="cmr-copeas-R2_lobby-a_1301" xml:lang="unk" synch="#cmr-copeas-tl_r-w110" who="#AR6" type="chat-message"> french's singers (7.5)<prod xml:id="cmr-copeas-R2_lobby-a_1302" synch="#cmr-copeas-tl_r-w111" who="#AR3" type="vote">agree (7.6)<prod xml:id="cmr-copeas-R2_lobby-a_1303" synch="#cmr-copeas-tl_r-w111" who="#AR2" type="vote">agree (7.7)<prod xml:id="cmr-copeas-R2_lobby-a_1304" synch="#cmr-copeas-tl_r-w112" who="#AR6" type="vote">agree (7.8)<prod xml:id="cmr-copeas-R2_lobby-a_1305" synch="#cmr-copeas-tl_r-w113" who="#TutR" type="vote">disagree (7.9)<prod xml:id="cmr-copeas-R2_lobby-a_1306" synch="#cmr-copeas-tl_r-w114" who="#AR1" type="vote">disagree

The contents of the element is fairly simple in (7), whereas it is much more complicated in the element of example (1). In (1) it corresponds to an act of typing within a collaborative word processor. It is up to the researchers who transcribe (out of videoscreen captures) actions within online collaborative tools to decide which kind ofcoding scheme they want. We should not impose anything for this contents. The only mandatory information should be restrained to attributes.

This element does not exist in the current TEI version. Of course, the element name may be debatable (here the name "prod" corresponds to the fact that the corresponding non verbal act is a production made by a participant), but not its function.

We have considered some TEI elements relared to non verbal features before introducing.

Elements than cannot be used as an act of type prod

 * : “contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything.”, it is a brief description, no attribute @who, too low-level
 * : “marks any communicative phenomenon, not necessarily vocalized, for example a gesture, frown, etc”. Has been designed as integrated inside , but may be used at the same level. However the name is wrong. Kinesic is a specific non-verbal notion related to gaze, posture, gesture, not a general one.
 * : “marks any phenomenon or occurrence, not necessarily vocalized or communicative, for example incidental noises or other events affecting communication.” Name unacceptable, runs against the interaction and communicative framework.

This is the whole philosophy / theoretical standpoint of the TEI chapter on speech that cannot be applied to non verbal description placed as the same level as text and sppech. These TEI elements, related to an utterance, are not really considered as being part in the interaction, at the same level as the utterance’s one. Their naming is also unacceptable and cannot refer to concepts we have mentioned here.

Attributes of and
Figure 3: interplay between modalities

and elements share the main top attributes as those listed in the upper part of Table2 for the. However there is a striking difference with respect to time. Time attributes attached to correspond to en bloc acts. The time encoded corresponds to the time of post received/sent by the server. This is also the time that other participants saw the result of the act appearing on their screen. They can only react to this act after this time. All participants perceived this act (let us say a textchat) as instantaneous, despiste the fact that is not. On the contrary, all  and most acts have a duration, i.e. other participants perceived oral or visual cues of the act and may react before it ends. Their reaction may lead the first participant to change on the thores what s/he intended to say or do. For example, the figure on the left here display in the left column the trasncript of an audio act and on the three right columns contents of textchat acts. Whilst the first participant utters his message, the other three react. Consequenlty the first one changes two times what he had initially planed to utter (extract from Second Life environment ).

Attribute representing Addressee / Receiver for  and more elements
In Table1 we mentioned the "Receiver" information, which described to whom a message (Email) was sent. Since there exist no such attribute in TEI, we encoded it into the of the, in a quite verbose way. If we had known the addresses of SMS messages (we could not for ethics reasons when collecting SMS), we would also have needed such feature within a.

It appears that in oral corpora it is often an necessity to clearly mention the addresse of a speech utterance. See for example, the Alipe project with child-parents interactions where addressees are clearly indentified and have been described through verbose feature structures.

A great improvement to these circumlocutions may be to design an @addressee attribute which may be added to  and at least. We may also use it for the <adressingTerm> element (see next section).

Microstructure


The micro-structure refers to the elements and their attributes found in the actual contents of the interaction (here mainly within ). A large part of this section comes from (Beiswenger et al., 2012)

There is no common terminology to classify the elements of Internet jargon, nor consensus about the status of these elements in a natural language grammar framework. To fill this gap, we have developed an annotation schema for these phenomena on the microstructure level of CMC documents. The basic linguistic description category of our approach is termed an interaction sign; in the schema, instances of interaction signs such as emoticons, acronyms, etc. are represented using the element <interactionTerm>.

In our schema, we introduce an element <interactionTerm> as a phrase-level element (in the model.phrase class) which encloses one or more instances of subclasses of interaction signs.

New elements child of InteractionTerm
The <interactionTerm> element can have members of att.global as attributes. In addition, we introduce elements for the following subclasses of interaction signs:


 * : Emoticons are iconic units created using the keyboard. They are often used to portray facial expressions, and they typically serve as emotion, illocution, or irony markers. Due to their iconic character, the use of emoticons is not restricted to CMC in one particular language; instead, the same emoticons can be found in CMC data in different languages. The element, which is assigned to the gLike element class. Conventionally, elements of this class contain non-Unicode characters and glyphs. Although most emoticons are produced as a sequence of keyboard characters (dot, comma, colon, and the like), the resulting figure is comparable in its semiotic status to graphic characters. While some smiley faces have been included in Unicode, the variety of emoticons is still larger than can be captured by Unicode characters alone. That is why we place the element in the class of gLike elements.


 * <interactionWord> : Interaction words are symbolic linguistic units. Their morphologic construction is based on a word or a phrase of a given language which describes expressions, gestures, bodily actions, or virtual events?for example, the units sing, g (< grins, “grin”), fg (< fat grin), s (< smile), wildsei (“being wild”) are used as emotion or illocution markers, irony markers  or to playfully mimic simulated bodily activity .The element <interactionWord> in our schema is a member of model.global.spoken. It shares properties of the , , and elements in TEI.


 * <interactionTemplate> : Interaction templates are units that the user does not generate with the keyboard but by activating a template which automatically inserts a previously prepared text or graphical element into a space of the user’s choice. The category of interaction templates includes graphic smileys, chosen by the user of a CMC environment from a finite list of elements. These often portray facial expressions but can depict almost anything; in the case of animated GIFs, they can even portray entire scenes as moving pictures. This clearly goes beyond what can be expressed using only keyboardgenerated emoticons. On the other hand, users can invent new emoticons by combining keyboard characters, while template-generated units are always bound to predefined templates. The element <interactionTemplate> in our schema belongs to the model.global class of elements.


 * <addressingTerm> : Addressing terms address an utterance to a particular interlocutor . The most widely used form here is the one made out of the “@” character together with a specification of the addressee’s name. The element <addressingTerm> in our schema belongs to the model.nameLike class of elements. While this element usually uses no attributes, our customization includes the att.global attributes. The content of <addressingTerm> is restricted to two elements:
 * The <addressMarker> element belongs to the class model.labelLike (used to gloss or explain parts of a document) and is provided with the att.global class of attributes. The purpose of <addressMarker> is to identify or to highlight the addressee in a posting. This is typically achieved by using the “at” sign (“@”) or one of a set of fixed phrases (English: “to”; German: “an” or “für”).
 * The element is placed in the model.nameLike.agent class.

Anonymization
In order to be able to distribute the collected CMC data as widely as possible, we need to anonymize the data. Our anonymization strategy shall support the following goals:
 * Every user of the data shall be able to associate a certain set of interaction acts in a CMC document to a user. This user, however, shall not be identifiable as an individual of the “real world”.
 * In some corpora, researchers have collected numerous information about users (sociolinguistic ones, language biography, etc.) it is important to release with the corpora for future research analysis.

To achieve these particular goals, we perform the following steps:
 * All of the recoverable personal data of a CMC participant (or group) are collected into a person profile in a element. This profile is provided with a value of a user ID ( @xml:id) which is unique within the particular TEI document. All person profiles are stored in the header of the document; thus, they can easily be separated from the body of the document and therefore be hidden from the less privileged users of the data (cf. more explanation on participants - individual, groups- in SIG:CMC/Draft: A metadata schema for CMC)
 * Each interaction act is linked to a person profile via the @who attribute, which points to the value of an @xml:id of a person element.
 * Instances of user names in segments of a given posting are also linked to a user ID.

Anonymization is a tedious process, but it has already been accomplished over numerous CMC corpora. When it is done, several types of information are kept and coded into feature structures (who did the process? how many characters have been changed, etc.). We will not detail them here. Every time such process is performed a piece of information is replaced by another, such as in the SMS message in (8). (8) Bon, d'accord ! Elle est ou miss [_forename_] ? Usa ? On se retrouve ds 1 coin sympa ? Qu'est-ce qui t'arrange ?

In (9) we list the type of information which have been anonmyzed in the CoMeRe databaank and the keywords used for replacement and algined them with TEI elements used to encode them.

(9)               [_forename_]          [_forename_] [_surname_]          [_surname_] [_addName_]          <addName>[_addName_]</addName> [_tel_]              [_tel_]</rs> [_email_]            [_email_] [_url_]              [_url_]</rs> [_code_]             [_code_]</rs> [_address_]          [_address_]