SIG:CMC/CoMeRe schema draft for representing CMC in TEI (2014)

From TEIWiki

Jump to: navigation, search

This page is part of the wiki space of the TEI-SIG “Computer-mediated communication".


Contents

ODD file

This ZIP file contains the ODD file defining new elements and new attributes corresponding to the extension proposal of TEI-CMC + an HTML fle documenting the ODD file.

Status of this draft

This page describes a draft for a basic schema for representing genres on computer-mediated communication (CMC) in TEI. The draft has been created by members of the TEI-SIG "Computer-Mediated Communication" where members are developping databank of CMC corpora encoded into TEI in various European langues (e.g. [1] [2]). This ODD has been applied to all corpora of the CoMeRe project, which corpora can be fully downloaded from the repository http://hdl.handle.net/11403/comere . Documentation on the CoMeRe project is on its website: http://comere.org

The SIG encourages everybody to discuss this draft and give their feedback/comments using the "discussion" function on top of this page. The comments/discussions will be carefully taken into consideration in the further development of the schema.

The history of the draft is documented on the main wiki page of the SIG. This page should be read in parallel to SIG:CMC/Draft: A metadata schema for CMC

Authors of this draft: Thierry Chanier.

Interaction types

Figure 1: Interaction Space and multimodal interactions

Interaction type.jpg

Interaction

Participants are in the same interaction space (IS) when they can interact (but not necessarily do it, cf. lurkers). They interact through input devices,(microphone, keyboard, mouse, gloves, etc.), which let them use the modality tools and output devices, mainly producing visual or oral signals. (These however, will not be described in this article). Hence when participants cannot hear nor see the other participants’ actions, they are not in the same IS. Of course, participants may not be participants during the whole time frame of the IS. They can enter late, or leave early.

In an IS, actions occur between participants. Let us call the trace of an action within an environment and one particular modality an “act”. Acts are generated by participants, and sometimes by the system. Some of them may be considered as directly communicative (verbal ones in synchronous text or oral modalities). Others may not be directly communicative but may represent the cause of communicative reaction / interaction (e.g. when participants write collaboratively in an online word processor and comment on their work). Participants see and hear what others are doing. These actions may represent the rationale for participants to be there and to interact (produce something collectively). Hence the distinction between acts, directly communicative or not, is irrelevant.

An important distinction may be made between an IS where only one modality (tool) is used by participants, and an IS where several occur. In the next section, we start presenting some examples of mono modality environments where actions occur en bloc (Beisswenger et al., 2012)[3]. A more complicated case appears when an IS uses several modalities. We will also find an example in this article.

Within other multimodal environments verbal (speech, text chat) and nonverbal acts occur simultaneously. The main purpose of transcriptions is then to describe inter-relations amongst acts and within acts: the participant’s utterance may be re-planned when s/he talks depending on other specific acts occurring at the same time (see Wigham & Chanier, 2013)[4][5][6][7]. Indeed, written communication can be simultaneously combined with other modalities. For example, there are situations where a participant does not plan an utterance as a one-shot process before it is sent as an en bloc message to a server, which in turn displays it to the other participants as an non modifiable piece of language (e.g. as a text chat turn). Undeniably, an utterance can also be planned, then modified in the throes of the interaction while taking into account what is happening in other modalities of communication (e.g. in an audio chat turn.
Table 1: CMC environments and modalities

CMC environment

Mode & modality

TEI element

Main macrostructure issues with TEI

Corpora presently under processing into TEI

SMS

Text
synchronous

<post>

No notion of addresse (in <head> ?)

Y

Textchat

Text
synchronous

<post>

@alias, @type, @subtype

Y

Email

Text
asynchronous

<post>

Addresses, readers, copy, attached file, etc. (in <head> or <trailer>)

Y

Tweet

Text
asynchronous

<post>

microstructure

Y

Discussion Forum

Text
asynchronous

<post>

Threads (opening, answering) : @ref

Y

Wikipedia discussion forum

Text
asynchronous

<post>

Reply difficult to identify (indent): @ref

Y

Blog

Text + image
asynchronous

<post>

Message & comment : @ref
set of images attached to the TEI file ?

Y

Audio conferencing system (e.g. Skype)

Text
Audio
synchronous

<post>
<u>

 

N

Complex CMC environment

LMS (Learning Management System) : WebCT, Moodle
email + textchat + discussion forum

Text
asynchronous & synchronous

<post>

One TEIcorpus file
one <text> per  main interaction space  IS (group)
ne main <div> per IS subspaces and /or type of modality

Y

Audio-graphic conference system (e.g. Lyceum, Centra)

Text + audio + nonverbal
synchronous

<post>

<u>

<prod>

Every element at the same level, i.e. mixing of these elements within a <div>
set of audio files attached to the TEI file ?

Y

Video-graphic or 3D environment (e.g. : Second Life)

Text + audio + nonverbal
synchronous

<post>

<u>

<prod>

Idem + set of video files attached to the TEI file

Y

Examples of macrostructures

Accordingly to (Beiswenger et al., 2012)[3], we refer to macro-structure when considering the general information attached to an interaction (adresses, copy-to, title, label, readers, attached files, etc.) as well as issues dealing with way of arranging sets of interactions per modality or interaction space. The micro-structure of the text (next section) refers to the type of elements found in the actual contents of the interaction (<post>, <u> or <prod>) for example interaction words, emoticons, hash code, etc.

Before assembling general proposals concerning <post> and <prod> and their attributes, let us consider some examples of CMC interactions.

Multimodal example

From : cmr-copeas-tei-v1[2]. Context: Context: Lyceum audio-graphic conference environment, 3 learners (English L2) working into a word processor: one writing, others helping.

  • (1.2): collaborative word processor
  • (1.3): audio, clarification
  • (1.4): textchat, correction (with error)
  • (1.5): textchat, request clarification
(1)
(1.2)<prod xml:id="cmr-copeas-T8_s101_ecriture_multimodale-a_14481" xml:lang="unk"
                  start="#cmr-copeas-tl_t-w1979" end="#cmr-copeas-tl_t-w1987" who="#AT3"
                  type="text_doc"> (modify),paragraph (ad,For example:to have comparaison between
                  web sites, to know more criterias for a good site)</prod>
(1.3)<u xml:id="cmr-copeas-T8_s101_ecriture_multimodale-a_14482" xml:lang="eng"
                  start="#cmr-copeas-tl_t-w1988" end="#cmr-copeas-tl_t-w1993" who="#AT1">euh + to
                  euh + to use the good euh + good euh vocabulary + I think euh + we euh + we wrote
                  euh for me I have euh +++ I've euh much progress in euh in use the good vocabulary
                  to euh + to evaluate euh a website</u>
(1.4)<post xml:id="cmr-copeas-T8_s101_ecriture_multimodale-a_14483" xml:lang="unk"
                  synch="#cmr-copeas-tl_t-w1990" who="#AT6" type="chat-message">
                  <p>according to the differents criteria</p>
               </post>
(1.5)<post xml:id="cmr-copeas-T8_s101_ecriture_multimodale-a_14484" xml:lang="unk"
                  synch="#cmr-copeas-tl_t-w1991" who="#AT6" type="chat-message">
                  <p>?</p>

Textchat

From cmr-getalp_org-rhone-alpes-tei-v1[2]. Textchat turns correspond here, respectively, to :

  • the first 3: messages
  • 4: change alias
  • 5: change rights
(2)
<post xml:id="cmr-get-c065-a21693" when-iso="2004-03-18T14:09" who="#cmr-get-c065-p39174" alias="cortex_taff" type="chat-message">
  <p>Apres je vé faire ma physique c aussi les equation bilan</p></post>
<post xml:id="cmr-get-c065-a21694" when-iso="2004-03-18T14:09" who="#cmr-get-c065-p39174" alias="cortex_taff" type="chat-message">
   <p>Aujourd'hui c la journée equation</p></post>
<post xml:id="cmr-get-c065-a21697" when-iso="2004-03-18T14:11" who="#cmr-get-c065-p36208" alias="roulie" type="chat-message">
   <p>lol</p></post>
<post xml:id="cmr-get-c065-a21699" when-iso="2004-03-18T14:13" who="#cmr-get-c065-p120845" alias="Tsu" type="chat-event" subtype="changementpseudo">
    <p>Changement de pseudo: Tsu -&gt; Tsu[H</p><add><code>alias_change(Tsu,Tsu[H)</code></add></post>
<post xml:id="cmr-get-c065-a21705" when-iso="2004-03-18T14:18" who="#unknow" type="chat-event" subtype="changementmode">
    <p>#Rhone-alpes:changement de mode(s) '+o Mega-Link' par Hera!services@olympe.epiknet.org</p></post>

SMS

From cmr-smslareunion-tei-v1[2]. Two SMS messages sent by the same person to the same addressee. Note that here the adresse is not explicitly encoded (phone numbers of adresses have not been collected, hence are not known, and there is no attribute in TEI to encode adresses (same problem with speech / oral chapter in TEI)

(3)
<post xml:id="cmr-slr-c001-a00011" when-iso="2008-04-14T10:17:11" who="#cmr-slr-c001-p010" type="sms">
      	<p>é@??$?Le + triste c ke tu na aucune phraz agréabl et ke tu va encor me dir ke c moi ki Merde par mon attitu2! 
             Moi je deman2 pa mieu ke klke mot agréabl échangé</p>
 	[…]
<post xml:id="cmr-slr-c001-a00304" when-iso="2008-04-15T20:23:59" who="#cmr-slr-c001-p010" type="sms">
      	<p>...2 te comporter comme ca avec moi. Je ve bien admettr mes erreur kan j'agi vraimen mal comm hier mé fo pa exagérer. 
         Si t pa d'accor c ton droi. Si tentain.le rest c à dirreposer dé question sur 1 sujet déjà expliké c pa 1 raison valabl pr 
         ke tu te monte contr moi.pr moi ossi ca suffi.</p>

Discussion forum

From cmr-simuligne-tei-v1[2]. The author of the message is a native speaker of French who is replying to a post made by a learner of French. Each person mentioned has been identified in the message structure (author, list of readers -here shortened-) and in its contents (addressee, signature of the author, attached file). This information may lead to other types of research on discourse and group interactions. For example, who takes the position of a leader, or an animator in a group? Can subgroups of communication be traced within a group, thanks to an analysis of clusters, cliques?

(4)
<post xml:id="cmr-Simu-Aquitania-Principal_27.04-13.05.01-59"
	when="2001-05-02T07:58:00" who="#cmr-Simu-Al5" type="forum-message">
	<head>
		<title>les sons du Suffolk</title>
		<listPerson>
			<person corresp="#cmr-Simu-An3">
				<event type="Read" when="2001-05-02T07:58:00">
					<label>Read</label>
				</event>
			</person>
                       [other readers]
			
		</listPerson>
	</head>
	<p>Puisqu'on parlait de ce qui est par la fenêtre, j'ai mis mon micro
		tout près de la fenêtre ce soir... Le coucou s'était déjà couché,
		malheureusement, mais les autres chantent très fort! Les 'Bullocks'
		ont tous le même père: un grand taureau Charolais, qui est le père
		de tous les boeufs de la région! bonne nuit à tous<name ref="#cmr-Simu-Al5" type="person"><forename>Marja</forename></name>
	</p>
	<trailer>
		<ref type="attached_file" target="#Simu_Aqui_forum_attach_988833475"
			>suffolk.qcp</ref>
	</trailer>
</post>

Wikipedia discussion

to be added

Blog

From cmr-infral-tei-v1[2]. In (5) one message and its comment .

(5)
       <post xml:id="cmr-blog-a2" synch="#T2" who="#P2" type="blog-message">
            <head>
               <title>Présentation de ma personne</title>
               <label>étapeE1 ; </label>
            </head>
            <p>Bon soir à tous!<lb/> Maintenant, je vais commencer avec les présentations.......
               <lb/> Je pense que vous avez vu que je m'appelle <name ref="#P2">Kerstin</name> .
               J'ai 22 ans. Mon nom est un nom suédois qui est très fréquent en Allemagne. Comme
               vous savez peut-être, on a commencé nos études en master cette semaine. <lb/> Ma
               famille - mes parents et mes deux soeurs -habite à Osnabrueck. C'est une ville qui
               est pas loin de Brême. Après avoir passé mon bac à Osnabrueck, j'ai commencé mes
               études de francais et de sport à Brême. La raison pour laquelle j'ai choisi ces deux
               matières est que j'aime faire du sport (jouer au tennis, nager) et que j'adore la
               culture francaise. J'adore la langue francaise et le pays me plaît beaucoup (le
               paysage francais....). <lb/> Les deux étés passés, j'ai fait un stage de plus que
               deux mois en Suisse francophone et en France (près de Lyon) pour améliorer mes
               connaissances de la langue francaise et la pratique du francais à l'oral. <lb/> En ce
               qui concerne mes études de francais, ce qui me plaît surtout, c'est, d'explorer la
               culture francaise d'une manière différente (les textes littéraires, les séquences
               vidéos......). <lb/> J'attends vos présentations et je vous souhaite encore un bon
               soir........ <lb/> A bientôt, <name ref="#P2">Kerstin</name></p>
         </post>
         <post xml:id="cmr-blog-a3" synch="#T3" who="#P3" type="blog-comment" ref="#cmr-blog-a2">
            <head>
               <title>Hallo Kirstin! J'ai lu que tu as fait des stages e...</title>
            </head>
            <p>Hallo<name ref="#P2">Kirstin</name> ! J'ai lu que tu as fait des stages en Suisse
               francophone ! Où exactement car j'habite près de la frontière suisse (à 1h de
               Lausanne !)! Je pense qu'on aura l'occasion d'en reparler ! Bis Bald </p>
         </post>

Email

From cmr-simuligne-tei-v1[2]. On email sent to one person and read by this person.

(6)
<post xml:id="cmr-Simu-Aq-At-Outbox-0080" when="2001-05-12T01:15:00"
	who="#cmr-Simu-At" type="email-message">
	<head>
	 <title>ta photo</title>
		<listPerson>
		<person corresp="#cmr-Simu-Al6">
		 <event type="SendTo">
			<label>SendTo</label></event></person>
		<person corresp="#cmr-Simu-Al6">
		  <event type="Read" when="2001-05-12T01:15:00">
			<label>Read</label></event></person>
		</listPerson>
	</head>
<p>Coucou<name ref="#cmr-Simu-Al6" type="person"><forename>Mia</forename></name>, Tu peux aller te voir dans
Publications : maintenant, tu y existe en totalité ! A bientôt,<name ref="#cmr-Simu-At" type="person"><forename>Anna</forename></name>
</p>
</post>

Tweets

See #Example for Tweets

Macrostructure: the <post> element

The element

In our schema, the element <post> is the basic structural element of a CMC document corresponding to textual "enbloc" interactions. We consider it a macrostructural element, but it is the pivot between the higher level macrostructural components thread and logfile and the microstructure of the content which it encloses . The structure of <post> is based on that of the existing <div> element.

The <div> and <post> elements have the following similarities:

  • <div> and <post> are high-level elements, belonging to the same class(model.divLike);
  • <div> and <post> contain the major divisions of text;
  • <div> and <post> have similar internal content.

It is important to note that <post>, like <div>, does not belong to the class of pLike elements. One <post> may consist of one or more paragraphs, similar to a <nowiki><div>. While a division may represent, for example, a chapter of a book, <posting> represents one user contribution to some computer-mediated communication event (forum, blog, web-discussion, or chat). Such a contribution can contain multiple paragraphs, just like <div>. In the chat, all postings consist of exactly one paragraph and the portion of text exhibits no special markup, but on the Wikipedia talk page given in figure 2, some of the postings contain divisions and markup that the authors inserted into the content of their postings in order to structure their content. Therefore, <post> cannot be a model.pLike element.

The <div> and <post> elements have the following differences:

  • <div> is a self-nesting element, while <post> is not;
  • <post>s can only appear inside of a division which encloses one complete CMC

document (such as an entire forum thread, an entire blog with user comments, or a chat logfile).

In other words, <post> is a child element of
and shares its content model except that

it does not contain divisions and does not embed itself. Normally, <posting> consists of one or more paragraphs. In some cases a posting contains a head, typically with a title.

Attributes for <post>

Here is a summary of the attributes and other information which may be attached to different types of CMC environments. Note that information relative to StatusRead and Receiver have been encoded in TEI in the <head> of the <post> (close to the <title> - for email, forum, blog- and <label> (for blog). In order to minimize the number of attributes, we also have used the <trailer> element in order to encode description of attached file (email, forum, etc.) or numerous information concerning tweets (see the corresponding section). Attributes relative to Wikipedia forums have not yet been used (see the corresponding section).

Table2 : attributes of the <post>element

 

SMS

Textchat

Blog

Email

Discussion forum

Wiki forum

@xml:lang

y

y

y

y

y

y

@xml:id

y

y

y

y

y

y

@who

y

y

y

y

y

y

@when / when-iso / synch

y

y

y

y

y

y

@type

y

y

y

y

y

y

@subtype

N

Option

N

N

N

N

@ref

N

N

Y (comment)

Y (respond)

Y (respond)

Y(respond)

Not in TEI

@alias

 

Option

 

 

 

 

Receiver

 

 

 

SendTo, Cc, Bcc

 

 

StatusRead

 

 

Option

Option

Option

 

@revisedWhen

 

 

 

 

 

Y

@revisedBy

 

 

 

 

 

Y

@identLevel

N

N

N

N

N

Y

Macrostructure & multimodality: u and prod elements

Figure2: non verbal acts in a 3D CMC environment

Non verbal.jpg



Types of divisions for the interaction space

As already seen an interaction space may be described at 2 very different levels:

  • 1) the meta level (see SIG:CMC/Draft: A metadata schema for CMC) ;
  • 2) the interactions per themselves (i.e. the set of acts. These acts, in all examples given here, are included within a division which correspond to a session or a division within a division.





We may distinguish several types of divisions :

  • div type=”thread” , e.g. forum, blog with different tools and then
    • child element : <post>with different types
  • div type =”logfile” , e.g. textchat, SMS, with different tools
    • child element : <post> with different types for example within a textchat
  • div type =”oral-discourse" for audiochat
    • child element : <u> see chapter TEI on speech
  • div type=”multi-modalities”
    • child element : <post>
    • child element : <u>
    • child element : <prod> (for iconic acts - vote, raise_hand, brief_absence_act, etc. - , all collective tools – wordprocessor, semantic map, whiteboard, etc. - nonverbal communication ) . As an example of non-verbal classification of acts, see the figure besides which represents non-verbal acts in Second Life as encoded by (Wigham & Chanier, 2013)[4].

<prod> element

As explained the <prod> element refers to acts which are non-verbal, are part of the interction process, at the same level than the <post> and <u> elements. After example (1) given herebefore, here is another example (7) of interactions between one tutor and learners (From : cmr-copeas-tei-v1[2]. Context: Context: Lyceum audio-graphic conference environment).

  • (7.1) audio act : yes/no question by the tutor
  • (7.2) positive answer givent through a non-verbal modality by a learner (inconic system, modality here named "vote" with content "agree")
  • (7.3) audio act : yes/no question by the tutor
  • (7.4) textchat act : complementary info given by a learner
  • (7.5) to (7.9) : yes / no answers of 5 participants through the iconic system
(7)
(7.1)<u xml:id="cmr-copeas-R2_lobby-a_1297" xml:lang="eng" start="#cmr-copeas-tl_r-w107"
         end="#cmr-copeas-tl_r-w109" who="#AR4">
      euh no + euh I don't know the + the style ++ in french it's a band + named + 
      {les enfoirés} ++ you know euh + {enfoirés} |+++</u>
(7.2)<prod xml:id="cmr-copeas-R2_lobby-a_1298" synch="#cmr-copeas-tl_r-w108" who="#AR7"
        type="vote">agree</prod>
(7.3)<u xml:id="cmr-copeas-R2_lobby-a_1299" xml:lang="eng" start="#cmr-copeas-tl_r-w109"
        end="#cmr-copeas-tl_r-w110" who="#TutR">anybody else know |</u>
(7.4)<post xml:id="cmr-copeas-R2_lobby-a_1301" xml:lang="unk" synch="#cmr-copeas-tl_r-w110" who="#AR6" type="chat-message">
        <p>french's singers</p>
       </post>
(7.5)<prod xml:id="cmr-copeas-R2_lobby-a_1302" synch="#cmr-copeas-tl_r-w111" who="#AR3"
        type="vote">agree</prod>
(7.6)<prod xml:id="cmr-copeas-R2_lobby-a_1303" synch="#cmr-copeas-tl_r-w111" who="#AR2"
        type="vote">agree</prod>
(7.7)<prod xml:id="cmr-copeas-R2_lobby-a_1304" synch="#cmr-copeas-tl_r-w112" who="#AR6"
        type="vote">agree</prod>
(7.8)<prod xml:id="cmr-copeas-R2_lobby-a_1305" synch="#cmr-copeas-tl_r-w113" who="#TutR"
         type="vote">disagree</prod>
(7.9)<prod xml:id="cmr-copeas-R2_lobby-a_1306" synch="#cmr-copeas-tl_r-w114" who="#AR1"
        type="vote">disagree</prod>

The contents of the <prod> element is fairly simple in (7), whereas it is much more complicated in the <prod> element of example (1). In (1) it corresponds to an act of typing within a collaborative word processor. It is up to the researchers who transcribe (out of videoscreen captures) actions within online collaborative tools to decide which kind ofcoding scheme they want. We should not impose anything for this contents. The only mandatory information should be restrained to attributes.

This element does not exist in the current TEI version. Of course, the element name may be debatable (here the name "prod" corresponds to the fact that the corresponding non verbal act is a production made by a participant), but not its function.

We have considered some TEI elements relared to non verbal features before introducing <prod>.

Elements than cannot be used as an act of type prod

  • <activity>: “contains a brief informal description of what a participant in a language interaction is doing other than speaking, if anything.”, it is a brief description, no attribute @who, too low-level
  • <kinesic>: “marks any communicative phenomenon, not necessarily vocalized, for example a gesture, frown, etc”. Has been designed as integrated inside <u> , but may be used at the same level. However the name is wrong. Kinesic is a specific non-verbal notion related to gaze, posture, gesture, not a general one.
  • <incident>: “marks any phenomenon or occurrence, not necessarily vocalized or communicative, for example incidental noises or other events affecting communication.” Name unacceptable, runs against the interaction and communicative framework.

This is the whole philosophy / theoretical standpoint of the TEI chapter on speech that cannot be applied to non verbal description placed as the same level as text and sppech. These TEI elements, related to an utterance, are not really considered as being part in the interaction, at the same level as the utterance’s one. Their naming is also unacceptable and cannot refer to concepts we have mentioned here.

Attributes of <u> and <prod>

Figure 3: interplay between modalities

Modality interplay.jpg

<u> and <prod> elements share the main top attributes as those listed in the upper part of Table2 for the <post>. However there is a striking difference with respect to time. Time attributes attached to <post> correspond to en bloc acts. The time encoded corresponds to the time of post received/sent by the server. This is also the time that other participants saw the result of the act appearing on their screen. They can only react to this act after this time. All participants perceived this act (let us say a textchat) as instantaneous, despiste the fact that is not [8]. On the contrary, all <u> and most <prod> acts have a duration, i.e. other participants perceived oral or visual cues of the act and may react before it ends. Their reaction may lead the first participant to change on the thores what s/he intended to say or do. For example, the figure on the left here display in the left column the trasncript of an audio act and on the three right columns contents of textchat acts. Whilst the first participant utters his message, the other three react. Consequenlty the first one changes two times what he had initially planed to utter (extract from Second Life environment [5]).

Attribute representing Addressee / Receiver for <u> <post> and more elements

In Table1 we mentioned the "Receiver" information, which described to whom a message (Email) was sent. Since there exist no such attribute in TEI, we encoded it into the <head> of the <post>, in a quite verbose way. If we had known the addresses of SMS messages (we could not for ethics reasons when collecting SMS), we would also have needed such feature within a <post>.

It appears that in oral corpora it is often an necessity to clearly mention the addresse of a speech utterance. See for example, the Alipe project ([8]) with child-parents interactions where addressees are clearly indentified and have been described through verbose feature structures.

A great improvement to these circumlocutions may be to design an @addressee attribute which may be added to <u> and <post> at least. We may also use it for the <adressingTerm> element (see next section).

Microstructure

Figure4: Hierarchy of interaction signs

Interaction sign.png

The micro-structure refers to the elements and their attributes found in the actual contents of the interaction (here mainly within <post>). A large part of this section comes from (Beiswenger et al., 2012)[3]

There is no common terminology to classify the elements of Internet jargon, nor consensus about the status of these elements in a natural language grammar framework. To fill this gap, we have developed an annotation schema for these phenomena on the microstructure level of CMC documents. The basic linguistic description category of our approach is termed an interaction sign; in the schema, instances of interaction signs such as emoticons, acronyms, etc. are represented using the element <interactionTerm>.

In our schema, we introduce an element <interactionTerm> as a phrase-level element (in the model.phrase class) which encloses one or more instances of subclasses of interaction signs.


New elements child of InteractionTerm

The <interactionTerm> element can have members of att.global as attributes. In addition, we introduce elements for the following subclasses of interaction signs:

  • <emoticon>  : Emoticons are iconic units created using the keyboard. They are often used to portray facial expressions, and they typically serve as emotion, illocution, or irony markers. Due to their iconic character, the use of emoticons is not restricted to CMC in one particular language; instead, the same emoticons can be found in CMC data in different languages. The <emoticon> element, which is assigned to the gLike element class. Conventionally, elements of this class contain non-Unicode characters and glyphs. Although most emoticons are produced as a sequence of keyboard characters (dot, comma, colon, and the like), the resulting figure is comparable in its semiotic status to graphic characters. While some smiley faces have been included in Unicode, the variety of emoticons is still larger than can be captured by Unicode characters alone. That is why we place the <emoticon> element in the class of gLike elements.
  • <interactionWord> : Interaction words are symbolic linguistic units. Their morphologic construction is based on a word or a phrase of a given language which describes expressions, gestures, bodily actions, or virtual events?for example, the units sing, g (< grins, “grin”), fg (< fat grin), s (< smile), wildsei (“being wild”) are used as emotion or illocution markers , irony markers or to playfully mimic simulated bodily activity .The element <interactionWord> in our schema is a member of model.global.spoken. It shares properties of the <kinesic>, <incident>, and <vocal> elements in TEI.
  • <interactionTemplate>: Interaction templates are units that the user does not generate with the keyboard but by activating a template which automatically inserts a previously prepared text or graphical element into a space of the user’s choice. The category of interaction templates includes graphic smileys, chosen by the user of a CMC environment from a finite list of elements. These often portray facial expressions but can depict almost anything; in the case of animated GIFs, they can even portray entire scenes as moving pictures. This clearly goes beyond what can be expressed using only keyboardgenerated emoticons. On the other hand, users can invent new emoticons by combining keyboard characters, while template-generated units are always bound to predefined templates. The element <interactionTemplate> in our schema belongs to the model.global class of elements.
  • <addressingTerm> : Addressing terms address an utterance to a particular interlocutor . The most widely used form here is the one made out of the “@” character together with a specification of the addressee’s name. The element <addressingTerm> in our schema belongs to the model.nameLike class of elements. While this element usually uses no attributes, our customization includes the att.global attributes. The content of <addressingTerm> is restricted to two elements:
    • The <addressMarker> element belongs to the class model.labelLike (used to gloss or explain parts of a document) and is provided with the att.global class of attributes. The purpose of <addressMarker> is to identify or to highlight the addressee in a posting. This is typically achieved by using the “at” sign (“@”) or one of a set of fixed phrases (English: “to”; German: “an” or “für”).
    • The element <addressee> is placed in the model.nameLike.agent class.

Attributes of these elements

Table 3: attributes of elemens corresponding to interaction terms

 

<emoticon>

<interactionWord>

<interactionTemplate>

<addressee>

@style

Western, Japanese, Korean, Other.

y

N

N

@systemicFunction

emotionMarker:positive, emotionMarker:negative, emotionMarker:neutral,

emotionMarker:unspec, responsive, ironyMarker, illocutionMarker, virtualEvent.

y

Y

N

@contextFunction

y

y

Y

N

@topology

y

y

N

N

@semioticSource

y

mimic, gesture , bodilyReaction , sound, action , sentiment , process , and emotion

N

N

@formType

N

simple, complex, abbreviated

N

persNameFull, persNameAbbreviation, persNameNickname

@type

N

N

iconic, verbal, iconic-verbal

N

@motion

N

N

static, animated

N

@who

N

N

N

ID

@scope

N

N

N

all, group ,individual, unspec

Example taken out of a textchat

Figure 5: example of encoding into TEI of interaction terms used in a textchat

InteractionTerm.jpg















Example for Tweets

From cmr-politwwet-tei-v1[2]. (10.1) gives an example of tweet where a variety of Twitter syntax phenomena appears (for the sake of presentation, we have created this example out of several real ones).

  • reference to twitter-account: @JLMelenchon
  • Retweet: RT @LePG:
  • Tweet transmetted via another one: via @LeHuffPostBlog
  • hashtag: #France2

In order to mark the specific Twitter syntax, we use the <distinct> element with a specific type. <addressingTerm> and its subcomponent are included within <distinct> as shown in (10.2). Specific characters / terms such as '#', 'RT' or 'via' are included within an <ident> element. One may wonders whether it should not be the same for the addressmarker. When the twitter-account is described into the list of persons in the <teiheader> then its ID in the value of @ref, otherwise it is its URL page.

Note other features specific to tweets which are encoded into the <trailer> as a feature structure <fs>.

(10.1)
RT @LePG: A 07h50, @JLMelenchon est l'invité des #4vérités sur #France2. Nous
     live-tweeterons. #Chômage #Municipales2014 http://t.co/5Qi1eB1nTm via
     @LeHuffPostBlog
(10.2)
<post xml:id="cmr-politweets-a449071343175471104" when="2014-03-27T07:32:06Z"
  who="#cmr-politweets-p80820758" type="tweet" xml:lang="fra">
  <p>
     <distinct type="twitter-retweet">
	<ident>RT</ident>
	<addressingTerm>
	   <addressMarker>@</addressMarker>
	   <addressee type="twitter-account" ref="https://twitter.com/LePG 12214546"
	      >LePG</addressee>
	</addressingTerm> : </distinct> A 07h50, 
     <addressingTerm>
	<addressMarker>@</addressMarker>
	<!-- when user was listed in the <listPerson> above, we use the id in the text instead of URL -->
	<addressee type="twitter-account"
	   ref="#cmr-politweets-p80820758">JLMelenchon</addressee> 
     </addressingTerm> est l'invité des <distinct type="twitter-hashtag">
	<ident>#</ident>
	<rs ref="https://twitter.com/search?q=%234vérités&src=hash"
	   >4vérités</rs>
     </distinct> sur <distinct type="twitter-hashtag">
	<ident>#</ident>
	<rs ref="https://twitter.com/search?q=%23France2&src=hash">France2></rs>
     </distinct> . Nous live-tweeterons. <distinct type="twitter-hashtag">
	<ident>#</ident>
	<rs ref="https://twitter.com/search?q=%23Chômage&src=hash">Chômage</rs>
     </distinct>
     <distinct type="twitter-hashtag">
	<ident>#</ident>
	<rs ref="https://twitter.com/search?q=%23Municipales2014&src=hash"
	   >Municipales2014</rs>
     </distinct>
     <ref target="http://t.co/5Qi1eB1nTm http://huff.to/1fMDTo7"
	>http://t.co/5Qi1eB1nTm</ref>
     <distinct type="twitter-via">
	<ident>via</ident>
	<addressingTerm>
	   <addressMarker>@</addressMarker>
	   <addressee type="twitter-account"
	      ref="https://twitter.com/LeHuffPostBlog">LeHuffPostBlog</addressee>
	</addressingTerm>
     </distinct>
  </p>
  <trailer>
     <location>
	<!-- place -->
	<placeName>België</placeName>
	<!-- geo_lat et geo_long -->
	<geo>41.687142 -74.870109</geo>
     </location>
     <fs>
	<!-- these features will be hidden when have no value -->
	<f name="medium"><string>web</string></f>
	<f name="favoritecount"><numeric value="30"/></f> 
	<f name="retweetcount"><numeric value="73"/></f> 
	<f name="isRetweet"><binary value="true"/></f>
	<f name="isTruncated"><binary value="true"/></f> 
	<f name="isFavorited"><binary value="true"/></f>
	<f name="retweetedstatus_id"><numeric value="444087690850750464"/></f>
	<f name="inReplyToUserId"><numeric value="23339387"/></f>
	<f name="inReplyToScreenName"><string>josebove</string></f>
     </fs>
  </trailer>
</post>

Anonymization

In order to be able to distribute the collected CMC data as widely as possible, we need to anonymize the data. Our anonymization strategy shall support the following goals:

  • Every user of the data shall be able to associate a certain set of interaction acts in a CMC document to a user. This user, however, shall not be identifiable as an individual of the “real world”.
  • In some corpora, researchers have collected numerous information about users (sociolinguistic ones, language biography, etc.) it is important to release with the corpora for future research analysis.

To achieve these particular goals, we perform the following steps:

  • All of the recoverable personal data of a CMC participant (or group) are collected into a person profile in a <person> element. This profile is provided with a value of a user ID ( @xml:id) which is unique within the particular TEI document. All person profiles are stored in the header of the document; thus, they can easily be separated from the body of the document and therefore be hidden from the less privileged users of the data (cf. more explanation on participants - individual, groups- in SIG:CMC/Draft: A metadata schema for CMC)
  • Each interaction act is linked to a person profile via the @who attribute, which points to the value of an @xml:id of a person element.
  • Instances of user names in segments of a given posting are also linked to a user ID.

Anonymization is a tedious process, but it has already been accomplished over numerous CMC corpora. When it is done, several types of information are kept and coded into feature structures (who did the process? how many characters have been changed, etc.). We will not detail them here. Every time such process is performed a piece of information is replaced by another, such as in the SMS message in (8).

(8)  Bon, d'accord ! Elle est ou miss [_forename_] ? Usa ? On se retrouve ds 1 coin sympa ? Qu'est-ce qui t'arrange ?

In (9.1) we list the type of information which have been anonmyzed in the CoMeRe databaank and the keywords used for replacement and aligned them with TEI elements used to encode them. (9.2) shows an example of use within an email. Corresponding information is encoded within a feature structure whihc has to be declared in the <teiheader> / <encodingDesc> / <fsdDecl>

(9.1)
                [_forename_]          <forename>[_forename_]</forename>
                [_surname_]           <surname>[_surname_]</surname>
                [_addName_]           <addName>[_addName_]</addName>
                [_tel_]               <rs type="telephone">[_tel_]</rs>
                [_email_]             <email>[_email_]</email>
                [_url_]               <rs type="url">[_url_]</rs>
                [_code_]              <rs type="code">[_code_]</rs>
                [_address_]           <address>[_address_]</address>
(9.2)
<post xml:id="cmr-Simu-Aq-At-Inbox-0075" when="2001-05-10T09:53:00"
		who="#cmr-Simu-Al9" type="email-message"
		ref="#cmr-Simu-Aq-At-Outbox-0033">
	<head>
		<title>Forum Prep Interculture</title>
		[...]
	</head>
	<p>Salut<name ref="#cmr-Simu-At" type="person"
			><forename>Anna</forename></name>, Malheuresement,j?ai un
		problème qui affect mon participation.. It is better to explain in
		english., Since Simuligne commenced we have here a recently
		announced General Election,which was of course delayed by one month
		and should have been on May 7th. I am very heavily commited to
		contribute to the campaign for my Party and as a result have been
		unable to keep contact with you all. Now I find myself unable even
		to keep our link up tomorrow evening. This activity will continue
		for at least one month and so I must advise that I have decided to
		discontinue my participation.Very sorry and best wishes to
			Simuligne....regards,<name ref="#cmr-Simu-Al9" type="person"
				><forename>Howard</forename><surname><fs
					type="anonymisation">
					<f name="origfrom"><string>Depositor</string></f>
					<f name="anonyString"><string>[_surname_]</string></f>
				</fs></surname></name>.. </p>
</post>

References

  1. DeRiK (2013). Description of the DeRiK (Deutsches Referenzkorpus zur internetbasierten Kommunikation) project for CMC in German databank of corpora encoded into TEI. [1]
  2. 2.0 2.1 2.2 2.3 2.4 2.5 2.6 2.7 2.8 CoMeRe(2014). Website documentation of the CoMeRe (Communication Médiée par les Réseaux) project. CMC in French databank of corpora encoded into TEI [2]
  3. 3.0 3.1 3.2 Beißwenger, M., Ermakova, M., Geyken, A., Lemnitzer, L &, and Storrer, A (2012). "A TEI Schema for the Representation of Computer-mediated Communication", Journal of the Text Encoding Initiative, 3. [3] ; DOI : 10.4000/jtei.476
  4. 4.0 4.1 Wigham, C.R. & T. Chanier (2013a). "A study of verbal and nonverbal communication in Second Life. the ARCHI21 experience". ReCALL 25(1), Cambridge Journals. DOI: 10.1017/S0958344012000250 [4]
  5. 5.0 5.1 Wigham, C.R. & T. Chanier (to appear in 2013b). Interactions between text chat and audio modalities for L2 communication and feedback in the synthetic world Second Life. (CALL) Computer Assisted Language Learning. DOI: 10.1080/09588221.2013.851702 [5]
  6. Chanier, T., Vetter. A. (2006). "Multimodalité et expression en langue étrangère dans une plate-forme audio-synchrone". Apprentissage des langues et Système d'Information et de Communication (Alsic), vol. 9. pp 61-101. DOI : 10.4000/alsic.270 [6]
  7. Ciekanski, M., Chanier, T (2008). Developing online multimodal verbal communication to enhance the writing process in an audio-graphic conferencing environment. Recall, vol. 20 (2), Cambridge University Press. 162-182. doi:10.1017/S0958344008000426 [7]
  8. The duration of the typing of a textchat is not encoded here and is not perceveived by other participants. Also in some CMC there exist visual cues of participants engaged in a process of typing in the textchat. But if one wants to take into account these cues, which often play a role in the interaction and participants may refer to it, they would be described as non verbal acts (<prod>)
Personal tools