SIG:CMC/CoMeRe metadata schema draft for CMC (2014)
This page is part of the wiki space of the TEI-SIG “Computer-mediated communication".
Status of this draft
This page describes a draft for a metadata schema for genres on computer-mediated communication (CMC) in TEI. The draft has been created by members of the TEI-SIG "Computer-Mediated Communication" and applied to all corpora of the CoMeRe project: all these corpora can be fully downloaded from the repository http://hdl.handle.net/11403/comere . Documentation on the CoMeRe project is on its website: http://comere.org
The SIG encourages everybody to discuss this draft and give their feedback/comments using the "discussion" function on top of this page. The comments/discussions will be carefully taken into consideration in the further development of the schema.
The history of the draft is documented on the main wiki page of the SIG. This page should be read in parallel to SIG:CMC/Draft: A basic schema for representing CMC in TEI.
Authors of this draft: Thierry Chanier.
Rationales for Modelling CMC discourse
Note : we use the terme CMC (which stands for Computer-Mediated Communication) in a broad meaning, when refering to all kinds of Networks Mediated Communication (cf. SMS).
Annotation is basically an interpretation and the TEI markup naturally encompasses hypotheses concerning what a text is and what it should be. Although the TEI was historically dedicated to the markup of literature texts, various extensions have been developed for the annotation of other genres and discourses, including poetry, dictionaries, language corpora or speech transcriptions. If one wants to still apply the word “text” to a coherent and circumscribed set of CMC interactions, it is not so much in the sense originally developed by the TEI. Indeed, it would be closer to the meaning adopted by Bauldry & Thibault (2006). These authors consider (ibid: 4) “texts to be meaning-making events whose functions are defined in particular social contexts” following Halliday (1989:10) who declared that “any instance of living language that is playing a role some part in a context of situation, we shall call it a text. It may be either spoken or written, or indeed in any other medium of expression that we like to think of.”
Bearing the above in mind, we found it more relevant to start from a general framework, that we will term “Interaction Space”, encompassing, from the outset, the richest and the more complex CMC genres and situations. Therefore, we did not work genre by genre, nor with scales that would, for instance, oppose simple and complex situations (e.g. unimodal versus multimodal environments) - as said, our goal is to release guidelines for all CMC documents and not for each CMC genre. This also explains why we did not limit ourselves solely to written communication. For these reasons, we take multimodality into account and our approach is akin to the one under discussion in European networks delaing with TEI and oral corpora: they tend to reject the collection and study of oral corpora as self contained elements and prefer to study oral and multimodal corpora within a common framework.
Figure 1: Interaction Space
Interaction space: time, location, participants
An Interaction Space (henceforth referred to as IS) is an abstract concept, located in time (with a beginning and ending date with absolute time, hence a time frame) where interactions between a set of participants occur within an online location. The online location is defined by the properties of the set of environments used by the set of participants. Online means that interactions have been transmitted through networks, Internet, Intranet, telephone, etc. The set of participants is composed of individual members or groups. It can be a predefined learner group or a circumscribed interest group. A mandatory property of a group is the listing of its participants.
The range of types of interactions (and their related locations) is widespread. On one end of the scale, we find simple types with one environment based on one modality / tool (e.g., one email system, or text chat system, etc). On the other end of the scale, complex environments such as LMSs, where several type of communication modalities are integrated (see hereafter example with the LMS WebCT which uses only textual modalities synchronous — text chat — and asynchronous — email and forum —).
Environment, mode and modality
An environment may be synchronous or asynchronous, mono or multimodal. Multimodality refers to environments that offer several interaction tools, integrated within the same interface. Every tool uses one mode of communication (e.g., oral, text, icon, nonverbal) and one modality (e.g., a text chat has a specific textual modality, different from the modality of a collective word processor, although both are based on the same textual mode). Every modality has its own grammar which constraints interactions. The icon modality within an audio-graphic environment is composed of a finite set of icons (raise hand, clap hand, is_talking, momentarily absent, etc.). Consequently, an interaction may be multimodal because several modes are used and/or several modalities.
An environment offers the participants one or more locations / places in which to interact. For example, a conference system may have several rooms where a set of participants may work separately in sub-groups or gather in one place. In a 3D environment such as the synthetic world Second Life, a location may be an island or a plot. A plot may even be divided into small sub-plots where verbal communication (text chat, audio) is impossible from one to another. Hence we say that participants are in the same location / place if they can interact at a given time. Notions of location and interaction are closely related and are defined by the affordances of the environment. Lastly, an IS is an abstract space where interaction occurs. When the same participants interact over several weeks, different interaction sessions will occur.
More information on interactions in SIG:CMC/Draft: A basic schema for representing CMC in TEI
Describing the interaction space of monomodal environment within the TEI header
In this section we present the way Interaction Space(s) have been described for monomodal environemnts. next section will consider example fo multimodal environments.
Environments and affordances
The first step when describing an environment is to define the general features attached to the overall environment type to which it belongs (e.g., IRC text chat systems). However, this needs to be refined in order to elicit specific features of the system. For example, (1a) describes, in TEI, the general text chat modality where inside one public channel every connected participant may interact with the other participants. Example (1b), however, details the affordances related to the specific IRC system used in cmr-getalp_org. This simplified extract displays the three main types of chat actions (message, command, and event), and part of the subtype of events.
(1a) <textDesc xml:lang="en-GB"> <channel mode="w" xml:lang="en-GB"> <term ref="#texchat-epiknet">text chat</term></channel> <constitution>Messages typed by participants inside EpikNet IRC Channels and then collected by Botstats.com </constitution> <derivation type="original"/> <domain type="public"/> <factuality type="fact"/> <interaction type="complete" active="plural" passive="many"/> <preparedness type="spontaneous"/> <purpose degree="high"><note>Informal discussion</note></purpose> </textDesc>
(1b) <classDecl> <taxonomy> <category xml:id="texchat-epiknet" /> <catDesc>Definition of the modality textchat. Type of messages used in cmr-getalp_org. Textchat features are those coming from EPIKNET <ref target="http://www.epiknet.org/"/> </catDesc> <category xml:id="chat-message"/> <category xml:id="chat-command"/> <category xml:id="chat-event"> <category xml:id="connexion" /> <category xml:id="deconnexion"/> <category xml:id="changementpseudo" /> [...]
Structure of a textchat message, the <post> element
Part of the description a textchat turn, maybe applied to any kind of textchat environment. But particularities imposed by a specific environment (here, again, IRC EpikNet) have to be detailed in order to guide further research analyses (see explanation on attributes @who @alias, on the time), in order to help future research analysis.
(2)<editorialDecl> <normalization> <p>for details about encoding before TEI, see the attached document <idno>cmr-getalp_org-tei-v1-manuel.pdf</idno></p> […] </normalization> <stdVals> <p>The time of a post in second is not known in the textchat logfile. Hence the values of <att>when-iso</att> on the <gi>time</gi> element always end in the format <val>HH:MM</val>; i.e., seconds, fractions thereof, and time zone designators are not present.</p> </stdVals> <segmentation> <p><gi>post</gi> correspond to textchat turns</p> </segmentation> </editorialDecl> <tagsDecl> <tagUsage gi="post">one post corresponds to one texchat turn, i.e. one participant's utterrance.<list> <item><att>xml:id</att>ID of the posting.</item> <item> <att>alias</att> is the participant's alias. It does not identify a participant since a participant may change her/his alias (cf. <att>type</att>chat-command). Moreover two participant may use the same alias (we have never checked this).</item> <item> <att>who</att>is the login ID given by the system to a participant present in the channel at one given moment. In other words, if the participant leaves the channel and then comes back, s/he will receive another system ID. This system ID does not identify a participant in the whole channel. It only identifies a participant during a short period of interaction. 2 different participants cannot have the same system ID. Tracking aliases' use and relating it to system IDs may be one way of approaching this identification. This identification (knowing the exhaustive list of posts sent by the same person) may be a topic of investigation for future analyses.</item> <item> <att>type</att>type of message cf. taxononomy. </item> <item><att>sub-type</att>subtype of message in the taxonomy </item> <item> <att>synch</att>absolute time when the IRC channel displayed the post</item> </list></tagUsage> </namespace> </tagsDecl>
Interactions spaces within the environment
Location and time frames
In (3) is described the general location of the server, then a particular channel with its time frame. 80 other channels (in distinct TEI files) have been described in a similar way in cmr-getalp_org_tei_v1'..
(3) <profileDesc> <creation> <date from="2004-02-03" to="2004-04-09"/> <location type="online_environment"> <placeName>whereas epiknet.org was the place where IR Channels occurred, botstats.com collected the logfiles of the interactions</placeName> <geogName> <rs type="city">Blanquefort, France</rs> <rs type="TGN">7008161</rs> <rs type="URL">http://www.botstats.com</rs> <rs type="URL">http://www.epiknet.org</rs> </geogName> </creation> […] <settingDesc> <setting> <name>rhone-alpes</name> <locale>one IRC EpikNet channel</locale> <time from-iso="2004-03-09T00:00" to-iso="2004-04-09T12:08">begining time of first sessions and end time of last texchat session</time> <activity>participants type on keyboard. They can only see threads of messages of the IRC Channel</activity> </setting> </settingDesc>
In (4), for SMS, collected from volontarily participating people, we distinguish dates and locations of a company in charge of collecting the data (creation) from the participant location and times (settingDesc)(cmr-smslareunion-tei-v1 corpus).
(4) <profileDesc> <creation> <date from="2008-04-10" to="2008-06-30"/> <location type="telephone network"> <placeName>A private company collected the messages and sent them to "Laboratoire de recherche sur les espaces Créolophones et Francophones", Université de la Réunion. All participants were located in La Réunion</placeName> <geogName> <rs type="city">La Réunion, France</rs> <rs type="TGN">1000184</rs> </geogName></location> </creation> […] <settingDesc> <setting> <name>La Réunion</name> <locale> private phones (or phones given by their company) of authors of sms </locale> <time from-iso="2008-04-10T10:57 " to-iso="2008-06-30T21:35"> beginning time of the first message received by the project server and time of the last message received by the server. </time> <activity>participants, leaving in La Réunion freely accepted to send a copy of their SMS to the server of the project. The copy was sent by authors via a specific process, i.e. a process different from the SMS sent to their correspondent. </activity> </setting> </settingDesc>
Figure 2: complex CMC environment with several modalities
(5)<textDesc xml:lang="en-GB"> <channel mode="w" xml:lang="en-GB"><term ref="#webCT">Learning Management System (LMS), WebCT</term></channel> <constitution>This corpus is made of interactions between participants (learners, natives, tutors, researchers) during the online language learning Simuligne experiment (2001). All these interactions happened within the LMS and are made of textacht turns, emails and forum messages. Participants were organized in groups (learning groups): 4 following "scenario 1", a fifth one gathering all participants during the Interculture activity ("scenario 2",(see <gi>projectDesc</gi>) a sixth restrained to tutors). All details about groups are in <gi>particDesc</gi>. Data have been collected by the <ref target="CR">corpus compiler of the first LETEC Simuligne corpus (2009). Since WebCT had no export facilities, data have been extracted out of WebCT internal database, then structured and anonymized.</ref> </constitution> <derivation type="original"/> <domain>education</domain> <factuality type="fact"/> <interaction type="complete" active="many"><note>Interactions happened accordingly to the guidelines of the learning activities (see <gi>projectDesc</gi> for access to guidelines) </note></interaction> <preparedness type="spontaneous"/> <purpose degree="high">learn and practice French, develop intercultural competences</purpose> </textDesc>
(6)<creation> <date from="2001-04-09" to="2001-07-06"/> <location type="online_environment"> <placeName>Interactions collected on the LMS server in Besançon. Participants members of two universities one in UK, the other one in France, but could be online anywhere in the world.</placeName> <geogName><rs type="city">Milton Keynes, United Kingdom</rs> <rs type="city">Besançon, France</rs> <rs type="TGN">7026232</rs> <rs type="TGN">7008356</rs></geogName> </location> </creation> [….] <settingDesc> <setting> <name>Simuligne online language learning course</name> <locale>Locations of interactions were different from one learning group to another, but follow the same learning scenario. They all happened through the communication tools (email, forum, textchat) of the LMS environment adn their respective spaces: mailboxes, forums, chat rooms.</locale> <time from-iso="2001-04-09T00:00" to-iso="2001-07-06T12:10">beginning time of first post and end time of last post</time> <activity>Participants interacted while following the learning activities and their guidelines (see <gi>textDesc</gi>, <gi>classDesc</gi>, and <gi>tagDesc</gi> for more information.</activity> </setting> </settingDesc>
Participants descriptions may considerably vary from one corpus to another. When CMC interacitons happen in an free/open environment, it is almots imossible to collect any information on participant, except those whihc can only be automatically processed. See (7) for an extract comin from the open IRC textchat already mentionned above (cmr-getalp_org-tei-v1 corpus). Participant with ID cmr-get-c024-p45906 has been processed as being different from participant with ID cmr-get-c024-p45905. They both share some common aliases. Impossible to tell whether it is the same person. Here they are considered as 2 people. Within such environment participants tend to constantly change their aliaises in order to tell other participants what their current mood or occupation are.
(7) <person xml:id="cmr-get-c024-p45906"> <persName> <addName type="alias">[tilulu]pYcOoZ</addName> <addName type="alias">pYcOoZ</addName> <addName type="alias">pYcOoZ|away</addName> </persName> </person> <person xml:id="cmr-get-c024-p45905"> <persName> <addName type="alias">[tilulu]pYcOoZ</addName> <addName type="alias">pYcOoZ</addName> </persName> </person>
On the opposite side when CMC happend in learning situations, reserachers can collect rich informaiotn on users, which need to be incorporated in order to support further research. (8) give an extract of participants (individuals and groups) coming the aforementioned cmr-smuligne-tei-v1 corpus. It is only an extract, many more informaiton on language biography, education, etc. for example can be present.
(8)<particDesc> <listPerson> <person role="learner" xml:id="Gl1"> [here gender, TEI element cannot be written in the wiki!] <age value="51"/> <residence>United Kindom</residence> <affiliation> <orgName>Open University</orgName> </affiliation> <persName> <addName type="alias">Alba</addName> </persName> </person> [other participants] <personGrp role="learnerGroup" xml:id="Simu-g-Ga"> <persName> <addName type="alias">Gallia</addName> </persName> </personGrp> [other groups] <listRelation corresp="#Simu-g-Ga"> <relation type="social" name="tutor" active="#Gt"/> <relation type="social" name="native" active="#Gn1 #Gn2"/> <relation type="social" name="learner" active="#Gl1 #Gl2 #Gl3 #Gl4 #Gl5 #Gl6 #Gl8 #Gl9 #Gl10"/> <relation type="social" name="researcher" active="#Tm"/> </listRelation>
<teiHeader> elements used and some limitations to the current TEI version
When presenting metadata information attached to CMC corpora we have mainly used these elements of the <teiHeader>. Note that the @mode attribute of <channel> in <textDesc> have not enough values for describing multimodal texts.
encodingDesc <projectDesc> <editorialDecl> <tagsDecl> <taxonomy> <classDecl> <profileDesc> <creation> <textDesc> <particDesc>
Multimodal corpus and TEI file
When you consider a multimodal corpus, it is made of a set of files (audio files, video files, documents, etc.) among which the TEI file only represents the data structuration related to interactions (see for example a LEarning and TEaching Corpora - LETEC -  which gathers into one unique corpus all these files, among which one corresponds to structured interactions , here in the XML manifest). These interactions are transcriptions of audio and video files or come from logfiles. Hence these files play an important role and should be clearly referred to in the TEI file. As far as we know, there is no easy nor standard way of referencing them within the TEI file. This diffculty has also be stressed by consortia working on oral corpora. There is a clear distinction between a multimodal corpus and a TEI corpus made out of manuscript or a paper-based book. Audio and video files are part of the corpus and go hand in hand with the TEI file.