Difference between revisions of "SIG:CMC/Technical Meeting on CMC at DARIAH VCC 2014"

From TEIWiki
Jump to navigation Jump to search
m (PROPOSAL)
m
 
(48 intermediate revisions by 2 users not shown)
Line 1: Line 1:
'''NOTE: The content of this page is preliminary and tentative.''' The pages describes a proposal and tentative program for a technical meeting on issues related with the modeling of CMC corpora submitted by members of the CMC-SIG for the 4th DARIAH-EU VCC meeting 2014 in Rome.
+
<p style="background:#87CEFA;">'''This page is part of the wiki space of the [[SIG:Computer-Mediated Communication|TEI-SIG “Computer-mediated communication"]].'''</p>
 +
<table width="90%" border="0" align="center">
 +
<tr>
 +
<td>
 +
<big>
 +
This pages describes the goal and tentative program for a community session/technical meeting on issues related with the modeling of CMC corpora organized by members of the CMC-SIG at the [http://www.dariah.eu/activities/general-vcc-meetings/4th-general-vcc-meeting.html 4th DARIAH-EU VCC meeting 2014] in Rome.
  
= PROPOSAL =
+
'''Date:''' September 17-18, 2014
  
Proposal for a technical meeting on the topic
+
'''Location:''' Rome, [http://www.iliesi.cnr.it/EN/sede.shtml Villa Mirafiori]
  
<big>'''Models and tools for structuring & annotating corpora of social media / computer-mediated communication'''</big>
+
Main page of the CMC-SIG in this wiki: <u>[[SIG:Computer-Mediated Communication]]</u>
  
to be held during the 4th General Virtual Competency Centre (VCC) meeting of DARIAH-EU, Rome, 17-19 September 2014
+
== DESCRIPTION ==
  
Corpora of ''computer-mediated communication'' (''CMC'') are a desideratum for many scholars in the Humanities who are interested in doing empirical research of language use and of emerging communicative genres on the Internet and in social media applications. Important steps for building such corpora and for representing them in an interoperable way are
+
<big>'''TEI CMC: Models  and  tools  for  structuring  &  annotating  corpora  of  social  media  /  computer-­mediated communication'''</big><br/>See [http://dariah.eu/activities/general-vcc-meetings/4th-general-vcc-meeting/programme/community-sessions.html PDF version] on the DARIAH website
  
*to create models for the representation of CMC (including text-only as well as multimodal genres) which comply with a standard or de-facto standard in the field of Humanities;
+
Corpora of ''computer-mediated communication'' (CMC) are a desideratum for many scholars in the humanities who are interested in doing empirical research of language use and of emerging communicative genres on the Internet and in social media applications. Important steps for building such corpora and for representing them in an interoperable way are:
*to adapt tools for boilerplate removal and natural language processing to the structural and linguistic peculiarities of these genres (e.g., tools for converting raw CMC data into a standard format; tools for linguistic processing such as tokenizers, POS taggers, parsers etc.).
+
*To create models for the representation of CMC (including text-only as well as multimodal genres) which comply with a standard or de-facto standard in the field of humanities;
 +
*To adapt tools for boilerplate removal and natural language processing to the structural and linguistic peculiarities of these genres (e.g., tools for converting raw CMC data into a standard format; tools for linguistic processing, such as tokenizers, POS taggers, parsers etc.).
  
Researchers at a European level are already aware that many of the challenges in building CMC corpora in the Humanities are the same for every language; therefore CMC corpus projects for different languages can benefit from sharing knowledge and experiences with each other and from facing the challenges as a joint task. Since 2013, a group of corpus projects from France, Germany, Italy and the Netherlands has started to exchange expertise and experiences in building CMC corpora (= the network "Buildng and annotating CMC corpora", https://wiki.itmc.tu-dortmund.de/cmc/) and to jointly work on a proposal for an extension to the TEI standard which is adapted to the particularities of a broad range of CMC genres (= the TEI special interest group on CMC, http://www.tei-c.org/Activities/SIG/). The DARIAH technical meeting will gather a restricted number a researchers, coming from different European countries, involved in projects aiming at building, structuring and annotating CMC corpora including:
+
Researchers at a European level are already aware that many of the challenges in building CMC corpora in the humanities are the same for every language; therefore CMC corpus projects for different languages can benefit from sharing knowledge and experience with each other and from facing the challenges as a joint task. Since 2013, a group of corpus projects from France, Germany, bItaly and the Netherlands has started to exchange expertise and experience in building CMC corpora (= the network "Building and annotating CMC corpora", https://wiki.itmc.tu-dortmund.de/cmc/) and to jointly work on a proposal for an extension to the TEI standard which is adapted to the particularities of a broad range of CMC genres (= the TEI special interest group on CMC, http://www.tei-c.org/Activities/SIG/). The DARIAH technical meeting will gather a restricted number of researchers, coming from different European countries, involved in projects aiming at building, structuring, annotating and analyzing CMC corpora - including:
 
 
*CoMeRe: Project for a corpus of French CMC: http://comere.org
 
*Dortmund Chat Corpus: http://www.chatkorpus.tu-dortmund.de
 
*DeRiK: project for a reference corpus of German CMC: http://www.tinyurl.com/derik-llc
 
*SoNaR ?
 
*Web2CorpusIT, Pilot Corpus of Italian Computer-Mediated Communication: http://www.glottoweb.org/web2corpus/
 
*'Wikipedia corpora in DeReKo / IDS Mannheim: www.ids-mannheim.de/dereko
 
 
 
The expected outcomes of the meeting are, amongst other:  
 
  
 +
*''CoMeRe'': Project for a corpus of French CMC: http://comere.org
 +
*''Dortmund Chat Corpus'': http://www.chatkorpus.tu-­dortmund.de
 +
*''DeRiK'': project for a reference corpus of German CMC: http://www.tinyurl.com/derik-­llc
 +
*''Web2CorpusIT'': Pilot Corpus of Italian Computer-­Mediated communication: http://www.glottoweb.org/web2corpus/
 +
*''Wikipedia corpora in DeReKo'' / IDS Mannheim: http://www.ids-mannheim.de/dereko
 +
*''KobRA'': Corpus-based linguistic analysis with the help of data mining: http://www.kobra.tu-dortmund.de
 +
The expected outcomes of the meeting are, amongst others:
 
*an advanced proposal for representing CMC genres in TEI (which subsequently shall be presented as a proposal to the TEI community in 2015);
 
*an advanced proposal for representing CMC genres in TEI (which subsequently shall be presented as a proposal to the TEI community in 2015);
*a first exchange about experiences in automatically structuring and processing CMC data and a concept for a common platform (to be set up in 2015) for the documentation and exchange of NLP tools and annotation experiments with other projects and research groups interested in building CMC corpora in different languages;
+
*a first exchange about experience in automatically structuring and processing CMC data as well as a concept for a common platform (to be set up in 2015) for the documentation and exchange of NLP tools and annotation experiments with other projects and research groups interested in building CMC corpora in different languages;
*plans for international scientific events (extended workshops, conferences) around these topics in 2015/16.
+
* plans for international scientific events (extended workshops, conferences) based on these topics in 2015/16.
 
 
=TENTATIVE SCHEDULE / PROGRAM=
 
 
 
*September 17, 2014: '''short intro/presentation''' (3-5 mins) on the work of our community
 
*September 18, 2014: '''technical meeting''' (1,5 + 1,5 + 2 hours).
 
  
'''Precise times for the meeting and presentations will be fixed after DARIAH weill have decided about the proposal. All titles given below are working titles.'''
+
==PROGRAM AND SLIDES==
  
==Technical meeting pt. I: Current state of building & modeling corpora (1,5 hrs)==
+
===Pt. I: "Lightning talks" session (Wednesday, September 17)===
  
* A presentation from CoMeRé ?
+
* Michael Beißwenger (Dortmund), Thierry Chanier (Clermont) & Isabella Chiari (Rome):<br/>'''Models and tools for structuring & annotating corpora of social media / computer-mediated communication]''' ( [http://wiki.tei-c.org/images/a/aa/Dariah-cmc-0_lightningtalk.pdf slides as pdf] )
* A presentation by Isabella and/or Isabelle and Axel ?
 
* Harald Luengen & Eliza Margareta (IDS Mannheim): Experiences with modeling Wikipedia corpora in TEI
 
* N.N.
 
  
==Technical meeting pt. II: NLP for CMC / social media corpora (1,5 hrs)==
+
===Pt. II: Community Session (Thursday, September 18)===
  
* members of the KobRA project (M. Beißwenger/C. Pölitz) on adapting machine learning methods for the analysis and annotation of CMC features in Wikipedia corpora (using the Wikipedia corpora of the IDS)
+
* Thierry Chanier & Kun Jin (UBP Clermont):<br/>'''End of phase 1 of the CoMeRe project: application of the Interaction Space (TEI-CMC) model to various CMC corpora in French''' ( slides as pdf: [http://wiki.tei-c.org/images/b/bd/Dariah-cmc-1_comere-1.pdf pt. 1] , [http://wiki.tei-c.org/images/9/9f/Dariah-cmc-1_comere-2.pdf pt. 2] )
* a presentation from someone who's dealing with "NLP4CMC" issues in the French community?
+
* Harald Lüngen & Eliza Margareta (IDS Mannheim):<br/>'''Applying the TEI CMC SIG proposal to Wikipedia discussion corpora''' ( [http://wiki.tei-c.org/images/1/13/Dariah-cmc-2_wikipediacorpora.pdf slides as pdf] )
* ? A report from the project of preparing a shared task on the linguistic annotation of CMC (= intermediate results from developing a tagset and guidelines for POS tagging of German CMC) (members of the Empirikom network)
+
* Michael Beißwenger & Christian Pölitz (TU Dortmund):<br/>'''Analyzing CMC corpora using machine mearning methods: report from the KobRA project''' ( [http://wiki.tei-c.org/images/b/b1/Dariah-cmc-3_kobra-machinelearning.pdf slides as pdf] )
* N.N.
+
* Benoit Sagot (Alpage/CNRS & INRIA):<br/>'''Application of a general POS tagger for French to the CoMeRe-CMC corpora''' ( [http://wiki.tei-c.org/images/3/34/Dariah-cmc-4_Sagot.pdf slides as pdf] )
  
==Technical meeting pt. III: schedule for further work on standards and joint scientific activities (2 hrs)==
+
===Pt. III: Round table: further work on standards and joint scientific activities (Thursday, September 18)===
  
* schedule for extending the standards (e.g., the next step in the TEI-CMC SIG)
+
* Discussion of the current draft schema from the perspective of different projects and initiatives and of the interface between POS annotations and the microstructure in the TEI schema
* plans for scientific events in 2015-16 focused on annnotaons, NLP4CMC, work on wikipedia & other genres
+
* schedule for extending the standards / next steps of work in the TEI CMC-SIG
* platform for the documentation and exchange of NLP tools and annotation experiments?
+
* plans for scientific events and initiatives in 2015-16 focused on linguistic annotations
 +
* platform for the exchange of tools and experiences in annotating CMC corpora
 +
</big>
 +
</td>
 +
</tr>
 +
</table>

Latest revision as of 08:51, 19 October 2015

This page is part of the wiki space of the TEI-SIG “Computer-mediated communication".

This pages describes the goal and tentative program for a community session/technical meeting on issues related with the modeling of CMC corpora organized by members of the CMC-SIG at the 4th DARIAH-EU VCC meeting 2014 in Rome.

Date: September 17-18, 2014

Location: Rome, Villa Mirafiori

Main page of the CMC-SIG in this wiki: SIG:Computer-Mediated Communication

DESCRIPTION

TEI CMC: Models and tools for structuring & annotating corpora of social media / computer-­mediated communication
See PDF version on the DARIAH website

Corpora of computer-mediated communication (CMC) are a desideratum for many scholars in the humanities who are interested in doing empirical research of language use and of emerging communicative genres on the Internet and in social media applications. Important steps for building such corpora and for representing them in an interoperable way are:

  • To create models for the representation of CMC (including text-only as well as multimodal genres) which comply with a standard or de-facto standard in the field of humanities;
  • To adapt tools for boilerplate removal and natural language processing to the structural and linguistic peculiarities of these genres (e.g., tools for converting raw CMC data into a standard format; tools for linguistic processing, such as tokenizers, POS taggers, parsers etc.).

Researchers at a European level are already aware that many of the challenges in building CMC corpora in the humanities are the same for every language; therefore CMC corpus projects for different languages can benefit from sharing knowledge and experience with each other and from facing the challenges as a joint task. Since 2013, a group of corpus projects from France, Germany, bItaly and the Netherlands has started to exchange expertise and experience in building CMC corpora (= the network "Building and annotating CMC corpora", https://wiki.itmc.tu-dortmund.de/cmc/) and to jointly work on a proposal for an extension to the TEI standard which is adapted to the particularities of a broad range of CMC genres (= the TEI special interest group on CMC, http://www.tei-c.org/Activities/SIG/). The DARIAH technical meeting will gather a restricted number of researchers, coming from different European countries, involved in projects aiming at building, structuring, annotating and analyzing CMC corpora - including:

The expected outcomes of the meeting are, amongst others:

  • an advanced proposal for representing CMC genres in TEI (which subsequently shall be presented as a proposal to the TEI community in 2015);
  • a first exchange about experience in automatically structuring and processing CMC data as well as a concept for a common platform (to be set up in 2015) for the documentation and exchange of NLP tools and annotation experiments with other projects and research groups interested in building CMC corpora in different languages;
  • plans for international scientific events (extended workshops, conferences) based on these topics in 2015/16.

PROGRAM AND SLIDES

Pt. I: "Lightning talks" session (Wednesday, September 17)

  • Michael Beißwenger (Dortmund), Thierry Chanier (Clermont) & Isabella Chiari (Rome):
    Models and tools for structuring & annotating corpora of social media / computer-mediated communication] ( slides as pdf )

Pt. II: Community Session (Thursday, September 18)

  • Thierry Chanier & Kun Jin (UBP Clermont):
    End of phase 1 of the CoMeRe project: application of the Interaction Space (TEI-CMC) model to various CMC corpora in French ( slides as pdf: pt. 1 , pt. 2 )
  • Harald Lüngen & Eliza Margareta (IDS Mannheim):
    Applying the TEI CMC SIG proposal to Wikipedia discussion corpora ( slides as pdf )
  • Michael Beißwenger & Christian Pölitz (TU Dortmund):
    Analyzing CMC corpora using machine mearning methods: report from the KobRA project ( slides as pdf )
  • Benoit Sagot (Alpage/CNRS & INRIA):
    Application of a general POS tagger for French to the CoMeRe-CMC corpora ( slides as pdf )

Pt. III: Round table: further work on standards and joint scientific activities (Thursday, September 18)

  • Discussion of the current draft schema from the perspective of different projects and initiatives and of the interface between POS annotations and the microstructure in the TEI schema
  • schedule for extending the standards / next steps of work in the TEI CMC-SIG
  • plans for scientific events and initiatives in 2015-16 focused on linguistic annotations
  • platform for the exchange of tools and experiences in annotating CMC corpora