Developer:
GECCo Team at Saarland University, Department of Applied Linguistics, Translation and Interpreting
Authors:
Ekaterina Lapshinova-Koltunski, Kerstin Kunz, Katrin Menzel, Erich Steiner, Jose Manuel Martinez Martinez, Stefania Degaetano-Ortlieb, Marilisa Amoia
Contact person(s):
Ekaterina Lapshinova-Koltunski, Kerstin Kunz
Availability:
The corpus is available via CLARIN-DE repository (http://hdl.handle.net/11858/00-246C-0000-0023-8CF7-A). The URL for querying http://corpora.clarin-d.uni-saarland.de/cqpweb/; there is restricted access to this corpus - please contact the authors.
Available translations:
English-to-German translations for the written part of the English original subcorpusGerman-to-English translations for the written part of the German original subcorpus
Genre:
fiction
science
interactional (social networks, sms, everyday conversation, etc.)
Genre (detailed):
academic speeches, political essays, fictional texts, interviews, instruction manuals, popular-scientific articles, letters to shareholders, prepared political speeches, tourism leaflets, texts from various corporate websites
Register (2):
spontaneous
semi-spontaneous
non-spontaneous
Text type:
instructive
narrative
expository
descriptive
argumentative
Years of the data origin:
written subcorpora: 1992 - 2006; spoken subcorpora: 2008 - 2012
Document structure:
sentence boundaries, turns in the spoken part and text boundaries all over the corpus (not always full texts)
Unit of segmentation used:
Tools for annotation:
mpro (morphological analysis), TreeTagger, Stanford Parser, MATE parser, CWB modules, CQP, MMAX2
Tools for querying:
CQP and CQP Web, also MMAX2
Types of DSDs annotated:
explicit inter-sentential explicit (cohesive) relations and intra-sentential relations between clauses
Method of annotation:
semi-automatic (automatic with manual post-correction)
Style/theory of annotation:
Systemic Functional Linguistics
Version number, release date:
GECCO2013, GECCO-SPOKEN2014 (spoken only)
Previous versions and their release dates:
Pointers to related corpora:
CroCo (all written subcorpora are extracted from here), MICASE (English academic speeches), BLACKBONE and ELISA (German and English interviews), BNC (English medical consultations and sermon speeches)
Citation (text format):
Lapshinova-Koltunski, E., K. Kunz and M. Amoia (2012). Compiling a Multilingual Corpus. In Heliana Mello, Massimo Pettorino and Tommaso Raso (eds). Proceedings of the VIIth GSCP-2012 International Conference: Speech and Corpora. Firenze: Firenze University Press. pp. 29-34.Lapshinova-Koltunski, E. and K. Kunz (2014). Annotating Cohesion for Multillingual Analysis. In Proceedings of the 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation, Reykjavik, May 26, 2014
Citation (bibTeX format):
@INPROCEEDINGS{LapshinovaEtal2012,author = {Lapshinova-Koltunski, Ekaterina and Kerstin Kunz and Marilisa Amoia},editor = {Heliana Mello, Massimo Pettorino, Tommaso Raso},title = {Compiling a Multilingual Spoken Corpus},booktitle = {Proceedings of the VIIth GSCP International Conference: Speech and corpora},publisher = {Firenze University Press},address = {Firenze}, pages = {79--84},year = {2012},url = {http://store.torrossa.it/resources/9788866553519}}@InProceedings{LapshinovaKunz:2014:ISA, author = {Lapshinova-Koltunski, Ekaterina and Kunz, Kerstin}, title = {Annotating Cohesion for Multillingual Analysis}, booktitle = {Proceedings of the 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation}, month = {May}, year = {2014}, address = {Reykjavik, Iceland}, publisher = {LREC}}
Further info about the discourse relations:
senses/semantic labels are annotated for the relations
Other annotation layers:
sentence morphosyntax, parse structure
anaphora (coreference, bridging)
substitution, ellipsis, lexical cohesion