Corpus for the Analysis of German-English Contrasts in Cohesion

Corpus acronym: 
GECCo
Developer: 
GECCo Team at Saarland University, Department of Applied Linguistics, Translation and Interpreting
Authors: 
Ekaterina Lapshinova-Koltunski, Kerstin Kunz, Katrin Menzel, Erich Steiner, Jose Manuel Martinez Martinez, Stefania Degaetano-Ortlieb, Marilisa Amoia
Contact person(s): 
Ekaterina Lapshinova-Koltunski, Kerstin Kunz
Availability: 
The corpus is available via CLARIN-DE repository (http://hdl.handle.net/11858/00-246C-0000-0023-8CF7-A). The URL for querying http://corpora.clarin-d.uni-saarland.de/cqpweb/; there is restricted access to this corpus - please contact the authors.
Languages covered: 
English, German
Available translations: 
English-to-German translations for the written part of the English original subcorpusGerman-to-English translations for the written part of the German original subcorpus
Corpus size (documents): 
604
Corpus size (sentences): 
78,942
Corpus size (tokens): 
1,693,386
Mode: 
written
spoken
Genre: 
fiction
science
interactional (social networks, sms, everyday conversation, etc.)
Genre (detailed): 
academic speeches, political essays, fictional texts, interviews, instruction manuals, popular-scientific articles, letters to shareholders, prepared political speeches, tourism leaflets, texts from various corporate websites
Register: 
casual
semi-formal
formal
Register (2): 
spontaneous
semi-spontaneous
non-spontaneous
Text type: 
instructive
narrative
expository
descriptive
argumentative
Years of the data origin: 
written subcorpora: 1992 - 2006; spoken subcorpora: 2008 - 2012
Document structure: 
sentence boundaries, turns in the spoken part and text boundaries all over the corpus (not always full texts)
Unit of segmentation used: 
turns
Tools for annotation: 
mpro (morphological analysis), TreeTagger, Stanford Parser, MATE parser, CWB modules, CQP, MMAX2
Tools for browsing: 
CQP and CQP Web
Tools for querying: 
CQP and CQP Web, also MMAX2
Types of DSDs annotated: 
explicit inter-sentential explicit (cohesive) relations and intra-sentential relations between clauses
Number of DSD instances: 
51,790
Method of annotation: 
semi-automatic (automatic with manual post-correction)
Style/theory of annotation: 
Systemic Functional Linguistics
Format: 
CQP XML
Version number, release date: 

GECCO2013, GECCO-SPOKEN2014 (spoken only)

Previous versions and their release dates: 

previous GECCo versions

Citation (text format): 

Lapshinova-Koltunski, E., K. Kunz and M. Amoia (2012). Compiling a Multilingual Corpus. In Heliana Mello, Massimo Pettorino and Tommaso Raso (eds). Proceedings of the VIIth GSCP-2012 International Conference: Speech and Corpora. Firenze: Firenze University Press. pp. 29-34.Lapshinova-Koltunski, E. and K. Kunz (2014). Annotating Cohesion for Multillingual Analysis. In Proceedings of the 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation, Reykjavik, May 26, 2014

Citation (bibTeX format): 

@INPROCEEDINGS{LapshinovaEtal2012,author = {Lapshinova-Koltunski, Ekaterina and Kerstin Kunz and Marilisa Amoia},editor = {Heliana Mello, Massimo Pettorino, Tommaso Raso},title = {Compiling a Multilingual Spoken Corpus},booktitle = {Proceedings of the VIIth GSCP International Conference: Speech and corpora},publisher = {Firenze University Press},address = {Firenze}, pages = {79--84},year = {2012},url = {http://store.torrossa.it/resources/9788866553519}}@InProceedings{LapshinovaKunz:2014:ISA, author = {Lapshinova-Koltunski, Ekaterina and Kunz, Kerstin}, title = {Annotating Cohesion for Multillingual Analysis}, booktitle = {Proceedings of the 10th Joint ACL - ISO Workshop on Interoperable Semantic Annotation}, month = {May}, year = {2014}, address = {Reykjavik, Iceland}, publisher = {LREC}}

Notes: 
Further info about the discourse relations: 
senses/semantic labels are annotated for the relations
Other annotation layers: 
sentence morphosyntax, parse structure
anaphora (coreference, bridging)
substitution, ellipsis, lexical cohesion