DisFrEn

Corpus acronym: 
DisFrEn
Developer: 
Université Catholique de Louvain
Authors: 
Ludivine Crible
Contact person(s): 
Ludivine Crible
Availability: 
annotation complete; extracted annotations available for the database
Languages covered: 
English, French
Available translations: 
the corpus is comparable, not translated
Corpus size (hours): 
15
Corpus size (documents): 
111
Corpus size (tokens): 
161700
Mode: 
spoken
Genre: 
journalistic
science
interactional (social networks, sms, everyday conversation, etc.)
Genre (detailed): 
interview (face-to-face and radio), conversation, phone calls, news broadcast, political speech, classroom lessons, sports commentaries
Register: 
casual
semi-formal
formal
Register (2): 
spontaneous
semi-spontaneous
non-spontaneous
Years of the data origin: 
1991-2010
Unit of segmentation used: 
word
Tools for annotation: 
EXMARaLDA (Partitur Editor)
Tools for browsing: 
EXMARaLDA (Exakt)
Tools for querying: 
EXMARaLDA (Exakt)
Types of DSDs annotated: 
explicit only relational & non-relational (e.g. because & well) take scope over at least one unit which is equal to or bigger than a clause (excludes intra-sentential conjunctions)
Number of DSD instances: 
8743
Method of annotation: 
manual
Style/theory of annotation: 
PDTB style
Format: 
.exb (EXMARaLDA), Praat TextGrid, XML ...
Version number, release date: 

will be version 1.0, finished by end of 2015

Previous versions and their release dates: 

NA

Citation (text format): 

Crible, L. 2017. "Discourse markers and (dis)fluency across registers: A contrastive usage-based study in English and French". PhD thesis, Université catholique de Louvain.

Citation (bibTeX format): 

@phdthesis{crib17, author = "Discourse markers and (dis)fluency across registers: A contrastive usage-based study in English and French", year = "2017"}

Notes: 

The transcription and audio files are not mine (either freely available corpora, or available by convention). Most of them underwent technical treatment to be homogenized in the same format ; some of them were sound-aligned. All annotations are mine.

Audio/video annotation: 
alignment of the audio/video to the transcriptions
Further info about the discourse relations: 
senses/semantic labels are annotated for the relations
Other annotation layers: 
syntactic position of the discourse markers ; POS ; disfluency markers