Prague Discourse Treebank 2.0

Corpus acronym: 
PDiT 2.0
Developer: 
Charles University, Institute of Formal and Applied Linguistics
Authors: 
Rysová Magdaléna
Synková Pavlína
Mírovský Jiří
Hajičová Eva
Nedoluzhko Anna
Ocelák Radek
Pergler Jiří
Poláková Lucie
Scheller Veronika
Zdeňková Jana
Zikánová Šárka
Contact person(s): 
Jiří Mírovský
Availability: 
publicly available in LINDAT-CLARIN repository: http://hdl.handle.net/11234/1-1905
Languages covered: 
Czech
Corpus size (documents): 
3,165
Corpus size (sentences): 
49,431
Corpus size (tokens): 
833,124
Mode: 
written
Genre: 
journalistic
Genre (detailed): 
newspaper news, journal articles, interviews, and others
Register: 
formal
Register (2): 
semi-spontaneous
non-spontaneous
Text type: 
instructive
narrative
expository
descriptive
argumentative
Years of the data origin: 
1991-1995
Document structure: 
documents, paragraph boundaries, headings
Tools for annotation: 
TrEd (http://ufal.mff.cuni.cz/tred/)
Tools for browsing: 
TrEd (http://ufal.mff.cuni.cz/tred/)
Tools for querying: 
PML-TQ (http://ufal.mff.cuni.cz/pmltq/)
Direct link to the PML-TQ search server: https://lindat.mff.cuni.cz/services/pmltq/#!/treebank/pdit20/help
Types of DSDs annotated: 
Relations marked by explicit connectives (primary and secondary), both intra- and inter-sentential.
Number of DSD instances: 
21668
Method of annotation: 
primary connectives: manual for inter-sentential, automatic with manual correction for intra-sentential; secondary connectives: manual at places semi-automatically identified in the texts
Style/theory of annotation: 
PDTB style adapted for dependency trees
Format: 
XML
Version number, release date: 

PDiT 2.0, December 2016

Previous versions and their release dates: 

PDT 3.0 (2013), PDiT 1.0 (2012)

Citation (text format): 

Rysová Magdaléna, Synková Pavlína, Mírovský Jiří, Hajičová Eva, Nedoluzhko Anna, Ocelák Radek, Pergler Jiří, Poláková Lucie, Scheller Veronika, Zdeňková Jana, Zikánová Šárka: Prague Discourse Treebank 2.0. Data/software, ÚFAL MFF UK, Prague, Czech Republic, Lindat/Clarin: http://hdl.handle.net/11234/1-1905, Dec 2016

Citation (bibTeX format): 

@misc{ biblio:RySyMiPragueDiscourse2016, title = {Prague Discourse Treebank 2.0}, author = {Magdal{\'{e}}na Rysov{\'{a}} and Pavl{\'{i}}na J{\'{i}}nov{\'{a}} and Ji{\v{r}}{\'{i}} M{\'{i}}rovsk{\'{y}} and Eva Haji{\v{c}}ov{\'{a}} and Anna Nedoluzhko and Radek Ocel{\'{a}}k and Ji{\v{r}}{\'{i}} Pergler and Lucie Pol{\'{a}}kov{\'{a}} and Jana Zde{\v{n}}kov{\'{a}} and Veronika Scheller and {\v{S}}{\'{a}}rka Zik{\'{a}}nov{\'{a}} }, year = {2016}, publisher = {{\'{U}}{FAL} {MFF} {UK}}, address = {Prague, Czech Republic}, url = {Lindat/Clarin: http://hdl.handle.net/11234/1-1905}, }

Notes: 

Regarding annotation of discourse relations, PDiT 2.0 (Prague Discourse Treebank 2.0) is an update to the PDT 3.0 (Prague Dependency Treebank 3.0), which in turn was an update to PDiT 1.0 (Prague Discourse Treebank 1.0) . The main addition
in comparison with PDT 3.0 is annotation of secondary connectives (e.g. in English "for this reason", "due to this", "under these conditions" etc.).<br>We realize that the changing titles (Prague Discourse vs.
Dependency Treebank, i.e. PDiT vs. PDT) is a mess. Unfortunately, we
were not allowed to publish the new discourse annotation as a new
version of PDT, henceforth PDiT 2.0...

Further info about the discourse relations: 
information about arguments of each relation is available
senses/semantic labels are annotated for the relations
Other annotation layers: 
sentence morphosyntax, parse structure
anaphora (coreference, bridging)
information structure
tectogrammatics - deep syntax