Polish Sejm Corpus The Polish Sejm Corpus contains annotated utterances of Polish Sejm members from terms of office 1-6 (years 1991-2011). Corpus files contain information about text segmentation (paragraphs, sentences, tokens), disambiguated morphosyntactic description (lemma, POS tag, MSD tag), syntactic description (syntactic words and groups) and named entities (person names, locations, organization). The data is a valuable source of linguistic information, being a large (100 M segments) collection of quasi-spoken content and making the basis of the audio/video recording of sessions, started in 2011 and planned to be consecutively appended to the corpus. PSC http://clip.ipipan.waw.pl/PSC NOT_DEFINED_FOR_V2 401 available-unrestrictedUse BSD-style attribution downloadable http://clip.ipipan.waw.pl/PSC free of charge Institute of Computer Science, Polish Academy of Sciences IPI PAN Linguistic Engineering Group ipi@ipipan.waw.pl http://www.ipipan.eu
Jana Kazimierza 5
01-248 Warsaw +48 22 38 00 500 +48 22 38 00 510
Ogrodniczuk Maciej maciej.ogrodniczuk@ipipan.waw.pl http://zil.ipipan.waw.pl/MaciejOgrodniczuk
Jana Kazimierza 5
01-248 Warsaw +48 22 38 00 563 +48 22 38 00 510
Assistant Professor Institute of Computer Science, Polish Academy of Sciences IPI PAN Linguistic Engineering Group ipi@ipipan.waw.pl http://www.ipipan.eu
Jana Kazimierza 5
01-248 Warsaw +48 22 38 00 500 +48 22 38 00 510
2011-10-17 Ogrodniczuk Maciej maciej.ogrodniczuk@ipipan.waw.pl http://zil.ipipan.waw.pl/MaciejOgrodniczuk
Jana Kazimierza 5
01-248 Warsaw +48 22 38 00 563 +48 22 38 00 510
Assistant Professor Institute of Computer Science, Polish Academy of Sciences IPI PAN Linguistic Engineering Group ipi@ipipan.waw.pl http://www.ipipan.eu
Jana Kazimierza 5
01-248 Warsaw +48 22 38 00 500 +48 22 38 00 510
CESAR en 2011-11-26
1.0 2011-11-12 Central and South-East European Resources CESAR http://www.cesar-project.net euFunds nationalFunds ownFunds European Commission (50%) Polish Ministry of Science and Higher Education (40%) Institute of Computer Science, Polish Academy of Sciences (10%) Poland 2011-02-01 2013-01-31 corpus text monolingual PL Polish 2.6 gb 114000000 tokens other false paragraph text/xml TEI automatic transcripts of one session day represented as a set of files: header.xml – header information about the individual session days, text_structure.xml – session structure and individual utterances, ann_segmentation.xml.gz – compressed sentence-level and token-level segmentation, ann_morphosyntax.xml.gz – disambiguated morphosyntactic description (lemma, POS tag and MSD tag), ann_words.xml.gz – syntactic words, ann_groups.xml.gz – syntactic groups, ann_named.xml.gz – named entities scripts developed internally 2011-03-01 2011-11-26 other false paragraph text/xml TEI automatic representation of the session structure (taken over from the transcripts) in <div> elements scripts developed internally 2011-03-01 2011-11-26 segmentation false utterance text/xml TEI automatic segmentation into utterances taken over from the transcripts; each utterance is marked with the speaker identifier (resolved in the transcript header) scripts developed internally 2011-03-01 2011-11-26 segmentation true sentence text/xml TEI automatic individual utterances split into sentences Pantera 2011-03-01 2011-11-26 segmentation true word text/xml NKJP tagset TEI automatic individual sentences split into tokens (word-like segments – see documentation of Morfeusz SGJP for details) Morfeusz SGJP 2011-03-01 2011-11-26 lemmatization true word text/xml TEI automatic lemma variants (all available interpretations) output by Morfeusz, then disambiguated by Pantera tagger Morfeusz SGJP 2011-03-01 2011-11-26 morphosyntacticAnnotation-bPosTagging true word text/xml NKJP tagset TEI automatic MSD tag variants (all available morphosyntactic interpretations) output by Morfeusz, then disambiguated by Pantera tagger Morfeusz SGJP 2011-03-01 2011-11-26 morphosyntacticAnnotation-posTagging true word text/xml NKJP tagset TEI automatic POS tag (CTAG) variants (all available interpretations) output by Morfeusz, then disambiguated by Pantera tagger Pantera 2011-03-01 2011-11-26 structuralAnnotation true word text/xml TEI automatic syntactic words (word-like compounds) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for details Spejd 2011-03-01 2011-11-26 syntacticAnnotation-shallowParsing true wordGroup text/xml TEI automatic syntactic groups (phrase-like constructs) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for details Spejd 2011-03-01 2011-11-26 semanticAnnotation-namedEntities true wordGroup text/xml TEI automatic named entities (person names, organizations, locations compatible with NKJP hierarchy) detected by Nerf Nerf 2011-03-01 2011-11-26 quasi-spoken formal Sprawozdanie Stenograficzne. Kancelaria Sejmu Rzeczypospolitej Polskiej, ul. Wiejska 4/6/8, 00-902, Warszawa, Poland. Wydawnictwo Sejmowe, 1991-2011. ISSN 08672768. http://www.sejm.gov.pl automatic Texts from terms 1-4 coverted from HTML files, terms 5-6 converted from XML files delivered by Sejm. Audio and video sample from day 3 of sitting 89, term 6 added as example of multimodal content. Morfeusz SGJP, a tokenizer, moprhological analyzer and lemmatizer for Polish Pantera, a Brill tagger for Polish Spejd, a shallow parser of Polish Nerf, a named entity recognizer for Polish