Polish Sejm CorpusThe Polish Sejm Corpus contains annotated utterances of Polish Sejm members from terms of office 1-6 (years 1991-2011). Corpus files contain information about text segmentation (paragraphs, sentences, tokens), disambiguated morphosyntactic description (lemma, POS tag, MSD tag), syntactic description (syntactic words and groups) and named entities (person names, locations, organization).
The data is a valuable source of linguistic information, being a large (100 M segments) collection of quasi-spoken content and making the basis of the audio/video recording of sessions, started in 2011 and planned to be consecutively appended to the corpus.PSChttp://clip.ipipan.waw.pl/PSCNOT_DEFINED_FOR_V2401available-unrestrictedUseBSD-styleattributiondownloadablehttp://clip.ipipan.waw.pl/PSCfree of chargeInstitute of Computer Science, Polish Academy of SciencesIPI PANLinguistic Engineering Groupipi@ipipan.waw.plhttp://www.ipipan.eu
Jana Kazimierza 5
01-248Warsaw+48 22 38 00 500+48 22 38 00 510OgrodniczukMaciejmaciej.ogrodniczuk@ipipan.waw.plhttp://zil.ipipan.waw.pl/MaciejOgrodniczuk
Jana Kazimierza 5
01-248Warsaw+48 22 38 00 563+48 22 38 00 510Assistant ProfessorInstitute of Computer Science, Polish Academy of SciencesIPI PANLinguistic Engineering Groupipi@ipipan.waw.plhttp://www.ipipan.eu
Jana Kazimierza 5
01-248Warsaw+48 22 38 00 500+48 22 38 00 5102011-10-17OgrodniczukMaciejmaciej.ogrodniczuk@ipipan.waw.plhttp://zil.ipipan.waw.pl/MaciejOgrodniczuk
Jana Kazimierza 5
01-248Warsaw+48 22 38 00 563+48 22 38 00 510Assistant ProfessorInstitute of Computer Science, Polish Academy of SciencesIPI PANLinguistic Engineering Groupipi@ipipan.waw.plhttp://www.ipipan.eu
Jana Kazimierza 5
01-248Warsaw+48 22 38 00 500+48 22 38 00 510en2011-11-261.02011-11-12Central and South-East European ResourcesCESARhttp://www.cesar-project.neteuFundsnationalFundsownFundsEuropean Commission (50%)Polish Ministry of Science and Higher Education (40%)Institute of Computer Science, Polish Academy of Sciences (10%)Poland2011-02-012013-01-31corpustextmonolingualPLPolish2.6gb114000000tokensotherfalseparagraphtext/xmlTEIautomatictranscripts of one session day represented as a set of files: header.xml – header information about the individual session days, text_structure.xml – session structure and individual utterances, ann_segmentation.xml.gz – compressed sentence-level and token-level segmentation, ann_morphosyntax.xml.gz – disambiguated morphosyntactic description (lemma, POS tag and MSD tag), ann_words.xml.gz – syntactic words, ann_groups.xml.gz – syntactic groups, ann_named.xml.gz – named entitiesscripts developed internally2011-03-012011-11-26otherfalseparagraphtext/xmlTEIautomaticrepresentation of the session structure (taken over from the transcripts) in <div> elementsscripts developed internally2011-03-012011-11-26segmentationfalseutterancetext/xmlTEIautomaticsegmentation into utterances taken over from the transcripts; each utterance is marked with the speaker identifier (resolved in the transcript header)scripts developed internally2011-03-012011-11-26segmentationtruesentencetext/xmlTEIautomaticindividual utterances split into sentencesPantera2011-03-012011-11-26segmentationtruewordtext/xmlNKJP tagsetTEIautomaticindividual sentences split into tokens (word-like segments – see documentation of Morfeusz SGJP for details)Morfeusz SGJP2011-03-012011-11-26lemmatizationtruewordtext/xmlTEIautomaticlemma variants (all available interpretations) output by Morfeusz, then disambiguated by Pantera taggerMorfeusz SGJP2011-03-012011-11-26morphosyntacticAnnotation-bPosTaggingtruewordtext/xmlNKJP tagsetTEIautomaticMSD tag variants (all available morphosyntactic interpretations) output by Morfeusz, then disambiguated by Pantera taggerMorfeusz SGJP2011-03-012011-11-26morphosyntacticAnnotation-posTaggingtruewordtext/xmlNKJP tagsetTEIautomaticPOS tag (CTAG) variants (all available interpretations) output by Morfeusz, then disambiguated by Pantera taggerPantera2011-03-012011-11-26structuralAnnotationtruewordtext/xmlTEIautomaticsyntactic words (word-like compounds) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for detailsSpejd2011-03-012011-11-26syntacticAnnotation-shallowParsingtruewordGrouptext/xmlTEIautomaticsyntactic groups (phrase-like constructs) detected by Spejd with NKJP shallow parsing grammar; see NKJP documentation for detailsSpejd2011-03-012011-11-26semanticAnnotation-namedEntitiestruewordGrouptext/xmlTEIautomaticnamed entities (person names, organizations, locations compatible with NKJP hierarchy) detected by NerfNerf2011-03-012011-11-26quasi-spokenformalSprawozdanie Stenograficzne. Kancelaria Sejmu Rzeczypospolitej Polskiej, ul. Wiejska 4/6/8, 00-902, Warszawa, Poland. Wydawnictwo Sejmowe, 1991-2011. ISSN 08672768. http://www.sejm.gov.plautomaticTexts from terms 1-4 coverted from HTML files, terms 5-6 converted from XML files delivered by Sejm. Audio and video sample from day 3 of sitting 89, term 6 added as example of multimodal content.Morfeusz SGJP, a tokenizer, moprhological analyzer and lemmatizer for PolishPantera, a Brill tagger for PolishSpejd, a shallow parser of PolishNerf, a named entity recognizer for Polish