Comparative characteristics of Polish and Russian language corpora

The aim of this paper is to present and compare the most representative language corpora of Polish and Russian according to selected criteria determining both their potential as a source of linguistic data for various types of linguistic analyses and their availability for researchers. Moreover, the...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2006
Автор: Grabowski, Ł.
Формат: Стаття
Мова:English
Опубліковано: Інститут української мови НАН України 2006
Назва видання:Лексикографічний бюлетень
Теми:
Онлайн доступ:https://nasplib.isofts.kiev.ua/handle/123456789/72849
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:Digital Library of Periodicals of National Academy of Sciences of Ukraine
Цитувати:Comparative characteristics of Polish and Russian language corpora / Ł. Grabowski // Лексикографічний бюлетень: Зб. наук. пр. — К.: Ін-т української мови НАН України, 2006. — Вип. 13. — С. 29-33. — Бібліогр.: 12 назв. — англ.

Репозитарії

Digital Library of Periodicals of National Academy of Sciences of Ukraine
id nasplib_isofts_kiev_ua-123456789-72849
record_format dspace
spelling nasplib_isofts_kiev_ua-123456789-728492025-02-09T21:26:26Z Comparative characteristics of Polish and Russian language corpora Grabowski, Ł. Корпусна лінгвістика The aim of this paper is to present and compare the most representative language corpora of Polish and Russian according to selected criteria determining both their potential as a source of linguistic data for various types of linguistic analyses and their availability for researchers. Moreover, the present paper indicates areas for improvement as far as the possibilities offered by the corpora and access to them are concerned. 2006 Article Comparative characteristics of Polish and Russian language corpora / Ł. Grabowski // Лексикографічний бюлетень: Зб. наук. пр. — К.: Ін-т української мови НАН України, 2006. — Вип. 13. — С. 29-33. — Бібліогр.: 12 назв. — англ. XXXX-0118 https://nasplib.isofts.kiev.ua/handle/123456789/72849 81‘322 en Лексикографічний бюлетень application/pdf Інститут української мови НАН України
institution Digital Library of Periodicals of National Academy of Sciences of Ukraine
collection DSpace DC
language English
topic Корпусна лінгвістика
Корпусна лінгвістика
spellingShingle Корпусна лінгвістика
Корпусна лінгвістика
Grabowski, Ł.
Comparative characteristics of Polish and Russian language corpora
Лексикографічний бюлетень
description The aim of this paper is to present and compare the most representative language corpora of Polish and Russian according to selected criteria determining both their potential as a source of linguistic data for various types of linguistic analyses and their availability for researchers. Moreover, the present paper indicates areas for improvement as far as the possibilities offered by the corpora and access to them are concerned.
format Article
author Grabowski, Ł.
author_facet Grabowski, Ł.
author_sort Grabowski, Ł.
title Comparative characteristics of Polish and Russian language corpora
title_short Comparative characteristics of Polish and Russian language corpora
title_full Comparative characteristics of Polish and Russian language corpora
title_fullStr Comparative characteristics of Polish and Russian language corpora
title_full_unstemmed Comparative characteristics of Polish and Russian language corpora
title_sort comparative characteristics of polish and russian language corpora
publisher Інститут української мови НАН України
publishDate 2006
topic_facet Корпусна лінгвістика
url https://nasplib.isofts.kiev.ua/handle/123456789/72849
citation_txt Comparative characteristics of Polish and Russian language corpora / Ł. Grabowski // Лексикографічний бюлетень: Зб. наук. пр. — К.: Ін-т української мови НАН України, 2006. — Вип. 13. — С. 29-33. — Бібліогр.: 12 назв. — англ.
series Лексикографічний бюлетень
work_keys_str_mv AT grabowskił comparativecharacteristicsofpolishandrussianlanguagecorpora
first_indexed 2025-12-01T00:14:32Z
last_indexed 2025-12-01T00:14:32Z
_version_ 1850262760364965888
fulltext Лексикографічний бюлетень 29 н .п . DNG R L Ba Вже давно думала про дорогу. Ввижалася їй нескінченним білим змієм, бо простягалась у визорі звідсіля, від їхнього передмістя, туди, в далекий безмежний світ, думати про який було страшно й лячно. Таким чином, спостерігаються кореляції F із L і, відповідно, A / P із M / B. При цьому S = Ba, S = Ab, Ba = Ab = S. Відповідно до зазначеного, корпусне анотування концептів має такий вигляд: Наша присутність — це і є та дорога, яку я так часто останнім часом бачу {DNG/R/F/A} («Дім на горі», концепт «Шлях», переносне значення, активне переносне значення). Таким чином, концептні смисли реалізують значення S (Ba, Ab) та FA, тобто символічні значення і активні переносні значення. Можливо, символічні значення, визначені як поєднання активних переносних і основних прямих значень, є прикметою ідеостилю Валерія Шевчука, який свідомо накладає абстрактні глибинні смисли на конкретні образи у своїх творах. Тому ефективність пропонованої схеми аналізу на матеріалі творів інших авторів потребує перевірки. Література 1. Автоматизация анализа научного текста / В. А. Вербицкий, Т. А. Грязнухина, Н. П. Дарчук и др. – К.: Наукова думка, 1984. – 256 с. 2. Демська-Кульчицька О. Основи національного корпусу української мови. – К., 2005. – 219 с. 3. Іващенко В. Концепт-символ «кобзар» у життєтворчості Т. Г. Шевченка // Материалы IV Международного семинара «Шевченковский Петербург». – СПб., 2005. – 186 с. 4. Никитин М. Курс лингвистической семантики: Учебное пособие для студентов, аспирантов и преподавателей лингвистических дисциплин в школах, лицеях, колледжах и вузах. – СПб., 1996. – 760 с. Ł. Grabowski* Institute of East-Slavonic Studies University of Opole (Opole, Poland) УДК 81‘322 COMPARATIVE CHARACTERISTICS OF POLISH AND RUSSIAN LANGUAGE CORPORA The aim of this paper is to present and compare the most representative language corpora of Polish and Russian according to selected criteria determining both their potential as a source of linguistic data for various types of linguistic analyses and their availability for researchers. Moreover, the present paper indicates areas for improvement as far as the possibilities offered by the corpora and access to them are concerned. 1. Introduction Corpus linguistics, which is both a branch of linguistics and a methodology of research, concentrates on the study of texts with the use of dedicated computer software 4:143]. Thus, any collection of texts stored in electronic form is called a corpus. The linguistic data stored in the corpus is the actual collection of electronic words, which are classified in terms of types and tokens 2:16]. This typology is important in that a token is every running word-form (segment) the corpus is composed of, whereas a type is a group of the same tokens (eg. the sentence ‗Моя мама живѐт в городе, a моя тѐтя живѐт в деревне’ consists of 8 types and 11 tokens ). Since language corpora, unlike single and coherent texts, are objects of complex and multidimensional character, the answer to the question how similar or different they are will indubitably be complex and multidimensional itself. Thus, the study aimed at comparing corpora that represent two different languages will be even more challenging and complex. Corpora as objects of comparison will always be similar in some aspects and different in others. For the comparison to render objective observations, one has to adopt a framework that would serve as criteria for such a study. Although there have been multiple quantitative methods based on word frequencies and ngram frequencies 1], purely statistical approach to a comparison of different multilingual corpora falls beyond the scope of the current paper. Thus, its aim is to present possibilities offered by the most representative commercial corpora representing Polish and * © Ł. Grabowski, 2006 30 Лексикографічний бюлетень Russian languages as far as the following criteria are concerned: a) the overall number of tokens collected in corpora; b) availability of meta-linguistic annotation and lemmatisation of a corpus data; c) types of queries offered by corpora; d) availability of corpora and access to them. The Polish corpora are represented by Korpus Języka Polskiego Wydawnictwa Naukowego PWN (henceforth PWN Corpus), Korpus Instytutu Podstaw Informatyki Polskiej Akademii Nauk (henceforth IPI PAN Corpus) and the PELCRA Corpus. The Russian ones are the following: the Russian National Corpus (henceforth the RNC), Comparable Corpus of English and Russian News Texts (henceforth CCERNT Corpus) and Computational Corpus of Russian Newspaper Texts at the End of the Twentieth Century (henceforth CCRNTE20 Corpus). Such a selection is not accidental since the aforementioned linguistic resources are the most representative corpora compiled for Polish and Russian, respectively. 2. Characteristics of Selected Polish Corpora PWN Corpus is a synchronic and balanced corpus compiled by the commercial institution Wydawnictwo Naukowe PWN S. A., one of the largest publishing houses operating on the Polish market. The corpus in question is available on the CD-ROM, on the Internet [7] and at the headquarters of the compilers. The access differentiation has a bearing on the corpus sample which is available to researchers. The overall corpus available at the PWN S. A. headquarters consists of 100 million tokens, out of which 70 million tokens account for the balanced corpus, which includes modern Polish literature (41 %), press articles (45.5 %), dialogues, leaflets and manuals, Internet websites (13.5 %); the remaining part includes Polish literature and press archives. The opportunities provided to the user of the Internet are limited, however. When accessing the corpus on-line, one can choose either a web sample of the corpus (wersja sieciowa) or the demonstrative version (wersja demonstracyjna). The difference between the two is crucial since the web sample provides access to the corpus of 40 million tokens (out of which 22 million constitutes the balanced corpus), whereas the demonstrative version sports a corpus sample of 7.5 million whereby only 3.5 million of tokens is a balanced collection. Although both samples are available through a concordancer placed on the above website, the difference between the two is that the very access to the web sample is chargeable, whereas the demonstrative version is free-available. Moreover, the queries which may be executed in the demonstrative version disable to regulate the width of the left and right-hand context of the key- words subject to search. As for the annotation of the corpus data, it contains only meta- situational and meta-textual information in a format that conforms to the Text Encoding Initiative (TEI) guidelines. Although such annotation renders part-of-speech queries unavailable, the undeniable advantage of the concordancer is its ability to display all inflected forms of the key-word subject to search. Finally, the corpus sample available on the CD-ROM, which was distributed to all Institutes of Polish Studies at Polish universities free of charge is the same corpus sample as a demonstrative one available on the Internet. The next corpus subject to this study is IPI PAN Corpus, which was developed at the Institute of Computer Science of the Polish Academy of Sciences. The full version of the corpus comprises almost 300 million segments available through the Poliqarp concordancer, which serves as an integral tool for browsing the corpus on-line or it can be downloaded as a compressed tar file. As for the subcorpora available free of charge from the http://korpus.pl website, three of them require further presentation. The source version (próbka źródłowa) of the IPI PAN Corpus comprises 100 million tokens which renders 286, 000 types. The preliminary sample (próbka wstępna) comprises 70 million segments which renders 364, 000 types. The sample of the corpus available on-line is composed of 15 million tokens which renders 217, 000 types and which further accounts for the opportunistic corpus whose 90 % fall into the category of modern Polish literature and socio-political journalism. All the above samples are downloadable from the corpus website in a form of tar archive files, which can be decompressed with the use of 7Zip freeware application. As far as the annotation is concerned, the corpus data are enriched with morphosyntactic tags whose common format is the slightly modified XML version of Corpus Encoding Initiative. As for the meta-textual and meta-situational information, it is still incomplete and subject to major improvements. The Poliqarp search engine and concordancer (available in three versions: on-line, graphical and GNU/Linux, where the latter one is the most advanced one) is equipped with a user-friendly interface, which enables researchers to request multifarious types of queries, which may contain standard regular expressions: the base-form query, grammatical class query (based on grammatical tags specifying the values of the part of speech), grammatical category query, query with Лексикографічний бюлетень 31 constraining matches to sentences or paragraphs as well as query with constraining matches to meta-situational or meta-textual information [5; 6]. The last Polish corpus subject to this description is the PELCRA Corpus, which is developed at the Institute of English at the University of Lodz in co-operation with the University of Lancaster. The Reference Corpus of Polish, as the major subcorpus within the framework of PELCRA project, comprises 93, 129, 588 tokens. Methodology used in the compilation of this corpus was very similar to the one adopted for the British National Corpus. It is a synchronic corpus of modern written and spoken Polish whereby the latter one comprises only 600, 000 tokens, which is the only major difference from the BNC. The target figure for spoken subcorpus, however, is 1 million tokens. As for the annotation of the corpus in question, its source form is the binary XML-annotated corpus text and its target form is a collection of data stored in the relational database MySQL. As a result, the meta-textual, meta-situational and meta-linguistic annotation that the corpus is equipped with is of hierarchical character and the corpus data are stored in the tables and columns of the MySQL database [3: 107]. Such a solution renders possible the extensive use of SQL (Simple Query Language) whose benefit is the opportunity to carry out multiple queries using the search tool available on the corpus website. The most important types of queries, which may again contain standard regular expressions, are the following: the base-form query, inflection query and phrase query, collocation query and the MI3 collocation query. The queries may be constrained by the meta- situational and meta-textual information, which enable researchers to specify the features of corpus samples subject to search. Moreover, the search tool offers a wide variety of statistical analyses concerning, in particular, the frequencies of words and collocations. However, the availability of the corpus for rank-and-file user is limited to executing a small number of queries and full access to the corpus data is chargeable. It has to be emphasized, however, that since the Reference Corpus of Polish described here is available on the World Wide Web, its interface is user-friendly and easy to manipulate. 3. Characteristics of Selected Russian Corpora The description of commercially available corpora of Russian starts with the Russian National Corpus (Национальный корпус русского языкa) which has been compiled at the V. Vinogradov Institute of Russian of Russian Academy of Sciences in Moscow. This corpus, which is available through the website http://ruscorpora.ru, includes 120 million tokens (recorded as of February 7, 2006) which make up the meta-situationally, meta-textually and meta-linguistically annotated representative collection of Russian texts in the electronic form. A particular emphasis shall be put on the meta-linguistic annotation, which is very detailed and comprehensive in that it comprises grammatical and semantic tagging. The former covers information concerning part of speech of the tokens, their case, gender, degrees of comparison, number, tense, aspect etc., whereas the semantic annotation is even more extended and covers semantic categories, taxonomy and axiology. As a result, the concordance search tool available on the website allows one to execute two types of queries: the ‗exact word-forms query‘ (Поиск точных форм) and the ‗lexico-grammatical query‘(Лексико-грамматический поиск). The latter one enables researchers to constrain the queries to specified grammatical properties (which facilitates detailed searches by determining specific morphological properties of key-words subject to search) and semantic properties (by determining semantic properties of key-words) which may contain standard regular expressions. Moreover, the search tool offers an additional option which enables one to constrain the search to a specific genre or linguistic environments a given text functions in. This corpus has been equipped with the option of ‗reduced homonymy‘ (снятие омонимии) or ‗non-reduced homonymy‘ which allows one to display either the lemmatised or non-lemmatised results of the query. The former one reduces tokens (lemmatas) to specific types (lemmas) representing the same part of speech which enables to make a distinction between inflectional forms of the same lemma. The latter one allows inaccuracy in the case of multifunctional words, such as a Russian lemma печь, which may function either as a verb or a noun. Moreover, the sample of the corpus with reduced homonymy allows for a display of the concordances with marked accents, which is essential for many language teachers browsing the corpus for linguistic material to be used during their classes. Apart from that, one has access to the subcorpus of spoken Russian and to the parallel Russian-English and English- Russian subcorpora, invaluable collections of texts for translators and lexicographers comprising aligned text-units (sentences and paragraphs). These separate corpora with non-reduced homonymy are equipped with concordance search tool, which allows researchers to execute lexico-grammatical search therein. 32 Лексикографічний бюлетень The Comparable Corpus of English and Russian News Texts was compiled at the University of Leeds under the supervision of Sergei Sharoff. It consists of multiple subcorpora which are available on the website. Russian part consists of a morphologically-tagged (the grammatical properties include: part-of-speech, case, gender, animation, number, person, degree of comparison, aspect and tense) and lemmatised collection of articles from Izvestia daily newspaper (issued between 2000-2001), which add up to 14 million tokens. The search tools allow one to constrain the search to grammatical properties of the word-forms and to display rich statistical data on frequency and distribution of the key-words subject to search. The interactivity of the interface enables corpus-oriented researchers to execute searches in the Russian National Corpus (in addition to the Izvestia Corpus) and in the small corpus of modern Russian fiction (500, 000 tokens), of which the latter one is also compiled in Leeds. The search results are displayed as concordances (whose width can be extended) enriched with meta-textual and meta- situational information after double-clicking on the selected concordance. For the users who log in, having received the access code to the corpus, it is available free of charge. The last Russian corpus subject to this description is Computational Corpus of Russian Newspaper Texts at the End of the Twentieth Century (Компьютерный корпус текстов русских газет конца ХХ-ого века). This synchronic and opportunistic corpus, which can be accessed through the http://www.philol.msu.ru/~lex/corpus website, is available either on-line (205, 000 tokens in 446 press articles) or it can be researched at the Institute of General and Computational Lexicography and Lexicology at Moscow State University, where it was compiled. The full version of the corpus comprises 11, 401, 479 tokens in 23, 110 press articles. Although the size of the corpus is limited, it contains meta-linguistic (morphosyntactic), meta- situational and meta-textual annotation. This collection of texts is also lemmatized and comprises samples of full issues of thirteen Russian newspapers dating from 1994 -- 1997. The search tool offers users two types of queries which may be constrained to the selected properties of annotation: the ‗exact word-form query‘ (Буквальное совпадение) and the ‗indirect query‘ (Подстрока), the latter allowing researchers to search for a sequence of characters which may either account for the full key-word or may be its integral part. The results of the queries are displayed as either concordances or lists of word-forms. 4. Conclusions Having been familiarised with general characteristics of the Polish and Russian corpora and having laid down four criteria for above description, it is possible to arrive at the following observations. As far as the overall number of tokens is concerned, three Polish corpora oscillate around the figure of 100 million, which may be the result of treating the British National Corpus as a reference point (the BNC has 100, 106, 008 tokens). Although IPI PAN Corpus is outstanding in that it includes 300 million tokens, its actual availability starts with a source sample which includes 100 million tokens. Moreover, the sheer number is not conclusive in that all Polish corpora have not been fully completed so far. What is crucial, however, are the issues concerning access to them for researchers working outside Poland and the question: which corpus to use if one wants to avoid charges and to access it on-line? In this respect IPI PAN Corpus comes to the fore as its demonstrative version available on-line includes 15 million tokens. Although PELCRA Corpus offers over 93 million tokens, the search is in practice limited to one query at a time. Another advantage of IPI PAN Corpus is that one can download all three samples (cf. the source, preliminary and on-line versions) together with Poliqarp search tool designed specifically to access the corpus. As a result, after decompressing the samples stored in tar files, one may execute searches directly on the hard drive. As for the Russian corpora, the biggest one is obviously the RNC, which comprises 120 million tokens available on-line and free of charge, which makes it a promising corpus for the size of data and access. If one conducts quantitative research, the Comparable Corpus of English and Russian News Texts compiled at Leeds makes up a good reference point as it provides extended statistical information about the key-words subject to search. The last Russian corpus, the Computational Corpus of Russian Newspaper Texts at the End of the Twentieth Century, is advantageous in that it is a specialist collection of texts but its overall size, which is relatively small (over 11 million tokens and only 200, 500 tokens available on-line), and opportunistic nature make it custom-designed for corpus-oriented researchers studying the language of modern Russian press. As for the meta-linguistic annotation, the IPI PAN Corpus and PELCRA Corpus are the best annotated Polish corpora because they are equipped with all three types of annotation (meta- Лексикографічний бюлетень 33 situational, meta-textual and meta-linguistic) which extends the scope of executable queries. It is perfectly visible on the example of PELCRA Corpus, whose user-friendly interface in Polish and English allows one to execute multiple types of queries (as referred to above), which proves that designers and compilers of this corpus aptly took advantage of its rich annotation. Among Russian corpora, the RNC is the most promising one as its detailed meta-linguistic annotation includes morphological, semantic and axiological features, which widen the scope of executable queries. As a result, one may constrain any lexico-grammatical query to all grammatical and semantic properties allowing narrow and specified search to be executed. The two remaining corpora of Russian also use three major types of annotation, but the meta-linguistic one is limited to morphological properties. Finally, access to both Polish and Russian corpora renders an interesting observation in that nearly all of them rely on the on-line access which is equivalent to browsing the corpus by means of a search engine, the fashion similar to looking for information on the Internet using such search engines as Google, Altavista, etc. In other words, the idea of developing a unique dedicated corpus client as a separate software application designed to be installed on any computer has not found much appeal for the analysed corpora. The exception to this rule is IPI PAN Corpus of Polish, which allows one to download Poliqarp, a dedicated search engine and concordancer to access and browse the corpus. This solution reflects the one adopted by the compilers of the British National Corpus, which can be accessed either on-line or from the hard drive by means of SARA client, a search engine and concordancer offering, among others, multiple options and queries to be executed (even building up a single query from multiple types of queries), printing the results straight away, manipulating the data and their format, etc. The only corpus available on the CD-ROM is PWN Corpus of Polish. This is another interesting observation since the process of distributing corpora on CD-ROMs is both profitable for their compilers and convenient for armchair researchers, because it enables them to enjoy fast and efficient access to electronic collections of linguistic data, even if they do not have access to the Internet. Summing up, the present comparison of selected Polish and Russian corpora shows that corpora vary and even using specific criteria does not guarantee that one arrives at objective and comprehensive results. Nevertheless, having been familiarised with above collections of texts, researchers interested in extensive linguistic analyses of Polish and Russian using corpus methodology are indubitably in a better position to choose the corpus which would meet their requirements and would correspond with goals of their research. References 1. Kilgarriff, A. (2001). ―Comparing Corpora‖. Retrieved on 10 Jan. 2006 from: http://www.itri.brighton.ac.uk/~Adam.Kilgarriff/publications.html 2. Mason, O. (2000). Programming for Corpus Linguistics. Edinburgh: University Press. 3. Pęzik, P., Uzar, R., Levin, E. (2005). Zastosowania baz danych w językoznawstwie. In: B. Lewandowska-Tomaszczyk, Podstawy językoznawstwa korpusowego (p. 95-115). Łódź: Wydawnictwo Uniwersytetu Łódzkiego. 4. Piotrowski, T. (2003). Językoznawstwo korpusowe: wprowadzenie do problematyki. In S. Gajda (Ed.), Językoznawstwo w Polsce. Stan i perspektywy (p.143-154). Opole: Wydawnictwo Uniwersytetu Opolskiego. 5. Przepiórkowski, A. (2004). ―The IPI PAN Corpus in Numbers‖. Retrieved on 11 Feb. 2006 from: http://nlp.ipipan.waw.pl/~adamp/Papers/2005-ltc-numbers/ 6. Przepiórkowski, A. (2005). ―The Potential of IPI PAN Corpus‖. Retrieved on 11 Feb. 2006 from: http://nlp.ipipan.waw.pl/~adamp/Papers/2005-psicl-numbers/ 7. Korpus Języka Polskiego Wydawnictwa Naukowego PWN. Retrieved on 12 Feb. 2006 from: http://www.korpus.pwn.pl 8. Korpus Języka Polskiego IPI PAN. Retrieved on 12 Feb. 2006 from: http://korpus.pl 9. The PELCRA Reference Corpus of Polish. Retrieved on 12 Feb. 2006 from: http://korpus.ia.uni.lodz.pl 10. Национальный корпус русского языкa. Retrieved on Feb. 14 2006 from: http://ruscorpora.ru 11. Компьютерный корпус текстов русских газет конца ХХ-ого века. Retrieved on 14 Feb. 2006 from: http://www.philol.msu.ru/~lex/corpus 12. The Comparable Corpus of English and Russian News Texts. Retrieved on Feb. 14 2006 from: http://corpus.leeds.ac.uk