Towards Easier Querying of XML -based Linguistic Corpora

В работе доказано, что любое тупиковое доопределение частичной булевой функции с класса (n, 1, k) имеет нулевую область неопределенности. Выделенные условия, при которых доопределении функции с класса (n, 1, k) является однозначным. У роботі доведено, що будь-яке тупикове довизначення часткової бу...

Full description

Saved in:
Bibliographic Details
Published in:Таврический вестник информатики и математики
Date:2009
Main Authors: Gladkova, G.P., Drozd, A.A.
Format: Article
Language:English
Published: Кримський науковий центр НАН України і МОН України 2009
Online Access:https://nasplib.isofts.kiev.ua/handle/123456789/18232
Tags: Add Tag
No Tags, Be the first to tag this record!
Journal Title:Digital Library of Periodicals of National Academy of Sciences of Ukraine
Cite this:Towards Easier Querying of XML -based Linguistic Corpora / G.P. Gladkova, A.A. Drozd // Таврический вестник информатики и математики. — 2009. — № 2. — С. 71-77. — Бібліогр.: 12 назв. — англ.

Institution

Digital Library of Periodicals of National Academy of Sciences of Ukraine
_version_ 1860021194686726144
author Gladkova, G.P.
Drozd, A.A.
author_facet Gladkova, G.P.
Drozd, A.A.
citation_txt Towards Easier Querying of XML -based Linguistic Corpora / G.P. Gladkova, A.A. Drozd // Таврический вестник информатики и математики. — 2009. — № 2. — С. 71-77. — Бібліогр.: 12 назв. — англ.
collection DSpace DC
container_title Таврический вестник информатики и математики
description В работе доказано, что любое тупиковое доопределение частичной булевой функции с класса (n, 1, k) имеет нулевую область неопределенности. Выделенные условия, при которых доопределении функции с класса (n, 1, k) является однозначным. У роботі доведено, що будь-яке тупикове довизначення часткової булевої функції з класу (n, 1, k) має нульову область невизначеності. Виділені умови, при яких довизначення функції з класу (n, 1, k) є однозначним. The paper is devoted to evaluation of general-purpose XML querying tools in respect to linguistic corpora. A specialized pattern-based query language is suggested and implemented in XCorp software.
first_indexed 2025-12-07T16:47:21Z
format Article
fulltext ÓÄÊ 004.6TOWARDS EASIER QUERYING OF XML-BASED LINGUISTICCORPORA Gladkova G.P., Drozd A.A.Kiev National Taras Shev henko UniversityInstitute of PhilologyDepartment of English Philology01601 Taras Shev henko Boulevard 14 Kiev, Ukrainee-mail: anna.gld�gmail. omMos ow State University (Sevastopol Bran h)Programming Department99000 Geroev Sevastopolya 9, Sevastopol, Ukrainee-mail: alexander.drozd�gmail. omAbstra t. The paper is devoted to evaluation of general-purpose XML querying tools in respe t tolinguisti orpora. A spe ialized pattern-based query language is suggested and implemented in XCorpsoftware. Introdu tionCorpus linguisti s is one of the most a tively developing trends in applied linguisti s.Corpora are widely understood to be merely a "large bodies of ma hine-readable text ontaining thousands or millions of words" [6, p.48℄, and many popular tools for orpusanalysis like Antony Lawren e's AntCon [2℄ presuppose the input to be simple plaintext �les. But urrent tasks in the spheres of phonology, semanti s or syntax of a naturallanguage require more omplex annotation of linguisti data, not to mention issues inpragmati s and ognitive analysis of language. This leads to the problem of in orporatingadditional data in the text and omplex querying of this information.Corpora may be stored in a variety of formats, in luding the so- alled verti al formatand SGML. While these formats may be more advantageous for ertain kinds of tasks,the most �exible solution remains to be XML, whi h is proved by the fa t that many orpus proje ts have developed their own XML-based formats optimized for storage oftask-spe i� information (well-known examples are generi TEI XML, TigerXML et ).Moreover, a great many utilities for tagging of the text on the levels of syntax andmorphology an produ e XML output. Yet while the XML format itself is �exible andmay be tailored to meet the needs of a parti ular study with some basi knowledge ofapplied linguisti s, querying the resulting data poses a more serious problem. The aim ofthis paper is to analyse the data model used in orpus linguisti s and the appli ability ofthe standard XML querying tools in this sphere, as well as to suggest a more onvenientspe ialized querying tool.The general issues of data retrieval from XML databases are dis ussed in variety ofsour es [4,5,11,12℄, but one an judge of the appli ability of general XML queries for orpus linguisti s by the fa t that all major orpus proje ts generally develop their owntools for querying their data (e.g. Xaira for British National Corpus [3℄). Therefore thereexist a number of solutions developed for spe ialized annotation sets (e.g. TigerSear hfor TigerXML). The problem of general appli ability of standard tools is aggravated by 72 Gladkova G.P., Drozd A.A.the fa t that they are all developed for spe ialists in IT and may be di� ult to use forlinguists. 1. General purpose tools for quiering XMLXML is laimed to be the universal format for data representation, and a great manyuniversal solutions has been developed for querying XML data. However, as often isthe ase, universal solutions may be less suitable for spe i� tasks. First of all, XMLitself, as well as the default XML querying tools, has been developed for usage in otherinformation pro essing paradigm: its main purpose is storage and retrieval of stru tureddata of database-like type, where all elements of the same level are onsidered equal and noimportan e is given to their onse utive order (employee pro�les or movie olle tions arethe typi al examples in XML tutorials). In fa t su h XML �les are merely an alternativeformat to a database, and typi al queries for su h �les mu h resemble database requests(e.g. "�nd all the employees with salaries higher than 1000$").However, linguisti orpora possess ertain hara teristi s that make the standardXML querying tools less suitable for them. Elements of a text in a natural languageare sequen es of words that ombine into phrases, senten es and paragraphs. The order ofthose elements is important for the resear her, and so is the distan e between the elements.Sometimes linguists need to ombine the annotation data with the patterns present in theplain text data in their sear h requests. For example, a study in alliteration may involvesear hing for sequen es of words starting with the same onsonant at a parti ular distan e.A study in word-formation may require sear hing for roots and words derived from themo urring within one senten e or one paragraph. The register of letters may be importantor not for a given task. Some of these di� ulties may be solved by means of standardXML querying tools, but this might pose some di� ulties even for an expert in IT sphere,while for an average linguist they turn into an unsolvable problem.The standard means of querying XML data is XQuery language, developed by W3Consortium as well as XML itself. However, the urrent version of XQuery is poorly suitedfor use with XML-annotated text orpora: typi al tasks involving sear h for sequen esof elements in a given order are very di� ult or impossible to solve. The ne essity ofaugmenting XQuery with text querying fun tionality is a knowledged by the fa t thatW3 Consortium itself started the work on development of XQuery tor better support oftext sear hing (XQuery and XPath Full Text [12℄). The suggested hanges partly solvethe problem of querying the XML data as a text in a natural language. However, theproblem of omplexity of XQuery for an expert in humanities is even aggravated by furthersophisti ation of the language. Besides that, the aforementioned hanges to XQuery arestill in the draft stage, and it is hard to predi t the time of new release of XQuery, not tomention the development of software tools to support the new sear h me hanisms.Besides XQuery there exist a number of less well-known XML querying languages,but none of them meet the two aforementioned requirements at a time (simpli ity andsupport for full-text sear h). For example, XML-QL [4℄ is simpler than XQuery, but it¾Òàâðè÷åñêèé âåñòíèê èí�îðìàòèêè è ìàòåìàòèêè¿, �2' 2009 Towards Easier Querying of XML-based Linguisti Corpora 73o�ers no support for regular expressions or sear hing for elements that o ur at a givendistan e.There also exist spe ialized software tools developed for spe i� orpus proje ts. Themost famous example is Xaira [3℄, the su essor of SGML-based SARA tool distributedwith the British National Corpus. While its ar hite ture is general, the drawba ks in lude omplexity of orpus ompilation, ne essity of huge indi es (sometimes �ve times as bigas sour e XML �les with heavy annotation), as well as instability in work. Alinea [7℄ isa parallel orpus tool whi h is somehow more di� ult to use for single-language orpora.The problem with many programs of the type of UAM orpus tool [9℄ is that they havebeen implemented in s ript languages and are rather instable in work with large-s ale orpora. Therefore this paper suggests a general tool for querying XML-based orporathat has been developed in view of the most ommon tasks in analysis of linguisti datathat an be easily automated.2. Sample Task in Corpus AnalysisIt is worth stressing that even if parti ular a parti ular resear h proje t in linguisti shas seemingly nothing to do with applied or omputer linguisti s, it is always based ona orpus of text data. Using ele troni texts may onsiderably shorten the time spenton retrieving eviden e of linguisti fa ts. The general s heme of a resear h proje t inlinguisti s is the following: at �rst a lassi� ation s heme or typology for some languagephenomenon is developed. It is then applied to analysis of text data and then statisti sis drawn to prove the preliminary hypothesis. Traditional approa h with index ards forexample is not only sus eptible to mistakes, but is also di� ult to follow in ases whenea h item to be analyzed has more then two parameters to be lassi�ed with (whi h is the ase with all omplex studies involving, e.g. analysis on the levels of semanti s, syntaxand pragmati s).Let's analyze a sample task posed in a resear h on pe uliarities of English abstra tnouns ending in -ness [1℄. While it is relatively easy to �nd su h nouns in a text withregular expressions (though odd words like witness or governess have to be eliminated),the task involves analysis of semanti s as well as synta ti and pragmati behavior ofsu h nouns in a orpus of lassi al British novels. Semanti analysis is brought down tode�ning of semanti domain of a parti ular noun (a ording to the nature of referent �vegeneral domains have been spe i�ed, four of whi h des ribe various qualities of people(physi al, psy hologi al, qualities, states of mind, and qualities denoting so ial behaviorand attitudes) and one is reserved for other kinds of referents). Words belonging to thesedomains are further subdivided into a number of themati groups. Synta ti behavior isanalyzed in terms of the most ommon distribution models of synta ti groups in ludingnouns ending in -ness. Pragmati s is studied in terms of who is the speaker and whi h hara ter is the quality denoted by the -ness noun attributed to, as well as whether thequality denoted by the -ness noun is evaluated positively or negatively in the ontextof a novel. Therefore 5 units of information are to be added to ea h -ness noun in the orpus. Besides that, the orpus has to be tokenized, and part-of-spee h information is¾Òàâðiéñüêèé âiñíèê ií�îðìàòèêè òà ìàòåìàòèêè¿, �2' 2009 74 Gladkova G.P., Drozd A.A.to be added to every word in order to enable the distribution analysis. Thus a senten efrom "Pride and Prejudi e" by Jane Austen ontained in a single line and in orporatingall this information in XML format would look like this (the pos-tag information has beensimpli�ed for viewing purposes):<paragraph id="40"><senten e id="78"><w pos="noun">Mr.</w><w pos="noun">Dar y</w><w pos="link_verb">is</w> <w pos="adje tive">all</w><w pos="noun" semanti s="so ial\_polite" evaluation="$+$"speaker="Elizabeth" qualified="Dar y">politeness</w><w pos="verb">said</w> <w pos="noun">Elizabeth</w> <w pos="parti iple">smiling</w></senten e></paragraph>Performing su h annotation enables the linguist to perform omplex queries to he kif some hara ter is more likely to use words from a ertain semanti domain, how heevaluates other hara ters and is hara terized by them, whether words from one semanti domain are more likely appear more often in ertain synta ti models and not in the others.It is possible to learn if several su h nouns appear in onse utive senten es or in the sameparagraph (whi h is of interest be ause -ness nouns used in groups in the same ontextor together with the words they are derived from produ e stylisti e�e t).3. XCorp Query LanguageSin e one of the problems of the standard XML querying languages is its ex essive omplexity for an average linguist, we suggest a query language based on patterns. It wasdeveloped in view of typi al tasks and situations that professional linguists fa e whenworking with text orpora. The proposed tool o�ers a general querying fun tionality forXML orpora that overs and simpli�es su h typi al tasks, that in lude �nding segments oftext mat hing ertain riteria and gathering statisti s. Suggested routines are implementedin a program alled XCorp, urrently released at http://sour eforge.net/proje ts/x orp/.XCorp runs under Mi rosoft .NET framework, and an be used on any operating systemsupporting .NET framework.First thing to be determined is the types of orpora to be supported. XML-basedlinguisti orpora generally store text as an hierar hy of stru tures like hapterparagraphsenten e, and on the bottom level as a sequen e of elements representing words withattributes for di�erent linguisti ategories, su h as part of spee h, word lemma, semanti lass et . This is the output model supported by the majority of tokenizers, lemmatizers,part-of-spee h taggers and other orpus utilities. Sin e this is the most frequently usedtype of annotation, XCorp was developed primarily in its view. (More omplex XMLs hemas with data model di�erent from the aforementioned one are generally developedfor spe ialized orpora like TigerCorpus that usually develop a spe ialized querying toolfor their data). The level of nesting and names of spe i� nodes may be di�erent invarious orpora and thus need to be spe i�ed in the sear h request. Corpus texts are to¾Òàâðè÷åñêèé âåñòíèê èí�îðìàòèêè è ìàòåìàòèêè¿, �2' 2009 Towards Easier Querying of XML-based Linguisti Corpora 75be stored in simple xml �les, no indexing is required. The urrent version of X orp has ommand-line interfa e with the program �le being exe uted on the request �le, and thedevelopment of GUI with graphi al query onstru tor is s heduled.As sear h request an ontain many parameters and an be rather ompli ated,we hose to represent it in XML format as well. The root element of thequery on�guration �le is < on�g> that ontains three se tions. The �rst se tionof request (< sear h_s ope >) spe i�es the stru ture of orpus �les and thesear h s ope within them, i.e. how elements are nested and what elements ontaintarget information. Target elements an be spe i�ed with XPath notation. Se ondse tion (< sear hrequest >) spe i�es sear h riteria. As text is presented as asequen e of elements with words and ertain attributes, XCorp software is developedto retrieve subsequen es of those elements, mat hing ertain riteria. User an spe ify asubstring or regular expression for ea h element in hain as well as for ea h attribute.Also maximum distan e between elements an be set. The last se tion of sear hrequest (< sear h_target >) ontains des ription of what kind of output is expe tedand how it is to be presented. Therefore sear hing an XML-annotated text �le is redu edto �lling in a template form, whi h should make the task onsiderably easier for linguistswith no prior training in programming.Currently XCorp enables the user to obtain information of four kinds. 1) basi statisti s for retrieved items. XCorp omputes the number of hits of target patternfor all the levels in whi h they are nested (e.g. those may be senten es or paragraphs ontaining the target item). This feature may be useful for he king the "density" oftarget sequen e, for example, in texts of di�erent genres, or in di�erent se tions of thesame text. It simpli�es sear hing for stylisti phenomena based on repetition, su h asanaphora or epiphora. 2) KWIC (keyword-in- ontext) lists ontaining all the o urren esof the target item in the ontext in whi h they o ur. The ontext may be spe i�ed to bea ertain amount of hara ters to the right and to the left of the target pattern, whi his the traditional way for on ordan er software, or the ontext may be understood asthe element within whi h the target pattern is found (e.g. senten es or paragraphs orsynta ti groups within whi h the target pattern o urs). 3) wordlists, or rather, lists ofo urren es of target pattern in every �le onstituting the orpus, and a general wordlistfor the whole orpus. The default setting for wordlist order is the order in whi h theyo ur in the �le, whi h may be useful for resear h involving linguisti analysis of � tion ornewspaper dis ourse. The wordlist an also be sorted alphabeti ally. There is an option ofgenerating a frequen y list, in whi h all similar o uren es of target pattern are mergedand general statisti s is given. 4) other information hara terizing the target pattern andstored in xml format. This feature makes XCorp useful not only for hypothesis-drivenresear h where one needs only to he k for availability of prede�ned patterns, but also fordis overing " lusters" of linguisti information that the user may not be aware of at thetime of request. For example, if the orpus has morphologi al and semanti annotation,this feature may help the resear her to dis over semanti patterns that orrespond to thetarget synta ti al pattern. ¾Òàâðiéñüêèé âiñíèê ií�îðìàòèêè òà ìàòåìàòèêè¿, �2' 2009 76 Gladkova G.P., Drozd A.A.Let us onsider a query designed for the above example from "Pride and Prejudi e".To des ribe the way the author onstrues the relationship between Elizabeth and Dar ywe need to know what the two hara ters think of ea h other. To learn that we an sear hfor -ness nouns uttered by Elizabeth and on erning Dar y, together with their attributes.The request mat hing the above example will look like this:<sear h_s ope><element name="//paragraph"><element name="senten e"><element name="w"></element></element></element></sear h_s ope><sear h_request><item mask="" distan e="0"><attribute name="pos">adje tive</attribute></item><item mask="\\wness" distan e=""><attribute name="speaker">Elizabeth</attribute><attribute name="qualified">Dar y</attribute></item></sear h_request><sear h_target>< ontent sort="frequen y" order="des ending"/></sear h_target>The adje tive in the above example serves to in rease the degree of Dar y's politenessso as to exaggerate it and let us feel the irony of Elizabeth, who in fa t thinks himextremely rude. On the other hand, Mrs. Bennet talks of his "sho king rudeness", whi his also an exaggeration, and this time it is the author who speaks ironi ally of her hara ter. But as the novel progresses we witness Elizabeth starting to like Dar y andeven a knowledging his "utmost politeness" in earnest.Con lusionThe presented paper analyses the appli ability of general-purpose XML queryingtools in the sphere of orpus linguisti s. Two main problems have been identi�ed: thestandard querying tools do not urrently support full-text sear h fun tionality, and thedefault querying language is too di� ult for experts in humanities with no programmingexperien e. Therefore the proposed query language is pattern-based. It is implemented insoftware program XCorp and an be applied for querying XML orpora with various kindsof annotation. The proposed solution is universal enough to work with di�erent kinds oflinguisti data, and at the same time it is as simpli�ed as possible. XCorp has beensu essfully applied for solving some pra ti al tasks in orpus linguisti s. Further workin ludes the development of graphi al user interfa e and inviting the linguisti ommunityto produ e more requirements, so as to make XCorp a more universal solution and to makethe suggested query language more expressive.¾Òàâðè÷åñêèé âåñòíèê èí�îðìàòèêè è ìàòåìàòèêè¿, �2' 2009 Towards Easier Querying of XML-based Linguisti Corpora 77Ñïèñîê ëèòåðàòóðû1. �ëàäêîâà �.Ï. Îñîáëèâîñòi �óíêöiîíóâàííÿ àáñòðàêòíèõ iìåííèêiâ iç ñó�iêñîì -ness ó òåêñòiðîìàíó Äæåéí Îñòií "Pride and Prejudi e-/ �.Ï. �ëàäêîâà // Ìîâíi i êîíöåïòóàëüíi êàðòèíèñâiòó. � 2008. � Âèï. 24. � ×àñòèíà 1. � Êè¨â: ÊÍÓ iìåíi Ò. Øåâ÷åíêà, 2008. � ñ. 180-186.2. Anthony, L. AntCon : design and development of a freeware orpus analysis toolkitfor the te hni al writing lassroom / Lawren e Anthony // Professional Communi ationConferen e, 2005. IPCC 2005. Pro eedings. -pp. 729-737. - [Ele troni resour e℄:http://www.antlab.s i.waseda.a .jp/abstra ts/ip 05_pres_20050713/IPCC_05_Anthony_�n_handouts.pdf3. Aston G. Introdu ing XAIRA: an XML-aware on ordan e program / Guy Aston,Lou Burnard. -Presentation at workshop held at TALC 2006. -[Ele troni resour e℄:http://www.ou s.ox.a .uk/rts/xaira/Talks/xaira-wkshop.odp.4. A Query Language for XML / Alin Deuts h, Mary Fernandez, Daniela Flores u et al. [Ele troni resour e℄. -http://www8.org/w8-papers/1 -xml/query/query.html.5. Buxton S. Querying XML : XQuery, XPath, and SQL/XML in ontext / Jim Melton, StephenBuxton. -San Fran is o: Morgan Kaufmann, 2006. -845 p. -(The Morgan Kaufmann Series in DataManagement Systems).6. Baker P. A Glossary of Corpus Linguisti s / Paul Baker, Andrew Hardie, Tony M Enery. -Edinburgh: Edinburgh University Press, 2006. -187 p.7. Du het, J.-L. Alinea: a language independant tool for bi-text pro essing / Jean-Louis Du het, OlieverKraif // JRC EU-Enlargement Workshop: Exploiting parallel orpora in up to 20 languages. JRC-Ispra, Italy, 26-27.09.2005. -[Ele troni resour e℄: http://langte h.jr .it/0509_EU-Enlargement-Workshop.html.8. Kennedy, Graeme D. An Introdu tion to Corpus Linguisti s / Graeme Kennedy. -London: Longman,1998.9. O'Donnell, M. The UAM CorpusTool: Software for orpus annotation and exploration / Mi haelO`Donnell // Pro eedings of the XXVI Congreso de AESLA, Almeria, Spain, 3-5 April 2008. -[Ele troni resour e℄. -http://www.wagsoft. om/Papers/AESLA08.pdf.10. Stubbs M. Text and orpus analysis: omputer-assisted studies of language and ulture / Mi haelStubbs. -Malden: Bla kwell Publishers, 1996. -267 p. -(Volume 23 of Language in So iety Series).11. XQuery 1.0: An XML Query Language /S ott Boag, Don Chamberlin, Mary F. Fernandez et al.[Ele troni resour e℄. -http://www.w3.org/TR/xquery/.12. XQuery and XPath Full Text 1.0 / Sihem Amer-Yahia, Chavdar Botev, Stephen Buxton et al.[Ele troni resour e℄. -http://www.w3.org/TR/xpath-full-text-10/.Ñòàòüÿ ïîñòóïèëà â ðåäàêöèþ 22.09.2009 ¾Òàâðiéñüêèé âiñíèê ií�îðìàòèêè òà ìàòåìàòèêè¿, �2' 2009
id nasplib_isofts_kiev_ua-123456789-18232
institution Digital Library of Periodicals of National Academy of Sciences of Ukraine
issn 1729-3901
language English
last_indexed 2025-12-07T16:47:21Z
publishDate 2009
publisher Кримський науковий центр НАН України і МОН України
record_format dspace
spelling Gladkova, G.P.
Drozd, A.A.
2011-03-18T23:49:40Z
2011-03-18T23:49:40Z
2009
Towards Easier Querying of XML -based Linguistic Corpora / G.P. Gladkova, A.A. Drozd // Таврический вестник информатики и математики. — 2009. — № 2. — С. 71-77. — Бібліогр.: 12 назв. — англ.
1729-3901
https://nasplib.isofts.kiev.ua/handle/123456789/18232
004.6
В работе доказано, что любое тупиковое доопределение частичной булевой функции с класса (n, 1, k) имеет нулевую область неопределенности. Выделенные условия, при которых доопределении функции с класса (n, 1, k) является однозначным.
У роботі доведено, що будь-яке тупикове довизначення часткової булевої функції з класу (n, 1, k) має нульову область невизначеності. Виділені умови, при яких довизначення функції з класу (n, 1, k) є однозначним.
The paper is devoted to evaluation of general-purpose XML querying tools in respect to linguistic corpora. A specialized pattern-based query language is suggested and implemented in XCorp software.
en
Кримський науковий центр НАН України і МОН України
Таврический вестник информатики и математики
Towards Easier Querying of XML -based Linguistic Corpora
Article
published earlier
spellingShingle Towards Easier Querying of XML -based Linguistic Corpora
Gladkova, G.P.
Drozd, A.A.
title Towards Easier Querying of XML -based Linguistic Corpora
title_full Towards Easier Querying of XML -based Linguistic Corpora
title_fullStr Towards Easier Querying of XML -based Linguistic Corpora
title_full_unstemmed Towards Easier Querying of XML -based Linguistic Corpora
title_short Towards Easier Querying of XML -based Linguistic Corpora
title_sort towards easier querying of xml -based linguistic corpora
url https://nasplib.isofts.kiev.ua/handle/123456789/18232
work_keys_str_mv AT gladkovagp towardseasierqueryingofxmlbasedlinguisticcorpora
AT drozdaa towardseasierqueryingofxmlbasedlinguisticcorpora