Methods and software for significant indicators determination of the natural language texts author profile

Methods for the formation and optimization of author profiles are presented. The author profile is an image – a vector in a multidimensional space, which components are author’s texts measurements by a number of methods based on 4-grams, stemming, recurrence analysis and formal stochastic grammar. T...

Повний опис

Збережено в:

Бібліографічні деталі
Дата:	2023
Автори:	Shynkarenko, V.I., Demydovych, I.M.
Формат:	Стаття
Мова:	Англійська
Опубліковано:	PROBLEMS IN PROGRAMMING 2023
Теми:	natural language texts authorship determination genetic algorithm recurrent analysis statistical analysis text classification pattern recognition formal grammars UDK 004.91
Онлайн доступ:	https://pp.isofts.kiev.ua/index.php/ojs1/article/view/577
Теги:	Додати тег Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:	Problems in programming
Завантажити файл:

Репозитарії

Problems in programming

_version_	1865434371548250112
author	Shynkarenko, V.I. Demydovych, I.M.
author_facet	Shynkarenko, V.I. Demydovych, I.M.
author_institution_txt_mv	[ { "author": "V.I. Shynkarenko", "institution": "Ukrainian State University of Science and Technologies" }, { "author": "I.M. Demydovych", "institution": "Ukrainian State University of Science and Technologies" } ]
author_sort	Shynkarenko, V.I.
baseUrl_str	https://pp.isofts.kiev.ua/index.php/ojs1/oai
collection	OJS
datestamp_date	2024-04-28T11:55:00Z
description	Methods for the formation and optimization of author profiles are presented. The author profile is an image – a vector in a multidimensional space, which components are author’s texts measurements by a number of methods based on 4-grams, stemming, recurrence analysis and formal stochastic grammar. The author’s profile is a model of his language, including vocabulary, sentence syntax features. A comparative analysis of each of the methods effectiveness is carried out. By means of the genetic algorithm, a reduced profile of the author is formed. Insignificant indicators are excluded, which allows to reduce their number by 20%. The reduced author’s profile contains attributes that are significant for this author and is an effective attribution of a particular author.Prombles in programming 2023; 3: 22-29
doi_str_mv	10.15407/pp2023.03.022
first_indexed	2025-07-17T09:57:22Z
format	Article
fulltext	Прикладне програмне забезпечення 22 UDK 004.91 http://doi.org/10.15407/pp2023.03.22 V.I. Shynkarenko, I.M. Demydovych METHODS AND SOFTWARE FOR SIGNIFICANT INDICATORS DETERMINATION OF THE NATURAL LANGUAGE TEXTS AUTHOR PROFILE Methods for the formation and optimization of author profiles are presented. The author profile is an image – a vector in a multidimensional space, which components are author’s texts measurements by a number of methods based on 4-grams, stemming, recurrence analysis and formal stochastic grammar. The author’s profile is a model of his language, including vocabulary, sentence syntax features. A comparative analysis of each of the methods effectiveness is carried out. By means of the genetic algorithm, a reduced profile of the author is formed. Insignificant indicators are excluded, which allows to reduce their number by 20%. The reduced author’s profile contains attributes that are significant for this author and is an effective attribution of a particular author. Keywords: natural language texts, authorship determination, genetic algorithm, recurrent analysis, statistical analysis, text classification, pattern recognition, formal grammars Introduction Attribution of authorship is the prob- lem of identifying an anonymous text author or a text whose authorship is in doubt [1]. There are many examples in the literature of differ- ent countries, when doubts arose in the work authorship and authorship was not reliably es- tablished. To resolve such controversial issues, an analysis of the other authors works is carried out, during which it is required to determine the significant characteristics of the text and the author’s style as a whole. Subsequently, the belonging of the text to one or another author’s pen will be determined by the closeness of the text under study writing style to one of them. In most cases, such a task of determining the text authorship refers to classification tasks. There are various subtasks in text clas- sification, and they can be divided into themat- ic and non-thematic. The traditional classifica- tion of texts is based on their subject matter. However, over the past 20 years, areas of non-thematic classification have also been actively used, for example, in such subtasks as genre classification [2,5], sentiment classifica- tion, spam identification, language identifica- tion, authorship identification, and plagiarism detection [3]. Many algorithms have been developed to evaluate text authorship. These algorithms rely on the fact that the authors are character- ized by the linguistic features of their own lan- guage at all levels – semantic, syntactic, lexico- graphic, spelling and morphological [4], which manifest themselves in the writing of texts. As a rule, these features appear uncon- sciously in the authors works and thus provide a useful basis for determining authorship. The most common approach to determining au- thorship is to use stylistic analysis, which takes place in two stages: first, certain style markers are extracted, then, some classification proce- dures are applied to the resulting model. These methods are usually based on the calculation of lexical measures representing the author’s vocabulary richness and the com- monly used words appear frequency [5]. The extraction of style markers is usual- ly done using some form NLP analysis, such as tagging, parsing, and morphological analysis. However, this standard approach has several drawbacks. First, the methods used to extract style markers are language specific. For example, the English parser is not applicable to texts in German, Ukrainian, or Chinese. Second, feature selection is not a trivial process and usually involves setting thresholds to exclude non-informative features [6]. These decisions can be extremely sub- tle because although rare features contribute less signal than common features, they can still have an important cumulative effect [7]. ©V.I.Shynkarenko, I.M.Demydovych, 2023 ISSN 1727-4907. Проблеми програмування. 2023. №3 Прикладне програмне забезпечення 23 Thirdly, modern authorship attribu- tion systems – determining the author of a text – invariably analyze by words. However, although word-level analysis seems intuitive, it ignores the fact that morphological features can also play an important role, and in addi- tion, many Asian languages such as Chinese and Japanese do not have well-defined word boundaries in text. When working with a small number of authors and their works, the number of mea- sures for comparison will also be small. How- ever, if the number of authors or classes is much larger, it is necessary to set a limit on the amount of information about the author, i.e. create an author profile that will include only the most informative indicators from a large list of them. At present, approaches starting with the theory of pattern recognition, mathemat- ical statistics and probability theory, algo- rithms of neural networks and cluster analy- sis, and many others are used for text attribu- tion. This article solves the problem of deter- mining the text authorship various attributions effectiveness – from the set of text attributes obtained by different methods, their subset is distinguished, which is sufficient to identify a specific author of the text. We will consider these subsets as effective attribution of a par- ticular author. The work is carried out on Ukrainian lit- erary texts and explores the features of speech constructions and sentence construction that are specific to the Ukrainian language. The allocation of effective attribution of the author is carried out on the basis of ex- periments with texts of different Ukrainian au- thors by means of a genetic algorithm. Methods Several methods are used to analyze the texts of different authors, form their profiles, highlight the most significant indicators, and then reduce the data of each profile to reduce the time and computational resources required during the experiment. Below is a general scheme for high- lighting the effective attribution of authors (Fig. 1). Figure 1 – General experiment scheme In the selecting weights process for each of the indicators using a genetic algorithm, the following is performed: the initial weight vector Wk of the first generation is randomly formed, the fitness function is determined, and the best ones are selected with a crossover and mutation to form a new generation Wk. Fitness function where – is the profile of the k-th work author, – are the measurement weights corresponding to this author, ρ – is a function that experimen- tally determines whether the authorship of the k-th work is established correctly. The last two steps are repeated until the improvement of the function result stops, af- Прикладне програмне забезпечення 24 ter which the process is considered completed, and the weights are determined. The last step is to reduce the number of indicators. хj and wj are successively elim- inated such that . If the result remains the same or slightly deteriorates, the profile reduction continues. As soon as the re- sult begins to deteriorate significantly, the con- traction stops and is considered complete. Frequency analysis in creating an author profile Frequency analysis is one of the most common text analysis methods. For many lan- guages and a large number of authors, linguists compiled an author’s language frequency dic- tionary or for the individual author’s texts [8, 9]. The basis of such text processing is the cal- culation of a single character occurrence fre- quency for a particular text. Based on the data obtained, it can be concluded that each text will be characterized by its own individual fre- quency structure. This method is based on the fact that there is a non-standard statistical distribution of characters within the text. Practical application of this approach can be very different. A large number of works have been devoted to this problem. Also, the problems of frequency analysis occur when the process of decoding is necessary, the nec- essary set of data selection in large arrays, the analysis of texts that were written in ancient languages, and the conduct of categorization processes. The implementation of frequency analysis can be used in expert systems. At the same time, the frequency component under- lines the measure of texts proximity. The method of text analysis using N-grams is a relatively new method and in most cases is used to search for plagiarism in various text sources [10, 12]. This method also shows the best results in determining the authorship of texts using frequency analysis [12, 13]. In the current work, 4-grams are used due to their greatest efficiency in determining authorship in previous works [12, 13]. Based on the obtained frequencies of 4-grams, a recurrent analysis adapted for work- ing with texts is carried out – a time series is built based on the frequency of occurrence of each 4-gram in order (advance to the next cor- responding element is taken as a unit of time), on the basis of which a recursive diagram is formed. According to the resulting diagram, the following indicators are calculated: for re- peating statistically similar symbols, 𝐷𝐼𝑉 – is a value, reverse maximum length of diagonal structures; 𝐸𝑁𝑇 – indicate the frequency dis- tribution of the statistically similar characters repetition, 𝐿𝐴𝑀 – indicates the repetition of statistically similar characters, 𝑇𝑇 – indicates the average frequency of statistically similar characters repetition. [12, 13]. An example of 4-grams from the work “Доля” by T. Shevchenko: Ти не лукавила зо мною, Ти другом, братом і сестрою… Obtained 4-grams: тине, инел, нелу, елук, лука, укав, кави, авил, вила, илаз, лазо, азом, зомн, омно, мною, ноют, оюти, ютид,… Using stems to form an author profile Stemming is the process of shortening a word to its base by cutting off parts, such as an ending or a suffix. The basic concept of stemming is words with the same stem or root that refer to the same concept. The results of stemming are some- times very similar to determining the root of a word, but its algorithms are based on oth- er principles. Therefore, the word after pro- cessing by the stemming algorithm may dif- fer from the morphological root of the word. Stemming is used in linguistic morphology and information retrieval [16]. Many search systems use stemming to establish synon- ymous relationships if they have the same forms after stemming. Martin Porter’s stemming algorithm has become widespread and has become the de facto standard stemming algorithm for the English language. In this work, Porter’s stemmer adapt- ed to the Ukrainian language is also used and studied from its effectiveness point of view for determining authorship [14, 15]. It is used to work directly with the texts of various authors and also to build a various stems frequency profile, specific to each author. Прикладне програмне забезпечення 25 An example of the same passage from the work “Доля” by T. Shevchenko after stem- ming: т, лукав, мн, друг, брат, сестр. Using dictionaries to create author profile To conduct an experiment in this paper, we studied the effectiveness of using a dictio- nary. In general, the dictionary was developed on the basis of two approaches. The first, dic- tionary was the public dictionary the Large Electronic Dictionary of Ukrainian (VESUM) [17]. And the second, one was formed on the basis of Ukrainian text bank, including literacy texts, messages, posts, etc. Based on it, a complex dictionary was built containing unique word stems, their end- ings and prefixes. To reduce its size, a prelimi- nary selection of unique endings lists was car- ried out and only an index from it was assigned to the stem of the word. Maintaining a list of vowel alternations in words is also supported. To create lists of prefixes for the bas- es, the formed dictionary was analyzed for the presence of bases that differ only in the pres- ence of a prefix by simple enumeration. As a result, the original dictionary of bases has de- creased – all key bases have been assigned the corresponding index from the list of prefixes, and the extra bases with prefixes have been re- moved. The advantage of the resulting dictio- nary is its support for taking into account all word forms for stems, each of them will be as- signed a unique index. Thus, all cases, differ- ent forms of words, as well as words obtained by adding a prefix, will unmistakably lead to a single stem. The process of dictionary formation and its form is described in more detail in the previous works of the authors [18]. Using formal stochastic grammar to model sentence structure Stochastic grammar is used to create rules that describe the structure of sentences in a text. For each rule, the probability of appli- cation in a particular work is determined. The probability of inferring the entire sentence is defined as the probabilities of the speech parts sequences product used in it. The resulting rules will generate a language characteristic of the processed and structurally similar a certain author works [19]. To describe the structure of the text under study, speech parts are used as a char- acteristic of the word. Thus, each word in the sentence is replaced by the part of speech that it is. For more information about the structure of sentences and the rules for their construc- tion, characteristic of a particular author, read- ing not only parts of speech, but also forms, numbers, gender, etc. for the word under study [19]. For each speech part, its occurrence probability in a certain place of the sentence in the given text is calculated. The certain speech part appearance probability in the studied se- quence will more accurately capture the each of the authors under study individual writing style characteristic. After receiving the text in the form a speech parts sequences set in sen- tences with the probability of their occurrence in a particular place, rules are formed. The pro- cess is described in more detail in the previous work of the authors [19]. An example of the same passage from the work “Доля” by T. Shevchenko in terms of rules: where σ – is the initial nonterminal, the -th nonterminal in the rule of the -th level, – is the probability of applying the corresponding rule when parsing this work. More details are given in the work of the authors. [12]. Forming an author profile To obtain the profile of a specific au- thor, calculations are carried out to determine each of the studied indicators groups for all the works of the author in the training sample. Fur- ther, they are all collected in one vector X – the profile of the author. For example, when working with 4-grams, based on the obtained indicators, a vector is formed that contains the frequency of each such 4-gram occurrence in the text. To compile the author’s profile, such vectors are taken into account for each texts in the train- Прикладне програмне забезпечення 26 ing sample and the average value for each of them is found. A similar procedure is repeated to form vectors based on the remaining groups of indicators. An example of a vector image of T. Shevchenko based on 4-grams:: = [АБАЗ, АБАЙ, АБАР, АБАС, АБАТ, АБАУ, АБЕР, АБІК, АБІЛ, АБЛА, АБОГ, АБОТ, АБОЮ, АБОЯ, АБУД, АБУД, АБУЛ, АБУС, АБУТ, АВАБ, АВАВ, АВАЛ, …]. In total, there are 8748 4-grams used in the text in the vector. And their frequencies: [0.0001249, 0.0001565, 0.0001249, 0.0001565, 0.0001249, 0.0001249, 0.0001565, 0.0001565, 0.0001249, 0.0001249, 0.0004998, 0.0001249, 0.0001249, 0.0001249, 0.0004381, 0.0001249, 0.0001249, 0.0001249, 0.0001565, 0.0002499, 0.0001565, 0.0004696, …]. As can be seen, there are a large num- ber of obtained 4-grams and their frequencies, which is time-consuming and computationally expensive to work with. However, since each author has his own style of writing, different 4-grams may be most informative for different authors. In addition, often the least common letter combinations can be of the greatest im- portance, as they will be a characteristic fea- ture of the author’s language. Thus, the list of received frequencies requires additional anal- ysis of their informativeness and subsequent data reduction to work with only the most sig- nificant indicators. To optimize performance and obtain best result, when working with different indi- cators in the vectors, a genetic algorithm was applied to determine the weights of each of them in each group. In this work, on the basis of all the above indicators and further determination of their weight, profiles of the authors were com- piled. In total, the author’s profile included four main groups according to the methods studied. Each of the groups includes a list of indica- tors with individual weights for each. Thus, for each author, a list of indicators was determined that most accurately reflect his author’s style and allow you to identify similar elements in the texts of the control sample. An example of a T. Shevchenko profile vector based on stems, created on the basis of the Large Electronic Dictionary of Ukrainian (VESUM): […а, аа, аб, абатів, абатівськ, абатств, абатськ, абет, абетк, аби, аби- аби, абиде, абиколи, абикуди, аби-но, абискільки, абись, аби-то, абич, …]. In total, there are 7239 stems used in the text in the vector. As can be seen from the data obtained, the number of topics for analy- sis is as large as previous, which will also re- quire subsequent reduction and selection the most informative of them. Their weights for the profile T. Shevchenko: = […0.91, 0.12, 0.55, 0.08, 0.18, 0.82, 0.9, 0.85, 0.99, 0.89, 0.17, 0.86, 0.38, 0.99, 0.42, 0.58, 0.98, 0.62, 0.43, 0.34, …]. And working with the rules when cre- ating a profile, all the rules obtained in the pro- cess of analyzing the texts in the training sam- ple were collected in a single database, and for each of them was also found a weight. The to- tal number of rules was 6946, the following is an example of a vector with weights for them: […0.35, 0.88, 0.25, 0.44, 0.21, 0.6, 0.41, 1, 0.08, 0.2, 0.72, 0.21, 0.86, 0.49, 0.62, 0.12, 0.54, 0.14, 0.12, 0.24, …]. The number of rules is somewhat less, but still requires the selection of the most im- portant and informative ones for the correct determination of authorship with the least ex- penditure of resources. For a repeat experiment, the profile of each author was reduced for each group of indicators. The indicators with the smallest weights for each of the groups were discard- ed in order to reduce the time and computing power of the computer. During the experiment, the authorship of natural language texts was determined by two samples. The sample included works of art due to the presence of the author style char- acteristic in them and confirmed information about their authorship, which is not subject to doubt. For the first experiment, 40 texts of fic- tion by 10 Ukrainian authors were selected in the training sample. The control sample con- sisted of 60 texts by the same authors. The works of the following authors are presented: IB – I. Bahrianyi, AV – A. Vyshnia, Прикладне програмне забезпечення 27 MV – M. Vovchok, AD – A. Dovzhenko, HK – H. Kvitka-Osnovianenko, PM – P. Myrnyi, VN – V. Nestaiko, VP – V. Pidmohylnyi, IF – I. Franko, MK – M. Khvylovyi. Attribution results In working with a control sample, when determining the authorship of a text based on the author’s profile, the following results were obtained. Based on the data presented, working with the author’s profile, the number of works with correctly identified authorship in the con- trol sample was 54 works out of 60. The meth- od under study made it possible to determine the authorship of most texts correctly, with some exceptions. While when comparing the profile of the following authors – Bahrianyi, Vovchok, Kvitka-Osnovianenko, Franko and Khvylovyi – one of the works was not correct- ly identified and showed a great similarity with the profile of another author in the sample. During analyzing the result obtained, some similarity of styles in the two works was shown by Bahrianyi and Franko, and it can also be argued that Khvylovyi’s style most often echoes the styles of other authors: in 3 cases out of 6. Table 1 – Authorship establishing result with the full profiles real defined real defined IB IB MV MV IB IB MV MK IB IB MV MV IB IB AD AD IB IB AD AD IB IF AD AD AV AV AD AD AV AV AD AD AV AV AD AD AV AV HK HK AV AV HK HK AV AV HK MK MV MV HK HK MV MV HK HK MV MV HK MK PM PM VP VP PM PM VP VP PM PM VP VP PM PM IF IF PM PM IF IF PM PM IF IF VN VN IF IB VN VN IF IF VN VN IF IF VN VN MK MK VN VN MK MK VN VN MK MK VP VP MK MK VP VP MK MK VP VP MK IF When excluding from the list of indica- tors the least significant for each author. Thus, the number of 4-grams in the profile decreased by 1750, stem by 1448 and rules by 1390, which amounted to 20% in each of the classes. When working with optimized vectors, the fol- lowing results were obtained. As a result of the experiment with a re- duced author profile, the result was 53 works with correctly established authorship out of 60. Results and discussion As a result of the experiment using a genetic algorithm and obtaining the best solu- tion, the following results were obtained: out of 60 texts in the control sample, the author- ship of 54 works was established correctly, which amounted to a total 90%. Table 2 – Authorship establishing result with the reduced profiles real elimi real elimin IB IB MV MV IB IB MV MHK IB IB MV MV IB IB AD AD IB IB AD AD IB IF AD AD AV AV AD AD AV AV AD AD AV AV AD AD AV AV HK HK AV AV HK HK AV AV HK MKH MV MV HK HK MV MV HK HK Прикладне програмне забезпечення 28 MV MV HK MKH PM PM VP VP PM PM VP VP PM PM VP VP PM PM IF IF PM PM IF IF PM PM IF IF VN VN IF IB VN VN IF IF VN VN IF IF VN VN MKH MKH VN VN MKH MKH VN VN MKH MKH VP VP MKH MKH VP VP MKH MKH VP PM MKH IF For comparison in previous works and the application of these methods separately, the following results were obtained. The best indicator – 91% coincidence of the texts au- thorship – was obtained when working with 4-grams. Working with the basics of words us- ing dictionaries and stemming gave a result of 88%. As you can see, the combination of different approaches and methods did not sig- nificantly improve the result, however, it made it possible to take into account additional fea- tures of the text due to working with grammars. Based on the data obtained, the most successful methods of working with text are 4-grams – working with them is average in terms of resources and time, relative to other methods, and gives the best result. As well as work with stochastic grammars, due to the dis- play the features of the phrases and sentences construction by the author, however, this meth- od requires significant computational and time resources. The result of working with stems and dictionaries shows that they are less informa- tive. Taking into account the high cost of these methods in calculations and time, the methods are the most expensive and the least informa- tive among all those used. With the exception of the least signifi- cant indicators and, as a result, a reduction in their number, the result obtained was 52 works with correctly established authorship, which is a good result – 87% the accuracy of the defi- nition. This approach made it possible to sig- nificantly reduce the complexity and time of calculation, while the result did not decrease significantly. Conclusions In the work, various approaches were explored for the formation of the general au- thor profile: work with 4-grams, stems, recur- rent analysis and sentence structure formalized by means of a formal stochastic grammar. This approach made it possible to ob- tain an effective profile of the author, taking into account the various features of his personal language, from the use of individual words to the peculiarities of constructing sentences. The results obtained demonstrate the effectiveness of an integrated approach that provides better results compared to approaches that take into account individual aspects of the author’s style. References 1. H. Love. 2002. Attributing Authorship: An Intro- duction. Cambridge University Press. 2. Aidan Finn and Nicholas Kushmerick. 2003. Learning to classify documents according to genre. In IJCAI-03 Workshop on Computation- al Approaches to Style Analysis and Synthesis. 3. D. Khmelev and W. Teahan. 2003. A repetition based measure for verification of text collections and for text categorization. In SIGIR’2003, To- ronto, Canada. 4. M. Ephratt. 1997. Authorship attribution – the case of lexical innovations. In Proc. ACH- ALLC-97. 5. E. Stamatatos, N. Fakotakis, and G. Kokkinakis. 2001. Computer-based authorship attribution without lexical measures. Computers and the Humanities, 35:193–214. 6. S. Scott and S. Matwin. 1999. Feature engi- neering for text classification. In Proceedings ICML-99. 7. A. Aizawa. 2001. Linguistic techniques to im- prove the performance of automatic text cate- gorization. In Proceedings 6th NLP Pac. Rim Symp. NLPRS-01. 8. Darchuk N. 2023. Automatic frequency dic- tionary of connectivity by Lina Kostenko and Mykola Vingranovskyi. Linguistic and concep- Прикладне програмне забезпечення 29 tual pictures of the world, 73 (1), 10.17721/2520- 6397.2023.1.01. 9. Danyliuk, I., Zagnitko, A. and Sytar, G., 2019. Text corpus of Yury Shevelyov: structure, functions, navigation. APPLIED LINGUIS- TICS. LINGUISTICS. 10.18523/1p.2522- 9281.2019.5.158-169. 10. Kuzma, K.T., 2020. Information technology for estimating the level of simslarity of strings based on the N-gram method. Academic notes of TNU named after V.I. Vernadskyi. Series: technical sciences. 31 (7), p. 96-98. 10.32838/ TNU-2663-5941/2020.6-1/16. 11.H. Gómez-Adorno, JP. Posadas-Durán, G. Sidorov, Document embeddings learned on var- ious types of n-grams for cross-topic authorship attribution. Computing 100 (2018) 741–756. doi: 10.1007/s00607-018-0587-8. 12. V.I. Shynkarenko, I.M. Demidovich Determi- nation of the attributes of authorship of natural texts. Artificial Intelligence 3 (2018) 27-35. 13. V.I. Shynkarenko, I.M. Demidovich Author- ship Determination of Natural Language Texts by Several Classes of Indicators with Custom- izable Weights, in: Proceedings of the 5th Inter- national Conference on Computational Linguis- tics and Intelligent Systems (COLINS 2021). Volume I: Main Conference. Lviv, Ukraine, April 22-23, 2021, pp. 832-844. 14. T. V. Golub, M. Yu. Tyagunova, Method of steaming Ukrainian-language texts for classi- fication of documents based on Porter’s algo- rithm. Scientific works of Donetsk National Technical University. Series: Informatics, cy- bernetics and computer engineering No. 1(24) (2017) 59–63. 15. Dukhnovska KK, Strashok YaA, Shilo PV. In- formation technology for performing lemma- tization and steming in Ukrainian-language texts. Applied systems and technologies in the information society. Pp.. 119-127. 16. S. Memon, K. Memon, F. Dehraj and others. 2020. Comparative Study of Truncating and Statistical Stemming Algorithms. Internation- al Journal of Advanced Computer Science and Applications. 17. Great electronic dictionary of the Ukrainian language (VESUM). URL: https://github. com/brown-uk/dict_uk. 18. I. Demidovich, V. Shynkarenko, O. Kuropiat- nyk, O. Kirichenko, Processing Words Effec- tiveness Analysis in Solving the Natural Lan- guage Texts Authorship Determination Task, XVI International Scientific and Technical Conference (CSIT’2021). September 22-25, 2021, Lviv, Ukraine. 19. V. I. Shynkarenko, I. M. Demidovich Natu- ral Language Texts Authorship Establishing Based on the Sentences Structure, in: Pro- ceedings of the 6th International Conference on Computational Linguistics and Intelligent Systems (COLINS 2022), Volume I: Main Conference, Gliwice, Poland, May 22- 23, 2022, pp. 328-337 Received: 07.09.2023 About the authors: Viktor Shynkarenko, Doctor of Science, Professor, Number of scientific publications in Ukrainian publications – more than 200 Number of scientific publications in foreign publications – more than 30 Index Girsh – 6 https://orcid.org/0000-0001-8738-7225 Scopus Author ID: 26635896100 Inna Demydovych, PhD student, Number of scientific publications in Ukrainian publications – 4 Number of scientific publications in foreign publications – 1 Index Girsh – 2 https://orcid.org/0000-0002-3644-184X Scopus Author ID: 57224201949 Place of work: Ukrainian State University of Science and Technologies, 49010, Ukraine, Dnipro, str. Lazaryana, 2 E-mail:office@ust.edu.ua
id	pp_isofts_kiev_ua-article-577
institution	Problems in programming
keywords_txt_mv	keywords
language	English
last_indexed	2025-07-17T09:57:22Z
publishDate	2023
publisher	PROBLEMS IN PROGRAMMING
record_format	ojs
resource_txt_mv	ppisoftskievua/d0/c37449a18097e61ddedeb3023e44d6d0.pdf
spelling	pp_isofts_kiev_ua-article-5772024-04-28T11:55:00Z Methods and software for significant indicators determination of the natural language texts author profile Методи та засоби визначення значимих показників профілю автора природно-мовних текстів Shynkarenko, V.I. Demydovych, I.M. natural language texts; authorship determination; genetic algorithm; recurrent analysis; statistical analysis; text classification; pattern recognition; formal grammars UDK 004.91 природомовні тексти; визначення авторства; генетичний алгоритм; рекурентний аналіз; статистичний аналіз; класифікація текстів; розпізнавання образів; формальні граматики УДК 004.91 Methods for the formation and optimization of author profiles are presented. The author profile is an image – a vector in a multidimensional space, which components are author’s texts measurements by a number of methods based on 4-grams, stemming, recurrence analysis and formal stochastic grammar. The author’s profile is a model of his language, including vocabulary, sentence syntax features. A comparative analysis of each of the methods effectiveness is carried out. By means of the genetic algorithm, a reduced profile of the author is formed. Insignificant indicators are excluded, which allows to reduce their number by 20%. The reduced author’s profile contains attributes that are significant for this author and is an effective attribution of a particular author.Prombles in programming 2023; 3: 22-29 Наведено методи формування та оптимізації профілів авторів. Профіль автора це образ – вектор у багатовимірному просторі, компоненти якого є вимірами текстів автора рядом методів на основі 4-грам, стемування, рекурентного аналізу та формальної стохастичної граматики. Профіль автора є моделлю його мови, включаючи словниковий запас, особливості синтаксису речень. Здійснюється порівняльний аналіз ефективності кожного із методів. Засобами генетичного алгоритму формується усічений профіль автора. Виключаються незначні показники, що дозволяє скоротити їхню кількість на 20%. Усічений профіль автора містить значущу для даного автора атрибутику і є ефективною атрибуцією конкретного автора. Дослідження виконані на україномовних текстах (мовою з низькою ресурсоємкістю). Наведені результати експериментів, виконаних на основі розроблених програмних засобів.Prombles in programming 2023; 3: 22-29 PROBLEMS IN PROGRAMMING ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ ПРОБЛЕМИ ПРОГРАМУВАННЯ 2023-10-06 Article Article application/pdf https://pp.isofts.kiev.ua/index.php/ojs1/article/view/577 10.15407/pp2023.03.022 PROBLEMS IN PROGRAMMING; No 3 (2023); 22-29 ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ; No 3 (2023); 22-29 ПРОБЛЕМИ ПРОГРАМУВАННЯ; No 3 (2023); 22-29 1727-4907 10.15407/pp2023.03 en https://pp.isofts.kiev.ua/index.php/ojs1/article/view/577/627 Copyright (c) 2023 PROBLEMS IN PROGRAMMING
spellingShingle	natural language texts authorship determination genetic algorithm recurrent analysis statistical analysis text classification pattern recognition formal grammars UDK 004.91 Shynkarenko, V.I. Demydovych, I.M. Methods and software for significant indicators determination of the natural language texts author profile
title	Methods and software for significant indicators determination of the natural language texts author profile
title_alt	Методи та засоби визначення значимих показників профілю автора природно-мовних текстів
title_full	Methods and software for significant indicators determination of the natural language texts author profile
title_fullStr	Methods and software for significant indicators determination of the natural language texts author profile
title_full_unstemmed	Methods and software for significant indicators determination of the natural language texts author profile
title_short	Methods and software for significant indicators determination of the natural language texts author profile
title_sort	methods and software for significant indicators determination of the natural language texts author profile
topic	natural language texts authorship determination genetic algorithm recurrent analysis statistical analysis text classification pattern recognition formal grammars UDK 004.91
topic_facet	natural language texts authorship determination genetic algorithm recurrent analysis statistical analysis text classification pattern recognition formal grammars UDK 004.91 природомовні тексти визначення авторства генетичний алгоритм рекурентний аналіз статистичний аналіз класифікація текстів розпізнавання образів формальні граматики УДК 004.91
url	https://pp.isofts.kiev.ua/index.php/ojs1/article/view/577
work_keys_str_mv	AT shynkarenkovi methodsandsoftwareforsignificantindicatorsdeterminationofthenaturallanguagetextsauthorprofile AT demydovychim methodsandsoftwareforsignificantindicatorsdeterminationofthenaturallanguagetextsauthorprofile AT shynkarenkovi metoditazasobiviznačennâznačimihpokaznikívprofílûavtoraprirodnomovnihtekstív AT demydovychim metoditazasobiviznačennâznačimihpokaznikívprofílûavtoraprirodnomovnihtekstív

Methods and software for significant indicators determination of the natural language texts author profile

Репозитарії

Схожі ресурси