Методологія аналізу даних економічного розвитку країн

The paper examines the issue of improving the methods of identification of economic objects and their analysis using algorithms of intelligent data processing. The use of the developed methodology in the economic analysis allows for improvement in the quality of management. It can be the basis for c...

Full description

Saved in:
Bibliographic Details
Date:2023
Main Authors: Donets, Volodymyr, Strilets, Viktoriia, Ugryumov, Mykhaylo, Shevchenko, Dmytro, Prokopovych, Svitlana, Chagovets, Liubov
Format: Article
Language:English
Published: The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2023
Subjects:
Online Access:https://journal.iasa.kpi.ua/article/view/297208
Tags: Add Tag
No Tags, Be the first to tag this record!
Journal Title:System research and information technologies
Download file: Pdf

Institution

System research and information technologies
_version_ 1867334441111650304
author Donets, Volodymyr
Strilets, Viktoriia
Ugryumov, Mykhaylo
Shevchenko, Dmytro
Prokopovych, Svitlana
Chagovets, Liubov
author_facet Donets, Volodymyr
Strilets, Viktoriia
Ugryumov, Mykhaylo
Shevchenko, Dmytro
Prokopovych, Svitlana
Chagovets, Liubov
author_institution_txt_mv [ { "author": "Volodymyr Donets", "institution": "V. N. Karazin Kharkiv National University, Kharkiv" }, { "author": "Viktoriia Strilets", "institution": "V. N. Karazin Kharkiv National University, Kharkiv" }, { "author": "Mykhaylo Ugryumov", "institution": "V. N. Karazin Kharkiv National University, Kharkiv" }, { "author": "Dmytro Shevchenko", "institution": "V. N. Karazin Kharkiv National University, Kharkiv" }, { "author": "Svitlana Prokopovych", "institution": "Simon Kuznets Kharkiv National University of Economics, Kharkiv" }, { "author": "Liubov Chagovets", "institution": "Simon Kuznets Kharkiv National University of Economics, Kharkiv" } ]
author_sort Donets, Volodymyr
baseUrl_str http://journal.iasa.kpi.ua/oai
collection OJS
datestamp_date 2024-02-01T21:03:07Z
description The paper examines the issue of improving the methods of identification of economic objects and their analysis using algorithms of intelligent data processing. The use of the developed methodology in the economic analysis allows for improvement in the quality of management. It can be the basis for creating decision support systems to prevent potentially dangerous changes in the economic status of the research object. In this work, an improved method of c-means data clustering with agent-oriented modification is proposed, and a radial-basis neural network and its extension are proposed to determine whether the obtained clusters are relevant and to analyze the informativeness of state variables and obtain a subset of informative variables. The effect of applying data compression using an autoencoder on the accuracy of the methods is also considered. According to the results of testing of the developed methodology, it was proved that the probability of incorrect determination of the state was reduced when identifying the states of economic systems, and a reduced value of the error of the third kind was obtained when classifying the states of objects.
doi_str_mv 10.20535/SRIT.2308-8893.2023.4.02
first_indexed 2025-07-17T10:28:25Z
format Article
fulltext  V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets, 2023 Системні дослідження та інформаційні технології, 2023, № 4 21 UDC 519.254:330.47 DOI: 10.20535/SRIT.2308-8893.2023.4.02 METHODOLOGY OF THE COUNTRIES’ ECONOMIC DEVELOPMENT DATA ANALYSIS V.V. DONETS, V.Y. STRILETS, M.L. UGRYUMOV, D.O. SHEVCHENKO, S.V. PROKOPOVYCH, L.O. CHAGOVETS Abstract. The paper examines the issue of improving the methods of identification of economic objects and their analysis using algorithms of intelligent data process- ing. The use of the developed methodology in the economic analysis allows for improvement in the quality of management. It can be the basis for creating decision support systems to prevent potentially dangerous changes in the economic status of the research object. In this work, an improved method of c-means data clustering with agent-oriented modification is proposed, and a radial-basis neural network and its extension are proposed to determine whether the obtained clusters are relevant and to analyze the informativeness of state variables and obtain a subset of informa- tive variables. The effect of applying data compression using an autoencoder on the accuracy of the methods is also considered. According to the results of testing of the developed methodology, it was proved that the probability of incorrect determina- tion of the state was reduced when identifying the states of economic systems, and a reduced value of the error of the third kind was obtained when classifying the states of objects. Keywords: machine learning, digital development, fuzzy clustering, radial basis neural networks, logistic regression, analysis of variables informativeness. INTRODUCTION Analysis of the state of economic systems requires taking into account a large number of factors that have a stochastic nature of development and high dyna- mism. Continuous monitoring allows taking into account the influence of these factors and maintaining the stable functioning of economic systems in conditions of constant global fluctuations [1]. Machine learning methods make it possible to evaluate these factors, their possible and real impact on macroeconomic proc- esses. The use of machine learning algorithms provides early consideration of the effects of factors that may threaten the stability of economic systems [1]. The use of intelligent methods for the analysis of collected economic data al- lows to automate the solution of many problems in the management of economic processes [1], which significantly increases its quality and efficiency. Automated systems of economic analysis are used as decision support systems to prevent po- tentially dangerous changes in the state of economic systems [1; 2]. Existing in- formation systems of economic analysis have modules for solving problems of clustering, classification or forecasting of received data, based on machine learn- ing methods, which allow to improve the accuracy of received decisions. The aim of the study is to improve the quality of data stratification in the in- formation analysis of economic systems by developing a methodology that includes methods of clustering, classification and analysis of the informativeness V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 22 of economic data. The scientific objective of the study is to improve the existing methods of economic data analysis through the introduction of an agent-oriented modification of the clustering method and radial basis neural networks for analyz- ing the informativeness of state variables. The proposed methods are expected to reduce the probability of erroneous determination of the state in the analysis of the economic system, thus the value of the third-order error in the classification of its state will be reduced. STATEMENT OF THE RESEARCH PROBLEM The data obtained as a result of the study of the economic system can be pre- sented in the form: }{ imxX  , where MmNi ,1, ,1  , X — matrix representing the data sample for analysis; N — number of objects; M — dimension of space. The problem of data analysis that characterizes the state of the economic system consists of solving a sequence of problems: – division of a set of data into sets that are similar according to certain char- acteristics — the task of clustering; – determination of the current state of the economic system based on a set of characteristics — the task of classification; – determination of a set of features that best describe the state of the eco- nomic system — the task of selecting informative features (reduction of the space of features). Let’s consider the methods of solving each of the problems. FUZZY DATA CLUSTERING METHOD For some known set of valid clusters Y it becomes necessary to split the input data X to Y subsets (clusters, classes), so that each cluster consists of objects that are close by some metric, or distant by another. Thus, each object will be as- signed to the y-th cluster. The result of the clustering algorithm [3] will be the application of the func- tion YXcluster : , which matches each object in the input set Xx matching an object from a set of clusters. Usually, plural Yy known in advance for a non-hierarchical approach, or determined in the process for a hierarchical ap- proach. Therefore, the question of determining the optimal number of clusters, as one of the parameters determining the final quality of clustering, often arises. Let’s define the distance between cluster objects as a metric for cluster analysis. Then we define the degree of similarity of objects as the reciprocal of the inter-element distance. Among the works devoted to cluster analysis, can be found a large number of possible metrics for determining the inter-element distance or degree of similarity. The most widespread metric is based on the Euclidean distance, which is a special case of the Minkowski distance [4] with the value of the parameter 2 . Generalized Minkowski metric: Methodology of the countries’ economic development data analysis Системні дослідження та інформаційні технології, 2023, № 4 23       jmim M m ji xxxxd 1 ), ( . The c-means fuzzy clustering method allows fuzzy distribution of objects into clusters or classes. In the c-means method, the object belongs to all clusters, but with a certain value of cluster membership [5]. In the method of fuzzy clustering [6], the membership matrix of elements to a cluster is calculated according to the assumption of a normal distribution of data according to the formula: ),0|),(( ), 0| ),(( 1 jji P i jji ij cxdΝ cxdΝ w j      , where ix — i -th element of the set, );1( jPi  ; jc — j -th cluster cen- ter; ), ( ji cxd — distance between points ix and jc ; ), 0 | ), (( jji cxdN  — probability density of a normal distribution at a point ), ( ji cxd . The cluster centers are adjusted according to the formula ij P i iij P i j w xw c j j     1 1 . (1) The center adjustment process continues until the loss function is minimized: min), ( 2 11    ijji P i K j wcxdloss j , (2) or on the condition of reaching some limitation on the number of iterations, or the required classification quality. Among the important disadvantages of the c-means method are the inability to divide the space with a complex shape of target clusters that go beyond simple M-dimensional spheres, and an insufficient level of robustness to noise [5; 7]. For data from real problems, both a complex distribution of object parame- ters and a high dimensionality of the input data are inherent, which in turn deter- mines the complex form of M-dimensional target clusters. Therefore, for the usual method of fuzzy clustering and many of its modifications, clustering with high accuracy is not possible. A modification of the distance metric (together with the membership metric) is proposed in [8]. An interesting approach is the assumption of the Cauchy distribution and the use of the Mahalanobis distance, which were proposed in [9; 10]. Mahalanobis distance was used to improve the calculation algorithm that prevents degeneracy of the inverse matrix [11]: )(Σ̂)( ), ( 1 jij T jiji cxcxcxMD   , where ΣΣΣ̂  — is the regularized covariance matrix;  — is a constant greater than zero. Taking into account the assumption of the Cauchy distribution in the data, the expression for calculating the value of belonging to a certain cluster [5] has the form: V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 24 ), ( ), ( 1 ji P i ji ij cx cx w j      , 1 2 2 ), ( 1), (                    ji ji cxMD cx . (3) Solving the clustering problem for clusters of complex M-dimensional form using Gaussian mixture models was considered in works [12; 13], using the de- rivative in [14] and using the Mahalanobis distance in [5; 9]. According to the obtained results, an improvement in clustering accuracy is noted, but the problem of spatial separation and overuse of input data dependence occurs. In works [5; 15], the possibility of taking into account the relative entropy of the data distribution was considered when using the c-means method, but the Euclidean distance was chosen as the metric of the distance between the objects of the sample, which reduced the computational load, but did not take into ac- count the entropy of the data. To overcome the difficulties of using the basic method of fuzzy clustering and its modifications based on Mixture and Gaussian mixture models on data with a complex shape of M-dimensional target clusters [12], which is based on an at- tempt to take into account the entropy of clusters [15] and the Kullback–Leibler distance [16], it was proposed to improve the clustering method. The Kullback–Leibler distance is an asymmetric measure of the informa- tional difference between two probability distributions. This measure has proven itself well in methods of information processing in physical systems and sta- tistics [16]. According to the previous definition Xxim  — is the m-th state variable of the i-th vector of the input data sample, where ],1[ Mm , M — dimension of the state vector. Let’s define Ffs  as the s-th object function from the vector of object functions ],1[ Ss , where S — the dimension of the object functions vec- tor. Then )( sfM and )( imxM are mathematical expectations of sf and imx respectively. According to this definition )( sfD and  imxD — dispersion of the relevant variables, and )( sf and )( imx — standard deviation. Variance and standard deviation of conditional dependence of sf from imx an be determined by formulas: constxmnnxfMvarxfD inminsims   , , ))), ((() | ( ; (4) ) | ( ) | ( imsims xfDxf  . (5) Using expression (4), we obtain estimates of informative state variables: )( ) | ( )( s ims s fE xfD f  , where )( sfE — signal energy. From (5), we get the influence coefficient (signal to noise ratio): )( ) | ( ) | ( im ims imssm x xf xfSNR    . In [16], the Kullback–Leibler entropy is defined as follows: Methodology of the countries’ economic development data analysis Системні дослідження та інформаційні технології, 2023, № 4 25            )( ) | ( log) | ( ), ( 2 1 im sim sim M m isKL x fx fxxfD . Mutual informative dependence is then determined by the formula:        )( )( )(log 2 1 ) | (log 2 1 2 2 2 im s simssm xD fE fxfSNRH . In the proposed method, we replace the loss function (2). Instead, we will get a formula for determining mutual informative dependence, which will be a func- tion of clustering quality assessment, that is, a function of losses in the developed method of fuzzy clustering: min)`,( 1 ),( )1( 1 )1( 11                         t jiKL P i t j k jj k j YxDYP P YXH j , where jY — state variables belonging to the j-th cluster. AGENT-ORIENTED MODIFICATION OF THE CLUSTERIZATION METHOD To overcome the non-priority problem, an agent-oriented modification was developed for the classical method of fuzzy clustering considering the M-dimensional spatial shape [3; 5], which is considered below. Let`s introduce special notations for the developed method of fuzzy cluster- ing: X — agents, elements of the input sample, C — centers of clusters, then iX — agents, cluster elements, Z — agents clusters. According to the agent-oriented approach, the elements-vectors of the input sample and the clusters are agents, these agent-elements choose the cluster agents closest to them, which they join according to a pre-specified metric, thus forming cluster agents. The number of cluster agents is determined by minimizing the loss function. According to the previous definition: the input sample partitioned into clusters is }{ jPX  , where  KKj , 1),1(  , j K j PN    1 — the number of elements in the input sample; jP — set of elements belonging to the j-th cluster; K — number of clusters. Than jij Px  — the i-th element of the j-th cluster. Four metrics were chosen to compare the possibilities of spatial separation of clusters and computational efficiency:            ,),(log*),( ,),( ,),( ,),( ),( 2 1 1 1 1 t jij t jij jijKL jijij jij jij cxpcxp cxD cxdw cxd cxd (6) where ),(1 jij cxd — Manhattan distance; ),(1 1 jijij cxdw — Mahalanobis distance with the inverse of the membership function; ),( jijKL cxD — Kullback–Leibler divergence; ),(log*),( 2 1 t jij t jij cxpcxp  — cross entropy. V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 26 Having the distance to determine the inter-element distance, we will get an expression for determining the cost function for each cluster, that is, the average measure of the intraclass distance: ),( || 1 )(_ 1 jij P ij j cxd P Plosscl j    . (7) Then, using expression (7), we obtain the general cost function for evaluat- ing the current quality of clustering: )(_ 1 )( 1 j K j t t Plosscl K Xloss t    . (8) By combining the classical method of fuzzy clustering with the agent- oriented approach described above, we will obtain a statement of the research problem, according to which it is necessary to determine the number of clusters and such a distribution of elements by clusters that the value of the cost function is minimal:       .))((minargˆ ,],[ t tt XlossA XKA According to the classical clustering method, cluster centers are optimized according to expression (1), and the membership matrix for adjustment is calcu- lated according to expression (3) taking into account the Cauchy distribution as- sumption. We formulate the clustering algorithm, defined according to the agent- oriented approach, as follows: 1. Determine some initial number of cluster agents KK t  , that is more than the target number of clusters, and set a limit on the number of elements in each cluster tt j KNP /||  and choose randomly tK centers of clusters }{ jc . 2. Select one of the inter-element distances (6) || t jP of the closest elements to each cluster, that is, to form cluster agents t jP . 3. For each cluster, calculate the value of the parameters )|( t jij Px distribu- tion and the values of the membership matrix according to expressions (3), and according to expression (1) adjust the cluster centers. 4. To each center of the cluster according to the selected measure ),( jij cxd to choose || t jP new agents-elements. 5. For each cluster agent, according to expression (7), determine the value of the cost function (or the average inter-element distance)  t jPlosscl _ . 6. To estimate the current quality of clustering by the loss function accord- ing to expression (8). In the case of the operation mode of the algorithm in the automatic search for the optimal number of clusters, and the increase in the value of the cost function, stop the algorithm. 7. To select agent-clusters and discard the agent-cluster with the highest value )(_ t jPlosscl . Methodology of the countries’ economic development data analysis Системні дослідження та інформаційні технології, 2023, № 4 27 8. To determine the new number of clusters 11  tt KK and the new number of cluster elements 11 /||   tt j KNP . 9. Return to stage 2, if KK t  . CLASSIFICATION METHOD BASED ON MULTIPLE LOGISTIC REGRESSION To solve the problem of multiclass classification in the case of spatially separated data, it is proposed to use a radial basis neural network (RBFN) with multiple lo- gistic regression. The application of the RBFN model for multiclass classification will allow checking the assumptions about the correctness of the cluster definition and testing the model’s ability to generalize. RBFN structure: H0 inputs for each of the parameters, H1 neurons of the first layer and H2 output neurons. We define the vector of input data for the k-th layer of the neural network (or the vector of output data for the k-1 layer) as Tk H kk YYY ],,[ )()( 1 )( 1   , we define the vector of coordinates of the cents of the ac- tivation function for the hidden layer as T jHjjj cccc ],,,[ 021   , where 1..1 Hj  , and the vector specifying the window width of the activation function of the j-th neuron of the hidden layer is defined as T jHjjj ],,,[ 021   . Then the acti- vation function for the neurons of the hidden layer will look like this:   pjpjhij H h jjpj ZwexpcY            2 1 0 0 2 1 ),,(  , where   jh jhph pjh cY Z    0 ; ijw — weighted connection between the i-th neuron of the output layer and the j-th neuron of the input layer. Multiple logistic regression [17] is used as the activation function of the out- put layer, the outputs of which are defined as: )(γexp )(γexp 2 1 k H k j j    , де iji H i w 1 γ j . A hybrid algorithm was used for training the RBFN, which includes 2 steps, the repetition of which usually leads to fast training of the network, especially if the parameters are successfully generated [18]: 1) selection of linear network parameters (weights) using the pseudo inver- sion method; 2) optimization of nonlinear parameters of activation functions (window cen- ters and widths). If there are P training pairs   PpdY pp ..1 ),,( 0   and fixing the specific val- ues of the centers and window widths of the activation functions, we get a system of equations: V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 28 2..1, Hidw ii   , where ],[ pj 1..0, ..1 HjPp  , , 10  p ,], ..., , [ 110 T iHiii wwww   id  T piii ddd ],..., , [ 10 . Vector iw  can be determined in one step using pseudo matrix inversion  : ii dw   , which in practice is calculated using the decomposition of eigen- values. At the second stage of the algorithm, when fixing the weights, the excitation signal passes through the network to the initial level, which allows to calculate the error value for the sequence of vectors  }{ 0 pY  . After that, there is a return to the hidden layer. The gradient vector of the selection function according to the spe- cific variable cents and window widths is determined by the error value:   2 2 LdY ‖‖   . Algorithm for forming the “coverage zone” by radial basis functions of k- neighbors Kkcc K khjh H h K k jjh ..1, )( 1 2 11 2 0    , ]5,3[K was used to de- termine the values of the window widths, which helped reduce the training time of the RBFN. CHARACTERISTICS INFORMATIVENESS ANALYSIS METHOD Since it is proposed to use the RBFN network to solve the classification problem, this model can also be used to find the minimum possible subset of informative variables. The input data set can be represented as a Taylor series, keeping only the terms of the first infinitesimal order. For the variance of an arbitrarily ob- tained linear function of several random variables, the estimate is valid: ljji SS l i j i jl J jll J j S j i J j iS T iY s Y s Y r s Y YYD                   grad ) grad( ,11 2 2 1 , where S — covariance matrix of variables 1S ; 2S , 1S — standard deviation; 1jr — correlation coefficient between variables 1S and 2S . Then the standard deviation and variance of the RBFN output can be esti- mated according to the architecture chosen for it, and from them determine the energy of the signals by the expression [18]: )0()2( 0 | 1 hi YY H h i DE    , where )0()0( 0 )0()0()2( )0( )2( )0( )2( ,1 2 2 )0( )2( | hnhhi Y h i Y n i hn H hnn Y h i YY Y Y Y Y r Y Y D                            . Methodology of the countries’ economic development data analysis Системні дослідження та інформаційні технології, 2023, № 4 29 Then the coefficient of informativeness of the variables (the weight of the contribution of )0( hY in to )2( iY ) is defined by the expression: i YY ih E D hi )0()2( |  . DATA PRE-PROCESSING METHODS In machine learning problems, it has become common practice to use data pre- processing methods (normalization, cleaning from anomalies, and dimensionality reduction) to improve the quality of problem solving [19]. Three methods of the scikit-learn, Python library were used for data normalization: – RobustScaler scales parameters with robustness to statistical outliers. – StandardScaler (Z-score normalization). Reduces the mean and scales to unit variance. – MinMaxScaler (min-max normalization). Each parameter is scaled and translated individually by the estimator so that it falls within a given range, for example [0,1]. The detection of unusual elements, events, or observations that are signifi- cantly different from the main body of data and do not correspond to a well- defined definition of normal behavior is called the process of anomaly detection [20]. Data cleaning techniques remove values that have been identified as outliers and based on anomaly detection. Two outlier detection methods from the scikit-learn library were used: – Interquartile Range (IQR). By dividing the data set into quartiles, it is used to measure variability; – Isolation forest. The method uses isolation to find anomalies (how far a data point is from the rest of the data) [21; 22]. The dimensionality reduction process aims to provide a lower-dimensional representation of the original data set while preserving its important characteris- tics. Separate scikit-learn and PyTorch libraries were used for dimensionality re- duction. Three methods were used: – T-distributed Stochastic Neighbor Embedding (t-SNE) [23]; – Principal Component Analysis (PCA) the method is based on SVD, it re- duces the dimensionality of the data well [24]. – Autoencoder. Is a certain type of feed-forward neural network where the input matches the output. It compresses the input data into a bottleneck (lower dimensional data) and then reconstructs the output data from that representation. The bottleneck is the target compact summation or dimensionality reduction of the input data, also called the latent space representation. APPLICATION OF METHODOLOGY FOR COUNTRIES DIGITAL DEVELOPMENT DATA ANALYSIS The developed methodology was tested to identify the state of digital develop- ment of the countries of the world. For the classification (positioning of countries) regarding the level of their digital development, the hypothesis of the existence of V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 30 homogeneous groups of countries (objects) according to specialized indices was tested. Indices that fully reflect the state of digital development were selected: – EGIit — Global E-Government Development Index; – NRIit — network readiness index; – ICTit — information and communication technologies development index. By forecasting independent factors — indicators of digital development based on the model, it is possible to estimate the forecast level of social progress of a specific country. The Social Progress Index (SPI) is a combined indicator of the International Research Project The Social Progress Imperative [25; 26] which measures the achievements of the countries of the world in terms of social well- being and social progress. The authors of the study [25; 26] believe that indicators of social development are often considered as an alternative to indicators of eco- nomic development. The global e-government development index [26] is an inte- gral indicator that assesses the readiness and capabilities of national government structures in using information and communication technologies (ICT) to provide public services to citizens. The index of network readiness [26] characterizes the level of development of information and communication technologies and the network economy in the countries of the world. Currently, the index is considered one of the most important indicators of the innovative and technological potential of the countries of the world and their development opportunities in the field of high technology and digital economy. The ICT Development Index is a composite index that combines 11 indicators and is used to monitor and compare the devel- opment of information and communication technologies (ICT) between countries. To implement the model, a sample of 115 precedents (observations by coun- try) was collected for 32 variables of the state of social development for each precedent and the 33rd field for the predictive value of the state. The ratio of val- ues of the social progress index SPIt (Social Progress Index) and the average level of income was used to mark the educational sample. All precedents of the sample were distributed according to the respective states: – “High income” — 45 precedents (I); – “Upper middle income” — 11 precedents (II); – “Lower middle income” — 25 precedents (III); – “Lower income” — 34 precedents (IV). For this sample, pre-processing of the data was first carried out: normaliza- tion and detection of anomalous values. Clustering was performed for the consid- ered economic data, and classifi- cation was performed to verify its results. It was decided to use the Kullback–Leibler distance classi- fication method. As a result of its application, an accuracy of 84.3% was achieved, and the value of the flow function was obtained as 0.0117. A matrix of inconsistencies (Table 1) was also constructed to assess the ac- curacy of the method, as well as graphs of cost function values (Fig. 1) and ROC curves for each of the classes (Fig. 2). T a b l e 1 . The matrix of inconsistencies in the classification of data indicators of the digital development of the countries of the world Predicted class Actual class I II III IV I 37 1 0 7 II 1 8 1 1 III 2 0 21 2 IV 0 1 2 31 Methodology of the countries’ economic development data analysis Системні дослідження та інформаційні технології, 2023, № 4 31 After a series of experiments, it was decided to apply the autoencoder method to reduce the dimensionality of the data with 98% information retention, which made it possible to reduce the dimensionality of 32 to 11 state variables for each case. After this application, an accuracy of 86.9% was achieved, and the value of the cost function became -0.04827. A matrix of inconsistencies (Table 2) was also constructed to assess the accuracy of the method and a graph of the values of the cost function (Fig. 3) and ROC curves for each of the classes (Fig. 4). To carry out multi-class classification with the help of RBFN, the data of the digital development of countries with a reduced dimension, processed by the autoencoder method, were used. To test the ability of the model to generalize, the data were divided into test and training samples in the ratio of 20% (22 prece- dents) and 80% (93 precedents), respectively. Previously, the data sample was normalized. T a b l e 2 . The matrix of inconsistencies in the classification of compressed data in- dicators of the digital development of the countries of the world Predicted class Actual class I II III IV І 38 0 0 7 II 0 10 0 1 III 1 0 21 3 IV 0 1 2 31 Fig. 1. The ratio of the number of clusters to the value of the cost function for economic indicators of the countries of the world data 1 3 2 4 1 — 2 — 3 — 4 — Fig. 2. ROC curves for each of the classes for these economic indicators of the countries of the world V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 32 RBFN will receive 7 state variables that do not have a defined value at the input, and at the output there will be estimates of state variable values — 4 states. The structure of the proposed RBFN has 70 H inputs for each of the parame- ters, 901 H neurons of the first layer and 42 H output neurons. As a result of training on the training sample, an accuracy of 83.87%, while on the test sample — 68.18%. To display the test results, a matrix of inconsistencies was con- structed for the training sample (Ta- ble 3) and a ROC curve was shown (Fig. 5), which has a smaller coverage area (i.e., worse classification ability), because part of the data was used for training, which reduced the ability of RBFN to generalization. T a b l e 3 . Misclassification matrix of the compressed data of the country’s digi- tal development indicators of the world Predicted class Actual class I II III IV I 8 0 0 1 II 0 2 0 1 III 0 4 1 0 IV 2 0 0 4 Fig. 3. The ratio of the number of clusters to the value of the cost function for the compressed data of the economic indicators of the countries of the world 1 3 2 4 1 — 2 — 3 — 4 — Fig. 4. ROC-curves for each of the classes for compressed data of economic indicators of the countries of the world Methodology of the countries’ economic development data analysis Системні дослідження та інформаційні технології, 2023, № 4 33 An analysis of the sensitivity of the target function was also carried out, i.e. the most informative indicators were determined. The results are shown in Table 4. Based on the results, it can be concluded that a different set of variables is in- formative for each cluster. T a b l e 4 . Sensitivity analysis of the variable clusters objective functions Cluster Number of precedents Sensitive cluster variables Mathematical expectation of the objective function 0 45 TII, ICT, HCI 85.33 1 11 TII, ICT, EGI 52.87 2 25 EPI, HCI, OSI 63.47 3 34 HCI, EPI, EGI 73.60 All numerical studies were carried out using the computer program “Nonlin- ear estimation methods in the multicriterion problems of system’s robust optimal designing and diagnosing under parametric apriority uncertainty (methodology, methods and computer decision support and making system” (ROD&IDS), devel- oped by the authors [27]. CONCLUSIONS The methods of intelligent data flow processing are widely used during the identi- fication of the states of economic objects. The use of new methods will make it possible to supplement the package of available tools for solving current problems with data processing and will make it possible to increase the stability of the methods to the nature of the data and improve the situation with the use of com- puting resources. Presented study examines the problem of improving the methods of classifi- cation and clustering of countries according to the state of social and digital de- velopment. A multiclass classification method based on radial basis neural net- works and a data clustering method based on an agent-oriented modification of the c-means method are proposed. 1 3 2 4 1 — 2 — 3 — 4 — Fig. 5. ROC curves for each of the classes for the PCA test sample of compressed data of indicators of digital development of the countries of the world V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 34 The proposed RBFN uses multiple logistic regression as the last layer for multiclass classification and the training results of an agent-oriented clustering model as input parameters. The peculiarity of the modification of the c-means method is the introduction of elite selection of clusters. According to the results of the research, the proposed methodology is pro- posed to be used for the analysis of economic systems to improve the quality of decision-making, but it should be noted that the method requires a qualitatively prepared sample that covers the largest possible space of input parameters for the target classes. REFERENCES 1. Mei Yang, Ming K. Lim, Yingchi Qu, Du Ni, and Zhi Xiao, “Supply chain risk man- agement with machine learning technology: A literature review and future research directions,” Computers & Industrial Engineering, vol. 175, January 2023, 108859. Available: https://doi.org/10.1016/j.cie.2022.108859 2. Benjamin Decardi-Nelson and Jinfeng Liu, “Robust Economic Model Predictive Control with Zone Control,” IFAC-PapersOnLine, vol. 54, issue 3, pp. 237–242, 2021. Available: https://doi.org/10.1016/j.ifacol.2021.08.248 3. M. Schlesinger and V. Hlavac, Ten lectures on statistical and structural pattern rec- ognition. Springer, Dordrecht, 2002. doi: 10.1007/978-94-017-3217-8. 4. Data clustering: algorithms and applications, Charu C. Aggarwal and Chandan, K. Reddy (ed.). CRC Press, Taylor & Francis Group, 2014. 5. N. Bakumenko, V. Strilets, and M. Ugryumov, “Application of the C-Means Fuzzy Clustering Method for the Patient’s State Recognition Problems in the Medicine Monitoring Systems,” CEUR Workshop Proceedings of 3rd International Confer- ence on Computational Linguistics and Intelligent Systems, COLINS 2019, vol. I, pp. 218–227, 2019, Available: https://www.researchgate.net/publication/338819685 6. R. Winkler, F. Klawonn, and R. Kruse, “Problems of Fuzzy c-Means Clustering and Similar Algorithms with High Dimensional Data Sets,” Challenges at the Interface of Data Analysis, Computer Science and Optimization, pp. 79–87, 2012. doi: 10.1007/978-3-642-24466-7_9. 7. Christopher D. Prabhakar Raghavan and Hinrich Schütze, Introduction to informa- tion retrieval. Cambridge University Press, 2008. 8. S. Askari, “Fuzzy C-Means clustering algorithm for data with unequal cluster sizes and contaminated with noise and outliers: Review and development,” Expert Systems with Applications, vol. 165, article no. 113856, 2020. doi: 10.1016/j.eswa.2020.113856. 9. Xuemei Zhao, Yu Li, and Quanhua Zhao, “Mahalanobis distance based on fuzzy clustering algorithm for image segmentation,” Digital Signal Processing, vol. 43, pp. 8–16, Aug 2015. Available: https://doi.org/10.1016/j.dsp.2015.04.009 10. Zarinbala M. Zarandia, M.H. Fazel, and I.B. Turksen, “Relative entropy fuzzy c-means clustering,” Information Sciences, vol. 260, pp. 74–97, 2014. doi: 10.1016/j.ins.2013.11.004. 11. V. Strilets, V. Donets, M. Ugryumov, R. Zelenskyi, and T. Goncharova, “Agent- Oriented data clustering for medical monitoring,” Radioelectronic and Computer Systems, no. 1, pp. 103–114, 2022. Available: https://doi.org/10.32620/reks.2022.1.08 12. Meng Xing, Yanbo Zhang, Hongmei Yu, Zhenhuan Yang, and Xueling Li, “Predict DLBCL patients’ recurrence within two years with Gaussian mixture model cluster oversampling and multi-kernel learning,” Computer Methods Programs in Biomedi- cine, vol. 226, 107103, 2022. Available: https://doi.org/10.1016/j.cmpb.2022.107103 13. Lynne A. Kvapil, Mark W. Kimpel, Rasitha R. Jayasekare, and Kim Shelton, “Using Gaussian mixture model clustering to explore morphology and standardized produc- tion of ceramic vessels: A case study of pottery from Late Bronze Age Greece,” Methodology of the countries’ economic development data analysis Системні дослідження та інформаційні технології, 2023, № 4 35 Journal of Archaeological Science: Reports, vol. 45, 103543, 2022. Available: https://doi.org/10.1016/j.jasrep.2022.103543 14. Meng Yinfeng, Jiye Liang, Fuyuan Cao and Yijun He, “A new distance with deriva- tive information for functional k-means clustering algorithm,” Information Sciences, vol. 463–464, pp. 166–185, 2018. Available: https://doi.org/10.1016/ j.ins.2018.06.035 15. Xinmin Tao, Ruotong Wang, Rui Chang, and Chenxi Li, “Density-sensitive fuzzy kernel maximum entropy clustering algorithm,” Knowledge-Based Systems, vol. 166, pp. 42–57, 2019. Available: https://doi.org/10.1016/j.knosys.2018.12.007. 16. K. Møllersen, S. Dhar and F. Godtliebsen, “On Data-Independent Properties for Density-Based Dissimilarity Measures in Hybrid Clustering,” Applied Mathematics, vol. 7, no. 15, pp. 1674–1706, 2016. doi: 10.4236/am.2016.715143. 17. Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Softmax Units for Multinoulli Output Distributions. Deep Learning. MIT Press, 2016. 18. V.E. Strilets et al., Methods of machine learning in the problems of system analysis and decision making: monograph. Karazin Kharkiv National University, 2020, 195 p. 19. Farbod Farhangi, “Investigating the role of data preprocessing, hyperparameters tun- ing, and type of machine learning algorithm in the improvement of drowsy EEG sig- nal modeling,” Intelligent Systems with Applications, vol. 15, 200100, September 2022. Available: https://doi.org/10.1016/j.iswa.2022.200100 20. Arthur Zimek and Peter Filzmoser, “There and back again: Outlier detection between statistical reasoning and data mining algorithms,” Wiley Interdisciplinary Reviews: Data Mining and Knowledge Discovery, 8(6), 2018. doi: 10.1002/widm.1280. 21. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou, “Isolation-Based Anomaly Detec- tion,” ACM Transactions on Knowledge Discovery from Data, 6(1), pp. 1–39, 2012. doi:10.1145/2133360.2133363. 22. O.Yu. Lykhach, M.L. Ugryumov, D.O. Shevchenko, and S.I. Shmatkov, “Methods of detecting emissions in test samples during process control in state-based systems,” Bulletin of Karazin Kharkiv National University, ser. “Mathematical modeling. In- formation Technology. Automated control systems”, no. 53. pp. 21–40, 2022. 23. L.J.P van der Maaten and G.E. Hinton, “Visualizing Data Using t-SNE,” Journal of Machine Learning Research, 9, pp. 2579–2605, 2008. 24. Ian T. Jolliffe and Jorge Cadima, “Principal component analysis: a review and recent developments. Philosophical Transactions of the Royal Society A,” Mathematical, Physical and Engineering Sciences, 374(2065), 20150202, 2016. doi: 10.1098/rsta.2015.0202. 25. L. Chagovets, N. Chernova, T. Klebanova, O. Dorokhov, and A. Didenko, “Selective Adaptive Model for Forecasting of Regional Development Unevenness Indexes,” Proceedings of the Workshop on the XII International Scientific Practical Confer- ence Modern problems of social and economic systems modelling (MPSESM-W 2020) Kharkiv, Ukraine, June 25, 2020, pp. 58–76. 26. L.О. Chagovets, S.V. Prokopovych, S.М. Vozniuk, and V.V. Chahovets, “Concep- tual basis of modeling telecommunication development of regions by methods of system analysis,” Municipal economy of cities, vol. 1, no. 161, pp. 230–240, 2021. 27. Computer program “Nonlinear estimation methods in the multicriterion problems of system’s robust optimal designing and diagnosing under parametric apriority uncer- tainty (methodology, methods and computer decision support and making system)” (“ROD&IDS”): Copyright registration certificate no. 82875 / M.L. Ugryumov, Y.S. Meniaylov, S.V. Chernysh, K.M. Ugryumova (Ukraine). Copyright and related rights. Official bulletin. Ministry of Economic Development and Trade of Ukraine. 2018, no. 51, p. 403. Received 30.06.2023 V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 36 INFORMATION ON THE ARTICLE Volodymyr V. Donets, ORCID: 0000-0002-5963-9998, V.N. Karazin Kharkiv National University, Ukraine, e-mail: v.donets@karazin.ua Viktoriia Y. Strilets, ORCID: 0000-0002-2475-1496, V.N. Karazin Kharkiv National University, Ukraine, e-mail: viktoria.strilets@karazin.ua Mykhaylo L. Ugryumov, ORCID: 0000-0003-0902-2735, V.N. Karazin Kharkiv Na- tional University, Ukraine, e-mail: m.ugryumov@karazin.ua Dmytro O. Shevchenko, ORCID: 0000-0002-7897-250X, V.N. Karazin Kharkiv Na- tional University, Ukraine, e-mail: dimyich24@gmail.com Svitlana V. Prokopovych, ORCID: 0000-0002-6333-2139, Simon Kuznets Kharkiv Na- tional University of Economics, Ukraine, e-mail: prokopovichsv@gmail.com Liubov O. Chagovets, ORCID: 0000-0003-4064-9712, Simon Kuznets Kharkiv National University of Economics, Ukraine, e-mail: liubov.chahovets@hneu.net МЕТОДОЛОГІЯ АНАЛІЗУ ДАНИХ ЕКОНОМІЧНОГО РОЗВИТКУ КРАЇН / В.В. Донець, В.Є. Стрілець, М.Л. Угрюмов, Д.О. Шевченко, С.В. Прокопович, Л.О. Чаговець Анотація. Досліджено питання удосконалення методів ідентифікації економі- чних об’єктів та їх аналізу з використанням алгоритмів інтелектуального об- роблення даних. Використання розробленої методології в економічному аналі- зі дозволяє підвищити якість управління та може бути основою для створення систем підтримання прийняття рішень для попередження потенційно небезпе- чних змін економічного стану об’єкта дослідження. Запропоновано удоскона- лений метод кластеризації даних c-середніх з агентно-орієнтованою модифіка- цією, для визначення відповідності отриманих кластерів актуальним пропонується радіально-базисна нейромережа та її розширення – для аналізу інформативності змінних стану й отримання підмножини інформативних змінних. Розглянуто вплив застосування стиснення даних за допомогою авто- кодувальника на точність застосування методів. За результатами тестування розробленої методології було доведено зменшення ймовірності неправильного визначення стану під час ідентифікації станів економічних систем та отримано зменшене значення помилки третього роду під час класифікації станів об’єктів. Ключові слова: машинне навчання, цифровий розвиток, нечітка кластериза- ція, радіально базисні нейромережі, логістична регресія, аналіз інформативно- сті змінних.
id journaliasakpiua-article-297208
institution System research and information technologies
keywords_txt_mv keywords
language English
last_indexed 2025-07-17T10:28:25Z
publishDate 2023
publisher The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
record_format ojs
resource_txt_mv journaliasakpiua/a1/429c42cc1f71ad95bfe3aee2496b35a1.pdf
spelling journaliasakpiua-article-2972082024-02-01T21:03:07Z Methodology of the countries’ economic development data analysis Методологія аналізу даних економічного розвитку країн Donets, Volodymyr Strilets, Viktoriia Ugryumov, Mykhaylo Shevchenko, Dmytro Prokopovych, Svitlana Chagovets, Liubov машинне навчання цифровий розвиток нечітка кластеризація радіально базисні нейромережі логістична регресія аналіз інформативності змінних machine learning digital development fuzzy clustering radial basis neural networks logistic regression analysis of variables informativeness The paper examines the issue of improving the methods of identification of economic objects and their analysis using algorithms of intelligent data processing. The use of the developed methodology in the economic analysis allows for improvement in the quality of management. It can be the basis for creating decision support systems to prevent potentially dangerous changes in the economic status of the research object. In this work, an improved method of c-means data clustering with agent-oriented modification is proposed, and a radial-basis neural network and its extension are proposed to determine whether the obtained clusters are relevant and to analyze the informativeness of state variables and obtain a subset of informative variables. The effect of applying data compression using an autoencoder on the accuracy of the methods is also considered. According to the results of testing of the developed methodology, it was proved that the probability of incorrect determination of the state was reduced when identifying the states of economic systems, and a reduced value of the error of the third kind was obtained when classifying the states of objects. Досліджено питання удосконалення методів ідентифікації економічних об’єктів та їх аналізу з використанням алгоритмів інтелектуального оброблення даних. Використання розробленої методології в економічному аналізі дозволяє підвищити якість управління та може бути основою для створення систем підтримання прийняття рішень для попередження потенційно небезпечних змін економічного стану об’єкта дослідження. Запропоновано удосконалений метод кластеризації даних c-середніх з агентно-орієнтованою модифікацією, для визначення відповідності отриманих кластерів актуальним пропонується радіально-базисна нейромережа та її розширення – для аналізу інформативності змінних стану й отримання підмножини інформативних змінних. Розглянуто вплив застосування стиснення даних за допомогою автокодувальника на точність застосування методів. За результатами тестування розробленої методології було доведено зменшення ймовірності неправильного визначення стану під час ідентифікації станів економічних систем та отримано зменшене значення помилки третього роду під час класифікації станів об’єктів. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2023-12-26 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/297208 10.20535/SRIT.2308-8893.2023.4.02 System research and information technologies; No. 4 (2023); 21-36 Системные исследования и информационные технологии; № 4 (2023); 21-36 Системні дослідження та інформаційні технології; № 4 (2023); 21-36 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/297208/290123
spellingShingle машинне навчання
цифровий розвиток
нечітка кластеризація
радіально базисні нейромережі
логістична регресія
аналіз інформативності змінних
Donets, Volodymyr
Strilets, Viktoriia
Ugryumov, Mykhaylo
Shevchenko, Dmytro
Prokopovych, Svitlana
Chagovets, Liubov
Методологія аналізу даних економічного розвитку країн
title Методологія аналізу даних економічного розвитку країн
title_alt Methodology of the countries’ economic development data analysis
title_full Методологія аналізу даних економічного розвитку країн
title_fullStr Методологія аналізу даних економічного розвитку країн
title_full_unstemmed Методологія аналізу даних економічного розвитку країн
title_short Методологія аналізу даних економічного розвитку країн
title_sort методологія аналізу даних економічного розвитку країн
topic машинне навчання
цифровий розвиток
нечітка кластеризація
радіально базисні нейромережі
логістична регресія
аналіз інформативності змінних
topic_facet машинне навчання
цифровий розвиток
нечітка кластеризація
радіально базисні нейромережі
логістична регресія
аналіз інформативності змінних
machine learning
digital development
fuzzy clustering
radial basis neural networks
logistic regression
analysis of variables informativeness
url https://journal.iasa.kpi.ua/article/view/297208
work_keys_str_mv AT donetsvolodymyr methodologyofthecountrieseconomicdevelopmentdataanalysis
AT striletsviktoriia methodologyofthecountrieseconomicdevelopmentdataanalysis
AT ugryumovmykhaylo methodologyofthecountrieseconomicdevelopmentdataanalysis
AT shevchenkodmytro methodologyofthecountrieseconomicdevelopmentdataanalysis
AT prokopovychsvitlana methodologyofthecountrieseconomicdevelopmentdataanalysis
AT chagovetsliubov methodologyofthecountrieseconomicdevelopmentdataanalysis
AT donetsvolodymyr metodologíâanalízudanihekonomíčnogorozvitkukraín
AT striletsviktoriia metodologíâanalízudanihekonomíčnogorozvitkukraín
AT ugryumovmykhaylo metodologíâanalízudanihekonomíčnogorozvitkukraín
AT shevchenkodmytro metodologíâanalízudanihekonomíčnogorozvitkukraín
AT prokopovychsvitlana metodologíâanalízudanihekonomíčnogorozvitkukraín
AT chagovetsliubov metodologíâanalízudanihekonomíčnogorozvitkukraín