Методологія аналізу даних економічного розвитку країн
The paper examines the issue of improving the methods of identification of economic objects and their analysis using algorithms of intelligent data processing. The use of the developed methodology in the economic analysis allows for improvement in the quality of management. It can be the basis for c...
Saved in:
| Date: | 2023 |
|---|---|
| Main Authors: | , , , , , |
| Format: | Article |
| Language: | English |
| Published: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2023
|
| Subjects: | |
| Online Access: | https://journal.iasa.kpi.ua/article/view/297208 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Journal Title: | System research and information technologies |
| Download file: | |
Institution
System research and information technologies| _version_ | 1867334441111650304 |
|---|---|
| author | Donets, Volodymyr Strilets, Viktoriia Ugryumov, Mykhaylo Shevchenko, Dmytro Prokopovych, Svitlana Chagovets, Liubov |
| author_facet | Donets, Volodymyr Strilets, Viktoriia Ugryumov, Mykhaylo Shevchenko, Dmytro Prokopovych, Svitlana Chagovets, Liubov |
| author_institution_txt_mv | [
{
"author": "Volodymyr Donets",
"institution": "V. N. Karazin Kharkiv National University, Kharkiv"
},
{
"author": "Viktoriia Strilets",
"institution": "V. N. Karazin Kharkiv National University, Kharkiv"
},
{
"author": "Mykhaylo Ugryumov",
"institution": "V. N. Karazin Kharkiv National University, Kharkiv"
},
{
"author": "Dmytro Shevchenko",
"institution": "V. N. Karazin Kharkiv National University, Kharkiv"
},
{
"author": "Svitlana Prokopovych",
"institution": "Simon Kuznets Kharkiv National University of Economics, Kharkiv"
},
{
"author": "Liubov Chagovets",
"institution": "Simon Kuznets Kharkiv National University of Economics, Kharkiv"
}
] |
| author_sort | Donets, Volodymyr |
| baseUrl_str | http://journal.iasa.kpi.ua/oai |
| collection | OJS |
| datestamp_date | 2024-02-01T21:03:07Z |
| description | The paper examines the issue of improving the methods of identification of economic objects and their analysis using algorithms of intelligent data processing. The use of the developed methodology in the economic analysis allows for improvement in the quality of management. It can be the basis for creating decision support systems to prevent potentially dangerous changes in the economic status of the research object. In this work, an improved method of c-means data clustering with agent-oriented modification is proposed, and a radial-basis neural network and its extension are proposed to determine whether the obtained clusters are relevant and to analyze the informativeness of state variables and obtain a subset of informative variables. The effect of applying data compression using an autoencoder on the accuracy of the methods is also considered. According to the results of testing of the developed methodology, it was proved that the probability of incorrect determination of the state was reduced when identifying the states of economic systems, and a reduced value of the error of the third kind was obtained when classifying the states of objects. |
| doi_str_mv | 10.20535/SRIT.2308-8893.2023.4.02 |
| first_indexed | 2025-07-17T10:28:25Z |
| format | Article |
| fulltext |
V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets, 2023
Системні дослідження та інформаційні технології, 2023, № 4 21
UDC 519.254:330.47
DOI: 10.20535/SRIT.2308-8893.2023.4.02
METHODOLOGY OF THE COUNTRIES’ ECONOMIC
DEVELOPMENT DATA ANALYSIS
V.V. DONETS, V.Y. STRILETS, M.L. UGRYUMOV, D.O. SHEVCHENKO,
S.V. PROKOPOVYCH, L.O. CHAGOVETS
Abstract. The paper examines the issue of improving the methods of identification
of economic objects and their analysis using algorithms of intelligent data process-
ing. The use of the developed methodology in the economic analysis allows for
improvement in the quality of management. It can be the basis for creating decision
support systems to prevent potentially dangerous changes in the economic status of
the research object. In this work, an improved method of c-means data clustering
with agent-oriented modification is proposed, and a radial-basis neural network and
its extension are proposed to determine whether the obtained clusters are relevant
and to analyze the informativeness of state variables and obtain a subset of informa-
tive variables. The effect of applying data compression using an autoencoder on the
accuracy of the methods is also considered. According to the results of testing of the
developed methodology, it was proved that the probability of incorrect determina-
tion of the state was reduced when identifying the states of economic systems, and a
reduced value of the error of the third kind was obtained when classifying the states
of objects.
Keywords: machine learning, digital development, fuzzy clustering, radial basis
neural networks, logistic regression, analysis of variables informativeness.
INTRODUCTION
Analysis of the state of economic systems requires taking into account a large
number of factors that have a stochastic nature of development and high dyna-
mism. Continuous monitoring allows taking into account the influence of these
factors and maintaining the stable functioning of economic systems in conditions
of constant global fluctuations [1]. Machine learning methods make it possible to
evaluate these factors, their possible and real impact on macroeconomic proc-
esses. The use of machine learning algorithms provides early consideration of the
effects of factors that may threaten the stability of economic systems [1].
The use of intelligent methods for the analysis of collected economic data al-
lows to automate the solution of many problems in the management of economic
processes [1], which significantly increases its quality and efficiency. Automated
systems of economic analysis are used as decision support systems to prevent po-
tentially dangerous changes in the state of economic systems [1; 2]. Existing in-
formation systems of economic analysis have modules for solving problems of
clustering, classification or forecasting of received data, based on machine learn-
ing methods, which allow to improve the accuracy of received decisions.
The aim of the study is to improve the quality of data stratification in the in-
formation analysis of economic systems by developing a methodology that
includes methods of clustering, classification and analysis of the informativeness
V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets
ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 22
of economic data. The scientific objective of the study is to improve the existing
methods of economic data analysis through the introduction of an agent-oriented
modification of the clustering method and radial basis neural networks for analyz-
ing the informativeness of state variables. The proposed methods are expected to
reduce the probability of erroneous determination of the state in the analysis of
the economic system, thus the value of the third-order error in the classification of
its state will be reduced.
STATEMENT OF THE RESEARCH PROBLEM
The data obtained as a result of the study of the economic system can be pre-
sented in the form:
}{ imxX ,
where MmNi ,1, ,1 , X — matrix representing the data sample for analysis;
N — number of objects; M — dimension of space.
The problem of data analysis that characterizes the state of the economic
system consists of solving a sequence of problems:
– division of a set of data into sets that are similar according to certain char-
acteristics — the task of clustering;
– determination of the current state of the economic system based on a set of
characteristics — the task of classification;
– determination of a set of features that best describe the state of the eco-
nomic system — the task of selecting informative features (reduction of the space
of features).
Let’s consider the methods of solving each of the problems.
FUZZY DATA CLUSTERING METHOD
For some known set of valid clusters Y it becomes necessary to split the input
data X to Y subsets (clusters, classes), so that each cluster consists of objects
that are close by some metric, or distant by another. Thus, each object will be as-
signed to the y-th cluster.
The result of the clustering algorithm [3] will be the application of the func-
tion YXcluster : , which matches each object in the input set Xx matching
an object from a set of clusters. Usually, plural Yy known in advance for a
non-hierarchical approach, or determined in the process for a hierarchical ap-
proach. Therefore, the question of determining the optimal number of clusters, as
one of the parameters determining the final quality of clustering, often arises.
Let’s define the distance between cluster objects as a metric for cluster
analysis. Then we define the degree of similarity of objects as the reciprocal of
the inter-element distance. Among the works devoted to cluster analysis, can be
found a large number of possible metrics for determining the inter-element distance
or degree of similarity. The most widespread metric is based on the Euclidean
distance, which is a special case of the Minkowski distance [4] with the value of
the parameter 2 . Generalized Minkowski metric:
Methodology of the countries’ economic development data analysis
Системні дослідження та інформаційні технології, 2023, № 4 23
jmim
M
m
ji xxxxd
1
), ( .
The c-means fuzzy clustering method allows fuzzy distribution of objects
into clusters or classes. In the c-means method, the object belongs to all clusters,
but with a certain value of cluster membership [5].
In the method of fuzzy clustering [6], the membership matrix of elements to
a cluster is calculated according to the assumption of a normal distribution of data
according to the formula:
),0|),((
), 0| ),((
1 jji
P
i
jji
ij
cxdΝ
cxdΝ
w
j
,
where ix — i -th element of the set, );1( jPi ; jc — j -th cluster cen-
ter; ), ( ji cxd — distance between points ix and jc ; ), 0 | ), (( jji cxdN —
probability density of a normal distribution at a point ), ( ji cxd .
The cluster centers are adjusted according to the formula
ij
P
i
iij
P
i
j
w
xw
c
j
j
1
1 . (1)
The center adjustment process continues until the loss function is minimized:
min), ( 2
11
ijji
P
i
K
j
wcxdloss
j
, (2)
or on the condition of reaching some limitation on the number of iterations, or the
required classification quality.
Among the important disadvantages of the c-means method are the inability
to divide the space with a complex shape of target clusters that go beyond simple
M-dimensional spheres, and an insufficient level of robustness to noise [5; 7].
For data from real problems, both a complex distribution of object parame-
ters and a high dimensionality of the input data are inherent, which in turn deter-
mines the complex form of M-dimensional target clusters. Therefore, for the usual
method of fuzzy clustering and many of its modifications, clustering with high
accuracy is not possible. A modification of the distance metric (together with the
membership metric) is proposed in [8]. An interesting approach is the assumption
of the Cauchy distribution and the use of the Mahalanobis distance, which were
proposed in [9; 10]. Mahalanobis distance was used to improve the calculation
algorithm that prevents degeneracy of the inverse matrix [11]:
)(Σ̂)( ), ( 1
jij
T
jiji cxcxcxMD ,
where ΣΣΣ̂ — is the regularized covariance matrix; — is a constant
greater than zero.
Taking into account the assumption of the Cauchy distribution in the data,
the expression for calculating the value of belonging to a certain cluster [5] has
the form:
V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets
ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 24
), (
), (
1 ji
P
i
ji
ij
cx
cx
w
j
,
1
2
2 ), (
1), (
ji
ji
cxMD
cx . (3)
Solving the clustering problem for clusters of complex M-dimensional form
using Gaussian mixture models was considered in works [12; 13], using the de-
rivative in [14] and using the Mahalanobis distance in [5; 9]. According to the
obtained results, an improvement in clustering accuracy is noted, but the problem
of spatial separation and overuse of input data dependence occurs.
In works [5; 15], the possibility of taking into account the relative entropy of
the data distribution was considered when using the c-means method, but the
Euclidean distance was chosen as the metric of the distance between the objects
of the sample, which reduced the computational load, but did not take into ac-
count the entropy of the data.
To overcome the difficulties of using the basic method of fuzzy clustering
and its modifications based on Mixture and Gaussian mixture models on data with
a complex shape of M-dimensional target clusters [12], which is based on an at-
tempt to take into account the entropy of clusters [15] and the Kullback–Leibler
distance [16], it was proposed to improve the clustering method.
The Kullback–Leibler distance is an asymmetric measure of the informa-
tional difference between two probability distributions. This measure has proven
itself well in methods of information processing in physical systems and sta-
tistics [16].
According to the previous definition Xxim — is the m-th state variable of
the i-th vector of the input data sample, where ],1[ Mm , M — dimension of the
state vector. Let’s define Ffs as the s-th object function from the vector of
object functions ],1[ Ss , where S — the dimension of the object functions vec-
tor. Then )( sfM and )( imxM are mathematical expectations of sf and imx
respectively. According to this definition )( sfD and imxD — dispersion of the
relevant variables, and )( sf and )( imx — standard deviation. Variance and
standard deviation of conditional dependence of sf from imx an be determined
by formulas:
constxmnnxfMvarxfD inminsims , , ))), ((() | ( ; (4)
) | ( ) | ( imsims xfDxf . (5)
Using expression (4), we obtain estimates of informative state variables:
)(
) | (
)(
s
ims
s fE
xfD
f ,
where )( sfE — signal energy.
From (5), we get the influence coefficient (signal to noise ratio):
)(
) | (
) | (
im
ims
imssm x
xf
xfSNR
.
In [16], the Kullback–Leibler entropy is defined as follows:
Methodology of the countries’ economic development data analysis
Системні дослідження та інформаційні технології, 2023, № 4 25
)(
) | (
log) | ( ), ( 2
1 im
sim
sim
M
m
isKL x
fx
fxxfD .
Mutual informative dependence is then determined by the formula:
)(
)(
)(log
2
1
) | (log
2
1
2
2
2
im
s
simssm xD
fE
fxfSNRH .
In the proposed method, we replace the loss function (2). Instead, we will get
a formula for determining mutual informative dependence, which will be a func-
tion of clustering quality assessment, that is, a function of losses in the developed
method of fuzzy clustering:
min)`,(
1
),( )1(
1
)1(
11
t
jiKL
P
i
t
j
k
jj
k
j
YxDYP
P
YXH
j
,
where jY — state variables belonging to the j-th cluster.
AGENT-ORIENTED MODIFICATION OF THE CLUSTERIZATION METHOD
To overcome the non-priority problem, an agent-oriented modification was
developed for the classical method of fuzzy clustering considering the
M-dimensional spatial shape [3; 5], which is considered below.
Let`s introduce special notations for the developed method of fuzzy cluster-
ing: X — agents, elements of the input sample, C — centers of clusters, then iX
— agents, cluster elements, Z — agents clusters. According to the agent-oriented
approach, the elements-vectors of the input sample and the clusters are agents,
these agent-elements choose the cluster agents closest to them, which they join
according to a pre-specified metric, thus forming cluster agents. The number of
cluster agents is determined by minimizing the loss function. According to the
previous definition: the input sample partitioned into clusters is }{ jPX , where
KKj , 1),1( , j
K
j
PN
1
— the number of elements in the input sample;
jP — set of elements belonging to the j-th cluster; K — number of clusters. Than
jij Px — the i-th element of the j-th cluster.
Four metrics were chosen to compare the possibilities of spatial separation
of clusters and computational efficiency:
,),(log*),(
,),(
,),(
,),(
),(
2
1
1
1
1
t
jij
t
jij
jijKL
jijij
jij
jij
cxpcxp
cxD
cxdw
cxd
cxd (6)
where ),(1 jij cxd — Manhattan distance; ),(1
1
jijij cxdw — Mahalanobis distance
with the inverse of the membership function; ),( jijKL cxD — Kullback–Leibler
divergence; ),(log*),( 2
1 t
jij
t
jij cxpcxp — cross entropy.
V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets
ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 26
Having the distance to determine the inter-element distance, we will get an
expression for determining the cost function for each cluster, that is, the average
measure of the intraclass distance:
),(
||
1
)(_
1
jij
P
ij
j cxd
P
Plosscl
j
. (7)
Then, using expression (7), we obtain the general cost function for evaluat-
ing the current quality of clustering:
)(_
1
)(
1
j
K
j
t
t Plosscl
K
Xloss
t
. (8)
By combining the classical method of fuzzy clustering with the agent-
oriented approach described above, we will obtain a statement of the research
problem, according to which it is necessary to determine the number of clusters
and such a distribution of elements by clusters that the value of the cost function
is minimal:
.))((minargˆ
,],[
t
tt
XlossA
XKA
According to the classical clustering method, cluster centers are optimized
according to expression (1), and the membership matrix for adjustment is calcu-
lated according to expression (3) taking into account the Cauchy distribution as-
sumption. We formulate the clustering algorithm, defined according to the agent-
oriented approach, as follows:
1. Determine some initial number of cluster agents KK t , that is more
than the target number of clusters, and set a limit on the number of elements in
each cluster tt
j KNP /|| and choose randomly tK centers of clusters }{ jc .
2. Select one of the inter-element distances (6) || t
jP of the closest elements
to each cluster, that is, to form cluster agents t
jP .
3. For each cluster, calculate the value of the parameters )|( t
jij Px distribu-
tion and the values of the membership matrix according to expressions (3), and
according to expression (1) adjust the cluster centers.
4. To each center of the cluster according to the selected measure ),( jij cxd
to choose || t
jP new agents-elements.
5. For each cluster agent, according to expression (7), determine the value of
the cost function (or the average inter-element distance) t
jPlosscl _ .
6. To estimate the current quality of clustering by the loss function accord-
ing to expression (8). In the case of the operation mode of the algorithm in the
automatic search for the optimal number of clusters, and the increase in the value
of the cost function, stop the algorithm.
7. To select agent-clusters and discard the agent-cluster with the highest
value )(_ t
jPlosscl .
Methodology of the countries’ economic development data analysis
Системні дослідження та інформаційні технології, 2023, № 4 27
8. To determine the new number of clusters 11 tt KK and the new
number of cluster elements 11 /|| tt
j KNP .
9. Return to stage 2, if KK t .
CLASSIFICATION METHOD BASED ON MULTIPLE LOGISTIC
REGRESSION
To solve the problem of multiclass classification in the case of spatially separated
data, it is proposed to use a radial basis neural network (RBFN) with multiple lo-
gistic regression. The application of the RBFN model for multiclass classification
will allow checking the assumptions about the correctness of the cluster definition
and testing the model’s ability to generalize.
RBFN structure: H0 inputs for each of the parameters, H1 neurons of the first
layer and H2 output neurons. We define the vector of input data for the k-th layer
of the neural network (or the vector of output data for the k-1 layer) as
Tk
H
kk YYY ],,[ )()(
1
)(
1
, we define the vector of coordinates of the cents of the ac-
tivation function for the hidden layer as T
jHjjj cccc ],,,[
021
, where 1..1 Hj ,
and the vector specifying the window width of the activation function of the j-th
neuron of the hidden layer is defined as T
jHjjj ],,,[
021
. Then the acti-
vation function for the neurons of the hidden layer will look like this:
pjpjhij
H
h
jjpj ZwexpcY
2
1
0
0
2
1
),,(
,
where
jh
jhph
pjh
cY
Z
0
; ijw — weighted connection between the i-th neuron of
the output layer and the j-th neuron of the input layer.
Multiple logistic regression [17] is used as the activation function of the out-
put layer, the outputs of which are defined as:
)(γexp
)(γexp
2
1 k
H
k
j
j
, де iji
H
i
w
1
γ j .
A hybrid algorithm was used for training the RBFN, which includes 2 steps,
the repetition of which usually leads to fast training of the network, especially if
the parameters are successfully generated [18]:
1) selection of linear network parameters (weights) using the pseudo inver-
sion method;
2) optimization of nonlinear parameters of activation functions (window cen-
ters and widths).
If there are P training pairs PpdY pp ..1 ),,( 0
and fixing the specific val-
ues of the centers and window widths of the activation functions, we get a system
of equations:
V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets
ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 28
2..1, Hidw ii
,
where ],[ pj 1..0, ..1 HjPp , , 10 p ,], ..., , [
110
T
iHiii wwww
id
T
piii ddd ],..., , [ 10 .
Vector iw
can be determined in one step using pseudo matrix inversion :
ii dw
, which in practice is calculated using the decomposition of eigen-
values.
At the second stage of the algorithm, when fixing the weights, the excitation
signal passes through the network to the initial level, which allows to calculate the
error value for the sequence of vectors }{ 0
pY
. After that, there is a return to the
hidden layer. The gradient vector of the selection function according to the spe-
cific variable cents and window widths is determined by the error value:
2
2
LdY ‖‖
.
Algorithm for forming the “coverage zone” by radial basis functions of k-
neighbors Kkcc
K khjh
H
h
K
k
jjh ..1, )(
1 2
11
2
0
, ]5,3[K was used to de-
termine the values of the window widths, which helped reduce the training time
of the RBFN.
CHARACTERISTICS INFORMATIVENESS ANALYSIS METHOD
Since it is proposed to use the RBFN network to solve the classification problem,
this model can also be used to find the minimum possible subset of informative
variables. The input data set can be represented as a Taylor series, keeping only
the terms of the first infinitesimal order. For the variance of an arbitrarily ob-
tained linear function of several random variables, the estimate is valid:
ljji SS
l
i
j
i
jl
J
jll
J
j
S
j
i
J
j
iS
T
iY s
Y
s
Y
r
s
Y
YYD
grad ) grad(
,11
2
2
1
,
where S — covariance matrix of variables 1S ; 2S ,
1S — standard deviation;
1jr — correlation coefficient between variables 1S and 2S .
Then the standard deviation and variance of the RBFN output can be esti-
mated according to the architecture chosen for it, and from them determine the
energy of the signals by the expression [18]:
)0()2(
0
|
1 hi YY
H
h
i DE
,
where )0()0(
0
)0()0()2(
)0(
)2(
)0(
)2(
,1
2
2
)0(
)2(
| hnhhi Y
h
i
Y
n
i
hn
H
hnn
Y
h
i
YY Y
Y
Y
Y
r
Y
Y
D
.
Methodology of the countries’ economic development data analysis
Системні дослідження та інформаційні технології, 2023, № 4 29
Then the coefficient of informativeness of the variables (the weight of the
contribution of )0(
hY in to )2(
iY ) is defined by the expression:
i
YY
ih E
D
hi
)0()2( |
.
DATA PRE-PROCESSING METHODS
In machine learning problems, it has become common practice to use data pre-
processing methods (normalization, cleaning from anomalies, and dimensionality
reduction) to improve the quality of problem solving [19]. Three methods of the
scikit-learn, Python library were used for data normalization:
– RobustScaler scales parameters with robustness to statistical outliers.
– StandardScaler (Z-score normalization). Reduces the mean and scales to
unit variance.
– MinMaxScaler (min-max normalization). Each parameter is scaled and
translated individually by the estimator so that it falls within a given range, for
example [0,1].
The detection of unusual elements, events, or observations that are signifi-
cantly different from the main body of data and do not correspond to a well-
defined definition of normal behavior is called the process of anomaly detection
[20]. Data cleaning techniques remove values that have been identified as outliers
and based on anomaly detection.
Two outlier detection methods from the scikit-learn library were used:
– Interquartile Range (IQR). By dividing the data set into quartiles, it is used
to measure variability;
– Isolation forest. The method uses isolation to find anomalies (how far a
data point is from the rest of the data) [21; 22].
The dimensionality reduction process aims to provide a lower-dimensional
representation of the original data set while preserving its important characteris-
tics. Separate scikit-learn and PyTorch libraries were used for dimensionality re-
duction. Three methods were used:
– T-distributed Stochastic Neighbor Embedding (t-SNE) [23];
– Principal Component Analysis (PCA) the method is based on SVD, it re-
duces the dimensionality of the data well [24].
– Autoencoder. Is a certain type of feed-forward neural network where the
input matches the output. It compresses the input data into a bottleneck (lower
dimensional data) and then reconstructs the output data from that representation.
The bottleneck is the target compact summation or dimensionality reduction of
the input data, also called the latent space representation.
APPLICATION OF METHODOLOGY FOR COUNTRIES DIGITAL
DEVELOPMENT DATA ANALYSIS
The developed methodology was tested to identify the state of digital develop-
ment of the countries of the world. For the classification (positioning of countries)
regarding the level of their digital development, the hypothesis of the existence of
V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets
ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 30
homogeneous groups of countries (objects) according to specialized indices was
tested. Indices that fully reflect the state of digital development were selected:
– EGIit — Global E-Government Development Index;
– NRIit — network readiness index;
– ICTit — information and communication technologies development index.
By forecasting independent factors — indicators of digital development
based on the model, it is possible to estimate the forecast level of social progress
of a specific country. The Social Progress Index (SPI) is a combined indicator of
the International Research Project The Social Progress Imperative [25; 26] which
measures the achievements of the countries of the world in terms of social well-
being and social progress. The authors of the study [25; 26] believe that indicators
of social development are often considered as an alternative to indicators of eco-
nomic development. The global e-government development index [26] is an inte-
gral indicator that assesses the readiness and capabilities of national government
structures in using information and communication technologies (ICT) to provide
public services to citizens. The index of network readiness [26] characterizes the
level of development of information and communication technologies and the
network economy in the countries of the world. Currently, the index is considered
one of the most important indicators of the innovative and technological potential
of the countries of the world and their development opportunities in the field of
high technology and digital economy. The ICT Development Index is a composite
index that combines 11 indicators and is used to monitor and compare the devel-
opment of information and communication technologies (ICT) between countries.
To implement the model, a sample of 115 precedents (observations by coun-
try) was collected for 32 variables of the state of social development for each
precedent and the 33rd field for the predictive value of the state. The ratio of val-
ues of the social progress index SPIt (Social Progress Index) and the average level
of income was used to mark the educational sample. All precedents of the sample
were distributed according to the respective states:
– “High income” — 45 precedents (I);
– “Upper middle income” — 11 precedents (II);
– “Lower middle income” — 25 precedents (III);
– “Lower income” — 34 precedents (IV).
For this sample, pre-processing of the data was first carried out: normaliza-
tion and detection of anomalous values. Clustering was performed for the consid-
ered economic data, and classifi-
cation was performed to verify its
results. It was decided to use the
Kullback–Leibler distance classi-
fication method. As a result of its
application, an accuracy of
84.3% was achieved, and the
value of the flow function was
obtained as 0.0117. A matrix of
inconsistencies (Table 1) was
also constructed to assess the ac-
curacy of the method, as well as graphs of cost function values (Fig. 1) and ROC
curves for each of the classes (Fig. 2).
T a b l e 1 . The matrix of inconsistencies in
the classification of data indicators of the digital
development of the countries of the world
Predicted class Actual
class I II III IV
I 37 1 0 7
II 1 8 1 1
III 2 0 21 2
IV 0 1 2 31
Methodology of the countries’ economic development data analysis
Системні дослідження та інформаційні технології, 2023, № 4 31
After a series of experiments, it was decided to apply the autoencoder
method to reduce the dimensionality of the data with 98% information retention,
which made it possible to reduce
the dimensionality of 32 to 11 state
variables for each case. After this
application, an accuracy of 86.9%
was achieved, and the value of the
cost function became -0.04827.
A matrix of inconsistencies (Table 2)
was also constructed to assess the
accuracy of the method and a
graph of the values of the cost
function (Fig. 3) and ROC curves
for each of the classes (Fig. 4).
To carry out multi-class classification with the help of RBFN, the data of the
digital development of countries with a reduced dimension, processed by the
autoencoder method, were used. To test the ability of the model to generalize, the
data were divided into test and training samples in the ratio of 20% (22 prece-
dents) and 80% (93 precedents), respectively. Previously, the data sample was
normalized.
T a b l e 2 . The matrix of inconsistencies
in the classification of compressed data in-
dicators of the digital development of the
countries of the world
Predicted class Actual
class I II III IV
І 38 0 0 7
II 0 10 0 1
III 1 0 21 3
IV 0 1 2 31
Fig. 1. The ratio of the number of clusters to the value of the cost function for economic
indicators of the countries of the world data
1
3
2
4
1 —
2 —
3 —
4 —
Fig. 2. ROC curves for each of the classes for these economic indicators of the countries
of the world
V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets
ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 32
RBFN will receive 7 state variables that do not have a defined value at the
input, and at the output there will be estimates of state variable values — 4 states.
The structure of the proposed RBFN has 70 H inputs for each of the parame-
ters, 901 H neurons of the first layer and 42 H output neurons.
As a result of training on the
training sample, an accuracy of
83.87%, while on the test sample —
68.18%. To display the test results, a
matrix of inconsistencies was con-
structed for the training sample (Ta-
ble 3) and a ROC curve was shown
(Fig. 5), which has a smaller coverage
area (i.e., worse classification ability),
because part of the data was used for
training, which reduced the ability of
RBFN to generalization.
T a b l e 3 . Misclassification matrix of
the compressed data of the country’s digi-
tal development indicators of the world
Predicted class Actual
class I II III IV
I 8 0 0 1
II 0 2 0 1
III 0 4 1 0
IV 2 0 0 4
Fig. 3. The ratio of the number of clusters to the value of the cost function for the compressed
data of the economic indicators of the countries of the world
1 3
2
4
1 —
2 —
3 —
4 —
Fig. 4. ROC-curves for each of the classes for compressed data of economic indicators of
the countries of the world
Methodology of the countries’ economic development data analysis
Системні дослідження та інформаційні технології, 2023, № 4 33
An analysis of the sensitivity of the target function was also carried out, i.e.
the most informative indicators were determined. The results are shown in Table
4. Based on the results, it can be concluded that a different set of variables is in-
formative for each cluster.
T a b l e 4 . Sensitivity analysis of the variable clusters objective functions
Cluster Number
of precedents
Sensitive cluster
variables
Mathematical expectation
of the objective function
0 45 TII, ICT, HCI 85.33
1 11 TII, ICT, EGI 52.87
2 25 EPI, HCI, OSI 63.47
3 34 HCI, EPI, EGI 73.60
All numerical studies were carried out using the computer program “Nonlin-
ear estimation methods in the multicriterion problems of system’s robust optimal
designing and diagnosing under parametric apriority uncertainty (methodology,
methods and computer decision support and making system” (ROD&IDS), devel-
oped by the authors [27].
CONCLUSIONS
The methods of intelligent data flow processing are widely used during the identi-
fication of the states of economic objects. The use of new methods will make it
possible to supplement the package of available tools for solving current problems
with data processing and will make it possible to increase the stability of the
methods to the nature of the data and improve the situation with the use of com-
puting resources.
Presented study examines the problem of improving the methods of classifi-
cation and clustering of countries according to the state of social and digital de-
velopment. A multiclass classification method based on radial basis neural net-
works and a data clustering method based on an agent-oriented modification of
the c-means method are proposed.
1
3
2
4
1 —
2 —
3 —
4 —
Fig. 5. ROC curves for each of the classes for the PCA test sample of compressed data of
indicators of digital development of the countries of the world
V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets
ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 34
The proposed RBFN uses multiple logistic regression as the last layer for
multiclass classification and the training results of an agent-oriented clustering
model as input parameters. The peculiarity of the modification of the c-means
method is the introduction of elite selection of clusters.
According to the results of the research, the proposed methodology is pro-
posed to be used for the analysis of economic systems to improve the quality of
decision-making, but it should be noted that the method requires a qualitatively
prepared sample that covers the largest possible space of input parameters for the
target classes.
REFERENCES
1. Mei Yang, Ming K. Lim, Yingchi Qu, Du Ni, and Zhi Xiao, “Supply chain risk man-
agement with machine learning technology: A literature review and future research
directions,” Computers & Industrial Engineering, vol. 175, January 2023, 108859.
Available: https://doi.org/10.1016/j.cie.2022.108859
2. Benjamin Decardi-Nelson and Jinfeng Liu, “Robust Economic Model Predictive
Control with Zone Control,” IFAC-PapersOnLine, vol. 54, issue 3, pp. 237–242,
2021. Available: https://doi.org/10.1016/j.ifacol.2021.08.248
3. M. Schlesinger and V. Hlavac, Ten lectures on statistical and structural pattern rec-
ognition. Springer, Dordrecht, 2002. doi: 10.1007/978-94-017-3217-8.
4. Data clustering: algorithms and applications, Charu C. Aggarwal and Chandan,
K. Reddy (ed.). CRC Press, Taylor & Francis Group, 2014.
5. N. Bakumenko, V. Strilets, and M. Ugryumov, “Application of the C-Means Fuzzy
Clustering Method for the Patient’s State Recognition Problems in the Medicine
Monitoring Systems,” CEUR Workshop Proceedings of 3rd International Confer-
ence on Computational Linguistics and Intelligent Systems, COLINS 2019, vol. I,
pp. 218–227, 2019, Available: https://www.researchgate.net/publication/338819685
6. R. Winkler, F. Klawonn, and R. Kruse, “Problems of Fuzzy c-Means Clustering and
Similar Algorithms with High Dimensional Data Sets,” Challenges at the Interface
of Data Analysis, Computer Science and Optimization, pp. 79–87, 2012. doi:
10.1007/978-3-642-24466-7_9.
7. Christopher D. Prabhakar Raghavan and Hinrich Schütze, Introduction to informa-
tion retrieval. Cambridge University Press, 2008.
8. S. Askari, “Fuzzy C-Means clustering algorithm for data with unequal cluster sizes
and contaminated with noise and outliers: Review and development,” Expert Systems
with Applications, vol. 165, article no. 113856, 2020. doi: 10.1016/j.eswa.2020.113856.
9. Xuemei Zhao, Yu Li, and Quanhua Zhao, “Mahalanobis distance based on fuzzy
clustering algorithm for image segmentation,” Digital Signal Processing, vol. 43,
pp. 8–16, Aug 2015. Available: https://doi.org/10.1016/j.dsp.2015.04.009
10. Zarinbala M. Zarandia, M.H. Fazel, and I.B. Turksen, “Relative entropy fuzzy
c-means clustering,” Information Sciences, vol. 260, pp. 74–97, 2014. doi:
10.1016/j.ins.2013.11.004.
11. V. Strilets, V. Donets, M. Ugryumov, R. Zelenskyi, and T. Goncharova, “Agent-
Oriented data clustering for medical monitoring,” Radioelectronic and Computer
Systems, no. 1, pp. 103–114, 2022. Available: https://doi.org/10.32620/reks.2022.1.08
12. Meng Xing, Yanbo Zhang, Hongmei Yu, Zhenhuan Yang, and Xueling Li, “Predict
DLBCL patients’ recurrence within two years with Gaussian mixture model cluster
oversampling and multi-kernel learning,” Computer Methods Programs in Biomedi-
cine, vol. 226, 107103, 2022. Available: https://doi.org/10.1016/j.cmpb.2022.107103
13. Lynne A. Kvapil, Mark W. Kimpel, Rasitha R. Jayasekare, and Kim Shelton, “Using
Gaussian mixture model clustering to explore morphology and standardized produc-
tion of ceramic vessels: A case study of pottery from Late Bronze Age Greece,”
Methodology of the countries’ economic development data analysis
Системні дослідження та інформаційні технології, 2023, № 4 35
Journal of Archaeological Science: Reports, vol. 45, 103543, 2022. Available:
https://doi.org/10.1016/j.jasrep.2022.103543
14. Meng Yinfeng, Jiye Liang, Fuyuan Cao and Yijun He, “A new distance with deriva-
tive information for functional k-means clustering algorithm,” Information Sciences,
vol. 463–464, pp. 166–185, 2018. Available: https://doi.org/10.1016/ j.ins.2018.06.035
15. Xinmin Tao, Ruotong Wang, Rui Chang, and Chenxi Li, “Density-sensitive fuzzy
kernel maximum entropy clustering algorithm,” Knowledge-Based Systems, vol. 166,
pp. 42–57, 2019. Available: https://doi.org/10.1016/j.knosys.2018.12.007.
16. K. Møllersen, S. Dhar and F. Godtliebsen, “On Data-Independent Properties for
Density-Based Dissimilarity Measures in Hybrid Clustering,” Applied Mathematics,
vol. 7, no. 15, pp. 1674–1706, 2016. doi: 10.4236/am.2016.715143.
17. Ian Goodfellow, Yoshua Bengio, and Aaron Courville, Softmax Units for Multinoulli
Output Distributions. Deep Learning. MIT Press, 2016.
18. V.E. Strilets et al., Methods of machine learning in the problems of system analysis
and decision making: monograph. Karazin Kharkiv National University, 2020, 195 p.
19. Farbod Farhangi, “Investigating the role of data preprocessing, hyperparameters tun-
ing, and type of machine learning algorithm in the improvement of drowsy EEG sig-
nal modeling,” Intelligent Systems with Applications, vol. 15, 200100, September
2022. Available: https://doi.org/10.1016/j.iswa.2022.200100
20. Arthur Zimek and Peter Filzmoser, “There and back again: Outlier detection between
statistical reasoning and data mining algorithms,” Wiley Interdisciplinary Reviews:
Data Mining and Knowledge Discovery, 8(6), 2018. doi: 10.1002/widm.1280.
21. Fei Tony Liu, Kai Ming Ting, and Zhi-Hua Zhou, “Isolation-Based Anomaly Detec-
tion,” ACM Transactions on Knowledge Discovery from Data, 6(1), pp. 1–39, 2012.
doi:10.1145/2133360.2133363.
22. O.Yu. Lykhach, M.L. Ugryumov, D.O. Shevchenko, and S.I. Shmatkov, “Methods
of detecting emissions in test samples during process control in state-based systems,”
Bulletin of Karazin Kharkiv National University, ser. “Mathematical modeling. In-
formation Technology. Automated control systems”, no. 53. pp. 21–40, 2022.
23. L.J.P van der Maaten and G.E. Hinton, “Visualizing Data Using t-SNE,” Journal of
Machine Learning Research, 9, pp. 2579–2605, 2008.
24. Ian T. Jolliffe and Jorge Cadima, “Principal component analysis: a review and recent
developments. Philosophical Transactions of the Royal Society A,” Mathematical,
Physical and Engineering Sciences, 374(2065), 20150202, 2016. doi:
10.1098/rsta.2015.0202.
25. L. Chagovets, N. Chernova, T. Klebanova, O. Dorokhov, and A. Didenko, “Selective
Adaptive Model for Forecasting of Regional Development Unevenness Indexes,”
Proceedings of the Workshop on the XII International Scientific Practical Confer-
ence Modern problems of social and economic systems modelling (MPSESM-W
2020) Kharkiv, Ukraine, June 25, 2020, pp. 58–76.
26. L.О. Chagovets, S.V. Prokopovych, S.М. Vozniuk, and V.V. Chahovets, “Concep-
tual basis of modeling telecommunication development of regions by methods of
system analysis,” Municipal economy of cities, vol. 1, no. 161, pp. 230–240, 2021.
27. Computer program “Nonlinear estimation methods in the multicriterion problems of
system’s robust optimal designing and diagnosing under parametric apriority uncer-
tainty (methodology, methods and computer decision support and making system)”
(“ROD&IDS”): Copyright registration certificate no. 82875 / M.L. Ugryumov,
Y.S. Meniaylov, S.V. Chernysh, K.M. Ugryumova (Ukraine). Copyright and related
rights. Official bulletin. Ministry of Economic Development and Trade of Ukraine.
2018, no. 51, p. 403.
Received 30.06.2023
V.V. Donets, V.Y. Strilets, M.L. Ugryumov, D.O. Shevchenko, S.V. Prokopovych, L.O. Chagovets
ISSN 1681–6048 System Research & Information Technologies, 2023, № 4 36
INFORMATION ON THE ARTICLE
Volodymyr V. Donets, ORCID: 0000-0002-5963-9998, V.N. Karazin Kharkiv National
University, Ukraine, e-mail: v.donets@karazin.ua
Viktoriia Y. Strilets, ORCID: 0000-0002-2475-1496, V.N. Karazin Kharkiv National
University, Ukraine, e-mail: viktoria.strilets@karazin.ua
Mykhaylo L. Ugryumov, ORCID: 0000-0003-0902-2735, V.N. Karazin Kharkiv Na-
tional University, Ukraine, e-mail: m.ugryumov@karazin.ua
Dmytro O. Shevchenko, ORCID: 0000-0002-7897-250X, V.N. Karazin Kharkiv Na-
tional University, Ukraine, e-mail: dimyich24@gmail.com
Svitlana V. Prokopovych, ORCID: 0000-0002-6333-2139, Simon Kuznets Kharkiv Na-
tional University of Economics, Ukraine, e-mail: prokopovichsv@gmail.com
Liubov O. Chagovets, ORCID: 0000-0003-4064-9712, Simon Kuznets Kharkiv National
University of Economics, Ukraine, e-mail: liubov.chahovets@hneu.net
МЕТОДОЛОГІЯ АНАЛІЗУ ДАНИХ ЕКОНОМІЧНОГО РОЗВИТКУ КРАЇН /
В.В. Донець, В.Є. Стрілець, М.Л. Угрюмов, Д.О. Шевченко, С.В. Прокопович,
Л.О. Чаговець
Анотація. Досліджено питання удосконалення методів ідентифікації економі-
чних об’єктів та їх аналізу з використанням алгоритмів інтелектуального об-
роблення даних. Використання розробленої методології в економічному аналі-
зі дозволяє підвищити якість управління та може бути основою для створення
систем підтримання прийняття рішень для попередження потенційно небезпе-
чних змін економічного стану об’єкта дослідження. Запропоновано удоскона-
лений метод кластеризації даних c-середніх з агентно-орієнтованою модифіка-
цією, для визначення відповідності отриманих кластерів актуальним
пропонується радіально-базисна нейромережа та її розширення – для аналізу
інформативності змінних стану й отримання підмножини інформативних
змінних. Розглянуто вплив застосування стиснення даних за допомогою авто-
кодувальника на точність застосування методів. За результатами тестування
розробленої методології було доведено зменшення ймовірності неправильного
визначення стану під час ідентифікації станів економічних систем та отримано
зменшене значення помилки третього роду під час класифікації станів
об’єктів.
Ключові слова: машинне навчання, цифровий розвиток, нечітка кластериза-
ція, радіально базисні нейромережі, логістична регресія, аналіз інформативно-
сті змінних.
|
| id | journaliasakpiua-article-297208 |
| institution | System research and information technologies |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2025-07-17T10:28:25Z |
| publishDate | 2023 |
| publisher | The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" |
| record_format | ojs |
| resource_txt_mv | journaliasakpiua/a1/429c42cc1f71ad95bfe3aee2496b35a1.pdf |
| spelling | journaliasakpiua-article-2972082024-02-01T21:03:07Z Methodology of the countries’ economic development data analysis Методологія аналізу даних економічного розвитку країн Donets, Volodymyr Strilets, Viktoriia Ugryumov, Mykhaylo Shevchenko, Dmytro Prokopovych, Svitlana Chagovets, Liubov машинне навчання цифровий розвиток нечітка кластеризація радіально базисні нейромережі логістична регресія аналіз інформативності змінних machine learning digital development fuzzy clustering radial basis neural networks logistic regression analysis of variables informativeness The paper examines the issue of improving the methods of identification of economic objects and their analysis using algorithms of intelligent data processing. The use of the developed methodology in the economic analysis allows for improvement in the quality of management. It can be the basis for creating decision support systems to prevent potentially dangerous changes in the economic status of the research object. In this work, an improved method of c-means data clustering with agent-oriented modification is proposed, and a radial-basis neural network and its extension are proposed to determine whether the obtained clusters are relevant and to analyze the informativeness of state variables and obtain a subset of informative variables. The effect of applying data compression using an autoencoder on the accuracy of the methods is also considered. According to the results of testing of the developed methodology, it was proved that the probability of incorrect determination of the state was reduced when identifying the states of economic systems, and a reduced value of the error of the third kind was obtained when classifying the states of objects. Досліджено питання удосконалення методів ідентифікації економічних об’єктів та їх аналізу з використанням алгоритмів інтелектуального оброблення даних. Використання розробленої методології в економічному аналізі дозволяє підвищити якість управління та може бути основою для створення систем підтримання прийняття рішень для попередження потенційно небезпечних змін економічного стану об’єкта дослідження. Запропоновано удосконалений метод кластеризації даних c-середніх з агентно-орієнтованою модифікацією, для визначення відповідності отриманих кластерів актуальним пропонується радіально-базисна нейромережа та її розширення – для аналізу інформативності змінних стану й отримання підмножини інформативних змінних. Розглянуто вплив застосування стиснення даних за допомогою автокодувальника на точність застосування методів. За результатами тестування розробленої методології було доведено зменшення ймовірності неправильного визначення стану під час ідентифікації станів економічних систем та отримано зменшене значення помилки третього роду під час класифікації станів об’єктів. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2023-12-26 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/297208 10.20535/SRIT.2308-8893.2023.4.02 System research and information technologies; No. 4 (2023); 21-36 Системные исследования и информационные технологии; № 4 (2023); 21-36 Системні дослідження та інформаційні технології; № 4 (2023); 21-36 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/297208/290123 |
| spellingShingle | машинне навчання цифровий розвиток нечітка кластеризація радіально базисні нейромережі логістична регресія аналіз інформативності змінних Donets, Volodymyr Strilets, Viktoriia Ugryumov, Mykhaylo Shevchenko, Dmytro Prokopovych, Svitlana Chagovets, Liubov Методологія аналізу даних економічного розвитку країн |
| title | Методологія аналізу даних економічного розвитку країн |
| title_alt | Methodology of the countries’ economic development data analysis |
| title_full | Методологія аналізу даних економічного розвитку країн |
| title_fullStr | Методологія аналізу даних економічного розвитку країн |
| title_full_unstemmed | Методологія аналізу даних економічного розвитку країн |
| title_short | Методологія аналізу даних економічного розвитку країн |
| title_sort | методологія аналізу даних економічного розвитку країн |
| topic | машинне навчання цифровий розвиток нечітка кластеризація радіально базисні нейромережі логістична регресія аналіз інформативності змінних |
| topic_facet | машинне навчання цифровий розвиток нечітка кластеризація радіально базисні нейромережі логістична регресія аналіз інформативності змінних machine learning digital development fuzzy clustering radial basis neural networks logistic regression analysis of variables informativeness |
| url | https://journal.iasa.kpi.ua/article/view/297208 |
| work_keys_str_mv | AT donetsvolodymyr methodologyofthecountrieseconomicdevelopmentdataanalysis AT striletsviktoriia methodologyofthecountrieseconomicdevelopmentdataanalysis AT ugryumovmykhaylo methodologyofthecountrieseconomicdevelopmentdataanalysis AT shevchenkodmytro methodologyofthecountrieseconomicdevelopmentdataanalysis AT prokopovychsvitlana methodologyofthecountrieseconomicdevelopmentdataanalysis AT chagovetsliubov methodologyofthecountrieseconomicdevelopmentdataanalysis AT donetsvolodymyr metodologíâanalízudanihekonomíčnogorozvitkukraín AT striletsviktoriia metodologíâanalízudanihekonomíčnogorozvitkukraín AT ugryumovmykhaylo metodologíâanalízudanihekonomíčnogorozvitkukraín AT shevchenkodmytro metodologíâanalízudanihekonomíčnogorozvitkukraín AT prokopovychsvitlana metodologíâanalízudanihekonomíčnogorozvitkukraín AT chagovetsliubov metodologíâanalízudanihekonomíčnogorozvitkukraín |