Моделі ризику смертності на основі дерева рішень і ансаблю для госпіталізованих пацієнтів із COVID-19
The work is devoted to studying SARS-CoV-2-associated pneumonia and the investigating of the main indicators that lead to the patients’ mortality. Using the good-known parameters that are routinely embraced in clinical practice, we obtained new functional dependencies based on an accessible and unde...
Gespeichert in:
| Datum: | 2023 |
|---|---|
| Hauptverfasser: | , , , , , , |
| Format: | Artikel |
| Sprache: | Englisch |
| Veröffentlicht: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2023
|
| Schlagworte: | |
| Online Zugang: | https://journal.iasa.kpi.ua/article/view/279747 |
| Tags: |
Tag hinzufügen
Keine Tags, Fügen Sie den ersten Tag hinzu!
|
| Назва журналу: | System research and information technologies |
| Завантажити файл: | |
Institution
System research and information technologies| _version_ | 1866302898022907904 |
|---|---|
| author | Vyklyuk, Yaroslav Levytska, Svitlana Nevinskyi, Denys Hazdiuk, Kateryna Škoda, Miroslav Andrushko, Stanislav Palii, Maryna |
| author_facet | Vyklyuk, Yaroslav Levytska, Svitlana Nevinskyi, Denys Hazdiuk, Kateryna Škoda, Miroslav Andrushko, Stanislav Palii, Maryna |
| author_sort | Vyklyuk, Yaroslav |
| baseUrl_str | http://journal.iasa.kpi.ua/oai |
| collection | OJS |
| datestamp_date | 2023-05-24T21:28:17Z |
| description | The work is devoted to studying SARS-CoV-2-associated pneumonia and the investigating of the main indicators that lead to the patients’ mortality. Using the good-known parameters that are routinely embraced in clinical practice, we obtained new functional dependencies based on an accessible and understandable decision tree and ML ensemble of classifiers models that would allow the physician to determine the prognosis in a few minutes and, accordingly, to understand the need for treatment adjustment, transfer of the patient to the emergency department. The accuracy of the resulting ensemble of models fitted on actual hospital patient data was in the range of 0.88–0.91 for different metrics. Creating a data collection system with further training of classifiers will dynamically increase the forecast’s accuracy and automate the doctor’s decision-making process. |
| doi_str_mv | 10.20535/SRIT.2308-8893.2023.1.02 |
| first_indexed | 2025-07-17T10:28:07Z |
| format | Article |
| fulltext |
Ya. Vyklyuk, S. Levytska, D. Nevinskyi, K. Hazdiuk, M. Škoda, S. Andrushko, M. Palii, 2023
Системні дослідження та інформаційні технології, 2023, № 1 23
UDC 004.02, 004.67, 004.891.3
DOI: 10.20535/SRIT.2308-8893.2023.1.02
DECISION-TREE AND ENSEMBLE-BASED MORTALITY RISK
MODELS FOR HOSPITALIZED PATIENTS WITH COVID-19
Ya. VYKLYUK, S. LEVYTSKA, D. NEVINSKYI, K. HAZDIUK,
M. ŠKODA, S. ANDRUSHKO, M. PALII
Abstract. The work is devoted to studying SARS-CoV-2-associated pneumonia and
the investigating of the main indicators that lead to the patients’ mortality. Using the
good-known parameters that are routinely embraced in clinical practice, we obtained
new functional dependencies based on an accessible and understandable decision
tree and ML ensemble of classifiers models that would allow the physician to de-
termine the prognosis in a few minutes and, accordingly, to understand the need for
treatment adjustment, transfer of the patient to the emergency department. The accu-
racy of the resulting ensemble of models fitted on actual hospital patient data was in
the range of 0.88–0.91 for different metrics. Creating a data collection system with
further training of classifiers will dynamically increase the forecast’s accuracy and
automate the doctor’s decision-making process.
Keywords: COVID-19, decision-making system, decision tree, ML-ensemble,
ensemble of classification models.
BACKGROUND
The pandemic of SARS-CoV-2 infection, started in December 2019 has rapidly
spread across the globe and affected all countries in two years. As of November
2021, the number of world-wide cases exceeded 262 million people, more than 5
million people died, including more than 85 thousand deaths in Ukraine [1]. The
spread of coronavirus infection in Ukraine began from Chernivtsi and this city
held the sad first place by the level of the SARS-CoV-2 morbidity during a year
and a half. An emergency situation in medicine has obliged physicians of various
specialties to help patients with SARS-CoV-2-associated pneumonia and to study
the peculiarities of SARS-CoV-2 infection in their own practical experience.
Despite the huge accumulated clinical and laboratory material, the extraordi-
nary attention of the medical community to the treatment of patients with SARS-
CoV-2-associated pneumonia, it is still not clear why the disease became fatal for
some patients [2].
Recent years decision-making and expert systems based on artificial intelli-
gence have become widespread in medicine. Classification methods are one of
the most urgent and necessary tasks in medicine. Classification shapes medicine
and guides its practice. An understanding of classification should be part of the
search for a better understanding of the social context and consequences of diag-
nosis. Classification is the part of human activity that provides the basis for recog-
nizing and studying a disease. This means deciding how to extract significant
parts from the vast expanse of nature, stabilizing and structuring disordered things
[3], [4]. One of the most popular methods of classification is the diagnosiс X-ray.
Ya. Vyklyuk, S. Levytska, D. Nevinskyi, K. Hazdiuk, M. Škoda, S. Andrushko, M. Palii
ISSN 1681–6048 System Research & Information Technologies, 2023, № 1 24
Different types of convolutional neural networks, or classical classifiers based on
image features, are used as a classification model [5–7].
There are also investigations to determine mortality rate of patients
depending on medical indicators. In particular, in the paper [8] Lactate
dehydrogenase, neutrophils (%), lymphocyte (%), high-sensitivity C-reactive
protein, and age (LNLCA), which were determined on hospital admission, were
identified as key predictors of death from the multi-tree XGBoost model. The
integrated score (LNLCA) was calculated with the corresponding probability of
death. COVID-19 patients were divided into three subgroups: low-, middle-, and
high-risk groups using LNLCA cutoff values of 10.4 and 12.65. The probability
of death in each group is less than 5%, 5-50% and above 50%, respectively. The
prognostic model, nomogram, and LNLCA assessment can help identify early
high-risk mortality in patients with COVID-19, which will help physicians
improve the management of patient stratification.
In the paper [9] the severity and outcome of COVID-19 cases has been
associated with the percentage of circulating lymphocytes (LYM%), levels of
C-reactive protein (CRP), interleukin-6 (IL-6), procalcitonin (PCT), lactic acid
(LA), and viral load (ORF1ab Ct). However, the predictive power of each of these
indicators in disease classification and prognosis remains largely unclear.
Similar results in work [10] indicate that the risk period for patients is 12–14
days, after which the probability of patient survival may increase. In addition, it is
noted that the probability of death in COVID cases increases with age. It is
established that the probability of death is higher in men than in women. SVM
with Grid search methods showed the highest accuracy of about 95%, followed by
the decision tree algorithm with an accuracy of about 94%.
Retrospective Cohort Study [11] included patients with COVID-19 who
were admitted at three designated locations at Wuhan Union Hospital (Wuhan,
China). Dynamic hematological and coagulation parameters were investigated
with a linear mixed model, and coagulopathy screening with sepsis-induced
coagulopathy and International Society of Thrombosis and Hemostasis overt
disseminated intravascular coagulation scoring systems was applied.
The authors of paper [12] used the available information on pre-existing
health conditions identified for deceased patients positive with severe acute
respiratory syndrome coronavirus 2 (SARS-CoV-2)’ in Italy. They estimated the
total number of deaths for different pre-existing health conditions categories and
calculated a conditional CFR based upon the number of comorbidities before
SARS-CoV-2 infection morbidities before SARS-CoV-2 infection.
In the paper [13] was proved that High IL-6 level, C-reactive protein level,
lactate dehydrogenase (LDH) level, ferritin level, d-dimer level, neutrophil count,
and neutrophil-to-lymphocyte ratio all of them were predictors of mortality (area
under the curve 0.70 ), as well as low albumin level, lymphocyte count,
monocyte count, and ratio of peripheral blood oxygen saturation to fraction of
inspired oxygen (SpO2/FiO2). A multivariable mortality risk model including the
SpO2/FiO2 ratio, neutrophil-to-lymphocyte ratio, LDH level, IL-6 level, and age
was developed and showed high accuracy for the prediction of fatal outcome
(area under the curve 0.94). The optimal cutoff reliably classified patients
(including patients without initial respiratory distress) as survivors and
nonsurvivors with a sensitivity of 0.88 and a specificity of 0.89.
As you can see there are not clearly defined factors that will affect mortality
rate. There are no strict rules or decision trees for predicting patients’ death.
Decision-tree and ensemble-based mortality risk models for hospitalized patients with COVID-19
Системні дослідження та інформаційні технології, 2023, № 1 25
Therefore, there is a great need to conduct research that will help the doctors
predict the severity of the disease and its mortality.
The present studies and analysis unlock a way in the direction of attribute
correlation, estimation of survival days, and the prediction of death probability.
The findings of the present review clearly indicate that machine learning
algorithms have strong capabilities of prediction and classification in relation to
COVID-19 as well.
The aim of the study is the determination of the prognostic factors of fatal
SARS-CoV-2-associated pneumonia and establishing a functional relationship
between them and the mortality of the patient.
The main contribution of this article can be summarized as follows:
based on the medical data of real patients of the hospital admitted with
COVID-19, a heterogeneous data set was created, which became the basis for
finding the relationship between the mortality rate of the patient;
the method of validation, transformation and purification of the medical
data set in preliminary preparation for the analysis was developed;
an analysis to determine the impact of medical factors on mortality was
conducted and a final set of data for the construction of classification models was
formed;
the train dataset for experimental modeling was created;
the effectiveness of ten existing machine learning algorithms for solving
the problem of determining the level of patient mortality was evaluated and a de-
cision tree was constructed;
a stacking model to predict mortality, which has prevented overfitting was
developed and a significant increase in the accuracy of its operation and in com-
parison, with some existing machine learning algorithms was shown.
The resulting functional dependence can be implemented in expert systems
that will allow the average physician to predict the degree of mortality of the patient,
and therefore apply the necessary tools of intensive care to save human lives.
METHODS
Data Collection
A retrospective analysis of the results of treatment of 121 SARS-CoV-2-
associated pneumonia patients who stayed in Chernivtsi City Hospital №1 (since
March 2020 – the Chernivtsi Central COVID Hospital) was performed. The in-
clusion criterion was moderate or severe SARS-CoV-2-associated pneumonia as
well as the exclusion criterion – the death before the fifth day staying in the hospi-
tal. According to the results, two groups were formed: the first group of the 60
SARS-CoV-2 associated pneumonia patients with the fatal outcome and the sec-
ond group of the 61 patients with favorable course of the SARS-CoV-2 associated
pneumonia.
Every patient could be described with a huge number of parameters. As po-
tential prognostic factors we analyzed the 77 parameters divided into 9 parts ac-
cording to the working hypothesis. This task can be attributed to the machine
learning classification, where it is necessary to determine patients belonging to
one of the classes (will die or live) based on many different factors. The stages of
machine learning in this case should include preliminary data preparation, models
selection, training and analysis of results.
Ya. Vyklyuk, S. Levytska, D. Nevinskyi, K. Hazdiuk, M. Škoda, S. Andrushko, M. Palii
ISSN 1681–6048 System Research & Information Technologies, 2023, № 1 26
Preliminary data preparation
There are several steps that are due to the peculiarities of obtaining and storing
data at this stage. A Python script was written to implement each step.
Removing personalized data. Fields that contain personal information and
those that do not clearly affect the diagnosis are removed from analysis. In par-
ticular: patient ID number, Name of patient, phone, diagnosis, complications,
CT-scans etc.
Verification of human mistakes. The feature of the available data is that
they are all entered by people, and this leads to technical mistakes. So, the first
procedure is to verify the data and correct them automatically and manually. To do
this, a script that identified and, if possible, corrected human errors was created.
Transformation and change of field values. A significant number of fields
are not suitable for digital analysis, because they contain information in text for-
mat that is not suitable for analysis. The parse function was created that trans-
formed all data for appropriate DataAnalysis form.
Handling features with missing data
The next step is removing the records that contain a lot of missing values. The
large number of features leads to the removing records that contain at least one
missing value. It can lead to a significant reduction in the DataSet and makes
using classification methods impossible. To resolve this problem, empty values
have been filled with the default values (if possible). Next, the features with the
most missed data were identified.
It was decided to eliminate these features that consist more that 40% missed
data from further calculation, as their presence will make further analysis
impossible. This procedure of deletion of records with missing data reduced the
DataSet by 19% (from 121 to 99 records). The total number of fields was 53 input
and one output field that contain 49 – digital fields, 3 categorical and one logical.
Identification of factor importances
The Pearson’s consistency criterion – 2 and mutual information (MI) as sorting
method was used to determine the importance of factors for the classification of
patients [14]–[16]. The magnitude of these criteria determines the significance of
the field in the classification. The results are present in Table 1.
T a b l e 1 . The top 10 of the most important features for classification
Features 2 Features МІ
Leukocytes 2 434 Lymphocytes 2 0.3
Band-neutrofils 2 352 Leukocytes 2 0.28
Lymphocytes 2 352 Band-neutrofils 2 0.25
Hematoсrit 2 250 Saturation without oxygen supply 0.23
Creatinine 2 226 The duration of the hospitalization 0.20
Saturation without oxygen supply 22 Hematoсrit 2 0.19
The duration of the hospitalization 183 Creatinine 2 0.17
C-reactive protein 2 146 Hemoglobin 1 0.15
The pulmonary insufficiency 68 Age 0.13
Gender 50 The course of the disease 0.13
As can be seen from Table 1, the first seven factors in the two methods coin-
cide. The only difference is their importance. Therefore, the DataSet was reduced
to the first seven features.
Decision-tree and ensemble-based mortality risk models for hospitalized patients with COVID-19
Системні дослідження та інформаційні технології, 2023, № 1 27
The next step was to check the presence of correlation between features. The
result of correlation analysis was presented in Fig. 1.
As can be seen from the Fig. 1, there is no correlation between the input fac-
tors. This means that you do not need to perform factor analysis and remove or
convert factors.
Proposed models
This paper is aimed at building a forecast model, which will provide the highest
accuracy in solving the problem on the one hand and will allow one to visualize
the result in the form of a decision tree on the other hand. It is impossible to
achieve this at the same time. After all, ensemble accuracy provides the highest
accuracy. It is based on the use of a set of basic regressors, the results of which
are summarized by the metaregressor. This will increase the accuracy compared
to the use of single models that form such a model. However, it is not possible to
visualize such a decision result in the form of a decision tree.
Therefore, we considered two approaches to prognosis. One is based on the deci-
sion tree; the other is ensemble.
Decision tree model. The decision tree method was used to determine the
classification rules and visualize the results [17]. The main advantage of choosing
this method is the ability to visualize the result of classification analysis in the
form of a decision tree. However, the accuracy of this method is not the best.
The Gini coefficient was chosen as the criterion for measuring the cleavage
threshold [18] – an indicator of the inequality of the distribution of some value of
numbers, which takes values between 0 and 1, where 0 means absolute equality
Fig. 1. Correlation matrix of input features
Ya. Vyklyuk, S. Levytska, D. Nevinskyi, K. Hazdiuk, M. Škoda, S. Andrushko, M. Palii
ISSN 1681–6048 System Research & Information Technologies, 2023, № 1 28
(the value takes only one value), and 1 denotes complete inequality. The strategy
used to select the split in each node is to find the best distribution.
Ensemble of classification models. The literature considers three main ap-
proaches to constructing ensemble models: boosting, bagging, and stacking.
In this work, we build a prediction model based on the stacking approach.
The model assumes the presence of basic N-algorithms that will form a stacking
ensemble. The meta-algorithm will weigh the results of their work. The work of
the meta-algorithm will determine the impact of solving the stated task.
The data set collected by us to solve the problem of predicting the level of
mortality contains many independent attributes. In addition, there are complex
and nonlinear, unobvious and unexplored relationships between different features.
It is evident that, in particular, many linear machine learning methods will not
provide sufficient accuracy. If such algorithms are included in the general
ensemble model, they will reduce the accuracy of their work. That is why we
propose to perform a preliminary selection of basic algorithms that will form a
stacking ensemble. It is based on initial modeling of machine learning algorithms
and evaluation of their efficiency using the next four performance metrics:
Accuracy, Precision, Recall and F1 Scope.
Accuracy means that the set of labels predicted for a sample must exactly
match the corresponding set of labels in target.
Precision is the ratio:
precision = tp / (tp + fp),
where tp is the number of true positives and fp — the number of false positives.
The precision is intuitively the ability of the classifier not to label as positive a
sample that is negative.
Recall is the ratio:
recall = tp / (tp + fn).
The recall is intuitively the ability of the classifier to find all the positive
samples.
F1 Scope is the harmonic mean of precision and recall, where an F1 score
reaches its best value at 1 and worst score at 0. The relative contribution of
precision and recall to the F1 score are equal. The formula for the F1 score is:
F1 = 2 * (precision * recall) / (precision + recall).
The Precision of classifier is the fraction of samples in the DataSet it labeled,
for example, as death is really death. Its Recall is the percentage of all death
samples in the dataset that it correctly labeled as death. The F1 score is the
harmonic mean of precision and recall.
RESULTS AND DISCUSSIONS
Performance evaluation of the investigated decision tree model
The DataSet was splitted into train and test in the proportion of 70/30 to fit and
determine the accuracy of the algorithm. The resulting decision tree is presented
in Fig. 2.
Decision-tree and ensemble-based mortality risk models for hospitalized patients with COVID-19
Системні дослідження та інформаційні технології, 2023, № 1 29
Performance metrics on train DataSets consisted: accuracy = 0.90, precision
= 0.89, recall = 0.88 and f1 = 0.89. We get accuracy = 0.88, precision = 0.86,
recall = 0.86 and f1 = 0.86 for test DataSet. The small variance between test and
training datasets indicates good fitting of this method. That is mean this model
predicts unknown (new) data in the same level accuracy like know data. The high
values of all metrics indicate the accuracy and adequacy of the model. It allows
the doctor to personally guide the patient through this tree and quickly determine
the class to which he belongs. Creating an automated decision-making program
F
ig
.
2.
T
he
d
ec
is
io
n
tr
ee
d
et
er
m
in
es
w
he
th
er
t
he
p
at
ie
nt
w
ill
d
ie
(
Fa
ls
e)
o
r
re
m
ai
n
al
iv
e
(T
ru
e)
.
G
in
i
-
di
st
ri
bu
tio
n
in
eq
ua
lit
y,
sa
m
pl
es
-
th
e
nu
m
be
r
of
c
as
es
, v
al
ue
-
th
e
va
lu
e
of
th
e
cl
as
si
fi
er
f
un
ct
io
n,
c
la
ss
-
b
el
on
gi
ng
to
th
e
cl
as
s
Ya. Vyklyuk, S. Levytska, D. Nevinskyi, K. Hazdiuk, M. Škoda, S. Andrushko, M. Palii
ISSN 1681–6048 System Research & Information Technologies, 2023, № 1 30
based on trees is not a problem. The construction of the decision tree made it pos-
sible to establish the importance of features for this classifier (Table 2).
T a b l e 2 . The importance of the decision tree features
Feature Importance
Lymphocytes 2 0.62
Band-neutrofils 2 0.13
Saturation without oxygen supply 0.12
Creatinine 2 0.04
The duration of the hospitalization 0.04
Leukocytes 2 0.03
Hematoсrit 2 0.02
As can be seen from the Table 2, the most important factor in the decision
tree is the number of lymphocytes a week after hospitalization (lymphocytes 2).
The decreased level of the lymphocytes as the marker of the severe SARS-CoV-
2-infection was described in [19], [20]. Instead, our study proves the importance
of this parameter as the risk marker of the fatal outcome. The further depression
of the lymphocytes a week after the beginning of the intensive treatment of the
SARS-CoV-2-patient points to the exhaustion of the immune defense and in-
creases the probability of the fatal outcome.
The next important factor is the good-known indicator of the activity of the
inflammatory process – the amount of the band-neutrophils [21] measured on the
7th day of the beginning of the intensive care of the SARS-CoV-2-patient. The
prognostic non-favorable marker was the combination of the increasing amount of
the band-neutrofils and the decreasing amount of the lymphocytes. The SARS-
CoV-2-pneumonia patient’s chances to survive are reduced in case of the severe
activation of the inflammatory process with depression of the specific immune
response.
The third important factor in the decision tree is the blood saturation without
oxygen supply at the moment of the hospitalization. The low level of the blood
saturation indirectly reflects the severity of the patient’s condition and lungs af-
fection, points to the exhaustion of the defensive and compensatory possibilities
of the organism, the cardio-circulatory decompensation, severe tissue hypoxia
[22]. The value of this indicator as a predictor of an unfavorable prognosis of the
disease turned out to be quite logical.
Here are some examples of using the decision tree. The patient 1 was admit-
ted to the hospital with blood saturation 85%, the amount of the leukocytes —
34,9 G/l, band-neutrofils — 24, lymphocytes — 3, hematoсrit — 43,1, kreatinin-
142 were revealed in his blood analysis in a week. Let’s take the patient through
the decision tree: lymphocytes 9.5 (yes) saturation without oxygen supply
92.5 (yes) leukocytes 7.05 (no) lymphocytes 7.75 (yes) band-
neutrofils 7.5 (no) Class False, it means the prognosis is non-favorable. In-
deed, on the 12th day after admission, the patient’s death was fixed.
The patient 12 was admitted to the hospital with blood saturation 91%, the
amount of the leukocytes — 6,1 G/l, band-neutrofils — 2, lymphocytes — 9,
hematoсrit — 46, kreatinin- 117 were revealed in his blood analysis in a week.
Let’s take the patient through the decision tree: lymphocytes 9.5 (yes) satu-
Decision-tree and ensemble-based mortality risk models for hospitalized patients with COVID-19
Системні дослідження та інформаційні технології, 2023, № 1 31
ration without oxygen supply 92.5 (yes) leukocytes 7.05 (yes) satura-
tion without oxygen supply 79.5 (no) Class True, that shows on the favorable
prognosis. And this patient was discharged from the hospital on the 13th day of
the treatment.
Performance evaluation of the investigated ML-ensemble
It was decided to increase the train DataSet and use an ensemble of classification
models to improve the quality of fitting and eliminate overfitting. For this pur-
pose, all available records were used as a training set. An additional 83 patients
were studied to obtain a test DataSet. New data were obtained in the same hospi-
tal department that is why the distribution of the test DataSet was the same.
The choice of classifiers for the ensemble was based on the analysis of the
accuracy of each of them. The availability of overfitting on the train DataSet was
also assessed. An experimental comparison of the efficiency of ten existing ma-
chine learning methods using the four performance metrics on train and test Da-
taSets was carried out (Table 3 and 4).
T a b l e 3 . The results of prediction based on performance criteria using all the
studied machine learning algorithms (Train Data Set)
Performance metric
Machine learning method
Accuracy Precision Recall F1 Scope
Logistic regression (CR) 0.89 0.90 0.84 0.87
Decision tree (DT) 0.89 0.87 0.87 0.87
Quadratic discriminant analysis (QDA) 0.84 0.94 0.68 0.79
Naive Bayesian classifier (NB) 0.84 0.91 0.70 0.79
Random forest classifier (RF) 0.95 0.93 0.95 0.94
Adaptive Boosting classifier (AB) 1.00 1.00 1.00 1.00
Support Vector Classification (SVC) 0.89 0.95 0.80 0.86
Stochastic Gradient Descent (SGD) 0.75 0.64 0.98 0.77
Neural Network (NN) 0.98 0.97 0.97 0.97
Gradient Boosting (GB) 1.00 1.00 1.00 1.00
T a b l e 4 . The results of prediction based on performance criteria using all the
studied machine learning algorithms (Test Data Set)
Performance metric
Machine learning method
Accuracy Precision Recall F1 Scope
Logistic regression (LR) [23] 0.78 0.74 0.74 0.74
Decision tree (DT) 0.86 0.85 0.84 0.85
Quadratic discriminant analysis (QDA)[24] 0.75 0.77 0.57 0.66
Naive Bayesian classifier (NB) [25] 0.72 0.73 0.54 0.62
Random forest classifier (RF) [26] 0.62 0.56 0.57 0.56
Adaptive Boosting classifier (ABC) [27] 0.66 0.59 0.69 0.63
Support Vector Classification (SVC) [28] 0.77 0.79 0.63 0.70
Stochastic Gradient Descent (SGD) [29] 0.48 0.44 0.91 0.60
Neural Network (NN) [30–32] 0.77 0.70 0.80 0.75
Gradient Boosting (GB) [33, 34] 0.63 0.55 0.68 0.61
Ya. Vyklyuk, S. Levytska, D. Nevinskyi, K. Hazdiuk, M. Škoda, S. Andrushko, M. Palii
ISSN 1681–6048 System Research & Information Technologies, 2023, № 1 32
As you can see in Table 3, Decision Tree, AdaBoost and Gradient Boost had
problems with overfitting. They have 100% accuracy on train DataSet and very
low on test DataSet. Therefore, we exclude them from future analysis. All other
seven classifiers had similar accuracy. Therefore, for improving accuracy we
combined them into ensemble. A joint solution to these methods was found by the
Voting Classifier [35]. The basic idea of a voting classifier is to combine concep-
tually different machine learning classifiers and use the majority of votes (hard
voiting) or average predicted probabilities (soft voting) to predict class labels. In
our case, “hard” voting was used i.e., the choice of class was determined by the
majority of “votes” of the classifiers. Results of accuracy of this ensemble are
present in Table 5.
T a b l e 5 . The results of prediction based on performance criteria using ensemble
of machine learning algorithms
Performance metric
Voting Classifier
Accuracy Precision Recall F1 Scope
Train Data Set 0.94 0.95 0.91 0.93
Test Data Set 0.91 0.88 0.88 0.88
For comparison we presented results on one plot (Fig. 3).
As you can see from the plot, ensemble has the biggest performance. You
can also see that Recall for SGD is bigger than for ensemble. But other their per-
formance metrics are smaller. Ensemble is stable in joint decision because Preci-
sion and Recall have the same big value. Thus, using an ensemble of ML models
made it possible to avoid overfitting and increase the accuracy and stability of the
forecast. The forecast error (bias) on the train DataSet is 6% and the variance of
the test DataSet from the training set is 3%. So, we can conclude that to reduce
the variance (reduce the error of the test DataSet) it is enough to simply increase
the train DataSet. This will lead to a slight decrease in the accuracy of bias of the
train DataSet and a increase in the accuracy of the test DataSet.
Further increase in the accuracy of the two indicators is possible provided
the simultaneous growth of the train DataSet and the inclusion in the calculation
of new factors, or the complexity of classification models, such as joining the en-
semble of classifiers based on neural networks.
1
2
3 4
Fig. 3. Comparison of performance metrics of investigated classifiers and their
ensemble: 1 — Accuracy, 2 —Precision, 3 — Recall, 4 — F1Scope
Decision-tree and ensemble-based mortality risk models for hospitalized patients with COVID-19
Системні дослідження та інформаційні технології, 2023, № 1 33
CONCLUSIONS
The only one marker of the non-favorable outcome of the SARS-CoV-2-
associated pneumonia presented on the day of admission of the patient was the
blood saturation less than 92.5%. This is the first and the basic indicator checked
in every patient and doctors determine the necessity of the oxygen supply based
on this parameter. In contrast to the severity of the general condition, diabetes
mellitus, the duration of the disease does not increase the probability of the lethal
outcome. The severity of the lung’s affection based on the results of CT- or ultra-
sound examination don’t influence the chances to die because of SARS-CoV-2-
pneumonia.
But after a week of intensive treatment, we could reveal the informative
markers of the lethal outcome. They are the amount of the lymphocytes and band-
neutrophils in peripheral blood. The increasing of the activity of the inflammatory
process reflected in the increase amount of the band-neutrophils and leukocytes as
well as the decreasing of the lymphocyte points to the exhaustion of the specific
immune response, the loss of the immunological control of the inflammation and
to the high probability of the lethal outcome.
Using the good-known parameters that are routinely used daily in clinical
practice, an accessible and understandable decision tree will allow the physician
to determine the prognosis in a few minutes and, accordingly, to understand the need
for treatment adjustment, transfer of the patient to the emergency department.
Creating a data collection system with further training of classifiers will dy-
namically increase the accuracy of the forecast and automate the decision-making
process by the doctor.
DECLARATIONS
Ethics approval and consent to participate
Not applicable.
Consent for publication
Not applicable.
Availability of data and materials
Data were obtained from the medical histories of patients who were hos-
pitalized at the Central Hospital in Chernivtsi, Ukraine. The data is available on
the link: https://github.com/vyklyuk/COVID_Chernivtsi
Competing interests
The authors declare that they have no competing interests.
Funding
This research received no external funding.
Authors’ contributions
Conceptualization, software, investigation, writing — original draft prepara-
tion Yaroslav Vyklyuk and Denys Nevinskyi; methodology Svitlana Levytska;
software, validation, writing–review and editing Kateryna Hazdiuk; formal analy-
sis, funding acquisition Miroslav Škoda; resources, data curation Stanislav An-
drushko and Maryna Palii. All authors have read and agreed to the published ver-
sion of the manuscript.
Ya. Vyklyuk, S. Levytska, D. Nevinskyi, K. Hazdiuk, M. Škoda, S. Andrushko, M. Palii
ISSN 1681–6048 System Research & Information Technologies, 2023, № 1 34
REFERENCES
1. Worldometer COVID-19 Coronavirus Pandemic. 2020. Accessed on: November 28,
2021. [Online]. Available: https://www.worldometers.info/coronavirus/
2. S. Priya, M. Selva Meena, J. Sangumani, P. Rathinam, C. Brinda Priyadharshini, and
V. Vijay Anand, “Factors influencing the outcome of COVID-19 patients admitted
in a tertiary care hospital, Madurai. -a cross-sectional study,” Clin Epidemiol Glob
Health, 2021. doi: 10.1016/j.cegh.2021.100705.
3. Annemarie Jutel, “Classification, Disease, and Diagnosis,” Perspectives in Biology
and Medicine, Project MUSE, vol. 54 no. 2, pp. 189–205, 2011. doi:
10.1353/pbm.2011.0015.
4. Aiping Lu, Miao Jiang, Chi Zhang, and Kelvin Chan, “An integrative approach of
linking tradi-tional Chinese medicine pattern classification and biomedicine diagno-
sis,” Journal of Ethnopharmacology, vol. 141, issue 2, pp. 549–556, 2012.
Available: https://doi.org/10.1016/j.jep.2011.08.045
5. O.S. Albahri et al., “Systematic review of artificial intelligence techniques in the de-
tection and classification of COVID-19 medical images in terms of evaluation and
benchmarking: Taxonomy analysis, challenges, future solutions and methodological
aspects,” Journal of Infection and Public Health, vol. 13, issue 10, pp. 1381–1396,
2020. Available: https://doi.org/10.1016/j.jiph.2020.06.028
6. Gonçalo Marques, Deevyankar Agarwal, and Isabel de la Torre Díez, “Automated
medical diagnosis of COVID-19 through EfficientNet convolutional neural
network,” Applied Soft Computing, 2020, vol. 96. Available:
https://doi.org/10.1016/j.asoc.2020.106691
7. X. Wang et al., “A Weakly-Supervised Framework for COVID-19 Classification and
Lesion Localization from Chest CT,” IEEE Transactions on Medical Imaging,
vol. 39, no. 8, pp. 2615–2625, 2020. doi: 10.1109/TMI.2020.2995965.
8. M.E.H. Chowdhury et al., “An Early Warning Tool for Predicting Mortality Risk of
COVID-19 Patients Using Machine Learning,” Cogn. Comput., 2021. Available:
https://doi.org/10.1007/s12559-020-09812-7
9. Li Tan et al., “Validation of Predictors of Disease Severity and Outcomes in
COVID-19 Patients: A Descriptive and Retrospective Study,” Med, vol. 1, issue 1,
pp. 128–138, 2020. Available: https://doi.org/10.1016/j.medj.2020.05.002
10. Ashutosh Kumar Dubey, Sushil Narang, Abhishek Kumar, Sasubilli Satya Murthy,
and Vicente García-Díaz, “Performance Estimation of Machine Learning Algorithms
in the Factor Analysis of COVID-19 Dataset,” Computers, Materials, & Continua,
66(2), pp. 1921–1936, 2021.
11. Danying Liao et al., “Haematological characteristics and risk factors in the classifi-
cation and prognosis evaluation of COVID-19: a retrospective cohort study,” The
Lancet Haematology, vol. 7, issue 9, pp. e671–e678, 2020. Available:
https://doi.org/10.1016/S2352-3026(20)30217-9
12. M. Aguiar and N. Stollenwerk, “Condition-specific mortality risk can explain differ-
ences in COVID-19 case fatality ratios around the globe,” Public Health, vol. 188,
pp. 18–20, 2020.
13. Rocio Laguna-Goya et al., “IL-6–based mortality risk model for hospitalized patients
with COVID-19,” Journal of Allergy and Clinical Immunology, vol. 146, issue 4,
pp. 799–807, 2020. Available: https://doi.org/10.1016/j.jaci.2020.07.009
14. R. Rana and R. Singhal, “Chi-square test and its application in hypothesis testing,”
J. Pract. Cardiovasc. Sci., 1, pp. 69–71, 2015. doi: 10.4103/2395-5414.157577.
15. B.C. Ross, “Mutual Information between Discrete and Continuous Data Sets,” PLoS
ONE, 9(2), 2014. Available: https://doi.org/10.1371/journal.pone.0087357
16. E. Archer, I.M. Park, and J. Pillow, “Bayesian and Quasi-Bayesian Estimators for
Mutual Information from Discrete Data,” Entropy, 15 (12), pp. 1738–1755, 2013.
doi: 10.3390/e15051738.
Decision-tree and ensemble-based mortality risk models for hospitalized patients with COVID-19
Системні дослідження та інформаційні технології, 2023, № 1 35
17. S.R. Safavian and D. Landgrebe, “A survey of decision tree classifier methodology”,
Systems Man and Cybernetics IEEE Transactions, vol. 21, no. 3, pp. 660–674, 1991.
18. Laura Elena Raileanu and Kilian Stoffel, “Theoretical Comparison between the Gini
Index and Information Gain Criteria,” Annals of Mathematics and Artificial Intelli-
gence, vol. 41, pp. 77–93, 2004. doi: 10.1023/B:AMAI.0000018580.96245.c6.
19. J. Wagner, A. DuPont, S. Larson, B. Cash, and A. Farooq, “Absolute lymphocyte
count is a prognostic marker in Covid-19: A retrospective cohort review,” Int.
J. Lab. Hematol., vol. 42(6), pp. 761–765, 2020. doi: 10.1111/ijlh.13288.
20. A. Mazzoni, L. Salvati, L. Maggi, F. Annunziato, and L. Cosmi, “Hallmarks of im-
mune response in COVID-19: Exploring dysregulation and exhaustion,” Semin. Im-
munol., 2021. doi: 10.1016/j.smim.2021.101508.
21. J. Wang, M. Jiang, X. Chen, and L.J. Montaner, “Cytokine storm and leukocyte
changes in mild versus severe SARS-CoV-2 infection: Review of 3939 COVID-19
patients in China and emerging pathogenesis and therapy concepts,” J. Leukoc. Biol.,
108(1), pp. 17–41, 2020. doi: 10.1002/JLB.3COVR0520-272R.
22. D. Böning, W.M. Kuebler, and W. Bloch, “The oxygen dissociation curve of blood
in COVID-19,” Am. J. Physiol. Lung. Cell. Mol. Physiol., vol. 321(2), L349–L357,
2021. doi: 10.1152/ajplung.00079.2021.
23. J. Tolles and W.J. Meurer, “Logistic Regression: Relating Patient Characteristics to
Outcomes,” JAMA, vol. 316(5), pp. 533–534, 2016. doi: 10.1001/jama.2016.7653.
24. Alaa Tharwat, “Linear vs. quadratic discriminant analysis classifier: a tutorial,” In-
ternational Journal of Applied Pattern Recognition, vol. 3.2, pp. 145–180, 2016.
25. P. Domingos and M. Pazzani, “On the optimality of the simple Bayes-ian classifier
under zero-one loss,” Machine Learning, vol. 29, pp. 103–137, 1997.
26. Leo Breiman, “Random Forests,” Machine Learning, 45 (1), pp. 5–32, 2001. doi:
10.1023/A:1010933404324.
27. Zhao Yan, Xing Chen, and Jun Yin, “Adaptive boosting-based computational model
for predicting potential miRNA-disease associations,” Bioinformatics, vol. 35.22,
pp. 4730–4738, 2019.
28. Chih-Wei Hsu, Chih-Chung Chang, and Chih-Jen Lin, A Practical Guide to Support
Vector Classification. Department of Computer Science, National Taiwan University
(Hrsg.), 2003.
29. L. Bottou, “Large-scale machine learning with stochastic gradient descent,” Pro-
ceedings of COMPSTAT, Physica-Verlag HD, 2010, pp. 177–186.
30. Shaohua Wan et al., “Deep multi-layer perceptron classifier for behavior analysis to
estimate parkinson’s disease severity using smartphones,” IEEE Access, 6,
pp. 36825–36833, 2018.
31. D.C. Liu and J. Nocedal, “On the limited memory BFGS method for large scale op-
timization,” Mathematical Programming, vol. 45, pp. 503–528, 1989. Available:
https://doi.org/10.1007/BF01589116
32. Ph. Moritz, N. Robert, and M. Jordan, “A linearly-convergent stochastic L-BFGS
algorithm,” Proceedings of the 19th International Conference on Artificial Intelli-
gence and Statistics, PMLR, 51, pp. 249–258, 2016.
33. S. Madeh Piryonesi and Tamer El-Diraby, “Data Analytics in Asset Management:
Cost-Effective Prediction of the Pavement Condition Index,” Journal of Infrastruc-
ture Systems, 26 (1): 04019036, 2020. doi: 10.1061/(ASCE)IS.1943-555X.0000512.
34. T. Hastie, R. Tibshirani, and J.H. Friedman, “Boosting and Additive Trees,” The
Elements of Statistical Learning (2nd ed.). New York: Springer, 2009, pp. 337–384.
35. Onan Aytuğ, Serdar Korukoğlu, and Hasan Bulut, “A multiobjective weighted vot-
ing ensemble classifier based on differential evolution algorithm for text sentiment
classification,” Expert Systems with Applications, vol. 62, pp. 1–16, 2016.
Received 10.06.2022
Ya. Vyklyuk, S. Levytska, D. Nevinskyi, K. Hazdiuk, M. Škoda, S. Andrushko, M. Palii
ISSN 1681–6048 System Research & Information Technologies, 2023, № 1 36
INFORMATIONON THE ARTICLE
Yaroslav I. Vyklyuk, ORCID: 0000-0003-4766-4659, Lviv Polytechnic National Univer-
sity, Ukraine, e-mail: vyklyuk@ukr.net
Svitlana A. Levytska, ORCID: 0000-0001-6616-3572, Bukovinian State Medical Uni-
versity, Ukraine, e-mail: levitska.svitlana@bsmu.edu.ua
Denys V. Nevinskyi, ORCID: 0000-0002-0962-072X, Lviv Polytechnic National Univer-
sity, Ukraine, e-mail: nevinskiy90@gmail.com
Kateryna P. Hazdiuk, ORCID: 0000-0002-7568-4422, Yuriy Fedkovych Chernivtsi Na-
tional University, Ukraine, e-mail: kateryna.gazdyik@gmail.com
Miroslav Škoda, ORCID: 0000-0001-6658-2742, DTI University, Slovakia, e-mail:
skoda@dti.sk
Stanislav D. Andrushko, Chernivtsi central hospital, Ukraine, e-mail:
stanislav.andrushko14@gmail.com
Maryna A. Palii, Chernivtsi central hospital, Ukraine, e-mail: marinapaljj90@gmail.com
МОДЕЛІ РИЗИКУ СМЕРТНОСТІ НА ОСНОВІ ДЕРЕВА РІШЕНЬ І
АНСАБЛЮ ДЛЯ ГОСПІТАЛІЗОВАНИХ ПАЦІЄНТІВ ІЗ COVID-19 /
Я.І. Виклюк, С.А. Левицька, Д.В. Невінський, К.П. Газдюк, М. Шкода, С.Д. Андрушко,
М.А. Палій
Анотація. Присвячено вивченню пневмонії, асоційованої із SARS-CoV-2 та
дослідженню основних показників, що призводять до смертності хворих. Ви-
користовуючи добре відомі параметри, які регулярно застосовуються в клініч-
ній практиці, отримано абсолютно нові функціональні залежності на основі
доступного та зрозумілого дерева рішень і моделей класифікаторів ML, що
дозволить лікарю визначити прогноз за кілька хвилин і, відповідно, зрозуміти
необхідність коригування лікування, переведення хворого до відділення невід-
кладної допомоги. Точність отриманого ансамблю моделей, підібраних за
реальними даними пацієнтів лікарні, становила 0,88–0,91 для різних
показників. Створення системи збирання даних з подальшим навчанням
класифікаторів дасть змогу динамічно підвищити точність прогнозу та автома-
тизувати процес прийняття рішення лікарем.
Ключові слова: COVID-19, система прийняття рішень, дерево рішень,
ML-ансамбль, ансамбль класифікаційних моделей.
|
| id | journaliasakpiua-article-279747 |
| institution | System research and information technologies |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2025-07-17T10:28:07Z |
| publishDate | 2023 |
| publisher | The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" |
| record_format | ojs |
| resource_txt_mv | journaliasakpiua/80/6088814ed26f6454ea5b2c418f7f8680.pdf |
| spelling | journaliasakpiua-article-2797472023-05-24T21:28:17Z Decision-tree and ensemble-based mortality risk models for hospitalized patients with COVID-19 Моделі ризику смертності на основі дерева рішень і ансаблю для госпіталізованих пацієнтів із COVID-19 Vyklyuk, Yaroslav Levytska, Svitlana Nevinskyi, Denys Hazdiuk, Kateryna Škoda, Miroslav Andrushko, Stanislav Palii, Maryna COVID-19 система прийняття рішень дерево рішень ML-ансамбль ансамбль класифікаційних моделей COVID-19 decision-making system decision tree ML-ensemble ensemble of classification models The work is devoted to studying SARS-CoV-2-associated pneumonia and the investigating of the main indicators that lead to the patients’ mortality. Using the good-known parameters that are routinely embraced in clinical practice, we obtained new functional dependencies based on an accessible and understandable decision tree and ML ensemble of classifiers models that would allow the physician to determine the prognosis in a few minutes and, accordingly, to understand the need for treatment adjustment, transfer of the patient to the emergency department. The accuracy of the resulting ensemble of models fitted on actual hospital patient data was in the range of 0.88–0.91 for different metrics. Creating a data collection system with further training of classifiers will dynamically increase the forecast’s accuracy and automate the doctor’s decision-making process. Присвячено вивченню пневмонії, асоційованої із SARS-CoV-2 та дослідженню основних показників, що призводять до смертності хворих. Використовуючи добре відомі параметри, які регулярно застосовуються в клінічній практиці, отримано абсолютно нові функціональні залежності на основі доступного та зрозумілого дерева рішень і моделей класифікаторів ML, що дозволить лікарю визначити прогноз за кілька хвилин і, відповідно, зрозуміти необхідність коригування лікування, переведення хворого до відділення невідкладної допомоги. Точність отриманого ансамблю моделей, підібраних за реальними даними пацієнтів лікарні, становила 0,88–0,91 для різних показників. Створення системи збирання даних з подальшим навчанням класифікаторів дасть змогу динамічно підвищити точність прогнозу та автоматизувати процес прийняття рішення лікарем. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2023-03-30 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/279747 10.20535/SRIT.2308-8893.2023.1.02 System research and information technologies; No. 1 (2023); 23-36 Системные исследования и информационные технологии; № 1 (2023); 23-36 Системні дослідження та інформаційні технології; № 1 (2023); 23-36 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/279747/274346 |
| spellingShingle | COVID-19 система прийняття рішень дерево рішень ML-ансамбль ансамбль класифікаційних моделей Vyklyuk, Yaroslav Levytska, Svitlana Nevinskyi, Denys Hazdiuk, Kateryna Škoda, Miroslav Andrushko, Stanislav Palii, Maryna Моделі ризику смертності на основі дерева рішень і ансаблю для госпіталізованих пацієнтів із COVID-19 |
| title | Моделі ризику смертності на основі дерева рішень і ансаблю для госпіталізованих пацієнтів із COVID-19 |
| title_alt | Decision-tree and ensemble-based mortality risk models for hospitalized patients with COVID-19 |
| title_full | Моделі ризику смертності на основі дерева рішень і ансаблю для госпіталізованих пацієнтів із COVID-19 |
| title_fullStr | Моделі ризику смертності на основі дерева рішень і ансаблю для госпіталізованих пацієнтів із COVID-19 |
| title_full_unstemmed | Моделі ризику смертності на основі дерева рішень і ансаблю для госпіталізованих пацієнтів із COVID-19 |
| title_short | Моделі ризику смертності на основі дерева рішень і ансаблю для госпіталізованих пацієнтів із COVID-19 |
| title_sort | моделі ризику смертності на основі дерева рішень і ансаблю для госпіталізованих пацієнтів із covid-19 |
| topic | COVID-19 система прийняття рішень дерево рішень ML-ансамбль ансамбль класифікаційних моделей |
| topic_facet | COVID-19 система прийняття рішень дерево рішень ML-ансамбль ансамбль класифікаційних моделей COVID-19 decision-making system decision tree ML-ensemble ensemble of classification models |
| url | https://journal.iasa.kpi.ua/article/view/279747 |
| work_keys_str_mv | AT vyklyukyaroslav decisiontreeandensemblebasedmortalityriskmodelsforhospitalizedpatientswithcovid19 AT levytskasvitlana decisiontreeandensemblebasedmortalityriskmodelsforhospitalizedpatientswithcovid19 AT nevinskyidenys decisiontreeandensemblebasedmortalityriskmodelsforhospitalizedpatientswithcovid19 AT hazdiukkateryna decisiontreeandensemblebasedmortalityriskmodelsforhospitalizedpatientswithcovid19 AT skodamiroslav decisiontreeandensemblebasedmortalityriskmodelsforhospitalizedpatientswithcovid19 AT andrushkostanislav decisiontreeandensemblebasedmortalityriskmodelsforhospitalizedpatientswithcovid19 AT paliimaryna decisiontreeandensemblebasedmortalityriskmodelsforhospitalizedpatientswithcovid19 AT vyklyukyaroslav modelírizikusmertnostínaosnovíderevaríšenʹíansablûdlâgospítalízovanihpacíêntívízcovid19 AT levytskasvitlana modelírizikusmertnostínaosnovíderevaríšenʹíansablûdlâgospítalízovanihpacíêntívízcovid19 AT nevinskyidenys modelírizikusmertnostínaosnovíderevaríšenʹíansablûdlâgospítalízovanihpacíêntívízcovid19 AT hazdiukkateryna modelírizikusmertnostínaosnovíderevaríšenʹíansablûdlâgospítalízovanihpacíêntívízcovid19 AT skodamiroslav modelírizikusmertnostínaosnovíderevaríšenʹíansablûdlâgospítalízovanihpacíêntívízcovid19 AT andrushkostanislav modelírizikusmertnostínaosnovíderevaríšenʹíansablûdlâgospítalízovanihpacíêntívízcovid19 AT paliimaryna modelírizikusmertnostínaosnovíderevaríšenʹíansablûdlâgospítalízovanihpacíêntívízcovid19 |