Порівняння ефективності методів заповнення пропущених даних під час розроблення моделей прогнозування
Missing data is a common issue in data analysis and machine learning. This article analyzes the impact of missing data imputation methods during the data preprocessing stage on the quality of forecasting models. Selected methods are listwise deletion, mean imputation, and two implementations of the...
Збережено в:
| Дата: | 2025 |
|---|---|
| Автор: | |
| Формат: | Стаття |
| Мова: | Англійська |
| Опубліковано: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2025
|
| Теми: | |
| Онлайн доступ: | https://journal.iasa.kpi.ua/article/view/301918 |
| Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
| Назва журналу: | System research and information technologies |
| Завантажити файл: | |
Репозитарії
System research and information technologies| _version_ | 1867334443253891072 |
|---|---|
| author | Popov, Andrii |
| author_facet | Popov, Andrii |
| author_institution_txt_mv | [
{
"author": "Andrii Popov",
"institution": "Навчально-науковий Інститут Прикладного Системного Аналізу Національного Технічного Університету України \"Київський Політехнічний Інститут імені Ігоря Сікорського\", Київ"
}
] |
| author_sort | Popov, Andrii |
| baseUrl_str | http://journal.iasa.kpi.ua/oai |
| collection | OJS |
| datestamp_date | 2025-05-20T17:56:07Z |
| description | Missing data is a common issue in data analysis and machine learning. This article analyzes the impact of missing data imputation methods during the data preprocessing stage on the quality of forecasting models. Selected methods are listwise deletion, mean imputation, and two implementations of the multiple imputation method in Python and R languages. Selected classifiers are Logistic Regression, Random Forest, Support Vector Machine, and Light Gradient Boosting Machine. The performance quality of forecasting models is estimated using accuracy, precision, and recall metrics. Two datasets were used as binary classification problems with different target metrics. The highest performance was achieved when the R implementation of the multiple imputation method was combined with RF and LGBM classifiers. |
| doi_str_mv | 10.20535/SRIT.2308-8893.2025.1.03 |
| first_indexed | 2025-07-17T10:28:28Z |
| format | Article |
| fulltext |
Publisher IASA at the Igor Sikorsky Kyiv Polytechnic Institute, 2025
32 ISSN 1681–6048 System Research & Information Technologies, 2025, № 1
TIДC
МАТЕМАТИЧНІ МЕТОДИ, МОДЕЛІ,
ПРОБЛЕМИ І ТЕХНОЛОГІЇ ДОСЛІДЖЕННЯ
СКЛАДНИХ СИСТЕМ
UDC 519.245+004.896
DOI: 10.20535/SRIT.2308-8893.2025.1.03
EFFICIENCY COMPARISON OF MISSING DATA IMPUTATION
METHODS IN PREDICTIVE MODEL CREATION
A. POPOV
Abstract. Missing data is a common issue in data analysis and machine learning.
This article analyzes the impact of missing data imputation methods during the data
preprocessing stage on the quality of forecasting models. Selected methods are list-
wise deletion, mean imputation, and two implementations of the multiple imputation
method in Python and R languages. Selected classifiers are Logistic Regression,
Random Forest, Support Vector Machine, and Light Gradient Boosting Machine.
The performance quality of forecasting models is estimated using accuracy, preci-
sion, and recall metrics. Two datasets were used as binary classification problems
with different target metrics. The highest performance was achieved when the R im-
plementation of the multiple imputation method was combined with RF and LGBM
classifiers.
Keywords: missing data, imputation methods, forecasting models, machine learning.
INTRODUCTION
Today, every forecasting task involves processing large amounts of information.
One of the key aspects of preparing data for creating predictive models is han-
dling missing values, as machines learning algorithms mostly require complete
data. In real-world datasets, it is common to find gaps that can occur for a variety
of reasons, such as technical issues, human errors, the specifics of the research in
which the data was collected, and other factors. Missing information in a dataset
can distort statistical parameters, which can have a serious impact on the quality
and reliability of the model and lead to incorrect conclusions. With proper han-
dling of missing data prior to model training, the probability of successful training
of a predictive model can be increased, which will positively affect its quality.
MISSING DATA MECHANISMS
To describe the logic behind the occurrence of missing data, the concept of a
missing data mechanism was created. A mechanism is a term that is meant to de-
scribe in a general way the relationship between missing and observed data. Ac-
cording to the most common classification, there are three types of mechanisms
based on what determines the probability of missing a particular variable in the
observation: missing completely at random (MCAR), missing at random (MAR),
and missing not at random (MNAR; Rubin 1976 [1]). MCAR is the case when the
Efficiency comparison of missing data imputation methods in predictive model creation
Системні дослідження та інформаційні технології, 2025, № 1 33
missingness is completely random, i.e., independent of the values of the variables
in the set. This mechanism usually poses the least problems for imputation, since
the statistical parameters of the data are by definition not biased. MAR — the ab-
sence of data on a particular variable depends on the values of other variables in
the set, but does not depend on the value of the variable itself. This mechanism
inherently contains bias and requires more careful handling than MCAR.
MNAR — the absence of a variable depends on the value of the variable itself.
This is the most complicated mechanism that does not have a clearly described
solution, and in this case, the processing of missing data requires a specialised
approach. Figure shows a simplified diagram of the mechanisms, where Y is the
variable in question, M is an indicator of missing data for Y, X is other variables,
a solid arrow is an existing dependency, and a dashed arrow is a possible dependency.
Simplified diagram of missing data mechanisms
KEY STAGES OF IMPUTATION
Data analysis. Determining statistical parameters of the available data, analysing
relations between variables, correlations, identifying missing data, analysing them
if possible, and determining the mechanism of their formation. The purpose of
this stage is to gain an understanding of the available data and, ideally, the miss-
ing data, which will greatly facilitate the process of filling in the data.
Method selection for processing missing data. The large variety of available
methods allows choosing the most suitable option for a particular task. The choice
may depend on the amount of missing data, the mechanism behind it and com-
plexity. Simple single-imputation methods (such as mean, mode, interpolation)
are very popular and generally accepted, they are easier to understand and imple-
ment, but have disadvantages that limit their use. Sophisticated methods usually
provide better imputation because they are able to take into account the relations
between the data and do not skew the statistical parameters as a result, thus there
are fewer limitations to their use. [2] In many cases, it makes sense to choose sev-
eral methods and compare the results to choose the most appropriate one for the
task at hand.
Performing the imputation. Application of the selected methods to fill in the
missing values based on the observed data. This stage results in a complete data-
set. The statistical quality of the imputation may depend on the nature of the gaps,
the number of gaps, and the selected method. The wrong choice of method can
lead to significant distortion of the results.
RELATED WORK
The topic of missing data processing is addressed in a large number of different
studies, since it appears in any field and can be solved in a variety of ways with
different levels of efficiency. With the accelerating development of artificial intel-
A. Popov
ISSN 1681–6048 System Research & Information Technologies, 2025, № 1 34
ligence and machine learning technologies, the topic of missing data processing
has become even more discussed — high-quality process modelling in any field
requires high-quality data, which creates the necessity of efficient processing of
missing values. Research on methods is diverse, and depends on the goal of the
researchers: some papers address general issues, review methods, and propose
solutions [3–7]. In other works, there is a specific problem and methods for solv-
ing it are considered.
In particular, in the paper “The impact of imputation quality on machine
learning classifiers for datasets with missing values” [8], the authors study the
impact of imputation methods on the predictive ability of models. The methods
studied are mean imputation, multiple imputation by chained equations (MICE),
MissForest, generative adversarial imputation networks (GAIN), and missing data
importance-weighted autoencoder (MIWAE); the selected datasets include both
complete datasets with MCAR gaps of 25–50% and datasets with intrinsic MNAR
gaps. The models under study are logistic regression, random forest, XGBoost,
and artificial neural network. The selected datasets are used to compare the results
between the trained models. The paper uses a multivariate ANOVA model to
evaluate the impact of the imputation on the quality of the models. The results
show that the quality of predictive models depends on the amount of missing data
and training on imputed datasets usually produces lower quality results compared
to training on complete datasets. At the same time, for the same dataset, the qual-
ity ranking of the models usually does not change for different amounts of miss-
ing data, i.e. a model that performs better on 25% of missing data will also per-
form better on 50% of the missing data. Different methods perform better
depending on the dataset, but some imputation methods have less variation in
quality across datasets, with MIWAE consistently performing well across the
study. In some cases, logistic regression, which typically has the worst quality
metrics, was also able to achieve high quality metrics.
Another paper by Jale Bektas, Turgay Ibrikci, and Ismail Turkay Ozcan [9]
investigates the impact of imputation methods on the quality of classifiers in the
task of diagnosing coronary artery disease. In this paper, three imputation meth-
ods based on machine learning techniques (K-means, multilayer perceptron, and
self-organising maps) are presented and their performance is compared with the
conventional mean imputation method and listwise deletion. The selected classifi-
cation methods were Logistic Model Trees (LMT), multilayer perceptron, random
forest method, and support vector machine. The developed imputation methods
showed significantly better results than the mean imputation method, which was
ranked fourth in terms of model quality, surpassed only by the listwise deletion.
The best results were achieved when using self-organising maps (88.23% accuracy),
and the most stable results were obtained when using a multilayer perceptron.
The papers “Do we really need imputation in AutoML predictive model-
ling?” [10] and “Does imputation matter? Benchmark for predictive models” [11]
investigate the necessity of using complex imputation methods in machine learn-
ing processes. In the first study, 6 imputation methods were used to process data
in 25 datasets with natural missing data and 10 datasets with artificial missing
data. In the second one, 7 imputation methods were used on 13 classification
tasks. The conclusions of both papers are that simple methods usually perform
slightly worse than more complex methods, while gaining considerably in compu-
tational power. The first paper found that using a binary indicator with simple
mean/mode imputation (for continuous and categorical data, respectively) per-
Efficiency comparison of missing data imputation methods in predictive model creation
Системні дослідження та інформаційні технології, 2025, № 1 35
formed well and was significantly more efficient than more complex methods. In
the second paper, simple methods also achieved good results, although even with
similar predictive quality of the models, more complex methods produced more
statistically accurate imputations.
In summary, the use of imputation methods at the stage of data preprocess-
ing is a common subject in machine learning, with wide application regardless of
the specific field of study.
STATEMENT OF THE RESEARCH PROBLEM
The purpose of this paper is to investigate the impact of missing data processing
method on the quality of predictive machine learning models. In the process, we
take complete datasets and using them as basis we artificially create datasets with
different missing data configurations to study the effect of imputations on the
predictive ability of models. All datasets are taken from the public domain.
The research algorithm consists of the following general steps: selection of
imputation methods for the study, selection of prediction methods, search and re-
search of datasets, creation of datasets with missing data, processing missing data,
training models on the obtained datasets, and analysis of the results.
Four imputation methods were selected for the study:
1. Listwise deletion.
2. Mean imputation.
3. Multiple imputation using Python library scikit-learn (Iterative Imputer).
4. Multiple imputation using R library MICE.
The following 4 algorithms were chosen as forecasting algorithms:
1. Logistic regression.
2. Support vector machine.
3. Random Forest.
4. LGBM (Light Gradient Boosting Machine).
The first selected dataset is the Churn dataset of bank customers, the task of
classification is to determine customer churn, i.e. to identify customers who are
likely to cancel their bank services based on the available data. The selected data-
set consists of 10.000 records and 10 variables, including 4 continuous and 6
categorical variables. The continuous variables are: CreditScore — customer’s
credit score, EstimatedSalary — customer’s estimated salary, Age — customer’s
age, and Balance — customer’s balance. Categorical variables include: Geography
— country of origin of the customer, Gender — gender of the customer, Tenure —
number of years the customer has been with the bank, NumOfProducts — number of
bank products used by the customer, HasCrCard — indicator of whether the
customer has a bank credit card, isActiveMember — indicator of customer activity,
and Exited — target variable reflecting the churn/retention status of the customer.
To handle missing data, we use only continuous variables. In total, 12 data-
sets with different types and numbers of gaps were created, including 4 datasets
with only MCAR gaps, 2 datasets with only MAR gaps, 1 dataset with only
MNAR gaps, 2 datasets with mixed MCAR and MAR gaps (such datasets are
considered in the MAR category), and 2 datasets with mixed gaps using MNAR
gaps. As a result, 48 datasets were obtained after completion of the imputation.
[12] Datasets with mixed gaps are considered in the category of a less strong as-
sumption — for example, for mixed MAR and MCAR gaps, the dataset is consid-
ered in the MAR category.
A. Popov
ISSN 1681–6048 System Research & Information Technologies, 2025, № 1 36
The second selected dataset is a set of characteristic parameters of wine for
the purpose of wine quality classification. The selected dataset consists of 1599
records and 12 variables, of which 11 are continuous and 1 is a categorical vari-
able. The categorical variable is the target variable quality. The continuous vari-
ables are: fixed acidity — fixed (nonvolatile) acids, volatile acidity — the amount
of volatile acids, citric acid — the amount of citric acid, residual sugar — the
amount of residual sugar after the fermentation process is stopped, chlorides —
the amount of salt, free sulfur dioxide — the free form of SO2 that exists in equi-
librium between molecular SO2 (as a dissolved gas) and bisulfite ion, total sulfur
dioxide — the amount of free and bound SO2, density — the density (the density
of wine is almost the same as that of water, depending on the alcohol and sugar
content), pH — an indicator of the acidity/alkalinity of wine from 0 to 14 (most
wines are between 3–4 on this scale) and sulphates — additives to wine that can
contribute to the level of SO2.
In total, 9 datasets with different types and numbers of gaps were created,
including 3 datasets with exclusively MCAR gaps, 2 datasets with exclusively
MAR gaps, 1 dataset with exclusively MNAR gaps, 1 dataset with mixed MCAR
and MAR gaps, and 2 datasets with mixed gaps using MNAR gaps. As a result,
36 datasets were obtained after the completion of the imputation.
PERFORMANCE METRICS
To evaluate the quality of the obtained predictive models, we used the accuracy,
precision and recall metrics based on the confusion matrix.
Accuracy is a metric of the overall classification accuracy of the model,
calculated as the ratio of correct predictions to all predictions.
Precision is a metric that shows how many positive predictions were correct.
Recall is a metric that shows how many elements of a positive class were
detected by the model.
For each dataset, one of the metrics is the target metric, i.e. the main quality
criterion in the context of a particular task. The quality comparison was performed for
the values of the target metrics for the models trained on the imputed datasets.
In the Churn dataset, the target metric is recall, since the most important
ability of the model should be the ability to correctly identify customers who will
leave. In the Wine dataset, the target metric is accuracy, since the accuracy of
classification is equally important for both classes.
CREATION OF ARTIFICIAL MISSING DATA
Missing data was created with different combinations of mechanisms and quanti-
ties. For the purpose of more accurate comparison, the gaps were created exclu-
sively in the training dataset — this was done in order to compare the classifica-
tion quality of different models on the same test set. Before creating missing data,
the full datasets were split into training and test samples in the ratio of 80 to 20.
The number of MCAR missing data for each selected variable ranges from
5% to 20%, and the total number of observations with gaps in the MCAR datasets
ranges from 9.72% to 47.54%. To create MAR missing data two variables were
selected and the values of the first variable were removed for records that had
values for the second variable below or above the selected percentile. The se-
lected percentiles ranged from 5% to 20% for values below them and 90–95% for
values above them. The total number of observations with missing data ranged
Efficiency comparison of missing data imputation methods in predictive model creation
Системні дослідження та інформаційні технології, 2025, № 1 37
from 21.59% to 48.83%. To create the MNAR type of missing data, a variable
was selected and those values below or above the selected percentile were re-
moved, which ranged from 7 to 13% and from 90% to 93%, respectively. The
total number of observations with missing data ranged from 23.77% to 45.97%.
The total number of records with missing data for the datasets derived from
the first dataset ranged from 9.72% to 48.82%. The average number was 30.7%.
The size of the full training dataset was 8000 records and the test dataset was
2000 records.
The total number of records with missing data for the datasets derived from
the second dataset ranged from 18.14% to 45.97%, with an average of 33%. The
size of the full training set was 1279 records, and the test set was 320 records.
MODEL TRAINING RESULTS
The performance metrics of models trained on complete datasets are presented in
Table 1. The highest predictive quality was achieved with LGBM model for the
Churn dataset and RF model for Wine dataset.
T a b l e 1 . Results of training on complete datasets
Churn Wine
Model Accuracy Precision Recall Model Accuracy Precision Recall
LR 0.725000 0.386397 0.679389 LR 0.784375 0.392157 0.851064
SVC 0.799500 0.492982 0.715013 SVC 0.859375 0.513514 0.808511
RF 0.808500 0.508711 0.743003 RF 0.8625 0.518072 0.914894
LGBM 0.820500 0.530357 0.755725 LGBM 0.840625 0.476744 0.87234
Tables 2, 3, 4 present the performance metrics of models trained on imputed
Churn datasets with MCAR, MAR and MNAR missing data respectively.
T a b l e 2 . Results of training on imputed Churn datasets with MCAR missing data
MCAR Churn
9.72% 18.46% 22.03% 47.54%
Model
Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall
LR 0.724 0.383602 0.666667 0.678 0.348247 0.732824 0.678 0.348247 0.732824 0.7285 0.387048 0.653944
SVC 0.7945 0.484211 0.70229 0.805 0.502712 0.707379 0.79 0.477686 0.735369 0.805 0.502879 0.666667
RF 0.8085 0.508865 0.73028 0.8015 0.496587 0.740458 0.8005 0.494845 0.732824 0.8135 0.518587 0.709924
LGBM 0.812 0.514834 0.750636 0.804 0.500846 0.753181 0.8095 0.510345 0.753181 0.812 0.515371 0.725191
L
is
tw
is
e
Avg 0.78475 0.472878 0.712468 0.772125 0.462098 0.733461 0.7695 0.457781 0.73855 0.78975 0.480971 0.688932
LR 0.7285 0.390988 0.684478 0.7245 0.386494 0.684478 0.7275 0.389535 0.681934 0.721 0.382646 0.684478
SVC 0.795 0.48532 0.715013 0.799 0.492091 0.712468 0.8025 0.498246 0.722646 0.812 0.515315 0.727735
RF 0.802 0.497453 0.745547 0.7925 0.481544 0.73028 0.803 0.499145 0.743003 0.7995 0.493151 0.732824
LGBM 0.817 0.52356 0.763359 0.8145 0.518966 0.765903 0.812 0.51463 0.760814 0.805 0.502555 0.750636
M
ea
n
Avg 0.785625 0.47433 0.727099 0.782625 0.469774 0.723282 0.78625 0.475389 0.727099 0.784375 0.473417 0.723918
LR 0.6745 0.345694 0.735369 0.6745 0.345694 0.735369 0.678 0.348247 0.732824 0.68 0.349206 0.727735
SVC 0.809 0.509874 0.722646 0.8115 0.514235 0.735369 0.8115 0.514337 0.73028 0.818 0.527938 0.697201
RF 0.8385 0.573529 0.694656 0.8355 0.568085 0.679389 0.8415 0.581197 0.692112 0.8245 0.542339 0.684478
LGBM 0.841 0.585421 0.653944 0.8415 0.584071 0.671756 0.8365 0.571121 0.6743 0.818 0.527619 0.704835 It
er
at
iv
e
Avg 0.79075 0.50363 0.701654 0.79075 0.503021 0.705471 0.791875 0.503726 0.707379 0.785125 0.486776 0.703562
LR 0.728 0.390421 0.684478 0.726 0.387844 0.681934 0.7265 0.387755 0.676845 0.7005 0.36658 0.720102
SVC 0.795 0.485062 0.70229 0.798 0.490435 0.717557 0.8005 0.4947 0.712468 0.813 0.517625 0.709924
RF 0.801 0.495798 0.750636 0.8085 0.508961 0.722646 0.8005 0.495 0.755725 0.814 0.519626 0.707379
LGBM 0.8175 0.524735 0.755725 0.8125 0.515789 0.748092 0.8155 0.521053 0.755725 0.824 0.538752 0.725191
M
IC
E
Avg 0.785375 0.474004 0.723282 0.78625 0.475757 0.717557 0.78575 0.474627 0.725191 0.787875 0.485646 0.715649
A. Popov
ISSN 1681–6048 System Research & Information Technologies, 2025, № 1 38
T a b l e 3 . Results of training on imputed Churn datasets with MAR missing data
MAR Churn
21.59% 26.16% (+ MCAR) 39.99% 48.73% (+ MCAR)
Model
Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall
LR 0.7235 0.385714 0.687023 0.6605 0.336758 0.750636 0.7315 0.395349 0.692112 0.737 0.399698 0.6743
SVC 0.791 0.479132 0.73028 0.7915 0.480198 0.740458 0.7755 0.454984 0.720102 0.7725 0.449838 0.707379
RF 0.787 0.473083 0.737913 0.798 0.490566 0.727735 0.8075 0.507143 0.722646 0.7835 0.466667 0.712468
L
is
tw
is
e
LGBM 0.803 0.499139 0.737913 0.8085 0.509158 0.707379 0.8105 0.512411 0.735369 0.805 0.502636 0.727735
Avg 0.776125 0.459267 0.723282 0.764625 0.45417 0.731552 0.78125 0.467472 0.717557 0.7745 0.45471 0.705471
LR 0.7235 0.384058 0.6743 0.716 0.376934 0.681934 0.7215 0.381159 0.669211 0.7225 0.383285 0.676845
SVC 0.7965 0.487931 0.720102 0.792 0.480475 0.720102 0.797 0.488927 0.73028 0.795 0.485114 0.704835
RF 0.802 0.497427 0.737913 0.814 0.519409 0.715013 0.7855 0.471061 0.745547 0.795 0.485904 0.745547 M
ea
n
LGBM 0.8135 0.517241 0.763359 0.8175 0.525926 0.722646 0.81 0.511149 0.75827 0.801 0.495881 0.765903
Avg 0.783875 0.471664 0.723919 0.784875 0.475686 0.709924 0.7785 0.463074 0.725827 0.778375 0.462546 0.723283
LR 0.68 0.349206 0.727735 0.6765 0.346618 0.73028 0.68 0.349206 0.727735 0.689 0.356336 0.722646
SVC 0.811 0.513711 0.715013 0.814 0.518717 0.740458 0.8125 0.515845 0.745547 0.8095 0.511111 0.70229
RF 0.8435 0.58658 0.689567 0.837 0.572043 0.676845 0.835 0.567452 0.6743 0.834 0.564482 0.679389 It
er
at
iv
e
LGBM 0.845 0.595402 0.659033 0.8355 0.568966 0.671756 0.837 0.574944 0.653944 0.818 0.527619 0.704835
Avg 0.794875 0.511225 0.697837 0.79075 0.501586 0.704835 0.791125 0.501862 0.700382 0.787625 0.489887 0.70229
LR 0.7235 0.384058 0.6743 0.7265 0.388081 0.679389 0.722 0.382055 0.671756 0.7265 0.387755 0.676845
SVC 0.791 0.478336 0.70229 0.8025 0.498258 0.727735 0.7895 0.476271 0.715013 0.8075 0.50738 0.699746
RF 0.8075 0.506849 0.753181 0.813 0.516522 0.755725 0.7965 0.488294 0.743003 0.8155 0.522556 0.707379 M
IC
E
LGBM 0.8125 0.515625 0.755725 0.813 0.516579 0.753181 0.812 0.51468 0.75827 0.822 0.533821 0.743003
Avg 0.783625 0.471217 0.721374 0.78875 0.47986 0.729008 0.78 0.465325 0.722011 0.792875 0.487878 0.706743
T a b l e 4 . Results of training on imputed Churn datasets with MNAR missing data
MNAR Churn
24.01% 31.74% (+ MAR) 35.81% (+ MCAR) 42.43% (+ MCAR/MAR)
Model
Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall
LR 0.7205 0.382436 0.687023 0.6605 0.336758 0.750636 0.65 0.331873 0.770992 0.65 0.331873 0.770992
SVC 0.779 0.460292 0.722646 0.79 0.47644 0.694656 0.7925 0.480903 0.704835 0.779 0.459098 0.699746
RF 0.8015 0.496503 0.722646 0.812 0.515654 0.712468 0.817 0.525424 0.709924 0.813 0.517691 0.707379
L
is
tw
is
e
LGBM 0.807 0.506087 0.740458 0.815 0.521024 0.725191 0.794 0.484034 0.732824 0.777 0.459168 0.75827
Avg 0.777 0.46133 0.718193 0.769375 0.462469 0.720738 0.763375 0.455559 0.729644 0.75475 0.441958 0.734097
LR 0.7185 0.381616 0.697201 0.715 0.377931 0.697201 0.7125 0.375683 0.699746 0.7135 0.377049 0.70229
SVC 0.769 0.445498 0.717557 0.7665 0.440895 0.70229 0.774 0.453249 0.727735 0.7755 0.454397 0.709924
RF 0.771 0.450077 0.745547 0.781 0.464115 0.740458 0.7935 0.482759 0.712468 0.797 0.488774 0.720102 M
ea
n
LGBM 0.799 0.492487 0.750636 0.806 0.504488 0.715013 0.803 0.499086 0.694656 0.8055 0.503663 0.699746
Avg 0.764375 0.44242 0.727735 0.767125 0.446857 0.713741 0.77075 0.452694 0.708651 0.772875 0.455971 0.708016
LR 0.671 0.342823 0.735369 0.671 0.342823 0.735369 0.671 0.342823 0.735369 0.671 0.342823 0.735369
SVC 0.8005 0.494505 0.687023 0.801 0.495379 0.681934 0.797 0.48816 0.681934 0.799 0.491682 0.676845
RF 0.8265 0.545276 0.704835 0.822 0.535783 0.704835 0.83 0.555324 0.676845 0.8215 0.535714 0.687023
It
er
at
iv
e
LGBM 0.8265 0.546939 0.681934 0.826 0.546012 0.679389 0.8255 0.546025 0.664122 0.8255 0.545267 0.6743
Avg 0.781125 0.482386 0.702290 0.78 0.479999 0.700382 0.780875 0.483083 0.689568 0.77925 0.478872 0.693384
LR 0.7225 0.385593 0.694656 0.7245 0.387464 0.692112 0.719 0.380481 0.684478 0.72 0.383543 0.699746
SVC 0.792 0.480207 0.709924 0.7885 0.474832 0.720102 0.785 0.468908 0.709924 0.793 0.48199 0.715013
RF 0.7925 0.482315 0.763359 0.79 0.478049 0.748092 0.7995 0.493197 0.737913 0.8035 0.5 0.709924
LGBM 0.806 0.504303 0.745547 0.804 0.500855 0.745547 0.8185 0.527372 0.735369 0.798 0.490787 0.745547 M
IC
E
Avg 0.77825 0.463105 0.728372 0.77675 0.4603 0.726463 0.7805 0.46749 0.716921 0.778625 0.46408 0.717558
Efficiency comparison of missing data imputation methods in predictive model creation
Системні дослідження та інформаційні технології, 2025, № 1 39
Tables 5, 6, 7 present the performance metrics of models trained on imputed
Wine datasets with MCAR, MAR and MNAR missing data respectively.
T a b l e . 5 . Results of training on imputed Wine datasets with MCAR missing data
MCAR Wine
18.14% 30.73% 45.35%
Model
Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall
LR 0.796875 0.408163 0.851064 0.81875 0.438202 0.829787 0.809375 0.425532 0.851064
SVC 0.85625 0.506667 0.808511 0.865625 0.529412 0.765957 0.86875 0.537313 0.765957
RF 0.86875 0.533333 0.851064 0.84375 0.481013 0.808511 0.865625 0.532258 0.702128
LGBM 0.853125 0.5 0.87234 0.8625 0.519481 0.851064 0.84375 0.478873 0.723404
L
is
tw
is
e
Avg 0.84375 0.487041 0.845745 0.847656 0.492027 0.81383 0.846875 0.493494 0.760638
LR 0.803125 0.418367 0.87234 0.79375 0.405941 0.87234 0.796875 0.41 0.87234
SVC 0.859375 0.513514 0.808511 0.8375 0.46988 0.829787 0.85 0.493506 0.808511
RF 0.84375 0.483146 0.914894 0.85 0.494253 0.914894 0.85 0.494253 0.914894
LGBM 0.8375 0.47191 0.893617 0.8375 0.471264 0.87234 0.834375 0.465909 0.87234
M
ea
n
Avg 0.835938 0.471734 0.872341 0.829688 0.460335 0.87234 0.832813 0.465917 0.867021
LR 0.784375 0.392157 0.851064 0.7875 0.39604 0.851064 0.809375 0.427083 0.87234
SVC 0.86875 0.534247 0.829787 0.8625 0.519481 0.851064 0.875 0.547945 0.851064
RF 0.859375 0.512195 0.893617 0.86875 0.530864 0.914894 0.865625 0.526316 0.851064
LGBM 0.840625 0.476744 0.87234 0.834375 0.464286 0.829787 0.840625 0.475 0.808511 It
er
at
iv
e
Avg 0.838281 0.478836 0.861702 0.838281 0.477668 0.861702 0.847656 0.494086 0.845745
LR 0.7875 0.398058 0.87234 0.784375 0.392157 0.851064 0.809375 0.425532 0.851064
SVC 0.86875 0.534247 0.829787 0.859375 0.512821 0.851064 0.875 0.547945 0.851064
RF 0.85 0.494253 0.914894 0.865625 0.52439 0.914894 0.86875 0.534247 0.829787
LGBM 0.83125 0.460674 0.87234 0.853125 0.5 0.87234 0.8625 0.518519 0.893617 M
IC
E
Avg 0.834375 0.471808 0.87234 0.840625 0.482342 0.872341 0.853906 0.506561 0.856383
T a b l e . 6 . Results of training on imputed Wine datasets with MAR missing data
MAR Wine
23.53% 34.48% (+ MCAR) 42.30%
Model
Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall
LR 0.828125 0.45122 0.787234 0.815625 0.428571 0.765957 0.809375 0.420455 0.787234
SVC 0.853125 0.5 0.659574 0.821875 0.434211 0.702128 0.80625 0.363636 0.425532
RF 0.8625 0.525424 0.659574 0.84375 0.477612 0.680851 0.846875 0.48 0.510638
LGBM 0.86875 0.538462 0.744681 0.8375 0.467532 0.765957 0.85625 0.507692 0.702128
L
is
tw
is
e
Avg 0.853125 0.503777 0.712766 0.829688 0.451982 0.728723 0.829688 0.442946 0.606383
LR 0.790625 0.397959 0.829787 0.8 0.414141 0.87234 0.8 0.412371 0.851064
SVC 0.85625 0.507042 0.765957 0.8375 0.469136 0.808511 0.8375 0.467532 0.765957
RF 0.853125 0.5 0.87234 0.84375 0.481481 0.829787 0.8625 0.519481 0.851064
LGBM 0.8375 0.47191 0.893617 0.825 0.448276 0.829787 0.828125 0.455556 0.87234
M
ea
n
Avg 0.834375 0.469228 0.840425 0.826563 0.453259 0.835106 0.832031 0.463735 0.835106
LR 0.821875 0.445652 0.87234 0.8125 0.43299 0.893617 0.815625 0.434783 0.851064
SVC 0.865625 0.527027 0.829787 0.83125 0.453333 0.723404 0.85 0.493151 0.765957
RF 0.85625 0.506329 0.851064 0.859375 0.512821 0.851064 0.86875 0.535211 0.808511
LGBM 0.853125 0.5 0.893617 0.8375 0.469136 0.808511 0.859375 0.512821 0.851064 It
er
at
iv
e
Avg 0.849219 0.494752 0.861702 0.835156 0.46707 0.819149 0.848438 0.493992 0.819149
LR 0.8125 0.430108 0.851064 0.8125 0.43299 0.893617 0.81875 0.43956 0.851064
SVC 0.8625 0.52 0.829787 0.85625 0.507042 0.765957 0.84375 0.479452 0.744681
RF 0.85625 0.506173 0.87234 0.859375 0.513158 0.829787 0.859375 0.513514 0.808511
LGBM 0.834375 0.464286 0.829787 0.84375 0.481928 0.851064 0.846875 0.488095 0.87234 M
IC
E
Avg 0.841406 0.480142 0.845745 0.842969 0.48378 0.835106 0.842188 0.480155 0.819149
A. Popov
ISSN 1681–6048 System Research & Information Technologies, 2025, № 1 40
T a b l e 7 . Results of training on imputed Wine datasets with MNAR missing data
MNAR Wine
23.77% 32.60% (+ MCAR) 45.97% (+ MCAR/MAR)
Model
Accuracy Precision Recall Accuracy Precision Recall Accuracy Precision Recall
LR 0.796875 0.404255 0.808511 0.796875 0.404255 0.808511 0.828125 0.452381 0.808511
SVC 0.8375 0.463768 0.680851 0.809375 0.397059 0.574468 0.859375 0.52 0.553191
RF 0.853125 0.5 0.702128 0.825 0.44 0.702128 0.853125 0.5 0.553191
LGBM 0.834375 0.461538 0.765957 0.83125 0.454545 0.744681 0.8375 0.454545 0.531915
L
is
tw
is
e
Avg 0.830469 0.45739 0.739362 0.815625 0.423965 0.707447 0.844531 0.481732 0.611702
LR 0.79375 0.402062 0.829787 0.790625 0.395833 0.808511 0.809375 0.423913 0.829787
SVC 0.83125 0.452055 0.702128 0.834375 0.460526 0.744681 0.8625 0.52381 0.702128
RF 0.83125 0.455696 0.765957 0.8375 0.469136 0.808511 0.85 0.493151 0.765957
LGBM 0.825 0.445783 0.787234 0.825 0.448276 0.829787 0.834375 0.464286 0.829787 M
ea
n
Avg 0.820313 0.438899 0.771277 0.821875 0.443443 0.797873 0.839063 0.47629 0.781915
LR 0.803125 0.418367 0.87234 0.80625 0.421053 0.851064 0.809375 0.423913 0.829787
SVC 0.853125 0.5 0.787234 0.8625 0.521739 0.765957 0.8625 0.52381 0.702128
RF 0.846875 0.486842 0.787234 0.84375 0.481013 0.808511 0.85 0.493151 0.765957
LGBM 0.840625 0.47561 0.829787 0.83125 0.455696 0.765957 0.834375 0.464286 0.829787 It
er
at
iv
e
Avg 0.835938 0.470205 0.819149 0.835938 0.469875 0.797872 0.839063 0.47629 0.781915
LR 0.796875 0.40625 0.829787 0.80625 0.419355 0.829787 0.803125 0.416667 0.851064
SVC 0.840625 0.474359 0.787234 0.85 0.493333 0.787234 0.878125 0.560606 0.787234
RF 0.88125 0.561644 0.87234 0.86875 0.534247 0.829787 0.859375 0.513889 0.787234
LGBM 0.8375 0.46988 0.829787 0.83125 0.45679 0.787234 0.834375 0.460526 0.744681 M
IC
E
Avg 0.839063 0.478033 0.829787 0.839063 0.475931 0.808511 0.84375 0.487922 0.792553
DISCUSSION
Listwise deletion, which is a highly problematic method from the statistical point
of view and is rarely recommended, has proven in some cases to be able to pro-
duce datasets that yield prediction quality that is as good as when sophisticated
imputation methods are used. The method performs better with the Wine dataset,
where the target metric is accuracy, and performs worse with the Churn dataset
when the target metric is recall. In particular, for Churn on datasets with MCAR
missing data, the method showed mixed and unpredictable results. In some cases,
the obtained value of the target metric was not worse than the results obtained
using other methods, but the same model could have very different metric values
on different datasets, which made it difficult to predict the result. On datasets with
MAR missing data, the recall value was as good as the other methods, but it also
often increased at the cost of the accuracy value, so the overall quality of the
models was lower. Similar results can be seen on the datasets with MNAR miss-
ing data. In addition, on these datasets, the highest recall score was achieved us-
ing logistic regression, which usually shows the worst results. Thus, for the recall
as target metric, good results using this method are not uncommon, but the best
quality models are obtained on datasets that have exclusively MCAR mecha-
nism — otherwise, the overall quality of the model decreases.
For the Wine dataset with the accuracy target metric, the method’s perform-
ance was significantly higher. Despite the loss of a large amount of information,
the predictive quality of the obtained models was not inferior to other methods. It
is worth noting that the value of the recall metric was significantly lower than that
of the other methods, especially as the number of missing data increased, which
resulted in lower overall model quality. The most balanced models were obtained
on datasets containing only MCAR missing data: accuracy ranged from 79.7% to
Efficiency comparison of missing data imputation methods in predictive model creation
Системні дослідження та інформаційні технології, 2025, № 1 41
86.9%, recall from 76.7% to 87%, which matches the quality of models obtained
using more complex methods. On the datasets with MAR and MNAR missing
data, the trained models showed good results in terms of accuracy (79.7–86.9%),
but the values of the recall metric were significantly lower (51–80.1%).
In summary, in both problems, it was observed that the method is not a reli-
able choice for the recall metric, as satisfactory and predictable results were ob-
tained only on MCAR missing data. At the same time, the method is able to show
very good results when working with the accuracy metric, but is still limited by
the MCAR missing data mechanism and the percentage of missing data to obtain
balanced models for the metrics. Due to the general unreliability and unpredict-
ability of the method, it can be concluded that it is not the best choice for solving
such problems, but its use does not necessarily mean obtaining unsatisfactory results,
because predictive models can often learn to correctly identify the features of the
target classes of the problem even using datasets with biased statistical parameters.
Mean imputation generally showed more reliable results on most datasets
than listwise deletion, as the quality of trained models fluctuated less regardless of
the type and number of missing data. For this method, the best results were
achieved with SVC, RF or LGBM classifiers, while the method performed worse
with logistic regression. From a statistical point of view, a significant problem
with this method is the reduction of data variability and weakening of correlations
between variables (which was observed for the datasets imputed with this meth-
od), but, as in the case of listwise deletion, machine learning models are able to
learn to identify features of the target class even with statistically skewed data,
and they do so with greater success for the mean imputation method. Using this
method, satisfactory results were obtained on MCAR and MAR missing data for
both datasets for three out of four methods: for Churn, the accuracy was in the
range of 71–81% in almost all cases, recall was 71–76%, and only when using
logistic regression were the results worse (recall 67–68%); for the Wine dataset,
the accuracy was 79–86.2%, recall was 80–91.5% (exceptions are two cases of
SVC method on MAR, where recall was 76.6%). The training results on MNAR
missing data were of lower quality, with a noticeable decrease in recall compared
to other methods: for Churn, the metric had results of 69.4–75%, and for Wine —
70–83%. The best results in these cases were achieved using LGBM (Churn) and
RF or SVC (Wine). In summary, using mean imputation method is a relatively
good choice, as the models trained on these datasets were of high quality more
often and had more predictable results than those using listwise deletion. In addi-
tion, the method also performs better because the range of values obtained for the
metrics, even for the worst outcomes, is smaller, making the results more predictable.
Multiple imputation in the Python implementation of IterativeImputer from
the scikit-learn library showed unsatisfactory results for the Churn dataset. The
models often did not meet the minimum required classification quality. Across all
the missing data mechanisms, it was observed that this method worked best with
logistic regression and support vector machine — in particular, logistic regression
repeatedly showed significantly better results on the recall metric when using this
method compared to the complete data (up to 73.5%) — but fell short on other
metrics (67–68% accuracy). The SVC models performed relatively well (accuracy
80–81%, recall 70–74.5%), except for datasets with MNAR-type missing data
(accuracy 80%, recall 68%). The RF and LGBM models consistently had low re-
call values in combination with this method (67–70%), regardless of the mecha-
nism and number of gaps.
On the Wine dataset, by contrast, the method performed quite well on
MCAR and MAR missing data, especially when SVC and RF models were used
A. Popov
ISSN 1681–6048 System Research & Information Technologies, 2025, № 1 42
(accuracy 83–87.5%, recall 82.9–91.5% with two exceptions on MAR data for
SVC). Logistic regression generally performed worse than the other models, but
often had the highest recall, which was also observed on the Churn dataset. On
the MNAR missing data, the metrics were also quite high and did not fall short of
other imputation methods.
In general, this method proved to be quite unpredictable and data-dependent,
as there was a significant difference in quality between models trained on differ-
ent groups of datasets. In addition, specifically for the case of maximising recall,
the method showed unsatisfactory results, although it was able to create powerful
models for the Wine task with the target accuracy metric.
Multiple imputation in the implementation of the R MICE library proved to
be the best, providing the most consistently high results for all metrics, which
were closest to the performance after training on the complete datasets. The
method performed well on all datasets regardless of the type, combination and
number of gaps. For the Churn dataset, it worked best when combined with the
RF and LGBM algorithms, with the RF algorithm even performing better in some
cases using this imputation method than after training on the full dataset (MAR
dataset). On the Wine models, the method also showed excellent results, deliver-
ing high scores on both the target and recall metrics. The method worked best
with RF and SVC models. In general, the method had the highest level of reliabil-
ity and predictability of results, and the models trained on the datasets with this
imputation had consistently high prediction quality with the least fluctuations.
Overall, this particular implementation of the multiple imputation method proved
to be the most successful choice among studied methods for solving the problem
of processing missing data.
CONCLUSIONS
The widespread problem of missing data is becoming especially relevant today
due to the rapid development of artificial intelligence and machine learning tech-
nologies, which create a growing need for large amounts of high-quality data, as
most algorithms require complete datasets. A large number of different methods
for processing missing data have been created to solve the problem of missing
data, while preserving the statistical parameters of the data for the success of fur-
ther modelling. An important issue is the compatibility of imputation methods and
predictive models, as different methods have different levels of quality and pre-
dictability of modelling results.
In this paper, an impact of the selected imputation methods on the quality of
forecasting models is analysed. The best results were obtained using the multiple
imputation method in the implementation from the R MICE library. Training on
data using this method most reliably produced results that had high scores on
quality metrics and were characterised by smaller quality fluctuations compared
to other methods. The Python implementation of the multiple imputation method
was less reliable, as its effectiveness strongly depended on the target metric and
the specifics of the available data.
It has also been observed that statistically unreliable imputation methods,
such as mean imputation or listwise deletion, do not necessarily lead to poor pre-
diction results, as quite often predictive models are able to learn to recognise the
target class even in the case of biased parameters. Therefore, their use, although
riskier and more dependent on the characteristics of the available data, can also
produce satisfactory results, which may not be inferior in quality to training using
more complex methods.
Efficiency comparison of missing data imputation methods in predictive model creation
Системні дослідження та інформаційні технології, 2025, № 1 43
REFERENCES
1. Donald B. Rubin, “Inference and Missing Data,” Biometrika, vol. 63, no. 3, pp. 581–592, 1976.
2. Craig K. Enders, Applied Missing Data Analysis; 1 ed. The Guilford Press, 2010, 377 p.
3. Therese D. Pigott, “A review of methods for missing data,” Educational Research and
Evaluation, vol. 7, no. 4, pp. 353–383, 2001.
4. Luke Oluwaseye Joel, Wesley Doorsamy, and Babu Sena Paul, “A Review of Missing
Data Handling Techniques for Machine Learning,” International Journal of Innovative
Technology and Interdisciplinary Sciences (IJITIS), vol. 5, no. 3, pp. 971–1005, 2022.
doi: https://doi.org/10.15157/IJITIS.2022.5.3.971-1005
5. Helen Bridge, Thomas Schindler, “The perils of the unknown: Missing data in clinical
studies,” Medical Writing, 27(1), pp. 56–59, 2018.
6. Tlamelo Emmanuel, Thabiso Maupong, Dimane Mpoeleng, Thabo Semong, Banyatsang
Mphago, and Oteng Tabona, “A survey on missing data in machine learning,” Journal of
Big Data, 8(1), article no. 140, 2021. doi: 10.1186/s40537-021-00516-9
7. Hyun Kang, “The prevention and handling of the missing data,” Korean Journal of Anes-
thesiology, 64(5), pp. 402–406, 2013. doi: 10.4097/kjae.2013.64.5.402
8. Tolou Shadbahr et al., “The impact of imputation quality on machine learning classifiers
for datasets with missing values,” Communications medicine, vol. 3, article no. 139,
2023. doi: 10.1038/s43856-023-00356-z
9. Jale Bektas, Turgay Ibrikci, and Ismail Ozcan, “The impact of imputation procedures
with machine learning methods on the performance of classifiers: An application to coro-
nary artery disease data including missing values,” Biomedical Research, 29(13),
pp. 2780–2785, 2018. doi: 10.4066/biomedicalresearch.29-18-199
10. George Paterakis, Stefanos Fafalios, Paulos Charonyktakis, Vassilis Christophides, and
Ioannis Tsamardinos, “Do we really need imputation in AutoML predictive modeling?” ACM
Transactions on Knowledge Discovery from Data, 18(6), 2024. doi: 10.1145/3643643
11. Katarzyna Woźnica, Przemyslaw Biecek, Does imputation matter? Benchmark for pre-
dictive models, 2020. doi: 10.48550/arXiv.2007.02837
12. A. Popov, O. Makarenko, and P. Bidyuk, “Rozv’iazannia zadachi zapovnennia propuskiv
danykh alternatyvnymy metodamy pry stvorenni prohnoznykh modelei [Solving missing
data imputation problem using alternative methods in predictive model creation],” Pro-
ceedings of the II All-Ukrainian Scientific and Practical Conference "System Sciences and
Informatics", December 4–8, 2023, Kyiv: National Technical University of Ukraine “Igor
Sikorsky Kyiv Polytechnic Institute”, pp. 201–206.
Received 10.04.2024
INFORMATION ON THE ARTICLE
Andrii Yu. Popov, ORCID: 0009-0001-4783-7401, Educational and Research Institute
for Applied System Analysis of the National Technical University of Ukraine “Igor Sikor-
sky Kyiv Polytechnic Institute”, Ukraine, e-mail: popovandrii1403@gmail.com
ПОРІВНЯННЯ ЕФЕКТИВНОСТІ МЕТОДІВ ЗАПОВНЕННЯ ПРОПУЩЕНИХ
ДАНИХ ПІД ЧАС РОЗРОБЛЕННЯ МОДЕЛЕЙ ПРОГНОЗУВАННЯ / А.Ю. Попов
Анотація. Наявність пропущених даних є поширеною проблемою в аналізі даних
та машинному навчанні. У роботі проаналізовано залежності якості прогнозування
моделей машинного навчання від використаних методів оброблення пропущених
даних на етапі підготовки даних до навчання моделей. Досліджуваними методами
є аналіз повних спостережень, заповнення середнім та дві реалізації методу мно-
жинного заповнення — мовами Python та R. Обраними класифікаторами є логістич-
на регресія, метод випадкового лісу, метод опорних векторів та Light Gradient
Boosting Machine (LGBM). Якість прогнозних моделей оцінюється за метриками
accuracy, precision та recall. Розглянуто два набори даних із задачами класифікації,
що мають різні цільові метрики. Найкращі результати досягнуто з використанням
алгоритму множинного заповнення у реалізації мовою R у поєднанні з класифіка-
торами випадкового лісу та LGBM.
Ключові слова: пропущені дані, методи заповнення, прогнозні моделі, ма-
шинне навчання.
|
| id | journaliasakpiua-article-301918 |
| institution | System research and information technologies |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2025-09-17T09:26:01Z |
| publishDate | 2025 |
| publisher | The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" |
| record_format | ojs |
| resource_txt_mv | journaliasakpiua/70/33a29e3cdfeec3003bb6482e409ff770.pdf |
| spelling | journaliasakpiua-article-3019182025-05-20T17:56:07Z Efficiency comparison of missing data imputation methods in predictive model creation Порівняння ефективності методів заповнення пропущених даних під час розроблення моделей прогнозування Popov, Andrii missing data imputation methods forecasting models machine learning пропущені дані методи заповнення прогнозні моделі машинне навчання Missing data is a common issue in data analysis and machine learning. This article analyzes the impact of missing data imputation methods during the data preprocessing stage on the quality of forecasting models. Selected methods are listwise deletion, mean imputation, and two implementations of the multiple imputation method in Python and R languages. Selected classifiers are Logistic Regression, Random Forest, Support Vector Machine, and Light Gradient Boosting Machine. The performance quality of forecasting models is estimated using accuracy, precision, and recall metrics. Two datasets were used as binary classification problems with different target metrics. The highest performance was achieved when the R implementation of the multiple imputation method was combined with RF and LGBM classifiers. Наявність пропущених даних є поширеною проблемою в аналізі даних та машинному навчанні. У роботі проаналізовано залежності якості прогнозування моделей машинного навчання від використаних методів оброблення пропущених даних на етапі підготовки даних до навчання моделей. Досліджуваними методами є аналіз повних спостережень, заповнення середнім та дві реалізації методу множинного заповнення — мовами Python та R. Обраними класифікаторами є логістична регресія, метод випадкового лісу, метод опорних векторів та Light Gradient Boosting Machine (LGBM). Якість прогнозних моделей оцінюється за метриками accuracy, precision та recall. Розглянуто два набори даних із задачами класифікації, що мають різні цільові метрики. Найкращі результати досягнуто з використанням алгоритму множинного заповнення у реалізації мовою R у поєднанні з класифікаторами випадкового лісу та LGBM. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2025-03-28 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/301918 10.20535/SRIT.2308-8893.2025.1.03 System research and information technologies; No. 1 (2025); 32-43 Системные исследования и информационные технологии; № 1 (2025); 32-43 Системні дослідження та інформаційні технології; № 1 (2025); 32-43 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/301918/318901 |
| spellingShingle | пропущені дані методи заповнення прогнозні моделі машинне навчання Popov, Andrii Порівняння ефективності методів заповнення пропущених даних під час розроблення моделей прогнозування |
| title | Порівняння ефективності методів заповнення пропущених даних під час розроблення моделей прогнозування |
| title_alt | Efficiency comparison of missing data imputation methods in predictive model creation |
| title_full | Порівняння ефективності методів заповнення пропущених даних під час розроблення моделей прогнозування |
| title_fullStr | Порівняння ефективності методів заповнення пропущених даних під час розроблення моделей прогнозування |
| title_full_unstemmed | Порівняння ефективності методів заповнення пропущених даних під час розроблення моделей прогнозування |
| title_short | Порівняння ефективності методів заповнення пропущених даних під час розроблення моделей прогнозування |
| title_sort | порівняння ефективності методів заповнення пропущених даних під час розроблення моделей прогнозування |
| topic | пропущені дані методи заповнення прогнозні моделі машинне навчання |
| topic_facet | missing data imputation methods forecasting models machine learning пропущені дані методи заповнення прогнозні моделі машинне навчання |
| url | https://journal.iasa.kpi.ua/article/view/301918 |
| work_keys_str_mv | AT popovandrii efficiencycomparisonofmissingdataimputationmethodsinpredictivemodelcreation AT popovandrii porívnânnâefektivnostímetodívzapovnennâpropuŝenihdanihpídčasrozroblennâmodelejprognozuvannâ |