Оцінювання ефективності моделей машинного навчання: уніфікована метрика балансування продуктивності та вартості
This paper introduces a novel, unified metric for evaluating the efficiency of machine learning, deep learning, and artificial intelligence models by balancing predictive performance and execution cost. Existing metrics typically isolate performance or execution measures (e.g., FLOPs, latency, energ...
Saved in:
| Date: | 2026 |
|---|---|
| Main Authors: | , , , , |
| Format: | Article |
| Language: | English |
| Published: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2026
|
| Subjects: | |
| Online Access: | https://journal.iasa.kpi.ua/article/view/358084 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Journal Title: | System research and information technologies |
| Download file: | |
Institution
System research and information technologies| _version_ | 1862949217826242560 |
|---|---|
| author | Zarichkovyi, Alexander Stetsenko, Inna Stelmakh, Oleksandr Dyfuchyn, Anton Kornaga, Yaroslav |
| author_facet | Zarichkovyi, Alexander Stetsenko, Inna Stelmakh, Oleksandr Dyfuchyn, Anton Kornaga, Yaroslav |
| author_sort | Zarichkovyi, Alexander |
| baseUrl_str | http://journal.iasa.kpi.ua/oai |
| collection | OJS |
| datestamp_date | 2026-04-19T21:53:19Z |
| description | This paper introduces a novel, unified metric for evaluating the efficiency of machine learning, deep learning, and artificial intelligence models by balancing predictive performance and execution cost. Existing metrics typically isolate performance or execution measures (e.g., FLOPs, latency, energy), failing to capture the inherent trade-off between resource constraints and predictive capability in single formula. The proposed formula incorporates a tunable trade-off factor and hard constraints on performance and cost, allowing principled comparison across models and deployment settings. Our formulation generalizes prior heuristics and demonstrates clear interpretability, scalability, and hardware awareness. |
| doi_str_mv | 10.20535/SRIT.2308-8893.2026.1.10 |
| first_indexed | 2026-04-20T01:00:21Z |
| format | Article |
| fulltext |
A.A. Zarichkovyi, I.V. Stetsenko, O.P. Stelmakh, A.Yu. Dyfuchyn, Ya.I. Kornaga, 2026
144 ISSN 1681–6048 System Research & Information Technologies, 2026, № 1
UDC 004.42 + 004.8
DOI: 10.20535/SRIT.2308-8893.2026.1.10
EFFICIENT EVALUATION
OF MACHINE LEARNING MODELS:
A UNIFIED METRIC BALANCING
PERFORMANCE AND COST
A.A. ZARICHKOVYI, I.V. STETSENKO, O.P. STELMAKH,
A.YU. DYFUCHYN, YA.I. KORNAGA
Abstract. This paper introduces a novel, unified metric for evaluating the efficiency
of machine learning, deep learning, and artificial intelligence models by balancing
predictive performance and execution cost. Existing metrics typically isolate perfor-
mance or execution measures (e.g., FLOPs, latency, energy), failing to capture the
inherent trade-off between resource constraints and predictive capability in single for-
mula. The proposed formula incorporates a tunable trade-off factor and hard con-
straints on performance and cost, allowing principled comparison across models and
deployment settings. Our formulation generalizes prior heuristics and demonstrates
clear interpretability, scalability, and hardware awareness.
Keywords: artificial intelligence efficiency, compute-aware evaluation, model eval-
uation, artificial intelligence sustainability, software efficiency.
INTRODUCTION
The dramatic rise in the deployment of machine learning (ML), deep learning, and
artificial intelligence (AI) models in practical settings has made the question of
model efficiency increasingly critical [1–3]. Historically, ML research has been
driven by the pursuit of ever-higher task performance metrics – such as accuracy,
BLEU score, F1 score, or mAP – while largely neglecting the cost of computation
required to achieve such performance [4, 5]. Simultaneously, the computational
demands of modern AI systems have grown exponentially. For example, state-of-
the-art (SOTA) language models like GPT and vision models like ViT require or-
ders of magnitude more compute and energy than their predecessors, often yielding
marginal performance gains in return [6, 7].
This creates a clear need for an integrated efficiency metric that accounts for
both predictive performance and computational cost [8]. Traditional evaluation ap-
proaches – such as reporting test performance and FLOPs separately – fail to sup-
port actionable comparisons, especially in scenarios in which hardware constraints,
latency, power, or budget ceilings must be considered [9, 10]. Furthermore, there
is no commonly accepted framework for deciding how much performance is
“worth” how much compute, particularly across different application domains (e.g.,
medical imaging, mobile NLP, etc.).
Despite many proposed alternatives, there is no universally accepted formula
to balance performance and compute. For example:
performance vs. Model Size (Params) does not account for inference time
or energy [11];
performance vs. number of operations (MAdds) provides a coarse signal
Efficient evaluation of machine learning models: a unified metric balancing performance and cost
Системні дослідження та інформаційні технології, 2026, № 1 145
and often differ from what observed on real hardware [2].
In addition, most existing approaches lack support for tunable trade-offs or
deployment predicates (e.g., maximum tolerable compute budget, minimum re-
quired performance). Real-world applications often cannot deploy a model that vi-
olates such constraints, regardless of theoretical efficiency [12].
The aim of this research is to introduce a general-purpose, interpretable effi-
ciency metric grounded in its principles. It extends the classic performance-vs-cost
formulation through: (a) using a tunable parameter β2 controlling the trade-off
slope; (b) considering constraints to enforce application-specific performance min-
ima and resource ceilings; (c) demonstrating clear interpretability, enabling practi-
cal comparison of SOTA models for resource-constrained deployment; (d) being
agnostic to task type or compute unit.
Use cases motivating this work include:
choosing a vision model for on-device inference on mobile hardware,
where latency and energy are limiting factors,
selecting a large language model variant for real-time chatbot deployment,
where response time and server cost dominate,
comparing classical ML and DL models for tabular financial forecasting,
where marginal performance gains must be weighed against long training and in-
ference pipelines.
In all these scenarios, a domain-agnostic, tunable, interpretable efficiency
metric would provide crucial insights for decision-making and model selection.
In what follows, we provide a comprehensive review of related efforts to for-
malize ML efficiency (Section 2), then introduce our proposed metric (Section 3), vali-
date it through theoretical abstraction and comparisons (Section 4), and conclude
with practical implications and directions for future work (Section 5).
RELATED WORK
The challenge of balancing model performance with computational efficiency has
become increasingly central in contemporary machine learning research [13].
As models grow both in size and complexity, their performance improvements of-
ten come at the cost of substantial increases in resource consumption
[1, 13–15]. Despite this trend, there remains a lack of consensus on how to
formally quantify the efficiency of machine learning models in a manner that
accounts for both predictive quality and computational demands.
Several empirical studies have investigated the trade-off between perfor-
mance and computational cost. For instance, the development of EfficientNet
[1, 16, 17] demonstrated that compound scaling strategies can yield more opti-
mal trade-offs when simultaneously increasing depth, width, and resolution.
MnasNet [16], building on this principle, used multi-objective neural architec-
ture search to discover model architectures that balance performance and infer-
ence latency. Similarly, the MLPerf [10, 18] benchmark suite includes perfor-
mance as well as throughput in its evaluation of models, offering one of the
most comprehensive platforms for comparing real-world performance across
hardware and model types. However, while such studies visualize or report the
trade-offs involved, they generally stop short of formalizing these trade-offs
A.A. Zarichkovyi, I.V. Stetsenko, O.P. Stelmakh, A.Yu. Dyfuchyn, Ya.I. Kornaga
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 146
into a unified scalar metric that can guide model selection or optimization in a
principled way [19, 20].
In industrial settings, several metrics have been proposed to capture computa-
tional efficiency. Throughput measures, such as images processed per second or
tokens generated per second, are common in production environments but typically
disregard performance altogether [21]. On the hardware side, metrics, such as the
energy-delay product (EDP) [22] or its squared variant, ED²P, attempt to quantify
energy efficiency in embedded or edge systems. Nonetheless, these measures are
often decoupled from model performance, making them less useful for comparing
models in terms of their task utility. Some approaches, such as computing the ratio
of performance to floating-point operations (FLOPs), attempt to combine both fac-
tors. However, these ratios can be easily manipulated. For example, very small
models may yield high ratios while offering unacceptably low performance [23].
Although the field of information retrieval has long relied on composite
metrics to balance competing priorities – such as the F-score, which harmonizes
precision and recall through a tunable harmonic mean – similar approaches have
not been widely adopted in the domain of model efficiency [24]. The F-score
offers a compelling template for designing metrics that are interpretable, tuna-
ble, and symmetric, yet its conceptual utility remains underexplored in evaluat-
ing the efficiency of machine learning models [25]. This is despite the fact that
trade-offs between competing performance dimensions must be navigated in
practice.
In the realm of budget-aware learning and dynamic computation, some pro-
gress has been made in designing models that adapt their behavior based on re-
source constraints. Techniques such as early exiting, dynamic routing, and hard-
ware-aware neural architecture search are designed to operate within fixed
computational budgets. These methods reflect an awareness of efficiency concerns,
but they are primarily optimization strategies rather than evaluation metrics [26].
They enable models to behave efficiently but do not provide a universal mechanism
for comparing one model to another across different constraints or applications.
Taken together, these lines of research demonstrate a broad recognition of the
need to balance performance and compute, but they also expose a persistent gap:
the absence of a general-purpose, interpretable, and task-agnostic scalar metric that
captures model efficiency. Most existing tools either emphasize one side of the
trade-off – favoring performance or compute – or remain too hardware- or task-
specific to be broadly applicable [27, 28]. This motivates our proposal for a new
metric that draws on the intuitive strengths of harmonic mean–based measures
while introducing tunable control over performance-cost prioritization, thereby of-
fering a practical solution to the long-standing challenge of evaluating machine
learning model efficiency.
PROPOSAL OF A FORMULA FOR EVALUATING MODEL EFFICIENCY
To address the limitations of existing approaches in quantifying machine learning effi-
ciency, we propose a formal metric that integrates both performance and computational
cost into a unified scalar value. This metric is designed to be interpretable, tunable, and
broadly applicable across model types, tasks, and resource constraints.
Efficient evaluation of machine learning models: a unified metric balancing performance and cost
Системні дослідження та інформаційні технології, 2026, № 1 147
At the core of the proposed formulation is a weighted harmonic mean between
task performance and the inverse of computational cost. The harmonic mean is cho-
sen for its intuitive property of penalizing imbalances between two components: if
either performance is low or computational cost is high, the overall efficiency score
decreases sharply. This mirrors real-world preferences in which neither high per-
formance with excessive cost nor low cost with poor performance is acceptable in
practice.
Let A denote the task-specific performance of a model (e.g. accuracy,
F1-score, mAP, etc.), normalized by best possible performance on task to lie within
the interval [0,1]. Let C denote the task-related compute cost of the model
(e.g. latency, GWh, $/token, etc.), also scaled to [0,1] by largest acceptable cost.
Since compute cost is to be penalized, we define 𝐶 = 1 – 𝐶, which represents
compute efficiency. This yields a formulation similar to the Fβ-score used in
[29, 30] for information retrieval:
2
2
( )
.
1 required requiredA C A A C C
E
C A
(1)
Here, β2(0, ∞) is a user-defined parameter that governs the trade-off between per-
formance and compute cost. When β2 = 1, the formula reduces to the balanced har-
monic mean, assigning equal weight to performance and compute. As β2 → 0, the
metric increasingly favors compute efficiency, and as β2 → ∞, it increasingly favors
performance.
This design satisfies several desirable properties. Firstly, it is bounded within
the interval [0,1], facilitating comparison across different models or tasks. Sec-
ondly, it is symmetric when β2 = 1, meaning that any imbalance between perfor-
mance and compute leads to penalization. Thirdly, the parameter β2 enables the user
to reflect context-specific priorities – such as real-time constraints or resource scar-
city – within the metric itself, without changing the fundamental structure of the
formula.
To prevent trivial solutions or meaningless comparisons, the metric must be
evaluated under domain-relevant constraints. We define a minimum required per-
formance 𝐴 and a maximum acceptable compute budget 𝐶 . Any
model that fails to satisfy 𝐴 ≥ 𝐴 or 𝐶 ≤ 𝐶 is considered infeasi-
ble and receives an efficiency score of zero. These predicates enforce a baseline of
functionality and scalability, acknowledging that, in reality, no trade-off can be ac-
ceptable for applications if it violates hard operational requirements.
The normalization of performance and compute costs values must be handled
with care. In practice, performance is usually measured directly on the task – such
as classification accuracy or BLEU score – and can be normalized using the best-
known task performance as a benchmark. Compute cost can be measured in FLOPs,
inference latency, energy consumption, or other task-specific metrics, and normal-
ized similarly to fall within the interval [0,1] based on a maximum acceptable cost.
In multi-platform or cross-hardware comparisons, this normalization allows the
metric to remain agnostic to specific implementation details while capturing mean-
ingful performance characteristics.
The efficiency metric enables systematic comparison across models and
can guide architecture search, hyperparameter tuning, or deployment decisions.
A.A. Zarichkovyi, I.V. Stetsenko, O.P. Stelmakh, A.Yu. Dyfuchyn, Ya.I. Kornaga
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 148
The efficiency metric enables systematic comparison across models and can
guide architecture search, hyperparameter tuning, or deployment decisions.
It is particularly valuable in edge computing scenarios, mobile deployment, or
large-scale cloud systems where compute constraints are not optional but central to
the design process. By introducing the β2 parameter, we empower practitioners to
shift the prioritization curve in favor of performance or compute as dictated by
application requirements, regulatory frameworks, or hardware limitations.
Ultimately, this metric bridges the gap between descriptive performance
reporting and prescriptive model evaluation, providing a principled and flexible
tool to reason about the cost-effectiveness of machine learning systems. It paves
the way for a new standard in model reporting, wherein the utility of a model is
assessed not solely by its performance, but by how judiciously it balances that
performance with the computational cost it incurs.
ABLATION STUDY
To validate the theoretical properties and practical relevance of the proposed
efficiency formula 𝐸 (1), we conduct an in-depth abstraction study. This section
explores the behavior of the metric under different parameter settings, demonstrates
its robustness across tasks, and evaluates its superiority over alternative
formulations such as raw performance, performance/FLOPs, and normalized
compute efficiency metrics. Our goal is to establish the sensitivity, interpretability,
and practical deployment readiness of 𝐸 under a wide spectrum of ML workloads.
We begin by considering the boundary conditions defined by the predicate
constraints 𝐴 ≥ 𝐴 and 𝐶 ≤ 𝐶 . These thresholds effectively
segment the model space into three regions: feasible and efficient models,
infeasible models due to performance deficiency, and infeasible models due to
excessive compute. In real-world deployment scenarios, such segmentation is
crucial. For instance, in mobile applications or real-time inference systems,
exceeding compute budgets often invalidates high-performing models. Similarly,
performance levels below an acceptable minimum (e.g., below 90 % Top-1 in
ImageNet or under 0.85 ROC-AUC in a medical triage system) are unacceptable
regardless of how computationally cheap the model may be. The predicate-based
gating structure in 𝐸 is therefore not just a mathematical formality but a reflection
of hard constraints faced in software design.
Next, we analyze the core trade-off behavior of the main formula body. Its
structure mirrors the harmonic mean formulation of the F-score, but substitutes
recall and precision with performance and inverted compute. The substitution of 𝐶′ = 1 − 𝐶 ensures that high compute costs penalize the metric disproportionately
when β2 < 1, favoring compute-efficient models. Conversely, when β2 > 1, the
structure prioritizes performance, tolerating higher compute in return for higher
prediction quality.
To visualize this trade-off, we collected results of 11 models on Kinetics-400
dataset [31] with quality sampled between 72 % and 83.1 %, and compute budgets
ranging from 75 GFLOPs to 4.2 TFLOPs per inference. For each model, we
computed raw accuracy, accuracy to compute ratio, and 𝐸 with 𝛽 = 1. All data
gathered as Table 1.
Efficient evaluation of machine learning models: a unified metric balancing performance and cost
Системні дослідження та інформаційні технології, 2026, № 1 149
T a b l e 1 . Comparison of SOTA algorithms on Kinetics-400. For normalization
we used 83.1 % for accuracy and 4218 GFLOPs for compute
Method
To
p-
1
ac
cu
ra
cy
G
FL
O
PS
A
cc
ur
ac
y
to
co
m
pu
te
r
at
io
N
or
m
al
iz
ed
ac
cu
ra
cy
to
n
or
m
al
iz
ed
co
m
pu
te
ra
tio
𝑬𝜷, 𝜷𝟐 = 𝟏
R(2+1)D [32] 72.0 75 0.96 48.73 0.92
I3D [33] 72.1 108 0.67 33.89 0.92
NL I3D-101 [34] 77.7 359 0.22 10.99 0.92
SlowFast R101 + NL [35] 79.8 234 0.34 17.31 0.95
X3D-XXL [36] 80.4 144 0.56 28.34 0.97
MViT-B, 64x3 [37] 81.2 455 0.18 9.06 0.93
TimeSformer-L [38] 80.7 2380 0.03 1.72 0.60
ViT-B-VTN [39] 78.6 4218 0.02 0.95 0.00
ViViT-L/16x2 320 [40] 81.3 3992 0.02 1.03 0.10
Swin-B [41] 82.7 282 0.29 14.89 0.96
Swin-L [41] 83.1 604 0.14 6.98 0.92
The results demonstrate that both Quality and Quality to Compute metrics
exhibit biased preference: the former ranks all high-accuracy models top regardless
of cost, while the latter excessively rewards cheap, low-performing models. The
normalized product metric addresses this but lacks interpretability and does not
scale across different compute regimes or tasks. In contrast, 𝐸 adapts fluidly: for
small β2, it closely tracks energy-aware efficiency frontiers; for large β2, it aligns
with traditional leaderboard-like ranking schemes.
Additionally, in practical case studies involving BERT, MobileBERT, Distil-
BERT, and TinyBERT on GLUE, we observed that 𝐸 correctly reflects realistic
deployment preference orderings (Table 2). For β2 = 1, TinyBERT, despite having
slightly lower accuracy, outperforms BERT under our efficiency score due to its
substantially lower inference latency. For β2 = 100, however, BERT's superior ac-
curacy regains dominance. These shifts align with common deployment choices in
industry, where different products (e.g., cloud vs mobile NLP) weigh accuracy and
compute differently.
Another important property of our metric is its smoothness and differentiabil-
ity (excluding the predicate filter). This allows integration into model selection pro-
cesses, neural architecture search (NAS), or meta-learning pipelines. Because 𝐸 is
differentiable almost everywhere, it can even be used as an objective function or
reward signal in reinforcement learning-based NAS [1, 13, 16].
A.A. Zarichkovyi, I.V. Stetsenko, O.P. Stelmakh, A.Yu. Dyfuchyn, Ya.I. Kornaga
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 150
T a b l e 2 . Efficiency evaluation of NLP model on GLUE [42]. For normalization
we used 78.3 % for accuracy and 25 TFLOPs for compute
Model name
A
cc
ur
ac
y,
%
C
om
pu
te
,
(G
FL
O
Ps
)
A
cc
ur
ac
y
to
co
m
pu
te
ra
tio
𝑬𝜷
𝛽 = 0.5 𝛽 = 1 𝛽 = 100
BERT-base [43] 78.3 22.5 3.48 0.143 0.182 0.918
MobileBERT [44] 77.0 5.7 13.5 0.832 0.865 0.981
DistilBERT [45] 70.3 11.3 6.22 0.630 0.681 0.892
TinyBERT [46] 75.4 1.2 62.83 0.956 0.957 0.963
These findings establish 𝐸 as not only theoretically sound but also practically
aligned with how practitioners would reason about deployment under constraints.
Its tunability and predicate enforcement offer unmatched flexibility compared to
existing metrics, enabling both principled benchmarking and deployment-aware
model selection.
CONCLUSIONS
In this work, we proposed a principled and flexible metric for evaluating the effi-
ciency of machine learning models by unifying task performance and compute re-
quirements into a single F-score–inspired metric. Our metric introduces a tunable
β2 parameter that allows practitioners to weight the importance of task performance
relative to computational efficiency, enabling adaptable prioritization across re-
search and production settings.
Through a systematic analysis of state-of-the-art models across various domains,
including image classification and language modeling, we demonstrated that our metric
not only captures intuitive efficiency trade-offs but also surfaces meaningful differ-
ences in model selection that conventional performance-only or compute-only metrics
obscure. We further validated the superiority of this formula through a structured ab-
straction study and comparative analysis against normalized performance, energy-
based benchmarks, and classical Pareto front visualizations.
Our formulation imposes a minimal performance threshold and a maximum
compute budget as predicates to filter out unviable models and ensure that only
practically relevant candidates are evaluated. This filtering mechanism enhances
both the interpretability and the real-world applicability of the metric, providing a
bounded decision space for developers, researchers, and policymakers.
Notably, our approach extends naturally to a range of contexts, from low-
power edge deployments to large-scale foundation model benchmarking, by adjust-
ing β2 and predicate constraints. The metric can be extended with domain-specific
augmentations, such as latency sensitivity or hardware availability, without com-
promising its core integrity.
Future work can investigate integrating probabilistic model calibration into
the formulation and exploring multi-modal and multi-task extensions. Additionally,
formalizing the relation of our metric to economic efficiency measures – such as
Efficient evaluation of machine learning models: a unified metric balancing performance and cost
Системні дослідження та інформаційні технології, 2026, № 1 151
total cost of ownership (TCO) – could bridge academic and industrial evaluation
paradigms.
In summary, our proposed efficiency score provides a powerful, tunable, and
interpretable tool to unify performance and cost in machine learning evaluation.
As ML models grow ever more complex and deployment environments more
varied, such a metric will be essential in driving responsible and impactful innova-
tion.
REFERENCES
1. M. Tan, Q. Le, “EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks,” ICML, 2019. doi: https://doi.org/10.48550/arXiv.1905.11946
2. A. Howard et al., “Searching for MobileNetV3,” 2019 IEEE/CVF International Con-
ference on Computer Vision (ICCV), Seoul, Korea (South), 2019, pp. 1314–1324. doi:
https://doi.org/10.1109/ICCV.2019.00140
3. S. Han, H. Mao, W. Dally, “Deep Compression: Compressing DNNs with Pruning,
Trained Quantization and Huffman Coding,” ICLR, 2016. doi: https://doi.org/
10.48550/arXiv.1510.00149
4. T. Wolf et al., “Transformers: State-of-the-Art Natural Language Processing,”
EMNLP, pp. 38–45, 2020. doi: https://doi.org/10.18653/v1/2020.emnlp-demos.6
5. T.B. Brown et al., “Language Models are Few-Shot Learners,” NeurIPS, 2020. doi:
https://doi.org/10.48550/arXiv.2005.14165
6. A. Dosovitskiy et al., “An Image is Worth 16x16 Words: Transformers for Image
Recognition at Scale,” ICLR, 2021. doi: https://doi.org/10.48550/arXiv.2010.11929
7. Sukhpal Singh Gill, Rupinder Kaur, ChatGPT: Vision and Challenges. 2023. doi:
https://doi.org/10.48550/arXiv.2305.15323
8. Y. Cheng, D. Wang, P. Zhou, T. Zhang “Model Compression and Acceleration for
Deep Neural Networks: The Principles, Progress, and Challenges,” IEEE Signal Pro-
cessing Magazine, vol. 35, no. 1, pp. 126–136, Jan. 2018. doi: https://doi.org/
10.1109/MSP.2017.2765695
9. J. Deng, W. Dong, R. Socher, L.-J. Li, Kai Li, Li Fei-Fei, “ImageNet: A large-scale
hierarchical image database,” 2009 IEEE Conference on Computer Vision
and Pattern Recognition, Miami, FL, USA, 2009, pp. 248–255. doi:
https://doi.org/10.1109/CVPR.2009.5206848
10. “MLPerf Training Benchmark,” MLPerf Consortium. 2022. Available: https://
mlcommons.org
11. M. Sandler, A. Howard, M. Zhu, A. Zhmoginov, L.-C. Chen, “MobileNetV2:
Inverted Residuals and Linear Bottlenecks,” 2018 IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018, pp. 4510–4520.
doi: https://doi.org/10.1109/CVPR.2018.00474
12. J. Frankle, M. Carbin, “The Lottery Ticket Hypothesis,” ICLR, 2019.
doi: https://doi.org/10.48550/arXiv.1803.03635
13. H. Cai, T. Chen, W. Zhang, Y. Yu, J. Wang, “Efficient Architecture Search by Net-
work Transformation,” AAAI Conference on Artificial Intelligence, vol. 32, no. 1,
2018. doi: https://doi.org/10.1609/aaai.v32i1.11709
14. R. Collobert, J. Weston, L. Bottou, M. Karlen, K. Kavukcuoglu, P. Kuksa, “Natural
Language Processing (Almost) from Scratch,” JMLR, vol. 12, pp. 2493–2537, 2011.
doi: https://doi.org/10.5555/1953048.2078186
15. Haozhi Qi, Xiaolong Wang, Deepak Pathak, Yi Ma, Jitendra Malik, “Learning Long-
Term Visual Dynamics with Region Proposal Interection Networks,” CoRR, 2020.
doi: https://doi.org/10.48550/arXiv.2008.02265
A.A. Zarichkovyi, I.V. Stetsenko, O.P. Stelmakh, A.Yu. Dyfuchyn, Ya.I. Kornaga
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 152
16. M. Tan et al., “MnasNet: Platform-Aware Neural Architecture Search
for Mobile,” 2019 IEEE/CVF Conference on Computer Vision and Pattern Recogni-
tion (CVPR), Long Beach, CA, USA, 2019, pp. 2815–2823. doi:
https://doi.org/10.1109/CVPR.2019.00293
17. Barret Zoph, Quoc V. Le, “Neural Architecture Search with Reinforcement Learn-
ing,” ICLR, 2017. doi: https://doi.org/10.48550/arXiv.1611.01578
18. “MLPerf Inference Benchmark v2.1,” MLCommons, 2022. Available:
https://mlcommons.org/
19. Xuanyi Dong, Yi Yang, “NAS-Bench-201: Extending the Scope of
Reproducible Neural Architecture Search,” ICLR, 2020. doi:
https://doi.org/10.48550/arXiv.2001.00326
20. H. Benmeziane, K. El Maghraoui, H. Ouarnoughi, S. Niar, M. Wistuba, N. Wang,
A Comprehensive Survey on Hardware-Aware Neural Architecture Search, 2021. doi:
https://doi.org/10.48550/arXiv.2101.09336
21. D. Brooks et al. “Power-Aware Microarchitecture: Design and Modeling Challenges
for Next-Generation Microprocessors,” IEEE Micro, vol. 20, issue 6, pp. 26–44, 2000.
doi: https://doi.org/10.1109/40.888701
22. James H. Laros, “Energy Delay Product,” Energy-Efficient High Performance Com-
puting, SpringerBriefs in Computer Science. Springer, London, 2013. doi:
https://doi.org/10.1007/978-1-4471-4492-2_8
23. S. Han et al., “EIE: Efficient Inference Engine on Compressed Deep Neural Net-
work,” 2016 ACM/IEEE 43rd Annual International Symposium on Computer Archi-
tecture (ISCA), Seoul, Korea (South), 2016, pp. 243–254. doi:
https://doi.org/ 10.1109/ISCA.2016.30
24. C.D. Manning, P. Raghavan, H. Schütze, Introduction to Information Retrieval. Cam-
bridge University Press, 2008. doi: https://doi.org/10.1017/CBO9780511809071
25. Y. LeCun, Y. Bengio, G. Hinton, “Deep Learning,” Nature, 521, pp. 436–444, 2015.
doi: https://doi.org/10.1038/nature14539
26. A. Veit, S. Belongie, “Convolutional Networks with Adaptive Inference Graphs,”
IJCV, 2019. doi: https://doi.org/10.48550/arXiv.1711.11503
27. Álvaro Domingo Reguero, Silverio Martínez-Fernández, Roberto Verdecchia, “En-
ergy-efficient neural network training through runtime layer freezing, model quanti-
zation, and early stopping,” Computer Standards & Interfaces, vol. 92, 103906, 2024.
doi: https://doi.org/10.1016/j.csi.2024.103906
28. Yu Emma Wang, Gu-Yeon Wei, David Brooks, Benchmarking TPU, GPU, and CPU
Platforms for Deep Learning, 2019. doi: https://doi.org/10.48550/ arXiv.1907.10701
29. D.M.W. Powers, Evaluation: from precision, recall and F-measure
to ROC, informedness, markedness and correlation, 2010. doi:
https://doi.org/10.48550/ arXiv.2010.16061
30. J.R. Hershey, Z. Chen, J. Le Roux, S. Watanabe, “Deep clustering: Discriminative
embeddings for segmentation and separation,” 2016 IEEE International Conference
on Acoustics, Speech and Signal Processing (ICASSP), Shanghai, China, 2016,
pp. 31–35, doi: https://doi.org/10.1109/ICASSP.2016.7471631
31. W. Kay et al., “The kinetics human action video dataset,” CoRR, 2017. doi:
https://doi.org/10.48550/arXiv.1705.06950
32. D. Tran, H. Wang, L. Torresani, J. Ray, Y. LeCun, M. Paluri, “A Closer Look at
Spatiotemporal Convolutions for Action Recognition,” 2018 IEEE/CVF Conference
on Computer Vision and Pattern Recognition, Salt Lake City, UT, USA, 2018,
pp. 6450–6459. doi: https://doi.org/10.1109/CVPR.2018.00675
33. J. Carreira, A.Zisserman, “Quo Vadis, Action Recognition? A new model
and the kinetics dataset,” Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition, pp. 6299–6308, 2017. doi:
https://doi.org/10.48550/ arXiv.1705.07750
Efficient evaluation of machine learning models: a unified metric balancing performance and cost
Системні дослідження та інформаційні технології, 2026, № 1 153
34. X. Wang, R. Girshick, A. Gupta, K. He, “Non-local Neural Networks,” 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition, Salt Lake City,
UT, USA, 2018, pp. 7794–7803. doi: https://doi.org/10.1109/ CVPR.2018.00813
35. C. Feichtenhofer, H. Fan, J. Malik, K. He, “SlowFast Networks for Video Recogni-
tion,” 2019 IEEE/CVF International Conference on Computer Vision (ICCV), Seoul,
Korea (South), 2019, pp. 6201–6210. doi: https://doi.org/10.1109/ICCV.2019.00630
36. C. Feichtenhofer, “X3D: Expanding Architectures for Efficient Video Recognition,”
2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR),
Seattle, WA, USA, 2020, pp. 200–210. doi: https://doi.org/10.1109/
CVPR42600.2020.00028
37. H. Fan et al., “Multiscale Vision Transformers,” 2021 IEEE/CVF International Con-
ference on Computer Vision (ICCV), Montreal, QC, Canada, 2021, pp. 6804–6815.
doi: https://doi.org/10.1109/ICCV48922.2021.00675
38. G. Bertasius, H. Wang, L. Torresani, “Is space-time attention all you need for video
understanding?” CoRR, 2021. doi: https://doi.org/10.48550/arXiv.2102.05095
39. D. Neimark, O. Bar, M. Zohar, D. Asselmann, “Video transformer network,” CoRR,
2021. doi: https://doi.org/10.48550/arXiv.2102.00719
40. A. Arnab, M. Dehghani, G. Heigold, C. Sun, M. Lučić, C. Schmid, “ViViT:
A Video Vision Transformer,” 2021 IEEE/CVF International Conference on Com-
puter Vision (ICCV), Montreal, QC, Canada, 2021, pp. 6816–6826. doi:
https://doi.org/10.1109/ICCV48922.2021.00676
41. Z. Liu et al., “Video Swin Transformer,” 2022 IEEE/CVF Conference on Computer
Vision and Pattern Recognition (CVPR), New Orleans, LA, USA, 2022, pp. 3192–
3201. doi: https://doi.org/10.1109/CVPR52688.2022.00320
42. Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, Samuel R.
Bowman, “GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Lan-
guage Understanding,” CoRR, 2018. doi: https://doi.org/10.48550/ arXiv.1804.07461
43. Jacob Devlin, Ming-Wei Chang, Kenton Lee, Kristina Toutanova, BERT: Pre-train-
ing of Deep Bidirectional Transformers for Language Understanding. 2018 doi:
https://doi.org/10.48550/arXiv.1810.04805
44. Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yiming Yang, Denny Zhou,
MobileBERT: a Compact Task-Agnostic BERT for Resource-Limited Devices. 2020.
doi: https://doi.org/10.48550/arXiv.2004.02984
45. Sahana Viswanath et al., “The DistilBERT Model: A Promising Approach to Improve
Machine Reading Comprehension Models,” International Journal on Recent and In-
novation Trends in Computing and Communication, vol. 11, no. 8, pp. 293–309, 2023.
doi: https://doi.org/10.17762/ijritcc.v11i8.7957
46. Xiaoqi Jiao et al., TinyBERT: Distilling BERT for Natural Language Understanding.
2019. doi: https://doi.org/10.48550/arXiv.1909.10351
Received 27.12.2024
INFORMATION ON THE ARTICLE
Alexander A. Zarichkovyi, ORCID: 0000-0002-4132-6424, National Technical Univer-
sity of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: alexan-
der.zarichkovyi@gmail.com
Inna V. Stetsenko, ORCID: 0000-0002-4601-0058, National Technical University of
Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: stiv.inna@gmail.com
Oleksandr P. Stelmakh, ORCID: 0000-0003-3147-579X, National Technical University of
Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: stelmah-
work@gmail.com
A.A. Zarichkovyi, I.V. Stetsenko, O.P. Stelmakh, A.Yu. Dyfuchyn, Ya.I. Kornaga
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 154
Anton Yu. Dyfuchyn, ORCID: 0000-0002-1722-8840, National Technical University of
Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: difuchin@gmail.com
Yaroslav I. Kornaga, ORCID: 0000-0001-9768-2615, National Technical University of
Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: slovyan_k@ukr.net
ОЦІНЮВАННЯ ЕФЕКТИВНОСТІ МОДЕЛЕЙ МАШИННОГО НАВЧАННЯ:
УНІФІКОВАНА МЕТРИКА БАЛАНСУВАННЯ ПРОДУКТИВНОСТІ ТА
ВАРТОСТІ / О.А. Зарічковий, І.В. Стеценко, О.П. Стельмах, А.Ю. Дифучин,
Я.І. Корнага
Анотація. Подано нову уніфіковану метрику для оцінювання ефективності мо-
делей машинного навчання, глибокого навчання та штучного інтелекту шляхом
балансування продуктивності та вартості виконання. Наявні метрики зазвичай
ізольовано враховують лише продуктивність або лише обчислювальні характе-
ристики (наприклад, FLOPs, затримку, енергоспоживання), не відображаючи
притаманний компроміс між обмеженими ресурсами та здатністю до передба-
чення в єдиній формулі. Запропоновано формулу, яка містить налаштовуваний
фактор компромісу та жорсткі обмеження на продуктивність і вартість, що дає
змогу здійснювати принципове порівняння між моделями та середовищами ро-
згортання. Формалізація узагальнює попередні евристики та демонструє чітку
інтерпретованість, масштабованість і врахування особливостей апаратного за-
безпечення.
Ключові слова: ефективність штучного інтелекту, обчислювально-орієнто-
ване оцінювання, оцінювання моделей, сталість штучного інтелекту, ефектив-
ність програмного забезпечення.
|
| id | journaliasakpiua-article-358084 |
| institution | System research and information technologies |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2026-04-20T01:00:21Z |
| publishDate | 2026 |
| publisher | The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" |
| record_format | ojs |
| resource_txt_mv | journaliasakpiua/2a/d5825d0aff764ab9da32baeea7cbb02a.pdf |
| spelling | journaliasakpiua-article-3580842026-04-19T21:53:19Z Efficient evaluation of machine learning models: a unified metric balancing performance and cost Оцінювання ефективності моделей машинного навчання: уніфікована метрика балансування продуктивності та вартості Zarichkovyi, Alexander Stetsenko, Inna Stelmakh, Oleksandr Dyfuchyn, Anton Kornaga, Yaroslav ефективність штучного інтелекту обчислювально-орієнтоване оцінювання оцінювання моделей сталість штучного інтелекту ефективність програмного забезпечення artificial intelligence efficiency compute-aware evaluation model evaluation artificial intelligence sustainability software efficiency This paper introduces a novel, unified metric for evaluating the efficiency of machine learning, deep learning, and artificial intelligence models by balancing predictive performance and execution cost. Existing metrics typically isolate performance or execution measures (e.g., FLOPs, latency, energy), failing to capture the inherent trade-off between resource constraints and predictive capability in single formula. The proposed formula incorporates a tunable trade-off factor and hard constraints on performance and cost, allowing principled comparison across models and deployment settings. Our formulation generalizes prior heuristics and demonstrates clear interpretability, scalability, and hardware awareness. Подано нову уніфіковану метрику для оцінювання ефективності моделей машинного навчання, глибокого навчання та штучного інтелекту шляхом балансування продуктивності та вартості виконання. Наявні метрики зазвичай ізольовано враховують лише продуктивність або лише обчислювальні характеристики (наприклад, FLOPs, затримку, енергоспоживання), не відображаючи притаманний компроміс між обмеженими ресурсами та здатністю до передбачення в єдиній формулі. Запропоновано формулу, яка містить налаштовуваний фактор компромісу та жорсткі обмеження на продуктивність і вартість, що дає змогу здійснювати принципове порівняння між моделями та середовищами розгортання. Формалізація узагальнює попередні евристики та демонструє чітку інтерпретованість, масштабованість і врахування особливостей апаратного забезпечення. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2026-03-31 Article Article Peer-reviewed Article application/pdf https://journal.iasa.kpi.ua/article/view/358084 10.20535/SRIT.2308-8893.2026.1.10 System research and information technologies; No. 1 (2026); 144-154 Системные исследования и информационные технологии; № 1 (2026); 144-154 Системні дослідження та інформаційні технології; № 1 (2026); 144-154 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/358084/344008 |
| spellingShingle | ефективність штучного інтелекту обчислювально-орієнтоване оцінювання оцінювання моделей сталість штучного інтелекту ефективність програмного забезпечення Zarichkovyi, Alexander Stetsenko, Inna Stelmakh, Oleksandr Dyfuchyn, Anton Kornaga, Yaroslav Оцінювання ефективності моделей машинного навчання: уніфікована метрика балансування продуктивності та вартості |
| title | Оцінювання ефективності моделей машинного навчання: уніфікована метрика балансування продуктивності та вартості |
| title_alt | Efficient evaluation of machine learning models: a unified metric balancing performance and cost |
| title_full | Оцінювання ефективності моделей машинного навчання: уніфікована метрика балансування продуктивності та вартості |
| title_fullStr | Оцінювання ефективності моделей машинного навчання: уніфікована метрика балансування продуктивності та вартості |
| title_full_unstemmed | Оцінювання ефективності моделей машинного навчання: уніфікована метрика балансування продуктивності та вартості |
| title_short | Оцінювання ефективності моделей машинного навчання: уніфікована метрика балансування продуктивності та вартості |
| title_sort | оцінювання ефективності моделей машинного навчання: уніфікована метрика балансування продуктивності та вартості |
| topic | ефективність штучного інтелекту обчислювально-орієнтоване оцінювання оцінювання моделей сталість штучного інтелекту ефективність програмного забезпечення |
| topic_facet | ефективність штучного інтелекту обчислювально-орієнтоване оцінювання оцінювання моделей сталість штучного інтелекту ефективність програмного забезпечення artificial intelligence efficiency compute-aware evaluation model evaluation artificial intelligence sustainability software efficiency |
| url | https://journal.iasa.kpi.ua/article/view/358084 |
| work_keys_str_mv | AT zarichkovyialexander efficientevaluationofmachinelearningmodelsaunifiedmetricbalancingperformanceandcost AT stetsenkoinna efficientevaluationofmachinelearningmodelsaunifiedmetricbalancingperformanceandcost AT stelmakholeksandr efficientevaluationofmachinelearningmodelsaunifiedmetricbalancingperformanceandcost AT dyfuchynanton efficientevaluationofmachinelearningmodelsaunifiedmetricbalancingperformanceandcost AT kornagayaroslav efficientevaluationofmachinelearningmodelsaunifiedmetricbalancingperformanceandcost AT zarichkovyialexander ocínûvannâefektivnostímodelejmašinnogonavčannâunífíkovanametrikabalansuvannâproduktivnostítavartostí AT stetsenkoinna ocínûvannâefektivnostímodelejmašinnogonavčannâunífíkovanametrikabalansuvannâproduktivnostítavartostí AT stelmakholeksandr ocínûvannâefektivnostímodelejmašinnogonavčannâunífíkovanametrikabalansuvannâproduktivnostítavartostí AT dyfuchynanton ocínûvannâefektivnostímodelejmašinnogonavčannâunífíkovanametrikabalansuvannâproduktivnostítavartostí AT kornagayaroslav ocínûvannâefektivnostímodelejmašinnogonavčannâunífíkovanametrikabalansuvannâproduktivnostítavartostí |