Мультимодальна система для виявлення меланоми шкіри
Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conve...
Saved in:
| Date: | 2026 |
|---|---|
| Main Authors: | , , |
| Format: | Article |
| Language: | English |
| Published: |
The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute"
2026
|
| Subjects: | |
| Online Access: | https://journal.iasa.kpi.ua/article/view/358061 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Journal Title: | System research and information technologies |
| Download file: | |
Institution
System research and information technologies| _version_ | 1862949218836021248 |
|---|---|
| author | Sydorskyi, Volodymyr Krashenyi, Igor Yakubenko, Oleksii |
| author_facet | Sydorskyi, Volodymyr Krashenyi, Igor Yakubenko, Oleksii |
| author_sort | Sydorskyi, Volodymyr |
| baseUrl_str | http://journal.iasa.kpi.ua/oai |
| collection | OJS |
| datestamp_date | 2026-04-19T21:53:19Z |
| description | Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conventional photo images, making it more accessible and versatile. Our system integrates image data with tabular metadata, such as patient demographics and lesion characteristics, to improve detection accuracy. It employs a multi-modal neural network combining image and metadata processing and supports a two-step model for cases with or without metadata. A three-stage pipeline further refines predictions by boosting algorithms and enhancing performance. To address the challenges of a highly imbalanced dataset, specific techniques were implemented to ensure robust training. An ablation study evaluated recent vision architectures, boosting algorithms, and loss functions, achieving a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Results demonstrate that integrating photo images with metadata in a structured, multi-stage pipeline yields significant performance improvements. This system advances melanoma detection by providing a scalable, equipment-independent solution suitable for diverse healthcare environments, bridging the gap between specialized and general clinical practices. |
| doi_str_mv | 10.20535/SRIT.2308-8893.2026.1.03 |
| first_indexed | 2026-04-20T01:00:22Z |
| format | Article |
| fulltext |
V. Sydorskyi, I. Krashenyi, O. Yakubenko, 2026
Системні дослідження та інформаційні технології, 2026, № 1 33
TIÄC
ТЕОРЕТИЧНІ ТА ПРИКЛАДНІ ПРОБЛЕМИ
ІНТЕЛЕКТУАЛЬНИХ СИСТЕМ
ПІДТРИМКИ ПРИЙНЯТТЯ РІШЕНЬ
UDC 004.932.2
DOI: 10.20535/SRIT.2308-8893.2026.1.03
MULTIMODAL SYSTEM FOR SKIN CANCER DETECTION
V. SYDORSKYI, I. KRASHENYI, O. YAKUBENKO
Abstract. Melanoma detection is vital for early diagnosis and effective treatment.
While deep learning models on dermoscopic images have shown promise, they re-
quire specialized equipment, limiting their use in broader clinical settings. This study
introduces a multi-modal melanoma detection system using conventional photo im-
ages, making it more accessible and versatile. Our system integrates image data with
tabular metadata, such as patient demographics and lesion characteristics, to improve
detection accuracy. It employs a multi-modal neural network combining image and
metadata processing and supports a two-step model for cases with or without
metadata. A three-stage pipeline further refines predictions by boosting algorithms
and enhancing performance. To address the challenges of a highly imbalanced dataset,
specific techniques were implemented to ensure robust training. An ablation study
evaluated recent vision architectures, boosting algorithms, and loss functions, achiev-
ing a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensi-
tivity of 0.78371. Results demonstrate that integrating photo images with metadata in
a structured, multi-stage pipeline yields significant performance improvements. This
system advances melanoma detection by providing a scalable, equipment-independ-
ent solution suitable for diverse healthcare environments, bridging the gap between
specialized and general clinical practices.
Keywords: medical image classification, computer vision, gradient boosting, deep
neural networks, clinical decision support systems.
INTRODUCTION
Skin cancer is one of the most commonly diagnosed types of cancer, posing a sig-
nificant public health concern due to its high incidence rates and the risk of severe
complications if not detected early [1]. The most effective approach to managing
skin cancer is through early detection and prevention [2]. Despite substantial pro-
gress in medical imaging and diagnostic technologies, reliably and efficiently de-
tecting melanoma remains challenging. Traditional diagnostic methods depend
heavily on dermatologists’ expertise, which can be subjective and vary between
practitioners. As a result, there is increasing interest in utilizing deep learning tech-
niques to automate and improve the accuracy of skin cancer detection [3]. Progress
in dataset curation and related classification challenges has demonstrated potential
for fast and accurate skin cancer detection [4]. Deep learning methods have recently
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 34
gained popularity and have been shown to improve skin cancer detection perfor-
mance [5–10]. The effectiveness of these models relies heavily on the quality of
their training datasets and the limitations of deep learning methods. Additionally,
early detection may not always be feasible due to lengthy manual diagnostic pro-
cedures [11], and many low-income individuals cannot afford these options. This
highlights the need to develop approaches that surpass human diagnostics, provid-
ing faster and more accurate results.
Deep learning has become a gold standard for skin cancer classification. Clas-
sical approaches using CNN architectures such as ResNets [12], DenseNets [13],
and convolutional networks enhanced with attention mechanisms [5] have been
widely adopted. More complex pipelines combine segmentation, feature extraction,
and attention-based classification neural networks [3]. Synthetic data generated
by GANs has proven effective in enhancing model performance [14]. Multi-modal
approaches incorporating diverse data modalities further improve classification
[15, 16].
Several studies employ hybrid approaches, blending deep learning with tradi-
tional machine learning algorithms. For instance, [17] uses VGG16 for feature ex-
traction, followed by XGBoost for final image classification. This study also lev-
erages synthetic data as part of its data augmentation strategy. Another noteworthy
approach involves skin cancer detection using genetic data. In [18], various ma-
chine learning algorithms, including KNN, SVM, and XGBoost, are applied to clas-
sify melanoma. Similarly, [19] explores the use of NIR spectroscopy as input data,
utilizing XGBoost, LightGBM, 1D-CNNs, and other machine and deep learning
methods for classification.
Despite progress, several research gaps remain:
− Development of complex multi-modal systems integrating different mo-
dalities (e.g., image and tabular data) in parallel or sequential architectures.
− Optimization of multi-modal neural networks for datasets with and without
meta-features.
− Adaptation to challenging imaging conditions, such as images captured us-
ing mobile devices, addressing data imbalance and quality variations.
In this study, we propose a novel framework for melanoma detection that in-
tegrates visual, structural, and lesion metadata with patient information such as age
and sex. Our solution combines multi-modal neural networks for processing visual
and metadata inputs, and a gradient-boosting model for metadata analysis unified
within a three-stage pipeline. Additionally, we introduce a two-step training meth-
odology to accommodate datasets with varying metadata availability. Advanced
training techniques and feature engineering are applied to address class imbalance,
ensuring robust and efficient melanoma detection. Finally, we enhance system per-
formance through two stages of feature engineering, enabling robust and efficient
melanoma detection.
MATERIALS AND METHODS
Data. This research utilizes several data sources:
− Data from ISIC 2024 Kaggle Challenge [20] - Main Data.
− Data from ISIC Archive [21] - ISIC Archive Data.
− Artificially generated image data [22] - Generated Data.
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 35
The primary dataset from the ISIC 2024 Kaggle Challenge is the foundation
for most of the training and validation processes. However, additional data sources
play a crucial role in enhancing the proposed system, particularly in improving the
neural network performance. However, diverse datasets introduce domain shifts
and varying feature subsets, creating challenges in harmonizing and effectively in-
tegrating the information. The proper fusion of these heterogeneous data sources
represents a key contribution to our work, enabling robust performance across dif-
ferent data modalities and domains.
Competition Data. The ISIC 2024 Kaggle Challenge dataset [20] includes images
and metadata from single-lesion crops extracted from 3D total body photos (TBP)
[23]. This dataset presents several challenges:
− Lower data quality compared to dermatoscopy images. The images resemble
close-up smartphone photos, making them highly relevant for telehealth applica-
tions, where patients often submit similar-quality images (Fig. 1).
− Different labeling confidence. The dataset includes two categories of labels:
“strongly-labeled tiles,” verified through histopathology, and “weakly-labeled
tiles,” which were not biopsied and considered benign by a doctor’s assessment.
− Severe class imbalance. The dataset contained 401059 tiles, where 400666
(99.902 %) are benign, and 393 (0.098 %) are malignant.
Each image represents 15x15 mm of skin area but may come with a slightly
varying resolution centered around 133 pixels (Fig. 2). Besides image data, the da-
taset includes metadata about patients, tile location, image characteristics, and ex-
tracted features derived from and extracted features derived from [24] and [25].
Fig. 1. 1st row – benign; 2nd row – malignant images from ISIC 2024 Kaggle Challenge
Fig. 2. Distribution of image shapes in Data from ISIC 2024 Kaggle Challenge
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 36
ISIC Archive Data. The ISIC Archive dataset [21] contains 81.722 images
accompanied by metadata. However, the dataset is highly unstructured due to its
compilation from various data sources and competitions. For this study, most of the
available meta-features are disregarded, and only patient information, target labels,
and images are utilized for system development.
To avoid potential data leakage, all patients included in the ISIC 2024 Kaggle
Challenge dataset [20] are excluded from the ISIC Archive data. Additionally, images
lacking explicit benign/malignant labels are removed. After such a filtration ISIC
Archive contains 71080 images, where 61910 (87.099 %) are benign, and 9170
(12.901 %) are malignant. This dataset has approximately 5.6 times fewer total
images than [20], but it contains 23.3 times more malignant images.
The primary challenge with the ISIC Archive dataset lies in its substantial data
diversity. It aggregates skin lesion tiles from various sources, including der-
matoscopy and standard photo images. The images differ significantly in size, as-
pect ratio, scale, and padding, influenced by the medical devices used to capture
them (Fig. 3). Most of the images are dermatoscopy and come into higher resolu-
tion – median height resolution is 3024, and width is 2016 (Fig. 4). This diversity
introduces a significant domain shift compared to the ISIC 2024 Kaggle Challenge
dataset [20]. Despite these challenges, the ISIC Archive dataset is a valuable source
of malignant images, addressing their severe undersampling in the primary dataset.
Its inclusion enriches the data diversity, improving model robustness and generali-
zation.
Fig. 3. 1st row – benign; 2nd row – malignant images from ISIC Archive
Fig. 4. Distribution of image shapes in ISIC Archive
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 37
Generated Data. The generated image dataset from [22] was created using
the Stable Diffusion 2 model [26]. It consists of 6.012 images, with a nearly equal
distribution of classes: 3.012 malignant and 3.000 benign images. All images are
standardized to a resolution of 512×512 pixels. While the generated images can
often be distinguished by artifacts and smooth textures characteristic of generative
models (Fig. 5), the model performs well in preserving malignant lesions’ shapes
and visual characteristics. This fidelity provides valuable information that can en-
hance the training of deep learning models by supplementing the limited data of
malignant cases in real-world datasets.
Fig. 5. 1st row – benign; 2nd row – malignant images from Generated Dataset
METHODS
This subsection describes the proposed system, including its components, models,
optimization processes, evaluation metrics, and validation procedures.
Metrics. To evaluate the proposed system and models, the following metrics
are utilized:
− ROC AUC. A standard metric to measure the overall performance of
a binary classification model.
− Partial ROC AUC [27]. This metric calculates the area under the ROC
curve only for True Positive Rates (TPR) above 80%. The score ranges from
0.0 to 0.2, emphasizing performance in high-sensitivity regions critical for clinical
applications.
− Top-15 retrieval sensitivity [28]. This metric is the most appropriate to real
clinic scenarios when a dermatologist has limited time for a patient and should pay
attention to the most suspicious lesions [29].
Out of fold (OOF), Mean Fold metrics will be reported.
Validation. To evaluate the models, a classical 5-fold cross-validation ap-
proach is used. Folds are stratified based on the target label (benign/malignant), and
no patient overlaps across folds are ensured.
In cases where a two-stage training approach is employed (first stage: ISIC
Archive + Main + Generated Data; second stage: only Main Data), the datasets are
split separately, and respective folds are merged afterward.
For models using only tabular data, validation is repeated five times with dif-
ferent random seeds, and average scores are reported. Hyperparameter tuning for
tabular models is performed using the Optuna algorithm [30], with the tuning strat-
egy discussed in the Tabular Approach.
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 38
Fig. 6. Models pipelines and validation schemes
The two-stage system, which integrates Vision and Tabular models, can fol-
low several aggregation approaches (Fig. 6):
1. Vision-Only Pretraining:
− Train the Vision model on ISIC Archive and Generated data.
− Generate predictions for the Main dataset.
− Train the Tabular model using Vision model predictions and tabular features.
2. Vision Model Pretraining and Fine-Tuning:
− Train the Vision model on ISIC Archive and Generated + Main datasets.
− Fine-tune the Vision model on the Main dataset only.
− Generate out-of-fold predictions on the Main dataset.
− Train the Tabular model using these predictions and tabular features.
3. Multi-Modal Pretraining and Fine-Tuning: Same as Approach 2, but tabu-
lar data is also incorporated during the Vision model fine-tuning.
In the final third stage of the system (Fig. 7), the Optuna algorithm is used for
the final stage of coefficient optimization. However, it is crucial to recognize that
the validation metrics obtained during the second stage, particularly for the second
and third approaches, may be unreliable and could lead to overly optimistic out-
comes. Similarly, the final stage lacks validation, which can further contribute to
unfavorable results. To address these limitations and ensure a robust evaluation of
the final system’s performance, the Public and Private Leaderboards from the
Kaggle Competition [31] are used as benchmarks. The Public test set contains ap-
proximately 140.000 tiles, while the Private test set includes around 360.000 tiles.
These external benchmarks provide a more realistic and unbiased assessment of the
system’s capabilities.
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 39
Fig. 7. Scheme of Multi-model fusion system
Feature Engineering. Basic preprocessing steps are applied to the tabular
data, including handling missing values and removing redundant features. Specifi-
cally, missing numerical values are filled with the median, accompanied by adding
a missing indicator feature, while missing categorical values are replaced with a
new “nan” category. Columns that are static or present only in the training data are
dropped to ensure consistency across datasets.
In addition to basic tabular features, several advanced features are manually
engineered to capture spatial, color, and physical relationships inherent in the data.
Initial proposals for these features are generated using ChatGPT [32] and refined
through pruning. Key engineered features include:
− Lesion size ratio: The ratio of the minimum to the maximum diameter
of the lesion.
− Hue contrast: The difference in hue between the lesion’s center and
periphery.
− Perimeter-to-area ratio: The ratio of the lesion’s perimeter to its area.
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 40
A critical aspect involves the comparison of lesion characteristics within the
same patient or body region, motivated by [33]. To address this, aggregation fea-
tures are introduced:
1. Deviation within patients: StandardScaler [34] is applied within each pa-
tient-id group to capture deviations relative to other lesions of the same patient.
2. Deviation within body regions: StandardScaler is applied within combined
patient-id and anatomic-site-general groups, reflecting deviations in specific body
regions (e.g., arm, leg).
3. Extremes within patients: Maximum and minimum feature values per pa-
tient are calculated. Given the limited patient sample size, these features are discre-
tized using QuantileTransformer [35] to mitigate overfitting risks.
Additionally, incorporating skin type as a feature inspired by [36] improved
performance.
Finally, categorical features are one-hot encoded to prepare the data for model
training.
Multi-Modal Neural Net: Image + Tabular data. Multi-Modal Vision +
Tabular model is trained in 2 stages:
1. A CNN Encoder combined with a Multilayer Classifier is trained on the
ISIC Archive, Generated, and Main datasets.
2. The pre-trained CNN Encoder is combined with a randomly initialized
Feed-Forward Tabular Neural Net in a new Multilayer Classifier, and this com-
bined model is fine-tuned only on the Main dataset.
All images are resized to a resolution of 128×128. Images from the ISIC
Archive are also center-cropped before resizing. Continuous tabular features are
normalized using the StandardScaler, while categorical features are one-hot
encoded.
ConvNeXt V2 Pico [37], EdgeNeXt Base [38], and EfficientNetV2 B0 [39]
CNN architectures are used. Pre-trained models from the timm repository [40] are
starting points for first-stage training: convnextv2_pico.fcmae_ft_in1k, edgen-
ext_base.in21k_ft_in1k, tf_efficientnetv2_b0.in1k. EdgeNeXt Base is used in one
of the multi-modal architectures to leverage its attention mechanisms, prioritizing
robustness over inference speed. EfficientNetV2 B0 is employed as a first-level
model to generate predictions for the second-level pipeline, benefiting from its high
inference speed. ConvNeXt V2 Pico balances inference speed and accuracy, mak-
ing it suitable for prediction generation and multi-modal architectures.
Heavy augmentations are applied to enhance model robustness and mitigate
overfitting in undersampled malignant cases. These include various spatial, color,
blurring, distortion, and dropout augmentations, introducing variability and im-
proving generalization (see Section Detailed Neural Net Architecture and Training
Setup).
To address class imbalance during training:
− A balanced sampling strategy is employed in the first stage, ensuring equal
representation of positive and negative classes.
− A “square” balancing strategy is applied in the second stage to refine class
distribution further.
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 41
For generating predictions, the last and best (based on validation Partial ROC
AUC) is used. Predictions are averaged across test-time augmentations (TTA), in-
corporating four flips to increase accuracy and robustness.
Detailed Neural Net Architecture and Training Setup. Images are first nor-
malized to the [0, 1] range and then normalized to ImageNet statistics [41]. For
image resizing, methods from the OpenCV library [42] are used, specifically
INTER_AREA and INTER_LANCZOS4. The final choice is INTER
LANCZOS4, though the overall difference between methods is marginal.
Both first and second-stage models are trained with the following series of
augmentations:
− Transpose with a probability of 0.5.
− Vertical Flip with a probability of 0.5.
− Horizontal Flip with a probability of 0.5.
− Random Brightness and Contrast adjustment, with changes in brightness
and contrast within the range [–0.2, 0.2] and a probability of 0.75.
− One of the following blurs: Motion, Median, Gaussian, or Gaussian Noise,
with variation in the range [5, 30] and a blur kernel size limit of up to 5, applied
with a probability of 0.7.
− One of the following distortions: Optical (limit up to 1.0), Grid (5 steps
with a limit of up to 1.0), or Elastic Transform (alpha up to 3), applied with a prob-
ability of 0.7.
− CLAHE with a clip limit of up to 4, applied with a probability of 0.7.
− Random adjustment of hue, saturation, and value. Hue changes within
[-10, 10], saturation within [–20, 20], and value shift within [–10, 10], with a prob-
ability of 0.5.
− Shifting, scaling, and rotation of the image. Shift within [–0.1, 0.1], scale
within [–0.1, 0.1], and rotation within [–15, 15] degrees, applied with a probability
of 0.85.
− Coarse Dropout with one hole of 48 width and height, applied with a prob-
ability of 0.7.
Models for both stages are trained with a batch size 64 and Binary cross-en-
tropy loss. First-stage models are trained for 10 epochs, while second-stage models
are trained for one epoch. The limited number of epochs is due to the large dataset
size combined with a small number of positive samples. Balanced sampling is used,
and overfitting tends to occur early.
For optimization, the Adam optimizer [43] is used in the first stage, while
RAdam [44] is used in the second stage. The learning rate settings are as follows:
− For the first stage, the starting learning rate is set to. For the EdgeNeXt
model, weight decay is set to.
− For the second stage:
− The CNN encoder (from the first stage) starts with a learning rate of 1e-4.
− The Feed-forward Tabular Neural Net and Multilayer Classifier start with
a learning rate of 1e-3.
Such a choice of different starting learning rates is crucial to fit the entire
model within one epoch while avoiding overfitting the training data or one of the
modalities (image or tabular). Learning rates are reduced using cosine scheduling
to 10 times smaller than their initial value. For second-stage training, the last check-
point from the first stage is used. “Square” balancing weights for second-stage
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 42
training are illustrated in Equation below. For EfficientNetV2, the starting learning
rate is set to and reduced to. This model is trained on all ISIC Archive and Gener-
ated data, resulting in good convergence.
𝑐𝑙𝑎𝑠𝑠 𝑤𝑒𝑖𝑔ℎ𝑡 ∑ ∑ 𝑐𝑙𝑎𝑠𝑠 ,∑ 𝑐𝑙𝑎𝑠𝑠 ,
Fig. 8. Vision and Tabular general model architecture
The overall Image and Tabular model architecture is inspired by [16]. The
resulting model architecture is shown in Fig. 8. Detailed values for the number of
hidden channels and other hyperparameters can be found in Tables 1 and 2.
T a b l e 1 . CNN Encoders
CNN Encoder Embedding Shape Vision and Tabular
Embedding Shape
ConvNeXt 512 576
EdgeNeXt 584 648
EfficientNetV2 192 –
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 43
T a b l e 2 . Feed Forward Net for Meta Features
Layer In Channels Out Channels Dropout
Probability
1 200 256 0.3
2 256 512 0.3
3 512 128 0.3
Final 128 64 –
An important feature of the multi-modal model is a substantial bottleneck in
the final Feed-forward Net layer. This bottleneck is critical for preventing overfit-
ting to the tabular branch. Using a more significant number of channels results in
faster overfitting and poorer final results.
Tabular Approach. To address the class imbalance and enhance the effi-
ciency of training and hyperparameter selection, RandomUnderSampler [45] is
used as the initial step for all tabular models. The underlying model used is a boost-
ing algorithm (LightGBM [46] or XGBoost [47]), with or without early stopping.
− For boosting without early stopping, a single model is trained on the entire
training dataset, and the number of epochs is selected as one of the hyperpara-
meters.
− In the case of boosting with early stopping, an ensemble of five models is
trained. Each model is fitted on 4/5 of the training data, with early stopping per-
formed on the remaining 1/5 (the data split follows the same principle described in
the validation section) (Fig. 9).
Fig. 9. Validation strategy for Tabular Models
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 44
Hyperparameter tuning, including the under-sampling ratio and the number of
epochs for models without early stopping, is conducted using the Optuna optimiza-
tion framework in the following steps:
1. Initial Optimization: 300 Optuna trials are performed without predefined
starting parameters.
2. Parameter Refinement: The best parameters from the top five trials are
combined to reduce overfitting and ensure a more robust solution. For numerical
parameters, medians are calculated across the top-performing trials. For categorical
parameters, the most frequent value or the value from the model with the highest
Partial ROC AUC score is selected.
The primary fusion scheme is depicted in Fig. 7.
1. In the first stage, tabular and multi-modal tabular/image neural models are
trained. An image-only neural model is also trained for use in the second stage.
2. In the second stage, a tabular model is trained using tabular features and
the outputs of the image-only neural model.
3. In the third stage, the outputs of all models from the first stage (except the
image-only neural model) and the outputs of the second stage models are ensem-
bled using Optuna coefficient optimization. The optimization is performed on Par-
tial ROC AUC directly on validation folds.
Predictions are generated from each fold model in the final system, increasing
inference time but significantly improving robustness. This tradeoff is crucial for
real-world applications such as skin cancer detection.
Due to the different nature of models and variations in the number of positive
samples, the final probability distributions for each model can vary (Fig. 10). The
distributions are standardized using the rank method to address this, where proba-
bilities are converted to ranks before ensembling.
Fig. 10. Probability distribution of different models, trimmed by 0.1. Trimming is needed
because most of the probabilities are lower than 0.1, while the most interesting part is above
this point
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 45
For the final Optuna optimization stage, overfitting to the training (validation) set
is a notable risk. The top 10 results from 5 optimization runs are collected and averaged
to mitigate this. The results of the coefficient search are shown
in Fig. 11. Final ensemble weights are adjusted manually to refine the system further.
Fig. 11. Coefficients of 3rd Stage (Three-stage v2) obtained from Optuna across several runs
RESULTS
In the results, Tables 3, 4, 5 and the coefficient Table 12 different versions refer to
the following:
− Different versions of Two-stage models. These correspond to the different
approaches illustrated in Fig. 6.
− Different versions of XGBoost and LightGBM models. These reflect minor
adjustments in the Optuna configurations or feature setups. For instance, version
2 incorporates the skin tone feature.
We evaluate the proposed methods using validation, private, and public datasets.
The primary metrics include Partial ROC AUC, ROC AUC, and Top 15 Retrieval Sen-
sitivity. For the validation dataset, we report both Out-of-Fold (OOF) and Mean met-
rics, with their relevance discussed in Section Metrics. Due to limitations, we only re-
port the Top 15 Retrieval Sensitivity for the validation dataset.
Metrics of One-stage and Two-stage Models
The proposed two-stage v2 model demonstrates superior performance across all
datasets in Partial and Full ROC AUC metrics, as detailed in Tables 3 and 4. The
Multi-Modal ConvNext model achieves the highest Top 15 Retrieval Sensitivity
(Table 5). While validation results for two-stage models may appear over-optimis-
tic (see Section Validation), their consistent outperformance on public and private
datasets supports such conclusions.
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 46
T a b le 3 . Partial ROC AUC of One-Stage and Two-Stage Models
Model OOF Mean Private Public
Two-stage v2 0.17666 0.17862 0.16941 0.18608
Multi-Modal
ConvNext 0.17497 0.17698 0.16090 0.17714
XGBoost v2 0.17348 0.17460 – –
XGBoost v1 0.17252 0.17351 – –
EarlyStop
LightGBM v4 0.17225 0.17305 – –
LightGBM v3 0.17173 0.17266 0.16107 0.18400
LightGBM v4 0.17105 0.17210 – –
EarlyStop
LightGBM v1 0.17029 0.17183 – –
EarlyStop
LightGBM v2 0.17024 0.16187 0.16187 0.18336
Two-stage v1 0.17010 0.17116 0.16173 0.18283
LightGBM v1 0.17005 0.17165 – –
Multi-Modal
EdgeNext 0.15892 0.17410 0.16082 0.17481
T a b le 4 . ROC AUC of One-Stage and Two-Stage Models
Model OOF Mean
Two-stage v2 0.97234 0.97395
Multi-Modal ConvNext 0.97082 0.97244
XGBoost v2 0.96826 0.96916
XGBoost v1 0.96741 0.96817
EarlyStop LightGBM v4 0.96719 0.96765
LightGBM v3 0.96599 0.96682
LightGBM v4 0.96586 0.96666
Two-stage v1 0.96518 0.96600
EarlyStop LightGBM v2 0.96502 0.96605
EarlyStop LightGBM v1 0.96491 0.96629
LightGBM v1 0.96485 0.96632
Multi-Modal EdgeNext 0.95110 0.96701
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 47
T a b le 5 . Top 15 Retrieval Sensitivity of One-Stage and Two-Stage Models
Model OOF Mean
Multi-Modal ConvNext 0.76081 0.75995
Two-stage v2 0.74809 0.75375
XGBoost v2 0.73791 0.73919
LightGBM v1 0.73537 0.73769
EarlyStop LightGBM v2 0.72774 0.73095
LightGBM v4 0.72774 0.72939
XGBoost v1 0.72519 0.72621
Two-stage v1 0.72265 0.72304
EarlyStop LightGBM v4 0.72265 0.72658
EarlyStop LightGBM v1 0.72010 0.72296
LightGBM v3 0.71501 0.71614
Multi-Modal EdgeNext 0.71247 0.71542
Metrics of Three-stage System
Three-stage systems outperform both standalone and two-stage models on the pri-
vate dataset (Table 6). For this comparison, validation results may not be fully re-
liable; however, we can evaluate performance based on results from the Private and
Public datasets. Finally, compared to the top solutions from the competition
(Table 7), our proposed solutions underperform approximately by 2 %. Taking into
account a small number of malignant cases, we may consider such a difference to
be a marginal one.
Compared to prior works on melanoma detection (Table 8), our solutions
outperform all previous approaches. While this comparison is not entirely equitable
due to differences in training and validation datasets, most prior studies rely on
higher-quality dermoscopic images. Given the lower quality of photo images in our
dataset, poorer results might have been expected. However, the findings strongly
indicate that our proposed solution performs comparably, if not better, on lower-
quality photo images, demonstrating its robustness and applicability in real-world
scenarios.
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 48
T a b l e 6 . Metrics of Three-stage Systems
System Part ROC AUC Private Public Sensitivity
Two-stage v2 0.17666 0.16941 0.18608 0.74809
Two-stage v1 0.17010 0.16173 0.18283 0.72265
Three-stage v2 0.18068 0.17042 0.18528 0.78371
Three-stage v1 0.18014 0.16982 0.18449 0.77608
Three-stage v1 mc 0.17939 0.17039 0.18527 0.78117
T a b l e 7 . Comparison with Best Competition Results
Solution Private Public
1st Private Place 0.17264 0.18611
1st Public Place 0.17051 0.188
Three-stage v1 mc 0.17039 0.18527
Three-stage v2 0.17042 0.18528
T a b l e 8 . Comparison with Other SOTA Researches
Experiment Dataset ROC AUC
2020 Best Solution
[16]
2020 ISIC Competition
[48]
0.9490
Saranya N et al.
[14]
PH2
[49]
0.87
Saranya N et al.
[14]
Derm7pt
[15]
0.76
Jojoa Acosta et al.
[50]
ISIC 2017 Challenge
[21]
0.91
Ours (Multi-Modal Con-
vNext)
ISIC 2024 Kaggle Challenge 0.97244
Ours (XGBoost v2) ISIC 2024 Kaggle Challenge 0.96916
Ours (Two-stage v2) ISIC 2024 Kaggle Challenge 0.97395
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 49
Image and Multi-Modal Models Ablation Study
As a Baseline model, EfficientNet B1 [51] is used. It uses image resolution 128,
severe data augmentations, and a balanced data sampler (described in Section
Detailed Neural Net Architecture and Training Setup). As shown in Table 9, the
model benefits from adding ISIC Archive data, even considering the modality shift.
Regarding increasing image resolution, we observe reasonable score improvements
on the Validation set but controversial results on Public and Private sets. Higher
resolution noticeably slows down model training and inference, leading us to drop
this feature. Discussing different loss functions, such as Focal [52] and ASL [53],
we see improvements across all scores, which is expected given the high label im-
balance. However, Balanced MixUp [54] does not prove effective for this task. Re-
garding backbone architecture search across EfficientNet B1, EfficientNet V2 B0,
EdgeNeXt Base, and ConvNext V2 Pico, Table 9 shows that the EdgeNeXt family
performs the worst, while EfficientNet V2 achieves the best results. The ConvNext
family performs in the middle, marginally underperforming EfficientNet V2.
T a b l e 9 . Ablation Study of Image-Only Model
Setup Mean Partial
ROC AUC Public Private
Baseline 0.15252 0.14769 0.13588
Add ISIC Archive data 0.15478 0.15500 0.14390
Add ISIC Archive data + Resolution
256 0.15761 0.15208 0.14633
Add ISIC Archive data + Focal loss 0.15889 0.15361 0.14645
EfficientNet V2 B2 + Add ISIC
Archive data 0.162993 0.15020 0.13812
ConvNext V2 Pico + Add ISIC Ar-
chive and Generated data 0.15723 0.15490 0.14388
ConvNext V2 Pico + Add ISIC Ar-
chive and Generated data + Balanced
MixUp
0.154894 0.14902 0.14315
ConvNext V2 Pico + Add ISIC Ar-
chive and Generated data + ASL loss 0.159298 0.15581 0.14474
EdgeNeXt Base + Add ISIC Archive
and Generated data 0.160584 0.14729 0.13514
EfficientNet V2 B2 + 2 stage with
tune on main data 0.159025 0.15917 0.14224
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 50
All multi-modal setups are first trained in image-only mode on Main, ISIC
Archive, and Generated data and then tuned on Main data with tabular features
(Sections Multi-Modal Neural Net: Image + Tabular data and Detailed Neural Net
Architecture and Training Setup). As shown in Table 10, ConvNext outperforms
EfficientNet v2. ASL loss shows the best performance on Validation and Private
datasets. Finally, all multi-modal models outperform images only by a significant
margin. This is evident because their input information is enhanced with tabular
features, which are much less noisy than images.
T a b l e 1 0 . Ablation Study of Multi-Modal Image/Tabular Models
Setup Mean Partial
ROC AUC Public Private
ConvNext V2 Pico 0.17698 0.17740 0.16409
EfficientNet V2 B0 0.16836 0.16698 0.15547
EfficientNet V2 B2 0.16600 0.16322 0.15908
EdgeNeXt Base 0.17410 0.17481 0.16082
ConvNext V2 Pico +
Balanced MixUp 0.17582 0.16851 0.15337
ConvNext V2 Pico +
ASL loss 0.17740 0.17403 0.16458
Tabular Models Ablation Study
As a baseline model, LightGBM is utilized, incorporating 85 initial features.
As shown in Table 11, the model is further enhanced by adding 42 features based
on lesions’ spatial, color, and physical properties. Performance significantly im-
proves by including 193 features that aggregate and compare lesion characteristics
within the same patient or body region.
T a b l e 1 1 . Ablation Study of Tabular Model
Setup Mean Partial
ROC AUC. Public Private
LightGBM with
basic features 0.1586 0.1693 0.1494
LightGBM with
additional features 0.1604 0.1707 0.1518
LightGBM with
aggregated features 0.1728 0.1837 0.1643
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 51
DISCUSSION
Our proposed method is based on the fusion of image and metadata information in
different ways, resulting in a three-stage system of different models.
Three-stage systems effectively leverage the strengths of multi-modal data in-
tegration, allowing them to correct errors from earlier stages. All three-stage sys-
tems outperform their two-stage counterparts, even when the first version of the
two-stage system (Two-stage v1) is included as part of the three-stage system. This
improvement likely arises from the ability of other models in the system to com-
pensate for and correct errors introduced by Two-stage v1.
An important observation is that the first version of the first-level model is not
trained on the Main data. As a result, its predictions for the Main data are affected
by domain shift, potentially introducing errors and even confusing the second-stage
model. We hypothesize that increasing the number of positive samples available to
the second-stage model could mitigate this issue, enabling it to correct better and
utilize the first-level model’s predictions.
Table 12 provides coefficients for individual models and systems used to con-
struct the three-stage system. The automated coefficient search using Optuna as-
signs higher coefficients to models demonstrating better performance on the vali-
dation dataset. However, there are exceptions; for instance, the Multi-Modal Vision
Transformer (Multi-Modal EdgeNext) receives a notable coefficient despite not be-
ing the top performer. This is likely due to its contribution to system diversity, as it
relies on an attention mechanism for decision-making.
T a b l e 1 2 . Coefficients for 3-Stage System
Model 3-Stage v1 3-Stage v1 mc 3-Stage v2
Multi-Modal ConvNext 0.30939 0.2 0.23926
Two-stage v2 – – 0.25295
XGBoost v2 0.18841 0.15 0.12442
LightGBM v1 0.01507 – 0.01548
EarlyStop LightGBM v2 0.01673 – 0.01518
LightGBM v4 0.02799 – 0.00984
XGBoost v1 0.03195 0.05 0.03067
Two-stage v1 0.03730 0.25 –
EarlyStop LightGBM v4 0.10580 – 0.06188
LightGBM v3 0.09203 0.15 0.11032
Multi-Modal EdgeNext 0.17531 0.15 0.14000
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 52
Another noteworthy observation is the comparable performance of three-
stage v1 and three-stage v1 mc. This occurs because the two-stage v1 model is
re-weighted in the second three-stage system. This re-weighting validates the
hypothesis that other models within the system effectively correct the errors
introduced by two-stage v1, resulting in comparable or even slightly better per-
formance for three-stage v1 mc. This highlights the value of leveraging diverse
models in multi-stage systems to enhance overall robustness and accuracy.
When discussing tabular models and the ablation study, the proposed two-
stage feature engineering process shows a clear performance improvement.
Another notable observation is that XGBoost outperforms LightGBM, supporting
the hypothesis that while LightGBM is more suitable for rapid prototyping, other
boosting approaches, such as XGBoost, should be utilized to achieve the best per-
formance.
Regarding the comparison of Tabular and Multi-Modal Vision approaches,
interesting trends emerge:
− Multi-Modal ConvNext outperforms Tabular approaches on the validation
dataset.
− Tabular approaches outperform Multi-Modal ConvNext on the Public da-
taset.
− Both approaches exhibit comparable performance on the Private dataset,
with a slight preference for Tabular approaches.
These differences may stem from two key factors:
− The small number of malignant tiles in the evaluation datasets introduces
noise in the evaluation procedure.
− Variations in data quality between Tabular and Image data originate from
different clinics and institutions. For instance, validation folds may contain higher-
quality images than the Public dataset. Additionally, the automated feature
extraction algorithm likely performs differently depending on the quality of the in-
put images.
Fig. 12. Plot of scores on the validation set and Public set across 39 different Image-only
and Multi-Modal experiments
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 53
The multi-modal design demonstrates robustness, but the scarcity of datasets
combining image and metadata remains a significant limitation. Additionally, var-
iations in data quality across datasets introduce evaluation noise, as reflected in
discrepancies between validation and public metrics. In Figure 12, we explore the
correlation between scores obtained from two evaluation datasets, using OOF and
Mean score for the validation dataset. We can conclude that for both computation
approaches (Mean and OOF), the score difference is reasonable for Public and val-
idation sets. However, the Mean score correlates better.
For future work, we identify the following key directions:
Conduct additional benchmarks on new skin cancer datasets that include
metadata to evaluate system generalizability.
Develop hybrid models capable of aggregating information across multiple
nearby lesions to improve classification accuracy.
Address the domain shift problem between dermoscopic and photo images
to bridge the gap between clinical and real-world applications.
CONCLUSIONS
This paper proposes a three-stage system that leverages multi-modal data, including
images and metadata, to classify skin cancer. Unlike previous works, our approach
incorporates metadata directly related to the characteristics of individual lesions.
We achieve enhanced system performance by employing multiple datasets (both
with and without metadata), implementing a multi-step feature engineering pipe-
line, and using advanced techniques for optimizing performance on highly imbal-
anced datasets. The experiments are conducted on a novel skin cancer classification
dataset composed of photo images, demonstrating the potential applicability of our
approach in real-world scenarios, benefiting many patients.
ACKNOWLEDGEMENTS
First and foremost, we express our deepest gratitude to the Armed Forces of Ukraine, the
Security Service of Ukraine, the Defence Intelligence of Ukraine, and the State Emer-
gency Service of Ukraine for ensuring the safety and security that made it possible to
complete this work. We also sincerely thank the Kaggle team, Canfield Scientific, The
Shore Family Foundation, and all contributing institutions for providing the essential data
and materials that enabled us to build models, test hypotheses, and complete this research.
The authors acknowledge the use of OpenAI’s ChatGPT for text refinement during the
preparation of this manuscript. This tool enhanced the text’s clarity and flow while en-
suring the technical content’s accuracy remained intact.
REFERENCES
1. P. Gruber, P.M. Zito, Skin cancer. Treasure Island (FL): StatPearls Publishing, 2024.
2. Andrew J. Wagner, Nancy Berliner, Edward J. Benz Jr., “Anatomy and physiology of
the gene,” in Hematology, pp. 3–16. Elsevier, 2018.
3. M. Mateen, S. Hayat, F. Arshad, Y.-H. Gu, M.A. Al-antari, “Hybrid Deep Learning
Framework for Melanoma Diagnosis Using Dermoscopic Medical Images,” Diagnos-
tics, 14(19), 2242, 2024. doi: https://doi.org/10.3390/diagnostics14192242
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 54
4. V. Rotemberg et al., “A patient-centric dataset of images and metadata for identifying
melanomas using clinical context,” Sci. Data, 8(1):34, 2021. doi: 10.1038/ s41597-
021-00815-z
5. A.A. Adegun, S. Viriri, “Deep learning techniques for skin lesion analysis and mela-
noma cancer detection: a survey of state-of-the-art,” Artificial Intelligence Review,
vol. 54, pp. 811–841, 2020. doi: 10.1007/s10462-020-09865-y
6. M. Naqvi, S.Q. Gilani, T. Syed, O. Marques, H.C. Kim, “Skin Cancer Detection Using
Deep Learning—A Review,” Diagnostics, 13(11), 1911, 2023. doi: https://doi.org/ 10.3390/di-
agnostics13111911
7. W. Gouda, N.U. Sama, G. Al-Waakid, M. Humayun, N.Z. Jhanjhi, “Detection of Skin
Cancer Based on Skin Lesion Images Using Deep Learning,” Healthcare, 10(7),
1183, 2022. doi: https://doi.org/10.3390/healthcare10071183
8. J.R.H. Lee, M. Pavlova, M. Famouri, A. Wong, “Cancer-Net SCa: tailored deep neu-
ral network designs for detection of skin cancer from dermoscopy images,” BMC
Medical Imaging, vol. 22, article no. 143, 2022. doi: https://doi.org/ 10.1186/s12880-
022-00871-w
9. B. Cassidy, C.Kendrick, A. Brodzicki, J. Jaworek-Korjakowska, M.H. Yap, “Analy-
sis of the ISIC image datasets: Usage, benchmarks and recommendations,” Medical
Image Analysis, vol. 75, 102305, 2022. doi: https://doi.org/10.1016/ j.me-
dia.2021.102305
10. D. Wen, A. Soltan, E. Trucco, R.N. Matin, “From data to diagnosis: skin cancer image
datasets for artificial intelligence,” Clinical and Experimental Dermatology, vol. 49,
issue 7, pp. 675–685, 2024. doi: https://doi.org/10.1093/ced/llae112
11. K.M. Hosny, M.A. Kassem, M.M. Foaud, “Classification of skin lesions using trans-
fer learning and augmentation with Alex-net,” PLOS One, 14(5), e0217293, 2019.
doi: https://doi.org/10.1371/journal.pone.0217293
12. K. He, X. Zhang, S. Ren, J. Sun, Deep Residual Learning for Image Recognition,
2015. doi: https://doi.org/10.48550/arXiv.1512.03385
13. G. Huang, Z. Liu, L. van der Maaten, K.Q. Weinberger, Densely Connected Convo-
lutional Networks, 2018. doi: https://doi.org/10.48550/arXiv.1608.06993
14. N. Saranya, Alfred C. Jowin, R.R. Rishikesh, Idayan I. Gilbert, “Analysis of GAN for
Melanoma Skin CancerClassification with Dermatologist Recommendation,” in Pro-
ceedings of the 2024 International Conference on Recent Advances in Electrical,
Electronics, Ubiquitous Communication, and Computational Intelligence
(RAEEUCCI), Chennai, India, 2024, pp. 1–8. doi: https://doi.org/10.1109/
RAEEUCCI61380.2024.10547727
15. J. Kawahara, S. Daneshvar, G. Argenziano, G. Hamarneh, “Seven-point checklist and
skin lesion classification using multitask multimodal neural nets,” IEEE Journal of
Biomedical and Health Informatics, vol. 23, no. 2, pp. 538–546, 2019. doi:
https://doi.org/10.1109/JBHI.2018.2824327
16. boliu61, SIIM-ISIC Melanoma Classification - Discussion on Kaggle. 2020. Accessed
on: October 27, 2024. Available: https://www.kaggle.com/c/siim-isic-melanoma-
classification
17. M.R. Thanka et al., “A hybrid approach for melanoma classification using ensemble
machine learning techniques with deep transfer learning,” Computer Methods and
Programs in Biomedicine Update, 3(11):100103, 2023. doi: 10.1016/
j.cmpbup.2023.100103
18. A. Ju, J. Tang, S. Chen, Y. Fu, Y. Luo, “Pyroptosis-related gene signatures can
robustly diagnose skin cutaneous melanoma and predict the prognosis,” Frontiers
in Oncology, vol. 11, 709077, 2021. doi:
https://doi.org/10.3389/fonc.2021.709077
19. F.P. Loss et al., Skin cancer diagnosis using NIR spectroscopy data of skin lesions in
vivo using machine learning algorithms, 2024. doi: https://doi.org/10.48550/
arXiv.2401.01200
20. International Skin Imaging Collaboration. SLICE-3D 2024 Challenge Dataset, 2024.
Creative Commons Attribution-Non Commercial 4.0 International License. doi:
https://doi.org/10.34970/2024-slice-3d
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 55
21. N. Kurtansky et al., “The SLICE-3D dataset: 400,000 skin lesion image crops ex-
tracted from 3D TBP for skin cancer detection,” Scientific Data, 11, article no. 884,
2024. doi: https://doi.org/10.1038/s41597-024-03743-w
22. MAli-Farooq.Derm-T2IM-Dataset, 2024. Accessed on: October 20, 2024. Available:
https://huggingface.co/datasets/MAli-Farooq/Derm-T2IM-Dataset
23. Canfield Scientific, I. VECTRA® WB360 Imaging System, 2024. Accessed on:
October 21, 2024.
24. B. D’Alessandro, “Methods and Apparatus for Identifying Skin Features of Interest,”
US11164670B2, Nov. 2021. Filed: March 18, 2016; Issued: November 2, 2021.
25. B. Betz-Stablein et al., “Reproducible Naevus Counts Using 3D Total Body Photog-
raphy and Convolutional Neural Networks,” Dermatology, 238(1), pp. 4–11, 2022.
doi: https://doi.org/10.1159/000517218
26. R. Rombach, A. Blattmann, D. Lorenz, P. Esser, B. Ommer, High-Resolution Image
Synthesis with Latent Diffusion Models, 2021. doi: https://doi.org/10.48550/
arXiv.2112.10752
27. “Partial Area Under the ROC Curve,” Wikipedia, 2023. Accessed on: October 26, 2024.
Available: https://en.wikipedia.org/wiki/ Partial_Area_Under_the_ROC_Curve
28. ISIC Research. Challenge 2024 Metrics, 2024. Accessed on: October 26, 2024.
Available: https://github.com/ISIC-Research/Challenge-2024-Metrics/tree/main
29. Kaggle. ISIC 2024 Challenge - Secondary Prize Metrics, 2024. Accessed on: October 26,
2024. Available: https://www.kaggle.com/competitions/isic-2024-challenge/overview/ sec-
ondary-prize-metrics
30. T. Akiba, S. Sano, T. Yanase, T. Ohta, M. Koyama, “Optuna: A Next-generation Hy-
perparameter Optimization Framework,” in KDD '19: Proceedings of the 25th ACM
SIGKDD International Conference on Knowledge Discovery & Data Mining, pp.
2623–2631, 2019. doi: https://doi.org/10.1145/3292500.3330701
31. N. Kurtansky, V. Rotemberg, M. Gillis, K. Kose, W. Reade, A. Chow, “ISIC 2024 -
Skin Cancer Detection with 3D-TBP,” Kaggle, 2024. Available: https://kaggle.com/
competitions/isic-2024-challenge
32. OpenAI. ChatGPT. Accessed on: November 22, 2024. Available: https://
chatgpt.com/
33. A. Scope et al., “The “ugly duckling” sign: agreement between observers,” Archives
of dermatology, 144(1), pp. 58–64, 2008. doi: 10.1001/archdermatol.2007.15
34. “StandardScaler,” scikit learn, 2024. Accessed on: November 22, 2024. Available:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.StandardScaler.html
35. “Quantile Transformer,” scikit learn, 2024. Accessed on: November 22, 2024. Available:
https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.QuantileTransformer.html
36. S. Du, B. Hers, N. Bayasi, G. Hamarneh, R. Garbi, “FairDisCo: Fairer ai in derma-
tology via disentanglement contrastive learning,” in Proceedings of the Computer Vi-
sion–ECCV 2022 Workshops: Tel Aviv, Israel, October 23–27, 2022, Proceedings,
Part IV. Springer, 2023, pp. 185–202.
37. S. Woo et al., “Convnext v2: Co-designing and scaling convnets with masked auto-
encoders,” in Proceedings of the Proceedings of the IEEE/CVF Conference on Com-
puter Vision and Pattern Recognition, 2023, pp. 16133–16142. doi:
10.1109/CVPR52729.2023.01548
38. M. Maaz et al., “EdgeNeXt: Efficiently Amalgamated CNN-Transformer Architec-
ture for Mobile Vision Applications,” in Proceedings of the European Conference on
Computer Vision. Springer, 2022, pp. 3–20.
39. M. Tan, Q. Le, “Efficientnetv2: Smaller models and faster training,” in Proceedings
of the International Conference on Machine Learning, PMLR, 2021, pp. 10096–
10106.
40. R. Wightman et al., PyTorch Image Models, 2019. doi: https://doi.org/10.5281/zenodo.4414861
41. J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, L. Fei-Fei, “ImageNet: A large-scale
hierarchical image database,” in Proceedings of the 2009 IEEE Conference on Com-
puter Vision and Pattern Recognition, 2009, pp. 248–255. doi: https://
doi.org/10.1109/CVPR.2009.5206848
42. Itseez. Open Source Computer Vision Library, 2015. Available: https://
github.com/itseez/opencv
V. Sydorskyi, I. Krashenyi, O. Yakubenko
ISSN 1681–6048 System Research & Information Technologies, 2026, № 1 56
43. D.P. Kingma, J. Ba, Adam: A Method for Stochastic Optimization, 2017. doi:
https://doi.org/10.48550/arXiv.1412.6980
44. L. Liu et al., On the Variance of the Adaptive Learning Rate and Beyond, 2021. doi:
https://doi.org/10.48550/arXiv.1908.03265
45. G. Lemaître, F. Nogueira, C.K. Aridas, “Imbalanced-learn: A Python Toolbox to
Tackle the Curse of Imbalanced Datasets in Machine Learning,” Journal of Machine
Learning Research, 18, pp. 1–5, 2017.
46. G. Ke et al., “LightGBM: A highly efficient gradient boosting decision tree,” Ad-
vances in Neural Information Processing Systems, 30, 2017.
47. T. Chen, C. Guestrin, “XGBoost: A Scalable Tree Boosting System,” in KDD '16:
Proceedings of the 22nd ACM SIGKDD International Conference on Knowledge Dis-
covery and Data Mining, pp. 785–794. doi: https://doi.org/10.1145/
2939672.2939785
48. A. Zawacki et al., “SIIM-ISIC Melanoma Classification,” Kaggle, 2020. Available:
https://kaggle.com/competitions/siim-isic-melanoma-classification
49. T. Mendonça, P.M. Ferreira, J.S. Marques, A.R.S. Marcal, J. Rozeira, “PH2 - A der-
moscopic image database for research and benchmarking,” in 2013 35th Annual In-
ternational Conference of the IEEE Engineering in Medicine and Biology Society
(EMBC), Osaka, Japan, 2013, pp. 5437–5440. doi: https://doi.org/10.1109/
EMBC.2013.6610779
50. M.F.J. Acosta, L.Y.C. Tovar, M.B. Garcia-Zapirain, W.S. Percybrooks, “Melanoma
diagnosis using deep learning techniques on dermatoscopic images,” BMC Medical
Imaging, vol. 21, article no. 6, 2021. doi: https://doi.org/10.1186/s12880-020-00534-
8
51. M. Tan, Q.V. Le, EfficientNet: Rethinking Model Scaling for Convolutional Neural
Networks, 2020. doi: https://doi.org/10.48550/arXiv.1905.11946
52. T.Y. Lin, P. Goyal, R. Girshick, K. He, P. Dollár, Focal Loss for Dense Object De-
tection, 2018. doi: https://doi.org/10.48550/arXiv.1708.02002
53. E. Ben-Baruch et al., Asymmetric Loss for Multi-Label Classification, 2021. doi:
https://doi.org/10.48550/arXiv.2009.14119
54. A. Galdran, G. Carneiro, M.A. González Ballester, “Balanced-MixUp for Highly Im-
balanced Medical Image Classification,” in Medical Image Computing and Computer
Assisted Intervention – MICCAI 2021, vol. 12905, pp. 323–333. Springer
International Publishing, 2021. doi: https://doi.org/10.1007/978-3-030-87240-3_31
Received 16.01.2025
INFORMATION ON THE ARTICLE
Volodymyr S. Sydorskyi, ORCID: 0000-0001-9697-7403, National Technical University
of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: volodymyr.syd-
orskyi@gmail.com
Igor E. Krashenyi, ORCID: 0000-0003-0424-147X, Ukrainian Catholic University,
Ukraine, e-mail: igor.krashenyi@ucu.edu.ua
Oleksii P. Yakubenko, ORCID: 0009-0009-5752-4546, “Pleso Therapy”, Ukraine, e-mail:
yakubenko.oleksii@gmail.com
МУЛЬТИМОДАЛЬНА СИСТЕМА ДЛЯ ВИЯВЛЕННЯ МЕЛАНОМИ ШКІРИ /
В.С. Сидорський, І.Е. Крашений, О.П. Якубенко
Анотація. Виявлення меланоми є надзвичайно важливим для ранньої діагнос-
тики та ефективного лікування. Хоча глибокі нейронні мережі на основі дермо-
скопічних зображень показали багатообіцяльні результати, їх використання по-
требує спеціалізованого обладнання, що обмежує їх застосування в ширших
клінічних умовах. Представлено мультимодальну систему виявлення меланоми,
яка використовує звичайні фотозображення, що робить її більш доступною й
Multimodal system for skin cancer detection
Системні дослідження та інформаційні технології, 2026, № 1 57
універсальною. Ця система інтегрує зображення із табличними метаданими, та-
кими як демографічні дані пацієнтів і характеристики утворів, для підвищення
точності виявлення. Вона використовує мультимодальну нейронну мережу, яка
поєднує оброблення зображень і метаданих, і підтримує двохетапну модель для
випадків із метаданими або без них. Трирівнева система додатково вдосконалює
прогнози за допомогою алгоритмів градієнтного бустингу, покращуючи резуль-
тати. Для вирішення проблем, пов’язаних із дуже незбалансованим набором да-
них, реалізовано спеціальні техніки, які забезпечують надійне навчання. У дос-
лідженні впливу компонентів оцінено сучасні архітектури комп’ютерного зору,
алгоритми бустингу і функції втрат із досягненням пікового значення часткової
AUC ROC 0.18068 (максимум 0.2) та чутливості в топ-15 пошуку 0.78371. Ре-
зультати демонструють, що інтеграція фотозображень із метаданими у багато-
рівневу систему забезпечує суттєве покращення продуктивності. Ця система
просуває виявлення меланоми, пропонуючи масштабоване рішення, яке не за-
лежить від спеціалізованого обладнання і підходить для різноманітних умов на-
дання медичної допомоги, об’єднуючи спеціалізовану та загальноклінічну прак-
тики.
Ключові слова: класифікація медичних зображень, комп’ютерний зір, градієн-
тний бустинг, глибокі нейронні мережі, клінічні системи підтримання прий-
няття рішень.
|
| id | journaliasakpiua-article-358061 |
| institution | System research and information technologies |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2026-04-20T01:00:22Z |
| publishDate | 2026 |
| publisher | The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" |
| record_format | ojs |
| resource_txt_mv | journaliasakpiua/12/4ff25aaa24a057a10059ac690e9d9012.pdf |
| spelling | journaliasakpiua-article-3580612026-04-19T21:53:19Z Multimodal system for skin cancer detection Мультимодальна система для виявлення меланоми шкіри Sydorskyi, Volodymyr Krashenyi, Igor Yakubenko, Oleksii класифікація медичних зображень комп’ютерний зір градієнтний бустинг глибокі нейронні мережі клінічні системи підтримання прийняття рішень medical image classification computer vision gradient boosting deep neural networks clinical decision support systems Melanoma detection is vital for early diagnosis and effective treatment. While deep learning models on dermoscopic images have shown promise, they require specialized equipment, limiting their use in broader clinical settings. This study introduces a multi-modal melanoma detection system using conventional photo images, making it more accessible and versatile. Our system integrates image data with tabular metadata, such as patient demographics and lesion characteristics, to improve detection accuracy. It employs a multi-modal neural network combining image and metadata processing and supports a two-step model for cases with or without metadata. A three-stage pipeline further refines predictions by boosting algorithms and enhancing performance. To address the challenges of a highly imbalanced dataset, specific techniques were implemented to ensure robust training. An ablation study evaluated recent vision architectures, boosting algorithms, and loss functions, achieving a peak Partial ROC AUC of 0.18068 (0.2 maximum) and top-15 retrieval sensitivity of 0.78371. Results demonstrate that integrating photo images with metadata in a structured, multi-stage pipeline yields significant performance improvements. This system advances melanoma detection by providing a scalable, equipment-independent solution suitable for diverse healthcare environments, bridging the gap between specialized and general clinical practices. Виявлення меланоми є надзвичайно важливим для ранньої діагностики та ефективного лікування. Хоча глибокі нейронні мережі на основі дермоскопічних зображень показали багатообіцяльні результати, їх використання потребує спеціалізованого обладнання, що обмежує їх застосування в ширших клінічних умовах. Представлено мультимодальну систему виявлення меланоми, яка використовує звичайні фотозображення, що робить її більш доступною й універсальною. Ця система інтегрує зображення із табличними метаданими, такими як демографічні дані пацієнтів і характеристики утворів, для підвищення точності виявлення. Вона використовує мультимодальну нейронну мережу, яка поєднує оброблення зображень і метаданих, і підтримує двохетапну модель для випадків із метаданими або без них. Трирівнева система додатково вдосконалює прогнози за допомогою алгоритмів градієнтного бустингу, покращуючи результати. Для вирішення проблем, пов’язаних із дуже незбалансованим набором даних, реалізовано спеціальні техніки, які забезпечують надійне навчання. У дослідженні впливу компонентів оцінено сучасні архітектури комп’ютерного зору, алгоритми бустингу і функції втрат із досягненням пікового значення часткової AUC ROC 0.18068 (максимум 0.2) та чутливості в топ-15 пошуку 0.78371. Результати демонструють, що інтеграція фотозображень із метаданими у багаторівневу систему забезпечує суттєве покращення продуктивності. Ця система просуває виявлення меланоми, пропонуючи масштабоване рішення, яке не залежить від спеціалізованого обладнання і підходить для різноманітних умов надання медичної допомоги, об’єднуючи спеціалізовану та загальноклінічну практики. The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2026-03-31 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/358061 10.20535/SRIT.2308-8893.2026.1.03 System research and information technologies; No. 1 (2026); 33-57 Системные исследования и информационные технологии; № 1 (2026); 33-57 Системні дослідження та інформаційні технології; № 1 (2026); 33-57 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/358061/343992 |
| spellingShingle | класифікація медичних зображень комп’ютерний зір градієнтний бустинг глибокі нейронні мережі клінічні системи підтримання прийняття рішень Sydorskyi, Volodymyr Krashenyi, Igor Yakubenko, Oleksii Мультимодальна система для виявлення меланоми шкіри |
| title | Мультимодальна система для виявлення меланоми шкіри |
| title_alt | Multimodal system for skin cancer detection |
| title_full | Мультимодальна система для виявлення меланоми шкіри |
| title_fullStr | Мультимодальна система для виявлення меланоми шкіри |
| title_full_unstemmed | Мультимодальна система для виявлення меланоми шкіри |
| title_short | Мультимодальна система для виявлення меланоми шкіри |
| title_sort | мультимодальна система для виявлення меланоми шкіри |
| topic | класифікація медичних зображень комп’ютерний зір градієнтний бустинг глибокі нейронні мережі клінічні системи підтримання прийняття рішень |
| topic_facet | класифікація медичних зображень комп’ютерний зір градієнтний бустинг глибокі нейронні мережі клінічні системи підтримання прийняття рішень medical image classification computer vision gradient boosting deep neural networks clinical decision support systems |
| url | https://journal.iasa.kpi.ua/article/view/358061 |
| work_keys_str_mv | AT sydorskyivolodymyr multimodalsystemforskincancerdetection AT krashenyiigor multimodalsystemforskincancerdetection AT yakubenkooleksii multimodalsystemforskincancerdetection AT sydorskyivolodymyr mulʹtimodalʹnasistemadlâviâvlennâmelanomiškíri AT krashenyiigor mulʹtimodalʹnasistemadlâviâvlennâmelanomiškíri AT yakubenkooleksii mulʹtimodalʹnasistemadlâviâvlennâmelanomiškíri |