OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS

Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2026
Автори: Maslov, Danilo, Golub, Oleksandr
Формат: Стаття
Мова:Англійська
Опубліковано: V.I.Vernadsky Institute of General and Inorganic Chemistry 2026
Онлайн доступ:https://ucj.org.ua/index.php/journal/article/view/772
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:Ukrainian Chemistry Journal

Репозитарії

Ukrainian Chemistry Journal
Опис
Резюме:Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models are widely used for this purpose; however, their true predictive performance is often overestimated due to improper data splitting strategies. A key challenge arises when test sets contain molecular scaffolds absent from the training data, resulting in models that appear accurate under random splits but fail to generalize to unseen chemical space. This study investigates optimization stra­tegies for QSAR modeling while explicitly accounting for molecular diversity. A dataset of 3,782 molecules with 3,291 computed descriptors and pChEMBL anesthetic activity values (5.01–8.52) for receptor TRPV1 was analy­zed. The dataset contained 733 unique scaffolds, and 72 occurred exclusively in the test set under random 80/20 splitting, revea­ling substantial information leakage. Three splitting strategies were compared: standard K-Fold (R² = 0.54), scaffold-based Group K-Fold (R² = 0.31), and stratified scaffold-­aware splitting (R² = 0.646–0.7201), the latter demonstrating the most realistic and stable performance. Multiple machine-learning approaches were evaluated, with Gradient Boosting achieving the best baseline accuracy. Optimization techniques included descriptor-level data augmentation (σ = 0.02), descriptor weighting by duplicating the most important features, and combined methods. The best model  (R² = 0.7201, MAE = 0.41) was obtained by integrating augmentation with triple duplication of top-ranking descriptors. Several commonly used approaches—Morgan fingerprints, deep neural networks, PCA—yielded significantly weaker performance, highlighting the superior informativeness of physicochemical descriptors for this dataset. The resulting model demonstrates practical utility for early-stage virtual screening and prio­ritization of candidate molecules, providing a reliable tool for guiding medicinal che­mistry decisions.
DOI:10.33609/2708-129X.92.3.2026.27-32