OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS

Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2026
Автори: Maslov, Danilo, Golub, Oleksandr
Формат: Стаття
Мова:Англійська
Опубліковано: V.I.Vernadsky Institute of General and Inorganic Chemistry 2026
Онлайн доступ:https://ucj.org.ua/index.php/journal/article/view/772
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:Ukrainian Chemistry Journal

Репозитарії

Ukrainian Chemistry Journal
_version_ 1864036377017974784
author Maslov, Danilo
Golub, Oleksandr
author_facet Maslov, Danilo
Golub, Oleksandr
author_sort Maslov, Danilo
baseUrl_str https://ucj.org.ua/index.php/journal/oai
collection OJS
datestamp_date 2026-05-01T10:54:38Z
description Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models are widely used for this purpose; however, their true predictive performance is often overestimated due to improper data splitting strategies. A key challenge arises when test sets contain molecular scaffolds absent from the training data, resulting in models that appear accurate under random splits but fail to generalize to unseen chemical space. This study investigates optimization stra­tegies for QSAR modeling while explicitly accounting for molecular diversity. A dataset of 3,782 molecules with 3,291 computed descriptors and pChEMBL anesthetic activity values (5.01–8.52) for receptor TRPV1 was analy­zed. The dataset contained 733 unique scaffolds, and 72 occurred exclusively in the test set under random 80/20 splitting, revea­ling substantial information leakage. Three splitting strategies were compared: standard K-Fold (R² = 0.54), scaffold-based Group K-Fold (R² = 0.31), and stratified scaffold-­aware splitting (R² = 0.646–0.7201), the latter demonstrating the most realistic and stable performance. Multiple machine-learning approaches were evaluated, with Gradient Boosting achieving the best baseline accuracy. Optimization techniques included descriptor-level data augmentation (σ = 0.02), descriptor weighting by duplicating the most important features, and combined methods. The best model  (R² = 0.7201, MAE = 0.41) was obtained by integrating augmentation with triple duplication of top-ranking descriptors. Several commonly used approaches—Morgan fingerprints, deep neural networks, PCA—yielded significantly weaker performance, highlighting the superior informativeness of physicochemical descriptors for this dataset. The resulting model demonstrates practical utility for early-stage virtual screening and prio­ritization of candidate molecules, providing a reliable tool for guiding medicinal che­mistry decisions.
doi_str_mv 10.33609/2708-129X.92.3.2026.27-32
first_indexed 2026-05-02T01:00:17Z
format Article
id oai:ojs2.1444248.nisspano.web.hosting-test.net:article-772
institution Ukrainian Chemistry Journal
keywords_txt_mv keywords
language English
last_indexed 2026-05-02T01:00:17Z
publishDate 2026
publisher V.I.Vernadsky Institute of General and Inorganic Chemistry
record_format ojs
spelling oai:ojs2.1444248.nisspano.web.hosting-test.net:article-7722026-05-01T10:54:38Z OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS Maslov, Danilo Golub, Oleksandr QSAR modeling; machine learning; TRPV1; molecular descriptors. Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models are widely used for this purpose; however, their true predictive performance is often overestimated due to improper data splitting strategies. A key challenge arises when test sets contain molecular scaffolds absent from the training data, resulting in models that appear accurate under random splits but fail to generalize to unseen chemical space. This study investigates optimization stra­tegies for QSAR modeling while explicitly accounting for molecular diversity. A dataset of 3,782 molecules with 3,291 computed descriptors and pChEMBL anesthetic activity values (5.01–8.52) for receptor TRPV1 was analy­zed. The dataset contained 733 unique scaffolds, and 72 occurred exclusively in the test set under random 80/20 splitting, revea­ling substantial information leakage. Three splitting strategies were compared: standard K-Fold (R² = 0.54), scaffold-based Group K-Fold (R² = 0.31), and stratified scaffold-­aware splitting (R² = 0.646–0.7201), the latter demonstrating the most realistic and stable performance. Multiple machine-learning approaches were evaluated, with Gradient Boosting achieving the best baseline accuracy. Optimization techniques included descriptor-level data augmentation (σ = 0.02), descriptor weighting by duplicating the most important features, and combined methods. The best model  (R² = 0.7201, MAE = 0.41) was obtained by integrating augmentation with triple duplication of top-ranking descriptors. Several commonly used approaches—Morgan fingerprints, deep neural networks, PCA—yielded significantly weaker performance, highlighting the superior informativeness of physicochemical descriptors for this dataset. The resulting model demonstrates practical utility for early-stage virtual screening and prio­ritization of candidate molecules, providing a reliable tool for guiding medicinal che­mistry decisions. V.I.Vernadsky Institute of General and Inorganic Chemistry 2026-04-30 Article Article Physical chemistry Физическая xимия Фізична xімія application/pdf https://ucj.org.ua/index.php/journal/article/view/772 10.33609/2708-129X.92.3.2026.27-32 Ukrainian Chemistry Journal; Vol. 92 No. 3 (2026): Ukrainian Chemistry Journal; 27-32 Украинский химический журнал; Том 92 № 3 (2026): Ukrainian Chemistry Journal; 27-32 Український хімічний журнал; Том 92 № 3 (2026): Ukrainian Chemistry Journal; 27-32 2708-129X 2708-1281 en https://ucj.org.ua/index.php/journal/article/view/772/405
spellingShingle Maslov, Danilo
Golub, Oleksandr
OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_full OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_fullStr OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_full_unstemmed OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_short OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_sort optimization of qsar models for prediction of biological activity molecules using machine learning methods
topic_facet QSAR modeling
machine learning
TRPV1
molecular descriptors.
url https://ucj.org.ua/index.php/journal/article/view/772
work_keys_str_mv AT maslovdanilo optimizationofqsarmodelsforpredictionofbiologicalactivitymoleculesusingmachinelearningmethods
AT goluboleksandr optimizationofqsarmodelsforpredictionofbiologicalactivitymoleculesusingmachinelearningmethods