OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS

Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models...

Повний опис

Збережено в:

Бібліографічні деталі
Дата:	2026
Автори:	Maslov, Danilo, Golub, Oleksandr
Формат:	Стаття
Мова:	Англійська
Опубліковано:	V.I.Vernadsky Institute of General and Inorganic Chemistry 2026
Онлайн доступ:	https://ucj.org.ua/index.php/journal/article/view/772
Теги:	Додати тег Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:	Ukrainian Chemistry Journal

Репозитарії

Ukrainian Chemistry Journal

_version_	1864398770127503360
author	Maslov, Danilo Golub, Oleksandr
author_facet	Maslov, Danilo Golub, Oleksandr
author_sort	Maslov, Danilo
baseUrl_str	https://ucj.org.ua/index.php/journal/oai
collection	OJS
datestamp_date	2026-05-05T12:56:53Z
description	Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models are widely used for this purpose; however, their true predictive performance is often overestimated due to improper data splitting strategies. A key challenge arises when test sets contain molecular scaffolds absent from the training data, resulting in models that appear accurate under random splits but fail to generalize to unseen chemical space. This study investigates optimization strategies for QSAR modeling while explicitly accounting for molecular diversity. A dataset of 3,782 molecules with 3,291 computed descriptors and pChEMBL anesthetic activity values (5.01–8.52) for receptor TRPV1 was analyzed. The dataset contained 733 unique scaffolds, and 72 occurred exclusively in the test set under random 80/20 splitting, revealing substantial information leakage. Three splitting strategies were compared: standard K-Fold&nbsp;(R² = 0.54), scaffold-based Group K-Fold (R² = 0.31), and stratified scaffold-aware splitting (R² = 0.646–0.7201), the latter demonstrating the most realistic and stable performance. Multiple machine-learning approaches were evaluated, with Gradient Boosting achieving the best baseline accuracy. Optimization techniques included descriptor-level data augmentation (σ = 0.02), descriptor weighting by duplicating the most important features, and combined methods. The best model &nbsp;(R² = 0.7201, MAE = 0.41) was obtained by integrating augmentation with triple duplication of top-ranking descriptors. Several commonly used approaches—Morgan fingerprints, deep neural networks, PCA—yielded significantly weaker performance, highlighting the superior informativeness of physicochemical descriptors for this dataset. The resulting model demonstrates practical utility for early-stage virtual screening and prioritization of candidate molecules, providing a reliable tool for guiding medicinal chemistry decisions.
doi_str_mv	10.33609/2708-129X.92.3.2026.27-32
first_indexed	2026-05-02T01:00:17Z
format	Article
id	oai:ojs2.1444248.nisspano.web.hosting-test.net:article-772
institution	Ukrainian Chemistry Journal
keywords_txt_mv	keywords
language	English
last_indexed	2026-05-06T01:00:22Z
publishDate	2026
publisher	V.I.Vernadsky Institute of General and Inorganic Chemistry
record_format	ojs
spelling	oai:ojs2.1444248.nisspano.web.hosting-test.net:article-7722026-05-05T12:56:53Z OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS Maslov, Danilo Golub, Oleksandr QSAR modeling; machine learning; TRPV1; molecular descriptors. Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models are widely used for this purpose; however, their true predictive performance is often overestimated due to improper data splitting strategies. A key challenge arises when test sets contain molecular scaffolds absent from the training data, resulting in models that appear accurate under random splits but fail to generalize to unseen chemical space. This study investigates optimization strategies for QSAR modeling while explicitly accounting for molecular diversity. A dataset of 3,782 molecules with 3,291 computed descriptors and pChEMBL anesthetic activity values (5.01–8.52) for receptor TRPV1 was analyzed. The dataset contained 733 unique scaffolds, and 72 occurred exclusively in the test set under random 80/20 splitting, revealing substantial information leakage. Three splitting strategies were compared: standard K-Fold&nbsp;(R² = 0.54), scaffold-based Group K-Fold (R² = 0.31), and stratified scaffold-aware splitting (R² = 0.646–0.7201), the latter demonstrating the most realistic and stable performance. Multiple machine-learning approaches were evaluated, with Gradient Boosting achieving the best baseline accuracy. Optimization techniques included descriptor-level data augmentation (σ = 0.02), descriptor weighting by duplicating the most important features, and combined methods. The best model &nbsp;(R² = 0.7201, MAE = 0.41) was obtained by integrating augmentation with triple duplication of top-ranking descriptors. Several commonly used approaches—Morgan fingerprints, deep neural networks, PCA—yielded significantly weaker performance, highlighting the superior informativeness of physicochemical descriptors for this dataset. The resulting model demonstrates practical utility for early-stage virtual screening and prioritization of candidate molecules, providing a reliable tool for guiding medicinal chemistry decisions. V.I.Vernadsky Institute of General and Inorganic Chemistry 2026-04-30 Article Article Physical chemistry Физическая xимия Фізична xімія application/pdf https://ucj.org.ua/index.php/journal/article/view/772 10.33609/2708-129X.92.3.2026.27-32 Ukrainian Chemistry Journal; Vol. 92 No. 3 (2026): Ukrainian Chemistry Journal; 27-32 Украинский химический журнал; Том 92 № 3 (2026): Ukrainian Chemistry Journal; 27-32 Український хімічний журнал; Том 92 № 3 (2026): Ukrainian Chemistry Journal; 27-32 2708-129X 2708-1281 en https://ucj.org.ua/index.php/journal/article/view/772/405
spellingShingle	Maslov, Danilo Golub, Oleksandr OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title	OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_full	OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_fullStr	OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_full_unstemmed	OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_short	OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
title_sort	optimization of qsar models for prediction of biological activity molecules using machine learning methods
topic_facet	QSAR modeling machine learning TRPV1 molecular descriptors.
url	https://ucj.org.ua/index.php/journal/article/view/772
work_keys_str_mv	AT maslovdanilo optimizationofqsarmodelsforpredictionofbiologicalactivitymoleculesusingmachinelearningmethods AT goluboleksandr optimizationofqsarmodelsforpredictionofbiologicalactivitymoleculesusingmachinelearningmethods

OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS

Репозитарії

Схожі ресурси