OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS
Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models...
Збережено в:
| Дата: | 2026 |
|---|---|
| Автори: | , |
| Формат: | Стаття |
| Мова: | Англійська |
| Опубліковано: |
V.I.Vernadsky Institute of General and Inorganic Chemistry
2026
|
| Онлайн доступ: | https://ucj.org.ua/index.php/journal/article/view/772 |
| Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
| Назва журналу: | Ukrainian Chemistry Journal |
Репозитарії
Ukrainian Chemistry Journal| _version_ | 1864036377017974784 |
|---|---|
| author | Maslov, Danilo Golub, Oleksandr |
| author_facet | Maslov, Danilo Golub, Oleksandr |
| author_sort | Maslov, Danilo |
| baseUrl_str | https://ucj.org.ua/index.php/journal/oai |
| collection | OJS |
| datestamp_date | 2026-05-01T10:54:38Z |
| description | Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models are widely used for this purpose; however, their true predictive performance is often overestimated due to improper data splitting strategies. A key challenge arises when test sets contain molecular scaffolds absent from the training data, resulting in models that appear accurate under random splits but fail to generalize to unseen chemical space.
This study investigates optimization strategies for QSAR modeling while explicitly accounting for molecular diversity. A dataset of 3,782 molecules with 3,291 computed descriptors and pChEMBL anesthetic activity values (5.01–8.52) for receptor TRPV1 was analyzed. The dataset contained 733 unique scaffolds, and 72 occurred exclusively in the test set under random 80/20 splitting, revealing substantial information leakage. Three splitting strategies were compared: standard K-Fold (R² = 0.54), scaffold-based Group K-Fold (R² = 0.31), and stratified scaffold-aware splitting (R² = 0.646–0.7201), the latter demonstrating the most realistic and stable performance.
Multiple machine-learning approaches were evaluated, with Gradient Boosting achieving the best baseline accuracy. Optimization techniques included descriptor-level data augmentation (σ = 0.02), descriptor weighting by duplicating the most important features, and combined methods. The best model  (R² = 0.7201, MAE = 0.41) was obtained by integrating augmentation with triple duplication of top-ranking descriptors. Several commonly used approaches—Morgan fingerprints, deep neural networks, PCA—yielded significantly weaker performance, highlighting the superior informativeness of physicochemical descriptors for this dataset.
The resulting model demonstrates practical utility for early-stage virtual screening and prioritization of candidate molecules, providing a reliable tool for guiding medicinal chemistry decisions. |
| doi_str_mv | 10.33609/2708-129X.92.3.2026.27-32 |
| first_indexed | 2026-05-02T01:00:17Z |
| format | Article |
| id | oai:ojs2.1444248.nisspano.web.hosting-test.net:article-772 |
| institution | Ukrainian Chemistry Journal |
| keywords_txt_mv | keywords |
| language | English |
| last_indexed | 2026-05-02T01:00:17Z |
| publishDate | 2026 |
| publisher | V.I.Vernadsky Institute of General and Inorganic Chemistry |
| record_format | ojs |
| spelling | oai:ojs2.1444248.nisspano.web.hosting-test.net:article-7722026-05-01T10:54:38Z OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS Maslov, Danilo Golub, Oleksandr QSAR modeling; machine learning; TRPV1; molecular descriptors. Molecular modeling plays a central role in modern computational chemistry, particularly in the early stages of drug discovery, where researchers must rapidly and reliably predict the biological activity of large sets of potential candidates. Quantitative Structure–Activity Relationship (QSAR) models are widely used for this purpose; however, their true predictive performance is often overestimated due to improper data splitting strategies. A key challenge arises when test sets contain molecular scaffolds absent from the training data, resulting in models that appear accurate under random splits but fail to generalize to unseen chemical space. This study investigates optimization strategies for QSAR modeling while explicitly accounting for molecular diversity. A dataset of 3,782 molecules with 3,291 computed descriptors and pChEMBL anesthetic activity values (5.01–8.52) for receptor TRPV1 was analyzed. The dataset contained 733 unique scaffolds, and 72 occurred exclusively in the test set under random 80/20 splitting, revealing substantial information leakage. Three splitting strategies were compared: standard K-Fold (R² = 0.54), scaffold-based Group K-Fold (R² = 0.31), and stratified scaffold-aware splitting (R² = 0.646–0.7201), the latter demonstrating the most realistic and stable performance. Multiple machine-learning approaches were evaluated, with Gradient Boosting achieving the best baseline accuracy. Optimization techniques included descriptor-level data augmentation (σ = 0.02), descriptor weighting by duplicating the most important features, and combined methods. The best model  (R² = 0.7201, MAE = 0.41) was obtained by integrating augmentation with triple duplication of top-ranking descriptors. Several commonly used approaches—Morgan fingerprints, deep neural networks, PCA—yielded significantly weaker performance, highlighting the superior informativeness of physicochemical descriptors for this dataset. The resulting model demonstrates practical utility for early-stage virtual screening and prioritization of candidate molecules, providing a reliable tool for guiding medicinal chemistry decisions. V.I.Vernadsky Institute of General and Inorganic Chemistry 2026-04-30 Article Article Physical chemistry Физическая xимия Фізична xімія application/pdf https://ucj.org.ua/index.php/journal/article/view/772 10.33609/2708-129X.92.3.2026.27-32 Ukrainian Chemistry Journal; Vol. 92 No. 3 (2026): Ukrainian Chemistry Journal; 27-32 Украинский химический журнал; Том 92 № 3 (2026): Ukrainian Chemistry Journal; 27-32 Український хімічний журнал; Том 92 № 3 (2026): Ukrainian Chemistry Journal; 27-32 2708-129X 2708-1281 en https://ucj.org.ua/index.php/journal/article/view/772/405 |
| spellingShingle | Maslov, Danilo Golub, Oleksandr OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS |
| title | OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS |
| title_full | OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS |
| title_fullStr | OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS |
| title_full_unstemmed | OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS |
| title_short | OPTIMIZATION OF QSAR MODELS FOR PREDICTION OF BIOLOGICAL ACTIVITY MOLECULES USING MACHINE LEARNING METHODS |
| title_sort | optimization of qsar models for prediction of biological activity molecules using machine learning methods |
| topic_facet | QSAR modeling machine learning TRPV1 molecular descriptors. |
| url | https://ucj.org.ua/index.php/journal/article/view/772 |
| work_keys_str_mv | AT maslovdanilo optimizationofqsarmodelsforpredictionofbiologicalactivitymoleculesusingmachinelearningmethods AT goluboleksandr optimizationofqsarmodelsforpredictionofbiologicalactivitymoleculesusingmachinelearningmethods |