On Biomedical Computations in Cluster and Cloud Environment

The experience of the use of applied containerized biomedical software tools in cloud environment is summarized. The reproducibility of scientific computing in relation to modern technologies of scientific calculations is discussed. The main approaches to biomedical data preprocessing and integratio...

Повний опис

Збережено в:

Бібліографічні деталі
Дата:	2021
Автори:	Bardadym, T., Gorbachuk, V., Novoselova, N., Osypenko, S., Skobtsov, V., Tom, I.
Формат:	Стаття
Мова:	English
Опубліковано:	Інститут кібернетики ім. В.М. Глушкова НАН України 2021
Назва видання:	Кібернетика та комп’ютерні технології
Теми:	Інформаційні технології: теорія та інструментальні засоби
Онлайн доступ:	http://dspace.nbuv.gov.ua/handle/123456789/181001
Теги:	Додати тег Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:	Digital Library of Periodicals of National Academy of Sciences of Ukraine
Цитувати:	On Biomedical Computations in Cluster and Cloud Environment / T. Bardadym, V. Gorbachuk, N. Novoselova, S. Osypenko, V. Skobtsov, I. Tom // Кібернетика та комп’ютерні технології: Зб. наук. пр. — 2021. — № 2. — С. 76-84. — Бібліогр.: 32 назв. — англ.

Репозитарії

Digital Library of Periodicals of National Academy of Sciences of Ukraine

id	irk-123456789-181001
record_format	dspace
spelling	irk-123456789-1810012021-10-27T01:26:32Z On Biomedical Computations in Cluster and Cloud Environment Bardadym, T. Gorbachuk, V. Novoselova, N. Osypenko, S. Skobtsov, V. Tom, I. Інформаційні технології: теорія та інструментальні засоби The experience of the use of applied containerized biomedical software tools in cloud environment is summarized. The reproducibility of scientific computing in relation to modern technologies of scientific calculations is discussed. The main approaches to biomedical data preprocessing and integration in the framework of the intelligent analytical system are described. Мета роботи. Опис сучасних технологій, що забезпечують відтворюваність чисельних експериментів у цій галузі, та інструментів, спрямованих на інтеграцію декількох джерел біомедичної інформації з метою поліпшення діагностики і прогнозу складних захворювань. Особлива увага приділяється методам обробки даних, отриманих з різних джерел біомедичної інформації і включеним до складу інтелектуальної аналітичної системи. Отримані результати. Узагальнено досвід використання прикладних контейнерних біомедичних програмних засобів у хмарному середовищі. Обговорюється відтворюваність наукових обчислень і можливості сучасних технологій наукових обчислень. Описано основні підходи до попередньої обробки та інтеграції біомедичних даних у рамках інтелектуальної аналітичної системи. Розроблена модель гібридної класифікації є основою інтелектуальної аналітичної системи і спрямована на інтеграцію декількох джерел біомедичної інформації. Цель работы. Описание современных технологий, обеспечивающих воспроизводимость численных экспериментов в этой области, и инструментов, направленных на интеграцию нескольких источников биомедицинской информации с целью улучшения диагностики и прогноза сложных заболеваний. Особое внимание уделяется методам обработки данных, полученных из разных источников биомедицинской информации и включенным в состав интеллектуальной аналитической системы. Полученные результаты. Обобщен опыт использования прикладных контейнерных биомедицинских программных средств в облачной среде. Обсуждается воспроизводимость научных вычислений и возможности современных технологий научных вычислений. Описаны основные подходы к предварительной обработке и интеграции биомедицинских данных в рамках интеллектуальной аналитической системы. Разработанная модель гибридной классификации представляет собой основу интеллектуальной аналитической системы и направлена на интеграцию нескольких источников биомедицинской информации. 2021 Article On Biomedical Computations in Cluster and Cloud Environment / T. Bardadym, V. Gorbachuk, N. Novoselova, S. Osypenko, V. Skobtsov, I. Tom // Кібернетика та комп’ютерні технології: Зб. наук. пр. — 2021. — № 2. — С. 76-84. — Бібліогр.: 32 назв. — англ. 2707-4501 DOI:10.34229/2707-451X.21.2.8 http://dspace.nbuv.gov.ua/handle/123456789/181001 004.89 en Кібернетика та комп’ютерні технології Інститут кібернетики ім. В.М. Глушкова НАН України
institution	Digital Library of Periodicals of National Academy of Sciences of Ukraine
collection	DSpace DC
language	English
topic	Інформаційні технології: теорія та інструментальні засоби Інформаційні технології: теорія та інструментальні засоби
spellingShingle	Інформаційні технології: теорія та інструментальні засоби Інформаційні технології: теорія та інструментальні засоби Bardadym, T. Gorbachuk, V. Novoselova, N. Osypenko, S. Skobtsov, V. Tom, I. On Biomedical Computations in Cluster and Cloud Environment Кібернетика та комп’ютерні технології
description	The experience of the use of applied containerized biomedical software tools in cloud environment is summarized. The reproducibility of scientific computing in relation to modern technologies of scientific calculations is discussed. The main approaches to biomedical data preprocessing and integration in the framework of the intelligent analytical system are described.
format	Article
author	Bardadym, T. Gorbachuk, V. Novoselova, N. Osypenko, S. Skobtsov, V. Tom, I.
author_facet	Bardadym, T. Gorbachuk, V. Novoselova, N. Osypenko, S. Skobtsov, V. Tom, I.
author_sort	Bardadym, T.
title	On Biomedical Computations in Cluster and Cloud Environment
title_short	On Biomedical Computations in Cluster and Cloud Environment
title_full	On Biomedical Computations in Cluster and Cloud Environment
title_fullStr	On Biomedical Computations in Cluster and Cloud Environment
title_full_unstemmed	On Biomedical Computations in Cluster and Cloud Environment
title_sort	on biomedical computations in cluster and cloud environment
publisher	Інститут кібернетики ім. В.М. Глушкова НАН України
publishDate	2021
topic_facet	Інформаційні технології: теорія та інструментальні засоби
url	http://dspace.nbuv.gov.ua/handle/123456789/181001
citation_txt	On Biomedical Computations in Cluster and Cloud Environment / T. Bardadym, V. Gorbachuk, N. Novoselova, S. Osypenko, V. Skobtsov, I. Tom // Кібернетика та комп’ютерні технології: Зб. наук. пр. — 2021. — № 2. — С. 76-84. — Бібліогр.: 32 назв. — англ.
series	Кібернетика та комп’ютерні технології
work_keys_str_mv	AT bardadymt onbiomedicalcomputationsinclusterandcloudenvironment AT gorbachukv onbiomedicalcomputationsinclusterandcloudenvironment AT novoselovan onbiomedicalcomputationsinclusterandcloudenvironment AT osypenkos onbiomedicalcomputationsinclusterandcloudenvironment AT skobtsovv onbiomedicalcomputationsinclusterandcloudenvironment AT tomi onbiomedicalcomputationsinclusterandcloudenvironment
first_indexed	2025-07-15T21:30:12Z
last_indexed	2025-07-15T21:30:12Z
_version_	1837750041145507840
fulltext	INFORMATION TECHNOLOGY: THEORY AND TOOLS 76 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2 CYBERNETICS and COMPUTER TECHNOLOGIES The experience of the use of applied containerized biomedical software tools in cloud environment is summarized. The reproducibility of scientific computing in relation to modern technologies of scientific calculations is discussed. The main approaches to biomedical data preprocessing and integration in the framework of the intelligent analytical system are described. Keywords: classifier, cloud service, contai- nerized application   T. Bardadym, V. Gorbachuk, N. Novoselova, S. Osypenko, V. Skobtsov, I. Tom, 2021 UDC 004.89 DOI:10.34229/2707-451X.21.2.8 T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO, V. SKOBTSOV, I. TOM ON BIOMEDICAL COMPUTATIONS IN CLUSTER AND CLOUD ENVIRONMENT* I. INTRODUCTION This publication summarizes the experience of the use of applied containerized software tools in cloud environment, which the authors gained during the project “Development of methods, algorithms and intellectual analytical system for processing and analysis of heterogeneous clinical and biomedical data in order to improve the diagnosis of complex diseases”, accomplished by the team from the United Institute of Informatics Problems of the NAS of Belarus and V.M. Glushkov Institute of Cybernetics of the NAS of Ukraine. The goal of the project is to develop effective methods and software for constructing classifiers, selection of informative features, creation of a prototype of an intelligent analytical system, which is a software implementation of all stages of data processing and analysis and is aimed at conducting research in the field of clinical medicine. This system will implement the functions of integrating clinical and molecular patient data, determining diagnostic biomarkers and their combinations, building classifiers of complex diseases (oncological diseases) based on integrated data, identifying new disease subtypes to improve treatment methods and increase its efficiency. Large amount of research activities devoted to the development of mathematical methods of data handling, particularly classification models, is due, on the one hand, to a wide range of possible applications, and on the other hand – the complexity of these problems, which requires the development and improvement of means to solve them (see for example [1 – 5]). In addition to general requirements for efficiency of the created software there exists a need to pay attention to the conditions of availability of large and heterogeneous data sets, requirements for the ability to transfer programs from one hardware unit to another, their performance in cloud computing. * Supported by the National Academy of Sciences of Ukraine (project ВФ.115.41), the Ministry of Education and Science of Ukraine (projects М/99-2019, M/37-2020) and the Belarusian Republican Foundation for Fundamental Research (project № Ф19УКРГ-005). https://doi.org/10.34229/2707-451X.21.2.8 ON BIOMEDICAL COMPUTATIONS IN CLUSTER AND CLOUD ENVIRONMENT ISSN 2707-4501. Cybernetics and Computer Technologies. 2021, No.2 77 Moreover, one of the most important requirements is the reproducibility of research numerical experiments. The principle of reproducibility of research is one of the basic scientific principles. However, a crisis called "reproducibility crisis" has been realized in science [6, 7]. This crisis has affected almost all branches of science, in particular, to a large extent - biology and medicine. Much effort has been made recently to overcome this crisis, including the development of software and software platforms to ensure the reproducibility of scientific computing. Computing in biology and medicine involves the use of high-performance computing technologies (including clusters and grid technologies). However, the introduction of modern technologies to ensure the reproducibility of calculations in this area is quite slow [8, p. 731]. As a result, in the field of cluster technologies, which do not have the appropriate software installed, there is a contradiction between modern requirements for the reproducibility of scientific calculations and the ability to achieve it by old means. It so happened that the need to create a containerized application was not a planned stage of our study. This was primarily due to the ways of accessing the real data on which the software was tested. Only then did the authors realize that they had gained other advantages, among which the most important is the reproducibility of research numerical experiments. It is the purpose of the publication to share this experience. The second purpose is to describe shortly our efforts taken towards the development of specialized computer methods and models in order to solve the vital tasks in the field of biomedicine. Nowadays there exists the enormous amount of biomedical and clinical data collected in the public and private repositories. They can be freely accessed and present the wide field for experiments with the newly developed scientific approaches and their comparison. The integration of heterogeneous information sources is one of the urgent applied problems, which we have tried to solve in our project. The hybrid classification model presents the basis of the intelligent analytical system and aims to integrate several sources of biomedical information in order to improve the diagnostics and prognosis of complex diseases. II. NEW LINEAR CLASSIFIER AND ITS PROGRAM REALISATION Based on the approaches presented in [9, 10], optimization models and methods for solving problems of constructing linear classifiers have been developed. In particular, the problem of constructing classifiers for linearly indivisible sets was formulated as a problem of minimizing the band of incorrect classification of training sample points. This model belongs to the class of optimization problems of non-convex programming and is multi-extreme. Various formulations of this problem are offered, approaches to construction of approximate decisions and calculation of estimations of optimum values are considered. An interesting geometric interpretation of the problems of constructing linear classifiers can be found in [11]. To solve these optimization problems, methods of non-smooth optimization, namely r-algorithms of N.Z. Shor [12 – 14] and exact penalty functions [15, 16] were used. When creating appropriate software, modern libraries of linear algebra, similar to [17 – 19] should be used to speed up arithmetic operations. It is a combination of algorithms based on non-smooth optimization methods and the use of modern libraries of linear algebra was implemented in the developed software module NonSmoothSVC. To test the abilities of the new classifier NonSmoothSVC a comparison with existing tools was made. The methods integrated into the library scikit-learn [8, 20] were chosen, namely Linear SVC, NuSVC, Ada Boost. The two last methods are non-linear classifiers, they were chosen to get additional information concerning advantages of different methods for different problems. First numerical experiments were made on specially generated artificial data. Computational experiments aimed to estimate the speed and predictive properties of new software compared to existing ones. Both artificially created data and real medical data were used in the calculations in the test problems. Training and control samples of randomly generated problems were formed as identically distributed data points on a single cube in the space of features nR . Then, the points of the first class shifted in the first coordinate by the value δ, and the points of the second class shifted in the first T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO, V. SKOBTSOV, I. TOM 78 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2 coordinate by the value (-1-δ). When δ > 0, training and control samples are linearly separable, and when δ < 0, they are linearly inseparable. Next, the rotation (linear transformation) of space was performed so that the separating hyperplane depended on many coordinates of space. The need to test new software on real data forced us to locate the software module NonSmoothSVC into a containerized application (using Docker technology [21]) for use on a personal computer, as well as on a cluster, grid, and cloud environment. This permitted to get access to the real data on Cancer Genomics Cloud [22], a specialized cloud platform that provides free access to genetic, medical databases, in particular – The Cancer Genome Atlas (TCGA) [23], and more than 450 public applications designed to analyze data on this topic. It is possible to expand this list with the own applications, data sets, research results (currently there are more than one million on this service), to involve other researchers in projects. Computational experiments have demonstrated that on some data sets the NonSmoothSVC has qualitative advantages over other methods involved in the comparison, but is inferior in speed. Particularly, on linearly separable samples the NonSmoothSVC gained an advantage over the LinearSVC in the number of cases with better classification accuracy. On the unbalanced samples, the NonSmoothSVC software slightly outperformed the LinearSVC software in the number of cases with better classification accuracy on average, but demonstrated an advantage in some parts of the classification accuracy scale. Full description of numerical experiments and the results of testing can be found in the reports (in Ukrainian) at http://moderninform.icybcluster.org.ua/ais/. Thanks to the containerized form, the developed software can become publicly available tool and application of this and other services in the problems of constructing optimized linear classifiers using modern libraries of linear algebra. In the presence of technical possibilities, parallelization on microprocessor networks looks promising. This approach is especially recommended in the case of large data samples, when the dimension of the feature space is tens of thousands. It is also necessary to take into account the features of optimization problems in specific cases. In particular, additional requirements that may be formulated by specialists may reduce the number of informative features. III. SPECIFIC FEATURES OF BIOMEDICAL DATA Processing and study of biomedical data have some peculiarities. This, in particular, the existence of possible large errors that arise in the processing of medical information and huge number of features that need to be taken into account, which increases the dimensionality of the corresponding optimization problems, the missed measurements, which requires the use of specialized methods for their processing and analysis. In order to improve the diagnosis and treatment of complex diseases, much attention is paid to the comprehensive analysis of various biomedical and clinical data to understand the processes occurring in the body at the cellular level and changes caused by the development of the disease. It is known, the cause of complex diseases, along with external factors, is a combination of genetic failures, which does not allow to fix only one genetic mutation as a biomarker. The difficulty also lies in the fact that individual genetic factors can differ and individual cases of the same disease (phenotype) can be caused by different genetic changes. In addition, in the case of the combined effect of several mutations, the individual effect of each of them can be rather insignificant and, therefore, difficult to be detected. It is also necessary to take into account the high heterogeneity of the complex disease, i.e. heterogeneity of its observed manifestations (phenotypes). Recently, the methods of systems biology have become widely used to study complex diseases, namely, knowledge about the interactions between genes, their products and small molecules that form a complex network of interactions. This approach makes it possible to explain the appearance of similar phenotypes despite different genetic causes, namely, their interconnection and influence (dysregulation) on the same component of the cellular system. Thus, the use of interactome in conjunction with other data http://moderninform.icybcluster.org.ua/ais/ ON BIOMEDICAL COMPUTATIONS IN CLUSTER AND CLOUD ENVIRONMENT ISSN 2707-4501. Cybernetics and Computer Technologies. 2021, No.2 79 from biogenetic studies can contribute to understanding the processes occurring at the molecular level in complex diseases. The use of combinations of heterogeneous data makes it possible to determine dysregulated cellular pathways, to reveal the relationship between genotype and phenotype, and to explain the heterogeneity of a complex disease. Natural approaches here are: to increase the efficiency of tools and methods for selection informative features. In the works [26 – 30] attention is paid to the preliminary preparation of available medical data in order to select informative features. In the course of the project, algorithms for preprocessing and extracting biomarkers from biomedical data were developed, including: an algorithm for ranking features by information content for classification [26], an algorithm for identifying combinations of biomarkers, taking into account the correlation of features and allowing to exclude their influence. Moreover, several approaches were analyzed for identifying a subset of informative features, taking into account several data sources, namely, gene expression data and data on functional and physical interactions of genes and their products, presented in the form of networks. Based on the analysis of existing approaches, an algorithm for identifying a subset of features has been developed, which allows integrating interactomic and transcriptomic data to determine functional subsets associated with the disease. Pre-processing of biomedical data made it possible to reduce the feature space and thereby increase the accuracy of classification models. Detailed description of algorithms and related information can be found in the report (in Russian) at http://moderninform.icybcluster.org.ua/ais/. Figure. Simplified diagram of combining two data sources into an ensemble In one of the numerical experiments the real data contained information on the gene expression of cancer patients (143 observations of 60,483 features) obtained from the Cancer Genome Atlas (TCGA). From these data by means of the simplified method of ranking of features proposed by Novoselova [30] 23 most informative features concerning the forecast of a vital status of patients having diagnosed glioblastoma were identified. This approach substantially simplifies numerical difficulties in following data processing. IV. THE CORE OF THE INTELLIGENT ANALYTICAL SYSTEM FOR BIOMEDICAL DATA ANALYSIS Due to the fact that various sources of biological information characterize various changes occurring in the body at the cellular level during the development of a complex disease, it is assumed that their combination will improve the accuracy of diagnosis of the subtype of the disease, the reliability of the disease prognosis and response to therapy [31, 32]. In addition, combining heterogeneous data will allow one to discover the relationships between various biomedical entities (genes, proteins, metabolites, etc.) http://moderninform.icybcluster.org.ua/ais/ T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO, V. SKOBTSOV, I. TOM 80 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2 directly related to the development of the disease, compensate for noise and errors in individual data sources and thereby obtain more reliable results. A common problem in solving this problem is how to combine information from different data sources. The Figure shows an example of a simplified scheme for combining two data sources to build a classifier. In our study, of interest are methods for constructing classifiers based on various sources of multidimensional data, which, as a rule, have a heterogeneous representation. Consequently, the task is to unify this representation, determine the base classifier, build classification models on each data source, and select ways to combine the predicted values, obtained using the constructed models. The core of the intelligent analytical system being developed is a hybrid classification model, which allows combining several sources of biological information about patients in order to build a classification model that allows diagnosing subtypes of complex diseases characterized by genetic disorders. The proposed hybrid model is a classification ensemble with the following distinctive features: 1) Uniform presentation of information from various data sources by constructing a matrix of object- object distances using various kernel functions (density functions), including Gaussian, polynomial function, scalar product of vectors, etc; 2) Implementation of the procedure for selecting classification characteristics for each individual data source; 3) Construction of a basic or individual classifier of a hybrid model, which can be either a single classifier or an ensemble of classifiers built on a single data source; 4) Implementation of several ways of integrating individual classifiers of the model; 5) Analysis of the information content of individual classifiers using the assessment of their weight coefficients. The method for constructing a hybrid model is based on a combination of the bagging procedure and the aggregation of ranked lists to build basic classifiers and a pruning procedure to determine the final structure of the model, which allows adaptively adjusting the ensemble taking into account the type of classified data. The preliminary experiments on the TCGA data [23] showed that the ensembles built on heterogeneous data sources can sufficiently increase the accuracy of classification and prediction of subtypes of complex diseases, since each of the data sources describes the organism under study in different planes: gene expression data, Ribonucleic acid (RNA) sequencing, metabolic data, gene copy number data, etc. V. SPECIFIC FEATURES OF BIOMEDICAL COMPUTATIONS Ensuring the reproducibility of calculations is a prerequisite for the reproducibility of scientific research as a whole. The conditions for computational reproducibility are the availability of source data, the ability to reproduce an identical computing environment (or an environment that does not lead to other calculation results), and the availability of the results of computations. Biomedical calculations have their own specific features that should be taken into account when planning them. Let we mention some of them. Modern biomedical calculations, especially based on genome data, are very huge and cumbersome. Usually "classic" biomedical applications (PAML, Muscle, MAFFT, MrBayes, BLAST, etc.) and large libraries with implementations of biomedical algorithms written in different programming languages (C / C ++, Java, R, Go, Scala, Haskell, Perl, Python, Ruby, Erlang, Julia, etc. [24]) are quite often used simultaneously in one study. Moreover, biomedical calculations often involve methods of artificial intelligence - machine learning, pattern recognition, and corresponding libraries (e.g., scikit-learn [8, 20]). Such a variety of software requires careful configuration of the computing environment with control of the versions of libraries used (here can be used as dozens and hundreds of libraries). ON BIOMEDICAL COMPUTATIONS IN CLUSTER AND CLOUD ENVIRONMENT ISSN 2707-4501. Cybernetics and Computer Technologies. 2021, No.2 81 Otherwise one can get a lack of reproducibility as a result of calculations. In terms of using cluster technologies, creating such environments (separate for each user) and maintaining them in a conflict-free state is quite a burdensome task (unless you use special software configuration tools, such as Conda, Bioconda, or containerization of applications using, for example, technology Singularity). Most of the libraries and applications used in biomedical computing do not provide efficient use of parallel multithreaded computing with multi-core processors, and at the same time many of them can be applied to an "embarrassingly parallel" model – a model in which individual pieces of data are calculated in parallel by identical instances of computational processes without transferring messages between them (for example, using Apache Hadoop technology) [8]. VI. TECHNOLOGIES THAT ENSURE THE REPRODUCIBILITY OF SCIENTIFIC CALCULATIONS Taking into account the peculiarities of biomedical computing, reproducibility and their horizontal scaling (the ability to increase the number of identical computing units to solve one problem) can be achieved through the use of containerized applications, software pipeline computing and parameterization of software environment. Technologies of containerization of software applications. Due to the containerization of biomedical applications (Docker, Singularity containerization technology) the following can be achieved: reproducibility of the conditions in which the calculations took place (invariability of software including software and libraries), the possibility of horizontal scaling provided the use of "stunning" model of parallelism in cluster (Singularity) and cloud (using Docker) calculations. Technologies of software pipelining of calculations. Software pipeline allows you to organize flow calculations (calculations in which the inputs and outputs of processes are interconnected). Thanks to the use of tools for automation of flow calculations (workflow engine) such as CWL (Common Workflow Language), GWL (Guix Workflow Language), Snakemake, Nextflow, it is possible to present a specific calculation in the form of a task (text file, as usual, in YAML format or JSON), the results of which can be reproduced [3]. In addition, there are tools that allow you to create / display such tasks in the form of a graph of processes and data flows. An example of such a tool is RABIX (Reproducible Analyzes for Bioinformatics) – a graphical editor for CWL. Some pipeline tools also use containerization (for example, CWL) – such tasks can be performed both on a personal computer and in a cloud environment. An important feature of streaming automation tools is that the task description syntax allows you to specify the scale of the calculations, indicating the number of resources required. Seven Bridges' product, Cancer Genomics Cloud (CGC, see http://www.cancergenomicscloud.org/), is an example of a cloud software platform for performing reproducible biomedical computations using containerization and pipelining. It is the use of containerization in the creation of an application for the construction of a linear classifier at the V.M. Glushkov Institute of Cybernetics of the National Academy of Sciences of Ukraine made it possible to conduct testing on real very voluminous medical data located at the CGC. Technologies for parameterization of software environment. Parameterization of the software environment allows you to reproduce, if necessary, an identical computing environment. GNU Guix, Conda, Bioconda are examples of tools that allow you to create an isolated software environment for individual users in a cluster [8]. VII. CONCLUSIONS AND PROSPECTS OF FURHER RESEARCH At present, there exists a range of technologies to ensure the reproducibility of scientific calculations in cloud and cluster environments. This makes it possible to create biomedical applications adapted to these environments. As a results of our efforts we get computational basis that satisfies modern requirements for computational reproducibility. The experience of using the developed linear classifier, gained during its testing on artificial and real data, allows us to conclude about several advantages provided by the containerized form of the created application. Namely: • it permits to provide access to real data located in cloud environment, T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO, V. SKOBTSOV, I. TOM 82 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2 • it is possible to perform calculations to solve research problems on cloud resources both with the help of developed tools and with the help of cloud services, • such a form of research organization makes numerical experiments reproducible, i.e. any other researcher can compare the results of their developments on specific data that have already been studied by others, in order to verify the conclusions and technical feasibility of new results, • there exists a universal opportunity to use the developed tools on technical devices of various classes from a personal computer to powerful cluster. The next steps of the project include development of the common software interface of the experimental prototype of the intelligent analytical system in order to integrate the developed methods and software modules of biomedical data preprocessing, data clustering and classification. It will allow performing all the steps of data analysis from the single framework and conducting research in the field of biomedicine. The hybrid classification model as a core of the intelligent system will make it possible to integrate multidimensional, heterogeneous biomedical data with the aim to better understand the molecular courses of disease origin and development, to improve the identification of disease subtypes and disease prognosis. Much attention will be paid to the experimentation with different computation approaches on real datasets taking into account the reproducibility of results. References 1. Vorontsov K.V. Mathematical methods of learning by precedents (Machine Learning Theory) (in Russian) http://www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf 2. Gupal A.M., Sergienko I.V. Symmetry in DNA. Methods for Discrete Sequences Recognition. Kyiv. Naukova Dumka, 2016. 227 p. (in Russian). 3. Baldi P., Wesley Hatfield G. DNA Microarrays and Gene Expression. From Experiments to Data Analysis and Modeling. Cambridge University Press, 2011. 4. Kuhn M., Johnson K. Applied predictive modeling. New York: Springer, 2013. https://doi.org/10.1007/978-1-4614-6849-3 5. Heath L.S., Ramakrishnan N. (Eds.). Problem solving handbook in computational biology and bioinformatics. NY: Springer Science & Business Media, 2010. https://doi.org/10.1007/978-0-387-09760-2 6. Ioannidis J. Why Most Published Research Findings Are False. PLoS Medicine. 2005. 2 (8). P. e124 https://doi.org/10.1371/journal.pmed.0020124 7. Baker M. Reproducibility crisis? Natur. 2016. 26 (533). P. 353-66. 8. Strozzi F., Janssen R., Wurmus R., Crusoe M.R. et al. Scalable workflows and reproducible data analysis for genomics. In: Evolutionary Genomics, 2nd ed. New York, NY: Humana Press, 2019. P. 723–745. https://doi.org/10.1007/978-1-4939-9074-0_24 9. Zhuravlev Y., Laptin Y., Vinogradov A., Zhurbenko N., Lykhovyd O., Berezovskyi O. Linear classifiers and selection of informative features. Pattern Recogn. and Image Anal. 2017. 27 (3). P. 426–432. https://doi.org/10.1134/S1054661817030336 10. Laptin Y., Zhuravlev Y., Vinogradov A. Comparison of Some Approaches to Classification Problems, and Possibilities to Construct Optimal Solutions Efficiently. Pattern Recogn. and Image Anal. 2014. 24 (2). P. 189–195. https://doi.org/10.1134/S1054661814020175 11. Zhurbenko N.G. Linear classifier and projection on polytop. Cybern. Syst. Anal. 2020. 56 (3). P. 1–8. https://doi.org/10.1007/s10559-020-00264-3 12. Shor N.Z., Zhurbenko N.G. A minimization method using the operation of extension of the space in the direction of the difference of two successive gradients. Cybernetics. 1971. 7 (3). P. 450–459. https://doi.org/10.1007/BF01070454. 13. Shor N.Z. Minimization Methods for Non-Differentiable Functions. Springer, 1985. https://doi.org/10.1007/978-3-642-82118-9 14. Shor N.Z. Nondifferentiable Optimization and Polynomial Problems. London: Kluwer Acad. Publ, 1998. https://doi.org/10.1007/978-1-4757-6015-6 15. Laptin Y.P. Exact penalty functions and convex extensions of functions in decomposition schemes in variables. Cybernetics and Systems Analysis. 2016. 52 (1). P. 85–95. https://doi.org/10.1007/s10559-016-9803-8 16. Laptin Y.P., Bardadym T.A. Problems related to estimating the coefficients of exact penalty functions. Cybernetics and Systems Analysis. 2019. 55 (3). P. 400-412. https://doi.org/10.1007/s10559-019-00147-2 http://www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf https://doi.org/10.1007/978-1-4614-6849-3 https://doi.org/10.1007/978-0-387-09760-2 https://doi.org/10.1371/journal.pmed.0020124 https://doi.org/10.1007/978-1-4939-9074-0_24 https://doi.org/10.1134/S1054661817030336 https://doi.org/10.1134/S1054661814020175 https://doi.org/10.1007/s10559-020-00264-3 https://doi.org/10.1007/BF01070454 https://doi.org/10.1007/978-3-642-82118-9 https://doi.org/10.1007/978-1-4757-6015-6 https://doi.org/10.1007/s10559-016-9803-8 https://doi.org/10.1007/s10559-019-00147-2 ON BIOMEDICAL COMPUTATIONS IN CLUSTER AND CLOUD ENVIRONMENT ISSN 2707-4501. Cybernetics and Computer Technologies. 2021, No.2 83 17. Chang C.-C., Lin C.-J. LIBSVM - A Library for Support Vector Machines. https://www.csie.ntu.edu.tw/~cjlin/libsvm/ 18. BLAS (Basic Linear Algebra Subprograms). http://www.netlib.org/blas/ 19. LAPACK – Linear Algebra PACKage. http://www.netlib.org/lapack/ 20. Free software machine learning library for the Python programming language. https://scikit-learn.org/stable/index.html 21. Tools for creation of isolated Linux-containers. https://www.docker.com/ 22. The Cancer Genomics Cloud. http://www.cancergenomicscloud.org/ 23. The Cancer Genome Atlas (TCGA). https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga 24. Bonnal R., Yates A., Goto N., Gautier L. et al. Sharing Programming Resources Between Bio* Projects. In: Evolutionary Genomics, 2nd ed., New York, NY: Humana Press, 2019. P. 747–766. https://doi.org/10.1007/978-1-4939-9074-0_25 25. Novoselova N.A., Tom I.E. Integrated network approach to protein function prediction. The Scientific Journal of Riga Technical University. Information Technology and Management Science. 2018. 21. P. 98–103. https://doi.org/10.7250/itms-2018-0016. 26. Tom I.E. Information technologies in the analysis of medical data. Science and innovations. 2016. 3. P. 28–31. 27. Novoselova N.A., Tom I.E. Semi-supervised clustering with active constraint selection. Proc. XIII International Conference "Pattern Recognition and Information Processing"- PRIP-2016, BSU, October 3–5, 2016. Minsk. P. 69–72. 28. Novoselova N.A., Tom I.E. Methods of construction of genetic data clusters. Informatics. 2016. 1 (49). P. 64–74. 29. Novoselova N.A., Tom I.E. Algorithm for ranking features for detecting biomarkers in gene expression data, Artificial Intelligence. 2013. 3. P. 58–68. 30. Novoselova N.A., Tom I.E. , Borisov A., Polaka I. Feature ranking by classification accuracy estimation of multiple data sample, Information Technology and Management Science. 2013. 16. P. 95–100. https://doi.org/10.2478/itms-2013-0015 31. Kuncheva L.I. Combining Pattern Classifiers. Methods and Algorithms. Wiley. 2004. https://doi.org/10.1002/0471660264 32. Novoselova N.A., Tom I.E., Ablameyko S.V. Evolutionary design of the classifier ensemble. Artificial Intelligence. 2011. 3. P. 429–48. Received 16.04.2021 Tamara Bardadym, Cand. Sci. (Phys. & Math.), Senior Researcher, V.M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, Tamara.Bardadym@gmail.com Vasyl Gorbachuk, Dr. Sci. (Phys. & Math.), Head of Dept., V.M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, Gorbachukvasyl@netscape.net Natalia Novoselova, Cand. Sci. (Engineering), Senior Researcher, United Institute of Informatics Problems, National Academy of Sciences of Belarus, Minsk, novosel@newman.bas-net.by Sergiy Osypenko, Software Engineer, V.M. Glushkov Institute of Cybernetics, National Academy of Sciences of Ukraine, Kyiv, baston888@gmail.com Vadim Skobtsov, Cand. Sci. (Engineering), Leading Researcher, United Institute of Informatics Problems of the National Academy of Sciences of Belarus, Minsk, vasko_vasko@mail.ru Igor Tom, Cand. Sci. (Engineering), Head of Lab., United Institute of Informatics Problems, National Academy of Sciences of Belarus, Minsk. tom@newman.bas-net.by https://www.csie.ntu.edu.tw/~cjlin/libsvm/ http://www.netlib.org/blas/ http://www.netlib.org/lapack/ https://scikit-learn.org/stable/index.html https://www.docker.com/ http://www.cancergenomicscloud.org/ https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga https://doi.org/10.1007/978-1-4939-9074-0_25 https://doi.org/10.7250/itms-2018-0016 https://doi.org/10.2478/itms-2013-0015 https://doi.org/10.1002/0471660264 mailto:Tamara.Bardadym@gmail.com mailto:novosel@newman.bas-net.by mailto:baston888@gmail.com mailto:vasko_vasko@mail.ru mailto:tom@newman.bas-net.by T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO, V. SKOBTSOV, I. TOM 84 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2 УДК 004.89 Т.О. Бардадим 1 , В.М. Горбачук 1, Н.А. Новоселова 2, С.П. Осипенко 1, В.Ю. Скобцов 2, І.Е. Том 2 Про біомедичні обчислення в кластерному та хмарному середовищі 1 Інститут кібернетики імені В.М. Глушкова НАН України, Київ 2 Об'єднаний інститут проблем інформатики НАН Білорусі, Мінськ Листування: Tamara.Bardadym@gmail.com Вступ. У публікації узагальнено досвід використання прикладних контейнерних програмних засобів у хмарному середовищі, отриманий авторами в ході проекту «Розробка методів, алгоритмів і інтелектуальної аналітичної системи для обробки і аналізу різнорідних клінічних та біомедичних даних з метою поліпшення діагностики складних захворювань», виконаного колективом Об'єднаного інституту проблем інформатики НАН Білорусі та Інституту кібернетики імені В.М. Глушкова НАН України. Паралельно описані особливості біомедичних даних та основні підходи до їх обробки та класифікації, реалізовані в рамках інтелектуальної аналітичної системи та можливості їх реалізації у складі контейнерного додатка. Мета роботи. Опис сучасних технологій, що забезпечують відтворюваність чисельних експериментів у цій галузі, та інструментів, спрямованих на інтеграцію декількох джерел біомедичної інформації з метою поліпшення діагностики і прогнозу складних захворювань. Особлива увага приділяється методам обробки даних, отриманих з різних джерел біомедичної інформації і включеним до складу інтелектуальної аналітичної системи. Отримані результати. Узагальнено досвід використання прикладних контейнерних біомедичних програмних засобів у хмарному середовищі. Обговорюється відтворюваність наукових обчислень і можливості сучасних технологій наукових обчислень. Описано основні підходи до попередньої обробки та інтеграції біомедичних даних у рамках інтелектуальної аналітичної системи. Розроблена модель гібридної класифікації є основою інтелектуальної аналітичної системи і спрямована на інтеграцію декількох джерел біомедичної інформації. Висновки. Досвід використання розробленого модуля класифікації NonSmoothSVC, що входить до складу розробленої інтелектуальної аналітичної системи, отриманий при його тестуванні на штучних і реальних даних, дозволяє зробити висновок про декілька переваг, які дає контейнерна форма реалізації створеного додатку. А саме: • вона дозволяє надавати доступ до реальних даних, що знаходяться в хмарному середовищі; • дає можливість виконання розрахунків для вирішення дослідницьких завдань на хмарних ресурсах як за допомогою розроблених інструментів, так і за допомогою хмарних сервісів; • така форма організації дослідження робить чисельні експерименти відтвореними, тобто будь-який інший дослідник може порівнювати результати своїх розробок з конкретними даними, які вже були вивчені іншими, щоб перевірити висновки і технічну здійсненність нових результатів; • існує універсальна можливість використання розроблених інструментів на технічних пристроях різного класу від персонального комп'ютера до потужного кластеру. Модель гібридної класифікації як ядро інтеллектуальної системи дозволяє інтегрувати багатовимірні, різнорідні біомедичні дані з метою кращого розуміння молекулярних шляхів походження і розвитку хвороби, поліпшення ідентифікації підтипів хвороб і прогнозів хвороби. Ключові слова: класифікатор, хмарний сервіс, контейнерний додаток, гетерогенні біомедичні дані. mailto:Tamara.Bardadym@gmail.com

On Biomedical Computations in Cluster and Cloud Environment

Репозитарії

Схожі ресурси