On Biomedical Computations in Cluster and Cloud Environment
The experience of the use of applied containerized biomedical software tools in cloud environment is summarized. The reproducibility of scientific computing in relation to modern technologies of scientific calculations is discussed. The main approaches to biomedical data preprocessing and integratio...
Збережено в:
Дата: | 2021 |
---|---|
Автори: | , , , , , |
Формат: | Стаття |
Мова: | English |
Опубліковано: |
Інститут кібернетики ім. В.М. Глушкова НАН України
2021
|
Назва видання: | Кібернетика та комп’ютерні технології |
Теми: | |
Онлайн доступ: | http://dspace.nbuv.gov.ua/handle/123456789/181001 |
Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
Назва журналу: | Digital Library of Periodicals of National Academy of Sciences of Ukraine |
Цитувати: | On Biomedical Computations in Cluster and Cloud Environment / T. Bardadym, V. Gorbachuk, N. Novoselova, S. Osypenko, V. Skobtsov, I. Tom // Кібернетика та комп’ютерні технології: Зб. наук. пр. — 2021. — № 2. — С. 76-84. — Бібліогр.: 32 назв. — англ. |
Репозитарії
Digital Library of Periodicals of National Academy of Sciences of Ukraineid |
irk-123456789-181001 |
---|---|
record_format |
dspace |
spelling |
irk-123456789-1810012021-10-27T01:26:32Z On Biomedical Computations in Cluster and Cloud Environment Bardadym, T. Gorbachuk, V. Novoselova, N. Osypenko, S. Skobtsov, V. Tom, I. Інформаційні технології: теорія та інструментальні засоби The experience of the use of applied containerized biomedical software tools in cloud environment is summarized. The reproducibility of scientific computing in relation to modern technologies of scientific calculations is discussed. The main approaches to biomedical data preprocessing and integration in the framework of the intelligent analytical system are described. Мета роботи. Опис сучасних технологій, що забезпечують відтворюваність чисельних експериментів у цій галузі, та інструментів, спрямованих на інтеграцію декількох джерел біомедичної інформації з метою поліпшення діагностики і прогнозу складних захворювань. Особлива увага приділяється методам обробки даних, отриманих з різних джерел біомедичної інформації і включеним до складу інтелектуальної аналітичної системи. Отримані результати. Узагальнено досвід використання прикладних контейнерних біомедичних програмних засобів у хмарному середовищі. Обговорюється відтворюваність наукових обчислень і можливості сучасних технологій наукових обчислень. Описано основні підходи до попередньої обробки та інтеграції біомедичних даних у рамках інтелектуальної аналітичної системи. Розроблена модель гібридної класифікації є основою інтелектуальної аналітичної системи і спрямована на інтеграцію декількох джерел біомедичної інформації. Цель работы. Описание современных технологий, обеспечивающих воспроизводимость численных экспериментов в этой области, и инструментов, направленных на интеграцию нескольких источников биомедицинской информации с целью улучшения диагностики и прогноза сложных заболеваний. Особое внимание уделяется методам обработки данных, полученных из разных источников биомедицинской информации и включенным в состав интеллектуальной аналитической системы. Полученные результаты. Обобщен опыт использования прикладных контейнерных биомедицинских программных средств в облачной среде. Обсуждается воспроизводимость научных вычислений и возможности современных технологий научных вычислений. Описаны основные подходы к предварительной обработке и интеграции биомедицинских данных в рамках интеллектуальной аналитической системы. Разработанная модель гибридной классификации представляет собой основу интеллектуальной аналитической системы и направлена на интеграцию нескольких источников биомедицинской информации. 2021 Article On Biomedical Computations in Cluster and Cloud Environment / T. Bardadym, V. Gorbachuk, N. Novoselova, S. Osypenko, V. Skobtsov, I. Tom // Кібернетика та комп’ютерні технології: Зб. наук. пр. — 2021. — № 2. — С. 76-84. — Бібліогр.: 32 назв. — англ. 2707-4501 DOI:10.34229/2707-451X.21.2.8 http://dspace.nbuv.gov.ua/handle/123456789/181001 004.89 en Кібернетика та комп’ютерні технології Інститут кібернетики ім. В.М. Глушкова НАН України |
institution |
Digital Library of Periodicals of National Academy of Sciences of Ukraine |
collection |
DSpace DC |
language |
English |
topic |
Інформаційні технології: теорія та інструментальні засоби Інформаційні технології: теорія та інструментальні засоби |
spellingShingle |
Інформаційні технології: теорія та інструментальні засоби Інформаційні технології: теорія та інструментальні засоби Bardadym, T. Gorbachuk, V. Novoselova, N. Osypenko, S. Skobtsov, V. Tom, I. On Biomedical Computations in Cluster and Cloud Environment Кібернетика та комп’ютерні технології |
description |
The experience of the use of applied containerized biomedical software tools in cloud environment is summarized. The reproducibility of scientific computing in relation to modern technologies of scientific calculations is discussed. The main approaches to biomedical data preprocessing and integration in the framework of the intelligent analytical system are described. |
format |
Article |
author |
Bardadym, T. Gorbachuk, V. Novoselova, N. Osypenko, S. Skobtsov, V. Tom, I. |
author_facet |
Bardadym, T. Gorbachuk, V. Novoselova, N. Osypenko, S. Skobtsov, V. Tom, I. |
author_sort |
Bardadym, T. |
title |
On Biomedical Computations in Cluster and Cloud Environment |
title_short |
On Biomedical Computations in Cluster and Cloud Environment |
title_full |
On Biomedical Computations in Cluster and Cloud Environment |
title_fullStr |
On Biomedical Computations in Cluster and Cloud Environment |
title_full_unstemmed |
On Biomedical Computations in Cluster and Cloud Environment |
title_sort |
on biomedical computations in cluster and cloud environment |
publisher |
Інститут кібернетики ім. В.М. Глушкова НАН України |
publishDate |
2021 |
topic_facet |
Інформаційні технології: теорія та інструментальні засоби |
url |
http://dspace.nbuv.gov.ua/handle/123456789/181001 |
citation_txt |
On Biomedical Computations in Cluster and Cloud Environment / T. Bardadym, V. Gorbachuk, N. Novoselova, S. Osypenko, V. Skobtsov, I. Tom // Кібернетика та комп’ютерні технології: Зб. наук. пр. — 2021. — № 2. — С. 76-84. — Бібліогр.: 32 назв. — англ. |
series |
Кібернетика та комп’ютерні технології |
work_keys_str_mv |
AT bardadymt onbiomedicalcomputationsinclusterandcloudenvironment AT gorbachukv onbiomedicalcomputationsinclusterandcloudenvironment AT novoselovan onbiomedicalcomputationsinclusterandcloudenvironment AT osypenkos onbiomedicalcomputationsinclusterandcloudenvironment AT skobtsovv onbiomedicalcomputationsinclusterandcloudenvironment AT tomi onbiomedicalcomputationsinclusterandcloudenvironment |
first_indexed |
2025-07-15T21:30:12Z |
last_indexed |
2025-07-15T21:30:12Z |
_version_ |
1837750041145507840 |
fulltext |
INFORMATION TECHNOLOGY: THEORY AND TOOLS
76 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2
CYBERNETICS
and COMPUTER
TECHNOLOGIES
The experience of the use of applied
containerized biomedical software tools in cloud
environment is summarized. The reproducibility
of scientific computing in relation to modern
technologies of scientific calculations is
discussed. The main approaches to biomedical
data preprocessing and integration in the
framework of the intelligent analytical system
are described.
Keywords: classifier, cloud service, contai-
nerized application
T. Bardadym, V. Gorbachuk, N. Novoselova,
S. Osypenko, V. Skobtsov, I. Tom, 2021
UDC 004.89 DOI:10.34229/2707-451X.21.2.8
T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO,
V. SKOBTSOV, I. TOM
ON BIOMEDICAL COMPUTATIONS IN CLUSTER
AND CLOUD ENVIRONMENT*
I. INTRODUCTION
This publication summarizes the experience of the use of
applied containerized software tools in cloud environment,
which the authors gained during the project “Development
of methods, algorithms and intellectual analytical system
for processing and analysis of heterogeneous clinical and
biomedical data in order to improve the diagnosis of
complex diseases”, accomplished by the team from the
United Institute of Informatics Problems of the NAS of
Belarus and V.M. Glushkov Institute of Cybernetics of the
NAS of Ukraine.
The goal of the project is to develop effective methods
and software for constructing classifiers, selection of
informative features, creation of a prototype of an
intelligent analytical system, which is a software
implementation of all stages of data processing and
analysis and is aimed at conducting research in the field of
clinical medicine. This system will implement the
functions of integrating clinical and molecular patient data,
determining diagnostic biomarkers and their combinations,
building classifiers of complex diseases (oncological
diseases) based on integrated data, identifying new disease
subtypes to improve treatment methods and increase its
efficiency.
Large amount of research activities devoted to the
development of mathematical methods of data handling,
particularly classification models, is due, on the one hand,
to a wide range of possible applications, and on the other
hand – the complexity of these problems, which requires
the development and improvement of means to solve them
(see for example [1 – 5]). In addition to general
requirements for efficiency of the created software there
exists a need to pay attention to the conditions of
availability of large and heterogeneous data sets,
requirements for the ability to transfer programs from one
hardware unit to another, their performance in cloud
computing.
* Supported by the National Academy of Sciences of Ukraine
(project ВФ.115.41), the Ministry of Education and Science of
Ukraine (projects М/99-2019, M/37-2020) and the Belarusian
Republican Foundation for Fundamental Research (project
№ Ф19УКРГ-005).
https://doi.org/10.34229/2707-451X.21.2.8
ON BIOMEDICAL COMPUTATIONS IN CLUSTER AND CLOUD ENVIRONMENT
ISSN 2707-4501. Cybernetics and Computer Technologies. 2021, No.2 77
Moreover, one of the most important requirements is the reproducibility of research numerical
experiments. The principle of reproducibility of research is one of the basic scientific principles. However,
a crisis called "reproducibility crisis" has been realized in science [6, 7].
This crisis has affected almost all branches of science, in particular, to a large extent - biology and
medicine. Much effort has been made recently to overcome this crisis, including the development of
software and software platforms to ensure the reproducibility of scientific computing. Computing in
biology and medicine involves the use of high-performance computing technologies (including clusters and
grid technologies). However, the introduction of modern technologies to ensure the reproducibility of
calculations in this area is quite slow [8, p. 731]. As a result, in the field of cluster technologies, which do
not have the appropriate software installed, there is a contradiction between modern requirements for the
reproducibility of scientific calculations and the ability to achieve it by old means.
It so happened that the need to create a containerized application was not a planned stage of our study.
This was primarily due to the ways of accessing the real data on which the software was tested. Only then
did the authors realize that they had gained other advantages, among which the most important is the
reproducibility of research numerical experiments. It is the purpose of the publication to share this
experience.
The second purpose is to describe shortly our efforts taken towards the development of specialized
computer methods and models in order to solve the vital tasks in the field of biomedicine. Nowadays there
exists the enormous amount of biomedical and clinical data collected in the public and private repositories.
They can be freely accessed and present the wide field for experiments with the newly developed scientific
approaches and their comparison. The integration of heterogeneous information sources is one of the urgent
applied problems, which we have tried to solve in our project. The hybrid classification model presents the
basis of the intelligent analytical system and aims to integrate several sources of biomedical information in
order to improve the diagnostics and prognosis of complex diseases.
II. NEW LINEAR CLASSIFIER AND ITS PROGRAM REALISATION
Based on the approaches presented in [9, 10], optimization models and methods for solving problems
of constructing linear classifiers have been developed. In particular, the problem of constructing classifiers
for linearly indivisible sets was formulated as a problem of minimizing the band of incorrect classification
of training sample points. This model belongs to the class of optimization problems of non-convex
programming and is multi-extreme. Various formulations of this problem are offered, approaches to
construction of approximate decisions and calculation of estimations of optimum values are considered. An
interesting geometric interpretation of the problems of constructing linear classifiers can be found in [11].
To solve these optimization problems, methods of non-smooth optimization, namely r-algorithms of
N.Z. Shor [12 – 14] and exact penalty functions [15, 16] were used. When creating appropriate software,
modern libraries of linear algebra, similar to [17 – 19] should be used to speed up arithmetic operations.
It is a combination of algorithms based on non-smooth optimization methods and the use of modern
libraries of linear algebra was implemented in the developed software module NonSmoothSVC.
To test the abilities of the new classifier NonSmoothSVC a comparison with existing tools was made.
The methods integrated into the library scikit-learn [8, 20] were chosen, namely Linear SVC, NuSVC, Ada
Boost. The two last methods are non-linear classifiers, they were chosen to get additional information
concerning advantages of different methods for different problems. First numerical experiments were made
on specially generated artificial data.
Computational experiments aimed to estimate the speed and predictive properties of new software
compared to existing ones. Both artificially created data and real medical data were used in the calculations
in the test problems. Training and control samples of randomly generated problems were formed as
identically distributed data points on a single cube in the space of features nR . Then, the points of the first
class shifted in the first coordinate by the value δ, and the points of the second class shifted in the first
T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO, V. SKOBTSOV, I. TOM
78 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2
coordinate by the value (-1-δ). When δ > 0, training and control samples are linearly separable, and when
δ < 0, they are linearly inseparable. Next, the rotation (linear transformation) of space was performed so
that the separating hyperplane depended on many coordinates of space.
The need to test new software on real data forced us to locate the software module NonSmoothSVC
into a containerized application (using Docker technology [21]) for use on a personal computer, as well as
on a cluster, grid, and cloud environment. This permitted to get access to the real data on Cancer Genomics
Cloud [22], a specialized cloud platform that provides free access to genetic, medical databases, in
particular – The Cancer Genome Atlas (TCGA) [23], and more than 450 public applications designed to
analyze data on this topic. It is possible to expand this list with the own applications, data sets, research
results (currently there are more than one million on this service), to involve other researchers in projects.
Computational experiments have demonstrated that on some data sets the NonSmoothSVC has
qualitative advantages over other methods involved in the comparison, but is inferior in speed. Particularly,
on linearly separable samples the NonSmoothSVC gained an advantage over the LinearSVC in the number
of cases with better classification accuracy. On the unbalanced samples, the NonSmoothSVC software
slightly outperformed the LinearSVC software in the number of cases with better classification accuracy on
average, but demonstrated an advantage in some parts of the classification accuracy scale.
Full description of numerical experiments and the results of testing can be found in the reports (in
Ukrainian) at http://moderninform.icybcluster.org.ua/ais/.
Thanks to the containerized form, the developed software can become publicly available tool and
application of this and other services in the problems of constructing optimized linear classifiers using
modern libraries of linear algebra.
In the presence of technical possibilities, parallelization on microprocessor networks looks promising.
This approach is especially recommended in the case of large data samples, when the dimension of the
feature space is tens of thousands. It is also necessary to take into account the features of optimization
problems in specific cases. In particular, additional requirements that may be formulated by specialists may
reduce the number of informative features.
III. SPECIFIC FEATURES OF BIOMEDICAL DATA
Processing and study of biomedical data have some peculiarities. This, in particular, the existence of
possible large errors that arise in the processing of medical information and huge number of features that
need to be taken into account, which increases the dimensionality of the corresponding optimization
problems, the missed measurements, which requires the use of specialized methods for their processing and
analysis.
In order to improve the diagnosis and treatment of complex diseases, much attention is paid to the
comprehensive analysis of various biomedical and clinical data to understand the processes occurring in the
body at the cellular level and changes caused by the development of the disease.
It is known, the cause of complex diseases, along with external factors, is a combination of genetic
failures, which does not allow to fix only one genetic mutation as a biomarker. The difficulty also lies in the
fact that individual genetic factors can differ and individual cases of the same disease (phenotype) can be
caused by different genetic changes.
In addition, in the case of the combined effect of several mutations, the individual effect of each of
them can be rather insignificant and, therefore, difficult to be detected. It is also necessary to take into
account the high heterogeneity of the complex disease, i.e. heterogeneity of its observed manifestations
(phenotypes).
Recently, the methods of systems biology have become widely used to study complex diseases,
namely, knowledge about the interactions between genes, their products and small molecules that form a
complex network of interactions. This approach makes it possible to explain the appearance of similar
phenotypes despite different genetic causes, namely, their interconnection and influence (dysregulation) on
the same component of the cellular system. Thus, the use of interactome in conjunction with other data
http://moderninform.icybcluster.org.ua/ais/
ON BIOMEDICAL COMPUTATIONS IN CLUSTER AND CLOUD ENVIRONMENT
ISSN 2707-4501. Cybernetics and Computer Technologies. 2021, No.2 79
from biogenetic studies can contribute to understanding the processes occurring at the molecular level in
complex diseases. The use of combinations of heterogeneous data makes it possible to determine
dysregulated cellular pathways, to reveal the relationship between genotype and phenotype, and to explain
the heterogeneity of a complex disease.
Natural approaches here are: to increase the efficiency of tools and methods for selection informative
features. In the works [26 – 30] attention is paid to the preliminary preparation of available medical data in
order to select informative features.
In the course of the project, algorithms for preprocessing and extracting biomarkers from biomedical
data were developed, including: an algorithm for ranking features by information content for classification
[26], an algorithm for identifying combinations of biomarkers, taking into account the correlation of
features and allowing to exclude their influence.
Moreover, several approaches were analyzed for identifying a subset of informative features, taking
into account several data sources, namely, gene expression data and data on functional and physical
interactions of genes and their products, presented in the form of networks. Based on the analysis of
existing approaches, an algorithm for identifying a subset of features has been developed, which allows
integrating interactomic and transcriptomic data to determine functional subsets associated with the disease.
Pre-processing of biomedical data made it possible to reduce the feature space and thereby increase the
accuracy of classification models.
Detailed description of algorithms and related information can be found in the report (in Russian) at
http://moderninform.icybcluster.org.ua/ais/.
Figure. Simplified diagram of combining two data sources into an ensemble
In one of the numerical experiments the real data contained information on the gene expression of
cancer patients (143 observations of 60,483 features) obtained from the Cancer Genome Atlas (TCGA).
From these data by means of the simplified method of ranking of features proposed by Novoselova [30] 23
most informative features concerning the forecast of a vital status of patients having diagnosed
glioblastoma were identified. This approach substantially simplifies numerical difficulties in following data
processing.
IV. THE CORE OF THE INTELLIGENT ANALYTICAL SYSTEM FOR BIOMEDICAL DATA ANALYSIS
Due to the fact that various sources of biological information characterize various changes occurring in
the body at the cellular level during the development of a complex disease, it is assumed that their
combination will improve the accuracy of diagnosis of the subtype of the disease, the reliability of the
disease prognosis and response to therapy [31, 32]. In addition, combining heterogeneous data will allow
one to discover the relationships between various biomedical entities (genes, proteins, metabolites, etc.)
http://moderninform.icybcluster.org.ua/ais/
T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO, V. SKOBTSOV, I. TOM
80 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2
directly related to the development of the disease, compensate for noise and errors in individual data
sources and thereby obtain more reliable results. A common problem in solving this problem is how to
combine information from different data sources. The Figure shows an example of a simplified scheme for
combining two data sources to build a classifier.
In our study, of interest are methods for constructing classifiers based on various sources of
multidimensional data, which, as a rule, have a heterogeneous representation. Consequently, the task is to
unify this representation, determine the base classifier, build classification models on each data source, and
select ways to combine the predicted values, obtained using the constructed models.
The core of the intelligent analytical system being developed is a hybrid classification model, which
allows combining several sources of biological information about patients in order to build a classification
model that allows diagnosing subtypes of complex diseases characterized by genetic disorders. The
proposed hybrid model is a classification ensemble with the following distinctive features:
1) Uniform presentation of information from various data sources by constructing a matrix of object-
object distances using various kernel functions (density functions), including Gaussian, polynomial
function, scalar product of vectors, etc;
2) Implementation of the procedure for selecting classification characteristics for each individual data
source;
3) Construction of a basic or individual classifier of a hybrid model, which can be either a single
classifier or an ensemble of classifiers built on a single data source;
4) Implementation of several ways of integrating individual classifiers of the model;
5) Analysis of the information content of individual classifiers using the assessment of their weight
coefficients.
The method for constructing a hybrid model is based on a combination of the bagging procedure and
the aggregation of ranked lists to build basic classifiers and a pruning procedure to determine the final
structure of the model, which allows adaptively adjusting the ensemble taking into account the type of
classified data.
The preliminary experiments on the TCGA data [23] showed that the ensembles built
on heterogeneous data sources can sufficiently increase the accuracy of classification and prediction
of subtypes of complex diseases, since each of the data sources describes the organism under study in
different planes: gene expression data, Ribonucleic acid (RNA) sequencing, metabolic data, gene copy
number data, etc.
V. SPECIFIC FEATURES OF BIOMEDICAL COMPUTATIONS
Ensuring the reproducibility of calculations is a prerequisite for the reproducibility of scientific
research as a whole. The conditions for computational reproducibility are the availability of source data, the
ability to reproduce an identical computing environment (or an environment that does not lead to other
calculation results), and the availability of the results of computations. Biomedical calculations have their
own specific features that should be taken into account when planning them. Let we mention some of them.
Modern biomedical calculations, especially based on genome data, are very huge and cumbersome.
Usually "classic" biomedical applications (PAML, Muscle, MAFFT, MrBayes, BLAST, etc.) and large
libraries with implementations of biomedical algorithms written in different programming languages
(C / C ++, Java, R, Go, Scala, Haskell, Perl, Python, Ruby, Erlang, Julia, etc. [24]) are quite often used
simultaneously in one study. Moreover, biomedical calculations often involve methods of artificial
intelligence - machine learning, pattern recognition, and corresponding libraries (e.g., scikit-learn [8, 20]).
Such a variety of software requires careful configuration of the computing environment with control of the
versions of libraries used (here can be used as dozens and hundreds of libraries).
ON BIOMEDICAL COMPUTATIONS IN CLUSTER AND CLOUD ENVIRONMENT
ISSN 2707-4501. Cybernetics and Computer Technologies. 2021, No.2 81
Otherwise one can get a lack of reproducibility as a result of calculations. In terms of using cluster
technologies, creating such environments (separate for each user) and maintaining them in a conflict-free
state is quite a burdensome task (unless you use special software configuration tools, such as Conda,
Bioconda, or containerization of applications using, for example, technology Singularity). Most of the
libraries and applications used in biomedical computing do not provide efficient use of parallel
multithreaded computing with multi-core processors, and at the same time many of them can be applied to
an "embarrassingly parallel" model – a model in which individual pieces of data are calculated in parallel
by identical instances of computational processes without transferring messages between them (for
example, using Apache Hadoop technology) [8].
VI. TECHNOLOGIES THAT ENSURE THE REPRODUCIBILITY OF SCIENTIFIC CALCULATIONS
Taking into account the peculiarities of biomedical computing, reproducibility and their horizontal
scaling (the ability to increase the number of identical computing units to solve one problem) can be
achieved through the use of containerized applications, software pipeline computing and parameterization
of software environment.
Technologies of containerization of software applications. Due to the containerization of biomedical
applications (Docker, Singularity containerization technology) the following can be achieved:
reproducibility of the conditions in which the calculations took place (invariability of software including
software and libraries), the possibility of horizontal scaling provided the use of "stunning" model of
parallelism in cluster (Singularity) and cloud (using Docker) calculations.
Technologies of software pipelining of calculations. Software pipeline allows you to organize flow
calculations (calculations in which the inputs and outputs of processes are interconnected). Thanks to the
use of tools for automation of flow calculations (workflow engine) such as CWL (Common Workflow
Language), GWL (Guix Workflow Language), Snakemake, Nextflow, it is possible to present a specific
calculation in the form of a task (text file, as usual, in YAML format or JSON), the results of which can be
reproduced [3]. In addition, there are tools that allow you to create / display such tasks in the form of a
graph of processes and data flows. An example of such a tool is RABIX (Reproducible Analyzes for
Bioinformatics) – a graphical editor for CWL. Some pipeline tools also use containerization (for example,
CWL) – such tasks can be performed both on a personal computer and in a cloud environment. An
important feature of streaming automation tools is that the task description syntax allows you to specify the
scale of the calculations, indicating the number of resources required. Seven Bridges' product, Cancer
Genomics Cloud (CGC, see http://www.cancergenomicscloud.org/), is an example of a cloud software
platform for performing reproducible biomedical computations using containerization and pipelining. It is
the use of containerization in the creation of an application for the construction of a linear classifier at the
V.M. Glushkov Institute of Cybernetics of the National Academy of Sciences of Ukraine made it possible
to conduct testing on real very voluminous medical data located at the CGC.
Technologies for parameterization of software environment. Parameterization of the software
environment allows you to reproduce, if necessary, an identical computing environment. GNU Guix,
Conda, Bioconda are examples of tools that allow you to create an isolated software environment for
individual users in a cluster [8].
VII. CONCLUSIONS AND PROSPECTS OF FURHER RESEARCH
At present, there exists a range of technologies to ensure the reproducibility of scientific calculations in
cloud and cluster environments. This makes it possible to create biomedical applications adapted to these
environments. As a results of our efforts we get computational basis that satisfies modern requirements for
computational reproducibility.
The experience of using the developed linear classifier, gained during its testing on artificial and real
data, allows us to conclude about several advantages provided by the containerized form of the created
application. Namely:
• it permits to provide access to real data located in cloud environment,
T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO, V. SKOBTSOV, I. TOM
82 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2
• it is possible to perform calculations to solve research problems on cloud resources both with the help
of developed tools and with the help of cloud services,
• such a form of research organization makes numerical experiments reproducible, i.e. any other
researcher can compare the results of their developments on specific data that have already been studied by
others, in order to verify the conclusions and technical feasibility of new results,
• there exists a universal opportunity to use the developed tools on technical devices of various classes
from a personal computer to powerful cluster.
The next steps of the project include development of the common software interface of the
experimental prototype of the intelligent analytical system in order to integrate the developed methods and
software modules of biomedical data preprocessing, data clustering and classification. It will allow
performing all the steps of data analysis from the single framework and conducting research in the field of
biomedicine. The hybrid classification model as a core of the intelligent system will make it possible to
integrate multidimensional, heterogeneous biomedical data with the aim to better understand the molecular
courses of disease origin and development, to improve the identification of disease subtypes and disease
prognosis. Much attention will be paid to the experimentation with different computation approaches on
real datasets taking into account the reproducibility of results.
References
1. Vorontsov K.V. Mathematical methods of learning by precedents (Machine Learning Theory) (in Russian)
http://www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf
2. Gupal A.M., Sergienko I.V. Symmetry in DNA. Methods for Discrete Sequences Recognition. Kyiv. Naukova
Dumka, 2016. 227 p. (in Russian).
3. Baldi P., Wesley Hatfield G. DNA Microarrays and Gene Expression. From Experiments to Data Analysis and
Modeling. Cambridge University Press, 2011.
4. Kuhn M., Johnson K. Applied predictive modeling. New York: Springer, 2013.
https://doi.org/10.1007/978-1-4614-6849-3
5. Heath L.S., Ramakrishnan N. (Eds.). Problem solving handbook in computational biology and bioinformatics. NY:
Springer Science & Business Media, 2010. https://doi.org/10.1007/978-0-387-09760-2
6. Ioannidis J. Why Most Published Research Findings Are False. PLoS Medicine. 2005. 2 (8). P. e124
https://doi.org/10.1371/journal.pmed.0020124
7. Baker M. Reproducibility crisis? Natur. 2016. 26 (533). P. 353-66.
8. Strozzi F., Janssen R., Wurmus R., Crusoe M.R. et al. Scalable workflows and reproducible data analysis for
genomics. In: Evolutionary Genomics, 2nd ed. New York, NY: Humana Press, 2019. P. 723–745.
https://doi.org/10.1007/978-1-4939-9074-0_24
9. Zhuravlev Y., Laptin Y., Vinogradov A., Zhurbenko N., Lykhovyd O., Berezovskyi O. Linear classifiers and selection
of informative features. Pattern Recogn. and Image Anal. 2017. 27 (3). P. 426–432.
https://doi.org/10.1134/S1054661817030336
10. Laptin Y., Zhuravlev Y., Vinogradov A. Comparison of Some Approaches to Classification Problems, and
Possibilities to Construct Optimal Solutions Efficiently. Pattern Recogn. and Image Anal. 2014. 24 (2). P. 189–195.
https://doi.org/10.1134/S1054661814020175
11. Zhurbenko N.G. Linear classifier and projection on polytop. Cybern. Syst. Anal. 2020. 56 (3). P. 1–8.
https://doi.org/10.1007/s10559-020-00264-3
12. Shor N.Z., Zhurbenko N.G. A minimization method using the operation of extension of the space in the direction of
the difference of two successive gradients. Cybernetics. 1971. 7 (3). P. 450–459. https://doi.org/10.1007/BF01070454.
13. Shor N.Z. Minimization Methods for Non-Differentiable Functions. Springer, 1985.
https://doi.org/10.1007/978-3-642-82118-9
14. Shor N.Z. Nondifferentiable Optimization and Polynomial Problems. London: Kluwer Acad. Publ, 1998.
https://doi.org/10.1007/978-1-4757-6015-6
15. Laptin Y.P. Exact penalty functions and convex extensions of functions in decomposition schemes in variables.
Cybernetics and Systems Analysis. 2016. 52 (1). P. 85–95. https://doi.org/10.1007/s10559-016-9803-8
16. Laptin Y.P., Bardadym T.A. Problems related to estimating the coefficients of exact penalty functions. Cybernetics
and Systems Analysis. 2019. 55 (3). P. 400-412. https://doi.org/10.1007/s10559-019-00147-2
http://www.machinelearning.ru/wiki/images/6/6d/Voron-ML-1.pdf
https://doi.org/10.1007/978-1-4614-6849-3
https://doi.org/10.1007/978-0-387-09760-2
https://doi.org/10.1371/journal.pmed.0020124
https://doi.org/10.1007/978-1-4939-9074-0_24
https://doi.org/10.1134/S1054661817030336
https://doi.org/10.1134/S1054661814020175
https://doi.org/10.1007/s10559-020-00264-3
https://doi.org/10.1007/BF01070454
https://doi.org/10.1007/978-3-642-82118-9
https://doi.org/10.1007/978-1-4757-6015-6
https://doi.org/10.1007/s10559-016-9803-8
https://doi.org/10.1007/s10559-019-00147-2
ON BIOMEDICAL COMPUTATIONS IN CLUSTER AND CLOUD ENVIRONMENT
ISSN 2707-4501. Cybernetics and Computer Technologies. 2021, No.2 83
17. Chang C.-C., Lin C.-J. LIBSVM - A Library for Support Vector Machines. https://www.csie.ntu.edu.tw/~cjlin/libsvm/
18. BLAS (Basic Linear Algebra Subprograms). http://www.netlib.org/blas/
19. LAPACK – Linear Algebra PACKage. http://www.netlib.org/lapack/
20. Free software machine learning library for the Python programming language. https://scikit-learn.org/stable/index.html
21. Tools for creation of isolated Linux-containers. https://www.docker.com/
22. The Cancer Genomics Cloud. http://www.cancergenomicscloud.org/
23. The Cancer Genome Atlas (TCGA).
https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
24. Bonnal R., Yates A., Goto N., Gautier L. et al. Sharing Programming Resources Between Bio* Projects.
In: Evolutionary Genomics, 2nd ed., New York, NY: Humana Press, 2019. P. 747–766.
https://doi.org/10.1007/978-1-4939-9074-0_25
25. Novoselova N.A., Tom I.E. Integrated network approach to protein function prediction. The Scientific Journal of Riga
Technical University. Information Technology and Management Science. 2018. 21. P. 98–103.
https://doi.org/10.7250/itms-2018-0016.
26. Tom I.E. Information technologies in the analysis of medical data. Science and innovations. 2016. 3. P. 28–31.
27. Novoselova N.A., Tom I.E. Semi-supervised clustering with active constraint selection. Proc. XIII International
Conference "Pattern Recognition and Information Processing"- PRIP-2016, BSU, October 3–5, 2016. Minsk.
P. 69–72.
28. Novoselova N.A., Tom I.E. Methods of construction of genetic data clusters. Informatics. 2016. 1 (49). P. 64–74.
29. Novoselova N.A., Tom I.E. Algorithm for ranking features for detecting biomarkers in gene expression data, Artificial
Intelligence. 2013. 3. P. 58–68.
30. Novoselova N.A., Tom I.E. , Borisov A., Polaka I. Feature ranking by classification accuracy estimation of multiple
data sample, Information Technology and Management Science. 2013. 16. P. 95–100.
https://doi.org/10.2478/itms-2013-0015
31. Kuncheva L.I. Combining Pattern Classifiers. Methods and Algorithms. Wiley. 2004.
https://doi.org/10.1002/0471660264
32. Novoselova N.A., Tom I.E., Ablameyko S.V. Evolutionary design of the classifier ensemble. Artificial Intelligence.
2011. 3. P. 429–48.
Received 16.04.2021
Tamara Bardadym,
Cand. Sci. (Phys. & Math.), Senior Researcher, V.M. Glushkov Institute of Cybernetics,
National Academy of Sciences of Ukraine, Kyiv,
Tamara.Bardadym@gmail.com
Vasyl Gorbachuk,
Dr. Sci. (Phys. & Math.), Head of Dept., V.M. Glushkov Institute of Cybernetics, National
Academy of Sciences of Ukraine, Kyiv,
Gorbachukvasyl@netscape.net
Natalia Novoselova,
Cand. Sci. (Engineering), Senior Researcher, United Institute of Informatics Problems,
National Academy of Sciences of Belarus, Minsk,
novosel@newman.bas-net.by
Sergiy Osypenko,
Software Engineer, V.M. Glushkov Institute of Cybernetics,
National Academy of Sciences of Ukraine, Kyiv,
baston888@gmail.com
Vadim Skobtsov,
Cand. Sci. (Engineering), Leading Researcher, United Institute of Informatics Problems
of the National Academy of Sciences of Belarus, Minsk,
vasko_vasko@mail.ru
Igor Tom,
Cand. Sci. (Engineering), Head of Lab., United Institute of Informatics Problems,
National Academy of Sciences of Belarus, Minsk.
tom@newman.bas-net.by
https://www.csie.ntu.edu.tw/~cjlin/libsvm/
http://www.netlib.org/blas/
http://www.netlib.org/lapack/
https://scikit-learn.org/stable/index.html
https://www.docker.com/
http://www.cancergenomicscloud.org/
https://www.cancer.gov/about-nci/organization/ccg/research/structural-genomics/tcga
https://doi.org/10.1007/978-1-4939-9074-0_25
https://doi.org/10.7250/itms-2018-0016
https://doi.org/10.2478/itms-2013-0015
https://doi.org/10.1002/0471660264
mailto:Tamara.Bardadym@gmail.com
mailto:novosel@newman.bas-net.by
mailto:baston888@gmail.com
mailto:vasko_vasko@mail.ru
mailto:tom@newman.bas-net.by
T. BARDADYM, V. GORBACHUK, N. NOVOSELOVA, S. OSYPENKO, V. SKOBTSOV, I. TOM
84 ISSN 2707-4501. Кібернетика та комп'ютерні технології. 2021, № 2
УДК 004.89
Т.О. Бардадим 1 *, В.М. Горбачук 1, Н.А. Новоселова 2, С.П. Осипенко 1, В.Ю. Скобцов 2, І.Е. Том 2
Про біомедичні обчислення в кластерному та хмарному середовищі
1 Інститут кібернетики імені В.М. Глушкова НАН України, Київ
2 Об'єднаний інститут проблем інформатики НАН Білорусі, Мінськ
* Листування: Tamara.Bardadym@gmail.com
Вступ. У публікації узагальнено досвід використання прикладних контейнерних програмних
засобів у хмарному середовищі, отриманий авторами в ході проекту «Розробка методів, алгоритмів
і інтелектуальної аналітичної системи для обробки і аналізу різнорідних клінічних та біомедичних
даних з метою поліпшення діагностики складних захворювань», виконаного колективом Об'єднаного
інституту проблем інформатики НАН Білорусі та Інституту кібернетики імені В.М. Глушкова НАН
України. Паралельно описані особливості біомедичних даних та основні підходи до їх обробки
та класифікації, реалізовані в рамках інтелектуальної аналітичної системи та можливості їх реалізації
у складі контейнерного додатка.
Мета роботи. Опис сучасних технологій, що забезпечують відтворюваність чисельних
експериментів у цій галузі, та інструментів, спрямованих на інтеграцію декількох джерел біомедичної
інформації з метою поліпшення діагностики і прогнозу складних захворювань. Особлива увага
приділяється методам обробки даних, отриманих з різних джерел біомедичної інформації і включеним
до складу інтелектуальної аналітичної системи.
Отримані результати. Узагальнено досвід використання прикладних контейнерних біомедичних
програмних засобів у хмарному середовищі. Обговорюється відтворюваність наукових обчислень
і можливості сучасних технологій наукових обчислень. Описано основні підходи до попередньої
обробки та інтеграції біомедичних даних у рамках інтелектуальної аналітичної системи. Розроблена
модель гібридної класифікації є основою інтелектуальної аналітичної системи і спрямована на
інтеграцію декількох джерел біомедичної інформації.
Висновки. Досвід використання розробленого модуля класифікації NonSmoothSVC, що входить
до складу розробленої інтелектуальної аналітичної системи, отриманий при його тестуванні на штучних
і реальних даних, дозволяє зробити висновок про декілька переваг, які дає контейнерна форма
реалізації створеного додатку. А саме:
• вона дозволяє надавати доступ до реальних даних, що знаходяться в хмарному середовищі;
• дає можливість виконання розрахунків для вирішення дослідницьких завдань на хмарних
ресурсах як за допомогою розроблених інструментів, так і за допомогою хмарних сервісів;
• така форма організації дослідження робить чисельні експерименти відтвореними, тобто
будь-який інший дослідник може порівнювати результати своїх розробок з конкретними даними, які
вже були вивчені іншими, щоб перевірити висновки і технічну здійсненність нових результатів;
• існує універсальна можливість використання розроблених інструментів на технічних пристроях
різного класу від персонального комп'ютера до потужного кластеру.
Модель гібридної класифікації як ядро інтеллектуальної системи дозволяє інтегрувати
багатовимірні, різнорідні біомедичні дані з метою кращого розуміння молекулярних шляхів
походження і розвитку хвороби, поліпшення ідентифікації підтипів хвороб і прогнозів хвороби.
Ключові слова: класифікатор, хмарний сервіс, контейнерний додаток, гетерогенні біомедичні
дані.
mailto:Tamara.Bardadym@gmail.com
|