Parallel Relaxational Iterative Algorithm of GMDH
The paper suggests a parallel version of the generalized relaxational iterative algorithm with recurrent computations. Parallelization using TBB library allow getting 2.28х speedup using 2 cores and 4х speedup using 4 cores of Intel processor with hyper-threading technology. У роботі запропонована р...
Saved in:
| Published in: | Індуктивне моделювання складних систем |
|---|---|
| Date: | 2014 |
| Main Author: | |
| Format: | Article |
| Language: | English |
| Published: |
Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України
2014
|
| Online Access: | https://nasplib.isofts.kiev.ua/handle/123456789/83991 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Journal Title: | Digital Library of Periodicals of National Academy of Sciences of Ukraine |
| Cite this: | Parallel Relaxational Iterative Algorithm of GMDH / А. Pavlov // Індуктивне моделювання складних систем: Зб. наук. пр. — К.: МННЦ ІТС НАН та МОН України, 2014. — Вип. 6. — С. 33-40. — Бібліогр.: 7 назв. — англ. |
Institution
Digital Library of Periodicals of National Academy of Sciences of Ukraine| _version_ | 1860259056744136704 |
|---|---|
| author | Pavlov, A. |
| author_facet | Pavlov, A. |
| citation_txt | Parallel Relaxational Iterative Algorithm of GMDH / А. Pavlov // Індуктивне моделювання складних систем: Зб. наук. пр. — К.: МННЦ ІТС НАН та МОН України, 2014. — Вип. 6. — С. 33-40. — Бібліогр.: 7 назв. — англ. |
| collection | DSpace DC |
| container_title | Індуктивне моделювання складних систем |
| description | The paper suggests a parallel version of the generalized relaxational iterative algorithm with recurrent computations. Parallelization using TBB library allow getting 2.28х speedup using 2 cores and 4х speedup using 4 cores of Intel processor with hyper-threading technology.
У роботі запропонована розпаралеленаверсія узагальненого релаксаційного ітераційного алгоритму з рекурентними обчисленнями. Розпаралелення за допомогою бібліотеки TreadingBuildingBlocks дозволило прискорити алгоритм в 2.28х на 2 ядрах і в 4х на 4 ядрах процесора Intel, що використовує технологію Hyper-Treading.
В работе предложена распараллеленная версия релаксационного итерационного алгоритма с рекуррентными вычислениями. Распараллеливание с помощью библиотеки Treading-BuildingBlocks позволило получить ускорение в 2.28х на2 ядрах и в4х на 4 ядрах процессора Intel, использующего технологиюHyper-Treading.
|
| first_indexed | 2025-12-07T18:53:36Z |
| format | Article |
| fulltext |
A.Pavlov
Індуктивне моделювання складних систем, випуск 6, 2014
33
УДК 681.513.8
PARALLEL RELAXATIONAL ITERATIVE ALGORITHM OF GMDH
Andrey Pavlov
International Research and Training Center of Information Technologies and Systems of National
Academy of Sciences of Ukraine, 03680, Kiev, av. Glushkova, 40, Ukraine
andriypavlove@gmail.com
У роботі запропонована розпаралеленаверсія узагальненого релаксаційного ітераційного
алгоритму з рекурентними обчисленнями. Розпаралелення за допомогою бібліотеки Tread-
ingBuildingBlocks дозволило прискорити алгоритм в 2.28х на 2 ядрах і в 4х на 4 ядрах проце-
сора Intel, що використовує технологію Hyper-Treading.
Ключові слова: TBB, OpenMP, багатопотокове розпаралелювання, узагальнений релакса-
ційний ітераційний алгоритм, метод групового урахування аргументів.
The paper suggests a parallel version of the generalized relaxational iterative algorithm with re-
current computations. Parallelization using TBB library allow getting 2.28х speedup using 2 cores
and 4х speedup using 4 cores of Intel processor with hyper-threading technology.
Key words: TBB, OpenMP, multithreading parallelization, generalized relaxational iterative al-
gorithm, group method of data handling.
В работе предложена распараллеленная версия релаксационного итерационного алгорит-
ма с рекуррентными вычислениями. Распараллеливание с помощью библиотеки Treading-
BuildingBlocks позволило получить ускорение в 2.28х на2 ядрах и в4х на 4 ядрах процессора
Intel, использующего технологиюHyper-Treading.
Ключевые слова:TBB, OpenMP, многопотоковое распараллеливание, обобщённый релак-
сационный итерационный алгоритм, метод группового учёта аргументов.
Introduction
Multicore processors aren’t new today.Probably every modern computer is
equipped with such a central processing unit today. This technical progress make en-
gineers, scientists and programmersreorganize their way of thinking anddeveloping
algorithms and methods according to principles of parallel programming.The algo-
rithms of the group method of data handling (GMDH) are highlytime-consuming but
can be easily parallelized.
An analysis of approaches to a parallelization of known iterative GMDH algo-
rithms was carried out in [1]. The [1] suggests new parallelization principles for one
of the fastest iterativeGMDH algorithm – generalized relaxational iterative algorithm
(GRIA) [2] and locates the most resource-intensive algorithm sections that are re-
quired to be parallelized.
This paper describes the problemsthat authorencountered when implementinga
parallel version of the algorithm and shows the results of the GRIA scalability.
The algorithm is written in С++, therefore, let us first, present a survey of to-
day’s C++ libraries for multithread parallelization.
Parallel relaxational iterative GMDH algorithm
Індуктивне моделювання складних систем, випуск 6, 2014 34
1. Overview of libraries for parallelization
The most well-known libraries are: Message Passing Interface (MPI) [3], Open
Multi-Processing(OpenMP) [4] andThreading Building Blocks (TBB) [5].
MPI is an API developed for message transferring thatallow exchange messages
between processes which perform one task. Primarily the interface designed for par-
allelizing of tasks between CPUs but not threads, therefore creating, deleting, syn-
chronizing and message transfer between threads shoulder a programmer. That is the
reason the interface is a quite low-level and complex, and isn’t easy to master in a
short term.
OpenMPis an open standard for program parallelization describing a set of
compiler directives, library procedures and environment variables which purposed for
creating multithread applications based on shared memory multiprocessor systems
[6]. The standard is implemented in the majority of well-known compilers (GCC,
Microsoft Visual Studio, Intel, IBM, Oracle) and allows writing parallel applications
easily just by parallelizing local code segments adding minimum changes to the
original code.
The library automatically manages threads and distribute tasks over the treads. A
programmer should only specify how exactly the tasks must be split between threads.
The standard has several parameterized ways for loop parallelization which can be
successfully used in a certain situation. The easiness of application of this library,
clarity of the final parallel code and simplicity of making changes arethe significant
advantages over the MPI.
TBBis a C++ template library for parallelism undertaking thread management
that allow (as OpenMP) directly specify the section of code which should be parallel-
ized. A programmer should think in terms of tasks, not threads when applying the li-
brary.
Unlike OpenMP, TBB automatically form packages of tasks which will beproc-
essedin corresponding threads. TBBbalances a workload of all available processor
cores as evenly as it is possible. This feature is a great advantage of the library over
other libraries. The mechanism consists in splitting overall set of the tasks in subsets
or packages (and assigningthem to the threads) until the workload of the cores will be
even. If a thread finished processing its tasks and there is a thread that still have un-
processed tasks, TBB balances total workload using its internal task stealing mecha-
nism. As OpenMP, the library has all necessary facilities for loop parallelization. The
advantages of TBB over OpenMP are:
• any data type support;
• ready to useclass templates, allowing memory access from several threads si-
multaneously;
• automatic detection and creation the optimal number of working threads in
runtime.
The library is ultimately easy to use and can be modified easily to any specific
tasks due to open source. Let us compare the performance of OpenMP and TBB in
taskof parallel computation of normal matrix – one of the most time-taking stage in
the GRIA.
A.Pavlov
Індуктивне моделювання складних систем, випуск 6, 2014
35
2. Performance comparison of OpenMPand TBB
Implementation of the normal matrix calculation in [1] consists in the execution
of nested loops where the limits of the inner loop depend on the limits of the outer.
This dependence prevents from parallelizing the body of the inner loop. Both libraries
require independence of iterations in a loop. Therefore the nested loop was replaced
by an equivalent single loop that allow to speed up the procedure in 3.5x, even in the
serial version: the normal matrixXTX (dim X is 5000 observations and 1000 vari-
ables) was calculated using the nested loop in 1 min 51 Sec and using the single loop
in 31 Sec.
Performance comparison was carried out when calculating a normal matrixW =
=XTX, dim X = 5000 × 1000. The Intel Core i3 M 350 2.27GHz processor having 2
physicaland 2 logical (due to hyper-threading technology)cores was used. The calcu-
lation was repeated 20 times. Figure 1 shows the average time of the matrix calcula-
tion.
Figure 1. Normal matrix calculation time using paralallization
As one can see, the calculation time when using the TBB is slightly less than in
case of using OpenMP.Should be stressed that the usage of two additional logical
cores hardly speed up the procedure, indicating that parallelization such a low-level
operations as multiplying and summing use the resources of the two physical core-
salmost completely.
The decision about using the TBB library for parallelization the GRIA was
taken, rooting in the results of the experiment and taking into account the advantages
of the TBB mentioned above.
3. Parallel relaxational iterative GMDH algorithm
The generalized relaxational iterative algorithmhastwo model structure genera-
tors: exhaustive search generator and directed search generator. Each of the generator
Parallel relaxational iterative GMDH algorithm
Індуктивне моделювання складних систем, випуск 6, 2014 36
defines a separate relaxational algorithm. This work as the [1] focuses on the paral-
lelization the algorithm based on the generator of exhaustive search.
According to [1], there can be allocated two the most time-consuming stages in
the algorithm: normal matrix calculation stage (is a common stage for every algo-
rithm in GRIA) and stage of iterations where models are built.
After the models have been built using the GRIA, they must be evaluated: mod-
els’ response anddifferent kinds of errors should be calculated on training, validation
and holdout sets. This process, called Errors calculation, turned to be quite time-
taking if the length of the sets is long and the Fchoice freedom parameter is great. For
example, if F is 400, the iterations number is 50 and the data length is 5000 observa-
tions, then the calculation oferror measures for 20,000 models takes near 15 sec. And
the longer is the data length, the longer is the time to perform this stage.
3.1 Parallelization of the normal matrix calculation stage
A scheme for generation and distribution the tasks over available computing
units (threads) on the normal matrix calculation stage was suggested in [1]. The
scheme was developed to balance the workload of the threads in case of using the
nested loop. The previous section of this paper says about the requirements in paral-
lelizing loops via TBB, especially of the independence of loop iterations. To satisfy
this condition, a new single loop was suggested that is equivalent to the nested
one.This single loop is the subject of parallelization here.
Due to automatic detection of optimal volume of tasks packages (number of
loop iterations) which will be processed in parallel and distribution/balancing of the
work-load between threads, a programmer shouldn’t carry about that.
There were parallelized the loops for calculation of the , matrices
and , vectors in this stage.Scalars , are left calculated seri-
ally.
3.2 Parallelization of the stage of iterations
The loop of building the models in serial version of the algorithm looks like:
while ( Model *model = generator->next() )
{
estimator->estimate( model );
criterion->calculate( model );
if ( !selector->select( model ) )
{
delete model;
}
}
TBB has tools for parallelization of while loop, but the bottleneck is that only
one thread at a time can access the method of the model generation generator-
>next (). This fact prevents gaining a good scalability of the stage. Therefor we de-
cide to reimplement the loop in a way that the bottom and upper bounds of the loop
are known.
A.Pavlov
Індуктивне моделювання складних систем, випуск 6, 2014
37
The idea is to create a separate model structure generator for every pack-
age of tasks and make it generatea particular set of model structure such that the sets
of any two generators will be nonoverlapping and the union of all the sets give the
original total set. To avoid simultaneous access to the methods estimate and cal-
culate of the objects estimator (estimation of model parameters) and criterion
(calculation of the criterion),an independent copy of these objects should be created
for every package of tasks.
The second task of parallelization of the loop is the method of model selection
select. The object selector contains a map of models ranked by minimizationof
criterion value. Once a model haslessercriterion value than the last in the map has
been added, the last model is removed from the map. This situation causesaccess vio-
lation errors if one thread has already deleted the last model from the map and an-
other one requests information about this model because the size of the map is not
updated yet up to this moment from the first thread.
There are two ways to resolve this problem: 1) synchronization of the threads
when accessing the select method of the selector object; 2) creation of independ-
ent copies of the selector object for every tasks package. The extraordinarily small
time of model building causes a high access frequency to the select method, which
will be a bottleneck as in case with parallelizing the while instruction.Thus, let us
concentrate on the second way.
Despite that the second way allows getting completely independent loop itera-
tions, it has a disadvantage however. If the copies of the objects generator, esti-
mator and criterion require a little volume of RAM, the selector stores F se-
lected models. Each copy of the selector object must store map of F models to
guarantee that the result of the serial version of the algorithm will not be lost. This
demand significantly increases consumption of RAM if the number of the packages is
large. Actually, expenditure of RAM increase as many times as number of packages
were generated. It is suggested to directly set the (minimal) number of packages
equal to the number of working threads, to minimize RAM consumption. This as-
signment (restriction) prevent sometimes to get even work-load of the cores and is a
drawback of the suggestion. Achievement the balance of cores work-load by increas-
ing the number of the packages up to the optimal level is one of the basic tasks of
TBB.
3.3 Parallelization of the errors calculation stage
The procedure of model errors calculation consists of a loop with a known
range. This loop uses the errorsCalculator object that calculates errors on training,
validation and holdout sets. The parallelization is realized similarly to the normal ma-
trix calculation stage, by creating copies of errorsCalculator object for every pack-
age of tasks.
4 Scalability test of the parallel version
Let us compare time to run serial version and parallel under different number of
available cores. A test was conducted using two processors:
Parallel relaxational iterative GMDH algorithm
Індуктивне моделювання складних систем, випуск 6, 2014 38
• Intel Core i3 M 350 2.27GHz processor with 4cores
(2physicaland2logical);
• Intel Core i7 4700 HQ 2.4GHz processor with 8cores (4 physicaland 4
logical).
The input matrix X contained 4000 observations, 10 true variables and 990 false
variables. It was generated using pseudo random generator Mersennetwister [7]. Val-
ues were generated in the [0; 1] interval by uniform distribution law. The set of ob-
servations was divided into holdout set of 500 observations, training set of 3000 ob-
servations and validation set of 500 instances. Parameters of the algorithm such as
choice freedom F and the number of iterations R were set as 400 and 50 correspon-
dently. The algorithm should findthe model
y = 6.29447 + 9.37736x1 + 8.26752x2– 8.04919x3 + 8.11584x4– 7.46026x5 –
-7.29046x6 + 6.70017x7–5.57932x8– 3.83666x9 +2.64719x10.
The algorithm has found the model
= 6.29447 + 9.37736x1 + 8.26752x2– 8.04919x3 + 8.11584x4– 7.46026x5 –
–7.29046x6 + 6.70017x7–5.57932x8– 3.83666x9 +2.64719x10 + 3.5·10-8x145 +
+1.4·10-9x258.
Figures 2 and 3 show the results of algorithm scalability bythe three algorithm
stages mentioned above.
Figure 2. Scalability of the algorithm stages using Intel Core i3
A.Pavlov
Індуктивне моделювання складних систем, випуск 6, 2014
39
Figure 3. Scalability of the algorithm stages using Intel Core i7
As you can see from the figures, TBB allows obtaining almost equal time of the
serial and the parallel version using one core.Note that, usage of physical cores al-
lows speeding up every stage in 2x using 2 cores of Core i3 and in 1.86x using 2
cores of Core i7. Scalability of the stages differs when using 4 cores of Core i7: the
iterations stage has been speeded up in 2.9x, the normal matrix calculation stage – in
3.15x and the errors calculation stage – in 3.55x.
The scalability is week when adding logical cores what isin concordance with
the results of libraries performance tests: usage of 4 cores of Core i3 speedups the al-
gorithm only in 2.3x, and usage of 8 cores of Core i7 – 3.8x for the iterations stage,
4.39х for the normal matrix calculation stage and 4.26х for the model errors calcula-
tion stage.
Let’s look at the scalability of the whole algorithm (fig. 4).
As you see, TBB speedups the algorithm in:
• 2хusing 2 cores and 2.28х using 4 cores of Core i3;
• 1.86хusing 2 cores, 3хusing 4 coresand 4хusing 8 cores of Core i7.
Parallel relaxational iterative GMDH algorithm
Індуктивне моделювання складних систем, випуск 6, 2014 40
Figure 4. Algorithm scalability
5 Conclusion
Speeding up the algorithm in 2.28х and 4х is even more than ideal result, taking
into account that processors Core i3 and Core i7 have only two and fourphysical
corescorrespondently.
References
1. Pavlov А.V. Principles of parallel computations of relaxational iterative
GMDH algorithm /«Inductive modeling of complex systems», collection of sc.works,
№ 5 – K.: IRTCITS, 2013. – P.220-225. (in Russian)
2. PavlovА.V. GeneralizedrelaxationaliterativeGMDHalgorithm // Inductive-
modelingofcomplexsystems. Col. sc. works, iss. 2. – K.: IRTCITSNASU, 2011. – P.
95-108. (in Russian)
3. Internetresource http://en.wikipedia.org/wiki/Message_Passing_Interface
4. Internetresource http://openmp.org/wp/
5. Internetresource https://www.threadingbuildingblocks.org/
6. Internetresource http://en.wikipedia.org/wiki/Symmetric_multiprocessing
7. Internetresource http://en.wikipedia.org/wiki/Mersenne_twister
|
| id | nasplib_isofts_kiev_ua-123456789-83991 |
| institution | Digital Library of Periodicals of National Academy of Sciences of Ukraine |
| issn | XXXX-0044 |
| language | English |
| last_indexed | 2025-12-07T18:53:36Z |
| publishDate | 2014 |
| publisher | Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України |
| record_format | dspace |
| spelling | Pavlov, A. 2015-07-02T06:54:47Z 2015-07-02T06:54:47Z 2014 Parallel Relaxational Iterative Algorithm of GMDH / А. Pavlov // Індуктивне моделювання складних систем: Зб. наук. пр. — К.: МННЦ ІТС НАН та МОН України, 2014. — Вип. 6. — С. 33-40. — Бібліогр.: 7 назв. — англ. XXXX-0044 https://nasplib.isofts.kiev.ua/handle/123456789/83991 681.513.8 The paper suggests a parallel version of the generalized relaxational iterative algorithm with recurrent computations. Parallelization using TBB library allow getting 2.28х speedup using 2 cores and 4х speedup using 4 cores of Intel processor with hyper-threading technology. У роботі запропонована розпаралеленаверсія узагальненого релаксаційного ітераційного алгоритму з рекурентними обчисленнями. Розпаралелення за допомогою бібліотеки TreadingBuildingBlocks дозволило прискорити алгоритм в 2.28х на 2 ядрах і в 4х на 4 ядрах процесора Intel, що використовує технологію Hyper-Treading. В работе предложена распараллеленная версия релаксационного итерационного алгоритма с рекуррентными вычислениями. Распараллеливание с помощью библиотеки Treading-BuildingBlocks позволило получить ускорение в 2.28х на2 ядрах и в4х на 4 ядрах процессора Intel, использующего технологиюHyper-Treading. en Міжнародний науково-навчальний центр інформаційних технологій і систем НАН та МОН України Індуктивне моделювання складних систем Parallel Relaxational Iterative Algorithm of GMDH Article published earlier |
| spellingShingle | Parallel Relaxational Iterative Algorithm of GMDH Pavlov, A. |
| title | Parallel Relaxational Iterative Algorithm of GMDH |
| title_full | Parallel Relaxational Iterative Algorithm of GMDH |
| title_fullStr | Parallel Relaxational Iterative Algorithm of GMDH |
| title_full_unstemmed | Parallel Relaxational Iterative Algorithm of GMDH |
| title_short | Parallel Relaxational Iterative Algorithm of GMDH |
| title_sort | parallel relaxational iterative algorithm of gmdh |
| url | https://nasplib.isofts.kiev.ua/handle/123456789/83991 |
| work_keys_str_mv | AT pavlova parallelrelaxationaliterativealgorithmofgmdh |