Порівняльний аналіз ефективності використання дрібнозернистого та вкладеного паралелізму для збільшення пришвидшення паралельних обчислень у багатоядерних комп’ютерних системах

The article presents a comparative analysis of the effectiveness of using parallelism of varying granularity degrees in modern multicore computer systems using the most popular programming languages and libraries (such as C#, Java, C++, and OpenMP). Based on the performed comparison, the possibiliti...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2022
Автори: Martell, Valerii, Korochkin, Aleksandr, Rusanova, Olga
Формат: Стаття
Мова:Англійська
Опубліковано: The National Technical University of Ukraine "Igor Sikorsky Kyiv Polytechnic Institute" 2022
Теми:
Онлайн доступ:https://journal.iasa.kpi.ua/article/view/251975
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:System research and information technologies
Завантажити файл: Pdf

Репозитарії

System research and information technologies
_version_ 1867334422185902080
author Martell, Valerii
Korochkin, Aleksandr
Rusanova, Olga
author_facet Martell, Valerii
Korochkin, Aleksandr
Rusanova, Olga
author_institution_txt_mv [ { "author": "Valerii Martell", "institution": "National Technical University of Ukraine \"Igor Sikorsky Kyiv Polytechnic Institute\", Kyiv" }, { "author": "Aleksandr Korochkin", "institution": "National Technical University of Ukraine \"Igor Sikorsky Kyiv Polytechnic Institute\", Kyiv" }, { "author": "Olga Rusanova", "institution": "National Technical University of Ukraine \"Igor Sikorsky Kyiv Polytechnic Institute\", Kyiv" } ]
author_sort Martell, Valerii
baseUrl_str http://journal.iasa.kpi.ua/oai
collection OJS
datestamp_date 2022-10-17T22:12:39Z
description The article presents a comparative analysis of the effectiveness of using parallelism of varying granularity degrees in modern multicore computer systems using the most popular programming languages and libraries (such as C#, Java, C++, and OpenMP). Based on the performed comparison, the possibilities of increasing the efficiency of computations in multicore computer systems by using combinations of medium- and fine-grained parallelism were also investigated. The results demonstrate the high potential efficiency of fine-grained parallelism when organizing intensive parallel computations. Based on these results, it can be argued that, in comparison with more traditional parallelization methods that use medium-grain parallelism, the use of separately fine-grained parallelism can reduce the computation time of a large mathematical problem by an average of 4%. The use of combined parallelism can reduce the computation time of such a problem to 5,5%. This reduction in execution time can be significant when performing very large computations.
doi_str_mv 10.20535/SRIT.2308-8893.2022.2.03
first_indexed 2025-07-17T10:27:42Z
format Article
fulltext  V. Martell, A. Korochkin, O. Rusanova, 2022 Системні дослідження та інформаційні технології, 2022, № 2 45 UDC 004.438:519.685.8 DOI: 10.20535/SRIT.2308-8893.2022.2.03 COMPARATIVE ANALYSIS OF THE EFFECTIVENESS OF USING FINE-GRAINED AND NESTED PARALLELISM TO INCREASE THE SPEEDUP OF PARALLEL COMPUTING IN MULTICORE COMPUTER SYSTEMS V. MARTELL, A. KOROCHKIN, O. RUSANOVA Abstract. The article presents a comparative analysis of the effectiveness of using parallelism of varying granularity degrees in modern multicore computer systems using the most popular programming languages and libraries (such as C#, Java, C++, and OpenMP). Based on the performed comparison, the possibilities of in- creasing the efficiency of computations in multicore computer systems by using combinations of medium- and fine-grained parallelism were also investigated. The results demonstrate the high potential efficiency of fine-grained parallelism when organizing intensive parallel computations. Based on these results, it can be argued that, in comparison with more traditional parallelization methods that use medium- grain parallelism, the use of separately fine-grained parallelism can reduce the com- putation time of a large mathematical problem by an average of 4%. The use of combined parallelism can reduce the computation time of such a problem to 5,5%. This reduction in execution time can be significant when performing very large computations. Keywords: multicore computer system, core, thread, tasks, parallelism, granularity, fork-join, speedup coefficient, fine-grained parallelism, nested parallelism, com- bined parallelism. INTRODUCTION Classical von Neumann architecture is not designed for parallel computing; all commands execute in one sequence strictly one after another. Each such individ- ual sequence that operates on the machine at the current time is a process. ISO 9000:2000 [1] defines a process as a set of interconnected and interacting actions that convert input data into output. A computer program in itself is only a passive sequence of instructions, while a process is the direct execution of those instruc- tions. The concept of process is inextricably linked with the concept of thread. The execution thread is the smallest unit of processing, the execution of which can be assigned by the operating system kernel [2]. Implementation of execution threads and processes in different operating systems differs from each other, but in most cases, the execution thread is within the process. Multiple threads can exist within the same process and share resources such as memory, while processes do not share these resources. In particular, the execution thread hare process instructions (its code) and its context (the values of the variables they have at any given time). Thus, in the framework of von Neumann or single-core architecture, paral- lelism is usually realized by time multiplexing: the processor switches between different threads. This context switching is called pseudo-parallelism and usually V. Martell, A. Korochkin, O. Rusanova ISSN 1681–6048 System Research & Information Technologies, 2022, № 2 46 occurs often for the user to perceive the execution of threads or tasks as simulta- neous [3]. In such systems, there are effective time schedulers in standby mode (blocked) [4]. Scheduling is based on the principle of priorities, for example, in most cases, some user input has a higher priority than some computations, so if the central processor unit (CPU) receives an input signal, a priority interrupt oc- curs and the CPU processes its, which allows the user to control most of process, such as urgently terminate some of them. However, in the present stage, this approach no longer ensures compliance with the requirements for computer systems. And with the emergence of multi- core and multiprocessor architecture, the question arose of the organization of real parallelism, when at the point of time each individual core or processor performs its own thread. In multiprocessor and multi-core systems, threads or tasks can run simultaneously, with each processor or core processing a separate thread at the same time. The operating systems of such computers are much more complex and voluminous because, for such architectures to function effectively, their compo- nents must also exchange data (communicate and synchronize), and do so in a timely, fast, and with minimal computational downtime. Planners of such operat- ing systems plan the distribution of captures by the thread of cores both in time and space [4]. So, even parallel threads are not all the same. In addition to the level of their priority, you can enter another measure for them – quantitative, one that is based on the volume of the basic unit in the program. PROBLEM ANALYSIS AND TASK STATEMENT Each individual elementary parallel computation in modern multi-core computer systems (MCS) can be represented as a granule. Depending on how many such elementary calculations (parallel within the system, but consecutive within one thread) each thread contains, and how many communications (i.e. interruptions of parallelism) were conducted between such threads, we can introduce the concept of “granularity”. Granularity is a measure of the ratio of the number of calculations performed in a parallel problem to the number of communications [5]. The degree of granu- larity varies from fine-grained to coarse-grained. It should be noted that the granularity classifications available in different sources differ slightly, especially for coarse- and medium-grained parallelism, so the following classification will be taken as a basis in this article: 1. Coarse-grained parallelism: each parallel calculation is quite independent of the others, and requires a relatively rare exchange of information with other calculations. The units of parallelism are large and independent programs that include thousands of commands [6]. 2. Medium-grained parallelism: units of parallelization are individual proce- dures that are called in separate threads and include hundreds of commands. It is usually organized by both the programmer and the optimizing compiler. Most general-purpose parallel computers are primarily focused on this category of par- allelism [7]. 3. Fine-grained parallelism: each parallel calculation is quite small and ele- mentary, consists of dozens of commands. Usually, the units that are parallelized are elements of expressions or individual iterations of a loop. They usually have Comparative analysis of the effectiveness of using fine-grained and nested parallelism … Системні дослідження та інформаційні технології, 2022, № 2 47 little or no relationship between the data. The amount of work associated with a parallel task is low and the work is evenly distributed among the processors. Hence, fine-grained parallelism facilitates load balancing [8]. The very term “fine-grained parallelism” refers to the simplicity and speed of any computational action. A characteristic feature of fine-grained parallelism is the approximate equality of computational intensity and data exchange. This level of parallelism is often used by the parallelizing (vectorizing) compiler [9], as well as recently in asynchronous programming. This work focuses on medium- and fine-grained parallelism. Fine-grained parallelism has a long history: it is the most “ancient” kind of parallelism. The development of his theory took place simultaneously with the development of the theory of successive calculations and is associated with the name of the already mentioned John von Neumann. His theoretical model of a calculator with fine-grained parallelism is widely known – “cellular automaton” [10]. But, when in the process of development of computer technology there were opportunities and capacities for the organization of full-fledged threads of me- dium-grained and coarse-grained parallelism, there was some decline in interest in fine-grained parallelism. However, the decline in interest changed when techno- logical advances, on the one hand, led to the fact that medium- and coarse-grained parallels somewhat exhausted their significant development, and on the other hand allowed to create a unified architecture within which at different stages of program implementation could move on parallelism of various degree of granu- larity, i.e. on some parts of programs to use the classical mechanism of threads, and on others fine-grained parallelisms, like tasks or parallel loops. Against the background of renewed interest in fine-grained parallelism in modern programming languages and libraries, along with the tools for organizing parallel computations by creating a set of classic full-fledged medium-grain threads, there are tools for computing within fine-grained parallelism. Most often, these are tools for parallelizing loops or small sets of similar commands. However, it also led to conflicts in the design stages of modern parallel software. Currently, there are two opposing approaches to creating the architec- ture of such software: on the one hand, you can represent the parallel part of the program in the format of medium-grained structures (classical threads), on the other hand, all calculations can be represented in the form of small tasks. At the same time, if we can say that the classical concept of threads is not very suitable for modern back-end problems, in contrast to the concept of tasks, then in the field of high-load and intensive scientific calculations, solving classical problems can be presented using both threads and tasks. At the same time, given the above-described nature of parallelism in modern MCS, as well as modern flexible tools for organizing different levels of parallelism in languages and libraries of parallel programming, we can assume that there is space and opportunities for effective solving of computational problems with simultaneous use of both mechanisms. This combination is called “Nested parallelism”. Therefore, the purpose of this work is to research the possibilities of improv- ing the execution time of parallel programs in MCS through the use of fine- grained parallelism, the optimal combination of medium and fine-grained paral- lelism (nested parallelism), as well as the use of modern software instruments of their implementation. V. Martell, A. Korochkin, O. Rusanova ISSN 1681–6048 System Research & Information Technologies, 2022, № 2 48 FINE-GRAINED PARALLELISM IMPLEMENTATIONS OVERVIEW Among the most popular languages and libraries currently used in high-load com- puting, fine-grained parallelism tools are implemented in the OpenMP library (which is usually used in combination with C++), Java and C# languages [11–14]. OpenMP Fine-grained parallelism is represented by a special preprocessor directive #pragma omp parallel for, which refers to the work-sharing directives. Such directives are not used only for parallel code execution; they are used for the logi- cal distribution of a group of threads to implement these control logic constructs. The #pragma omp for directive informs that when running a loop in parallel mode, loop iterations must be distributed between a group of threads. Execution of the following program code: 1. int size=100; //the number of calculation iterations 2. #pragma omp parallel for 3. for(int i = 0; i < size; i++) 4. Calculations(); 5. ShowResults(); in the four-processor system would happen as shown in Fig. 1. This distribution is used by default and is called static scheduling. Main thread size = 100 om p th re ad 1 It er at io ns 0 -2 4 C al cu la tio ns () om p th re ad 2 It er at io ns 2 5- 49 C al cu la tio ns () om p th re ad 3 It er at io ns 5 0- 74 C al cu la tio ns () om p th re ad 4 It er at io ns 7 5- 99 C al cu la tio ns () Main thread ShowResults() Fig. 1. The scheme of distribution loop iterations in parallel threads using OpenMP OpenMP also provides another, more flexible types of scheduling, such as dynamic scheduling, runtime scheduling, and guided scheduling. The special sec- tion is used to set up one of these modes. The following code scheme shows the format of this section: schedule (algorithm name) num_threads (number of iterations). This mechanism is used when different iterations perform different amounts of work and determine whether a thread that has already completed its iteration will be able to take over part of the work of another thread. Comparative analysis of the effectiveness of using fine-grained and nested parallelism … Системні дослідження та інформаційні технології, 2022, № 2 49 When writing more complex parallel programs, one of the key tasks to solve is the problem of thread synchronization. In OpenMP, implicit barrier synchroni- zation is at the end of each #pragma omp parallel and parallel for block. There are also manual synchronization tools, such as barriers that can be created using the #pragma omp barrier directive. With writing the parallel program, the second main task that needs to be figured out is the mutual exclusion problem, which avoids the situation so-called “race condition”, when a large number of threads will behave unpredictable or be blocked due to access to the same area of memory, which, for example, contains some variable (shared resource), which used in all threads. With the large number of threads created with the application of fine-grained parallelism, this problem is particularly acute. OpenMP provides the ability to use special nonblocking tools to solve the problem of mutual exclusion, such as the directive #pragma omp atomic, as well as modifiers of the directive #pragma omp parallel for: shared, private, firstprivate, lastprivate or reduction. Manual means of solving the problem of mutual exclusion are represented by such constructions as locks and critical sections. Java Developers are offered extremely flexible and powerful tools for implementing fine-grained parallelism based on the Fork-Join model, realized in built-in package java.util.concurent. A recursive algorithm is used to implement this model in Java, which described in the following paragraph: 1. The check on the possibility of dividing the actions of this thread into two smaller tasks. 2. If the check is successful, the distribution is performed (Fork), by creating new threads for each new task. In each new thread, the algorithm begins anew. The thread that performed the distribution is blocked until both created threads have finished their work, and then it performs the final collection of the result. 3. If the check is not successful (the limit of the so-called “grain of parallel- ism” is reached), then the calculations are performed in this thread, after which the connection with the generated thread occurs (Join). This implementation contains two classes: RecursiveAction and Recur- siveTask. When it is necessary to calculate a specific numeric value of a large function (such as the sum of vector elements), it is better to use Recursive Task. In the case of general operations, the results of which are not a specific number, RecursiveAction works better. By inheriting these classes and describing the computer’s own overload, the developer can customize the work to solve a spe- cific problem, which is extremely effective. An example of using the RecursiveAction class to implement fine-grained parallelism in Java organizing a parallel loop as an example is shown in the listing below: 1 class ParallelFor extends RecursiveAction {  2      private int from, to;  3      volatile final int GRAIN = 25;  4      public ParallelFor(int from, int to) {  5           this.from = from;  6           this.to = to;  7      }  8      protected void compute() {  V. Martell, A. Korochkin, O. Rusanova ISSN 1681–6048 System Research & Information Technologies, 2022, № 2 50 9           int len = to ‐ from;  10           // Stop condition of main recursion   11           if (len < GRAIN)   12                work(from, to);   13           else {  14                int mid = (from + to) >>> 1;  15                ForkJoinTask<Void> parallelFor1 =   16                     = new ParallelFor(from, mid).fork();  17                ForkJoinTask<Void> parallelFor2 =   18                     = new ParallelFor(mid, to).fork();  19                parallelFor1.join();  20                parallelFor2.join();  21           }  22     }  23 }  24 // Parallel loop startup is performed using the invoke()   25 // method called on the instance of the ForkJoinPool class   26 ForkJoinPool pool = new ForkJoinPool();   27 pool.invoke(new ParallelFor(from, to));  The variable GRAIN determines the depth of the partition, in other words, the amount of work, after achieving which, the thread will start to perform it, rather than making further parallelization. The work() method may contain certain basic parameterized calculations, which performance is expected from each Fork- Join task after reaching the maximum depth of parallelization. For example, after setting the initial values of the variables GRAIN = 25, from = 0, to = 100, we ob- tain a parallel execution of the loop of 100 iterations discussed in the previous subsection, in which each of the Fork-Join tasks will receive 25 iterations for processing. However, unlike OpenMP, the structure of the generated threads and their control hierarchy will be treelike, as shown in Fig. 2. Main thread Iterations 0-99 F or k- Jo in t hr ea d 3 It er at io ns 0 -2 4 C al cu la tio ns () Fork-Join thread 1 Iterations 0-49 Fork-Join thread 2 Iterations 50-99 F or k- Jo in t hr ea d 4 It er at io ns 2 5- 49 C al cu la tio ns () F or k- Jo in t hr ea d 5 It er at io ns 5 0- 74 C al cu la tio ns () F or k- Jo in th re ad 6 It er at io ns 7 5- 99 C al cu la tio ns () Fork() Join()Join() Fork() Fork() Join() Join()Join()Join() Fig. 2. The scheme of distribution loop iterations in Fork-Join parallelization using Java Comparative analysis of the effectiveness of using fine-grained and nested parallelism … Системні дослідження та інформаційні технології, 2022, № 2 51 The resulting structure is more complex than the one in OpenMP. Neverthe- less, according to the results provided in the next chapter, such an approach to fine-grained parallelism proved to be no less effective. C# C# also provides the ability to implement fine-grained parallelism. All the neces- sary functionality for this is contained by a static class Sys- tem.Threading.Tasks.Parallel, namely by its three main methods Paral- lel.For(), Parallel.ForEach(), Parallel.Invoke() and their various overloads. Parallel.For and Parallel.ForEach provide parallel execution of for and foreach loops, respectively. The override methods presented in this class are aimed to maximize the parameterization of parallelism, depending on the specific task be- ing implemented. Each method described above is based on a mechanism similar to the one used in Java. Nonetheless, it has also been simplified with a mechanism for dele- gating and anonymous methods. Developers are not required to manually write the entire recursive part and the regulation of grain parallelism, this is the respon- sibility of the execution environment. Therefore, in practice, the implementation of fine-grained parallelism becomes extremely simple and has almost no different from the classical single-threaded approach. The following piece of code provides a similar parallelization to the previous loop of a hundred iterations: 1 Parallel.For(0, 100, i => {  2      Calculation();        3 });  Although parallelization in C# occurs by the very same mechanism as in OpenMP, through the simplifying, that assures the virtual environment of the CLR execution, the resulting threads’ structure and their hierarchy of manage- ment will be similar to that in the OpenMP library. Also in C# are realized the tools for the organization of fine-grained parallel- ism manually, similar to corresponding tools in Java. The Task class (Sys- tem.Threading.Tasks) is responsible for this. The formed tasks can be performed in one or more threads. The official documentation of this programming language recommends using them primarily for asynchronous programming. In fact, the tools for organization parallel loops in C# discussed above are high-level abstrac- tions constructed using Tasks. NESTED (COMBINED) PARALLELISM This approach is based on the use of two types of parallelism in the parallel pro- gram: medium-grained and fine-grained. The program includes a set of traditional threads for the number of MCS cores. Each of these threads additionally imple- ments internal (fine-grained) parallelism by creating subthreads using appropriate Fork-Join tools. The initial number of traditional threads can be reduced in order to provide fine-grained parallelism with free processor resources. A parallel thread interaction scheme for such a program for the test system and the task (dis- cussed in the next section) implemented by OpenMP tools as an example, can be represented as shown in Fig. 3. V. Martell, A. Korochkin, O. Rusanova ISSN 1681–6048 System Research & Information Technologies, 2022, № 2 52 In pu t М A , М B #p ra gm a om p b ar ri er #p ra gm a om p b ar ri er . . . om p_ se t_ lo ck (& co py L o ck ); M C 1 = M C M C P = M C om p_ un se t_ lo ck (& co py L o ck ); L oc k C op yi ng s ha re d re so ur ce s O ut pu t М C . . . D at a in pu t D at a in pu t s yn ch ro n iz at io n C al cu la ti on s en di n g sy n ch ro ni za tio n R es ul t ou tp ut T hr ea d 1 T hr ea d i T hr ea d P Im pl ic it b ar ri er #p ra gm a om p p ar al le l fo r M A H /G = M B H /G * M C 1 M A H /G = M B H /G * M C 1 . . . N es te d p ar al le l se ct io ns P ar al le li za ti on o f th e co m p ut at io na l l oo p in e ac h in d iv id u al t hr ea d #p ra gm a om p p ar al le l fo r M A H /G = M B H /G * M C P M A H /G = M B H /G * M C P . . . Im pl ic it b ar ri er M ai n th re ad . . . . . . F ig . 3. T he s ch em e of t he i nt er ac tio n of t hr ea ds w ith in t he m od el o f ne st ed p ar al le lis m i m pl em en te d by O pe nM P t oo ls u si ng t he p ro bl em of tw o sq ua re ( N *N ) m at ri ce s m ul ti pl ic at io n as a n ex am pl e Comparative analysis of the effectiveness of using fine-grained and nested parallelism … Системні дослідження та інформаційні технології, 2022, № 2 53 Support for nested parallelism in OpenMP is enabled manually. There are two ways to do this: 1. By calling omp_nested true command before compiling the program.  2. By calling omp_set_nested(1) procedure in the program code.  Both methods set the environment variable omp_nested to the value of true. However, the second method is more preferable because it provides better port- ability of the code, because there is no guarantee that on computers where the code will run in the future, will be manually set the appropriate value of the vari- able omp_nested. Unlike OpenMP, Java and C# do not require any additional operations to ac- tivate compiler support for nested parallelism. EXPERIMENTAL TESTING Selection a problem for testing For the most transparent comparative testing, the problem of multiplying two square matrices of large dimension ( NN * elements of 64-bit type double) was chosen as an example of a typical problem from the field of high-load computing: MCMBMA  . (1) Due to the large number of elementary mathematical operations, which can be differently and in different quantities distributed between threads or tasks, the program to solve this problem can be properly implemented using all types of parallelism, while always maintaining high-intensity computations, with minimal downtime for synchronization and data exchange, which is important for more transparent comparative performance testing. In addition, the choice of this problem for comparative testing of different types of parallelism is also ideal in terms of its coverage of all models of data ex- change and interaction between threads or tasks. In medium-grained parallelism, it contains the required fragment of the copy of the shared resource (matrix MC) between the threads. In fine-grained parallelism, there is an interaction of tasks without copying shared resources (a direct reference to matrices elements), which is just more typical for applied implementations of this type of parallelism. And in nested parallelism, these two approaches of parallel interaction with shared re- sources are combined. Therefore, it can be argued that this problem contains all the characteristic cases that occur in solving other common mathematical prob- lems in parallel programs. Description of the mathematical model of the test problem The mathematical algorithm for performing this problem (1) is reduced to N-fold repetition of the calculation:    N k jkkiij cba 1 ,, , (2) where ijijij cba ,, are the corresponding elements of the ith row and jth column of the matrices MA, MB, MC; i and j lie in the range from 1 to N . Therefore, the total number of elementary operations that must be performed to obtain solution (1) is equal to 3N . V. Martell, A. Korochkin, O. Rusanova ISSN 1681–6048 System Research & Information Technologies, 2022, № 2 54 With this in mind, the following parallel mathematical algorithm was chosen [15]: MCMBMA HH  , (3) where H is a rounded up number equal to the quotient of the division PN / and MAH, HMB are the corresponding rectangular matrices of dimensions H by N elements of the matrices MA and MB [15]. It is worth noting that in this case, the variable P can be either equal to the number of cores available in the system (medium-grained parallelism), or be less than this number (no parallelism, some variants of nested parallelism), or signifi- cantly larger than it (fine-grained parallelism). Each thread implements H repetitions of the algorithm (2), while calculating the Hth part (3) of the total program’s result. Since this algorithm provides for frequent access to all elements of the MC matrix from each thread, an important point is that when creating threads, each of them gets its own instance of this matrix, which eliminates conflicts between threads for capturing and owning shared resources. However, such copying is valid only for medium-grained parallelism; for fine-grained and nested parallelism, such copying does not occur. Description of the test software and hardware complex Testing of programs was carried out in two identical MCSs. Their main character- istics are shown in Table 1. T a b l e 1 . The main characteristics of the test MCSs Hardware Processor AMD Phenom II Processor architecture K10 Number of cores 6 RAM capacity 8 Gb RAM type DDR3 Sofrware Operation system Windows 7 OpenMP version 3.1 C++ compiler MC++ (MSVC) JVM version 1.8 .NET Framework version 4.7 The software structure is aligned with the hardware structure (primarily for medium-grained part of comput- ing), and it includes both manual and automatic scalability. The schemati- cally formed structure of the hardware and software test complex is shown in Fig. 4. The following symbols are in- troduced into this diagram: 1. )(PE i — the processing ele- Fig. 4. The test hardware and software complex schematic structure Comparative analysis of the effectiveness of using fine-grained and nested parallelism … Системні дослідження та інформаційні технології, 2022, № 2 55 ment (processor or core). Its index corresponds to the number of this element, and lies in the range from 1 to P inclusive, where P is the number of all processing elements in the test system. At the software level, each physical processor ele- ment corresponds to a software-generated thread. 2. CM — the common (shared) memory (RAM for example), to which each processor element is connected, and with the help of which they communicate with each other. 3. I/O - an I/O device that is connected to one of the processor elements (or to one of the cores, since in fact the processing of I/O signals will ultimately be processed by one of the physical cores of the system), and which provides input of initial data and output of results. 4. MA, MB, and MC are matrices that make up the multiplication operation (1). Moreover, the MA and MB matrices are multipliers and are entered from the I/O device (for large dimensions, random input is used, the execution time of which is not taken into account in the overall test result), and the MC matrix is the result that is output by this device at the end of all calculations (the execution time measurement is stopped just before the output). Tests results Table 2 shows the results of testing parallel programs for the operation of matrix multiplication for different values of N (dimension of matrices). Presented the execution time of programs that were built with:  using medium-grained parallelism through the thread mechanism;  using fine-grained parallelism through the parallel loops and/or fork-join mechanisms described in the previous section;  using nested (combined) parallelism, when each thread additionally uses parallel processing through the parallel loops and/or fork-join constructs. Based on the actual time indicators obtained, the acceleration coefficients (speedup coefficients) of the considered programs are calculated. The speedup coefficient of a parallel program is the ratio of the execution time of a program without parallelism on one computing core to the execution time of a similar pro- gram with parallelism on P computing cores and shows how much the program execution time is reduced in a parallel system [15]. Calculated using the formula pT T SC 1 , where 1T is the actual running time of the program without applying parallelism, and pT is the actual running time of a similar program using parallelism in P computing cores. Ideally, this coefficient is equal to the number of cores, but in practice, this coefficient is always several tenths less than the number of cores. This is due to the presence of other background processes in the system, which also occupy a certain place in the kernel operation plan. Below are graphs (Fig. 5) showing the dependence of the speedup coeffi- cients for all three types of parallelism on the values of N. V. Martell, A. Korochkin, O. Rusanova ISSN 1681–6048 System Research & Information Technologies, 2022, № 2 56 Fig. 5. The dependence of the average speedup coefficients of test parallel programs on the type of parallelism used in them, and on the software tools for its implementation: OpenMP (via C ++) (a), Java (b) and C# (c) in solving the problem of two square matrices multiplication Additionally, more detailed testing of fine-grained parallelism was con- ducted in order to identify ways to increase its efficiency. At this stage, those test programs from the developed package were tested, which were written using only fine-grained parallelism. Multiple measurements of the time of the multiplication operation of two matrices with a dimension of 1500*1500 elements were per- formed. The dependence of execution time on the number of software imple- mented threads was checked. Fig. 6 demonstrate the detected time dependence on the number of created tasks. More detailed results are provided in Table 2 and 3. Fig. 6. The dependence of fine-grained parallelism execution time on the number of created threads Comparative analysis of the effectiveness of using fine-grained and nested parallelism … Системні дослідження та інформаційні технології, 2022, № 2 57 T a b l e 2 . The results of testing the performance of test parallel programs de- veloped using different types of parallelism (or without using parallelism – as a control sample) and various software tools for its implementation in solving the problem of two square matrices multiplication Computing time (sec) Medium-grained Fine-grained N OpenMP Java C# OpenMP Java C# 516 1,7 0,2 1,2 1,6 0,2 1,1 1032 13,3 3,3 9,8 12,8 2,9 9,8 1548 46,7 14,0 35,6 45,9 12,1 34,1 2064 111,7 38,5 79,8 110,8 33,5 78,1 Computing time (sec) Nested Non-parallel N OpenMP Java C# OpenMP Java C# 516 1,5 0,2 1,1 7,7 8,0 5,2 1032 12,4 3,1 9,5 66,1 66,9 45,5 1548 44,9 13,0 33,8 240,6 250,2 171,4 2064 108,1 35,5 74,8 586,5 640,1 387,1 T a b l e 3 . The results of fine-grained parallelism testing in solving the problem of two square matrices multiplication using various software tools for its im- plementation Computing time (sec) Number of threads C# Java OpenMP 2 71,309 133,944 190,618 3 55,1 101,41 125,12 4 42,099 77,393 79,894 5 37,202 60,942 68,167 6 34,222 47,532 46,157 10 34,113 45,551 47,202 12 32,841 45,117 46,011 15 31,405 44,37 45,7 20 30,404 44,33 47,196 25 30,511 43,101 46,21 30 28,329 41,9 43,15 50 27,683 40,88 43,01 CONCLUSIONS AND FUTURE WORK The results obtained during testing showed the effectiveness of MCS in the im- plementation of the considered mathematical problem solution by using Java and C# languages and OpenMP library. Additionally, reduction of the programs exe- cution time with the application of the parallelism of any degree of grain is possi- ble (speedup coefficient values are in the range of 4,0–5,5). The best result in terms of program execution time was obtained for the C#. V. Martell, A. Korochkin, O. Rusanova ISSN 1681–6048 System Research & Information Technologies, 2022, № 2 58 Medium-grained parallelism showed sufficient efficiency, but had the worst result. At the same time, the speedup achieved by OpenMP tools is constantly increasing with the amount of data processed. While using the C# and the Java tools, speedup remains at approximately the same level. Using fine-grained parallelism was also effective, in which case the speedup coefficient increases steadily with increasing data volume. This type of parallel- ism was most effective in C#. Furthermore, additional testing of fine-grained parallelism revealed a declin- ing exponential dependence between the number of threads, which is allowed to create by the program, and the time of its operation. As mentioned in Section 3, the main reason for this is a larger number of fine-grained tasks (threads) that ac- cess physical processing elements (cores) and shared resources, which leads to significant downtime due to the problem of mutual exclusion. The optimal num- ber of fine-grained tasks is in the range of 5 to 20, regardless of how it was organized. Nested (combined) parallelism showed its effectiveness and allowed to in- crease speedup coefficient when it is used in C# language and OpenMP library. Moreover, there is an increase of speedup coefficient with increasing amount of processed data, which is one of the most important arguments for the feasibility of this approach in the MCS. It could be assumed that the efficiency of using nested parallelism will in- crease with an increasing number of cores in the MCS, where: 1. There will be additional processor resources for its implementation. 2. It is possible to reduce the size of grains. 3. There will be an optimal ratio between the number of streams and sub- threads. Besides, the efficiency of nested parallelism implemented in OpenMP can be improved by more efficient implementation of the powerful sub-threads manage- ment system embedded in the library, similar to how it was done in the fork-join model. Since the identified patterns of change in program execution time and speedup coefficients, in general, are preserved for all considered means of orga- nizing parallelism, it can be argued that this approach will be effective regardless of the language or library by which it will be organized. Therefore, it can be argued that the use of fine-grained and/or combined (nested) parallelism in most cases is an effective approach to the implementation of parallel computing in multi-core computer systems. Possible directions of work continuation: 1. Testing formed hypotheses in larger and more powerful multi-core com- puter systems. 2. Testing formed hypotheses on a number of more applied problems. 3. Check the statement about the effectiveness of fine-grained and combined parallelism, regardless of the instruments of its organization, compared with me- dium-grained parallelism. From the point of view of the development of high-level instruments of fine- grained parallelism realization in modern parallel programming languages and libraries: 1. Adaptation of existing computational planning methods to the realities of fine-grained and nested parallelism. 2. Adaptation of existing effective processing queue management policies to the realities of fine-grained and nested parallelism. Comparative analysis of the effectiveness of using fine-grained and nested parallelism … Системні дослідження та інформаційні технології, 2022, № 2 59 REFERENCES 1. ISO 9000:2000 Quality management systems – Fundamentals and vocabulary. Available: https://www.iso.org/standard/29280.html 2. L. Lamport, “How to Make a Multiprocessor Computer That Correctly Executes Multiprocess Programs”, IEEE Transactions on Computers, vol. C-28, no. 9, pp. 690–691, 1979. doi: 10.1109/TC.1979.1675439. 3. A.C. Wayne, S.J. Procter, and T.E. Anderson, “The nachos instructional operating system”, in USENIX Winter 1993 Con-ference (USENIX Winter 1993 Conference). San Diego, CA: USENIX Association, Jan. 1993. doi: 10.1.1.181.878. 4. B.S. Tanenbaum and H. Bos, Modern Operating Systems, 4th ed. USA: Prentice Hall Press, 2014. doi: 10.5555/2655363. 5. J.E. Moreira, D. Schouten, and C.D. Polychronopoulos, “The performance impact of granularity control and functional parallelism”, in Proceedings of the 8th Interna- tional Workshop on Languages and Compilers for Parallel Computing, ser. LCPC’95, pp. 581–597. Berlin, Heidelberg: Springer-Verlag, 1995. doi: 10.5555/645673.665710. 6. K. Hwang, Advanced Computer Architecture: Parallelism, Scalability, Programma- bility, 1st ed. McGraw-HillHigher Education, 1992. doi: 10.5555/541880. 7. R. Miller and Q.F. Stout, Parallel Algorithms for Regular Architectures: Meshes and Pyramids. Cambridge, Mass: MITPress, 1996. doi: 10.5555/249608. 8. B. Blaise, Introduction to Parallel Computing. [Online]. Available: https://hpc.llnl. gov/training/tutorials/introduction-parallel-computing-tutorial/ 9. J.L. Hennessy and D.A. Patterson, Computer Architecture, Fifth Edition: A Quanti- tative Approach. Morgan Kaufmann Publishers Inc., 2011. doi: 10.5555/1999263. 10. O. Bandman, “Composing fine-grained parallel algorithms forspatial dynamics simu- lation”, in Proceedings of the 8th International Conference on Parallel Computing Technologies, ser. PaCT’05, pp. 99–113. Berlin, Heidelberg: Springer-Verlag, 2005. doi: 10.1007/11535294_9. 11. D. Lea, “A java fork/join framework”, in Proceedings of the ACM2000 Conference on Java Grande, ser. JAVA ’00, pp. 36–43. New York, NY, USA: Association for Computing Machinery, 2000. doi: 10.1145/337449.337465. 12. P.E. Hadjidoukas, G.C. Philos, and V.V. Dimakopoulos, “Exploiting fine-grain thread parallelism on multicore architectures”, Sci. Program., vol. 17, no. 4, pp. 309–323, Dec. 2009. doi: 10.1155/2009/249651. 13. P. Czarnul, “Assessment of OpenMP master-slave implementations for selected ir- regular parallel applications”, Electronics, vol. 10, no. 10, 2021. doi: 10.3390/electronics10101188. 14. J. Ponge, Fork and Join: Java Can Excel at Painless Parallel Programming Too! [Online]. Available: https://www.oracle.com/technical-resources/articles/java/fork- join.html 15. O. Rusanova and A. Korochkin, “Scheduling problems for parallel and distributed systems”, Ada Lett., vol. XIX, no. 3, pp. 195–201, Sep. 1999. doi: 10.1145/319295.319323. Received 15.06.2022 INFORMATION ON THE ARTICLE Valerii V. Martell, ORCID: 0000-0002-1749-5818, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: va- lerii.martell@gmail.com Aleksandr V. Korochkin, ORCID: 0000-0003-4650-2316, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: avcora@gmail.com V. Martell, A. Korochkin, O. Rusanova ISSN 1681–6048 System Research & Information Technologies, 2022, № 2 60 Olga V. Rusanova, ORCID: 0000-0003-0145-3012, National Technical University of Ukraine “Igor Sikorsky Kyiv Polytechnic Institute”, Ukraine, e-mail: olga.rusanova.v@gmail.com ПОРІВНЯЛЬНИЙ АНАЛІЗ ЕФЕКТИВНОСТІ ВИКОРИСТАННЯ ДРІБНОЗЕРНИСТОГО ТА ВКЛАДЕНОГО ПАРАЛЕЛІЗМУ ДЛЯ ЗБІЛЬШЕННЯ ПРИШВИДШЕННЯ ПАРАЛЕЛЬНИХ ОБЧИСЛЕНЬ У БАГАТОЯДЕРНИХ КОМП’ЮТЕРНИХ СИСТЕМАХ / В.В. Мартелл, О.В. Корочкін, О.В. Русанова Анотація. Подано результати порівняльного аналізу ефективності викорис- тання паралелізму різного ступеня зернистості в сучасних багатоядерних комп’ютерних системах з використанням найпопулярніших натепер мов про- грамування та бібліотек (таких як C#, Java, C++ та OpenMP). Досліджено мож- ливості підвищення ефективності обчислень у багатоядерних комп’ютерних системах за допомогою комбінацій середньо- та дрібнозернистого паралеліз- му. Отримані результати демонструють високу потенційну ефективність вико- ристання дрібнозернистого паралелізму для організації інтенсивних паралель- них обчислень. На підставі цих результатів можна стверджувати, що порівняно з більш традиційними методами розпаралелювання, які використо- вують паралелізм із середньою зернистістю, використання окремо дрібнозер- нистого паралелізму може скоротити час обчислення великої тестової матема- тичної задачі в середньому на 4% , а використання комбінованого паралелізму — до 5,5%. Це скорочення часу виконання доцільне в разі виконання надвели- ких обчислень. Ключові слова: багатоядерна комп’ютерна система, ядро, потік, завдання, па- ралелізм, зернистість, fork-join, коефіцієнт пришвидшення, дрібнозернистий паралелізм, вкладений паралелізм, комбінований паралелізм.
id journaliasakpiua-article-251975
institution System research and information technologies
keywords_txt_mv keywords
language English
last_indexed 2025-07-17T10:27:42Z
publishDate 2022
publisher The National Technical University of Ukraine &quot;Igor Sikorsky Kyiv Polytechnic Institute&quot;
record_format ojs
resource_txt_mv journaliasakpiua/4f/c35689d5ed0327beb27a99ba82df4f4f.pdf
spelling journaliasakpiua-article-2519752022-10-17T22:12:39Z Comparative analysis of the effectiveness of using fine-grained and nested parallelism to increase the speedup of parallel computing in multicore computer systems СРАВНИТЕЛЬНЫЙ АНАЛИЗ ЭФФЕКТИВНОСТИ ИСПОЛЬЗОВАНИЯ МЕЛКОЗЕРНИСТОГО И ВЛОЖЕННОГО ПАРАЛЛЕЛИЗМА ДЛЯ УВЕЛИЧЕНИЯ УСКОРЕНИЯ ПАРАЛЕЛЬНЫХ ВЫЧИСЛЕНИЙ В МНОГОЯДЕРНЫХ КОМПЬЮТЕРНЫХ СИСТЕМАХ Порівняльний аналіз ефективності використання дрібнозернистого та вкладеного паралелізму для збільшення пришвидшення паралельних обчислень у багатоядерних комп’ютерних системах Martell, Valerii Korochkin, Aleksandr Rusanova, Olga multicore computer system core thread tasks parallelism granularity fork-join speedup coefficient fine-grained parallelism nested parallelism combined parallelism багатоядерна комп’ютерна система ядро потік завдання паралелізм зернистість fork-join коефіцієнт пришвидшення дрібнозернистий паралелізм вкладений паралелізм комбінований паралелізм The article presents a comparative analysis of the effectiveness of using parallelism of varying granularity degrees in modern multicore computer systems using the most popular programming languages and libraries (such as C#, Java, C++, and OpenMP). Based on the performed comparison, the possibilities of increasing the efficiency of computations in multicore computer systems by using combinations of medium- and fine-grained parallelism were also investigated. The results demonstrate the high potential efficiency of fine-grained parallelism when organizing intensive parallel computations. Based on these results, it can be argued that, in comparison with more traditional parallelization methods that use medium-grain parallelism, the use of separately fine-grained parallelism can reduce the computation time of a large mathematical problem by an average of 4%. The use of combined parallelism can reduce the computation time of such a problem to 5,5%. This reduction in execution time can be significant when performing very large computations. В статье представлены результаты сравнительного анализа эффективности использования параллелизма различной степени зернистости в современных многоядерных компьютерных системах с использованием самых популярных на сегодняшний день языков программирования и библиотек (таких как C#, Java, C++ и OpenMP). На основе проведенного сравнения исследованы возможности повышения эффективности вычислений в многоядерных компьютерных системах с помощью комбинаций средне- и мелкозернистого параллелизма. Полученные результаты показывают высокую потенциальную эффективность использования мелкозернистого параллелизма при организации интенсивных параллельных вычислений. На основе этих результатов можно утверждать, что по сравнению с более традиционными методами распараллелирования, использующими параллелизм со средней зернистостью, использование отдельно мелкозернистого параллелизма может сократить время вычисления большой тестовой математической задачи в среднем на 4%, а использование комбинированного параллелизма позволяет сократить время вычисления такой. задачи до 5,5%. Это сокращение времени выполнения может быть значительным при выполнении очень больших вычислений. Подано результати порівняльного аналізу ефективності використання паралелізму різного ступеня зернистості в сучасних багатоядерних комп’ютерних системах з використанням найпопулярніших натепер мов програмування та бібліотек (таких як C#, Java, C++ та OpenMP). Досліджено можливості підвищення ефективності обчислень у багатоядерних комп’ютерних системах за допомогою комбінацій середньо- та дрібнозернистого паралелізму. Отримані результати демонструють високу потенційну ефективність використання дрібнозернистого паралелізму для організації інтенсивних паралельних обчислень. На підставі цих результатів можна стверджувати, що порівняно з більш традиційними методами розпаралелювання, які використовують паралелізм із середньою зернистістю, використання окремо дрібнозернистого паралелізму може скоротити час обчислення великої тестової математичної задачі в середньому на 4% , а використання комбінованого паралелізму — до 5,5%. Це скорочення часу виконання доцільне в разі виконання надвеликих обчислень. The National Technical University of Ukraine &quot;Igor Sikorsky Kyiv Polytechnic Institute&quot; 2022-08-30 Article Article application/pdf https://journal.iasa.kpi.ua/article/view/251975 10.20535/SRIT.2308-8893.2022.2.03 System research and information technologies; No. 2 (2022); 45-60 Системные исследования и информационные технологии; № 2 (2022); 45-60 Системні дослідження та інформаційні технології; № 2 (2022); 45-60 2308-8893 1681-6048 en https://journal.iasa.kpi.ua/article/view/251975/261590
spellingShingle багатоядерна комп’ютерна система
ядро
потік
завдання
паралелізм
зернистість
fork-join
коефіцієнт пришвидшення
дрібнозернистий паралелізм
вкладений паралелізм
комбінований паралелізм
Martell, Valerii
Korochkin, Aleksandr
Rusanova, Olga
Порівняльний аналіз ефективності використання дрібнозернистого та вкладеного паралелізму для збільшення пришвидшення паралельних обчислень у багатоядерних комп’ютерних системах
title Порівняльний аналіз ефективності використання дрібнозернистого та вкладеного паралелізму для збільшення пришвидшення паралельних обчислень у багатоядерних комп’ютерних системах
title_alt Comparative analysis of the effectiveness of using fine-grained and nested parallelism to increase the speedup of parallel computing in multicore computer systems
СРАВНИТЕЛЬНЫЙ АНАЛИЗ ЭФФЕКТИВНОСТИ ИСПОЛЬЗОВАНИЯ МЕЛКОЗЕРНИСТОГО И ВЛОЖЕННОГО ПАРАЛЛЕЛИЗМА ДЛЯ УВЕЛИЧЕНИЯ УСКОРЕНИЯ ПАРАЛЕЛЬНЫХ ВЫЧИСЛЕНИЙ В МНОГОЯДЕРНЫХ КОМПЬЮТЕРНЫХ СИСТЕМАХ
title_full Порівняльний аналіз ефективності використання дрібнозернистого та вкладеного паралелізму для збільшення пришвидшення паралельних обчислень у багатоядерних комп’ютерних системах
title_fullStr Порівняльний аналіз ефективності використання дрібнозернистого та вкладеного паралелізму для збільшення пришвидшення паралельних обчислень у багатоядерних комп’ютерних системах
title_full_unstemmed Порівняльний аналіз ефективності використання дрібнозернистого та вкладеного паралелізму для збільшення пришвидшення паралельних обчислень у багатоядерних комп’ютерних системах
title_short Порівняльний аналіз ефективності використання дрібнозернистого та вкладеного паралелізму для збільшення пришвидшення паралельних обчислень у багатоядерних комп’ютерних системах
title_sort порівняльний аналіз ефективності використання дрібнозернистого та вкладеного паралелізму для збільшення пришвидшення паралельних обчислень у багатоядерних комп’ютерних системах
topic багатоядерна комп’ютерна система
ядро
потік
завдання
паралелізм
зернистість
fork-join
коефіцієнт пришвидшення
дрібнозернистий паралелізм
вкладений паралелізм
комбінований паралелізм
topic_facet multicore computer system
core
thread
tasks
parallelism
granularity
fork-join
speedup coefficient
fine-grained parallelism
nested parallelism
combined parallelism
багатоядерна комп’ютерна система
ядро
потік
завдання
паралелізм
зернистість
fork-join
коефіцієнт пришвидшення
дрібнозернистий паралелізм
вкладений паралелізм
комбінований паралелізм
url https://journal.iasa.kpi.ua/article/view/251975
work_keys_str_mv AT martellvalerii comparativeanalysisoftheeffectivenessofusingfinegrainedandnestedparallelismtoincreasethespeedupofparallelcomputinginmulticorecomputersystems
AT korochkinaleksandr comparativeanalysisoftheeffectivenessofusingfinegrainedandnestedparallelismtoincreasethespeedupofparallelcomputinginmulticorecomputersystems
AT rusanovaolga comparativeanalysisoftheeffectivenessofusingfinegrainedandnestedparallelismtoincreasethespeedupofparallelcomputinginmulticorecomputersystems
AT martellvalerii sravnitelʹnyjanalizéffektivnostiispolʹzovaniâmelkozernistogoivložennogoparallelizmadlâuveličeniâuskoreniâparalelʹnyhvyčislenijvmnogoâdernyhkompʹûternyhsistemah
AT korochkinaleksandr sravnitelʹnyjanalizéffektivnostiispolʹzovaniâmelkozernistogoivložennogoparallelizmadlâuveličeniâuskoreniâparalelʹnyhvyčislenijvmnogoâdernyhkompʹûternyhsistemah
AT rusanovaolga sravnitelʹnyjanalizéffektivnostiispolʹzovaniâmelkozernistogoivložennogoparallelizmadlâuveličeniâuskoreniâparalelʹnyhvyčislenijvmnogoâdernyhkompʹûternyhsistemah
AT martellvalerii porívnâlʹnijanalízefektivnostívikoristannâdríbnozernistogotavkladenogoparalelízmudlâzbílʹšennâprišvidšennâparalelʹnihobčislenʹubagatoâdernihkompûternihsistemah
AT korochkinaleksandr porívnâlʹnijanalízefektivnostívikoristannâdríbnozernistogotavkladenogoparalelízmudlâzbílʹšennâprišvidšennâparalelʹnihobčislenʹubagatoâdernihkompûternihsistemah
AT rusanovaolga porívnâlʹnijanalízefektivnostívikoristannâdríbnozernistogotavkladenogoparalelízmudlâzbílʹšennâprišvidšennâparalelʹnihobčislenʹubagatoâdernihkompûternihsistemah