Parallel computing on heterogeneous Networks: Challenges and Responses

In the paper, we analyse challenges associated with parallel programming for common networks of computers (NoCs) that are, unlike dedicated parallel computer systems, inherently heterogeneous and unreliable. This analysis results in description of main features of an ideal parallel program for No...

Повний опис

Збережено в:
Бібліографічні деталі
Дата:2004
Автор: Lastovetsky, Al.
Формат: Стаття
Мова:Англійська
Опубліковано: Інститут програмних систем НАН України 2004
Теми:
Онлайн доступ:https://nasplib.isofts.kiev.ua/handle/123456789/2298
Теги: Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:Digital Library of Periodicals of National Academy of Sciences of Ukraine
Цитувати:Parallel computing on heterogeneous Networks: Challenges and Responses /Al.Lastovetsky // Проблеми програмування. — 2004. — N 2,3. — С. 251-260. — Бібліогр.: 11 назв. — англ.

Репозитарії

Digital Library of Periodicals of National Academy of Sciences of Ukraine
_version_ 1859468268769640448
author Lastovetsky, Al.
author_facet Lastovetsky, Al.
citation_txt Parallel computing on heterogeneous Networks: Challenges and Responses /Al.Lastovetsky // Проблеми програмування. — 2004. — N 2,3. — С. 251-260. — Бібліогр.: 11 назв. — англ.
collection DSpace DC
description In the paper, we analyse challenges associated with parallel programming for common networks of computers (NoCs) that are, unlike dedicated parallel computer systems, inherently heterogeneous and unreliable. This analysis results in description of main features of an ideal parallel program for NoCs. We also outline some recent parallel programming tools, which try and respond to some of the challenges.
first_indexed 2025-11-24T07:17:33Z
format Article
fulltext Parallel Computing on Heterogeneous Networks: Challenges and Responses Alexey Lastovetsky Department of Computer Science, University College Dublin, Belfield, Dublin 4, Ireland E-mail: Alexey.Lastovetsky@ucd.ie Abstract In the paper, we analyse challenges associated with parallel programming for common networks of computers (NoCs) that are, unlike dedicated parallel computer systems, inherently heterogeneous and unreliable. This analysis results in description of main features of an ideal parallel program for NoCs. We also outline some recent parallel programming tools, which try and respond to some of the challenges. 1. Introduction Local networks of computers (NoCs) are the most common and available parallel architecture. Nowadays not only big businesses and organisations but also practically any medium or small one has several computers interconnected in a local network. In the most general case, a local network of computers consists of PCs, workstations, shared memory multiprocessor (SMP) servers, and even distributed memory multiprocessor supercomputers and clusters interconnected via mixed network equipment. At a first glance, this architecture is very similar to the distributed memory multiprocessor (also know as MPP) architecture. Like the latter, it provides a number of processors not sharing global main memory and interconnected via a communication network. Therefore, the most natural model of program for NoCs is also a set of parallel processes, each running on a separate processor and using message passing to communicate with the others. That is, message passing is the basic programming model for this architecture. Due to the similarity of MPPs and NoCs, it might be expected that NoCs be as widely used for high performance parallel computing as MPPs. In reality, NoCs are practically not used for parallel computing. The main reason, why the huge performance potential of millions NoCs around the world is so poorly utilised, is that parallel programming for NoCs is much more difficult than parallel programming for MPPs. The point is that unlike MPPs, which are designed and manufactured specifically for high performance parallel computing, a typical NoC is a naturally developed computer system. A NoC is a general-purpose computer system, which is developed incrementally, for a relatively long time. As a result, NoCs are not as nicely regular or balanced for high performance computing as MPPs. On the contrary, irregularity, heterogeneity, and instability are their inherent features differentiating the architecture from the MPP architecture. The very features make parallel programming for NoCs so difficult and challenging. There are three main sources of the difficulties. The first one is the heterogeneity of processors. Generally speaking, in a NoC, different processors are of the different architecture. The second source is the communication network itself, which is typically not designed for high performance parallel computing. The third source is the multi-user nature of NoCs. A NoC is not a strongly centralized computer system. It consists of relatively autonomous computers, each of which may be used and administered independently by its users. In NoCs, different components are not as strongly integrated and controlled as in MPPs. In the paper we discuss the sources of difficulties and analyse programming challenges coming from each of the sources. This analysis results in description of main features of an ideal parallel program for NoCs. We also outline some recent parallel programming tools, which try and respond to some of the challenges. 2. Heterogeneity of processors 2.1. Different processor speeds An immediate implication from the fact that a NoC uses processors of different architectures is that the processors run at different speeds. Let us see what happens when a parallel application, which provides a good performance while running on homogeneous MPPs, runs on the cluster of heterogeneous processors. A good parallel application for MPPs tries to evenly distribute computations over available processors. This very distribution ensures the maximal speedup on MPPs, which consist of identical processors. On the cluster of processors running at different speeds, faster processors will quickly perform their part of computations and wait for slower ones at points of synchronisation. Therefore, the total time of computations will be determined by the time elapsed on the slowest processor. In other words, when executing parallel applications, which evenly distribute computations among available processors, the heterogeneous cluster is equivalent to a homogeneous cluster that is composed of the same number but the slowest processors. The following simple experiment, which has been really carried out, corroborates the statement. Two subnetworks of the same local network were used, each consisting of four Sun workstations. The first subnetwork included identical workstations of the same model, and was thus homogeneous. The second one included workstations of three different models. Their relative speeds demonstrated while executing a LAPACK [1] Cholesky factorisation routine were 1.9, 2.8, 2.8, and 7.1. As the slowest workstation (relative performance 1.9) was shared by both clusters, the total power of the heterogeneous cluster was almost twice that of the homogeneous one. It might be expected that a parallel ScaLAPACK [2] Cholesky solver be executed on the more powerful cluster almost twice as fast as on the weaker one. But in reality, it ran practically at the same speed (~2% speedup for a 18001800× dense matrix). Thus, a good parallel application for a NoC must distribute computations unevenly taking into account the difference in processor speed. The faster processor is, the more computations it must perform. Ideally, the volume of computation performed by a processor should be proportional to its speed. X = A B C Figure 1. Matrix-matrix multiplication with matrices A, B, and C unevenly partitioned in one dimension. The area of the slice mapped to each processor is proportional to its speed. The slices mapped onto a single processor are shaded black. During execution, this processor requires all of matrix A (shown shaded grey). For example, a simple parallel algorithm implementing matrix operation BAC ×= on a p-processor heterogeneous cluster, where A, B are dense square nn × matrices, can be summarized as follows: • Each element ijc in C is computed as ∑ − = ×= 1 0 n k kjikij bac . • The A, B, and C matrices are identically partitioned into p vertical slices. There is one-to-one mapping between these slices and the processors. Each processor is responsible for computing its C slice. • Because all C elements require the same amount of arithmetic operations, each processor executes an amount of work proportional to the number of elements that are allocated to it, hence, proportional to the area of its slice. Therefore, to balance the load of the processors, the area of the slice mapped to each processor is proportional to its speed (see Fig. 1). • In order to compute elements of its C slice each processor requires all elements of the A matrix. Therefore, during the execution of the algorithm, each processor receives from p-1 other processors all elements of their slices (shown grey in Fig. 1). This heterogeneous parallel algorithm cannot be implemented in HPF 1.1 [3], since the latter provides no way to specify a heterogeneous distribution of arrays across abstract processors. But HPF 2.0 [4] addresses the problem by extending BLOCK distribution with the ability to explicitly specify the size of each individual block (GEN_BLOCK distribution). For example, the following HPF program implements the above parallel algorithm to multiply two dense square 10001000× matrices on a 4-processor heterogeneous cluster, processors of which have relative speeds 2, 3, 5, and 10: PROGRAM HETEROGENEOUS INTEGER, DIMENSION(4), PARAMETER:: M=(/100, 150, 250, 500/) REAL, DIMENSION(1000,1000):: A, B, C !HPF$ PROCESSORS p(4) !HPF$ DISTRIBUTE (*, GEN_BLOCK(M)) ONTO p:: A, B, C !HPF$ INDEPENDENT DO J=1,1000 !HPF$ INDEPENDENT DO I=1,1000 A(I,J)=1.0 B(I,J)=2.0 END DO END DO !HPF$ INDEPENDENT DO J=1,1000 !HPF$ INDEPENDENT DO I=1,1000 C(I,J)=0.0 DO K=1,1000 C(I,J)=C(I,J)+A(I,K)*B(K,J) END DO END DO END DO END In this program, the “generalized” block distribution, GEN_BLOCK, is used to map contiguous segments of arrays A, B, and C of unequal sizes onto processors. The sizes of the segments are specified by values of the user-defined integer mapping array M, one value per target processor of the mapping. That is, the i-th element of the mapping array specifies the size of the block to be stored on the i-th processor of the target processor arrangement p. The ‘*’ in the DISTRIBUTE directive specifies that array A, B, and C are not to be distributed along the first axis; thus an entire column is to be distributed as one object. So, array elements A(:,1:100), B(:,1:100), and C(:,1:100) are mapped on p(1), A(:,101:250), B(:,101:250), and C(:,101:250) are mapped on p(2), A(:,251:500), B(:,251:500), and C(:,251:500) are mapped on p(3), and A(:,501:1000), B(:,501:1000), and C(:,501:1000) are mapped on p(4). That distribution of matrices A, B, and C across processors ensures that the area of the vertical slice mapped to each processor is proportional to the speed of the processor. Note that this is responsibility of the programmer to explicitly specify the exact distribution of the arrays across processors. The specification is based on the knowledge of both the parallel algorithm and the executing heterogeneous cluster. HPF 2.0 also allows the programmer to distribute the arrays with the REDISTRIBUTE directive, based on a mapping array whose values are computed at runtime. This allows writing a more portable application. But again, either the programmer or a user of the application must explicitly specify the data distribution, which ensures the best performance of this particular parallel algorithm on each particular heterogeneous cluster. Apparently, the above algorithm can be implemented in MPI [5] as well. The corresponding MPI program will be not as simple as the HPF one because of much lower level of the MPI’s programming model. Actually, MPI is a programming tool of the assembler level for message passing programming. Therefore, practically all message passing algorithms can be implemented in MPI. Whatever programming tool is used to implement the above parallel algorithm, one can see that the efficiency of the corresponding application strongly depends on the accuracy of estimation of the relative speed of processors of the executing heterogeneous cluster. Distribution of arrays and, hence, distribution of computations across the processors are fully determined by the estimation of their relative speed. If this estimation is not accurate enough, the load of processors will be unbalanced, resulting in poorer execution performance. The problem of accurate estimation of the relative speed of processors is not as easy as it may look. Of course, if you consider two processors, which only differ in clock rate, it is not a problem to accurately estimate their relative speed. The relative speed will be the same for any application. But if you consider processors of different architectures, the situation changes drastically. Everything in the processors may be different: set of instructions, number of instruction execution units, number of registers, structure of memory hierarchy, size of each memory level, and so on, and so on. Therefore, the processors may demonstrate different relative speeds for different applications. Moreover, processors of the same architecture but different models or configurations may also demonstrate different relative speeds on different applications. Even different applications of the same narrow class may be executed by two different processors at significantly different relative speeds. To avoid speculation, consider the following experiment that has been really carried out. Three slightly different implementations of Cholesky factorisation of a 500500× matrix were used to estimate the relative speed of a SPARCstation-5 and a SPARCstation-20. Code for(k=0; k<500; k++) { for(i=k, lkk=sqrt(a[k][k]); i<500; i++) a[i][k] /= lkk; for(j=k+1; j<500; j++) for(i=j; i<500; i++) a[i][j] -= a[i][k]*a[j][k]; } estimated their relative speed as 10:9, meanwhile code for(k=0; k<500; k++) { for(i=k, lkk=sqrt(a[k][k]); i<500; i++) a[i][k] /= lkk; for(i=k+1; i<500; i++) for(j=i; j<500; j++) a[i][j] -= a[k][j]*a[k][i]; } as 10:14. Routine dptof2 from the LAPACK package, solving the same problem, estimated their relative speed as 10:10. 2.2. Heterogeneity of machine arithmetic As processors of a NoC may do floating-point arithmetic differently, there are special challenges associated with writing numerical software on NoCs. Specifically, there are two main issues potentially affecting the behaviour of a numerical parallel application running on a heterogeneous NoC. Firstly, different processors do not guarantee the same storage representation and the same results for operations on floating point numbers. Secondly, if a floating-point number is communicated between processors, the communication layer does not guarantee the exact transmittal of the floating-point value. Normally, transferring a floating point number in a heterogeneous environment includes two conversions of its binary representation: the representation of the number on the sender site is first converted into a machine independent representation, which is then converted into the representation for floating point numbers on the receiver site. The two successive conversions may change the original value, that is, the value received by the receiver may differ from the value sent by the sender. To illustrate the potential problems, consider the iterative solution of a system of linear equations where the stopping criterion depends upon the value of some function, f, of the relative machine precision, ε . A common definition of the relative machine precision, or unit roundoff, is the smallest positive floating point value, ε , such that 1)1( >+ εfl , where fl(x) is the floating point representation of x. The test for convergence might well include a test of the form: if( )( 2 2 εf x e r r < ) goto converged; In a heterogeneous setting the value of f may be different on different processors and er and xr may depend upon data of different accuracies, and thus one or more processes may converge in a fewer number of iterations. Indeed, the stopping criterion used by the most accurate processor may never be satisfied if it depends on data computed less accurately by other processors. If the code contains communication between processors within an iteration, it may not complete if one processor converges before the others. In a heterogeneous environment, the only way to guarantee termination is to have one processor make the convergence decision and broadcast that decision. Another problem is that overflow and underflow exceptions may occur during floating-point representation conversions, resulting in a failure of the communication. 3. Ad hoc communication network One can imagine a local network of heterogeneous computers, whose communication layer is almost as good as the communication layer of the MPP architecture. Parallel programming for such networks called in this book heterogeneous clusters faces no specific communication-related challenges. Heterogeneous clusters are normally designed specifically for high performance distributed computing. At the same time, the topology and structure of the communication network in a typical common local network of computers is determined by many different factors, among which high performance computing is far away from being a primary one if considered at all. The primary factors include the structure of the organisation, the tasks that are solved on computers of the NoC, the security requirements, the construction restrictions, the budget limitations, the qualification of technical personnel, etc. An additional important factor is that the communication network is constantly developing rather than fixed once and forever. The development is normally occasional and incremental. Therefore, the structure of the communication network reflects the evolution of the organization rather than its current snapshot. All the factors make the common communication network far away from the ideal MPP communication network, which is homogeneous with communication speedup and bandwidth being balanced with the number and speed of processors. First of all, the common communication network is heterogeneous. The speed and bandwidth of communication links between different pairs of processors may differ significantly. Secondly, some of the communication links may be of low speed and/or narrow bandwidth. This makes the problem of optimal distribution of computations and communications across a NoC much more difficult than across a cluster of heterogeneous processors interconnected with a homogeneous high-performance communication network. The additional difficulty comes from the larger size of the problem, which is now )( 2nO , where n is the total number of processors (respectively, 2n is the total number of inter- processor communication links). Apart from that, due to low performance of some communication links, the optimal distribution of computations and communications may be across some subnetwork of the NoC, not across the entire NoC. This substantially extends the space of possible solutions and increases the complexity of the distribution problem even further. 4. Multi-user decentralised computer system Unlike MPPs, NoCs are not strongly centralized computer systems. A typical NoC consists of relatively autonomous computers, each of which may be used and administered independently by its users. 4.1. Unstable performance characteristics The first implication from the multi-user decentralised nature of NoCs is that computers, executing a parallel program, may be also used for other computations and involved in other communications. In that case, the real performance of processors and communication links can dynamically change depending on the external computations and communications. Therefore, a good parallel program for a NoC must be sensitive to such dynamic variations of its workload. In such a program, computations and communications are distributed across the NoC in accordance to the actual performance at the moment of execution of the program. 4.2. Higher probability of resource failures Fault tolerance is not a primary problem for parallel applications running on MPPs. The probability of unexpected resource failures in a centralised dedicated parallel computer system is quite small. But this probability reaches much higher figures for NoCs. Firstly, any single computer in a NoC may be switched off or rebooted unexpectedly for other users in the NoC. The same may happen with any other resource in the NoC. Secondly, not all building elements of the common NoC as well as interaction between different elements are equally reliable. These make fault tolerance a desirable feature for parallel applications that run on NoCs; and the longer the execution time of the application is, the more important the feature becomes. The basic programming tool for distributed-memory parallel architectures, MPI, does not address the problem. The point is that a fault-tolerant parallel program assumes a dynamic process model. Failure of one or other process of the program should not necessarily lead to failure of the entire program. The program may continue running even after its set of processes has changed. The MPI 1.1 process model is fully static. MPI 2.0 does include some support for dynamic process control, although this is limited to the creation of new MPI process groups with separate communicators. These new processes cannot be merged with previously existing communicators to form intracommunicators needed for a seamless single application model and are limited to a special set of extended collective communications. To date, there is no industrial fault-tolerant implementation of MPI. At the same time, there are a few research versions of MPI suggesting different approaches to the problem of fault-tolerant parallel programming. The first approach to making MPI applications fault tolerant is through the use of check pointing and roll back. This approach is that all processes of the MPI program will flush their message queues to avoid in flight messages getting lost, and then they will all synchronously checkpoint. At some later stage if any error occurs, the entire MPI program will be rolled back to the last complete checkpoint and be re-started. This approach needs the entire application to checkpoint synchronously, which, depending on the application and its size, may become expensive in terms of time (with potential scaling problems). The second approach is to use “spare” processes that are utilized when there is a failure. For example, MPI-FT [6] supports several master-slave models where all communicators are built from grids that contain “spare” processes. To avoid loss of message data between the master and slaves, all messages are copied to an observer process, which can reproduce lost messages in the event of any failures. This system has a high overhead for every message and considerable memory needs for the observer process for long running applications. This system is not a full checkpoint system in that it assumes any data (or state) can be rebuilt using just the knowledge of any passed messages, which might not be the case for non-deterministic unstable solvers. MPI-FT is an example of an implicit fault tolerant MPI. Such implementations of MPI do not extend MPI interface itself. No specific design is needed for application using an implicit fault tolerant MPI. The system takes full responsibility over fault tolerant features of application. The drawback of that approach is that the programmer cannot control fault tolerant features of the application and fine tune for better balance between fault tolerance and performance as system and application conditions may dictate. Unlike MPI-FT, FT-MPI [7] is an explicit fault tolerance MPI, which extends standard MPI’s interface and semantics. An application using FT-MPI has to be specifically designed to take advantage of its fault tolerant features. 5. Summary of programming challenges In summarizing challenges associated with parallel programming for NoCs, let us describe main features of an ideal parallel program running on a NoC. Such a program distributes computations and communications unevenly across processors and communications links, taking into account their actual performance demonstrated during the execution of the code of the program. The distribution is not static and may be different not only for different NoCs but also for different executions of the program on the same NoC, depending on the workload of its elements. The program may find profitable to involve in computations not all available computers. In other words, the program must be efficiently portable. The program keeps running even if some resources in the executing network fail. In the case of a resource failure, it is able to reconfigure itself and resume computations from some point in the past. The program takes into account differences in machine arithmetic on different computers and avoids erroneous behaviour of the program that might be caused by the differences. 6. Any response to the challenges? Let us see how the challenges are responded. First, we outline how standard parallel programming tools such as HPF and MPI address the highlighted challenges. Then, we briefly introduce mpC, a dedicated programming language designed specifically for parallel computing on heterogeneous networks of computers. 6.1. High Performance Fortran As we have demonstrated in Section 2.1, HPF provides some basic support for programming heterogeneous algorithms. It allows the programmer to specify uneven distribution of data across abstract HPF processors. At the same time, it is full responsibility of the programmer to provide a code, which analyses the implemented parallel algorithm and the executing NoC, and calculates the best distribution. Another problem is that the HPF programmer cannot influence the mapping of abstract HPF processors to computers of the NoC. HPF provides no language constructs allowing the programmer to control better mapping of the heterogeneous algorithms to heterogeneous clusters. The HPF programmer should rely on some default mapping provided by the HPF compiler. The mapping cannot be sensitive to peculiarities of each individual algorithm just because the HPF compiler has no information about the peculiarities. Therefore, to control the mapping and take into account both the peculiarities of the implemented parallel algorithm and the peculiarities of the executing heterogeneous environment, the HPF programmer needs to additionally write a good peace of quite complex code. HPF does not address the problem of fault tolerance at all. Actually the lack of means for advising the compiler about the features of implemented parallel algorithm that have a major impact on its execution time is the general drawback of HPF, which makes the language difficult for compiling not only for heterogeneous platforms but for MPPs as well. To illustrate the associated difficulties, consider the following simple HPF program: PROGRAM SIMPLE REAL, DIMENSION(1000,1000):: A, B, C !HPF$ PROCESSORS p(4,4) !HPF$ DISTRIBUTE (BLOCK,BLOCK) ONTO p:: A, B, C !HPF$ INDEPENDENT DO J=1,1000 !HPF$ INDEPENDENT DO I=1,1000 A(I,J)=1.0 B(I,J)=2.0 END DO END DO !HPF$ INDEPENDENT DO J=1,1000 !HPF$ INDEPENDENT DO I=1,1000 C(I,J)=0.0 DO K=1,1000 C(I,J)=C(I,J)+A(I,K)*B(K,J) END DO END DO END DO END The program implements matrix operation BAC ×= on a 16-processor MPP, where A, B are dense square 10001000× matrices. Figure 2 illustrates the implemented parallel algorithm. The PROCESSORS directive specifies a logical 44× grid of abstract processors, p. The DISTRIBUTE directive recommends the compiler to partition each of the arrays A, B, and C into equal-sized blocks along each of its dimension. This will result in a 44× configuration of blocks each containing 250250× elements, one block per processor. The corresponding blocks of arrays A, B, and C will be mapped to the same abstract processor and, hence, to the same physical processor. Each of the four INDEPENDENT directives in the program is applied to a DO loop and advises the compiler that the loop does not carry any dependences and therefore its different iterations may be executed in parallel. X = A B C Figure 4.3. Matrix-matrix multiplication with matrices A, B, and C evenly partitioned in two dimensions. The blocks mapped onto a single processor are shaded black. During execution, this processor requires corresponding rows of matrix A and columns of matrix B (shown shaded grey). Altogether the directives give the compiler enough information in order to generate a target message-passing program. Additional information is given by a general HPF rule saying that evaluation of an expression should be performed on the processor, in the memory of which its result will be stored. Thus, a clever HPF compiler would be able to generate SPMD message-passing code like that: PROGRAM SIMPLE REAL, DIMENSION(250,250):: A, B, C REAL, DIMENSION(250,1000):: Arows, Bcols INTEGER colcom, rowcom, col, row INTEGER rank, colrank, rowrank INTEGER err CALL MPI_INIT(ierr) CALL MPI_COMM_RANK(MPI_COMM_WORLD, rank); row = rank/4 col = rank-row*4 DO J=1,250 DO I=1,250 A(I,J)=1.0 B(I,J)=2.0 END DO END DO CALL MPI_COMM_SPLIT(MPI_COMM_WORLD, row, rank, rowcom, err) CALL MPI_COMM_SPLIT(MPI_COMM_WORLD, col, rank, colcom, err) CALL MPI_ALLGATHER(A, 40000, MPI_REAL, Arows, 62500, &MPI_REAL, rowcom, err) CALL MPI_ALLGATHER(B, 40000, MPI_REAL, Bcols, 62500, &MPI_REAL, colcom, err) DO J=1,250 DO I=1,250 C(I,J)=0.0 ind1=1 ind2=J DO K=1,1000 C(I,J)=C(I,J)+Arows(I,K)*Bcols(ind1,ind2) IF(ind1.LT.250) THEN ind1=ind1+1 ELSE ind1=1 ind2=ind2+250 END IF END DO END DO END DO CALL MPI_COMM_FREE(rowcom, err) CALL MPI_COMM_FREE(colcom, err) CALL MPI_FINALIZE(err) END This code is in Fortran 77 with calls to MPI routines. It is supposed to be executed by all 16 processes making up the parallel program. Each process locally contains one 250250× block of global arrays A, B, and C of the source HPF program. A logical 44× process grid is formed from the 16 participating processes, and each process gets its coordinates row and col in the grid. In order to compute its block of the resulting matrix C, the process needs blocks of matrix A from its horizontal neighbours in the 44× process grid, and blocks of matrix B from its vertical neighbours (see Figure 2). The necessary communication is achieved by calls to the MPI_COMM_SPLIT and MPI_ALLGATHER routines. The main specific optimisation performed by an HPF compiler is the minimization of the cost of the inter-processor communication. This is not a trivial problem. It needs profound analysis of both the source code and the executing MPP. HPF provides no specific constructs or directives helping the compiler to solve the problem. This is one of the reasons why HPF is considered a difficult language to compile. For example, many real HPF compilers (i.e., the ADAPTOR HPF compiler from GMD) will translate the above HPF program into a message-passing program, each process of which sends its blocks of matrices A and B to all other processes. That straightforward communication scheme guarantees that each process receives all the elements of global arrays A and B, it needs to compute its elements of global array C. At the same time, in many particular cases, including ours, this universal scheme involves a good deal of redundant communications, sending and receiving data that are never used in computation. The better a compiler is, the more accurate communication patterns it generates to avoid redundant communications as much as possible. The above message- passing program, generated by an imaginary clever HPF compiler, performs no redundant communication. Each process of the program sends its blocks of matrices A and B only to 3 other processes, not to 15 as each process of the straightforward program does. HPF does not address the problem of fault tolerance at all. 6.2. Message Passing Interface As a general-purpose message-passing tool of assembler level, MPI allows the programmer to write efficiently portable programs for NoCs. At the same time, it provides no specific support to facilitate such programming. It is responsibility of the programmer to write all the code making the application efficiently portable among NoCs. In other words, every time, when programming for NoCs, a programmer must solve the extremely difficult problem of portable efficiency from scratch. Standard MPI also does not address the problem of fault tolerance. 6.3. mpC and HMPI An original approach to parallel computing on heterogeneous networks that has been proposed and implemented in the framework of the mpC language [8-9] and the HMPI library [10] and their programming systems. In brief, this approach can be summarised as follows: • The programmer provides the programming system with comprehensive information about the features of the implemented parallel algorithm that have a major impact on the execution time of this algorithm. In other words, the programmer provides a detailed description of the performance model of this algorithm. • The programming system uses the provided information to optimally map at runtime this algorithm to the computers of the executing network. The quality of this mapping strongly depends on the accuracy of the estimation of the actual performance of the processors and communication links demonstrated at runtime on the execution of this application. Therefore, the mpC programming system employs an advanced performance model of a heterogeneous network of computers, and the mpC language provides constructs that allow the programmer to update the parameters of this model at runtime by tuning them to the code of this particular application. This approach to parallel computing on heterogeneous networks has proved its efficiency. Many mpC and HMPI applications have been developed that efficiently solve real-life problems on common heterogeneous networks of computers. The mpC language in its current form addresses all the challenges associated with writing efficiently portable programs for NoCs except for the fault tolerance. The mpC parallel language allows the programmer to define all main features of the implemented parallel algorithm that can have an impact on the performance of execution of the algorithm on a heterogeneous NoC. The features include the total number of participating parallel processes, the total volume of computations to be performed on each of the processes, the total volume of data to be transferred between each pair of the processes, and how exactly the processes interact during the execution of the algorithm. The mpC programming system uses that performance model of the parallel algorithm together with the model of the executing heterogeneous network to map the processes of the parallel program to this network so as to ensure better execution time. The mapping is executed at runtime; therefore its efficiency is crucial for the total execution performance of mpC applications. The model of a heterogeneous network and the mapping algorithm are developed to keep balance between the accuracy and efficiency. To briefly introduce the mpC language, consider an mpC application simulating the evolution of groups of bodies under the influence of Newtonian gravitational attraction. Since the magnitude of interaction between bodies falls off rapidly with distance, a single equivalent body may approximate the effect of a large group of bodies. This allows us to solve the problem in parallel. The parallel application will use a few parallel processes, each of which will update data characterizing a single group of bodies. Each process holds attributes of all the bodies constituting the corresponding group as well as masses and centres of gravity of other groups. The attributes characterizing a body include its position, velocity and mass. The application will implement the following parallel algorithm: Initialisation of galaxy on host-process Scattering groups of bodies over processes Parallel computing masses of groups Interchanging the masses among processes while(1) { Visualization of galaxy by host-process Parallel computing centers of gravity Interchanging the centers among processes Parallel updating groups Gathering groups on host-process } It is assumed that at each iteration of the main loop, new coordinates of all bodies in some fixed interval of time are calculated. The core of the mpC application, implementing the above algorithm, is the following description of the performance model of this algorithm: nettype Galaxy(m, k, n[m]) { coord I=m; node { I>=0: bench*((n[I]/k)*(n[I]/k)); }; link { I>0 : length(Body)*n[I] [I]->[0]; }; parent [0]; scheme { int i; par (i=0; i<m; i++) 100%%[i]; par (i=1; i<m; i++) 100%%[i]->[0]; }; }; Informally, it looks like a description of an abstract network of processors, which executes the algorithm, complemented by the description of the workload of its processors and communication links, and the description of the scenario of interaction between the abstract processors during the algorithm execution. The first line of the above definition introduces the name Galaxy of the type of the abstract mpC network and a list of parameters – integer scalar parameters m and k and vector parameter n of m integers. Next line declares the coordinate system to which abstract processors will be related. It introduces coordinate variable I ranging from 0 to m-1. Next line associates abstract processors with this coordinate system and describes the volumes of computation to be performed by each of the processors. As a unit of measurement, the volume of computation performed by some benchmark code is used. In this particular case, it is assumed that the benchmark code computes a single group of k bodies. It is also assumed that i-th element of the vector parameter n is equal to the number of bodies in the group computed by the i-th abstract processor. The number of operations to compute one group is proportional to the number of bodies in the group squared. Therefore, the volume of computation to be performed by the I-th virtual processor is (n[I]/k)2 times bigger than the volume of computation performed by the benchmark code. This line just says it. Next line specifies volumes of data in Bodys to be transferred between the virtual processors during execution of the algorithm. It simply says that i-th virtual processor (i=1,…) will send attributes of all its bodies to the host-processor where they should be visualized. Note, that this definition describes one iteration of the main loop of the algorithm, which is a quite good approximation because practically all computations and communications concentrate in this loop. Therefore, the total time of the execution of this algorithm is approximately equal to the running time of a single iteration, multiplied by the total number of iterations. Finally, the scheme block describes how exactly virtual processors interact during execution of the algorithm. It says that first all the virtual processors perform in parallel 100 per cent of computations that should be performed, and then all the processors, except the host processor, send in parallel 100 per cent of data that should be sent to the host-processor. The most principal fragments of the rest code of this mpC application are: void [*] main(int [host]argc, char **[host]argv) { ... TestGroup[]=(*AllGroups[0])[]; recon Update_group(TestGroup, TestGroupSize) ; { net GalaxyNet(NofG, TestGroupSize, NofB) g; … } } The recon statement uses a call of the function Update_Group with actual parameters TestGroup and TestGroupSize to update the estimation of the performance of the physical processors executing the application. The main part of the total volume of computations performed by each virtual processor just falls into execution of calls to the function Update_Group. Therefore, the obtained estimation of performances of the real processors will be very close to their actual performances shown while executing this program. Next line defines the abstract network g of type GalaxyNet with the actual parameters NofG – the actual number of groups of bodies, TestGroupSize – the size of the test group of bodies used in the benchmark code, and NofB – an array of NofG elements containing actual numbers of bodies in the groups. The rest computations and communication will be performed on this abstract network. The mpC programming system maps virtual processors of the abstract network g to real parallel processes constituting the running parallel program. While performing the mapping, the programming system uses, on the one hand, the information about configuration and performance of physical processors and communication links of the network of computers executing the program, and on the other hand, the specified performance model of the parallel algorithm. The programming system does the mapping at runtime and tries to minimise the total running time of the parallel program. 7. Conclusion In this paper, we have analysed challenges associated with parallel programming for common heterogeneous networks of computers. This analysis has resulted in the description of the main features of an ideal parallel program for NoCs. We have taken a look at how standard parallel programming tools, such as HPF and MPI, addresses the programming challenges. We have also introduced the mpC language, which is the first language specifically designed for parallel programming for heterogeneous networks of computers. Detailed introduction to parallel computing on heterogeneous networks can be found in [11]. 8. References [1] E. Anderson, Z. Bai, C. Bischof, L. S. Blackford, J. Demmel, J. Dongarra, J. Du Croz, A. Greenbaum, S. Hammarling, A. McKenney, D. Sorensen, LAPACK User's Guide, SIAM, Philadelphia, third edition, 1999. [2] L.S. Blackford, J. Choi, A. Cleary, E. D'Azevedo, J. Demmel, I. Dhillon, J. Dongarra, S. Hammarling, G. Henry, A. Petitet, K. Stanley, D. Walker, R. C. Whaley, ScaLAPACK User's Guide, SIAM, Philadelphia, 1997. [3] High Performance Fortran Language Specification. Version 1.1. High Performance Standard Forum, Rice University, Houston, Texas, November 10, 1994. [4] High Performance Fortran Language Specification. Version 2.0. High Performance Standard Forum, Rice University, Houston, Texas, January 31, 1997. [5] MPI: A Message-Passing Interface Standard, Message Passing Interface Forum, June 12, 1995. [6] S.Louca, N.Neophytou, A.Lachanas, P.Evripidou, MPI-FT: Portable Fault Tolerance Scheme for MPI, Parallel Processing Letters, 10(4), pp.371-382, 2000. [7] G.Fagg, A.Bukovsky, J.Dongarra, HARNESS and fault tolerant MPI, Parallel Computing 27(11), pp.1479-1496, 2001. [8] A.Lastovetsky, D.Arapov, A.Kalinov, I.Ledovskih, A Parallel Language and Its Programming System for Heterogeneous Networks, Concurrency: Practice and Experience, 12(13), pp.1317-1343, 2000. [9] A.Lastovetsky, Adaptive Parallel Computing on Heterogeneous Networks with mpC, Parallel Computing, 28(10) , pp.1369-1407, 2002. [10] A.Lastovetsky, R.Reddy, HMPI: Towards a Message-Passing Library for Heterogeneous Networks of Computers, Proceedings of the 17th International Parallel and Distributed Processing Symposium (IPDPS 2003), 22-26 April 2003, Nice, France, CD-ROM/Abstracts Proceedings, IEEE Computer Society 2003. [11] A.Lastovetsky, Parallel Computing on Heterogeneous Networks, John Wiley & Sons, 423 pp, June 2003, ISBN: 0-471-22982-2.
id nasplib_isofts_kiev_ua-123456789-2298
institution Digital Library of Periodicals of National Academy of Sciences of Ukraine
issn 1727-4907
language English
last_indexed 2025-11-24T07:17:33Z
publishDate 2004
publisher Інститут програмних систем НАН України
record_format dspace
spelling Lastovetsky, Al.
2008-09-17T12:30:57Z
2008-09-17T12:30:57Z
2004
Parallel computing on heterogeneous Networks: Challenges and Responses /Al.Lastovetsky // Проблеми програмування. — 2004. — N 2,3. — С. 251-260. — Бібліогр.: 11 назв. — англ.
1727-4907
https://nasplib.isofts.kiev.ua/handle/123456789/2298
681.3
In the paper, we analyse challenges associated with parallel programming for common networks of computers (NoCs) that are, unlike dedicated parallel computer systems, inherently heterogeneous and unreliable. This analysis results in description of main features of an ideal parallel program for NoCs. We also outline some recent parallel programming tools, which try and respond to some of the challenges.
en
Інститут програмних систем НАН України
Параллельное программирование
Распределенные системы и сети
Parallel computing on heterogeneous Networks: Challenges and Responses
Article
published earlier
spellingShingle Parallel computing on heterogeneous Networks: Challenges and Responses
Lastovetsky, Al.
Параллельное программирование
Распределенные системы и сети
title Parallel computing on heterogeneous Networks: Challenges and Responses
title_full Parallel computing on heterogeneous Networks: Challenges and Responses
title_fullStr Parallel computing on heterogeneous Networks: Challenges and Responses
title_full_unstemmed Parallel computing on heterogeneous Networks: Challenges and Responses
title_short Parallel computing on heterogeneous Networks: Challenges and Responses
title_sort parallel computing on heterogeneous networks: challenges and responses
topic Параллельное программирование
Распределенные системы и сети
topic_facet Параллельное программирование
Распределенные системы и сети
url https://nasplib.isofts.kiev.ua/handle/123456789/2298
work_keys_str_mv AT lastovetskyal parallelcomputingonheterogeneousnetworkschallengesandresponses