Performance analysis of massively parallel programs for graphics processing units

Any modern Graphics Processing Unit (graphics card) is a good platform to run massively parallel programs. Still, we lack tools to observe and measure performance characteristics of GPU-based software. We state that due to complex memory hierarchy and thou- sands of execution threads the all perform...

Повний опис

Збережено в:

Бібліографічні деталі
Дата:	2023
Автор:	Rahozin, D.V.
Формат:	Стаття
Мова:	English
Опубліковано:	Інститут програмних систем НАН України 2023
Теми:	graphics processing unit software performance massive parallelism simulation software performance model UDC 681.3
Онлайн доступ:	https://pp.isofts.kiev.ua/index.php/ojs1/article/view/506
Теги:	Додати тег Немає тегів, Будьте першим, хто поставить тег для цього запису!
Назва журналу:	Problems in programming
Завантажити файл:

Репозитарії

Problems in programming

id	pp_isofts_kiev_ua-article-506
record_format	ojs
resource_txt_mv	ppisoftskievua/6d/9ef7d48d0f4763cc5eed87fa76efb46d.pdf
spelling	pp_isofts_kiev_ua-article-5062023-06-24T11:47:23Z Performance analysis of massively parallel programs for graphics processing units Аналіз швидкодії масивно паралельних програм для графічних процесорів Rahozin, D.V. graphics processing unit; software performance; massive parallelism; simulation; software performance model UDC 681.3 графічна карта; швидкодія програми; масивний паралелізм; симуляція; модель швидкодії програми УДК 681.3 Any modern Graphics Processing Unit (graphics card) is a good platform to run massively parallel programs. Still, we lack tools to observe and measure performance characteristics of GPU-based software. We state that due to complex memory hierarchy and thou- sands of execution threads the all performance issues are about efficient use of graphics card memory hierarchy. We propose to use GPGPUSim simulator, previously used mostly for graphics card architecture validation, for performance validation for CUDA-based program. We provide examples which show how to use the simulation for performance analysis of massively parallel programs.Prombles in programming 2022; 3-4: 51-58 Будь-яка сучасна графічна карта є цікавою платформою для запуску масивно паралельних програм. Проте, у нас дуже мало засобів для вимірювання та аналізу швидкодії такого програмного забезпечення. До того ж графічні карти мають складну ієрархію підсистеми пам’яті та тисячі потоків, що виконуються, тому всі питання швидкодії зводяться до ефективного використання ієрархії пам’яті графічної карти. Ми пропонуємо використовувати GPGPUSim — симулятор, розроблений для валідації архітектурних мрделей графічних карт — для аналізу швидкодії CUDA-програм. Наведено приклади аналізу результатів симуляції і визначення характеристик масивно паралельної програми.Prombles in programming 2022; 3-4: 51-58 Інститут програмних систем НАН України 2023-01-23 Article Article application/pdf https://pp.isofts.kiev.ua/index.php/ojs1/article/view/506 10.15407/pp2022.03-04.051 PROBLEMS IN PROGRAMMING; No 3-4 (2022); 51-58 ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ; No 3-4 (2022); 51-58 ПРОБЛЕМИ ПРОГРАМУВАННЯ; No 3-4 (2022); 51-58 1727-4907 10.15407/pp2022.03-04 en https://pp.isofts.kiev.ua/index.php/ojs1/article/view/506/557 Copyright (c) 2023 PROBLEMS IN PROGRAMMING
institution	Problems in programming
baseUrl_str	https://pp.isofts.kiev.ua/index.php/ojs1/oai
datestamp_date	2023-06-24T11:47:23Z
collection	OJS
language	English
topic	graphics processing unit software performance massive parallelism simulation software performance model UDC 681.3
spellingShingle	graphics processing unit software performance massive parallelism simulation software performance model UDC 681.3 Rahozin, D.V. Performance analysis of massively parallel programs for graphics processing units
topic_facet	graphics processing unit software performance massive parallelism simulation software performance model UDC 681.3 графічна карта швидкодія програми масивний паралелізм симуляція модель швидкодії програми УДК 681.3
format	Article
author	Rahozin, D.V.
author_facet	Rahozin, D.V.
author_sort	Rahozin, D.V.
title	Performance analysis of massively parallel programs for graphics processing units
title_short	Performance analysis of massively parallel programs for graphics processing units
title_full	Performance analysis of massively parallel programs for graphics processing units
title_fullStr	Performance analysis of massively parallel programs for graphics processing units
title_full_unstemmed	Performance analysis of massively parallel programs for graphics processing units
title_sort	performance analysis of massively parallel programs for graphics processing units
title_alt	Аналіз швидкодії масивно паралельних програм для графічних процесорів
description	Any modern Graphics Processing Unit (graphics card) is a good platform to run massively parallel programs. Still, we lack tools to observe and measure performance characteristics of GPU-based software. We state that due to complex memory hierarchy and thou- sands of execution threads the all performance issues are about efficient use of graphics card memory hierarchy. We propose to use GPGPUSim simulator, previously used mostly for graphics card architecture validation, for performance validation for CUDA-based program. We provide examples which show how to use the simulation for performance analysis of massively parallel programs.Prombles in programming 2022; 3-4: 51-58
publisher	Інститут програмних систем НАН України
publishDate	2023
url	https://pp.isofts.kiev.ua/index.php/ojs1/article/view/506
work_keys_str_mv	AT rahozindv performanceanalysisofmassivelyparallelprogramsforgraphicsprocessingunits AT rahozindv analízšvidkodíímasivnoparalelʹnihprogramdlâgrafíčnihprocesorív
first_indexed	2024-09-12T19:29:37Z
last_indexed	2024-09-12T19:29:37Z
_version_	1815407465783623680
fulltext	51 Паралельне програмування розподілених систем і мереж UDC 681.3 https://doi.org/10.15407/pp2022.03-04.051 PERFORMANCE ANALYSIS OF MASSIVELY PARALLEL PROGRAMS FOR GRAPHICS PROCESSING UNITS Dmytro Rahozin Будь-яка сучасна графічна карта є цікавою платформою для запуску масивно паралельних програм. Проте, у нас дуже мало засобів для вимірювання та аналізу швидкодії такого програмного забезпечення. До того ж графічні карти мають складну ієрархію підсистеми пам’яті та тисячі потоків, що виконуються, тому всі питання швидкодії зводяться до ефективного використання ієрархії пам’яті графічної карти. Ми пропонуємо використовувати GPGPUSim — симулятор, розроблений для валідації архітектурних мрделей графічних карт — для аналізу швидкодії CUDA-програм. Наведено приклади аналізу результатів симуляції і визначення характеристик масивно паралельної програми. Ключові слова: графічна карта, швидкодія програми, масивний паралелізм, симуляція, модель швидкодії програми. Any modern Graphics Processing Unit (graphics card) is a good platform to run massively parallel programs. Still, we lack tools to observe and measure performance characteristics of GPU-based software. We state that due to complex memory hierarchy and thou- sands of execution threads the all performance issues are about efficient use of graphics card memory hierarchy. We propose to use GPGPUSim simulator, previously used mostly for graphics card architecture validation, for performance validation for CUDA-based program. We provide examples which show how to use the simulation for performance analysis of massively parallel programs. Keywords: Graphics processing unit, software performance, massive parallelism, simulation, software performance model. Introduction Modern graphics processing units (GPUs) or video-cards have revolutionized high performance computing, introducing massive parallelism at a moderate price and allowing everybody to have a supercomputer at home. Still the effective use of a GPU is limited as the efficient GPU software creation and usually requires the full redesign of the original algorithm, as GPU provides quite a different model of massive parallelism utilization. First, we discuss the existing GPU performance models, which are based on software simulators of GPU hardware. This software is able to simulate computational units, GPU memory hierarchy model, commutation net- work and is able to execute usual programs compiled for GPU, for example PTX codes of CUDA toolkit for Nvidia GPU. The most simulators employ the reengineered GPU architecture [1], which was recovered using various benchmarks [2], so allows the end-user to be aware about better techniques of software optimization for video-cards. Second, there is a problem of the inefficient utilization of GPU massive parallelism, so that for complex software it is hard to provide the good higher-level model of massive parallelism, which can be effectively pro- jected onto the GPU hardware [3]. It is hard to account the relations and dependencies of the basic parameters of software massive parallel model – for example, number of threads, memory block size processed by one thread, scheduling techniques and so predict the final real performance. GPU computing model has really revolution- ized the computing and we state that the GPU performance is fully defined by memory hierarchy performance. As the GPU memory hierarchy employs 5-6 memory levels now (compares to the older L1-L2-L3 cache memory scheme) the analysis of bottlenecks caused by memory hierarchy/interconnect is extremely complex. Definitely the software developer needs good tools which help him to analyze what is wrong with software performance and needs to define simpler model for software performance. Third, we propose the use of simulator to discover the potential problems in massively parallel software, analyze the software bottlenecks and recover memory utilization models. We re-formulate the performance analysis problem as a problem of the different memory hierarchy stages resource utilization, it is very close to the commonly used mass service models. The role of simulator is changed – originally the simulator was used for recovering the peculiarities of a GPU architecture, but now we are extracting model data from memory hierarchy simulation. The simplified model is applied back to the original task and allows to estimate memory/time resources necessary for software run. Fourth, we are providing the motivating example of CUDA application analysis on the top of simulation and illustrate how the simulation data can be used to recover less complex performance model and re-applied for application optimization. We are providing suggestions for simulator enhancement and enabling better performance analysis, leading to build-up of clear performance estimation of the developer massively parallel software. 1. GPU performance models Currently we are observing several shifts in microprocessor production, and these shifts hugely impact existing approaches for application development. First, the term “parallel application development” became more “academic”, as the process of researching different forms of parallelism in application became even more complex, even more time-consuming and requires a lot of knowledge in narrow areas. Second, modern “start- © Д.В. Рагозін, 2022 ISSN 1727-4907. Проблеми програмування. 2022. № 3-4. Спеціальний випуск 52 Паралельне програмування розподілених систем і мереж up” culture pushes engineers to decrease “time-to-market” and development costs for their products or product prototypes, so the employment of newer and newer architectural capabilities become slower and slower and comes down to moving just to newer GPUs or running several copies of a process on more powerful GPU. The real things a “median” programmer can do around “parallelism” utilization is the employment of threading concept without good understanding of resulting application performance. This gives a chance to chip makers to increase the number of processor cores on chip, as they easily can be employed in applications or operating system by just run usual software services. So basically, applications are not optimized for multi core CPUs. Further performance analysis and optimizations are open to industrial rivals, who can enter the market with a substitute application, which is faster and smarter. So, the scaling of the number of processor cores looks to be a good choice for typical smartphone platform which runs a bunch of applications which are loosely coupled in terms of using some data amounts simultaneously. Although if we look into software, which we can define as “data processing infrastructure” – commu- nication, video processing, artificial intelligence (AI) [4], we see the opposite picture – each year the volume of processed data increases exponentially. For example, a century ago, a short telegram message was enough for communications, 20 years ago we entered the era of voice conferences with sophisticated sound processing, and now we demand video calls and conferences as we need the visual presence of our peer. Nobody can predict the next shift (mostly a move to virtual reality area) where computation demand will increase another 10x. This increase of computing demand targets not the computation capability and not the memory vol- ume – as the computational cores and memory cells are scaled perfectly, but the interconnect communication, which joins storage and computations (fig 1.). It’s impossible to scale interconnect infrastructure due to physi- cal limitations, but everybody notices that fig. 1 illustrated the usual computational system structure – it may be applied to microcontrollers, common multicore system and GPU. Actually, this means that all the computer system has the same problem of non-scaling interconnect network, but today the interconnect scaling limitation became at least “severe”. Note, that GPU hardware has the best current solutions for optimizing the intercon- nect, as it requires to process huge data (triangles, texture and points) in complex way (rendering, ray-tracing, general computing) in real-time (60 frames per second). Fig. 1. Computational system structure. Pure computation capability is practically infinite now, the number of memory cells is really huge, but the interconnect between them – as the data is routed to computational units - is very limited. Even more limiting fac- tor is the complexity of interconnect programming by a developer. There is a good (and ancient) example for this – older DSPs has several independent memory buses, as only this solution doubles or triples the data throughput. The technology required the special programming, as program data are spread among two or more memory spaces manually. Now there are more complicated chips with multilevel cache memory, but the major problem is still the same – memory is quite slow if compared to computations, so channels capacity between logic and memory should be increased, and here it looks that it is necessary to change the paradigm. If we check modern GPUs (as of 2019), they have 12 (twelve) or more high-speed (DDR4/5 DRAM) controllers, which should (at least in theory) balance memory load/throughput. Long ago a simple but practical classification for software performance limitations was introduced: memo- ry-bounded software and computations performance bounded software. The latter case was resolved quickly usually in two ways: 1) computations are scaled via SIMD execution and just providing extra number crunching power (but need to scale memory buses); 2) common pipelines are defined in hardware and used as a hardware accelerator (e.g. Nvidia video stream encoders and decoders). The memory-bounded case is much harder - memory bus became wide, up to 2048 bits – to feed SIMD units width or hardware accelerators, but never bus speed can outperform arithmetic speed – even worse multi-core system just multiple memory load. Basically, if we are taking out of view the small number of really computational bounded algorithms, the most important characteristic of a program/algorithm is the degree of “memory boundness”. Overcoming the memory bandwidth limitations requires the improvement of both hardware and software methods, as the interconnect infrastructure should be load optimally. The interconnect does not mean just a bunch of wires [3]. Computational units produce requests for read- ing/writing memory cells, memory is an old-fashioned DRAM with need to open pages, refresh contents and serve operations. Basically, the pseudo-random memory accesses greatly decrease the DRAM memory through- put, so interconnect should effectively decrease the number of memory accesses exploiting the principle that the data traffic for the most frequently used memory cells should be routed somewhere inside the interconnects. Speaking simply, interconnect should integrate specific multilevel cache memory to serve multi-threaded execu- tion, which multiplies memory traffic. As a good example of the threading paradigm change, we address the execution structure of GPU hardware threads – where the massive thread execution is driven by readiness of data, fetched from interconnect internals. The advanced approach requires simultaneous execution of hundreds of threads. Some of them processing data, 53 Паралельне програмування розподілених систем і мереж some of them waiting for data to be ready. This raises at least two common problems: 1) an algorithm should have enough degree of parallelism to allow execution of hundreds of threads – for example, triangle and pixel rendering; 2) an algorithm should be initially developed to utilize the benefits of memory hierarchy (interconnect). The first problem is inevitable, as somebody needs to convert an initial task into a concept of a set of interoperating threads. The second problem is more technical, but we need to define the interconnect and memory hierarchy structure in computing systems. For practical reason we state that the interconnect and memory hierarchy need to be architected to support hundreds of threads, conceptually this is very different from simple interconnects of multicore processors. The rise of cheap mobile system-on-chips (SoCs) leads us to new challenges in massively parallel com- puting (really supercomputing), proving excessive parallelism level but making code optimization time consum- ing and costly. A mobile GPU with thousands of execution threads is a flagship of modern computing and the possible level up looks to be a kind of quantum computer, so software development industry has to invent ways to utilize this computing power efficiently. Fig. 2. Memory hierarchies for multicore CPU and GPU [1]. Going back to traditional multi-core CPUs we consider the scheme (fig. 2, left) of multilevel cache memory (or interconnect), which usually include fast but very small L0 memory (read/write queues to L1 cache memory), faster L1 cache memory, and L2 and L3 cache memories. L2 and L3 look to be quite slow but really big. For streaming applications, we should also account DRAM performance, so current multicore CPU memory hierarchy includes up to 5 hierarchy levels, and 95% of memory bandwidth optimizations are around the better use of L1/L2 cache. More optimizations for memory hierarchy lead to exponential growth of optimization costs. Optimizations for L1/L2 cache improve data spatial locality but the other memory hierarchy is not taken into ac- count due to too much cost and still the number of hardware threads is small. The fig. 2, right shows us much more complicated interconnect for only one core at modern GPU. The fig. 2 does not show several hundred threads on- the-fly, generating requests for memory. These threads involve the use of the following hardware (fig. 2, right): 1) cross-bar interconnects/coalescers, allowing gathering/scattering data or connecting cores to different L2 parts; 2) “non-blocking” L1 streaming cache, which eliminates stall during memory accesses; 3) specific schedulers for multichannel DRAM access. And the huge number of executing threads require specific programming methods to utilize the hardware efficiently. Let us discuss the motivating case. 2. Software performance modeling Despite of the wide use of matrix computation examples in performance-related studies, still BLAS/LA- PACK based software is used for solving many problems (or BLAS/LAPACK style of expressing matrix operations), e.g. for signal/sound processing area. Convolutional neural networks software uses special optimized routines for matrix processing, but general matrix-processing based software still uses LAPACK due to old and well-known standard. Nvidia proposes cuBLAS library for GPU-optimized BLAS routines. Fig. 3. Illustrative block matrix computation scheme 54 Паралельне програмування розподілених систем і мереж The common matrix by matrix multiplication scheme is shown on fig. 3. In order to computer C matrix block, a block row from A should be multiplied by block column B. As above we have mentioned that the com- mon software workloads are memory bandwidth bounded and this example is classical – for each multiplication operation which is performed in 1 clock cycle, we need 2 memory accesses. The compared energy consumption for multiplication (a small functional unit) and for memory access (memory, buses, interconnect, crossbar switches) is the least 50x different. So, for speed and energy consumption all A2i and Bi2 should be synchronously loaded to L1D cache/shared memory (fig. 2, right) and possibly reused for computing C2i blocks. Standard approaches for cuBLAS matrix multiplications optimization are described in [5], advanced autotuning result is presented in [6]. The autotuning approach [7] has many pros but a kind of contra – the tuned library is fast “in common sense” as the target parameter is time, and sometimes additional resource restrictions apply (e.g. less time – more memory or software depends on “hot cache”) and a library may use very specific dataset, and this is common for the most of computational tasks – Fourier transforms have only one or two used dimensions, matrices have only several fixed dimensions. The important point is that several copies of computational process may be run on the same GPU (e.g. neural networks computations) to better utilize non-used computational resources. Anyway, matrix multiplication is scheduled as an execution graph of smaller partial matrix-multiplication procedures of different sizes. There are various tradeoffs here, as complied kernels have pre-defined memory alloca- tion layout and their parametrization (loop bound change) and sometimes need recompilation and re-scheduling and this is common for GPUs. The compiled multiplication kernels reside on SMT (Streaming Multi Processor, fig. 2, right), which has limited register file (local memory, usually 64K memory cells) which handles internal kernel vari- ables and shared memory cells needed for information exchange between kernels. The compiler also should divide the memory between threads local storage and shared communication memory, that directly influences the possibil- ity of kernel allocation on SMT – the more memory kernel uses for communication, the smaller number of threads are executed on the kernel – another tradeoff, highly depend on currently executed task set. Initially the “big” multiplication algorithm includes an optimized scheme for data transfers between kernels to save memory bandwidth. Ofcourse, kernels should be parametrized, for block sub-sizes and data type. Reaching some successful tradeoff includes finding out the optimal parameters for size of the general block matrix multiply kernel, by 1) adjusting the size of memory handling matrix values; 2) adjusting memory size used as shared to exchange data between kernels. The optimal piece of matrix processed by kernels is a tricky question, in general case very common suggestions (e.g. register file size) are used for the basic matrix blocks size, as we need 1) to compute 2) to exchange data 3) to run as many threads as possible. For the ker- nel tuning there is a pre-defined knowledge base initially, which stores basic register file parameters, so the tuning algorithm have defined opportunities to use different matrix block sizes and possible limits for matrix dimensions. The scheduler program defines the set of semi-optimal parameters for the “common” matrix mul- tiplication procedure for this particular GPU. The problem is solved now using a benchmarking system, which runs multiplication with defined matrix block sizes, making the GPU driver to evaluate execution time. After that the final library is hard-compiled with the optimal parameters and is used further for number crunching. If the code is moved to another hardware, we need to re-evaluate and recompile the library. Basically, this is the way commonly used in industry for example, for BLAS/LAPACK [5]. Still, this type of software tuning looks to be a “black box” methos– input task parameters produce results, but there are no evidences how to line up parameters to have the semi-optimal execution for selected datasets. None of benchmarking systems is able to make a model, which finds out relations between kernel parameters and execution time – as benchmark uses only execution time as optimal criteria. The upcoming problem is that the block-based matrix multiplication is the comparatively simple task but investigated more than 50 years for available parallelism. The public codes for BLAS/LAPACK show that even ba- sic task looks to be complex, it is not just compiled library code but a software system, which generate semi-optimal source code after benchmarking the system. If somebody is required to produce a brand-new code it is even hard to imagine how much efforts should be invested into creating a semi-optimal parallel code which can exploit parallel- ism on a massively parallel processor. Even if the task can be parallelized well, we need to implement parametrized kernel and define the factors which influence the software performance and use while resolving later tradeoffs. 3. Performance modeling for memory hierarchies In this chapter the model of the internal architecture and memory hierarchy (interconnect) of off-the-shelf video cards is considered. Only commonly used Nvidia video cards are considered – due to availability of the model, as since the appearance of the pioneering Nvidia 8800 programmable video card, this platform is actively used for different kinds of computations, including very specific tasks, which use imprecise computations [8] to employ thousands of hardware execution threads for faster computations. Another point is that the video card is the general number crunching machine, which has very good tradeoff between the peak computation power and memory band- width. Finally, it is really important that there is a code base of less or more optimized algorithms, which can be analysed for efficiency of memory bandwidth use. For our investigation we use GPGPUSim simulator version 4.0.0 [9], which reflects the contemporary complex architecture of the interconnect between GPU SMT (fig. 2) processors and memory system. Originally the simulator was developed for recovering the architecture of Nvidia GPUs, including SMT, energy model, full inter- connect for SMT registers down to DDR memory [1], as practically all hardware solutions used for computations 55 Паралельне програмування розподілених систем і мереж and memory traffic routing were discussed in research papers, e.g. [3], but the particular combinations of architec- ture solutions are still a trade secret of Nvidia. The GPGPUSIM simulation precision is proved by simultaneous run of benchmarks (e.g. Rodinia benchmark [2]) both on simulator and on a real GPU and comparing numbers, the results of comparison may be found in [2]. For our purposes we state that the difference of GPGPUSim simulation and real execution on GPU is insufficient for our research. GPGPUSim uses “fake driver” concept for running the applications. The simulator uses the fact that all CUDA based codes are compiled into a form of so called PTX codes, which are very similar to the real processor instructions of Nvidia GPU. Internally the GPU driver, which runs a CUDA program on a GPU, translates PTX code into real GPU instructions, place on SMTs and execute it due to N-dimensional execution model used by CUDA. This execution model effectively hides the GPU internals from the user but gives the promise that well parallelized software will we well scaled for the next generation GPUs supporting even more threads. GPGPUSim works as a driver and intercepts PTX code execution, so does the same work as CUDA transla- tor and scheduler do, but only on CPU and on a single thread. So GPGPUSim runs even on a computer without real GPU. The simulation process is highly time consuming – so the standard matrix multiply application with default parameters from CUDA samples package is run for two days under simulator. This is not a problem as an older com- puter fleet which is able to run Ubuntu 20/22 and GCC 7 toolchain can be used to simulate CUDA application runs. GPGPUSim provides different types of simulation – 1) just functional, 2) simulation with all instruction timings but zero memory timings, 3) simulation of the whole SMTs-memory interconnect including DDR5 memory simulation models. There is a special mode for energy consumption simulation. It should be noted that simulator codes are placed in GitHub, so can be freely downloaded and changed. There are several simulator extensions, e.g. Accel-Sim [10], which are used for precise simulation validation and analysis. GPGPUSim is highly parametrized – more than a hundred of options control the simulated hardware blocks, from SMT to DDR memory. As the GPGPUSim authors struggle to have simulation accuracy in less than 5% if compared to real HW, all the hardware components are precisely simulated. For a simple example, practically all the timing values from DDR5 chips manuals describing DDR5 memory bus are accurately introduced into the memory model. All the specific concepts, which enable the huge parallelism level, for example, streaming L1 caches [3], are simulated, so that the application developer can observe the real effects of using the newer hardware concepts for the application. The simulated interconnect complexity is really high to account even a basic set of optimizations for the parallel code. The developer who wants to optimize the code should sit with “paper and pencil” and plan the memory layouts and scheduling of the parallel applications. For our purpose we need to analyze a set of performance counter which are generated by GPGPUSim during the program execution. For the first glance these counters can be divided into two sets: 1) counters at the intercon- nect level borders; 2) counters for internal events in interconnect levels. Let’s consider fig. 4. Fig. 4. Memory hierarchy/interconnect levels. Fig 4. includes the large blocks of SMTs to DDR memory interconnect/hierarchy [1]. Here arrows are “located” at the real hierarchy borders, and basically for each arrow we are able to denote the traffic from more upper hierarchy level to lower level (e.g. from L2 cache to memory controller). The efficiency degree for the hi- erarchy level for the particular task (except interconnect as it just routes the data between L1 cache banks and L2 cache banks) is the bigger difference of the data amount routed to the level from higher level (left at fig. 4) and data routed to the lower levels of memory hierarchy (right at fig. 4). Researches also should be interested in the data set sizes, which are more friendly to the hardware, as despite of increasing DDR memory size on board, the internal cache/shared memory sizes for SMTs is practically constant over GPU generations. The internal events counters are used for the fine tuning of the algorithms, as are able to show how different memory level improve- ments are employed by the parallelized version of algorithm, as different parallelization methods can utilize internal helpers in different ways. Here we are discussing “level borders” counters, the internal level counters are the subject for our next research. 4. Motivating examples Now we propose a methodic, which starts from very beginning analysis of memory traffic. There are differ- ent levels of hardware observations and currently we starting from high level beginning metrics. So let us discuss what is valuable here for us and what are the challenges for research from the simulator side. Each computational demanding application is started from computational model, which usually have incredibly low speed. For example, LIDAR model application has 1 frame per second (fps) computational speed, but the final requirement is 100 fps. So, the initial task for simulation is the evaluation of peak memory perfor- mance of the model application. Note, that the “computational demand” here is transformed into “memory per- formance”, as memory throughput is more limiting than computational throughput. L2 cache memory misses per computed frame describe the basic need for memory traffic well, and (without optimizations) helps to project up- 56 Паралельне програмування розподілених систем і мереж per memory bandwidth limit. Local optimizations for CUDA procedures may improve memory bandwidth up to 10x-20x, but the detailed information may be got from simulation run. Anyway, any performance model made for multicore CPU does not project real scaling for hundreds of threads and does not reflect the optimization effects possible on GPU. That’s why massive parallelism simulation and evaluation are not possible on multi-core CPU as the difference in number of threads is up to 50-100x, so none of real parallelization issues may be observed at low number of threads. At high level of hardware observing a developer can notice the upper limits of memory traffic necessary to make a decision for feasibility of optimized program run at GPU, at lower level of observa- tions it is possible to check even fine effects of different cache levels use. For the entry level matrixMul CUDA sample application was used. To check the cache utilization effects the application was run (simulated) with different matrix size parameters, starting from 64x64 matrix and finish- ing with 256*896 size. The default multiplication size of 320x320 per 640x320 requires 2 days simulation on Intel Core i5-10400, at 1 thread, so the dataset configuration was adjusted to decrease simulation time and reflect real problem sizes. The GPGPUSim by default generates a text file, which updates the internal simulation counters and model in timely manner. As the simulator works over the GPU state - the simulation results are functionally correct, and the program/shaders are executed correctly. The resulting log file can be parsed using any tools, starting from basic grep/findstr utilities up to complex Python scripts. In order to parse these log files, we used simple findstr-based scripts. Simulation was run on Ubuntu 22 OS, Intel 10th Gen based and equipped with 32GB RAM, but retargeted for GCC 7 toolchain, as other toolchains fall into compilation errors. The simulation results are gathered into the following tables (fig. 5, fig. 6), we render only the most im- portant simulation results. For each matrixMul run we collect matrix dimensions (for reference we note matrix sizes, just to know which data amount is processed during the run), L2 cache accesses, miss rate and parallel utilization of the L2 cache. L2 accesses mean the total data traffic we get from L1 cache. L2 misses mean the data traffic goes to DRAM. And the most interesting point is the L2 cache parallel utilization, the more this digit the better is the L2 sectored cache utilization, as L2 structure allows to handle accesses in different L2 sec- tors simultaneously. The see some imbalance in L2 parallel utilization (fig. 5) in 64x384x64, 128x768x128 and 256x768x256 multiplication – the hypothesis that memory layout is these cases is imbalanced and some L2 sec- tors are loaded much more than others. Fig 5. Matrix multiplication simulation results 57 Паралельне програмування розподілених систем і мереж Fig. 6. L1/L2 cache load balancing Fig. 6 shows the thread running balance in sense of the balance of memory accesses in sectored cache memory. The picture at fig. 6 is practically the same for any reviewed matrix size, so we see practically the ideal balance of L2 accesses and misses, this means that the matrices equally distributed over the existing pro- cesses. Please note that this balance works also for L1 cache, despite it has more sectors – 30 – if compared to 24 L2 sectors. So, let us review what we can conclude for the “big” load picture for the matrix multiplication ex- ample. First, we see that the problem fits the cache in practically ideal way – the number of cache misses is less than 0.1% which is really small value. Yes, this should not surprise us as the matrix multiplication is the topic which is evaluated over 50 years, but such kind of analysis is usually provided for tasks which are unfamiliar for researchers. So, these results are extremely valuable if we have no prior knowledge about the problem. By the way, additional results are not included in the paper, but the same balancing analysis as in fig. 6 is computed for DRAM. As DRAM is multichannel (12 channels), in case of low L2 miss rates (more than 1-2%) a problem of proper parallelization of DRAM access arises, as 12 DRAM channels should effectively multiply DRAM throughput in case of correct parallelization. Anyway the fig. 6 data allow to understand load balancing at cache memory level and this metric directly highlights the quality of parallel- ization, which is ideal for our case. Conclusions Here we use GPGPUSim in the opposite way to the initial authors intentions – instead of re-engineering GPU architecture, we are evaluating CUDA-based software, as the simulation timing errors do not exceed 10%. This even is not important for our tasks, as we need information for memory accesses amount and patterns. For our classical task – matrix multiplication - we have got the cache utilization efficiency, memory bandwidth and paral- lelization efficiency metric quite easy. If projected to real tasks, where the matrix multiplication sizes are fixed, we easily predict the memory footprint and time necessary to complete the operation, so can evaluate the real perfor- mance (+-10%) on a GPU platform. The paper is just the beginning of the road into the GPU performance analysis, as now we analyze only basic digits. We have not evaluated all simulation modes and have not elaborated all the parameters accessible in resulting log file, these are tasks for the further simulator evaluation. Also, we need more accurate “slicing” of simulation state in time, for gathering more precise information of algorithm behavior in real time. Addi- tional analysis for cache memory utilization imbalance is also required. Finally, we need to check the perfor- mance of algorithms which gives L2 miss rate more than 2% in order to check what we can get from DRAM channels load analysis. 58 Паралельне програмування розподілених систем і мереж References 1. Khairy, M., Jain, A., Aamodt, T.M., & Rogers, T.G. (2019). A Detailed Model for Contemporary GPU Memory Systems. 2019 IEEE International Symposium on Performance Analysis of Systems and Software (ISPASS), p. 141-142. 2. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong and T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simula- tor,” 2009 IEEE International Symposium on Performance Analysis of Systems and Software, 2009, pp. 163-174, doi: 10.1109/ ISPASS.2009.4919648. 3. A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das. 2015. Anatomy of GPU Memory System for Multi-Application Execution. In Proc. of the 2015 International Symposium on Memory Systems (MEM- SYS ‘15). ACM, NY, USA, Pp. 223–234. Doi: 10.1145/2818950.2818979 4. M. A. Raihan, N. Goli and T. M. Aamodt, «Modeling Deep Learning Accelerator Enabled GPUs,» 2019 IEEE International Sympo- sium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 79-92, doi: 10.1109/ISPASS.2019.00016. 5. S. Barrachina, M. Castillo, F. D. Igual, R. Mayo and E. S. Quintana-Orti, «Evaluation and tuning of the Level 3 CUBLAS for graphics processors,» 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1-8, doi: 10.1109/IP- DPS.2008.4536485 6. J. Kurzak, S. Tomov and J. Dongarra, «Autotuning GEMM Kernels for the Fermi GPU,» in IEEE Transactions on Parallel and Dis- tributed Systems, vol. 23, no. 11, pp. 2045-2057, Nov. 2012, doi: 10.1109/TPDS.2011.311. 7. Pavlo A. Ivanenko, Anatoliy Y. Doroshenko, and Kostiantyn A. Zhereb, TuningGenie: Auto-Tuning Framework Based on Rewriting Rules // in: 10th International Conference, ICTERI 2014, Kherson, Ukraine, June 9-12, 2014, Revised Selected Papers, Series: Com- munications in Computer and Information Science, (Ermolayev, V., Mayr, H.C., Nikitchenko, M., Spivakovsky, A., Zholtkevych, G. (Eds.)), Springer, CCIS Vol. 469, 2014. – PP. 139-160. doi: 10.1007/978-3-319—13206-8_7 8. Wu, Kui & Truong, Nghia & Yuksel, Cem & Hoetzlein, Rama. Fast Fluid Simulations with Sparse Volumes on the GPU. Eurographics/ Computer Graphics Forum. Vol 37. May 2018. pp. 157-167. Doi: 10.1111/cgf.13350. 9. Jain, Akshay & Rogers, Timothy. A Quantitative Evaluation of Contemporary GPU Simulation Methodology. ACM SIGMETRICS Performance Evaluation Review. Vol 46, June 2018. Pp. 103-105. Doi: 10.1145/3292040.3219658. 10. M. Khairy, Z. Shen, T. M. Aamodt, T. G. Rogers. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling, in 2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA) Received 17.08.2022 Про авторів: Рагозін Дмитро Васильович, кандидат технічних наук, старший науковий співробітник. Кількість публікацій в українських виданнях – більше 10. Кількість зарубіжних публікацій – 2. http://orcid.org/0000-0002-8445-9921. Місце роботи авторів: Інститут програмних систем НАН України, 03187, м. Київ-187, проспект Академіка Глушкова, 40. Тел.: (38)(044) 526-21-48 E-mail: Dmytro.Rahozin@gmail.com Прізвища та ініціали авторів і назва доповіді англійською мовою: Rahozin D.V. Performance analysis of massively parallel programs for graphics processing units Прізвища та ініціали авторів і назва доповіді українською мовою: Рагозін Д.В. Аналіз швидкодії масивно паралельних програм для графічних процесорів Контакти для редактора: Рагозін Дмитро Васильович, старший науковий співробітник Інституту програмних систем НАН України, e-mail: Dmytro.Rahozin@gmail.com, тел.: (38)(068) 575-91-25

Performance analysis of massively parallel programs for graphics processing units

Репозитарії

Схожі ресурси