Performance analysis of massively parallel programs for graphics processing units
Any modern Graphics Processing Unit (graphics card) is a good platform to run massively parallel programs. Still, we lack tools to observe and measure performance characteristics of GPU-based software. We state that due to complex memory hierarchy and thou- sands of execution threads the all perform...
Збережено в:
Дата: | 2023 |
---|---|
Автор: | |
Формат: | Стаття |
Мова: | English |
Опубліковано: |
Інститут програмних систем НАН України
2023
|
Теми: | |
Онлайн доступ: | https://pp.isofts.kiev.ua/index.php/ojs1/article/view/506 |
Теги: |
Додати тег
Немає тегів, Будьте першим, хто поставить тег для цього запису!
|
Назва журналу: | Problems in programming |
Завантажити файл: | ![]() |
Репозитарії
Problems in programmingid |
pp_isofts_kiev_ua-article-506 |
---|---|
record_format |
ojs |
resource_txt_mv |
ppisoftskievua/6d/9ef7d48d0f4763cc5eed87fa76efb46d.pdf |
spelling |
pp_isofts_kiev_ua-article-5062023-06-24T11:47:23Z Performance analysis of massively parallel programs for graphics processing units Аналіз швидкодії масивно паралельних програм для графічних процесорів Rahozin, D.V. graphics processing unit; software performance; massive parallelism; simulation; software performance model UDC 681.3 графічна карта; швидкодія програми; масивний паралелізм; симуляція; модель швидкодії програми УДК 681.3 Any modern Graphics Processing Unit (graphics card) is a good platform to run massively parallel programs. Still, we lack tools to observe and measure performance characteristics of GPU-based software. We state that due to complex memory hierarchy and thou- sands of execution threads the all performance issues are about efficient use of graphics card memory hierarchy. We propose to use GPGPUSim simulator, previously used mostly for graphics card architecture validation, for performance validation for CUDA-based program. We provide examples which show how to use the simulation for performance analysis of massively parallel programs.Prombles in programming 2022; 3-4: 51-58 Будь-яка сучасна графічна карта є цікавою платформою для запуску масивно паралельних програм. Проте, у нас дуже мало засобів для вимірювання та аналізу швидкодії такого програмного забезпечення. До того ж графічні карти мають складну ієрархію підсистеми пам’яті та тисячі потоків, що виконуються, тому всі питання швидкодії зводяться до ефективного використання ієрархії пам’яті графічної карти. Ми пропонуємо використовувати GPGPUSim — симулятор, розроблений для валідації архітектурних мрделей графічних карт — для аналізу швидкодії CUDA-програм. Наведено приклади аналізу результатів симуляції і визначення характеристик масивно паралельної програми.Prombles in programming 2022; 3-4: 51-58 Інститут програмних систем НАН України 2023-01-23 Article Article application/pdf https://pp.isofts.kiev.ua/index.php/ojs1/article/view/506 10.15407/pp2022.03-04.051 PROBLEMS IN PROGRAMMING; No 3-4 (2022); 51-58 ПРОБЛЕМЫ ПРОГРАММИРОВАНИЯ; No 3-4 (2022); 51-58 ПРОБЛЕМИ ПРОГРАМУВАННЯ; No 3-4 (2022); 51-58 1727-4907 10.15407/pp2022.03-04 en https://pp.isofts.kiev.ua/index.php/ojs1/article/view/506/557 Copyright (c) 2023 PROBLEMS IN PROGRAMMING |
institution |
Problems in programming |
baseUrl_str |
https://pp.isofts.kiev.ua/index.php/ojs1/oai |
datestamp_date |
2023-06-24T11:47:23Z |
collection |
OJS |
language |
English |
topic |
graphics processing unit software performance massive parallelism simulation software performance model UDC 681.3 |
spellingShingle |
graphics processing unit software performance massive parallelism simulation software performance model UDC 681.3 Rahozin, D.V. Performance analysis of massively parallel programs for graphics processing units |
topic_facet |
graphics processing unit software performance massive parallelism simulation software performance model UDC 681.3 графічна карта швидкодія програми масивний паралелізм симуляція модель швидкодії програми УДК 681.3 |
format |
Article |
author |
Rahozin, D.V. |
author_facet |
Rahozin, D.V. |
author_sort |
Rahozin, D.V. |
title |
Performance analysis of massively parallel programs for graphics processing units |
title_short |
Performance analysis of massively parallel programs for graphics processing units |
title_full |
Performance analysis of massively parallel programs for graphics processing units |
title_fullStr |
Performance analysis of massively parallel programs for graphics processing units |
title_full_unstemmed |
Performance analysis of massively parallel programs for graphics processing units |
title_sort |
performance analysis of massively parallel programs for graphics processing units |
title_alt |
Аналіз швидкодії масивно паралельних програм для графічних процесорів |
description |
Any modern Graphics Processing Unit (graphics card) is a good platform to run massively parallel programs. Still, we lack tools to observe and measure performance characteristics of GPU-based software. We state that due to complex memory hierarchy and thou- sands of execution threads the all performance issues are about efficient use of graphics card memory hierarchy. We propose to use GPGPUSim simulator, previously used mostly for graphics card architecture validation, for performance validation for CUDA-based program. We provide examples which show how to use the simulation for performance analysis of massively parallel programs.Prombles in programming 2022; 3-4: 51-58 |
publisher |
Інститут програмних систем НАН України |
publishDate |
2023 |
url |
https://pp.isofts.kiev.ua/index.php/ojs1/article/view/506 |
work_keys_str_mv |
AT rahozindv performanceanalysisofmassivelyparallelprogramsforgraphicsprocessingunits AT rahozindv analízšvidkodíímasivnoparalelʹnihprogramdlâgrafíčnihprocesorív |
first_indexed |
2024-09-12T19:29:37Z |
last_indexed |
2024-09-12T19:29:37Z |
_version_ |
1815407465783623680 |
fulltext |
51
Паралельне програмування розподілених систем і мереж
UDC 681.3 https://doi.org/10.15407/pp2022.03-04.051
PERFORMANCE ANALYSIS OF MASSIVELY PARALLEL
PROGRAMS FOR GRAPHICS PROCESSING UNITS
Dmytro Rahozin
Будь-яка сучасна графічна карта є цікавою платформою для запуску масивно паралельних програм. Проте, у нас дуже мало
засобів для вимірювання та аналізу швидкодії такого програмного забезпечення. До того ж графічні карти мають складну
ієрархію підсистеми пам’яті та тисячі потоків, що виконуються, тому всі питання швидкодії зводяться до ефективного
використання ієрархії пам’яті графічної карти. Ми пропонуємо використовувати GPGPUSim — симулятор, розроблений
для валідації архітектурних мрделей графічних карт — для аналізу швидкодії CUDA-програм. Наведено приклади аналізу
результатів симуляції і визначення характеристик масивно паралельної програми.
Ключові слова: графічна карта, швидкодія програми, масивний паралелізм, симуляція, модель швидкодії програми.
Any modern Graphics Processing Unit (graphics card) is a good platform to run massively parallel programs. Still, we lack tools to
observe and measure performance characteristics of GPU-based software. We state that due to complex memory hierarchy and thou-
sands of execution threads the all performance issues are about efficient use of graphics card memory hierarchy. We propose to use
GPGPUSim simulator, previously used mostly for graphics card architecture validation, for performance validation for CUDA-based
program. We provide examples which show how to use the simulation for performance analysis of massively parallel programs.
Keywords: Graphics processing unit, software performance, massive parallelism, simulation, software performance model.
Introduction
Modern graphics processing units (GPUs) or video-cards have revolutionized high performance computing,
introducing massive parallelism at a moderate price and allowing everybody to have a supercomputer at home. Still
the effective use of a GPU is limited as the efficient GPU software creation and usually requires the full redesign of
the original algorithm, as GPU provides quite a different model of massive parallelism utilization.
First, we discuss the existing GPU performance models, which are based on software simulators of GPU
hardware. This software is able to simulate computational units, GPU memory hierarchy model, commutation net-
work and is able to execute usual programs compiled for GPU, for example PTX codes of CUDA toolkit for Nvidia
GPU. The most simulators employ the reengineered GPU architecture [1], which was recovered using various
benchmarks [2], so allows the end-user to be aware about better techniques of software optimization for video-cards.
Second, there is a problem of the inefficient utilization of GPU massive parallelism, so that for complex
software it is hard to provide the good higher-level model of massive parallelism, which can be effectively pro-
jected onto the GPU hardware [3]. It is hard to account the relations and dependencies of the basic parameters of
software massive parallel model – for example, number of threads, memory block size processed by one thread,
scheduling techniques and so predict the final real performance. GPU computing model has really revolution-
ized the computing and we state that the GPU performance is fully defined by memory hierarchy performance.
As the GPU memory hierarchy employs 5-6 memory levels now (compares to the older L1-L2-L3 cache memory
scheme) the analysis of bottlenecks caused by memory hierarchy/interconnect is extremely complex. Definitely
the software developer needs good tools which help him to analyze what is wrong with software performance and
needs to define simpler model for software performance.
Third, we propose the use of simulator to discover the potential problems in massively parallel software,
analyze the software bottlenecks and recover memory utilization models. We re-formulate the performance analysis
problem as a problem of the different memory hierarchy stages resource utilization, it is very close to the commonly
used mass service models. The role of simulator is changed – originally the simulator was used for recovering the
peculiarities of a GPU architecture, but now we are extracting model data from memory hierarchy simulation. The
simplified model is applied back to the original task and allows to estimate memory/time resources necessary for
software run.
Fourth, we are providing the motivating example of CUDA application analysis on the top of simulation
and illustrate how the simulation data can be used to recover less complex performance model and re-applied for
application optimization. We are providing suggestions for simulator enhancement and enabling better performance
analysis, leading to build-up of clear performance estimation of the developer massively parallel software.
1. GPU performance models
Currently we are observing several shifts in microprocessor production, and these shifts hugely impact
existing approaches for application development. First, the term “parallel application development” became
more “academic”, as the process of researching different forms of parallelism in application became even more
complex, even more time-consuming and requires a lot of knowledge in narrow areas. Second, modern “start-
© Д.В. Рагозін, 2022
ISSN 1727-4907. Проблеми програмування. 2022. № 3-4. Спеціальний випуск
52
Паралельне програмування розподілених систем і мереж
up” culture pushes engineers to decrease “time-to-market” and development costs for their products or product
prototypes, so the employment of newer and newer architectural capabilities become slower and slower and
comes down to moving just to newer GPUs or running several copies of a process on more powerful GPU. The
real things a “median” programmer can do around “parallelism” utilization is the employment of threading
concept without good understanding of resulting application performance. This gives a chance to chip makers
to increase the number of processor cores on chip, as they easily can be employed in applications or operating
system by just run usual software services. So basically, applications are not optimized for multi core CPUs.
Further performance analysis and optimizations are open to industrial rivals, who can enter the market with a
substitute application, which is faster and smarter. So, the scaling of the number of processor cores looks to be
a good choice for typical smartphone platform which runs a bunch of applications which are loosely coupled
in terms of using some data amounts simultaneously.
Although if we look into software, which we can define as “data processing infrastructure” – commu-
nication, video processing, artificial intelligence (AI) [4], we see the opposite picture – each year the volume
of processed data increases exponentially. For example, a century ago, a short telegram message was enough
for communications, 20 years ago we entered the era of voice conferences with sophisticated sound processing,
and now we demand video calls and conferences as we need the visual presence of our peer. Nobody can predict
the next shift (mostly a move to virtual reality area) where computation demand will increase another 10x.
This increase of computing demand targets not the computation capability and not the memory vol-
ume – as the computational cores and memory cells are scaled perfectly, but the interconnect communication,
which joins storage and computations (fig 1.). It’s impossible to scale interconnect infrastructure due to physi-
cal limitations, but everybody notices that fig. 1 illustrated the usual computational system structure – it may
be applied to microcontrollers, common multicore system and GPU. Actually, this means that all the computer
system has the same problem of non-scaling interconnect network, but today the interconnect scaling limitation
became at least “severe”. Note, that GPU hardware has the best current solutions for optimizing the intercon-
nect, as it requires to process huge data (triangles, texture and points) in complex way (rendering, ray-tracing,
general computing) in real-time (60 frames per second).
Fig. 1. Computational system structure.
Pure computation capability is practically infinite now, the number of memory cells is really huge, but the
interconnect between them – as the data is routed to computational units - is very limited. Even more limiting fac-
tor is the complexity of interconnect programming by a developer. There is a good (and ancient) example for this
– older DSPs has several independent memory buses, as only this solution doubles or triples the data throughput.
The technology required the special programming, as program data are spread among two or more memory spaces
manually. Now there are more complicated chips with multilevel cache memory, but the major problem is still the
same – memory is quite slow if compared to computations, so channels capacity between logic and memory should
be increased, and here it looks that it is necessary to change the paradigm. If we check modern GPUs (as of 2019),
they have 12 (twelve) or more high-speed (DDR4/5 DRAM) controllers, which should (at least in theory) balance
memory load/throughput.
Long ago a simple but practical classification for software performance limitations was introduced: memo-
ry-bounded software and computations performance bounded software. The latter case was resolved quickly usually
in two ways: 1) computations are scaled via SIMD execution and just providing extra number crunching power (but
need to scale memory buses); 2) common pipelines are defined in hardware and used as a hardware accelerator (e.g.
Nvidia video stream encoders and decoders). The memory-bounded case is much harder - memory bus became wide,
up to 2048 bits – to feed SIMD units width or hardware accelerators, but never bus speed can outperform arithmetic
speed – even worse multi-core system just multiple memory load. Basically, if we are taking out of view the small
number of really computational bounded algorithms, the most important characteristic of a program/algorithm is the
degree of “memory boundness”. Overcoming the memory bandwidth limitations requires the improvement of both
hardware and software methods, as the interconnect infrastructure should be load optimally.
The interconnect does not mean just a bunch of wires [3]. Computational units produce requests for read-
ing/writing memory cells, memory is an old-fashioned DRAM with need to open pages, refresh contents and
serve operations. Basically, the pseudo-random memory accesses greatly decrease the DRAM memory through-
put, so interconnect should effectively decrease the number of memory accesses exploiting the principle that
the data traffic for the most frequently used memory cells should be routed somewhere inside the interconnects.
Speaking simply, interconnect should integrate specific multilevel cache memory to serve multi-threaded execu-
tion, which multiplies memory traffic.
As a good example of the threading paradigm change, we address the execution structure of GPU hardware
threads – where the massive thread execution is driven by readiness of data, fetched from interconnect internals.
The advanced approach requires simultaneous execution of hundreds of threads. Some of them processing data,
53
Паралельне програмування розподілених систем і мереж
some of them waiting for data to be ready. This raises at least two common problems: 1) an algorithm should have
enough degree of parallelism to allow execution of hundreds of threads – for example, triangle and pixel rendering;
2) an algorithm should be initially developed to utilize the benefits of memory hierarchy (interconnect). The first
problem is inevitable, as somebody needs to convert an initial task into a concept of a set of interoperating threads.
The second problem is more technical, but we need to define the interconnect and memory hierarchy structure in
computing systems. For practical reason we state that the interconnect and memory hierarchy need to be architected
to support hundreds of threads, conceptually this is very different from simple interconnects of multicore processors.
The rise of cheap mobile system-on-chips (SoCs) leads us to new challenges in massively parallel com-
puting (really supercomputing), proving excessive parallelism level but making code optimization time consum-
ing and costly. A mobile GPU with thousands of execution threads is a flagship of modern computing and the
possible level up looks to be a kind of quantum computer, so software development industry has to invent ways
to utilize this computing power efficiently.
Fig. 2. Memory hierarchies for multicore CPU and GPU [1].
Going back to traditional multi-core CPUs we consider the scheme (fig. 2, left) of multilevel cache
memory (or interconnect), which usually include fast but very small L0 memory (read/write queues to L1 cache
memory), faster L1 cache memory, and L2 and L3 cache memories. L2 and L3 look to be quite slow but really
big. For streaming applications, we should also account DRAM performance, so current multicore CPU memory
hierarchy includes up to 5 hierarchy levels, and 95% of memory bandwidth optimizations are around the better
use of L1/L2 cache. More optimizations for memory hierarchy lead to exponential growth of optimization costs.
Optimizations for L1/L2 cache improve data spatial locality but the other memory hierarchy is not taken into ac-
count due to too much cost and still the number of hardware threads is small. The fig. 2, right shows us much more
complicated interconnect for only one core at modern GPU. The fig. 2 does not show several hundred threads on-
the-fly, generating requests for memory. These threads involve the use of the following hardware (fig. 2, right): 1)
cross-bar interconnects/coalescers, allowing gathering/scattering data or connecting cores to different L2 parts;
2) “non-blocking” L1 streaming cache, which eliminates stall during memory accesses; 3) specific schedulers for
multichannel DRAM access. And the huge number of executing threads require specific programming methods to
utilize the hardware efficiently. Let us discuss the motivating case.
2. Software performance modeling
Despite of the wide use of matrix computation examples in performance-related studies, still BLAS/LA-
PACK based software is used for solving many problems (or BLAS/LAPACK style of expressing matrix operations),
e.g. for signal/sound processing area. Convolutional neural networks software uses special optimized routines for
matrix processing, but general matrix-processing based software still uses LAPACK due to old and well-known
standard. Nvidia proposes cuBLAS library for GPU-optimized BLAS routines.
Fig. 3. Illustrative block matrix computation scheme
54
Паралельне програмування розподілених систем і мереж
The common matrix by matrix multiplication scheme is shown on fig. 3. In order to computer C matrix
block, a block row from A should be multiplied by block column B. As above we have mentioned that the com-
mon software workloads are memory bandwidth bounded and this example is classical – for each multiplication
operation which is performed in 1 clock cycle, we need 2 memory accesses. The compared energy consumption for
multiplication (a small functional unit) and for memory access (memory, buses, interconnect, crossbar switches)
is the least 50x different. So, for speed and energy consumption all A2i and Bi2 should be synchronously loaded to
L1D cache/shared memory (fig. 2, right) and possibly reused for computing C2i blocks. Standard approaches for
cuBLAS matrix multiplications optimization are described in [5], advanced autotuning result is presented in [6].
The autotuning approach [7] has many pros but a kind of contra – the tuned library is fast “in common sense” as
the target parameter is time, and sometimes additional resource restrictions apply (e.g. less time – more memory or
software depends on “hot cache”) and a library may use very specific dataset, and this is common for the most of
computational tasks – Fourier transforms have only one or two used dimensions, matrices have only several fixed
dimensions. The important point is that several copies of computational process may be run on the same GPU (e.g.
neural networks computations) to better utilize non-used computational resources.
Anyway, matrix multiplication is scheduled as an execution graph of smaller partial matrix-multiplication
procedures of different sizes. There are various tradeoffs here, as complied kernels have pre-defined memory alloca-
tion layout and their parametrization (loop bound change) and sometimes need recompilation and re-scheduling and
this is common for GPUs. The compiled multiplication kernels reside on SMT (Streaming Multi Processor, fig. 2,
right), which has limited register file (local memory, usually 64K memory cells) which handles internal kernel vari-
ables and shared memory cells needed for information exchange between kernels. The compiler also should divide
the memory between threads local storage and shared communication memory, that directly influences the possibil-
ity of kernel allocation on SMT – the more memory kernel uses for communication, the smaller number of threads
are executed on the kernel – another tradeoff, highly depend on currently executed task set.
Initially the “big” multiplication algorithm includes an optimized scheme for data transfers between
kernels to save memory bandwidth. Ofcourse, kernels should be parametrized, for block sub-sizes and data
type. Reaching some successful tradeoff includes finding out the optimal parameters for size of the general
block matrix multiply kernel, by 1) adjusting the size of memory handling matrix values; 2) adjusting memory
size used as shared to exchange data between kernels. The optimal piece of matrix processed by kernels is a
tricky question, in general case very common suggestions (e.g. register file size) are used for the basic matrix
blocks size, as we need 1) to compute 2) to exchange data 3) to run as many threads as possible. For the ker-
nel tuning there is a pre-defined knowledge base initially, which stores basic register file parameters, so the
tuning algorithm have defined opportunities to use different matrix block sizes and possible limits for matrix
dimensions. The scheduler program defines the set of semi-optimal parameters for the “common” matrix mul-
tiplication procedure for this particular GPU. The problem is solved now using a benchmarking system, which
runs multiplication with defined matrix block sizes, making the GPU driver to evaluate execution time. After
that the final library is hard-compiled with the optimal parameters and is used further for number crunching. If
the code is moved to another hardware, we need to re-evaluate and recompile the library. Basically, this is the
way commonly used in industry for example, for BLAS/LAPACK [5]. Still, this type of software tuning looks
to be a “black box” methos– input task parameters produce results, but there are no evidences how to line up
parameters to have the semi-optimal execution for selected datasets. None of benchmarking systems is able to
make a model, which finds out relations between kernel parameters and execution time – as benchmark uses
only execution time as optimal criteria.
The upcoming problem is that the block-based matrix multiplication is the comparatively simple task but
investigated more than 50 years for available parallelism. The public codes for BLAS/LAPACK show that even ba-
sic task looks to be complex, it is not just compiled library code but a software system, which generate semi-optimal
source code after benchmarking the system. If somebody is required to produce a brand-new code it is even hard to
imagine how much efforts should be invested into creating a semi-optimal parallel code which can exploit parallel-
ism on a massively parallel processor. Even if the task can be parallelized well, we need to implement parametrized
kernel and define the factors which influence the software performance and use while resolving later tradeoffs.
3. Performance modeling for memory hierarchies
In this chapter the model of the internal architecture and memory hierarchy (interconnect) of off-the-shelf
video cards is considered. Only commonly used Nvidia video cards are considered – due to availability of the model,
as since the appearance of the pioneering Nvidia 8800 programmable video card, this platform is actively used for
different kinds of computations, including very specific tasks, which use imprecise computations [8] to employ
thousands of hardware execution threads for faster computations. Another point is that the video card is the general
number crunching machine, which has very good tradeoff between the peak computation power and memory band-
width. Finally, it is really important that there is a code base of less or more optimized algorithms, which can be
analysed for efficiency of memory bandwidth use.
For our investigation we use GPGPUSim simulator version 4.0.0 [9], which reflects the contemporary
complex architecture of the interconnect between GPU SMT (fig. 2) processors and memory system. Originally the
simulator was developed for recovering the architecture of Nvidia GPUs, including SMT, energy model, full inter-
connect for SMT registers down to DDR memory [1], as practically all hardware solutions used for computations
55
Паралельне програмування розподілених систем і мереж
and memory traffic routing were discussed in research papers, e.g. [3], but the particular combinations of architec-
ture solutions are still a trade secret of Nvidia. The GPGPUSIM simulation precision is proved by simultaneous
run of benchmarks (e.g. Rodinia benchmark [2]) both on simulator and on a real GPU and comparing numbers, the
results of comparison may be found in [2]. For our purposes we state that the difference of GPGPUSim simulation
and real execution on GPU is insufficient for our research.
GPGPUSim uses “fake driver” concept for running the applications. The simulator uses the fact that all
CUDA based codes are compiled into a form of so called PTX codes, which are very similar to the real processor
instructions of Nvidia GPU. Internally the GPU driver, which runs a CUDA program on a GPU, translates PTX code
into real GPU instructions, place on SMTs and execute it due to N-dimensional execution model used by CUDA.
This execution model effectively hides the GPU internals from the user but gives the promise that well parallelized
software will we well scaled for the next generation GPUs supporting even more threads.
GPGPUSim works as a driver and intercepts PTX code execution, so does the same work as CUDA transla-
tor and scheduler do, but only on CPU and on a single thread. So GPGPUSim runs even on a computer without real
GPU. The simulation process is highly time consuming – so the standard matrix multiply application with default
parameters from CUDA samples package is run for two days under simulator. This is not a problem as an older com-
puter fleet which is able to run Ubuntu 20/22 and GCC 7 toolchain can be used to simulate CUDA application runs.
GPGPUSim provides different types of simulation – 1) just functional, 2) simulation with all instruction timings but
zero memory timings, 3) simulation of the whole SMTs-memory interconnect including DDR5 memory simulation
models. There is a special mode for energy consumption simulation. It should be noted that simulator codes are
placed in GitHub, so can be freely downloaded and changed. There are several simulator extensions, e.g. Accel-Sim
[10], which are used for precise simulation validation and analysis.
GPGPUSim is highly parametrized – more than a hundred of options control the simulated hardware blocks,
from SMT to DDR memory. As the GPGPUSim authors struggle to have simulation accuracy in less than 5% if
compared to real HW, all the hardware components are precisely simulated. For a simple example, practically all the
timing values from DDR5 chips manuals describing DDR5 memory bus are accurately introduced into the memory
model. All the specific concepts, which enable the huge parallelism level, for example, streaming L1 caches [3], are
simulated, so that the application developer can observe the real effects of using the newer hardware concepts for
the application. The simulated interconnect complexity is really high to account even a basic set of optimizations
for the parallel code. The developer who wants to optimize the code should sit with “paper and pencil” and plan the
memory layouts and scheduling of the parallel applications.
For our purpose we need to analyze a set of performance counter which are generated by GPGPUSim during
the program execution. For the first glance these counters can be divided into two sets: 1) counters at the intercon-
nect level borders; 2) counters for internal events in interconnect levels. Let’s consider fig. 4.
Fig. 4. Memory hierarchy/interconnect levels.
Fig 4. includes the large blocks of SMTs to DDR memory interconnect/hierarchy [1]. Here arrows are
“located” at the real hierarchy borders, and basically for each arrow we are able to denote the traffic from more
upper hierarchy level to lower level (e.g. from L2 cache to memory controller). The efficiency degree for the hi-
erarchy level for the particular task (except interconnect as it just routes the data between L1 cache banks and L2
cache banks) is the bigger difference of the data amount routed to the level from higher level (left at fig. 4) and
data routed to the lower levels of memory hierarchy (right at fig. 4). Researches also should be interested in the
data set sizes, which are more friendly to the hardware, as despite of increasing DDR memory size on board, the
internal cache/shared memory sizes for SMTs is practically constant over GPU generations. The internal events
counters are used for the fine tuning of the algorithms, as are able to show how different memory level improve-
ments are employed by the parallelized version of algorithm, as different parallelization methods can utilize
internal helpers in different ways. Here we are discussing “level borders” counters, the internal level counters are
the subject for our next research.
4. Motivating examples
Now we propose a methodic, which starts from very beginning analysis of memory traffic. There are differ-
ent levels of hardware observations and currently we starting from high level beginning metrics. So let us discuss
what is valuable here for us and what are the challenges for research from the simulator side.
Each computational demanding application is started from computational model, which usually have
incredibly low speed. For example, LIDAR model application has 1 frame per second (fps) computational speed,
but the final requirement is 100 fps. So, the initial task for simulation is the evaluation of peak memory perfor-
mance of the model application. Note, that the “computational demand” here is transformed into “memory per-
formance”, as memory throughput is more limiting than computational throughput. L2 cache memory misses per
computed frame describe the basic need for memory traffic well, and (without optimizations) helps to project up-
56
Паралельне програмування розподілених систем і мереж
per memory bandwidth limit. Local optimizations for CUDA procedures may improve memory bandwidth up to
10x-20x, but the detailed information may be got from simulation run. Anyway, any performance model made for
multicore CPU does not project real scaling for hundreds of threads and does not reflect the optimization effects
possible on GPU. That’s why massive parallelism simulation and evaluation are not possible on multi-core CPU
as the difference in number of threads is up to 50-100x, so none of real parallelization issues may be observed at
low number of threads. At high level of hardware observing a developer can notice the upper limits of memory
traffic necessary to make a decision for feasibility of optimized program run at GPU, at lower level of observa-
tions it is possible to check even fine effects of different cache levels use.
For the entry level matrixMul CUDA sample application was used. To check the cache utilization effects
the application was run (simulated) with different matrix size parameters, starting from 64x64 matrix and finish-
ing with 256*896 size. The default multiplication size of 320x320 per 640x320 requires 2 days simulation on
Intel Core i5-10400, at 1 thread, so the dataset configuration was adjusted to decrease simulation time and reflect
real problem sizes.
The GPGPUSim by default generates a text file, which updates the internal simulation counters and model
in timely manner. As the simulator works over the GPU state - the simulation results are functionally correct, and
the program/shaders are executed correctly. The resulting log file can be parsed using any tools, starting from basic
grep/findstr utilities up to complex Python scripts. In order to parse these log files, we used simple findstr-based
scripts. Simulation was run on Ubuntu 22 OS, Intel 10th Gen based and equipped with 32GB RAM, but retargeted
for GCC 7 toolchain, as other toolchains fall into compilation errors.
The simulation results are gathered into the following tables (fig. 5, fig. 6), we render only the most im-
portant simulation results. For each matrixMul run we collect matrix dimensions (for reference we note matrix
sizes, just to know which data amount is processed during the run), L2 cache accesses, miss rate and parallel
utilization of the L2 cache. L2 accesses mean the total data traffic we get from L1 cache. L2 misses mean the
data traffic goes to DRAM. And the most interesting point is the L2 cache parallel utilization, the more this
digit the better is the L2 sectored cache utilization, as L2 structure allows to handle accesses in different L2 sec-
tors simultaneously. The see some imbalance in L2 parallel utilization (fig. 5) in 64x384x64, 128x768x128 and
256x768x256 multiplication – the hypothesis that memory layout is these cases is imbalanced and some L2 sec-
tors are loaded much more than others.
Fig 5. Matrix multiplication simulation results
57
Паралельне програмування розподілених систем і мереж
Fig. 6. L1/L2 cache load balancing
Fig. 6 shows the thread running balance in sense of the balance of memory accesses in sectored cache
memory. The picture at fig. 6 is practically the same for any reviewed matrix size, so we see practically the
ideal balance of L2 accesses and misses, this means that the matrices equally distributed over the existing pro-
cesses. Please note that this balance works also for L1 cache, despite it has more sectors – 30 – if compared to
24 L2 sectors.
So, let us review what we can conclude for the “big” load picture for the matrix multiplication ex-
ample. First, we see that the problem fits the cache in practically ideal way – the number of cache misses
is less than 0.1% which is really small value. Yes, this should not surprise us as the matrix multiplication
is the topic which is evaluated over 50 years, but such kind of analysis is usually provided for tasks which
are unfamiliar for researchers. So, these results are extremely valuable if we have no prior knowledge about
the problem. By the way, additional results are not included in the paper, but the same balancing analysis
as in fig. 6 is computed for DRAM. As DRAM is multichannel (12 channels), in case of low L2 miss rates
(more than 1-2%) a problem of proper parallelization of DRAM access arises, as 12 DRAM channels should
effectively multiply DRAM throughput in case of correct parallelization. Anyway the fig. 6 data allow to
understand load balancing at cache memory level and this metric directly highlights the quality of parallel-
ization, which is ideal for our case.
Conclusions
Here we use GPGPUSim in the opposite way to the initial authors intentions – instead of re-engineering
GPU architecture, we are evaluating CUDA-based software, as the simulation timing errors do not exceed 10%.
This even is not important for our tasks, as we need information for memory accesses amount and patterns. For our
classical task – matrix multiplication - we have got the cache utilization efficiency, memory bandwidth and paral-
lelization efficiency metric quite easy. If projected to real tasks, where the matrix multiplication sizes are fixed, we
easily predict the memory footprint and time necessary to complete the operation, so can evaluate the real perfor-
mance (+-10%) on a GPU platform.
The paper is just the beginning of the road into the GPU performance analysis, as now we analyze only
basic digits. We have not evaluated all simulation modes and have not elaborated all the parameters accessible
in resulting log file, these are tasks for the further simulator evaluation. Also, we need more accurate “slicing”
of simulation state in time, for gathering more precise information of algorithm behavior in real time. Addi-
tional analysis for cache memory utilization imbalance is also required. Finally, we need to check the perfor-
mance of algorithms which gives L2 miss rate more than 2% in order to check what we can get from DRAM
channels load analysis.
58
Паралельне програмування розподілених систем і мереж
References
1. Khairy, M., Jain, A., Aamodt, T.M., & Rogers, T.G. (2019). A Detailed Model for Contemporary GPU Memory Systems. 2019 IEEE
International Symposium on Performance Analysis of Systems and Software (ISPASS), p. 141-142.
2. A. Bakhoda, G. L. Yuan, W. W. L. Fung, H. Wong and T. M. Aamodt, “Analyzing CUDA workloads using a detailed GPU simula-
tor,” 2009 IEEE International Symposium on Performance Analysis of Systems and Software, 2009, pp. 163-174, doi: 10.1109/
ISPASS.2009.4919648.
3. A. Jog, O. Kayiran, T. Kesten, A. Pattnaik, E. Bolotin, N. Chatterjee, S. W. Keckler, M. T. Kandemir, and C. R. Das. 2015. Anatomy
of GPU Memory System for Multi-Application Execution. In Proc. of the 2015 International Symposium on Memory Systems (MEM-
SYS ‘15). ACM, NY, USA, Pp. 223–234. Doi: 10.1145/2818950.2818979
4. M. A. Raihan, N. Goli and T. M. Aamodt, «Modeling Deep Learning Accelerator Enabled GPUs,» 2019 IEEE International Sympo-
sium on Performance Analysis of Systems and Software (ISPASS), 2019, pp. 79-92, doi: 10.1109/ISPASS.2019.00016.
5. S. Barrachina, M. Castillo, F. D. Igual, R. Mayo and E. S. Quintana-Orti, «Evaluation and tuning of the Level 3 CUBLAS for
graphics processors,» 2008 IEEE International Symposium on Parallel and Distributed Processing, 2008, pp. 1-8, doi: 10.1109/IP-
DPS.2008.4536485
6. J. Kurzak, S. Tomov and J. Dongarra, «Autotuning GEMM Kernels for the Fermi GPU,» in IEEE Transactions on Parallel and Dis-
tributed Systems, vol. 23, no. 11, pp. 2045-2057, Nov. 2012, doi: 10.1109/TPDS.2011.311.
7. Pavlo A. Ivanenko, Anatoliy Y. Doroshenko, and Kostiantyn A. Zhereb, TuningGenie: Auto-Tuning Framework Based on Rewriting
Rules // in: 10th International Conference, ICTERI 2014, Kherson, Ukraine, June 9-12, 2014, Revised Selected Papers, Series: Com-
munications in Computer and Information Science, (Ermolayev, V., Mayr, H.C., Nikitchenko, M., Spivakovsky, A., Zholtkevych, G.
(Eds.)), Springer, CCIS Vol. 469, 2014. – PP. 139-160. doi: 10.1007/978-3-319—13206-8_7
8. Wu, Kui & Truong, Nghia & Yuksel, Cem & Hoetzlein, Rama. Fast Fluid Simulations with Sparse Volumes on the GPU. Eurographics/
Computer Graphics Forum. Vol 37. May 2018. pp. 157-167. Doi: 10.1111/cgf.13350.
9. Jain, Akshay & Rogers, Timothy. A Quantitative Evaluation of Contemporary GPU Simulation Methodology. ACM SIGMETRICS
Performance Evaluation Review. Vol 46, June 2018. Pp. 103-105. Doi: 10.1145/3292040.3219658.
10. M. Khairy, Z. Shen, T. M. Aamodt, T. G. Rogers. Accel-Sim: An Extensible Simulation Framework for Validated GPU Modeling, in
2020 ACM/IEEE 47th Annual International Symposium on Computer Architecture (ISCA)
Received 17.08.2022
Про авторів:
Рагозін Дмитро Васильович,
кандидат технічних наук,
старший науковий співробітник.
Кількість публікацій в українських виданнях – більше 10.
Кількість зарубіжних публікацій – 2.
http://orcid.org/0000-0002-8445-9921.
Місце роботи авторів:
Інститут програмних систем НАН України, 03187, м. Київ-187,
проспект Академіка Глушкова, 40.
Тел.: (38)(044) 526-21-48
E-mail: Dmytro.Rahozin@gmail.com
Прізвища та ініціали авторів і назва доповіді англійською мовою:
Rahozin D.V.
Performance analysis of massively parallel programs for graphics processing units
Прізвища та ініціали авторів і назва доповіді українською мовою:
Рагозін Д.В.
Аналіз швидкодії масивно паралельних програм для графічних процесорів
Контакти для редактора: Рагозін Дмитро Васильович,
старший науковий співробітник Інституту програмних систем НАН України,
e-mail: Dmytro.Rahozin@gmail.com, тел.: (38)(068) 575-91-25
|