Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks

The opportunities of technology CUDA (Compute Unified Device Architecture - the unified hardware-software decision for parallel calculations on GPU)of the company NVIDIA were described. The basic differences of the programming language ”C” for GPU from ”usual” language ”C” were selected. The examp...

Full description

Saved in:
Bibliographic Details
Published in:Вопросы атомной науки и техники
Date:2009
Main Authors: Dudnik, V.A., Kudryavtsev, V.I., Sereda, T.M., Us, S.A., Shestakov, M.V.
Format: Article
Language:English
Published: Національний науковий центр «Харківський фізико-технічний інститут» НАН України 2009
Subjects:
Online Access:https://nasplib.isofts.kiev.ua/handle/123456789/96652
Tags: Add Tag
No Tags, Be the first to tag this record!
Journal Title:Digital Library of Periodicals of National Academy of Sciences of Ukraine
Cite this:Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks / V.A. Dudnik, V.I. Kudryavtsev, T.M. Sereda, S.A. Us, M.V. Shestakov // Вопросы атомной науки и техники. — 2009. — № 5. — С. 159-165. — Бібліогр.: 9 назв. — англ.

Institution

Digital Library of Periodicals of National Academy of Sciences of Ukraine
id nasplib_isofts_kiev_ua-123456789-96652
record_format dspace
spelling Dudnik, V.A.
Kudryavtsev, V.I.
Sereda, T.M.
Us, S.A.
Shestakov, M.V.
2016-03-18T21:19:30Z
2016-03-18T21:19:30Z
2009
Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks / V.A. Dudnik, V.I. Kudryavtsev, T.M. Sereda, S.A. Us, M.V. Shestakov // Вопросы атомной науки и техники. — 2009. — № 5. — С. 159-165. — Бібліогр.: 9 назв. — англ.
1562-6016
PACS: 89.80.+h, 89.70.+c, 01.10.Hx
https://nasplib.isofts.kiev.ua/handle/123456789/96652
The opportunities of technology CUDA (Compute Unified Device Architecture - the unified hardware-software decision for parallel calculations on GPU)of the company NVIDIA were described. The basic differences of the programming language ”C” for GPU from ”usual” language ”C” were selected. The examples of CUDA usage for acceleration of development of applications and realization of algorithms of scientific and technical calculations were given which are carried out by the means of graphic processors (GPGPU) of accelerators GeForce of the eighth generation. The recommendations on optimization of the programs using GPU were resulted.
Описано можливостi технологiї "CUDA"(Compute Unified Device Architecture – унiфiкованого програмно- апаратного рiшення для паралельних обчислень на GPU) компанiї NVIDIA. Видiлено основнi вiдмiнностi мови програмування C для GPU вiд "звичайного"С. Дани приклади використання CUDA для прискорення розробки i реалiзацiї алгоритмiв науково-технiчних розрахункiв, виконуваних засобами графiчних процесорiв (GPGPU) прискорювачiв GeForce восьмого поколiння. Приведено рекомендацiї по оптимiзацiї програм, використовуючих GPU.
Описаны возможности технологии "CUDA"(Compute Unified Device Architecture – унифицированного программно-аппаратного решения для параллельных вычислений на GPU) компании NVIDIA. Выделены основные отличия языка программирования C для GPU от "обычного"С. Даны примеры использования CUDA для ускорения разработки приложений и реализации алгоритмов научно- технических расчётов, выполняемых средствами графических процессоров (GPGPU) ускорителей GeForce восьмого поколения. Приведены рекомендации по оптимизации программ, использующих GPU.
en
Національний науковий центр «Харківський фізико-технічний інститут» НАН України
Вопросы атомной науки и техники
Вычислительные и модельные системы
Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks
Застосування можливостей iнструментальної системи "CUDA"для програмування графiчних процесорiв в науково-технiчних розрахункiв
Применение возможностей инструментальной системы "CUDA"для программирования графических процессоров в научно-технических расчётах
Article
published earlier
institution Digital Library of Periodicals of National Academy of Sciences of Ukraine
collection DSpace DC
title Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks
spellingShingle Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks
Dudnik, V.A.
Kudryavtsev, V.I.
Sereda, T.M.
Us, S.A.
Shestakov, M.V.
Вычислительные и модельные системы
title_short Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks
title_full Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks
title_fullStr Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks
title_full_unstemmed Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks
title_sort application of the opportunities of tool system ”cuda” for graphic processors programming in scientific and technical calculation tasks
author Dudnik, V.A.
Kudryavtsev, V.I.
Sereda, T.M.
Us, S.A.
Shestakov, M.V.
author_facet Dudnik, V.A.
Kudryavtsev, V.I.
Sereda, T.M.
Us, S.A.
Shestakov, M.V.
topic Вычислительные и модельные системы
topic_facet Вычислительные и модельные системы
publishDate 2009
language English
container_title Вопросы атомной науки и техники
publisher Національний науковий центр «Харківський фізико-технічний інститут» НАН України
format Article
title_alt Застосування можливостей iнструментальної системи "CUDA"для програмування графiчних процесорiв в науково-технiчних розрахункiв
Применение возможностей инструментальной системы "CUDA"для программирования графических процессоров в научно-технических расчётах
description The opportunities of technology CUDA (Compute Unified Device Architecture - the unified hardware-software decision for parallel calculations on GPU)of the company NVIDIA were described. The basic differences of the programming language ”C” for GPU from ”usual” language ”C” were selected. The examples of CUDA usage for acceleration of development of applications and realization of algorithms of scientific and technical calculations were given which are carried out by the means of graphic processors (GPGPU) of accelerators GeForce of the eighth generation. The recommendations on optimization of the programs using GPU were resulted. Описано можливостi технологiї "CUDA"(Compute Unified Device Architecture – унiфiкованого програмно- апаратного рiшення для паралельних обчислень на GPU) компанiї NVIDIA. Видiлено основнi вiдмiнностi мови програмування C для GPU вiд "звичайного"С. Дани приклади використання CUDA для прискорення розробки i реалiзацiї алгоритмiв науково-технiчних розрахункiв, виконуваних засобами графiчних процесорiв (GPGPU) прискорювачiв GeForce восьмого поколiння. Приведено рекомендацiї по оптимiзацiї програм, використовуючих GPU. Описаны возможности технологии "CUDA"(Compute Unified Device Architecture – унифицированного программно-аппаратного решения для параллельных вычислений на GPU) компании NVIDIA. Выделены основные отличия языка программирования C для GPU от "обычного"С. Даны примеры использования CUDA для ускорения разработки приложений и реализации алгоритмов научно- технических расчётов, выполняемых средствами графических процессоров (GPGPU) ускорителей GeForce восьмого поколения. Приведены рекомендации по оптимизации программ, использующих GPU.
issn 1562-6016
url https://nasplib.isofts.kiev.ua/handle/123456789/96652
citation_txt Application of the opportunities of tool system ”CUDA” for graphic processors programming in scientific and technical calculation tasks / V.A. Dudnik, V.I. Kudryavtsev, T.M. Sereda, S.A. Us, M.V. Shestakov // Вопросы атомной науки и техники. — 2009. — № 5. — С. 159-165. — Бібліогр.: 9 назв. — англ.
work_keys_str_mv AT dudnikva applicationoftheopportunitiesoftoolsystemcudaforgraphicprocessorsprogramminginscientificandtechnicalcalculationtasks
AT kudryavtsevvi applicationoftheopportunitiesoftoolsystemcudaforgraphicprocessorsprogramminginscientificandtechnicalcalculationtasks
AT seredatm applicationoftheopportunitiesoftoolsystemcudaforgraphicprocessorsprogramminginscientificandtechnicalcalculationtasks
AT ussa applicationoftheopportunitiesoftoolsystemcudaforgraphicprocessorsprogramminginscientificandtechnicalcalculationtasks
AT shestakovmv applicationoftheopportunitiesoftoolsystemcudaforgraphicprocessorsprogramminginscientificandtechnicalcalculationtasks
AT dudnikva zastosuvannâmožlivosteiinstrumentalʹnoísistemicudadlâprogramuvannâgrafičnihprocesorivvnaukovotehničnihrozrahunkiv
AT kudryavtsevvi zastosuvannâmožlivosteiinstrumentalʹnoísistemicudadlâprogramuvannâgrafičnihprocesorivvnaukovotehničnihrozrahunkiv
AT seredatm zastosuvannâmožlivosteiinstrumentalʹnoísistemicudadlâprogramuvannâgrafičnihprocesorivvnaukovotehničnihrozrahunkiv
AT ussa zastosuvannâmožlivosteiinstrumentalʹnoísistemicudadlâprogramuvannâgrafičnihprocesorivvnaukovotehničnihrozrahunkiv
AT shestakovmv zastosuvannâmožlivosteiinstrumentalʹnoísistemicudadlâprogramuvannâgrafičnihprocesorivvnaukovotehničnihrozrahunkiv
AT dudnikva primenenievozmožnosteiinstrumentalʹnoisistemycudadlâprogrammirovaniâgrafičeskihprocessorovvnaučnotehničeskihrasčetah
AT kudryavtsevvi primenenievozmožnosteiinstrumentalʹnoisistemycudadlâprogrammirovaniâgrafičeskihprocessorovvnaučnotehničeskihrasčetah
AT seredatm primenenievozmožnosteiinstrumentalʹnoisistemycudadlâprogrammirovaniâgrafičeskihprocessorovvnaučnotehničeskihrasčetah
AT ussa primenenievozmožnosteiinstrumentalʹnoisistemycudadlâprogrammirovaniâgrafičeskihprocessorovvnaučnotehničeskihrasčetah
AT shestakovmv primenenievozmožnosteiinstrumentalʹnoisistemycudadlâprogrammirovaniâgrafičeskihprocessorovvnaučnotehničeskihrasčetah
first_indexed 2025-11-27T06:51:42Z
last_indexed 2025-11-27T06:51:42Z
_version_ 1850805843826573312
fulltext COMPUTING AND MODELLING SYSTEMS APPLICATION OF THE OPPORTUNITIES OF TOOL SYSTEM ”CUDA” FOR GRAPHIC PROCESSORS PROGRAMMING IN SCIENTIFIC AND TECHNICAL CALCULATION TASKS V.A. Dudnik, ∗V.I. Kudryavtsev, T.M. Sereda, S.A. Us, M.V. Shestakov National Science Center ”Kharkov Institute of Physics and Technology”, 61108, Kharkov, Ukraine (Received May 27, 2009) The opportunities of technology CUDA (Compute Unified Device Architecture - the unified hardware-software deci- sion for parallel calculations on GPU)of the company NVIDIA were described. The basic differences of the program- ming language ”C” for GPU from ”usual” language ”C” were selected. The examples of CUDA usage for acceleration of development of applications and realization of algorithms of scientific and technical calculations were given which are carried out by the means of graphic processors (GPGPU) of accelerators GeForce of the eighth generation. The recommendations on optimization of the programs using GPU were resulted. PACS: 89.80.+h, 89.70.+c, 01.10.Hx 1. INTRODUCTION Programming of the decisions of intensive mathemat- ical tasks, which is given an opportunity of usage of existing graphic processors for acceleration of their decision is not only new and perspective, but also roughly developing tendency of development of tool means for the software. The research of opportunities of such one technical decision - technologies CUDA (Compute Unified Device Architecture - the unified hardware-software decision for parallel calculations on GPU), offered by the company NVIDIA, is very actual in this connection. This technology represents C-like programming language with the compiler and libraries for calculations on GPU for development of appendix and gives the to programmer greater control over hardware opportunities GPU. [1],[2] It is important, that support of NVIDIA CUDA is at chips G8x, G9x and GT2xx, series 8 used in the video adapters GeForce, 9 and 200 which are widely distributed. Besides it is necessary to pay attention, that these software of the development and their de- scription are given completely free-of-charge (SDK for all basic platforms is freely downloaded with de- veloper.nvidia.com). It does their usage especially attractive for scientific researches. Once more key moment of architecture CUDA is easy scalability. Once on as written code will be started on all de- vices supporting CUDA[9]. For the development and debugging of a code for start on GPU it is possi- ble to use usual video adapters. When the product is ready it is possible to start it on more powerful GPU. It is necessary to note that parallel processing on GPU some differs from the work with CPU. It is said about parallelism of tasks within the devel- opment of applications for traditional CPU i.e. one program module is executed on the first processor (or a nucleus), another is executed on the second, etc. Parallel processing by the means use CUDA assumes the parallelism of the data, i.e. presence of a plenty (much more, than physical processors GPU) elements or groups of the data which admitting the independent parallel processing. The purpose of given work was the research of opportunities of the technology CUDA for applications’ development and algorithms’realizations of scientific and technical cal- culations in NSC the KIPT. 2. THE GRAPHIC PROCESSOR AS THE SIMD SET OF MULTIPROCESSORS Development of the functionalities of graphic proces- sors (particularly blocks for calculations of pixel shaders) has made possible the usage of video adapters for scientific and technical calculations - as a powerful SIMD (Single Instruction Multiple Data) processors. Occurrence of the technologies of not-graphic general purpose calculations GPGPU (General-Purpose computation on GPU) pro- moted it. With the help of these technologies there was possible to use hundred mathematical execu- tive blocks of modern video chips GPU as general purpose processors for the significant acceleration computational intensive applications. For under- standing of features of programming GPGPU it is necessary to take into consideration the features of structure GPU (Fig.1) and its work in the basic ∗Corresponding author E-mail address: vladimir−1953@mail.ru PROBLEMS OF ATOMIC SCIENCE AND TECHNOLOGY, 2009, N5. Series: Nuclear Physics Investigations (52), p.159-165. 159 mode (it is real this graphic device, and its ba- sic purpose is formation of the image). As you can see on Fig.1, the part of the graphic proces- sor NVIDIA[4], used for GPGPU - the nucleus of shaders - consists of several clusters of textural processors (Texture Processor Cluster, TPC). Fig.1. Structure of a nucleus of shaders Quantity of TPC in the device depends on the model GPU (and its price, accordingly). The video chip 8800 GTX, for example, contains eight clusters, six 8800 GTS, etc. Each cluster consists of the tex- tural block and two or three streaming multiproces- sors (streaming multiprocessor, SM). The stream- ing multiprocessors include 16 kB shared memory (Shared Memory, it is not cache-memory: the pro- grammer can use it at own discretion), eight com- puters (Streaming Processors, SP)) and two super functional devices SFU (Super Function Unit) where instructions are carried out by the principle SIMD, i.e. one instruction is applied to all elements of the data. NVIDIA shows such way of performance SIMT (single instruction multiple threads - one instruction, it is a lot of streams). Shared memory gives the opportunity of information in- terchange between streams in one block. In the graphic mode the blocks of pixel shaders work as follows (see Fig.2): the block of geometry generates triangles, then the block of rasterization generates quads - squares of pixels 2x2 where each pixel is set by a vector with four values from a floating point of unary accuracy (R, G, B, A)or (X, Y, Z, W ) - most often used format in 3D-calculations. The quads then act in streaming processors (SM)(see Fig.3) (which work in 16-channel mode SIMD, i.e. the identical instruction is applied to all 16 num- bers from a floating point. When 8 quads (32 pix- els,”warp” on terminology CUDA) are collected in the buffer, they are carried out by the multiproces- sor in mode SIMD, and then entrance data from structures for each pixel are read out, are devel- oped and are entered the name in the target buffer. Fig.2. Circuit of work of pixel shaders Fig.3. Structure of the streaming processor The usage of GPU for calculations with the help of such graphic program interface was quite possible, but the special approach was necessary. Even item addition of two vectors demanded a drawing in the out-screen buffer. Besides there were much of other restrictions, in fact, the pixel shader is only the for- mula of dependence of final color of a pixel from its coordinate, and language of pixel shaders - language of recording of these formulas with C-like syntax. It 160 is possible to say that early methods GPGPU were the artful tricks allowing somehow to use the capacity of GPU for the general calculations. The data have been submitted there by pseudo-images (structures), and algorithm-process of a rasterization. Besides it is necessary to note the rather specific model of usage of memory and execution of the program. 3. ADVANTAGES OF CUDA BEFORE THE TRADITIONAL APPROACH TO GPGPU CALCULATIONS Occurrence of CUDA (and also GPU G80) has com- pletely removed all these restrictions, offering for GPGPU the simple and convenient model. In this model GPU it is examined the specialized computer (named device), which: • is the coprocessor to CPU (named host); • possesses own memory (DRAM); • possesses an opportunity of parallel perfor- mance of huge quantity of separate computing processes (threads); • more effective data transfer between system and video memory is provided; • is absent necessities of graphic API with redun- dancy and overhead charge; • linear addressing of memory and an opportu- nity of recording to any addresses are used; • there is a hardware support of integer and bit operations. 4. HIERARCHY OF COMPUTING PROCESSES (THREADS) IN CUDA The stream represents a stream of elements of one type which are required to be processed. Thread - the process of processing of stream‘s element. All threads are grouped in hierarchy - grid/block/warp/thread (see Fig.2). The warp rep- resents a group of 32 streams and is a minimal volume of the data processable in the SIMD-way in multi- processors CUDA. Threads actually carry out the same commands, but everyone with the data. The block in CUDA, it is possible to work with the blocks containing from 64 up to 512 streams instead of work with warps directly. All streams of the block are carried out on one multiprocessor Grid - blocks are gathered in grids. The advantage of a similar grouping consists of the idea that the number of the blocks simultaneously processable GPU are closely connected to hardware resources. The grouping of blocks in grids allows to abstract from this restriction and apply a nucleus / kernel to the greater number of streams for one call, and you can not think about the fixed resources. Kernel - the function which will be applied independently to each element of a stream; it is an equivalent of a pixel shader. In clas- sical programming it is possible to result analogy of a cycle - it is applied to the big number of the elements. Fig.4. Hierarchy of the threads in CUDA The model of programming in CUDA assumes group- ing of the streams. The streams are united in the blocks of streams (thread block) - one-dimensional or bidimentional grids of the streams cooperating among themselves by the means of shared memory and points of synchronization. The program (a nu- cleus, kernel) is executed above a grid (grid) blocks of streams (thread blocks), see a figure below. One grid is simultaneously executed. Each block can be one, two or three-dimensional under the form, and can consist of 512 streams on the current hardware maintenance. The blocks of streams are carried out as the small groups, named a warp, which size is 32 streams. It is the minimal volume of the data which can be processed in the multiprocessors. And as it is not always convenient, CUDA allows to work with the blocks containing from 64 up to 512 streams. The grouping of blocks in grids allows to avoid the restric- tions and apply a nucleus to the greater number of streams for one call. It helps in scaling. If GPU has not enough resources, it will carry out blocks consis- tently. In a return case, the blocks can be carried out in parallel, that is important for optimum distri- bution of a work to video chips of a different level, beginning from mobile and integrated. 5. MODEL OF MEMORY CUDA The model of memory in CUDA differs by the oppor- tunity of byte addressing, supporting both a gather, and a scatter. A plenty of registers on each stream 161 processor, up to 1024 pieces are accessible enough. The access to them is very fast and you can store in them 32-bit whole or numbers with a floating point [6],[7],[8]. Each stream has access to the following types of memory: Global memory - makes the great volume of mem- ory accessible to all multiprocessors on the video chip, the size from 256 megabytes up to 1.5 gigabytes on the current decisions (and up to 4 Gbytes on Tesla). It possesses a high bandwidth, more than 100 giga- byte / second for top decisions NVIDIA, but has very big delays in some hundreds steps. It is not cached, supports the generalized instructions load and store, and usual indexes for memory. Local memory is a small memory size to which only one stream processor has access. It is rather slow - similar to global memory. Shared memory is 16-kbite (in video chips of present architecture) block of memory with the com- mon access for the all streaming processors in the multiprocessor. This memory rather fast, like, reg- isters. It provides interaction of streams, copes the developer directly and has low delays. The advan- tages of shared memory are the usage as a cache of the first level controlled by the programmer, reduc- tion in delays at access of executive blocks (ALU) to the data, reduction of quantity of manipulations to global memory. Memory of constants - the area of memory in vol- ume of 64 kilobytes (the same - for present GPU), accessible only for reading by all multiprocessors. It is cached on 8 kilobyte on each multiprocessor. It is rather slow: a delay in some hundreds steps at ab- sence of the necessary data in a cache. Textural memory - the block of memory accessible to reading by all multiprocessors. Selection of the data is carried out by the means of textural blocks of the video chip: the opportunities of linear inter- polation of the data without additional expenses are given. It is cached on 8 kilobyte on each multiproces- sor. It is slow like a global memory - hundreds steps of a delay at absence of the data in a cache. Naturally, the global, local, textural memories and memory of constants are physically the same memory known as a local video memory of the video adapter. Their differences are in various algorithms of caching and models of access. The central processor can update and requests only external memory: global, constant and textural. 6. EXPANSIONS OF LANGUAGE C The peculiarities of usage of architecture GPU for usual calculations have found the reflection in rather small expansions of language: Specifiers of the functions were entered showing where function will be carried out and from where it can be called (_device_, _global_, _host_). The specifiers of variables are supported by spec- ifying the type of memory and using for given vari- ables (_device_, _constant_ and _shared_). The built variables are added in language contain- ing runtime the information on the current thread: • gridDim - the size of grid (has type dim3); • blockDim - the size of the block (has type dim3); • blockIdx - an index of the current block in grid (has type uint3); • threadIdx - an index of the current thread in the block (has type uint3); • warpSize - the size of warp (has type int). There are additional types of the data: are added 1/2/3/4-dimensional a vector from base types - char1, char2, char3, char4, uchar1, uchar2, uchar3, uchar4, short1, short2, short3, short4, ushort1, ushort2, ushort3, ushort4, int1, int2, int3, int4, uint1, uint2, uint3, uint4, long1, long2, long3, long4, ulong1, ulong2, ulong3, ulong4, float1, float2, float3, float2, and double2. The directive of a call of a nucleus. For start of a nucleus on GPU is used the following design con- struction: kernelName <<<Dg, Db, Ns, S>>> (args) Synchronization of all threads of the block. For this purpose function _syncthreads was added in language ”C”, management from it will be return only when all threads of the given block will call this function. This function is very convenient for the or- ganization of frictionless work with shared-memory. The most simple example of the usage CUDA will be a simple increase in each element of one-dimensional file at unit (the program ”incr.cu”), showing the basic working methods with it. *include <stdio.h> __ global __ void incKernel(float *data) { int idx=blockIdx.x * blockDim.x + threadIdx.x; data [idx]=data[idx]+1.0f; } int main (int argc,char *argv[]) { int n = 16*1024*1024; int numBytes = n*sizeof (float); // allocate host memory float *a = new float[n]; for(int i=0;i<n;i++) a[i]=0.0f; // allocate device memory float *dev = NULL; cudaMalloc((void **)*dev,numBytes); // set kernel launch configuration 162 dim3 threads = dim3(512,1); dim3 blocks = dim3(n/threads.x,1); // create cuda event handles cudaEvent_t start,stop; float gpuTime = 0.0f; cudaEventCreate (*start); cudaEventCreate (*stop); // asynchronously issue work to the GPU (all to stream 0) cudaEventRecord (start,0); cudaMemcpy (dev,a,numBytes, cudaMemcpyHostToDevice); incKernel <<<blocks,threads>>> (dev); cudaMemcpy(a,dev,numBytes, cudaMemcpyDeviceToHost); cudaEventRecord (stop, 0); cudaEventSynchronize (stop); cudaEventElapsedTime (*gpuTime,start, stop); // print the cpu and gpu times printf (" time spent executing by the GPU: % .2f millseconds\n ", gpuTime); // check the output for correctness printf ("-----------------------\n"); for (int i=0;i <n;i++) if(a[i]!=1.0f) { printf("Error in pos %d, %f\n ",i, a[i]); break; } // release resources cudaEventDestroy(start); cudaEventDestroy(stop); cudaFree(dev); delete a; return 0; }// End The nucleus is easy arranged - to each thread corresponds one thread, blocks and grid are one- dimensional. A nucleus (function incKernel) on in- put receives only the index on a file with the data in global memory. The task of a kernel - on threadIdx and blockIdx is to determine which element corre- sponds to the given thread and increase it. As both blocks and grid are one-dimensional, the number of the thread will be determined as a number of the block multiplied (increased) on quantity of threads in the block, plus a number of a thread inside the block, i.e. "blockIdx.x * blockDim.x + threadIdx.x ”. The function ”main” is a little bit more complex.It should prepare a file with the data in the memory CPU. Af- ter that it is necessary to allocate the memory with the help of cudaMalloc under a copy of a file with the data in global memory (DRAM GPU). Then the data are copied by the function cudaMemcpy from the memory CPU to the global memory GPU. After ending of the copy of the data in global memory it is possible to start a nucleus for data processing and after its call to copy results of calculations back from global memory GPU in memory CPU. Also in this example gauging spent on copying and calculations of time is made and the correctness of the received result is checked. After all selected memory is re- leased. 7. RECEIVING OF THE INFORMATION ABOUT AVAILABLE GPU AND THEIR OPPORTUNITIES The initial text of one simple program is given below which listing all accessible GPU and their basic op- portunities. *include <stdio.h> int main (int argc, char * argv []) { int deviceCount; cudaDeviceProp devProp; cudaGetDeviceCount (*deviceCount); printf (" Found %d devices\n ", deviceCount); for (int device = 0; device <deviceCount; device ++) { cudaGetDeviceProperties (*devProp, device); printf (" Device %d\n ", device); printf (" Compute capability: %d. % d\n ", devProp.major, devProp.minor); printf (" Name: %s\n ", devProp.name); printf (" Total Global Memory: %d\n ", devProp.totalGlobalMem); printf (" Shared memory per block: %d\n ", devProp.sharedMemPerBlock); 163 printf (" Registers per block: %d\n ", devProp.regsPerBlock); printf (" Warp size: %d\n ", devProp.warpSize); printf (" Max threads per block: %d\n ", devProp.maxThreadsPerBlock); printf (" Total constant memory: %d\n ", devProp.totalConstMem); } return 0; } From the resulted examples it is visible that pro- grams are written on ”expanded” C, thus their ”par- allel part” (kernel) is carried out on GPU and the usual part is carried on CPU. CUDA automatically carries out the division of parts and management of their start. 8. OPTIMIZATION OF PROGRAMS ON CUDA Naturally, within the framework of this article it is impossible to consider serious questions of optimiza- tion in CUDA programming, therefore in brief we shall mention about base things. For the effective usage of the opportunities of CUDA it is necessary to forget about usual methods of spelling of programs for CPU and use those algorithms for which the mul- tisequencing on thousand streams is well carried out. Also it is important to find an optimum place for a data storage (the registers, shared memory, etc.) to minimize data transfer between CPU and GPU and use buffering[7]. Within optimization of program of CUDA it is necessary to try achieving optimum balance between the size and quantity of blocks. A lot of streams in the block will reduce the influence of delays of memory, but also will bring down the accessible number of registers. Besides, the block of 512 streams is inefficient; NVIDIA recommends to use the blocks on 128 or 256 streams as a compro- mise value for achievement of optimum delays and quantities of registers. The basic moments of the op- timization of programs CUDA are more as possible to use an active shared memory because it is faster than global video memory of the video adapter; oper- ations of reading and recording from global memory should be incorporated (coalesced) as far as possible. For this purpose it is necessary to use special types of the data for reading and recordings, at once on 32/64/128 bat given by one operation. If the opera- tions of reading are difficult to unite it is possible to try using textural samples.[8] It is necessary to take into consideration one more useful feature of CUDA 2.0 which, however, has not the attitude to GPU - the compiler now provides compilation code CUDA in a highly effective multiline SSE code for fast execu- tion on the central processor. Now this opportunity suits not only debugging, but also real usage on sys- tems without the video adapter NVIDIA. In fact the usage of CUDA in the usual code restrains by the fact that the video adapters NVIDIA are though the most popular, but are not available in all systems. And up to the version 2.0 in such cases it should do two different codes: for CUDA and separately for CPU. And now it is possible to carry out any CUDA program on CPU with the high efficiency, however with smaller speed than on GPU. 9. SUMMARY The usage of CUDA in the server center of NSC KIPT has shown that though labour input of pro- gramming GPU with help of CUDA is rather big, it is much lower than with early GPGPU decisions. The software of SDK CUDA is established and enough steadily works without special problems. However CUDA programming for each multiprocessor is simi- lar of OpenMP programming, It demands a good un- derstanding of the organization of memory. But, cer- tainly, the complexity of development and carring on CUDA strongly depends on the application. It is nec- essary to get used to other paradigm of programming inherent in parallel calculations. Such programs de- mand splitting the application between several mul- tiprocessors like MPI programming but without di- vision of the data which are stored in the common video memory. CUDA enables to the developer an opportunity to organize at own discretion access to a set of instructions of the graphic accelerator and op- erate its memory, organize on it the complex parallel calculations. The graphic accelerator with support- ing CUDA becomes a powerful programmed open ar- chitecture like today’s central processors. All these gives the high level, controlled and high-speed access to the equipment at disposal of the developer, do- ing CUDA an effective basis for construction of the serious applications. References 1. A.Zubinsky. NVIDIA CUDA: graphics and calcu- lations unification, (http://itc.ua/node/27969). 2. D. Luebke. Graphics CPU-not only for graphics, (http://www.osp.ru/os/2007/02/4106864/). 3. D. Luebke, G. Humphreys. How GPUs Work// IEEE Computer, February 2007. IEEE Com- puter Society. 4. A.V. Boreskoff.Bases CUDA, (http://www.steps3d.narod.ru/tutorials/cuda- tutorial.html/). 5. A.V. Boreskoff. Bases of programming GPU, stream model of calculations, real- ization of conditional operators on modern 164 GPU, (http://steps3d.narod.ru/tutorials/gpu- programming-tutorial.html). 6. D. Chekanov. NVIDIA CUDA: calcula- tions on the videoadapter or death CPU? (http://www.thg.ru/graphic/nvidia_cuda/one page.html). 7. A. Berillo. NVIDIA CUDA - not graphic calculations on graphic processors, (http://www.ixbt.com/video3/cuda-1.shtml). 8. D. Chekanov. NVIDIA GeForce GTX 260 and 280: new generation of videoad- apters, (http://www.thg.ru/graphic/ geforce_gtx_260_280/geforce_gtx_260_280- 02.html). 9. I. Oskolkov. NVIDIA CUDA - the accessi- ble ticket in the world of the big calculations, (http://www.computerra.ru/interactive/423392/). ПРИМЕНЕНИЕ ВОЗМОЖНОСТЕЙ ИНСТРУМЕНТАЛЬНОЙ СИСТЕМЫ "CUDA"ДЛЯ ПРОГРАММИРОВАНИЯ ГРАФИЧЕСКИХ ПРОЦЕССОРОВ В НАУЧНО-ТЕХНИЧЕСКИХ РАСЧЁТАХ В.А. Дудник, В.И. Кудрявцев, Т.М. Середа, С.А. Ус, М.В. Шестаков Описаны возможности технологии "CUDA"(Compute Unified Device Architecture – унифицирован- ного программно-аппаратного решения для параллельных вычислений на GPU) компании NVIDIA. Выделены основные отличия языка программирования C для GPU от "обычного"С. Даны приме- ры использования CUDA для ускорения разработки приложений и реализации алгоритмов научно- технических расчётов, выполняемых средствами графических процессоров (GPGPU) ускорителей GeForce восьмого поколения. Приведены рекомендации по оптимизации программ, использующих GPU. ЗАСТОСУВАННЯ МОЖЛИВОСТЕЙ IНСТРУМЕНТАЛЬНОЇ СИСТЕМИ "CUDA"ДЛЯ ПРОГРАМУВАННЯ ГРАФIЧНИХ ПРОЦЕСОРIВ В НАУКОВО-ТЕХНIЧНИХ РОЗРАХУНКIВ В.О. Дудник, В.I. Кудрявцев, Т.М. Середа, С.О. Ус, М.В. Шестаков Описано можливостi технологiї "CUDA"(Compute Unified Device Architecture – унiфiкованого програмно- апаратного рiшення для паралельних обчислень на GPU) компанiї NVIDIA. Видiлено основнi вiдмiн- ностi мови програмування C для GPU вiд "звичайного"С. Дани приклади використання CUDA для прискорення розробки i реалiзацiї алгоритмiв науково-технiчних розрахункiв, виконуваних засобами графiчних процесорiв (GPGPU) прискорювачiв GeForce восьмого поколiння. Приведено рекомендацiї по оптимiзацiї програм, використовуючих GPU. 165