cuda - Amdahl's law and GPU -


i have couple of doubts regarding application of amdahl's law respect gpus. instance, have kernel code have launched number of threads, n. so,in amdahl's law number of processors n right? also, cuda programming using large number of threads, safe me assume amdahl's law reduced 1/(1-p) wherein p stands parallel code? thanks

for instance, have kernel code have launched number of threads, n. so,in amdahl's law number of processors n right?

not exactly. gpu not have many physical cores (k) number of threads can launch (n) (usually, k around 103, n in range 104 -- 106). however, significant portion of kernel time (usually) spend waiting data read/written from/to global memory, 1 core can seamlessly handle several threads. way device can handle n0 threads without them interfering each other, n0 several times bigger k, depends upon kernel function.

in opinion, best way determine n0 experimentally measure performance of application , use data fit parameters of amdahl's law :)

also, cuda programming using large number of threads, safe me assume amdahl's law reduced 1/(1-p) wherein p stands parallel code?

this assumption means neglect time parallel part of code (it executed infinitely fast) , consider time serial part.

e.g. if compute sum of 2 100-element vectors on gpu, initializing of device, data copying, kernel launch overhead etc (serial part) takes more time kernel execution (parallel part). however, not true.

also, individual gpu core not have same performance cpu core, should scaling, making amdah'l law 1 / [(1-p) + k*p/n] (at it's simplest, k = frequency(cpu) / frequency(gpu), k increased more take account architectural differences, cpu core having simd block).


i argue against literally applying amdahl's law real systems. sure, shows general trend, not grasp non-trivial processes.

first, amdahl's law assumes given infinite number of cores parallel part executed instantly. assumption not true (though might pretty accurate). if calculate sum of 2 vectors, can't compute faster takes add 2 bytes. 1 can neglect "quanta", or include in serial portion of algorithm, "breaks" idea.

how correctly estimate in amdahl's law effect of barrier synchronization, critical section, atomic operations etc. is, best of knowledge, unresolved mystery. such operations belong parallel part, walltime of execution @ best independent of number of threads and, @ worst, positively dependent.

simple example: broadcasting time between computational nodes in cpu cluster scales o(log n). initial initialization can take o(n) time.

in simple cases 1 can estimate benefit of parallelisation of algorithm, (as case cuda) static overhead of using parallel processing might take more time, parallel processing saves.

so, in opinion, simpler write application, measure it's performance , use plot amdahl's curve trying a priori correctly estimate nuances of algorithm , hardware. in case such estimations made, obvious without "laws".


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -