parallel processing - How to automatically calculate the block and grid size of a 2D image in CUDA? -


i have known ideas of block , grid in cuda, , i'm wondering if there helper function written can me determine best block , grid size given 2d image.

for example, 512x512 image mentioned in this thread. grid 64x64 , block 8x8.

however input image may not power of 2, may 317x217 or that.in case, maybe grid should 317x1 , block should 1x217.

so if have application accepts image user, , use cuda process it, how can automatically determine size , dimension of block , grid, user can input size of image.

is there existed helper function or class handles problem?

usually want choose size of blocks based on gpu architecture, goal of maintaining 100% occupancy on streaming multiprocessor (sm). example, gpus @ school can run 1536 threads per sm, , 8 blocks per sm, each block can have 1024 threads in each dimension. if launch 1d kernel on gpu, max out block 1024 threads, 1 block on sm (66% occupancy). if instead chose smaller number, 192 threads or 256 threads per block, have 100% occupancy 6 , 8 blocks respectively on sm.

another thing consider amount of memory must accessed vs amount of computation done. in many imaging applications, don't need value @ single pixel, rather need surrounding pixels well. cuda groups threads warps, step through every instruction simultaneously (currently, there 32 threads warp, though may change). making blocks square minimizes amount of memory needs loaded vs amount of computation can done, making gpu more efficient. likewise, blocks power of 2 load memory more efficiently (if aligned memory addresses) since cuda loads memory lines @ time instead of single values.

so example, though might seem more effective have grid 317x1 , blocks 1x217, code more efficient if launch blocks 16x16 on grid 20x14 lead better computation/memory ratio , sm occupancy. mean, though, have check within kernel make sure thread not out of picture before trying access memory, like

const int thread_id_x = blockidx.x*blockdim.x+threadidx.x; const int thread_id_y = blockidx.y*blockdim.y+threadidx.y; if(thread_id_x < pic_width && thread_id_y < pic_height) {   //do stuff } 

lastly, can determine lowest number of blocks need in each grid dimension covers image (n+m-1)/m n number of total threads in dimension , have m threads per block in dimension.


Comments

Popular posts from this blog

commonjs - How to write a typescript definition file for a node module that exports a function? -

openid - Okta: Failed to get authorization code through API call -

thorough guide for profiling racket code -