OpenCL - Are work-group axes exchangeable? -


i trying find best work-group size problem , figured out couldn't justify myself.

these results :

  • globalworksize {6400 6400 1}, workgroupsize {64 4 1}, time(milliseconds) = 44.18
  • globalworksize {6400 6400 1}, workgroupsize {4 64 1}, time(milliseconds) = 24.39

swapping axes caused twice faster execution. why !?

by way, using amd gpu.

thanks :-)

edit : kernel (a simple matrix transposition):

__kernel void transpose(__global float *input, __global float *output, const int size){     int = get_global_id(0);     int j = get_global_id(1);     output[i*size + j] = input[j*size + i]; } 

i agree @thomas, depends on kernel. probably, in second case access memory in coalescent way and/or make full use of memory transaction.

coalescence: when threads need access elements in memory hardware tries access these elements in less possible transactions i.e. if thread 0 , thread 1 have access contiguous elements there 1 transaction.

full use of memory transaction: let's have gpu fetches 32 bytes in 1 transaction. therefore if have 4 threads need fetch 1 int each using half of data fetched transaction; waste rest (assuming int 4 bytes).

to illustrate this, let's have n n matrix access. matrix in row major, , use n threads organized in 1 dimension. have 2 possibilities:

  1. each workitem takes care of 1 column, looping through each column element 1 @ time.
  2. each workitem takes care of 1 line, looping through each line element 1 @ time.

it might counter-intuitive, first solution able make coalescent access while second won't be. reason when first workitem need access first element in first column, second workitem access first element in second column , on. these elements contiguous in memory. not case second solution.

now if take same example, , apply solution 1 time have 4 workitems instead of n , same gpu i've spoken before you'll increase time factor 2 since waste half of memory transactions.

edit: posted kernel see forgot mention else.

with kernel, seems choosing local size of (1, 256) or (256, 1) bad choice. in first case 256 transactions necessary read column (each fetching 32 bytes out of 4 used - keeping in mind same gpu of previous examples) in input while 32 transactions necessary write in output: can write 8 floats in 1 transaction hence 32 transactions write 256 elements.

this same problem workgroup size of (256, 1) time using 32 transactions read, , 256 write.

so why first size works better? it's because there cache system, can mitigate bad access read part. therefore size (1, 256) write part , cache system handle not read part, decreasing number of necessary read transactions.

note number of transactions decreases overall (taking considerations workgroups within ndrange). example first workgroup issues 256 transactions, read 256 first elements of first column. second workgroup might go in cache retrieve elements of second column because fetched transactions (of 32 bytes) issued first workgroup.

now, i'm sure can better (1, 256) try (8, 32).


Comments

Popular posts from this blog

basic authentication with http post params android -

vb.net - Virtual Keyboard commands -

css - Firefox for ubuntu renders wrong colors -