OpenCL - Are work-group axes exchangeable? -
i trying find best work-group size problem , figured out couldn't justify myself. these results : globalworksize {6400 6400 1}, workgroupsize {64 4 1}, time(milliseconds) = 44.18 globalworksize {6400 6400 1}, workgroupsize {4 64 1}, time(milliseconds) = 24.39 swapping axes caused twice faster execution. why !? by way, using amd gpu. thanks :-) edit : kernel (a simple matrix transposition): __kernel void transpose(__global float *input, __global float *output, const int size){ int = get_global_id(0); int j = get_global_id(1); output[i*size + j] = input[j*size + i]; } i agree @thomas, depends on kernel. probably, in second case access memory in coalescent way and/or make full use of memory transaction. coalescence : when threads need access elements in memory hardware tries access these elements in less possible transactions i.e. if thread 0 , thread 1 have access contiguous elements there 1 transaction. full use of memory transaction : ...