Let’s consider this portion of your kernel where “most” of the work is being done:
for (int stride = 1; stride < 2 * blockDim.x; stride *= 2) {
if (threadIdx.x % stride == 0) {
if (i + stride < n) {
in[i] += in[i + stride];
}
}
__syncthreads();
}
At each pass through the for-loop, let’s first ask the question, how many “active” threads are there? This is given by the first if-test, which is threadIdx.x%stride==0
. stride
starts out as 1, then 2, then 4, and so on. So the first pass has the full complement of threads, the second pass selects every 2nd thread, the 3rd pass selects every 4th thread, and so on.
Now let’s ask the question, for a given threadblock size, how many average active threads do we have for this for-loop, across all iterations.
Finally, let’s build a table for the above questions, showing the average active threads, for each threadblock configuration, and also considering the multiplicative factor of having multiple threadblocks resident. The first average in that table would be computed like this:
1024+512+256+128+64+32+16+8+4+2+1 = 2047/11 = ~186 active threads (average)
And for this case there can be only 1 threadblock resident, because Turing has a maximum complement of 1024 threads per SM. What would that table look like for the other cases?
threads/block average per block number of blocks resident average per SM
1024 186 1 186
512 102 2 204
256 56.8 4 227
128 31.9 8 255
So you can see that using smaller threadblocks, but more of them, at least for this range and this portion of the analysis, results in more average active threads per SM. Effectively the SM is consuming higher memory traffic, on average, and therefore is getting this memory-bound workload done more quickly.
So what should we do?
My recommendation would be to precede your reduction operation with a grid-stride loop. You can get an idea of this by reviewing the canonical parallel reduction material. In that material, note the introduction of the grid-stride loop at reduction #7. The idea is to choose the number of threadblocks to just fill your SMs, and use a grid stride loop to efficiently load “all” your data. This moves the data loading and a portion of the reduction to a very efficient code configuration, while reserving the less-efficient sweep style reduction to a hopefully smaller portion of the overall work. This will minimize the amount of kernel time spent in the sweep phase, minimizing the effect of threadblock sizing on performance variability.
Of course, even with the above suggestion, if you are going for the last ounce of performance, you may wish to evaluate your code at various threadblock configurations for best performance. This kind of “ninja-tuning” suggestion is applicable to nearly any GPU code.
As an aside, your reduction methodology here, using only global memory, is not what I would recommend (refer to the linked canonical material for a better approach), but that doesn’t seem to be the thrust of your question.
CLICK HERE to find out more related problems solutions.