In the last part of the series, we implemented the compression loop part of the SHA-256 algorithm. In this post, we’ll look at the other half of the implementation – the schedule vector generation.
Step 4: Finish GPU Implementation
Here’s the C code again:
for (int i=0; i < 16; i++) W[i] = data[k*stride+i]; for (int i=16; i < 64; i++) W[i] = smsigma1(W[i-2]) + W[i-7] + smsigma0(W[i-15]) + W[i-16];
As we noted before, we only need to keep 16 vectors at a time in order to generate the next 16. The access pattern is the same but the stride is not fixed.
The first idea comes by recognizing that the VPM functions as a queue both for input and output. When we read the W array in the compression loop, we access each element in order, so it would be convenient to simply set the VPM read base address, say we’re going to read 16 vectors and the compression loop stays exactly the same. After each block of 16, we’ll generate and write the next 16 vectors into the same place in the VPM (overwriting the 16 we just used) and repeat 4 times (for the 64 iterations we need to do).
In order to generate the next vector, we need to read 4 elements of W. We could read each element of W by setting the VPM base address (e.g. for i-16), reading one vector, setting the base address (e.g. for i-7), reading another vector and that would work but this is not really the best way to use the VPM. We could also set the base address once and then read 16 vectors in a streaming fashion, discarding the ones we don’t want. If the element indices are close together (e.g. in this case, we can read W[i-16] and W[i-15] without resetting the base address and without discarding), this makes some sense.
Instead, the solution we’ll go with for now is a bit of a hybrid. We’ll “mirror” or “alias” the 16 W elements in 16 registers (ra4..ra19). So, ra4 always holds W, ra5 always holds W, et c … In addition to keeping W in registers, we will also write each element into the VPM (again as a queue) as we generate it. In this way, the compression loop can simply read the VPM as a queue and no changes are needed.
As we described, since we only need the previous 16 elements in order to generate the next one, we can think of it as a circular queue. For example, we place what would be W into W. After 16 times, we have a new set of W’s and we can run the next 16 loops of compression.
The code to do this unrolling makes use of a couple m4 macros:
define(`GENSCHEDULE', ` add rb32, $1, 0; nop # r0 = W_i-16 ror rb33, $2, 7; nop # r1 = RotR(x, 7) ror rb34, $2, rb6; nop # r2 = RotR(x, 18) shr rb35, $2, 3; nop # r3 = x >> 3; xor rb33, r1, r2; nop # r1 = r1 ^ r2 xor rb35, r1, r3; nop # r3 = r1 ^ r3 add rb32, r0, r3; nop # r0 += r3 (W_i-16 + smsigma0(W_i-15)) add rb32, r0, $3; nop # r0 += W_i-7 ror rb33, $4, rb8; nop # r1 = RotR(x, 17) ror rb34, $4, rb7; nop # r2 = RotR(x, 19) xor rb33, r1, r2; nop # r1 = r1 ^ r2 shr rb34, $4, 10; nop # r2 = x >> 10 xor rb33, r1, r2; nop # r1 = r1 ^ r2 add $1, r0, r1; nop # r0 += smsigma1(W_i-2) add rb48, r0, r1; nop ## $2 ignored, $3 ignored, $4 ignored, $1 ignored (suppress warnings)') define(`GENSCHEDULE_ALL', ` GENSCHEDULE(`ra4', `ra5', `ra13', `ra18') GENSCHEDULE(`ra5', `ra6', `ra14', `ra19') GENSCHEDULE(`ra6', `ra7', `ra15', `ra4') GENSCHEDULE(`ra7', `ra8', `ra16', `ra5') GENSCHEDULE(`ra8', `ra9', `ra17', `ra6') GENSCHEDULE(`ra9', `ra10', `ra18', `ra7') GENSCHEDULE(`ra10', `ra11', `ra19', `ra8') GENSCHEDULE(`ra11', `ra12', `ra4', `ra9') GENSCHEDULE(`ra12', `ra13', `ra5', `ra10') GENSCHEDULE(`ra13', `ra14', `ra6', `ra11') GENSCHEDULE(`ra14', `ra15', `ra7', `ra12') GENSCHEDULE(`ra15', `ra16', `ra8', `ra13') GENSCHEDULE(`ra16', `ra17', `ra9', `ra14') GENSCHEDULE(`ra17', `ra18', `ra10', `ra15') GENSCHEDULE(`ra18', `ra19', `ra11', `ra16') GENSCHEDULE(`ra19', `ra4', `ra12', `ra17')')
Macros are good for precisely this sort of unrolling. m4 is pretty powerful and you could certainly make a macro that unrolls the loop for you but to keep it simple and to make it clear, we unrolled the loop manually. If we take just the first GENSCHEDULE expansion, it says use ra4, ra5, ra13, and ra18 and put the result back in ra4 as well as write to the VPM queue. (That’s the ‘add rb48, ‘ line). The C code again (see above) is:
W[i] = smsigma1(W[i-2]) + W[i-7] + smsigma0(W[i-15]) + W[i-16];
Let i = 16 (the first iteration of the loop). Then W (i-16) is ra4, W (i-15) is ra5, W (i-7) is ra13 and W is ra18. And because it’s a circular buffer, we put the result back in ra4 as well as write the first column of the VPM. Hopefully that makes sense.
This concept of using registers and unrolling loops is an important technique for GPU programming. Instead of thinking of the registers as temporary locations for intermediate results or constants, try thinking of them as a general random-access memory and consider unrolling to create that effect. Consider that the VPM address space is only 4 KB (64 rows and 16 columns of 4-byte words) and that’s shared by all the QPUs. Now note that the register address space is the same size (64 registers – 32 in the A file and 32 in the B file) and it is private for each QPU. Obviously there are times when this won’t work. If you have a truly random access pattern, unrolling won’t be able to help but you’ll find in many cases, it is applicable, and when it is, it is often the most efficient way to operate.
We will see the exception and how this breaks down later in the series and reconsider this but part of the point of showing this intermediate step is to illustrate using the registers and loop unrolling.