In the last part of the series, we wrote a reference implementation of the parallel SHA-256 algorithm. In this installment, we’ll switch back to GPU code and write a partial implementation of just the compression loop (leaving schedule generation for the next part).
Step Three: Begin GPU Implementation
Fetch and build the code to follow along:
cd QPU/SHA-256/partial make
To run the code, you will need to create the char_dev device node as described in the first part. (Or run the code from a directory with such a file).
Let’s look at the host-side changes first. sha256.cpp should look pretty familiar from the last section. We’ve added an optional command-line parameter ‘-qpu’ to instruct the program to run the GPU implementation and some calls to functions in qpufuncs.cpp to interact with the QPUs. qpufuncs.cpp defines a few functions – SHA256SetupQPU, SHA256CleanupQPU, SHA256ExecuteQPU, SHA256FetchResult and loadQPUCode. If you’ve read the first article in this series, it should make sense. The only significant change from the reference code is that we only run the main loop 16 times so that we only work with the initial data and we don’t have to worry about schedule generation (we do compute the schedule vectors on the CPU but they are not used). There is a comment to this effect in the code.
So now let’s look at the QPU code in partial.asm. The first thing to notice is on the very first line:
define(`NOP', `nop ra39, ra39, ra39; nop rb39, rb39, rb39')
This is an m4 macro which we’ll use to make our little assembler more useful and save us lots of typing. m4 comes pre-installed on nearly all Linux systems. (rather, GNU/Linux systems as m4 is a GNU utility). You can use the info pages (‘info m4’) or the internet to learn the syntax but this line is just a simple substitution and most of what we’ll use it for is no more than simple substitution with arguments.
From now on, we’ll assemble the code after running it through m4:
m4 partial.asm | qpu-assembler -o sha256.bin
The general pattern of reading uniforms, using the VCD to move data into VPM memory, reading the data, writing it back into the VPM and then using the VCD to transfer the output to the host should be familiar from the first part.
We’re hard-coding the VPM setup registers to use the first 16 rows of the VPM address space for our temporary storage. We’ll have to change that when we scale the program beyond 1 QPU. We transfer the H vectors first, read them into registers, then transfer the data vectors into this section.
## Read the H vectors into registers ra20..ra27 (these are the a..h) ## Also copy them into rb20..rb27 (we need the original values to write back) or ra20, ra48, 0; v8max rb20, ra48, ra48;
Notice how we use both the multiply and the add pipeline to write into two registers in one instruction (this only works if they are in different register files – a and b). The assembler is somewhat low-level so we’re seeing the actual opcodes that will be generated. v8max takes the max of the 8-bit values and since the max of two things that are the same is the exactly the same as either parameter, this works as a mov. One could also use v8min and ‘and’ for mov operations. Note that even though ra48 is referenced 3 times, it is only read once and the value used in 3 places. Note also that you can only read the a or b register once in an instruction. This places significant restrictions on what operations can be combined into a single instruction. It is also why it is recommended to use the accumulation registers (r0..r3 for general purpose operation) whenever possible.
or rb56, ra31, 0; nop nop.tmu ra39, ra39, ra39; nop add ra31, ra31, 4; nop add rb32, r4, ra27; nop
Here we see how to use the texture fetch units to read a single word from a texture. We write the address into the TMUX_S register, set the .tmu flag on an instruction and read r4 to get the result. The nop in this code serves no purpose but to set the .tmu flag. Note that we can’t put the .tmu flag on the following add (which increments the address for the next time we go through the loop) because a small immediate (the 4 in this instruction) uses the signaling bits. Clearly this is not an optimized piece of code, but premature optimization is the source of all evil and we want to wait until we can measure the performance improvement before making optimizations.
The rest of the code is a somewhat boring translation of the SHA-256 compression loop into the fundamental operations. Again, the code is not completely optimized and we see a couple of mov’s that could/should be combined with other arithmetic instructions.
Run the code with and without the ‘-qpu’ flag and see that you get the same answer from the CPU version and the GPU version. In the next installment, we’ll look at the schedule generation where things are a bit more interesting.