In part 1, we wrote a simple “Hello World” QPU program and showed how to run it. In this installment, we’ll look at the algorithm we’re going to implement on the GPU.
Step Two: Write a Reference Implementation
The NIST description of SHA-256 provides a very good and readable description of the algorithm. Additionally, the Wikipedia article on the SHA-2 family of algorithms is another good resource and includes pseudo code for calculating the hash.
You will find the reference implementation under the directory QPU/SHA-256/reference:
cd QPU/SHA-256/reference make
Run the program with the test data:
Each line of the file passed in is treated as a separate block to be hashed and the code computes 16 hashes, one per line. Verify that the reference implementation is more or less correct by comparing to the OS sha256sum program:
$ head -1 test-data.bin | sha256sum daa9917f579255777c333304246501e9228bfb29e6be2ec9caa935ab986b5ddb - $ tail -1 test-data.bin | sha256sum 803ca2499221865999d6ac101cf000a2cb5eca394cef0747b27ad02984d33e44 -
This should match the first and last line of the output of the reference implementation. In this reference implementation, we are only concerned with making sure the algorithm is correct so we only handle one block (512-bits = 64 bytes). It would be straightforward to extend the implementation to handle multiple blocks using the stride parameter but we’ll only concern ourselves with a single block for the rest of this series.
Step Three: Make Observations and Map the Algorithm
The code is a pretty straightforward implementation but we can make a few observations and comments. As explained in the NIST description, the SHA-256 algorithm basically consists of two parts – message schedule generation and message compression. Message compression consists of simple bitwise operations like and, xor, shifts and rotates. Conveniently, the QPUs have a ‘ror’ (rotate right) instruction so we don’t have to do the two shifts and an or like we do in the reference implementation. The access pattern in the compression phase is also just a simple stream. Each iteration of compression reads the next K word as well as the schedule word and combines them with various bit/arithmetic operations. So once we have the schedule vectors, compression should translate pretty easily to the QPU. Finally, as we’ve noted before, all the operations in SHA-256 are bitwise or addition operations which are done in the QPU by the add pipeline. The multiply pipeline will be mostly idle. Effectively, the absolute, theoretical maximum QPU performance we can hope to achieve is 12 GFLOPs (half the 24 GFLOPs total). SHA-256 is not a great algorithm for showing off the GPU but it is interesting and illustrative for optimization purposes.
So what about message schedule generation? At first glance, this one looks a bit trickier. The way the pseudo code is written, it looks like we take the 16 words (64-bytes = 512-bit block) of input data and expand them into 64 words (the first 16 are the same) and then go through the compression loop 64 times. The main point to observe here is that the schedule generation and compression are fairly independent so long as we generate the schedule vector before we need to use it in the compression loop. The second point to note is that we only need to store the last 16 words in order to generate the next word of the schedule. Essentially, we can think of it as a queue and if we make it a circular queue new entries can overwrite old entries without requiring storage for 64 words. The access pattern here is not random but neither is it a simple stream. We’ll try a couple ways to handle this and see which is fastest. (Optimization is often like this).
Finally, as hinted by the fact that we wrote the reference implementation to compute 16 hashes at a time, our QPU program will take advantage of the fact that the QPUs natively operate on 16-wide vectors and compute 16 hashes in parallel.
Let’s look at what data we’ll need and how we’ll move it around. The input data for each QPU will be a block of 16×16 words (1024KB) for the 16 hashes of 64-bytes each. This maps nicely to the VPM where we can fit 4 such blocks in the VPM space. We will want to read these in 16-wide vectors which is exactly how the VPM works so there’s no question we’ll use the VPM for this.
We also need the 64 words of data for the K array of constants. We will access this data sequentially in a streaming fashion. Also notice that we want to “broadcast” each word across the vector when we read it. (That is, we want to read one word but have it replicated in the 16 elements of a vector). This is exactly how the texture fetch units work so we’ll use a texture for the K array. All we need to pass in for that is the base address.
Next we have the H vectors (256-bits each, 512 bytes total) which hold the intermediate, accumulating result between compression loops. They start with a constant value as described in the specification of the algorithm and at the end of the compression loop, they contain the final hash that we want. We’ll transfer this as an input/output parameter. That is to say, we’ll write the final result back into the same memory location. We’ll use the VPM for this as well, of course. The CPU will initialize the H vectors to the proper constant value before calling the QPU program. The CPU will also be responsible for all the padding. Other than that, the whole algorithm will run on the QPUs.
Take some time to familiarize yourself with the reference implementation (perhaps try optimizing the CPU version) and start thinking about different ways to go about implementing it in the QPUs. We’ll continue the series next time with a partial implementation.