All posts by codehacker

Hacking The GPU For Fun And Profit (Pt. 4)

In the last part of the series, we implemented the compression loop part of the SHA-256 algorithm.  In this post, we’ll look at the other half of the implementation – the schedule vector generation.

Step 4: Finish GPU Implementation

Here’s the C code again:

for (int i=0; i < 16; i++)
  W[i] = data[k*stride+i];
for (int i=16; i < 64; i++)
  W[i] = smsigma1(W[i-2]) + W[i-7] + smsigma0(W[i-15]) + W[i-16];

As we noted before, we only need to keep 16 vectors at a time in order to generate the next 16.  The access pattern is the same but the stride is not fixed.

The first idea comes by recognizing that the VPM functions as a queue both for input and output.  When we read the W array in the compression loop, we access each element in order, so it would be convenient to simply set the VPM read base address, say we’re going to read 16 vectors and the compression loop stays exactly the same.  After each block of 16, we’ll generate and write the next 16 vectors into the same place in the VPM (overwriting the 16 we just used) and repeat  4 times (for the 64 iterations we need to do).

In order to generate the next vector, we need to read 4 elements of W.  We could read each element of W by setting the VPM base address (e.g. for i-16),  reading one vector, setting the base address (e.g. for i-7), reading another vector and that would work but this is not really the best way to use the VPM.  We could also set the base address once and then read 16 vectors in a streaming fashion, discarding the ones we don’t want.  If the element indices are close together (e.g. in this case, we can read W[i-16] and W[i-15] without resetting the base address and without discarding), this makes some sense.

Instead, the solution we’ll go with for now is a bit of a hybrid.  We’ll “mirror” or “alias” the 16 W elements in 16 registers (ra4..ra19).  So, ra4 always holds W[0], ra5 always holds W[1], et c …  In addition to keeping W in registers, we will also write each element into the VPM (again as a queue) as we generate it.  In this way, the compression loop can simply read the VPM as a queue and no changes are needed.

As we described, since we only need the previous 16 elements in order to generate the next one, we can think of it as a circular queue.  For example, we place what would be W[16] into W[0].  After 16 times, we have a new set of W’s and we can run the next 16 loops of compression.

The code to do this unrolling makes use of a couple m4 macros:

define(`GENSCHEDULE',
`
    add rb32, $1, 0;                nop         # r0 = W_i-16
    ror rb33, $2, 7;                nop         # r1 = RotR(x, 7)
    ror rb34, $2, rb6;              nop         # r2 = RotR(x, 18)
    shr rb35, $2, 3;                nop         # r3 = x >> 3;
    xor rb33, r1, r2;               nop         # r1 = r1 ^ r2
    xor rb35, r1, r3;               nop         # r3 = r1 ^ r3
    add rb32, r0, r3;               nop         # r0 += r3          (W_i-16 + smsigma0(W_i-15))
    add rb32, r0, $3;               nop         # r0 += W_i-7
    ror rb33, $4, rb8;              nop         # r1 = RotR(x, 17)
    ror rb34, $4, rb7;              nop         # r2 = RotR(x, 19)
    xor rb33, r1, r2;               nop         # r1 = r1 ^ r2
    shr rb34, $4, 10;               nop         # r2 = x >> 10
    xor rb33, r1, r2;               nop         # r1 = r1 ^ r2
    add $1, r0, r1;                 nop         # r0 += smsigma1(W_i-2)
    add rb48, r0, r1;               nop
    ## $2 ignored, $3 ignored, $4 ignored, $1 ignored (suppress warnings)')
define(`GENSCHEDULE_ALL',
`
GENSCHEDULE(`ra4', `ra5', `ra13', `ra18')
GENSCHEDULE(`ra5', `ra6', `ra14', `ra19')
GENSCHEDULE(`ra6', `ra7', `ra15', `ra4')
GENSCHEDULE(`ra7', `ra8', `ra16', `ra5')
GENSCHEDULE(`ra8', `ra9', `ra17', `ra6')
GENSCHEDULE(`ra9', `ra10', `ra18', `ra7')
GENSCHEDULE(`ra10', `ra11', `ra19', `ra8')
GENSCHEDULE(`ra11', `ra12', `ra4', `ra9')
GENSCHEDULE(`ra12', `ra13', `ra5', `ra10')
GENSCHEDULE(`ra13', `ra14', `ra6', `ra11')
GENSCHEDULE(`ra14', `ra15', `ra7', `ra12')
GENSCHEDULE(`ra15', `ra16', `ra8', `ra13')
GENSCHEDULE(`ra16', `ra17', `ra9', `ra14')
GENSCHEDULE(`ra17', `ra18', `ra10', `ra15')
GENSCHEDULE(`ra18', `ra19', `ra11', `ra16')
GENSCHEDULE(`ra19', `ra4', `ra12', `ra17')')

Macros are good for precisely this sort of unrolling.  m4 is pretty powerful and you could certainly make a macro that unrolls the loop for you but to keep it simple and to make it clear, we unrolled the loop manually.  If we take just the first GENSCHEDULE expansion, it says use ra4, ra5, ra13, and ra18 and put the result back in ra4 as well as write to the VPM queue.  (That’s the ‘add rb48, ‘ line).  The C code again (see above) is:

  W[i] = smsigma1(W[i-2]) + W[i-7] + smsigma0(W[i-15]) + W[i-16];

Let i = 16 (the first iteration of the loop).  Then W[0] (i-16) is ra4, W[1] (i-15) is ra5, W[9] (i-7) is ra13 and W[14] is ra18.  And because it’s a circular buffer, we put the result back in ra4 as well as write the first column of the VPM.  Hopefully that makes sense.

This concept of using registers and unrolling loops is an important technique for GPU programming.  Instead of thinking of the registers as temporary locations for intermediate results or constants, try thinking of them as a general random-access memory and consider unrolling to create that effect.  Consider that the VPM address space is only 4 KB (64 rows and 16 columns of 4-byte words) and that’s shared by all the QPUs.  Now note that the register address space is the same size (64 registers – 32 in the A file and 32 in the B file) and it is private for each QPU.  Obviously there are times when this won’t work.  If you have a truly random access pattern, unrolling won’t be able to help but you’ll find in many cases, it is applicable, and when it is, it is often the most efficient way to operate.

We will see the exception and how this breaks down later in the series and reconsider this but part of the point of showing this intermediate step is to illustrate using the registers and loop unrolling.

Hacking The GPU For Fun And Profit (Pt. 3)

In the last part of the series, we wrote a reference implementation of the parallel SHA-256 algorithm.  In this installment, we’ll switch back to GPU code and write a partial implementation of just the compression loop (leaving schedule generation for the next part).

Step Three: Begin GPU Implementation

Fetch and build the code to follow along:

cd QPU/SHA-256/partial
make

To run the code, you will need to create the char_dev device node as described in the first part.  (Or run the code from a directory with such a file).

Let’s look at the host-side changes first.  sha256.cpp should look pretty familiar from the last section.  We’ve added an optional command-line parameter ‘-qpu’ to instruct the program to run the GPU implementation and some calls to functions in qpufuncs.cpp to interact with the QPUs.  qpufuncs.cpp defines a few functions – SHA256SetupQPU, SHA256CleanupQPU, SHA256ExecuteQPU, SHA256FetchResult and loadQPUCode.  If you’ve read the first article in this series, it should make sense.  The only significant change from the reference code is that we only run the main loop 16 times so that we only work with the initial data and we don’t have to worry about schedule generation (we do compute the schedule vectors on the CPU but they are not used).  There is a comment to this effect in the code.

So now let’s look at the QPU code in partial.asm.  The first thing to notice is on the very first line:

define(`NOP', `nop ra39, ra39, ra39;  nop rb39, rb39, rb39')

This is an m4 macro which we’ll use to make our little assembler more useful and save us lots of typing.  m4 comes pre-installed on nearly all Linux systems.  (rather, GNU/Linux systems as m4 is a GNU utility).  You can use the info pages (‘info m4’) or the internet to learn the syntax but this line is just a simple substitution and most of what we’ll use it for is no more than simple substitution with arguments.

From now on, we’ll assemble the code after running it through m4:

m4 partial.asm | qpu-assembler -o sha256.bin

The general pattern of reading uniforms, using the VCD to move data into VPM memory, reading the data, writing it back into the VPM and then using the VCD to transfer the output to the host should be familiar from the first part.

We’re hard-coding the VPM setup registers to use the first 16 rows of the VPM address space for our temporary storage.  We’ll have to change that when we scale the program beyond 1 QPU.  We transfer the H vectors first, read them into registers, then transfer the data vectors into this section.

## Read the H vectors into registers ra20..ra27 (these are the a..h)
## Also copy them into rb20..rb27 (we need the original values to write back)
or ra20, ra48, 0;           v8max rb20, ra48, ra48;

Notice how we use both the multiply and the add pipeline to write into two registers in one instruction (this only works if they are in different register files – a and b).  The assembler is somewhat low-level so we’re seeing the actual opcodes that will be generated.  v8max takes the max of the 8-bit values and since the max of two things that are the same is the exactly the same as either parameter, this works as a mov.  One could also use v8min and ‘and’ for mov operations.  Note that even though ra48 is referenced 3 times, it is only read once and the value used in 3 places.  Note also that you can only read the a or b register once in an instruction.  This places significant restrictions on what operations can be combined into a single instruction.  It is also why it is recommended to use the accumulation registers (r0..r3 for general purpose operation) whenever possible.

    or rb56, ra31, 0;       nop
    nop.tmu ra39, ra39, ra39;   nop
    add ra31, ra31, 4;      nop
    add rb32, r4, ra27;     nop

Here we see how to use the texture fetch units to read a single word from a texture.  We write the address into the TMUX_S register, set the .tmu flag on an instruction and read r4 to get the result.  The nop in this code serves no purpose but to set the .tmu flag.  Note that we can’t put the .tmu flag on the following add (which increments the address for the next time we go through the loop) because a small immediate (the 4 in this instruction) uses the signaling bits.  Clearly this is not an optimized piece of code, but premature optimization is the source of all evil and we want to wait until we can measure the performance improvement before making optimizations.

The rest of the code is a somewhat boring translation of the SHA-256 compression loop into the fundamental operations.  Again, the code is not completely optimized and we see a couple of mov’s that could/should be combined with other arithmetic instructions.

Run the code with and without the ‘-qpu’ flag and see that you get the same answer from the CPU version and the GPU version.  In the next installment, we’ll look at the schedule generation where things are a bit more interesting.

Hacking The GPU For Fun And Profit (Pt. 2)

In part 1, we wrote a simple “Hello World” QPU program and showed how to run it.  In this installment, we’ll look at the algorithm we’re going to implement on the GPU.

Step Two: Write a Reference Implementation

The NIST description of SHA-256 provides a very good and readable description of the algorithm.  Additionally, the Wikipedia article on the SHA-2 family of algorithms is another good resource and includes pseudo code for calculating the hash.

You will find the reference implementation under the directory QPU/SHA-256/reference:

cd QPU/SHA-256/reference
make

Run the program with the test data:

./sha256 test-data.bin

Each line of the file passed in is treated as a separate block to be hashed and the code computes 16 hashes, one per line.  Verify that the reference implementation is more or less correct by comparing to the OS sha256sum program:

$ head -1 test-data.bin | sha256sum
daa9917f579255777c333304246501e9228bfb29e6be2ec9caa935ab986b5ddb  -
$ tail -1 test-data.bin | sha256sum
803ca2499221865999d6ac101cf000a2cb5eca394cef0747b27ad02984d33e44  -

This should match the first and last line of the output of the reference implementation.  In this reference implementation, we are only concerned with making sure the algorithm is correct so we only handle one block (512-bits = 64 bytes).  It would be straightforward to extend the implementation to handle multiple blocks using the stride parameter but we’ll only concern ourselves with a single block for the rest of this series.

Step Three: Make Observations and Map the Algorithm

The code is a pretty straightforward implementation but we can make a few observations and comments.  As explained in the NIST description, the SHA-256 algorithm basically consists of two parts – message schedule generation and message compression.  Message compression consists of simple bitwise operations like and, xor, shifts and rotates.  Conveniently, the QPUs have a ‘ror’ (rotate right) instruction so we don’t have to do the two shifts and an or like we do in the reference implementation.  The access pattern in the compression phase is also just a simple stream.  Each iteration of compression reads the next K word as well as the schedule word and combines them with various bit/arithmetic operations.  So once we have the schedule vectors, compression should translate pretty easily to the QPU.  Finally, as we’ve noted before, all the operations in SHA-256 are bitwise or addition operations which are done in the QPU by the add pipeline.  The multiply pipeline will be mostly idle.  Effectively, the absolute, theoretical maximum QPU performance we can hope to achieve is 12 GFLOPs (half the 24 GFLOPs total).  SHA-256 is not a great algorithm for showing off the GPU but it is interesting and illustrative for optimization purposes.

So what about message schedule generation?  At first glance, this one looks a bit trickier.  The way the pseudo code is written, it looks like we take the 16 words (64-bytes = 512-bit block) of input data and expand them into 64 words (the first 16 are the same) and then go through the compression loop 64 times.  The main point to observe here is that the schedule generation and compression are fairly independent so long as we generate the schedule vector before we need to use it in the compression loop.  The second point to note is that we only need to store the last 16 words in order to generate the next word of the schedule.  Essentially, we can think of it as a queue and if we make it a circular queue new entries can overwrite old entries without requiring storage for 64 words.  The access pattern here is not random but neither is it a simple stream.  We’ll try a couple ways to handle this and see which is fastest.  (Optimization is often like this).

Finally, as hinted by the fact that we wrote the reference implementation to compute 16 hashes at a time, our QPU program will take advantage of the fact that the QPUs natively operate on 16-wide vectors and compute 16 hashes in parallel.

Let’s look at what data we’ll need and how we’ll move it around.  The input data for each QPU will be a block of 16×16 words (1024KB) for the 16 hashes of 64-bytes each.  This maps nicely to the VPM where we can fit 4 such blocks in the VPM space.  We will want to read these in 16-wide vectors which is exactly how the VPM works so there’s no question we’ll use the VPM for this.

We also need the 64 words of data for the K array of constants.  We will access this data sequentially in a streaming fashion.  Also notice that we want to “broadcast” each word across the vector when we read it.  (That is, we want to read one word but have it replicated in the 16 elements of a vector).  This is exactly how the texture fetch units work so we’ll use a texture for the K array.  All we need to pass in for that is the base address.

Next we have the H vectors (256-bits each, 512 bytes total) which hold the intermediate, accumulating result between compression loops.  They start with a constant value as described in the specification of the algorithm and at the end of the compression loop, they contain the final hash that we want.  We’ll transfer this as an input/output parameter.  That is to say, we’ll write the final result back into the same memory location.  We’ll use the VPM for this as well, of course.  The CPU will initialize the H vectors to the proper constant value before calling the QPU program.  The CPU will also be responsible for all the padding.  Other than that, the whole algorithm will run on the QPUs.

Take some time to familiarize yourself with the reference implementation (perhaps try optimizing the CPU version) and start thinking about different ways to go about implementing it in the QPUs.  We’ll continue the series next time with a partial implementation.

Hacking The GPU For Fun And Profit (Pt. 1)

Introduction

Recently (relatively), Broadcom, the manufacturer of the SoC used in the Raspberry Pi, decided to release documentation for the Raspberry Pi GPU.  Having some background in GPGPU programming (I was doing GPGPU programming before OpenCL and CUDA), this announcement peaked my interest and I began to take a look at what was possible and how difficult it would be to take advantage of the GPU processing power for other purposes.

While there are certainly limitations (some significant) and it can’t compete against high-end smartphone and tablet GPUs, it can definitely hold it’s own in the mid-range.  But, most importantly for hobbyists like myself, the price and power requirements are much more attractive than some other solutions.  The Raspberry Pi GPU has a theoretical maximum processing power of 24 GFLOPs.  (It’s important to note that you can never get to the theoretical max but as all vendors always quote their theoretical max, it’s not unreasonable to use it for very, very rough comparisons).  At $25 for a Model A at 2.5W, that’s about $1/GFLOPs or 10 GFLOPs/W which is quite respectable.  (To put it in perspective, you can buy a 192-core, 365 GFLOPs Tegra K1 board for $192 or a little over $0.5/GFLOP.  On the other hand, you can also buy the 16-core, 26 GFLOPs Parallela board for $99 or about $3.8/GFLOP).

Background

This series of posts will walk through the GPU architecture by designing and optimizing a non-trivial algorithm (parallel SHA-256) to run on the GPU.  Because this is a relatively advanced topic, some background knowledge must be assumed.  Perhaps later I will write a more general GPGPU tutorial but this article will turn into a book if I don’t focus it.  You should have some understanding of GPGPU concepts (Google can help) and you should read the Broadcom BCM2835 GPU Documentation.  I’ll describe most Raspberry Pi GPU specific concepts but in case I miss some, the documentation is your friend.

Required Hardware and Software

You will need:

  1. Raspberry Pi (Model A or B).  I have only tested on a Model B but I have no reason to believe it shouldn’t work on a Model A.
  2. The code to follow along can be cloned from github:
    git clone https://github.com/elorimer/rpi-playground

Step Zero: Build An Assembler

Herman Hermitage has done some excellent reverse engineering work (before the documentation was released) and has written a QPU (the name of a “core” in the GPU) assembler that you can get from github:

git clone https://github.com/hermanhermitage/videocoreiv-qpu

Unfortunately, I was unaware of this when I began, so I wrote my own and, for better or for worse, that’s what the code is written in.  The assembler is pretty rough, has lots of quirks and bugs and supports only what I needed to implement this algorithm, but it has the advantage (only to me, I suppose) that I know exactly how it works and what code it will produce.  If you’d prefer to use Herman’s assembler which is probably more sane and friendly, you can assemble the code with mine, then disassemble it with Herman’s disassembler which (should?) allow you to reassemble it with his assembler.

In addition, when we get to the loop unrolling and register substitutions, we’ll start using m4 macros in the assembly source.  M4 is pretty simple while being quite powerful, easy to read and write and it comes installed on pretty much all Linux systems.  Instead of trying to introduce all the syntax for the assembler and the macros right here (and risk losing all my readers), I’ll try to introduce the syntax as we need it.

Writing an assembler is a fairly boring affair so you can skip that and take one that’s already written.  If you’re of a more masochistic bent and feel like two assemblers are not nearly enough, feel free to write your own.  It’s not a bad learning exercise at all when you have to understand what bits actually go into each instruction.

Clone the rpi-playground repository above and then build the assembler:

cd QPU/assembler
make

You should find a qpu-assembler binary in that directory.  This is the assembler we’ll be using throughout the rest of this series.

Step One: Hello, World

cd QPU/helloworld

The first thing to do when trying out a new language or architecture is to write the obligatory “Hello World”  program that does the bare minimum to verify that something is working end-to-end.  For languages on the host, this is usually a trivial, 30-second exercise that is not worth mentioning.

In the case of GPGPU programming (especially without a host library like OpenCL), this can be much more involved as there may be a chunk of host-side code as well that needs to be written to initialize the GPU, map memory, configure the parameters, etc …  In the case of our QPUs, it’s even worse because to actually see anything come back from the QPU, we have to dive into the VPM and VCD and DMA-ing things back to the host.  Oh well, such is life.  Let’s get started.

First things first, let’s define what “Hello World” means in our context.  How about a QPU program that takes a single input value, adds some hard-coded constant to it, and returns it back to us?  We will start with only 1 QPU so we don’t need to worry about synchronization but this is enough to verify that we have two-way communication and we’re really running our code on the QPU.  There are two sides to the problem.

GPU Side

You did read the documentation mentioned above, didn’t you?  (This is the part of the article where you really want to do that).  There are two halves to the problem:

Input

There are a few ways to transfer data to the QPUs and if you are familiar with GPGPU programming, these should be familiar.  If not, a general reference on GPGPU programming might be a good idea.  The first is in uniforms.  These are analogous to function call arguments.  Alternatively, you might think of them as constants from the QPU point of view, only changing between program invocations (initiated by the host).  The QPU gets them in a queue (i.e. in order) one at a time.  They are single word (32-bit) values and there is a limit to how many you can pass in, but otherwise, they are the most convenient for this sort of data that is constant throughout one run.

The second is in textures.  These are also single word values but they are not limited in how many you can use or the access pattern (i.e. they do not have to accessed in order).

The third is using the VPM.  This is a block of memory that is shared by all the QPUs and transfers occur explicitly in blocks of vectors.  We will talk much more about that later.

For now, we will use a uniform to transfer our value from the host to the QPU.

Output

For output, there’s really only one way and that is to initiate a DMA transfer from the QPU VPM space to an address in the host’s memory.  This implies two things.  First, we’ll need another uniform to pass in the address to the block of memory where we want to write the result value back and second, we’ll need to write the value to the VPM first before we can DMA from the VPM to the host.  So even for “Hello World” we need to understand the VPM and VCD.

Code

All right.  Enough talk.  Let’s see some code (helloworld.asm):

# Load the value we want to add to the input into a register
ldi ra1, 0x1234

# Configure the VPM for writing
ldi rb49, 0xa00

You’re following along in your GPU documentation, right?  You can find the QPU register address map on pages 37-38 where it shows writing to rb49 is “VPMVCD_WR_SETUP”.  The VPM interface is documented in section 7 and table 32 describes the format for this register.  0xa00 means to write with no stride, starting at VPM location (0, 0) with horizontal format and 32-bit elements.

# Add the input value (from the first uniform) and the
# hard-coded constant into the VPM we just set up
add rb48, ra1, rb32;      nop

From the GPU documentation, we discover rb48 is the location to write to write to the VPM the way we just configured it.  Again, from the GPU documentation, we find that rb32/ra32 is the address we read from to fetch the uniform values in order.

As you can tell, the  assembler is fairly low-level and leaves things like register names untranslated.  A friendlier assembler would have aliases for these so we don’t have to remember them or look them up all the time.  That will probably come in a later version of the assembler but for now, think of it as good practice for understanding the QPUs at a lower level.

We will also note that we’re encoding two instructions per line. If you’ve been following in the GPU documentation (OK, last time, I promise – from here on, I’m assuming you’ve read it and I won’t explain what each register does), you know that this is because the QPU is a dual-issue architecture with two separate pipelines – the add and the multiply pipeline.  We have no useful work for the multiply pipeline to do, but we have to put something there so it gets a no-op.  The SHA-256 algorithm we’ll be developing hardly uses the multiply pipeline so we’ll soon get used to just ignoring these no-ops.

Also note the intervening instruction between the load into ra1 and the read from ra1.

# configure VCD to transfer 16 words (1 vector) to the host
ldi rb49, 0x88010000

# initiate a DMA transfer
or rb50, ra32, 0;         nop

# wait for the DMA transfer to finish
or rb39, rb50, ra39;      nop

# signal the host that we're done
or rb38, ra39, ra39;      nop

# The end
nop.tend ra39, ra39, ra39;  nop rb39, rb39, rb39
nop ra39, ra39, ra39;       nop rb39, rb39, rb39
nop ra39, ra39, ra39;       nop rb39, rb39, rb39

Notice that there is no ‘mov’ instruction as is usual.  Instead register-register moves are done with low-level instructions like ‘and’ or ‘or’.  OR-ing a register with zero yields the same value.  The zero in this case is called an immediate which means it occurs in the instruction stream along with the instruction itself.  Immediates take up the B register slot in the instruction so if you need to move something from a B register you can use ‘and <dest>, rbX, rbX’  There are other “tricks” to use the multiply pipe to perform a mov as well.

As another aside, the rX39 register is essentially a no-op when we don’t care what the value is for reading or writing.  Think of it like /dev/null.  Perhaps it is obvious but note that reading rX39 does not result in a 0 but is rather undefined and you’ll get garbage reading from it.  Never use rX39 as a read register unless you really don’t want to do anything with the result (e.g. it’s a nop operation or we just want the side effect as in the case of reading the VCD busy register).

Assemble it.  Assuming the qpu-assembler binary is in your path and the assembly file above is in a file named helloworld.asm:

qpu-assembler -o helloworld.bin < helloworld.asm

Host Side

On the host side, we also have some work to do.  Crack open the driver source code, driver.c and let’s take a look.

#include "mailbox.h"

For this example we are going to borrow the code from the GPU FFT example under /opt/vc/src/hello_pi/hello_fft which is hard-coded in the Makefile.  This includes a few functions to enable the QPU, allocate and map memory, and execute the QPU programs.

struct memory_map {
    unsigned int code[MAX_CODE_SIZE];
    unsigned int uniforms[NUM_QPUS][2];
    unsigned int msg[NUM_QPUS][2];
    unsigned int results[NUM_QPUS][16];
};

This defines the memory layout we’ll use and share between the host and GPU.

Initialize the mailbox interface and send a message to enable the QPU.  (This will also make the address space visible to the host.  You can them mmap /dev/mem and control the QPUs using the MMIO registers described in the GPU documentation):

    int mb = mbox_open();
    if (qpu_enable(mb, 1)) {
        fprintf(stderr, "QPU enable failed.\n");
        return -1;
    }

Allocate, lock and map some GPU memory using functions from mailbox.c.

   unsigned handle = mem_alloc(mb, size, 4096, GPU_MEM_FLG);
...
   unsigned ptr = mem_lock(mb, handle);
...
   void *arm_ptr = mapmem(ptr + GPU_MEM_MAP, size);

Now we have two addresses to refer to the memory.  ‘ptr’ is the GPU (VC) address that the GPU understands.  When passing pointers (such as the return address for the buffer to write the results into or the address of the code and uniforms for the QPU), we need to use VC addresses.  When accessing the memory from the host (for initializing the values in the first place and reading the result buffer), we have to use the mapped address which is a valid host address.

Next we do some pointer arithmetic to set the structure fields to point to the proper VC addresses.  To execute a QPU program through the mailbox interface, we pass an array of message structures that contain a pointer to the uniforms to bind to the QPU program and then a pointer to the address of the QPU code to execute:

   unsigned ret = execute_qpu(mb, NUM_QPUS, vc_msg, 1, 10000);

The rest of the code displays the results and releases the resources we used.

Build and Run It

First, we need to create a character device for communicating through the mailbox with the GPU:

sudo mknod char_dev c 100 0

Build and run it like so:

make
sudo ./helloworld helloworld.bin 100

This will use the GPU with the program above to add the value 100 (decimal) passed in on the command-line to the hard-coded constant 0x1234 (4660 decimal) in our QPU program above and should return the result 0x1298 (4760 decimal):

Loaded 80 bytes of code from helloworld.bin ...
QPU enabled.
Uniform value = 100
QPU 0, word 0: 0x00001298
QPU 0, word 1: 0x00001298
QPU 0, word 2: 0x00001298
QPU 0, word 3: 0x00001298
QPU 0, word 4: 0x00001298
QPU 0, word 5: 0x00001298
QPU 0, word 6: 0x00001298
QPU 0, word 7: 0x00001298
QPU 0, word 8: 0x00001298
QPU 0, word 9: 0x00001298
QPU 0, word 10: 0x00001298
QPU 0, word 11: 0x00001298
QPU 0, word 12: 0x00001298
QPU 0, word 13: 0x00001298
QPU 0, word 14: 0x00001298
QPU 0, word 15: 0x00001298
Cleaning up.
Done.

Success!  The value is replicated 16 times because the QPUs natively operate on 16-word vectors (4-wide multiplexed 4 times).  When we load an immediate, we actually load a 16-element register with the same value in all 16 elements.  Same with the uniform.  Then the add instruction adds two 16-word vectors to produce another 16-word vector which gets stored in the VPM and then transfered back to the host.  We didn’t have to transfer the whole vector to the host.  The DMA engine would allow us to transfer only part but this is to show how the QPU operates on data in parallel.

Congratulations!  You’ve written and run your first GPU program on the Raspberry Pi.  Admittedly, this one isn’t too interesting but it’s an important step to building something useful and efficient.  Next time we’ll start looking at the SHA-256 algorithm.

NOTE: Occasionally, the GPU can get into weird states where the programs do not return the expected values.  I have not determined if it is a bug in the host-side code for managing the GPU or if there is something else going on.  In any case, if you see things you don’t expect, try rebooting the machine.