Tag Archives: GPU

Hacking The GPU For Fun And Profit (Pt. 1)

Introduction

Recently (relatively), Broadcom, the manufacturer of the SoC used in the Raspberry Pi, decided to release documentation for the Raspberry Pi GPU.  Having some background in GPGPU programming (I was doing GPGPU programming before OpenCL and CUDA), this announcement peaked my interest and I began to take a look at what was possible and how difficult it would be to take advantage of the GPU processing power for other purposes.

While there are certainly limitations (some significant) and it can’t compete against high-end smartphone and tablet GPUs, it can definitely hold it’s own in the mid-range.  But, most importantly for hobbyists like myself, the price and power requirements are much more attractive than some other solutions.  The Raspberry Pi GPU has a theoretical maximum processing power of 24 GFLOPs.  (It’s important to note that you can never get to the theoretical max but as all vendors always quote their theoretical max, it’s not unreasonable to use it for very, very rough comparisons).  At $25 for a Model A at 2.5W, that’s about $1/GFLOPs or 10 GFLOPs/W which is quite respectable.  (To put it in perspective, you can buy a 192-core, 365 GFLOPs Tegra K1 board for $192 or a little over $0.5/GFLOP.  On the other hand, you can also buy the 16-core, 26 GFLOPs Parallela board for $99 or about $3.8/GFLOP).

Background

This series of posts will walk through the GPU architecture by designing and optimizing a non-trivial algorithm (parallel SHA-256) to run on the GPU.  Because this is a relatively advanced topic, some background knowledge must be assumed.  Perhaps later I will write a more general GPGPU tutorial but this article will turn into a book if I don’t focus it.  You should have some understanding of GPGPU concepts (Google can help) and you should read the Broadcom BCM2835 GPU Documentation.  I’ll describe most Raspberry Pi GPU specific concepts but in case I miss some, the documentation is your friend.

Required Hardware and Software

You will need:

  1. Raspberry Pi (Model A or B).  I have only tested on a Model B but I have no reason to believe it shouldn’t work on a Model A.
  2. The code to follow along can be cloned from github:
    git clone https://github.com/elorimer/rpi-playground

Step Zero: Build An Assembler

Herman Hermitage has done some excellent reverse engineering work (before the documentation was released) and has written a QPU (the name of a “core” in the GPU) assembler that you can get from github:

git clone https://github.com/hermanhermitage/videocoreiv-qpu

Unfortunately, I was unaware of this when I began, so I wrote my own and, for better or for worse, that’s what the code is written in.  The assembler is pretty rough, has lots of quirks and bugs and supports only what I needed to implement this algorithm, but it has the advantage (only to me, I suppose) that I know exactly how it works and what code it will produce.  If you’d prefer to use Herman’s assembler which is probably more sane and friendly, you can assemble the code with mine, then disassemble it with Herman’s disassembler which (should?) allow you to reassemble it with his assembler.

In addition, when we get to the loop unrolling and register substitutions, we’ll start using m4 macros in the assembly source.  M4 is pretty simple while being quite powerful, easy to read and write and it comes installed on pretty much all Linux systems.  Instead of trying to introduce all the syntax for the assembler and the macros right here (and risk losing all my readers), I’ll try to introduce the syntax as we need it.

Writing an assembler is a fairly boring affair so you can skip that and take one that’s already written.  If you’re of a more masochistic bent and feel like two assemblers are not nearly enough, feel free to write your own.  It’s not a bad learning exercise at all when you have to understand what bits actually go into each instruction.

Clone the rpi-playground repository above and then build the assembler:

cd QPU/assembler
make

You should find a qpu-assembler binary in that directory.  This is the assembler we’ll be using throughout the rest of this series.

Step One: Hello, World

cd QPU/helloworld

The first thing to do when trying out a new language or architecture is to write the obligatory “Hello World”  program that does the bare minimum to verify that something is working end-to-end.  For languages on the host, this is usually a trivial, 30-second exercise that is not worth mentioning.

In the case of GPGPU programming (especially without a host library like OpenCL), this can be much more involved as there may be a chunk of host-side code as well that needs to be written to initialize the GPU, map memory, configure the parameters, etc …  In the case of our QPUs, it’s even worse because to actually see anything come back from the QPU, we have to dive into the VPM and VCD and DMA-ing things back to the host.  Oh well, such is life.  Let’s get started.

First things first, let’s define what “Hello World” means in our context.  How about a QPU program that takes a single input value, adds some hard-coded constant to it, and returns it back to us?  We will start with only 1 QPU so we don’t need to worry about synchronization but this is enough to verify that we have two-way communication and we’re really running our code on the QPU.  There are two sides to the problem.

GPU Side

You did read the documentation mentioned above, didn’t you?  (This is the part of the article where you really want to do that).  There are two halves to the problem:

Input

There are a few ways to transfer data to the QPUs and if you are familiar with GPGPU programming, these should be familiar.  If not, a general reference on GPGPU programming might be a good idea.  The first is in uniforms.  These are analogous to function call arguments.  Alternatively, you might think of them as constants from the QPU point of view, only changing between program invocations (initiated by the host).  The QPU gets them in a queue (i.e. in order) one at a time.  They are single word (32-bit) values and there is a limit to how many you can pass in, but otherwise, they are the most convenient for this sort of data that is constant throughout one run.

The second is in textures.  These are also single word values but they are not limited in how many you can use or the access pattern (i.e. they do not have to accessed in order).

The third is using the VPM.  This is a block of memory that is shared by all the QPUs and transfers occur explicitly in blocks of vectors.  We will talk much more about that later.

For now, we will use a uniform to transfer our value from the host to the QPU.

Output

For output, there’s really only one way and that is to initiate a DMA transfer from the QPU VPM space to an address in the host’s memory.  This implies two things.  First, we’ll need another uniform to pass in the address to the block of memory where we want to write the result value back and second, we’ll need to write the value to the VPM first before we can DMA from the VPM to the host.  So even for “Hello World” we need to understand the VPM and VCD.

Code

All right.  Enough talk.  Let’s see some code (helloworld.asm):

# Load the value we want to add to the input into a register
ldi ra1, 0x1234

# Configure the VPM for writing
ldi rb49, 0xa00

You’re following along in your GPU documentation, right?  You can find the QPU register address map on pages 37-38 where it shows writing to rb49 is “VPMVCD_WR_SETUP”.  The VPM interface is documented in section 7 and table 32 describes the format for this register.  0xa00 means to write with no stride, starting at VPM location (0, 0) with horizontal format and 32-bit elements.

# Add the input value (from the first uniform) and the
# hard-coded constant into the VPM we just set up
add rb48, ra1, rb32;      nop

From the GPU documentation, we discover rb48 is the location to write to write to the VPM the way we just configured it.  Again, from the GPU documentation, we find that rb32/ra32 is the address we read from to fetch the uniform values in order.

As you can tell, the  assembler is fairly low-level and leaves things like register names untranslated.  A friendlier assembler would have aliases for these so we don’t have to remember them or look them up all the time.  That will probably come in a later version of the assembler but for now, think of it as good practice for understanding the QPUs at a lower level.

We will also note that we’re encoding two instructions per line. If you’ve been following in the GPU documentation (OK, last time, I promise – from here on, I’m assuming you’ve read it and I won’t explain what each register does), you know that this is because the QPU is a dual-issue architecture with two separate pipelines – the add and the multiply pipeline.  We have no useful work for the multiply pipeline to do, but we have to put something there so it gets a no-op.  The SHA-256 algorithm we’ll be developing hardly uses the multiply pipeline so we’ll soon get used to just ignoring these no-ops.

Also note the intervening instruction between the load into ra1 and the read from ra1.

# configure VCD to transfer 16 words (1 vector) to the host
ldi rb49, 0x88010000

# initiate a DMA transfer
or rb50, ra32, 0;         nop

# wait for the DMA transfer to finish
or rb39, rb50, ra39;      nop

# signal the host that we're done
or rb38, ra39, ra39;      nop

# The end
nop.tend ra39, ra39, ra39;  nop rb39, rb39, rb39
nop ra39, ra39, ra39;       nop rb39, rb39, rb39
nop ra39, ra39, ra39;       nop rb39, rb39, rb39

Notice that there is no ‘mov’ instruction as is usual.  Instead register-register moves are done with low-level instructions like ‘and’ or ‘or’.  OR-ing a register with zero yields the same value.  The zero in this case is called an immediate which means it occurs in the instruction stream along with the instruction itself.  Immediates take up the B register slot in the instruction so if you need to move something from a B register you can use ‘and <dest>, rbX, rbX’  There are other “tricks” to use the multiply pipe to perform a mov as well.

As another aside, the rX39 register is essentially a no-op when we don’t care what the value is for reading or writing.  Think of it like /dev/null.  Perhaps it is obvious but note that reading rX39 does not result in a 0 but is rather undefined and you’ll get garbage reading from it.  Never use rX39 as a read register unless you really don’t want to do anything with the result (e.g. it’s a nop operation or we just want the side effect as in the case of reading the VCD busy register).

Assemble it.  Assuming the qpu-assembler binary is in your path and the assembly file above is in a file named helloworld.asm:

qpu-assembler -o helloworld.bin < helloworld.asm

Host Side

On the host side, we also have some work to do.  Crack open the driver source code, driver.c and let’s take a look.

#include "mailbox.h"

For this example we are going to borrow the code from the GPU FFT example under /opt/vc/src/hello_pi/hello_fft which is hard-coded in the Makefile.  This includes a few functions to enable the QPU, allocate and map memory, and execute the QPU programs.

struct memory_map {
    unsigned int code[MAX_CODE_SIZE];
    unsigned int uniforms[NUM_QPUS][2];
    unsigned int msg[NUM_QPUS][2];
    unsigned int results[NUM_QPUS][16];
};

This defines the memory layout we’ll use and share between the host and GPU.

Initialize the mailbox interface and send a message to enable the QPU.  (This will also make the address space visible to the host.  You can them mmap /dev/mem and control the QPUs using the MMIO registers described in the GPU documentation):

    int mb = mbox_open();
    if (qpu_enable(mb, 1)) {
        fprintf(stderr, "QPU enable failed.\n");
        return -1;
    }

Allocate, lock and map some GPU memory using functions from mailbox.c.

   unsigned handle = mem_alloc(mb, size, 4096, GPU_MEM_FLG);
...
   unsigned ptr = mem_lock(mb, handle);
...
   void *arm_ptr = mapmem(ptr + GPU_MEM_MAP, size);

Now we have two addresses to refer to the memory.  ‘ptr’ is the GPU (VC) address that the GPU understands.  When passing pointers (such as the return address for the buffer to write the results into or the address of the code and uniforms for the QPU), we need to use VC addresses.  When accessing the memory from the host (for initializing the values in the first place and reading the result buffer), we have to use the mapped address which is a valid host address.

Next we do some pointer arithmetic to set the structure fields to point to the proper VC addresses.  To execute a QPU program through the mailbox interface, we pass an array of message structures that contain a pointer to the uniforms to bind to the QPU program and then a pointer to the address of the QPU code to execute:

   unsigned ret = execute_qpu(mb, NUM_QPUS, vc_msg, 1, 10000);

The rest of the code displays the results and releases the resources we used.

Build and Run It

First, we need to create a character device for communicating through the mailbox with the GPU:

sudo mknod char_dev c 100 0

Build and run it like so:

make
sudo ./helloworld helloworld.bin 100

This will use the GPU with the program above to add the value 100 (decimal) passed in on the command-line to the hard-coded constant 0x1234 (4660 decimal) in our QPU program above and should return the result 0x1298 (4760 decimal):

Loaded 80 bytes of code from helloworld.bin ...
QPU enabled.
Uniform value = 100
QPU 0, word 0: 0x00001298
QPU 0, word 1: 0x00001298
QPU 0, word 2: 0x00001298
QPU 0, word 3: 0x00001298
QPU 0, word 4: 0x00001298
QPU 0, word 5: 0x00001298
QPU 0, word 6: 0x00001298
QPU 0, word 7: 0x00001298
QPU 0, word 8: 0x00001298
QPU 0, word 9: 0x00001298
QPU 0, word 10: 0x00001298
QPU 0, word 11: 0x00001298
QPU 0, word 12: 0x00001298
QPU 0, word 13: 0x00001298
QPU 0, word 14: 0x00001298
QPU 0, word 15: 0x00001298
Cleaning up.
Done.

Success!  The value is replicated 16 times because the QPUs natively operate on 16-word vectors (4-wide multiplexed 4 times).  When we load an immediate, we actually load a 16-element register with the same value in all 16 elements.  Same with the uniform.  Then the add instruction adds two 16-word vectors to produce another 16-word vector which gets stored in the VPM and then transfered back to the host.  We didn’t have to transfer the whole vector to the host.  The DMA engine would allow us to transfer only part but this is to show how the QPU operates on data in parallel.

Congratulations!  You’ve written and run your first GPU program on the Raspberry Pi.  Admittedly, this one isn’t too interesting but it’s an important step to building something useful and efficient.  Next time we’ll start looking at the SHA-256 algorithm.

NOTE: Occasionally, the GPU can get into weird states where the programs do not return the expected values.  I have not determined if it is a bug in the host-side code for managing the GPU or if there is something else going on.  In any case, if you see things you don’t expect, try rebooting the machine.