by Norman Matloff
You've heard that graphics processing units — GPUs — can bring big increases in computational speed. While GPUs cannot speed up work in every application, the fact is that in many cases it can indeed provide very rapid computation. In this tutorial, we'll see how this is done, both in passive ways (you write only R), and in more direct ways, where you write C/C++ code and interface it to R.
What Are GPUs?
A GPU is basically just a graphics card. Any PC or laptop has one (or at least a GPU chipset), but the more expensive ones, favored by the gamers, are quite fast. Because the graphical operations performed by a GPU involve vectors and such, at some point physicists started using GPUs for their own non-graphic computations. To do so, they had to "trick" the GPUs by writing what appeared to be graphics code. Such programming was, to say the least, unwieldy, so NVIDIA took the risky, "build it and they will come" step of designing its GPUs to be generally programmable. The goal was to have a structure that is fast for game programming but which is programmable in "normal" C/C++ code.
NVIDIA's bet paid off, and today the most commonly-used GPUs for general-purpose work are made by that firm. They are programmed in a language called CUDA, which is an extension of C/C++. Check this list to see if you have a CUDA-compatible graphics card. We will focus on NVIDIA here, and the term GPU will mean those devices.
GPUs are shared-memory, threaded devices. (If you haven't seen these terms before, please read my recent OpenMP tutorial blog post first, at least the beginning portions.) Note by the way that the GPU card has its own memory, separate from that accessed by your CPU.
A GPU might be called a "multi-multicore" machine. It consists of a number of streaming multiprocessors (SMs), each one of which is a multicore engine. Thus the cores share memory, and many threads run simultaneously. A typical GPU will contain many more cores than the 2, 4 or even 8 core one sees commonly on PCs and laptops.
There are some differences from standard multicore, though, notably that threads run in groups called blocks. Within a block, the threads run in lockstep, all running the same machine language instruction (or none at all) at the same time. This uniformity makes for very fast hardware if one's application can exploit it.
Not all applications will fare well with GPUs. Consider what happens with an if type of statement in C/C++. The threads that satisfy the "if" condition will all execute the same machine instructions, but the remaining threads will be idle, a situation termed thread divergence. We thus lose parallelism, and speed is reduced. So, applications that have a regular pattern, such as matrix operations, can really take advantage of GPUs, while other apps that need a lot of if/else decision making may not be speedy.
Another major way in which GPUs differ from ordinary multicore systems is that synchronization between threads is less flexible than in multicore. There is nothing in GPUs quite the same as what the OpenMP barrier construct gives us. Threads within a block are synchronized by definition, but not between blocks. If we need a "wait here until all threads get here" type of operation, one must resort to heavy-handed measures, such as returning control to the CPU of the host machine, very slow. So, the applications that are most suitable for GPU use are those not needing much in terms of barriers and the like.
In any shared-memory system, a major performance issue is memory access time. With GPUs, the problem is writ large, since there is memory on two levels, host and device, each of which has long latency times. On the other hand, this is ameliorated for device memory, since there is an efficient embedded-in-hardware operating system that engages in latency hiding. : When a thread block encounters an instruction needing access to device memory, the OS suspends that block and tries to find another to run that doesn't access memory, giving the first block time to do the memory access while useful work is done by the second block. Accordingly, it is good practice to set up a large number of threads, each with a small amount of work to do (very different from the multicore case).The memory access itself is most efficient if the threads in a block are requesting consecutive memory locations; good CUDA coders strive to make this happen.