The CPU... and new friends

Summary

This brand-new talk covers the state of the art for using C++ (not just C) for general-purpose computation on graphics processing units (GPGPU). The first half of the talk discusses the most important issues and techniques to consider when using GPUs for high-performance computation, especially where we have to change our traditional advice for doing the same computation on the CPU. The second half focuses on upcoming C++ language and library extensions that bring key abstractions for GPGPU — and in time considerably more — directly into C++.

Description

The mainstream hardware platform, from grandma’s PC or tablet on up, now typically contains a modern GPU that can offer 10x to 100x speedups for certain interesting computation workloads. If you care about performance on commodity hardware, you won’t willingly leave that kind of performance on the table. It’s time to think of the GPU as another available coprocessor, or compute accelerator, that’s not just for high-end custom GPGPU servers any more.

There’s one small problem: Standard C++ cannot access that juicy performance directly because it doesn’t (yet) have the language abstractions needed for running parts of the same program on heterogeneous processors. That’s unfortunate, because it’s C++’s job as a systems programming language to give full access to the hardware. While C++ evolves in this direction, in the interim environments like CUDA and OpenCL are filling this gap using a combination of libraries and extensions to C.

This talk covers the following issues:

  • Language subsets due to hardware heterogeneity. The mainstream computer is now fundamentally heterogeneous. Different compute resources (traditional CPU, Cell-style SPU, GPU) support only subsets of the C and C++ languages; for example, many lack support for pointers to pointers, pointers to functions, new, malloc, and even fundamental types like short int.
  • Performance differences and pitfalls. Code executing on GPUs can have very different performance characteristics from the same code running on CPUs. Even using C++’s most basic language feature — by adding an “if” statement — can silently lose an order of magnitude or more of performance.
  • Non-uniform and fragmented memory. Most current GPUs do not share memory with the CPU, and so data must be (usually explicitly) transferred before and after the computation. Further, the GPU memory itself has a notion of cache-like memory shared by subgroups of threads, but that “cache” is not automatic; the programmer must manage it explicitly.
  • Rapidly changing hardware. GPGPU hardware designs in particular are still in great flux. What programming techniques work well on today’s hardware, and result in writing code that is also friendly to tomorrow’s different hardware?

Finally, we’ll also consider how to address the above issues (some of which are temporary) in a way that treats GPGPU as just an interesting current midpoint on the road to mainstream heterogeneous computation — spreading a computational workload across available parallel processing assets, from vector units and multicore, to GPGPU and APU, to elastic cloud computing. And, unlike CUDA and OpenCL, our goal is to find solutions, not for C, but for C++ — leveraging C++’s strength of strong abstractions and STL techniques while still flying close to today’s morphing metal.