April 2011

One of the features we trumpeted about last year’s C&B was a hard enrollment limit of 60.  We sold out quickly, and, although we later scheduled an encore presentation, the short notice of the encore and its proximity to December holidays meant that some people who would have liked to attend were not able to.

We wanted to avoid that problem this year, so we selected a venue with more space, and we didn’t impose an enrollment limit.  Irony having a way of being ironic, this year we find ourselves hearing from people who are concerned that C&B 2011 will devolve into a teeming mass, thus giving up the intimacy of a small-group event.  None of us wants that.

So we’ve adopted a middle ground.  C&B 2011 will have a hard enrollment limit of 100 people.  Once the 100 spots are gone, they’re gone, and we won’t add any more.  This means that each participant is assured of ample time to talk with me, Andrei, and Herb, as well as the other C&B attendees.  Each will have the opportunity to ask questions and make comments during technical sessions.  Nobody will feel lost in a crowd, because there won’t be any crowds.  If you’ve been worrying that this year’s C&B won’t have the small-group feel of last year’s events, you can stop worrying.

The flip side is that there’s no possibility of an encore this year.  Last year, we were lucky to be able to find a time slot where Herb, Andrei, I, and the hotel all had availability, but we already know that that’s not going to be possible this year.  If you’re interested in being part of C&B 2011, you’ll want to sign up for the one and only occurrence: August 7-10 in Banff.

Between presentations on efficiently processing big data sets, taking advantage of the computational capabilities of GPUs, understanding the C++0x memory model, and more (all created just for C&B), plus the chance to exchange experiences and ideas with other top-flight C++ developers (not to mention Herb, Andrei, and me) in a setting with no more than 100 people, we think this will be the premier C++-and-performance-related event of 2011.  We hope you’ll be part of it.


Perfecting forwarding is an important C++0x technique built atop rvalue references.  It allows move semantics to be automatically applied, even when the source and the destination of a move are separated by intervening function calls.  Common examples include constructors and setter functions that forward arguments they receive to the data members of the class they are initializing or setting, as well as standard library functions like make_shared, which “perfect-forwards” its arguments to the class constructor of whatever object the to-be-created shared_ptr is to point to.  At last year’s C&B, I discussed perfect forwarding as the final part of my talk on move semantics and rvalue references.

The thing about perfect forwarding is that there’s both less and more to it than the name suggests.  For one thing, perfect forwarding isn’t perfect:  there are types that cannot be perfect-forwarded.  Examples include 0 and NULL as null pointer constants, as well as braced initializer lists.  Such imperfections have implications for both those who specify interfaces as well as those who use them, i.e., for everybody.  It’s important to be familiar with the trade-offs of the various design alternatives.

Perfect forwarding is implemented via function templates with parameters declared to be of type T&&.  Such parameters are treated specially during type deduction, which is fine, unless you want to specialize the templates.  Purists will rightly point out that you can’t specialize function templates, you can only overload them, but the thing about perfect-forwarding templates is that the only parameter type that gets the special type deduction treatment T&&, so you can’t overload on, say, something like T*&& in an attempt to “partially specialize” for pointer types.  So what do you do if you’re dealing with a perfect-forwarding template and you want to get the behavior you’d normally get via template overloading?

Even something as simple as writing a perfect-forwarding constructor or setter gets tricky if you want to combine it with something as equally simple as use of the pImpl idiom, because the template instantiation mechanism typically wants all the template code in the header file, yet the use of pImpl is motivated by the desire to avoid having to do that.  How do you resolve that problem, especially when what’s supposed to work runs headlong into, um, compiler implementations that are less than what they might be?

In this session, I’ll give a brief review of perfect forwarding (i.e., I won’t assume that everybody saw (and remembers…) my talk at last year’s C&B), then I’ll launch into a discussion of the practical issues such as those above that arise when you try to put perfect forwarding into day-to-day use.

If there are issues related to this topic you’d like to see me address, please let me know via comment on this blog post or via email.


Everybody at C&B is committed to seeing that our delegates are treated fairly and that the information we have listed on our web site is neither inaccurate nor misleading.

When we listed our special room rate of $189 CDN for attendees, we failed to notice that this room type can only hold two adults. As indicated by the comments on the hotel information page, this caused confusion, because some people expected to fit more than two people in a room at this rate, paying only an extra $30 for each additional person.

The Banff Springs Hotel has been very generous and flexible in handling this situation with us. They’ve extended the $189/night rate for larger rooms to those people who’ve already registered and who expected to be able to fit more than two people in a room at that rate.  If you’re one of those people, you should already have been contacted about the revised arrangement.  (If not, let me know.)

Once the rate ambiguity became clear, we revised the information at the hotel information page, so people who register from now on should do so with the understanding that if they want to put more than two people in a room, they will need to book a larger (and more expensive) room than the $189/night covers.

Thanks for bringing this matter to our attention, and thanks for your patience as we resolved it. We truly apologize for the confusion.


[Update: My other sessions have grown longer because there’s lots of detail to cover on the C++0x and GPU topics, and so this talk no longer fits and it will be deferred to a future event. However, much of what this was going to cover will still be covered in the context of “How To Teach C++” which targets C++0x — the main difference will be that the examples won’t be drawn directly from the Exceptional C++ books this time. -Herb]

If you’ve been doing C++ for more than a year or two, I’ll bet you grok how to think in C++… especially if we’re talking about the first C++ standard, C++98.

Now that the second ISO C++ standard is technically complete and expected to be published later this year, and many of its features are available in popular C++ compilers, it’s time to ask: How will the new features in C++0x affect the way I solve problems and write code in C++ — indeed, the way I think in the language?

As Bjarne Stroustrup put it:

Surprisingly, C++0x feels like a new language: The pieces just fit together better than they used to and I find a higher-level style of programming more natural than before and as efficient as ever.

Using a series of examples drawn from my Exceptional C++ (XC++) books, this brand-new talk illustrates how and why the new features in the just-finalized C++0x standard (aka C++11) change the way we solve problems as well as our C++ coding style. I’m going through the three XC++ books specifically bringing together many of the examples whose solutions are affected the most, and that highlight the difference in how C++0x, while retaining much that is familiar, lets us think about our code in new ways.

You’ve probably seen these popular questions before, and their solutions in C++98. (If not, you can freely browse them here in their original “Guru of the Week” or “GotW” form.) Now you’ll see them in a whole new light… and see why C++ feels like a whole new fresh language that leads to different approaches and even better solutions.

Note: This talk isn’t about a language you’ll eventually get to use someday, it’s about “now” — all the code I’ll show works on today’s most popular shipping compilers, which already implement many C++0x language and library features.

As we get closer to the event, I’ll post lists of particular XC++ Items (and corresponding GotW issues) that I plan to cover so that you can refamiliarize yourself with their C++98 solutions in advance. During the course we will be spending most of our time on the C++0x solutions, and just refer to the C++98 solutions enough to act as a refresher and for side-by-side comparison.

In 2004,  Andrei and I coauthored an article about why it was not possible to write a safe implementation of double-checked locking in standard C++.  Between compiler optimizers performing code motion and hardware MMUs reordering loads and stores, there was no way to prevent transformations that could cause safe-looking source code to yield incorrect runtime behavior.

But that was for C++98, which, because it had no concept of threads, had no memory model, so it offered no guarantees regarding the visibility of reads and writes across threads.  C++0x does have a memory model, and the changes to the language and library are significant.  Sequence points no longer exist, having been replaced with the “sequenced before” and “happens before” relationships.  Atomic types offer memory visibility guarantees not offered by non-atomic types, and compilers must generate code that respects these guarantees.  These guarantees make it possible to prevent the kind of unwanted code motion and memory access reorderings that bedevil double-checked locking in C++98, although they’re not enough:  the core language must also constrain the relationship between memory allocation and object construction in a new expression.  C++0x provides such constraints.

Like most languages that recognize the existence of multiple threads, C++0x offers programmers a mechanism for ensuring sequential consistency in concurrent code.  In contrast to other languages (and in recognition of C++’s role as a systems programming language), C++0x offers fine-grained control over the constraints imposed on compilers and hardware when optimizing, making it possible to implement consistency models weaker — and hence potentially faster — than sequential consistency.

This talk will address  the topics mentioned above:  double-checked locking (briefly — I’ll assume you’ve read Andrei’s and my above-mentioned article in advance), compiler optimizations, memory access reorderings, “sequenced before” and “happens before” relations, atomic types, memory consistency models, and how they relate to both correctness and performance.  Also, because it’s pretty much obligatory in this kind of talk, I’ll address the potentially confusing relationship between atomic and volatile types.

I welcome your feedback on this talk topic.


The CPU... and new friends


This brand-new talk covers the state of the art for using C++ (not just C) for general-purpose computation on graphics processing units (GPGPU). The first half of the talk discusses the most important issues and techniques to consider when using GPUs for high-performance computation, especially where we have to change our traditional advice for doing the same computation on the CPU. The second half focuses on upcoming C++ language and library extensions that bring key abstractions for GPGPU — and in time considerably more — directly into C++.


The mainstream hardware platform, from grandma’s PC or tablet on up, now typically contains a modern GPU that can offer 10x to 100x speedups for certain interesting computation workloads. If you care about performance on commodity hardware, you won’t willingly leave that kind of performance on the table. It’s time to think of the GPU as another available coprocessor, or compute accelerator, that’s not just for high-end custom GPGPU servers any more.

There’s one small problem: Standard C++ cannot access that juicy performance directly because it doesn’t (yet) have the language abstractions needed for running parts of the same program on heterogeneous processors. That’s unfortunate, because it’s C++’s job as a systems programming language to give full access to the hardware. While C++ evolves in this direction, in the interim environments like CUDA and OpenCL are filling this gap using a combination of libraries and extensions to C.

This talk covers the following issues:

  • Language subsets due to hardware heterogeneity. The mainstream computer is now fundamentally heterogeneous. Different compute resources (traditional CPU, Cell-style SPU, GPU) support only subsets of the C and C++ languages; for example, many lack support for pointers to pointers, pointers to functions, new, malloc, and even fundamental types like short int.
  • Performance differences and pitfalls. Code executing on GPUs can have very different performance characteristics from the same code running on CPUs. Even using C++’s most basic language feature — by adding an “if” statement — can silently lose an order of magnitude or more of performance.
  • Non-uniform and fragmented memory. Most current GPUs do not share memory with the CPU, and so data must be (usually explicitly) transferred before and after the computation. Further, the GPU memory itself has a notion of cache-like memory shared by subgroups of threads, but that “cache” is not automatic; the programmer must manage it explicitly.
  • Rapidly changing hardware. GPGPU hardware designs in particular are still in great flux. What programming techniques work well on today’s hardware, and result in writing code that is also friendly to tomorrow’s different hardware?

Finally, we’ll also consider how to address the above issues (some of which are temporary) in a way that treats GPGPU as just an interesting current midpoint on the road to mainstream heterogeneous computation — spreading a computational workload across available parallel processing assets, from vector units and multicore, to GPGPU and APU, to elastic cloud computing. And, unlike CUDA and OpenCL, our goal is to find solutions, not for C, but for C++ — leveraging C++’s strength of strong abstractions and STL techniques while still flying close to today’s morphing metal.