Friday Q&A 2008-12-19: Multithreaded Programming and Parallelism Overview
Related Articles
Thread Safety in OS X System Frameworks
Profiling With Shark
Operations-Based Parallelization
The Good and Bad of Distributed Objects
Holistic Optimization
Multithreaded Optimization in ChemicalBurn
Mac OS X Process Memory Statistics
Intro to Grand Central Dispatch, Part I: Basics and Dispatch Queues
Intro to Grand Central Dispatch, Part II: Multi-Core Performance
Intro to Grand Central Dispatch, Part III: Dispatch Sources
Intro to Grand Central Dispatch, Part IV: Odds and Ends
GCD Practicum
Dangerous Cocoa Calls

Great response last week. This week I'm going to merge Sam McDonald's question about how I got into doing multithreaded programming and Phil Holland's idea of talking about the different sorts of parallelism available.

Like a lot of computer programmers, I was always interested in making code run fast. This led to better languages (I started in BASIC!), micro-optimization, and algorithms, but ultimate performance means multiprocessing. The distributed.net and SETI@Home projects showed the power of distributed computation.

Multithreading was also interesting in coming to OS X from the old Mac OS, where multithreading was a lot more limited and difficult. At the time it wasn't about performance, since most machines had only one CPU. But multithreading has lots of other benefits for organization, design, and interactive GUIs so it was still highly useful.

Then the ongoing multicore revolution kicked off and made it clear that multithreading was the way to go.

That's the why. The how is pretty boring. Just lots of work, reading, and experimentation on all sorts of multiprocessing, not just threading. They're very different, but many concepts are the same, and ideas from one can often help with the others. As with most things, practice and experience makes a big difference.

So then we have the different forms of parallel processing available. There are actually a lot of these, and I'm probably doomed to miss some, but:

  1. Distributed computing. Probably the most visible example of this one for Mac developers is distributed builds in Xcode. This is generally the most difficult to build, the most expensive to take advantage of for the user, and therefore the least useful. Bandwidth and latency between computers are horrendous compared to what you get within a single machine, so it's hard to write something that goes fast. Xcode can get away with it because it (usually) does a lot of processing for each bit of data that it processes. Beyond the difficulty of achieving speed, you also have to deal with a much more error-prone environment. You want to recover gracefully if the user wanders out of wifi range, not lose a bunch of data. Most of the time this is not worth it, especially if you're going to be shipping consumer-level software.
  2. GPGPU. Basically using the video card for computation. This is what GPULife does. It's capable of immense power. A top-of-the-line video card can easily outperform a top-of-the-line CPU by a factor of 50 with the right program. It's also really hard. GPUs are extremely parallelized and have a considerably different architecture from CPUs, so coding for them is hard and making them go fast is harder. (Although even slow GPU code can run really fast due to the amount of power available.) Technologies like CUDA and OpenCL promise to make this sort of thing a lot better, although you're always going to be dealing with the fact that it's a massively parallel system with really different performance characteristics. My recommendation here is to wait for Snow Leopard and then hope OpenCL delivers on its promise.
  3. Multiple processes. Again Xcode is a prominent example of this approach, where you can see it spawn multiple instances of gcc when compiling. This is often talked about as being an easier, safer way to go than multithreading because the OS protects processes from each other and forces a better separation of concerns. I don't buy it, personally. For just about any non-trivial program, a dead subprocess is going to mean that the whole thing comes crashing down, and all you've done by splitting it into multiple processes is make it harder to debug. What's worse, OS limits on the number of processes tend to be frighteningly low, so your program would need to gracefully handle being unable to spawn as many subprocesses as it likes. (And all the other apps on the system would need to as well!)
  4. Multithreading. The standard technique. Often very difficult to get right, and very difficult to debug, but potentially very rewarding in terms of performance. Threads can also help to better organize a program. It's often much cleaner to put long-running processing or blocking operations into a separate thread than to try to multiplex them together.

Multithreading is the one we're most familiar with and the one that's the most generally useful. It's useful because it's very generalized, so you have various ways to use multithreading to actually get things done:

  1. Locks. "Standard" multithreading. You have shared data protected by locks. Acquire the locks before you fiddle with the data. Often used to build more sophisticated machinery. This level can be tricky to get right so I recommend avoiding it where you can, and using it sparingly to build better abstractions where you must.
  2. Message passing. With message passing, you avoid shared data, and have threads communicate using message queues instead. (The message queues generally have shared data inside them, but that's an implementation detail.) Cocoa has some nice facilities for this with the -performSelectorOnMainThread:... and performSelector:onThread:... calls. The threading-heavy language Erlang uses this model extensively and is the main force behind its multithreading power.
  3. Operation units. This is kind of like message passing, except the operations just fly off and get executed on a queue which uses threads outside your view. When set to only execute one operation at a time, a queue can act like a synchronization point, replacing locks in a way that's often easier to work with. NSOperationQueue provides this and Grand Central Dispatch in Snow Leopard is rumored to provide similar facilities.
  4. Atomic/transactional objects. Rather than using mutual exclusion (locks, queues) to avoid destroying shared data, you can build objects that operate using transactions. Grab a snapshot, make changes, commit them. (Often this is implemented as a loop: snapshot, change, try to commit and start over with a new snapshot if something changed in the middle.) TransactionKit is a great example of this sort of thing in a Cocoa context.

As for what to use, here are my thoughts. Avoid distributed computing unless your code is going to be run by a single client with a lot of available hardware. Being able to snarf up CPU cycles from idle hardware sitting around in the user's house sounds cool but just doesn't pay off most of the time. Avoid GPGPU on the Mac until Snow Leopard ships unless you have a really good application for it. OpenCL will make GPGPU a lot more practical and flexible, so trying to shoehorn your computationally expensive code into GLSL or CoreImage today just doesn't seem worth it.

Using multiple processes is a good idea if the subprograms are already written. If you're invoking gcc as a subprocess, invoking it simultaneously on four files instead of one by one is pretty easy. If you're writing your code from scratch, I don't recommend it unless you have another good reason to write subprocesses, as it's difficult and the reward just isn't there.

For multithreading, concentrate on message passing and operations. Multithreading is never easy, but these help greatly to make it simpler and less error prone. Good OO design will also help a lot here. It's vastly easier to multithread an app which has already been decomposed into simple objects with well-defined interfaces and loose coupling between them.