mikeash.com: Friday Q&A 2009-09-04: Intro to Grand Central Dispatch, Part II: Multi-Core Performance

Posted at 2009-09-04 01:51 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2009-09-11: Intro to Grand Central Dispatch, Part III: Dispatch Sources
Previous article: Objective-C Runtime Podcast Episode at Mobile Orchard
Tags: fridayqna gcd performance

Friday Q&A 2009-09-04: Intro to Grand Central Dispatch, Part II: Multi-Core Performance

by Mike Ash

Welcome back to Friday Q&A. Last week I discussed the basics of Grand Central Dispatch, an exciting new technology in Snow Leopard. This week I'm going to dive deeper into GCD and look at how you can use GCD to take advantage of multi-core processors to speed up computation. This post assumes that you've read last week's edition, so be sure to do that if you haven't already.

Concepts
In order to take advantage of multiple CPU cores within a single process, it's necessary to use multiple threads. (I'm ignoring multi-process concurrency, because it's unrelated to GCD.) This is just as true in the GCD world as it is in the purely threaded world. At the low level, GCD global dispatch queues are just abstractions around a pool of worker threads. Blocks on those queues get dispatched onto the worker threads as they become available. Blocks submitted to custom queues end up going through global queues and into that same pool of worker threads. (Unless your custom queue is targeted at the main thread, but you would never do that for speed purposes!)

There are essentially two ways to extract multi-core performance out of GCD: by parallelizing a single task or a group of related tasks onto one of the global queues, and by parallelizing multiple unrelated or loosely related tasks onto multiple custom queues.

Global Queues
Imagine the following loop:

    for(id obj in array)
        [self doSomethingIntensiveWith:obj];

Assume that -doSomethingIntensiveWith: is thread safe and can be run in parallel with other uses of that same method. If this is true, and array regularly contains more than one object, it's easy to use GCD to parallelize this code:

    dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    for(id obj in array)
        dispatch_async(queue, ^{
            [self doSomethingIntensiveWith:obj];
        });

Easy as that, you're running on multiple cores.

Of course code isn't always this nice. Sometimes you have code which manipulates an array like this, but then has to perform some work with the result:

    for(id obj in array)
        [self doSomethingIntensiveWith:obj];
    [self doSomethingWith:array];

The use of dispatch_async in the GCD example means that this doesn't work. And you can't solve it just by switching to dispatch_sync, because that will cause each individual iteration to block, destroying all parallelism.

One way to solve this problem is by using dispatch groups. A dispatch group is a way to group together multiple blocks, and either wait for them to complete or be notified once they complete. They are created using dispatch_group_create, and the dispatch_group_async function allows submitting a block to a dispatch queue and also adding it to the group. We could then rewrite this code to use GCD like so:

    dispatch_queue_t queue = dispatch_get_global_qeueue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    dispatch_group_t group = dispatch_group_create();
    for(id obj in array)
        dispatch_group_async(group, queue, ^{
            [self doSomethingIntensiveWith:obj];
        });
    dispatch_group_wait(group, DISPATCH_TIME_FOREVER);
    dispatch_release(group);
    
    [self doSomethingWith:array];

If this work can be performed asynchronously relative to the calling code, then we can get even fancier than this, and run -doSomethingWith: in the background instead of waiting. To do this, we'll use dispatch_group_async to set a block to run when the group completes:

    dispatch_queue_t queue = dispatch_get_global_qeueue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    dispatch_group_t group = dispatch_group_create();
    for(id obj in array)
        dispatch_group_async(group, queue, ^{
            [self doSomethingIntensiveWith:obj];
        });
    dispatch_group_notify(group, queue, ^{
        [self doSomethingWith:array];
    });
    dispatch_release(group);

Now not only will all of the work on the array objects run in parallel, but the final work will also run asynchronously with respect to the rest of the application, giving even more parallelism. Note that if -doSomethingWith: needed to run on the main thread, for example to manipulate the GUI, all you need to do is pass the main queue to dispatch_group_notify instead of a global queue.

For the synchronous case, GCD provides a nice shortcut with the dispatch_apply function. This function calls a single block multiple times in parallel and waits for it to complete, just like what we wanted:

    dispatch_queue_t queue = dispatch_get_global_qeueue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    dispatch_apply([array count], queue, ^(size_t index){
        [self doSomethingIntensiveWith:[array objectAtIndex:index]];
    });
    [self doSomethingWith:array];

This is nice, but what about the asynchronous case? There's no asynchronous version of dispatch_apply that we can use. But we're using an API built around asynchronous invocation! We can just use dispatch_async to push the whole thing into the background:

    dispatch_queue_t queue = dispatch_get_global_qeueue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);
    dispatch_async(queue, ^{
        dispatch_apply([array count], queue, ^(size_t index){
            [self doSomethingIntensiveWith:[array objectAtIndex:index]];
        });
        [self doSomethingWith:array];
    });

Easy!

The key to this approach is identifying code which is performing identical work on many different pieces of data at once. If you ensure that the work performed is done in a thread safe manner (beyond the scope of this post) then you can replace your loops with calls to GCD in order to achieve parallelism.

In order to see a performance gain, you need to be performing a fairly substantial amount of work. GCD is lightweight and low-overhead compared to threads, but it's still somewhat costly to submit a block to a queue. The block has to be copied and enqueued, and the appropriate worker thread somehow notified. Submitting a block for every pixel in an image is probably not going to be a win. On the other hand, submitting a block for each image when converting a collection of images is probably going to be a win. The point where GCD ceases to be profitable falls somewhere in the middle. When in doubt, experiment. Parallelizing applications is an optimization, and as such you should always measure before and after to make sure that your changes helped. (And to make sure that you're making the changes in the right place!)

Subsystem Parallelism
The previous section talked about taking advantage of multiple cores in a single subsystem of your application. It can also be useful to do this across multiple subsystems.

For example, imagine an application which opens a document containing metadata. The document data itself must be parsed and converted into model objects for display, as must the metadata. However, the document data and the metadata don't interact. You could create a dispatch queue for each one, then run both in parallel. The code for each piece of data parsing would be entirely serial within itself, and thread safety is not a concern (as long as you don't have shared data between them), but they will still run in parallel.

Once the document is open, the program needs to perform tasks in response to user actions. For example, it may need to perform spell checking, syntax highlighting, word counting, autosave, and other such things. If each one of these tasks is implemented using a separate dispatch queue, they will all run in parallel with respect to each other without many of the difficulties of multithreaded programming.

By using dispatch sources, something I'll cover next week, you can have GCD deliver events directly to a custom dispatch queue. A part of your program that monitors a network socket, for example, could be given its own dispatch queue which will then allow it to run in parallel with respect to the rest of the application. And again, by using a custom queue, this module will run serially with respect to itself, simplifying programming.

Conclusion
This week we saw how to use GCD to increase the performance of your applications and take advantage of modern multi-core systems. Although care must still be taken when writing parallel applications, GCD makes it easier than ever to take advantage of all available computing power.

That wraps up this week's Friday Q&A. Come back next week for the next part in the continuing series on GCD, when I will talk about dispatch sources, GCD's mechanism for monitoring internal and external events. As always, if you have a suggestion for a topic to cover, please post it in the comments or e-mail it directly to me.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Nick at 2009-09-05 16:40:28:

Excellent, much more helpful than the apple dev website(link to GCD is broken atm)

Nat at 2009-09-08 18:51:22:

This series is shaping up great; it's highly evocative. Thanks.

Going back to the first part, can you elaborate on use of GCD in lieu of locks? A setter which returns immediately and does its work asynchronously seems like a race condition waiting to happen.

I presume I'm missing something -- all threaded programming sort of seems like a race condition waiting to happen -- but I was surprised to see that option presented matter-of-factly.

mikeash at 2009-09-08 21:03:04:

A good question. Maybe I'll expand on it in a future edition of Friday Q&A if you want to call this a topic suggestion, but here's the short version.

A thread safe API needs to follow three rules:

1) Synchronize all access to shared data. In other words, always take a lock (or something equivalent, like using a serial dispatch queue) before touching any shared state.

2) Ensure that the visible state of the object is self-consistent at all times. E.g. you have separate firstName and lastName variables, you should always update them both within a single critical region so that a change from "Charlie Chang" to "Mohammed Ali" doesn't result in some poor caller getting a full name of "Mohammed Chang".

3) Ensure that the visible state of the object is temporally consistent from any given thread. In other words, if you set, and then get (and no other thread jumped in the middle) the get should reflect the set.

(I may have forgotten some rules, since this is off the cuff, but consider these to be at least a minimum.)

Using locks around shared data gives you 1 and 3 automatically. #2 you have to work at by making sure that all of your updates are done atomically. (Incidentally this is why the atomic attribute on ObjC @property declarations is such a bad idea; it fools you into thinking that you're safe but does nothing to achieve #2.)

So what do dispatch queues give us?

#1 is taken care of because the dispatch queue acts just like a lock if you ensure that all shared data access occurs on the queue.

#2 is up to you, just like it is with locks. Ensure that your object's internal state is consistent every time you exit a block on your dispatch queue.

#3 is probably what you're worried about. However, the use of a serial queue, and the fact that GCD guarantees that queues execute in FIFO order, ensures that this is always taken care of. Your setter executes asynchronously, this is true. However, it is guaranteed to execute before any subsequent operation. The danger would be in having a getter, enqueued afterwards, run before the setter has a chance. But this can't happen; if the setter hasn't run yet, the getter will be blocked until it has, at which point the getter will see a temporally consistent value.

mikeash at 2009-09-08 21:26:30:

I should also note that having the setter run asynchronously means you need to ensure that you're working with thread safe data in the setter.

For example, if you have an NSString setter that makes a copy of the string (to avoid sharing an NSMutableString) then you should make the copy outside of the setter block, so that it happens synchronously with the caller. Once you have the copy, you can then enqueue a block to actually set your instance variable to point to that copy.

Put it another way: using locks means that setters happen synchronously with respect to both the caller and the callee. Using an asynchronous dispatch means that part of the setter happens synchronously only with respect to the caller, and part of it happens synchronously only with respect to the callee. For many things this makes no difference, but for some things you will have to split code between the two regions. This is similar to how the synchronous getter had to call autorelease outside the dispatch block to ensure that it happened on the correct thread.

jamiehardt at 2009-09-10 07:05:24:

Just to add to the general thread-safety convo -- the basic idea is that your code should be "reentrant," which means if process 1 enters your block of code, and then process 2 enters your block of code with a different set of arguments, they should both be able to play back all of the instructions in that block without messing with each other's business.

Think of your code block (in a selector implementation, or a function, or whatever) as a recipe and the scope it runs in is a kitchen. If you were to give this recipe to two blind cooks at the same time, does the recipe give them enough information so they don't accidently set the oven to 900 degrees, or one adds flour to a cake that's already iced, because you store a finished cake and the mixing bowl on the same counter? A "lock" in this instance is like a step in the recipe that says "feel the counter, if it is occupied, wait until it isn't, otherwise proceed." This way, two cooks can work in the kitchen simultaneously and share the counter -- the first will get it first, but the second only has to wait until the cook is done with the counter, and then the first cook will "relinquish" it and take over the oven, meaning the second can get to work with the counter and then wait for the oven to be available.

Also on a more concrete note, there's a bit of reading you should do about the reentrancy of the various Apple APIs, because not all of them are, and putting many Cocoa API calls inside an async block will fail horribly -- mikeash has already gone into this a little. AppKit is basically not reentrant: bits of it are, but anything inheiriting from NSResponder or NSView isn't, and that's most of AppKit. The Foundation API is, except for things like NSArchiver, the mutable collection classes, and host of other things. So keep in mind that if you are trying to write thread-safe code, not only does it have to be internally thread-safe, but anything you call must be thread-safe as well.

http://developer.apple.com/mac/library/documentation/Cocoa/Conceptual/Multithreading/Introduction/Introduction.html

jamiehardt at 2009-09-10 07:12:05:

Hmm, maybe I'm confused... Need to do more reading.

mikeash at 2009-09-10 16:41:31:

Do not confuse thread safe with reentrant. They are somewhat similar but markedly different concepts.

What you're describing is actually thread safety. Multiple independent threads of control, executing simultaneously (possibly time-sharing CPUs, but conceptually simultaneously) in the same block of code. If that block of code is thread safe, then operations complete as expected in this circumstance. If it is not thread safe, then the threads interfere with each other and Bad Things Happen. The common way to make a block of code thread safe is to wrap it in a lock, which allows only one thread into unsafe areas at a time.

Reentrancy is very different. Click your click, then in the table of contents click "Appendix A: Thread Safety Summary". Notice how there are separate sections listing thread-safe classes and reentrant classes.

Reentrancy is allowing the same thread of control to call a block of code while that block of code is already being executed farther up the stack.

As an example, take one of the reentrant classes listed on that page: NSNotificationCenter. Imagine that some code posts a notification, which is then received by your code. One stack frame up from your code, NSNotificationCenter is sitting in the middle of some of its code. From within your notification handler, you decide to add a new notification observer, calling back into NSNotificationCenter even though it's still in the middle of executing.

Reentrant means that NSNotificationCenter can handle this situation.

Imagine if NSNotificationCenter took an NSLock at the beginning of every method, and relinquished it at the end. It would be thread safe. But the scenario I outlined above would deadlock. It would not be reentrant.

Another, less common scenario where reentrancy is important is in signal handlers. A signal handler is executed by essentially hijacking a thread in your application and making it run the signal handler. Whatever that thread was doing at the time is just paused for the time being. Lots and lots of thread-safe APIs are completely unsafe to call from a signal handler, because those APIs achieve thread safety through locks, and the thread that was hijacked to run the signal handler may well be holding onto one of those locks, leading to a deadlock.

To put it simply, reentrancy is basically about being thread-safe when all but one of the threads are blocked for the duration, whereas normal thread safety is about being safe when all of your threads are actively executing.

Most of what you say is correct, as long as you think "thread safe" every time you read "reentrant".

jamiehardt at 2009-09-10 20:22:53:

Thank you for posting on GCD, there's still very little practical lit on it on the webs and these demonstrations are much more edifying than Apple's.

Stuart Carnie at 2009-10-01 18:32:38:

Great post Mike - thanks for the information!

Karty at 2013-11-20 06:59:09:

I don't understand why the below code is blocking. Here I'm downloading a list of images and once all the images are downloaded the tableview should be reloaded. As it needs to update the UI I dispatch this async in the main queue. But this code blocks the UI.



dispatch_queue_t queue = dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0);

    dispatch_async(dispatch_get_main_queue(), ^{

        dispatch_apply([self imageList].count, queue, ^(size_t index) {

            [self downloadImageFromRemotePath:[[_imageList objectAtIndex:index]];

        });

            [[self tableView] reloadData];

    })

cristik at 2014-05-30 20:54:45:

Karty, is downloadImageFromRemotePath: using NSURLConnection or NSURLDownload? If yes, are you using the sync or async calls? If async, what kind of work are you doing within the delegate methods?

Andrew Rahn at 2014-09-23 16:11:17:

Mike -- you have a typo above. Search for dispatch_get_global_qeueue (extra e in queue).

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Code syntax highlighting thanks to Pygments.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
	Formatting: `<i> <b> <blockquote> <code>`.
	NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.