mikeash.com: just this guy, you know?

Posted at 2009-08-28 14:16 | RSS feed (Full text feed) | Blog Index
Next article: Objective-C Runtime Podcast Episode at Mobile Orchard
Previous article: Unicode Comments Support
Tags: fridayqna gcd performance
Friday Q&A 2009-08-28: Intro to Grand Central Dispatch, Part I: Basics and Dispatch Queues
by Mike Ash  

Welcome back to Friday Q&A. This week's edition lines up with Apple's release of Snow Leopard, so I'm going to take this opportunity to open up the discussion on previously NDA'd technologies and talk about some of the cool stuff now available in Snow Leopard. For this week I'm going to start what I plan to be an ongoing series on Grand Central Dispatch, a topic suggested by Chris Liscio.

What Is It?
Grand Central Dispatch, or GCD for short, is a low level API which introduces a new way to perform concurrent programming. For basic functionality it's a bit like NSOperationQueue, in that it allows a program's work to be divided up into individual tasks which are then submitted to work queues to run concurrently or serially. It's lower level and higher performance than NSOperationQueue, and is not part of the Cocoa frameworks.

In addition to the facilities for parallel execution of code, GCD also provides a fully integrated event handling system. Handlers can be set up to respond to events on file descriptors, mach ports, and processes, to timers and signals, and to user-generated events. These handlers are executed through the GCD facilities for concurrent execution.

GCD's API is heavily based around blocks, which I talked about in previous Friday Q&A's, first to introduce the basics of blocks and then to discuss the practical aspects of using blocks in real-world code. While GCD can be used without blocks, by using the traditional C mechanism of providing a function pointer and a context pointer, it's vastly easier to use and ultimately more capable, from a practical standpoint, when used with blocks.

For documentation on GCD, start with man dispatch on a Snow Leopard machine.

Why Use It?
GCD offers many advantages over traditional multi-threaded programming:

  1. Ease of use: GCD is much easier to work with than threads. Because it's based around work units rather than threads of computation, it can take care of common tasks such as waiting for work to finish, monitoring file descriptors, executing code periodically, and suspending work. The blocks-based APIs make it extremely easy to pass context between different sections of code.
  2. Efficiency: GCD is implemented in a lightweight manner which makes it practical and fast to use GCD in many places where creating dedicated threads is too costly. This ties into ease of use: part of what makes GCD so easy to use is that for the most part you can just use it, and not worry too much about using it efficiently.
  3. Performance: GCD automatically scales its use of threads according to system load, which in turn leads to fewer context switches and more computational efficiency.
Dispatch Objects
Although pure C, GCD is built in an object-oriented style. GCD objects are called dispatch objects. Dispatch objects are reference counted, much like Cocoa objects. The dispatch_retain and dispatch_release functions can be used to manipulate the reference count of dispatch objects for the purposes of memory management. Note that unlike Cocoa objects, dispatch objects do not participate in garbage collection, so you will have to manage GCD objects manually even if you have GC enabled.

Dispatch queues and dispatch sources (more on what these are later) can be suspended and resumed, can have an arbitrary context pointer associated with them, and can have a finalizer function associated with them. For more information on these facilities, see man dispatch_object.

Dispatch Queues
A fundamental concept in GCD is that of the dispatch queue. A dispatch queue is an object which accepts jobs and which executes them in the order in which they arrive. A dispatch queue can either be concurrent or serial. A concurrent queue will execute many jobs simultaneously, as appropriate for system load, much like NSOperationQueue. A serial queue will only execute a single job at a time.

There are three main types of queues in GCD:

  1. The main queue: Analogous to the main thread. In fact, jobs submitted to the main queue execute on the main thread of the process. The main queue can be obtained by calling dispatch_get_main_queue(). Since the main queue is inherently tied to the main thread, it is a serial queue.
  2. Global queues: Global queues are concurrent queues shared through the entire process. Three global queues exist: a high, a default, and a low priority queue. Global queues can be accessed by calling dispatch_get_global_queue and telling it which priority you want.
  3. Custom queues: Custom queues (GCD does not call them this, but doesn't have a specific name for these, so I call them "custom") are queues created with the dispatch_queue_create function. These are serial queues which only execute one job at a time. Because of this, they can be used as a synchronization mechanism, much like a mutex in a traditional threaded program.
Creating Queues
If you want to use a custom queue, you'll have to create one. To do this, just call dispatch_queue_create. The first parameter is a label, which is purely for debugging purposes. Apple recommends using reverse-DNS naming to give the queue a unique name, like "com.yourcompany.subsystem.task". These names show up in crash logs and can be queried from the debugger and will help a lot when trying to see where things went wrong. The second argument is an attribute argument which is currently unsupported, so pass NULL.

Submitting Jobs
Submitting a job to a queue is easy: call the dispatch_async function, and pass it a queue and a block. The queue will then execute that block when it's that block's turn to execute. Here is an example of executing some long-running job in the background using a global queue:

    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        [self goDoSomethingLongAndInvolved];
        NSLog(@"Done doing something long and involved");
    });
dispatch_async returns immediately, and then the block will execute asynchronously in the background.

Of course, it's not really very useful to perform an NSLog when the work is done. In a typical Cocoa application, you probably want to update a part of your GUI, and that in turn means running code on the main thread. You can easily accomplish this by using nested dispatches, with the outer one performing the background work, and then from within the background block dispatching onto the main queue, like this:

    dispatch_async(dispatch_get_global_queue(DISPATCH_QUEUE_PRIORITY_DEFAULT, 0), ^{
        [self goDoSomethingLongAndInvolved];
        dispatch_async(dispatch_get_main_queue(), ^{
            [textField setStringValue:@"Done doing something long and involved"];
        });
    });
There is also a dispatch_sync function, which does the same thing but which waits for the block to complete before returning. In conjunction with the __block type qualifier, this can be used to get a value back from the executing block. For example, you may have some code running on a background thread (or better yet, a non-main dispatch queue) which needs to get a value from a GUI control. You can do this easily by using dispatch_sync and dispatch_get_main_queue:
    __block NSString *stringValue;
    dispatch_sync(dispatch_get_main_queue(), ^{
        // __block variables aren't automatically retained
        // so we'd better make sure we have a reference we can keep
        stringValue = [[textField stringValue] copy];
    });
    [stringValue autorelease];
    // use stringValue in the background now
It can be better to use a more asynchronous programming style, however. Rather than block background processing to fetch the GUI value, you can use nested blocks to terminate background processing, execute your fetch on the main thread, and then simply submit further processing back in the background. You can write code like this instead:
    dispatch_queue_t bgQueue = myQueue;
    dispatch_async(dispatch_get_main_queue(), ^{
        NSString *stringValue = [[[textField stringValue] copy] autorelease];
        dispatch_async(bgQueue, ^{
            // use stringValue in the background now
        });
    });
Depending on your needs, myQueue could be a custom queue or it could just be one of the global queues.

Replacing Locks
Custom queues can be used as a synchronization mechanism in place of locks. In traditional multi-threaded programming, you might have an object which is designed to be usable from multiple threads. In order to accomplish this, it protects all accesses to shared data using a lock, which you might find in an instance variable:

    NSLock *lock;
Then accesses look like this:
    - (id)something
    {
        id localSomething;
        [lock lock];
        localSomething = [[something retain] autorelease];
        [lock unlock];
        return localSomething;
    }
    
    - (void)setSomething:(id)newSomething
    {
        [lock lock];
        if(newSomething != something)
        {
            [something release];
            something = [newSomething retain];
            [self updateSomethingCaches];
        }
        [lock unlock];
    }
Using GCD, you can replace the instance variable with a queue:
    dispatch_queue_t queue;
In order to be used as a synchronization mechanism, the queue must be a custom queue, not a global queue, so you would initialize it using dispatch_queue_create. You would then wrap all code accessing shared data in dispatch_async or dispatch_sync:
    - (id)something
    {
        __block id localSomething;
        dispatch_sync(queue, ^{
            localSomething = [something retain];
        });
        return [localSomething autorelease];
    }
    
    - (void)setSomething:(id)newSomething
    {
        dispatch_async(queue, ^{
            if(newSomething != something)
            {
                [something release];
                something = [newSomething retain];
                [self updateSomethingCaches];
            }
        });
    }
Note that dispatch queues are extremely lightweight so it's entirely reasonable to use them just as often as you would use a lock.

At this point you may be asking, this is all well and good, but what's the point? I just switched code from one mechanism to another mechanism that looks pretty much the same. Why would you do this?

There are actually several advantages to the GCD approach:

  1. Parallelism: Notice how -setSomething: uses dispatch_async in the second version of the code. This means that the call to -setSomething: will return right away, and then the bulk of the work will happen in the background. This could be a significant win if updateSomethingCaches is a costly operation and the caller will be doing something processor intensive as well.
  2. Safety: It's impossible to accidentally write a code path that doesn't unlock the lock using GCD. In normal locked code it's not unusual to inadvertently put a return statement in the middle of the lock, or conditionalize the exit, or something equally unfortunate. With GCD, the queue always continues to run and you can't help but return control to it normally.
  3. Control: It's possible to suspend and resume dispatch queues at will, which cannot easily be done with a locks-based approach. It's also possible to point a custom queue at another dispatch queue, making it inherit the attributes of that other dispatch queue. Using this, the priority of the queue can be adjusted by making it point to the different global queues, and the queue can even be made to execute code on the main thread if this were required for some reason.
  4. Integration: The GCD event system integrates with dispatch queues. Any events or timers that the object needs to use can be pointed at the object's queue, causing the handlers to automatically run on that queue, making them automatically synchronized with the object.

Conclusion
Now you know the basics of Grand Central Dispatch, how to create dispatch queues, how to submit jobs to dispatch queues, and how to use queues as a substitute for locks in multithreaded programs. Next week I'll show you techniques for using GCD to write code which performs parallel processing to extract more performance out of multi-core systems. And in the coming weeks, I'll discuss more of GCD in depth, including the event system and queue targeting.

That wraps up this week's Friday Q&A. Come back next week for more GCD goodness. And just because I have a subject mapped out for a few weeks doesn't mean I don't need your suggestions. Quite the contrary: Friday Q&A is driven by your suggestions, and the more I have, the better posts I'll be able to make when I get to the end of this series. If you have a suggestion for a topic to cover, please post it in the comments or e-mail it directly to me.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Mike -

Great article, thank you. You say in your article that GCD is lower-level and higher-performance than NSOperationQueue. I was under the impression that NSOperationQueue on Snow Leopard actually uses GCD and the overhead it adds on top of GCD is relatively trivial.

Is my understanding incorrect? Is there a significant performance benefit to using GCD directly in Cocoa applications rather than using NSOperationQueue?

Thanke!
Jeff
You're correct, Jeff. NSOperationQueue on SL uses GCD as an implementation detail, and the overhead is pretty small.
I don't know just how much overhead it is. I'm sure Chris knows what he's talking about. You still have the overhead of creating and dealing with ObjC objects, though, even if it's fairly directly implemented in terms of GCD. Also things that NSOperationQueue offers that GCD doesn't, like dependency tracking, can't come for free. That said, I would assume that if you have existing code that already uses NSOperationQueue and works well, there is very little reason to change it.
Do we get queued accessor methods automatically when we @synthesize atomic properties, or do we need to write them by hand?

If it's the former, that would be a pretty nice automatic perceptual speedup just for recompiling in Snow Leopard.

If the latter, well I guess we'll see a new version of Accessorizer pretty soon :)
I'm sure that @synthesized atomic accessors still use normal locks to control access. There would be no reason to use a queue instead. The benefit from using a queue instead of a lock is not speed, but rather the ability to write better code. Since you're not writing the code either way, there's little point.

Note that despite my example, accessors are still almost always the wrong place to enable thread safety. They are far too granular in most cases for this to be a good place to do that. It really is beyond me why Apple made properties atomic by default, given this. You would generally protect an entire subsystem's interface with a lock or queue, and then have free access to all of that subsystem's objects' properties without locking or other synchronization within that.
The issue of @synchronize() is not one of speed vs. locks, but one of having to set up exception handling such that the lock can be unlocked in the face of an exception.

With GCD & queues, the behavior with an exception is undefined since the execution of the block could happen on any thread.

So, you'd still need an exception handler if you wanted to generically replace @synchronize()'s lock with a queue and that is the main source of the performance hit.

BTW: If anyone reading this has ideas for improving & extending Blocks, GCD, or anything else, please file a bug via http://bugreport.apple.com/
Hey Mike,

Thanks for your (as usual) well written post -- I've come to really enjoy them.

This one comes in particularly handy as I decided to convert something I'm working on to SL last night and started digging through my notes & WWDC videos to refresh myself on how GCD worked -- While the information is there to be had this is a nice concise summary.

One thing that I've never seen data on is the actual overhead of GCD -- at WWDC is was listed as 'small' but what is the break even point? Is there a figure of merit for how much work you need to perform before throwing it on a background queue is a win?

-John
However, it's worth noting that it's also very easy to hang yourself when using dispatch queues as locks. As an example:


dispatch_queue_t dq1 = dispatch_queue_create("com.example.q1", NULL);
dispatch_queue_t dq2 = dispatch_queue_create("com.example.q2", NULL);

dispatch_sync(q1, ^{
    dispatch_sync(q2, ^{
        dispatch_sync(q1, ^{
            printf("Help, I've deadlocked an I can't get up!\n");
        });
    });
});

dispatch_release(q2);
dispatch_release(q1);


Obviously this structure is contrived, but it's extraordinarily easy to create if you start using queues around setters/accessors for interconnected objects. And just to make people aware, trying to check dispatch_get_current_queue() is not safe against this problem. If you take out q2, it is, but that's a dangerous assumption.
John McLaughlin: I have this vague memory from WWDC of the presenter saying that starting a new pthread was only a win when the work to be done was equivalent to the overhead at least 10,000 C function calls, and that the breakeven point with a GCD block was only 100. However I can't even remember which WWDC this was, or how current this is, or even if the numbers are right.

Paul: This is a fine point. For this reason you should use the _sync variant as little as possible, and do as little as possible within the block when you do use it.
Indeed, using _sync() everywhere can get you into trouble; I was just commenting for anyone who sees dispatch_sync() as a tempting replacement for @synchronized(), which will inevitably lead to pain, even if you always use dispatch_sync() for getters and dispatch_async() for setters.

I definitely agree: anything in a dispatch block should be as small as possible, or else you'll probably wind up in trouble pretty quick.
I spent a little time tonight and wrote up my results

http://loghound.com/about/blog2/index.php?id=7996637473678888874

Basically in my simple experiment it took thousands (~2500) method calls to equal the overhead of any sort of backgrounding approach.

GCD appeared to be slightly faster (150us vs 200us) but they were all pretty long compared to just doing the work directly.

Your GCD test is kind of screwed up. Your other tests are "run some stuff in the background and then finish up on the main thread". Your GCD test is, "run some stuff on the background, and then push back to the main thread a block which finishes up on the background". It's redundant and doing more work than the others.

You're also testing two separate mechanisms simultaneously: backgrounding a task, and then bringing it back to the main thread. This is a common pattern and thus a useful measurement, but it would also be interesting to see what it costs to only do the backgrounding.

Finally, you're measuring latency but not throughput, which is commonly going to be more important for real-world tasks.
I was thinking the same thing as Paul, as it's reminiscent of a similar problem in Erlang. Erlang uses independent asynchronous processes about the same way as one might use objects in an OO language; this would be similar to using a single queue for handling all methods of an ObjC object. Say process/object/queue A asks proc/obj/q B to do something and return an answer (that is, synchronously), but B needs to query A for something (could be as simple as calling a getter, which must be a synchronous call), we deadlock! A is waiting for B to finish, B asks A something but A is busy waiting for B.

Your examples get me really excited! GCD seems so simple yet it's so powerful! Thanks for another great Q&A.
Mike, any chance you could turn off the truncation for RSS feeds? mikeash.com is blocked at my place of employment.
Joachim: Do you know how Erlang solves that problem? Is it just a matter of "don't do it"?

Jim: Try this URL:

http://www.mikeash.com/pyblog/rss.py?mode=fulltext

I made it after a similar request but the fellow never told me if it worked well for him so I never got around to making it public.
@mikeash

Your right about the gcd example, I had the order of execution reversed (first the main thread and then the background thread) That was a mistake and not suprisingly fixing it didn't change the result (150us)

I updated the post to show the costs to do three cases (see table in the middle)

1) The case I examined (push work off on background and then have it update in main thread)
2) the time to just launch case #1 above and return
3) The time to push work off on the background and then never call back the main thread with the result.

What surprised me was case #3 while being 1/2 the time of case #1 for the performSelector/NSThread approach was significantly more effecient for the blocks/gcd case (a factor of 7 times faster, not the expected 2)

Finally on latency/throughput -- I was trying to answer a very specific question -- how much 'work' would you have to incur before considering pushing the work off to a background task.

My use case was designed around the idea of being called from the event loop and needing to eventually update the UI on the main thread.

Anyway thanks again for the good post.
 
The fact that GCD is so much more efficient for background-and-forget than for bacgkround-and-main-thread should not be surprising, and is exactly what I expected to see. The reason for this is because, in a Cocoa application (or anything using CFRunLoop), the main thread is not under the control of GCD. There is obviously some mechanism which allows CFRunLoop to cooperate with GCD in order to make the main loop function, but ultimately you're still going through CFRunLoop. This means that the backgrounding part is all done in efficient GCD-land, but coming back to the main thread ultimately mirrors the other, slower techniques even though you're only using the GCD API.

As for latency/throughput, I submit that throughput is still the more important variable. You would never background a single-threaded task that needed to update the GUI in a swift manner, because there's no advantage over simply performing that task directly in the main thread. The reason to use this pattern is to allow the main thread to do other work in the mean time, which means that it's throughput that counts. As an example, imagine that you have a task which takes 50us to execute directly, 75us to roundtrip through GCD, and GCD can process through a group of these tasks at an average of 25us apiece. It makes sense to shove this stuff through GCD in this case, even if the roundtrip time is worse.

The one case where latency would really matter is if you have an extremely common task which normally takes an extremely short amount of time but occasionally takes a very long time, such that you want to background it so it doesn't lock up the GUI. In this case, pushing all the short ones onto GCD could lead to a significant slowdown. However, for something like this, you could play it highly conservative and only background it if you anticipate it taking more than, say, 50ms. Of course, with your experimental results we know that 50ms is a highly conservative number, so that's certainly useful.
@mikeash: As far as I understand it, it's "don't do it", yes. I still haven't managed to wrap my brain around Erlang enough to write good apps in it, but I think the recommendation is to just make sure you pass all the relevant state as arguments, and/or keep the state in two places (which I just can't agree with).
I think in the Erlang case that Joachim brings up, you can't really phrase a synchronous getter in Erlang the way you would in C because "message sends" are always asynchrous. Erlang has no way of saying "a = [otherObject processData]" in such a way that the runtime will wait on the assignment.

In Erlang, in which we would say this something like "A = OtherObject ! {processData}, A will be immediately assigned to "processData" (which is just an interned string, like a ruby Symbol). The calling object is supposed to obtain the result of the processData message by waiting for a response from OtherObject itself, which may come at any time, in any order with other calls -- the waiting itself is also concurrent with any other work the Process may be undertaking. Erlang has primitives for setting up send/receive patterns like this, and setting timeouts for when a Process doesn't get a response.
Interesting stuff. Thanks for the explanation.

If I've understood it correctly, a rough (I said "rough"!) equivalent of the Erlang convention would be to write "getters" that take a block as a parameter and invoke it asynchronously when it's done, kind of like this:

- (void)getProcessData: (dispatch_queue_t)callbackQueue: (void (^)(NSData *))callbackBlock {
    dispatch_async(myQueue, ^{
        NSData *localData = [self processData]; // synchronous, non-thread-aware getter
        dispatch_async(callbackQueue, callbackBlock);
    });
}


Then call it like so:

...code...
[otherObject getProcessData: myQueue : ^(NSData *processData){
    ...use processData...
    ...remainder of method here...
}];
...careful, code here runs before the processData code...


I don't think that this is practical for everyday accessors, but could be an interesting pattern to follow for certain cases.
Sortof. The "callback block" would be one of the pattern conditions inside another process's "receive" block, which works like a big "socket received data" event handler with a case statement for exactly what came down the wire.

Your "careful" note wouldn't be necessary, because messsage sends can't change the environment they originate from. It's simply impossible for a block of Erlang code to "request" data from another process in a blocking manner. All it can do is send bytes down the pipeline, and do X when it receives bytes into its mailbox. In Erlang programs, though I warn you I'm pretty inexperienced in this as well, the pattern of code is more likely to start with a "input" process (or processes -- because they share nothing any one process can be made into hundreds to improve multiprocessing), which hands execution of to various workers, and then off to various presenters. You don't see the same "russian-doll" style calling like in procedural language, where a runloop frames a program event and and the case of the event jumps into a function, which jumps into another, on and on and then returning out and out and out back to original scope before moving on. Interactions between processes in Erlang, besides the program arguments, are essentally stateless.
s/program arguments/function arguments/
Note that my ObjC blocks code was meant to illustrate how you could somewhat copy Erlang's style for coping with the deadlock problem, rather than to illustrate how Erlang itself works. Sorry if that was unclear. In any case I appreciate the additional information on all of it.
Seems worth pointing out that the named High/Normal/Low Priority global queues are NOT serial, and can't be used for synchronization as in your last example. Only application created named queues and the main queue can be so used.

It's also possible that a future version of GCD may add the ability for applications to create their own named concurrent dispatch queues.

Your example confused the heck out of me at first, because I assumed that all queues were concurrent (or else, what's the point!), and had to go to the docs to clear that up!
Well, the very first time I mention the concept of global queues, I say, "Global queues are concurrent queues shared through the entire process," and at the first mention of custom queues I say, "These are serial queues which only execute one job at a time."
Great article

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.