mikeash.com: just this guy, you know?

Posted at 2008-11-30 23:36 | RSS feed (Full text feed) | Blog Index
Next article: RAOperationQueue, an open-source replacement for NSOperationQueue
Previous article: Key-Value Observing Done Right
Tags: nsoperationqueue osbug
Don't Use NSOperationQueue
by Mike Ash  

I've been discussing this on cocoa-dev but wanted to make a separate post about it to get the word out further. NSOperationQueue is broken on Leopard and will cause crashes. Don't use it!

The Bug

The bug can be demonstrated very easily with this code:

    #import <Foundation/Foundation.h>
    
    @interface Tester : NSObject
    {
        NSOperationQueue *_queue;
    }
    
    - (void)test;
    
    @end
    
    @implementation Tester
    
    - (id)init
    {
        if((self = [super init]))
        {
            _queue = [[NSOperationQueue alloc] init];
            [_queue setMaxConcurrentOperationCount:1];
        }
        return self;
    }
    
    - (void)test
    {
        NSInvocationOperation *op = [[NSInvocationOperation alloc]
    initWithTarget:self selector:_cmd object:nil];
        [_queue addOperation:op];
        [op release];
    }
    
    @end
    
    int main(int argc, char **argv)
    {
        [NSAutoreleasePool new];
    
        NSMutableArray *testers = [NSMutableArray array];
        int i;
        for(i = 0; i < 10; i++)
            [testers addObject:[[[Tester alloc] init] autorelease]];
    
        for(Tester *tester in testers)
            [tester test];
    
        while(1) sleep(1000);
    }

On my machine this will die within 10 seconds, throwing this exception:

    *** -[NSInvocationOperation start]: receiver has already started or finished

Which is thrown with this backtrace:

    #0  0x96480ff4 in ___TERMINATING_DUE_TO_UNCAUGHT_EXCEPTION___ ()
    #1  0x9207ee3b in objc_exception_throw ()
    #2  0x92db74de in -[NSOperation start] ()
    #3  0x92db7112 in __runop ()
    #4  0x902ae1f7 in _pthread_wqthread ()
    #5  0x902ae0aa in start_wqthread ()

Note that the bug appears to be hardware-dependent. A bunch of people have tried this code and nobody has ever made it crash on a PowerPC machine yet, so it seems that it's Intel-only.

The bug is also very rare. I discovered it in some audio processing code that was enqueueing roughly 500 NSOperations per second. It would take about an hour to crash. This was, as you can imagine, a whole lot of fun to debug.

The crash will still happen even if the above code is modified to enqueue NSOperations separately rather than doing that unusual trick of having one NSOperation enqueue a new one. Leaving off the call to setMaxConcurrentOperationCount: also doesn't help. Indeed, just about the only thing that seems to help is to only have a single NSOperationQueue running at a time. Since you can't prevent framework code from running its own NSOperationQueue, this is an untenable solution.

The Cause

NSOperationQueue works by managing a thread pool. The pool is global, which is to say that it's shared among all NSOperationQueues in a process. NSOperationQueue uses a private API to examine system load and spawn more worker threads or kill excess ones as needed to keep the system fully utilized without too much overhead.

From what I was able to determine from examining a crashed program, it appears that there's a race condition in the worker thread pool. They dequeue NSOperations from a set of internal queues, and it appears that when the timing is just wrong, the same NSOperation can get dequeued by two worker threads simultaneously. Since this is never supposed to happen, havoc ensues.

What You Can Do About It

Unfortunately the answer is, "not much". The bug is firmly in Apple's code and at this point it's doubtful that they'll fix it in Leopard. Your options are:

  1. File a bug. Apple probably won't fix it for Leopard, but if enough people complain maybe they'll change their mind. If you do file a bug, you can reference my bug, which is bug ID 6332143.
  2. Wait for 10.6.. It appears likely that 10.6 will not share this bug. If you can wait for 10.6 to ship, writing your software to require 10.6 is an easy "fix" for this bug.
  3. Go back to older concurrency models. Switch to raw threads and you'll (mostly) only have your own bugs to contend with, not Apple's. Depending on your needs, it may not be difficult to implement an NSOperationQueue workalike which supports the subset of NSOperationQueue's capabilities that you need. For the software which led me to find this bug, I wrote my own queue subclass which offered prioritized access to a single worker thread in about a day, and made it lockless to boot.
  4. Eliminate concurrency. Sometimes NSOperation is used just as an optimization to take advantage of multiple cores, or to avoid blocking a thread. In those cases it may be reasonable to simply drop back to a slower serialized method or to simply block until the work is complete.
This bug is really unfortunate, as NSOperation/NSOperationQueue present a fairly nice API (although far too dependent on KVO) and there's no decent workaround and a low probability of a fix before Snow Leopard ships. But at least now you know about it!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

I'm curious if this is also an issue on the iPhone OS since it also support NSOperationQueue. I ran the test both on the Simulator and the actual device and I wasn't able to reproduce this crash. Who knows if it's because the bug is rare or if it's Intel only.

This is also worrisome because there are a lot of "New to Cocoa" iPhone developers and I've seen a few use NSOperationQueue because of it's simplicity.
I suspect that the iPhone may be immune, either because of ARM or because it only has one CPU core.

I'm surprised that it didn't crash in the simulator, though. That uses your regular CPU and, as far as I understand it, the regular Mac OS X Foundation libraries. Are you able to produce the crash in a regular Mac OS X program? Not everybody has been able to, even on Intel machines (although more than enough have been able to that it's obvious the bug is real and pretty widespread), so maybe your setup is just somehow immune.
Hi Mike.

I haven't used NSOperationQueue although I've been thinking about it which was why I (a) read your post, and (b) became somewhat depressed at your conclusions.

Anyway I've nothing to add except to ask if you think that also filing the bug on OpenRadar (http://openradar.appspot.com/) might help raise it's priority? My apologies if it's already there, I couldn't see it.

Regards,

Matt

After some more testing, I was able to get it to crash the iPhone simulator. It takes about 40 seconds for me and it doesn't happen every time.

Still haven't been able to crash the actual iPhone. Hopefully it doesn't affect ARM.
I posted it to OpenRadar (http://openradar.appspot.com/6332143). While I think it may help make more people aware of the problem, I doubt that Apple will care at all.

As for iPhones, I suspect that the bug either doesn't happen on ARM or doesn't happen on single-core machines (or both!) so it's probably safe there.
Mike Ash is right. In my tests:

PowerPC dual core is OK.

Using a single NSOprationQueue is OK, but you can't guarantee that there is exactly one, since who knows what Apple's libs do.

Intel multi-chip (more than 2 cores) is easy to crash, as Mike says.

Using addDependency: to specify that Op B should not run until after Op A completes does not help.

Using separate classes for the NSOperations does not help.

In case anyone is coming here later on and wondering about alternatives, the replacement I wrote for Rogue Amoeba has now been open sourced:

http://www.rogueamoeba.com/utm/2008/12/01/raoperationqueue-an-open-source-replacement-for-nsoperationqueue/

It's not a 100% replacement, but it can be very useful in some circumstances.
Too bad this didn't affect the iPhone, as that would certainly get it fixed sooner.
I have a whole bunch of soon to be released code that relies heavily on NSOperationQueue. I started cursing when I saw this post and then replaced the operation in the above code with one of the operations I use very heavily - (Its a custom operation inheriting from NSOperation). It's been running without a crash for the last 10 minutes(the original code crashed).

So, is it an NSOperationQueue bug or an NSInvocationOperation bug?
My investigation indicates that the bug is that a single NSOperation gets dequeued and executed twice, so definitely an NSOperationQueue bug. As to why you're not seeing it....

Your operation may not detect that it's being run twice. I'd recommend adding a counter ivar:

int32_t runCount

And then doing an atomic increment plus assert at the top of your operation code:

assert(OSAtomicIncrement32(&runCount) == 1)

This will either blow up when it runs twice, or prove that it's not happening.

Also, it may be much more rare since you're doing actual work. How long does your operation take to run? I believe that each operation basically has a small chance of triggering the bug. My testing indicates that this chance is something on the order of 0.000005 per operation. (This is a very rough estimate.) If your operations take enough time to execute such that running a couple hundred thousand of them takes longer than you've let it run, that could explain it as well.

It could also be that your code is somehow causing NSOperationQueue to run slightly differently and avoid the crash altogether.

And lastly it's certainly possible that my analysis of the bug is wrong.
One question: how is it that Apple won't fix this in 10.5? I mean, some of the discussion here seems to imply that NSOperationQueue is actually used by some Apple frameworks (or, at least, that it might be), which would mean that those frameworks will crash so long as this bug remains unfixed. Apple releases all kinds of other bugfixes for older versions of OS X (at least, they're still releasing updated for 10.4 almost a year after 10.5 shipped, if Software Update is to be believed) so I don't see any reason that they couldn't release a fix for NSOperationQueue to 10.5, even after 10.6 ships.

What am I missing?
Just to be clear, the claim that Apple won't fix it for Leopard is entirely my own personal opinion. But here's what leads me to say this:

- My guess is that this is not something which will be fixed by adding an ampersand or something. It seems to be a somewhat more fundamental infrastructure problem which will require a good deal of work to fix, which reduces the probability of getting it fixed. This is pure speculation.

- It's extremely rare. The program in which I discovered this spawns off several hundred NSOperations per second. (I should say spawned, past tense, as it now spawns several hundred RAOperations per second.) Even with this it took an hour to crash on average. A more typical use of NSOperation, from what I've seen, is to do tasks which take multiple seconds each, which enormously increases the probable time-to-crash.

- I discovered this bug about a month ago and as far as I've been able to tell, I am the first. (Nobody I've talked to has said they were familiar with it, and bug report with Apple is still open, although admittedly Apple sometimes leaves bug reports open for years for no reason at all.) This means that it's not being hit often enough for people to notice. That I discovered it doesn't really change that. If this were some sort of catastrophic data loss bug then it might get prioritized, but if Leopard has survived without a fix until now, it'll keep on surviving.

- That it hasn't been found leads me to believe that Apple's use of this API in frameworks is pretty limited.

- Apple has not displayed any particular hurry to fix bad bugs in the past. For one amusing example, try using NSKeyedArchiver to archive the string @"$null" and watch what you get back out. I haven't really tracked that bug but it has probably been there since 10.3. A run through the release notes will turn up a bunch of bugs that never got fixed on 10.X but were fixed for 10.X+1. A big difference is that those bugs tend to be reliable and easy to reproduce, whereas NSOperationQueue breaks only rarely. I don't know if that makes it more or less likely that Apple will fix it.

Anyway, it's just my opinion from watching Apple work that they probably won't fix this until 10.6, but I surely could be wrong.
If you have are select or premier ADC member maybe fill a DTS incident!

I think this would be the best solution to get apples attention on the bug.

I did contact DTS and they pretty much blew me off. Very disappointing to be sure.
CocoaDev.com has no pages yet on NSOperation or NSOperationQueue. But this should really all go on a new wiki there for better exposure.

And perhaps a gzipped xcode project would make it easier for others. Then we could just download and run the test on our respective hardware configuration(s) - :)

Its a bit strange that the selector being called is itself in the following line:
NSInvocationOperation *op = [[NSInvocationOperation alloc]
    initWithTarget:self selector:_cmd object:nil];

It does not make a difference if using a different selector?

I added the method:
- (void)test2
{
    NSLog(@"test2");
}

and changed the NSInvocationOperation to:
NSInvocationOperation *op = [[NSInvocationOperation alloc]
    initWithTarget:self selector:@selector(test2) object:nil];

and no longer get the crash. I did get the crash in less than 10 seconds if I left the selector as _cmd
Forgive me if I am missing something but after further looking at this you can see that the example code provided to reproduce this problem results in an infinite loop because of the selector being used is _cmd. That is why this example crashes and will depend on the memory/speed and some other factor of your computer as to how soon it will crash.

Having the selector call itself is intentional. It's to run the operations indefinitely. Changing this to a stub method, and having another method "pump" operations into the queue does not fix the problem.

The reason your altered test case does not crash is because it does not do anything after the initial 10 operations have executed. Run Activity Monitor and watch your test program's CPU usage sit at zero. Put a loop around the initial invocation pump and watch it crash.

The example does not crash due to an infinite loop. It is very clear that it is crashing due to an uncaught exception thrown by NSOperation. It is much more difficult to discover, but just as clear, that this is due to a race condition in NSOperationQueue's background worker threads.

If you have any evidence that the crash is due to running out of memory or such then please post it, but this sort of baseless guessing game is worse than useless.
Mike thanks for the posting.

I run a Mac Pro with 2 x 8 core processors. As a newcomer to NSOperationQueue and Cocoa in general I would have spent days looking for the cause of this problem.

Naturally I had to reproduce it my own way.
Significantly I used just one queue from main and found that this crashed also!

My tasks did nothing other than run and return. I stuck a counter on it and caused the queue to clear after it had been loaded by n tasks and waited periodically for the queue to clear, all to no avail. For queue lengths of 1000 sometimes it would crash after 20 thousand iterations sometimes 250 thousand exactly as you said it would.

I then decided to give the tasks something to do, just sum the integers from 1 to 1000 and try some different queue lengths. For queue lengths of 10 it would go to 2 million iterations and then hang. For queues of 100 it would go to 3 million and crash with a malloc "unable to allocate" message. Replacing the summation with a sleep(0.01) it got to almost 3 million iterations before collapsing again with the malloc message.

My feeling is that although these latter cases are collapsing for ostensibly different reasons they are nevertheless related. Is is possible that task memory is not being cleared?

Thanks for the alternative code.
Best wishes
Julius
Your malloc errors are almost certainly due to a memory management problem. To double check, run Activity Monitor and watch the process's RPRVT number. Does it grow endlessly? If so then that is your problem. Keep in mind that you need an inner autorelease pool to destroy temporary objects if you're looping on the main thread, and also double-check all of your retains and releases to make sure you aren't outright leaking anything.
Yes I think it must be something to do with memory management. I am running with GC required. Is there anything else I need to do? The test object is stored temporarily in an NSMutable Array which is replaced every time I allow the queue to clear.
Viz my main looks like this

int main(int argc, char **argv)
{
   NSMutableArray * myArray = [[NSMutableArray alloc]init];
   NSOperationQueue * _queue = [[NSOperationQueue alloc] init];
   [_queue setMaxConcurrentOperationCount:NSOperationQueueDefaultMaxConcurrentOperationCount];
   
   NSInteger iterationsPerStep = 100;
   NSInteger interimIterations = 1000;
   NSInteger i;
   for(i=0; i < 10000;i++) {
      NSInteger j;
      for(j = 0; j < interimIterations; j++) {
         NSInteger k;
         for(k = 0; k < iterationsPerStep; k++) {
            Tester * myTesterObj = [[Tester alloc]init];
             [myArray addObject:myTesterObj];
             NSInvocationOperation *op = [[NSInvocationOperation alloc]initWithTarget:myTesterObj
                                                         selector:@selector(test)
                                                           object:nil];
             [_queue addOperation:op];
            
         } // end for k
         [_queue waitUntilAllOperationsAreFinished];
         myArray = [[NSMutableArray alloc]init];
      } // end for j
      NSLog(@"iterations so far = %d",(i+1) * iterationsPerStep * interimIterations);
   } // end for i
   
}


The extra iteration loop is just a convenience to facilitate printing the numbers.
My understanding is that the GC should remove all those now free floating no longer used arrays?
Julius
Oh and to answer your question, yes the memory does keep growing.
Sorry Mike, it took a while for the penny to drop - the GC will only collect after I've exited from the loop so ....
I'll recast the problem and take a look at that asap.
Julius
Yeah, the Cocoa GC is not good with tight loops spilling out all kinds of temporary objects. You'll want to add a call to the collector every so often, or switch to manual memory management.

Also note that your code isn't going to crash according to what's been discovered about the problem so far. The bug only shows up when there are at least two queues in the program. Of course it would be interesting to try it with just one and see if it still crashes, but if you get your memory usage under control and you still don't crash, try adding a second queue.
I've done two things, well three, four actually. First I moved the whole main procedure into an object that runs under NSApplicationMain so I could be sure GC applied to it. The first time I ran it it did not crash even after 36 million iterations. The second time it crashed almost straight away with "[NSInvocationOperation start]: receiver has already started or finished"

Next I removed the array of Task objects and replaced it with just a single task object and thereafter all of the NSInvocationOperation objects were created using a call to that object. As in the previous test I cause a wait for the NSOperationQueue to clear every 1000 NSInvocationOperation objects. It crashes with the "receiver has already started or finished" message after 200,000 iterations.

I changed the queue length to 10 and it ran ok until I stopped it at 18 million iterations. Ditto for a queue length of 100. But for a queue length of 500 I did not get past 2 million iterations (4,000 clearings of the queue) before the dreaded "receiver has already started or finished" message.

Unless due to another error it seems then that the problem can also occur with a single queue. The time before errors is erratic, and sometimes if one is running other programs in the foreground these will hang after a few millions of iterations

Finally I ran the same program but with two queues which I waited to clear after putting 10 operations into each of them. Collapse occurred after 640000 iterations with the "receiver has already started or finished" diagnostic.

To make it possible for others to check these results the sources of these XCode 3.1.2 programs running on Mac Pro Leopard 10.5.6 with Garbage Collection Required can be found at:

http://animatedpaint.co.uk/cocoa-examples-zip/DontUseArrayOfTestObj.zip

http://animatedpaint.co.uk/cocoa-examples-zip/NewDontUseOneQueue.zip

http://animatedpaint.co.uk/cocoa-examples-zip/NewDontUseTwoQueues.zip

Best wishes
Julius
Great work. It seems that you have conclusively demolished the idea that NSOperationQueue is safe as long as there is only one queue in the application, or indeed in any circumstances whatsoever.

I distilled your example code to its essence, removing the GUI and the dependence on garbage collection, and came up with this:

#import <Foundation/Foundation.h>

@interface MyOperation : NSOperation {} @end
@implementation MyOperation
- (void)main
{
    usleep(10000);
}
@end

int main(int argc, char **argv)
{
    NSAutoreleasePool *outerPool = [[NSAutoreleasePool alloc] init];
    
    NSOperationQueue *queue = [[NSOperationQueue alloc] init];
    unsigned count = 0;
    while(1)
    {
        NSAutoreleasePool *innerPool = [[NSAutoreleasePool alloc] init];
        
        MyOperation *op = [[MyOperation alloc] init];
        [queue addOperation:op];
        [op release];
        
        if(++count % 50000 == 0)
            NSLog(@"%u operations", count);
        
        [innerPool release];
    }
    
    [outerPool release];
    return 0;
}


This crashes reliably on my Mac Pro, often after just a few seconds. Sometimes it takes much longer. But it's not doing any of the "funny" stuff that other people have previously complained about or suggested might be behind the problem. I'm not sure how I missed it before, perhaps because I was stuck with the idea of testing self-repeating operations instead of simply enqueueing them separately, but the result here seems pretty solid.
@mikeash at 2009-02-01 21:09:55:

If you want to test the command-line version with garbage collection enabled:
  * invoke objc_startCollectorThread() at the beginning of the main function,
  * invoke -[NSAutoreleasePool drain] instead of -[NSAutoreleasePool release],
  * optionally, invoke objc_clear_stack() inside the while loop.

"Garbage Collection Programming Guide > Garbage Collection for Cocoa Essentials > Foundation Tools"
http://developer.apple.com/documentation/Cocoa/Conceptual/GarbageCollection/Articles/gcEssentials.html

I ran Mike's latest code, in the message dated 2009-02-01 21:09:55 (one operation queue, no garbage collection), for over 24 hours, minus a few sleeping hours since I forgot it was running and got no crashes. It logged over 210 million operations before I gave up. I've got an early 2006 Intel Core 2 Duo Mac Mini, 1.66 GHz, 2 GB RAM.

[Session started at 2009-02-02 15:27:01 -0800.]
2009-02-02 15:27:16.516 NSOperationCrash[24649:10b] 50000 operations
2009-02-02 15:27:29.030 NSOperationCrash[24649:10b] 100000 operations
2009-02-02 15:27:40.879 NSOperationCrash[24649:10b] 150000 operations
...
...the next day...
...
2009-02-03 14:52:08.011 NSOperationCrash[24649:10b] 209300000 operations
2009-02-03 14:52:26.563 NSOperationCrash[24649:10b] 209350000 operations
2009-02-03 14:52:44.770 NSOperationCrash[24649:10b] 209400000 operations

Then I ran Mike's original code, from the top of this page, which creates the multiple operation queues, and it crashed after 1:15 min:sec.

If anyone knows how I might make my Mac Mini more "sensitive", I'll give it a try and post the results.
Mike, your code from 2009-02-01 21:09:55 doesn't crash on my MacBook Pro. The initial code crashes *some of the time* (which is worse than crashing reliably) when compiled for i386, doesn't seem to crash when compiled for x86_64 or PPC.

Model Name: MacBook Pro 15"
Model Identifier: MacBookPro2,2
Processor Name: Intel Core 2 Duo
Processor Speed: 2.33 GHz
Number Of Processors: 1
Total Number Of Cores: 2
L2 Cache: 4 MB
Memory: 4 GB
Bus Speed: 667 MHz
Boot ROM Version: MBP22.00A5.B07
System Version: Mac OS X 10.5.6 (9G55)
Kernel Version: Darwin 9.6.0

       /System/Library/Frameworks/Foundation.framework/Versions/C/Foundation (compatibility version 300.0.0, current version 677.22.0)
MD5 (Foundation) = 9402bcd843fa783b44879ad65f6f09be
I ran into this discussion by chance, and i tested the codes that Mike posted at 2009-02-01 21:09:55, then boom! these are all i got:

2009-03-01 12:50:08.726 a.out[755:10b] 50000 operations
2009-03-01 12:50:21.473 a.out[755:10b] 100000 operations
2009-03-01 12:50:31.658 a.out[755:2803] *** Terminating app due to uncaught exception 'NSInvalidArgumentException', reason: '*** -[MyOperation start]: receiver has already started or finished'
2009-03-01 12:50:31.659 a.out[755:2803] Stack: (
    2520822027,
    2483650107,
    2520821483,
    2520821546,
    2490044589,
    2490043730,
    2500748303,
    2500747970
)
Trace/BPT trap

I tested the codes with my imac with 2GHz Intel Core 2 Duo and Mac OS X 10.5.6. I wish my post will be helpful.
Looks like this will be fixed in 10.5.7.

"Fixed problem with NSOperationQueue and PThreads"
<< http://news.worldofapple.com/archives/2009/03/21/apple-seeds-new-build-of-mac-os-x-1057-seed-notes/
Yeah, I'm hopeful, but I'm not going to call it won until somebody actually tries and verifies the fix. After all, the first version wasn't supposed to be buggy either.
Hey Mike,

10.5.7 is out now. I'd love to hear if this issue was resolved in that fix.

Thanks very much for your time and this great article!

      -- Cyrus
I just updated, and am testing. Post to follow with my results, whatever they are, when I have them.
Awesome! I look forward to the results!
Hey Mike, any update on this?

Thanks!

        -- Cyrus
I'm getting this on an iPad.
getting this issue on ipad 1 but not on ipad2 :(
Mike, Thanks very much for this analysis. I have been pulling my hair out for 2 days and did finally realize I am hitting this bug on an iPad IOS 9.3.2 and on the simulator in El Capitan.

PS, I do not get the bug when I add to the operation queue with waitUntilFinished:NO. Maybe just timing.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.