mikeash.com: just this guy, you know?

Posted at 2008-01-12 21:01 | RSS feed (Full text feed) | Blog Index
Next article: A Tool for Editing Version-Controlled Bundles
Previous article: The Cults of Programming
Tags: cocoa leopard objectivec performance
Performance Comparisons of Common Operations, Leopard Edition
by Mike Ash  

By popular demand, I have re-run my Performance Comparisons of Common Operations on the same hardware but running Leopard.

You can see the original article here. I used the exact same program as before, which you can get here. There is one change to the hardware, in that the computer now has 7GB of RAM instead of 3GB. I do not expect this to influence much. It still has the 2.66GHz CPUs and the stock 250GB hard drive. Here's the new chart:

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send10000000000.70.7
C++ virtual method call10000000001.11.1
Integer division10000000002.42.4
Objective-C message send10000000004.94.9
Float division with int conversion1000000000.99.0
Floating-point division1000000000.99.2
16 byte memcpy1000000002.928.9
16 byte malloc/free1000000005.656.0
NSInvocation message send100000000.877.3
NSObject alloc/init/release100000002.9290.5
NSAutoreleasePool alloc/init/release100000003.6357.7
16MB malloc/free1000000.44485.2
NSButtonCell creation10000006.66640.5
Read 16-byte file1000002.121219.3
Zero-second delayed perform1000004.242211.8
pthread create/join100000.656633.2
NSButtonCell draw1000006.969400.5
1MB memcpy100001.2123001.8
Write 16-byte file100004.9492040.5
Write 16-byte file (atomic)100008.7867380.7
NSTask process spawn10006.16096478.5
Read 16MB file1002.928619582.6
Write 16MB file (atomic)3010.7356168718.8
Write 16MB file3010.9361767086.5

As you would expect, the low-level stuff that doesn't touch the OS is pretty much unaffected, and any changes are well within the margin of error. Things like Objective-C message sends really can't get much faster than they already are, and are unchanged. However, there are some interesting changes from the Tiger numbers.

Sending a message with NSInvocation on Tiger took about 160ns per message, but on Leopard it took only 77ns. That's over a factor of two faster. It's still over ten times slower than a straight message send, but it's much better.

Allocating and destroying Objective-C objects has apparently become significantly slower. [[[NSObject alloc] init] release] on Tiger took under 190ns, but on Leopard it took 290ns. This is still well into ignorable territory in most situations, but taking 50% more time is not good. I'm not sure what changes would have been made to make this slower. The small malloc/free test was pretty much unaffected, so it's apparently something specific to Objective-C.

The 16MB malloc/free test ran in less than half the time on Leopard. At this size, malloc/free hit the kernel directly, so presumably there is some syscall or kernel memory management optimization at work.

Delayed performs are somewhat worse on Leopard, going from 30µs each to 42µs.

NSButtonCell drawing got a bit faster, although not significantly. Leopard probably has various drawing optimizations, since that's something that the OS does quite a lot of.

Pthread creation is about twice as fast on Leopard. It's still annoyingly slow, but now it's something you can do about 20,000 times a second instead of 10,000 times a second.

And lastly, the atomic 16-byte file write is a bit over 10% faster. Perhaps there are some filesystem optimizations affecting the atomic swap process that this uses.

Update: The above was compiled as 32-bit, and it was pointed out that it might be good to show 64-bit numbers as well:

NameIterationsTotal time (sec)Time per (ns)
IMP-cached message send10000000000.80.8
C++ virtual method call10000000001.11.1
Integer division10000000002.42.4
Objective-C message send10000000008.68.6
Float division with int conversion1000000000.99.0
Floating-point division1000000000.99.0
16 byte memcpy1000000002.929.3
16 byte malloc/free1000000005.352.7
NSInvocation message send100000000.881.6
NSAutoreleasePool alloc/init/release100000001.7169.5
NSObject alloc/init/release100000001.9192.6
16MB malloc/free1000000.32924.6
NSButtonCell creation10000006.16069.9
Read 16-byte file1000001.717382.1
Zero-second delayed perform1000003.433858.4
pthread create/join100000.656279.8
NSButtonCell draw1000006.564812.5
1MB memcpy100001.2122725.0
Write 16-byte file100004.9486454.7
Write 16-byte file (atomic)100008.6859274.8
NSTask process spawn10005.65551589.2
Read 16MB file1002.827785650.2
Write 16MB file309.2305342338.2
Write 16MB file (atomic)309.2306369990.9

There are definitely some interesting differences here. Objective-C object allocation is back down to the Tiger numbers. I have even less of an idea why it would only be slower on Leopard 32-bit but not 64-bit. The 16MB malloc/free is even faster under 64-bit, and delayed peforms are significantly faster as well. Everything else seems to be pretty much the same, with some margin of error.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Just out of interest, which of Leopard's Objective-C runtimes were you using? I expect the 64-bit one would be faster (but still would hold off optimising for it until I actually had a speed problem)
That's a good point. The original numbers were done with 32-bit code. I've updated the post to show 64-bit numbers as well.
Woo, my demands are popular! Perhaps they’re marketable?

The link in the first paragraph is broken in the RSS feed. Probably this is some not-very-interesting issue in your software regarding URLs with queries in them.
Thanks for pointing that out. It's actually an issue with my RSS feed not knowing how to rewrite relative links in the post. I have fixed this by the simple expedient of removing the link from the first paragraph.
There is one change to the hardware, in that the computer now has 7GB of RAM instead of 3GB. I do not expect this to influence much.


I do not share your sanguineness, I'm afraid. This could easily skew results. The kernel used to allocate certain caches as a percentage of available RAM; I don't know if it still does that, but there is still kernel overhead in managing the extra RAM. Can you rerun these results using the kernel "maxmem" (or whatever it is) boot param to temporarily lock the usable memory to 3GB?
I'll be happy to run a new test locking the memory to 3GB if it's not too involved and you can give me more specific instructions. However it will probably have to wait a week or so until after Macworld.

I have even less of an idea why it would only be slower on Leopard 32-bit but not 64-bit


It is possible that the 32-bit and 64-bit compilers have their own fork of optimizer, memory management stuff, etc. and the 64-bit compiler is optimized heavily while the 32-bit one is not...
Garbage collection?

Even if you had GC off for the test, the 10.5 frameworks might still generate a few little stubs that eat up that bit of extra time. It's just a guess but it would explain why things got slower. Still doesn't explain the 32/64 question but Voolek made a good point about optimization.
Can you make a chart that has Tiger and Leopard numbers with a delta? That would make the differences more clear.
Which flags did you pass to the compiler?, Eg, -Os, -O3, etc.

I only passed flags needed to compile, so things like -framework Cocoa. It was compiled with no optimization.
I recently came across this page while trying to learn Obj-C and these metrics and your explanation of how messages are sent was very helpful. After seeing the original test, I thought the cached IMP test seemed similar to a C++ Pointer to member function. I tried adding an additional test using it and it turned out slightly faster than a virtual function call. I tried a similar test using visual studio and the ptmf was about 60% SLOWER than the virtual function call (which lead me to lots of interesting reading on the topic and different implementations). Since gcc seems to have one of the more sane implementations it might be interesting to include if you choose to do any similar tests like this in the future:


    class StubClass *obj = new StubClass;
    
    void (StubClass::*func)()=&StubClass::stub;
    BEGIN( 1000000000 )
    (obj->*func)();
    END()
May I suggest a new benchmark on Snow Leopard? :)
Whoops, clicked "post" too soon.

It'd be interesting to see what differences there are between GCC and Clang too.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Hosted at DigitalOcean.