mikeash.com: just this guy, you know?

Posted at 2008-03-19 21:40 | RSS feed (Full text feed) | Blog Index
Next article: Deconstructing the iPhone SDK: Malware
Previous article: Use strnstr
Tags: cocoa iphone objectivec performance
Performance Comparisons of Common Operations, iPhone Edition
by Mike Ash  

I finally got a chance to run my performance comparison code on an iPhone, so we can see just how much horsepower this little device has. I still am not able to load my own code onto the device myself, so I want to thank an anonymous benefactor for adapting my code to the new environment and gathering the results for me.

For comparison, you may wish to see the original Performance Comparisons of Common Operations and its followup, Performance Comparisons of Common Operations, Leopard Edition. The source code used in this test can be obtained here.

Here are the times:

NameIterationsTotal time (sec)Time per (ns)
C++ virtual method call100000000080.880.8
IMP-cached message send100000000085.485.4
Floating-point division10000000013.4134.4
Integer division1000000000139.5139.5
16 byte memcpy10000000017.6175.7
Objective-C message send1000000000192.9192.9
Float division with int conversion10000000019.3193.0
NSInvocation message send1000000019.01899.0
16 byte malloc/free100000000198.81988.4
NSObject alloc/init/release10000000118.811883.6
NSAutoreleasePool alloc/init/release10000000172.717272.9
16MB malloc/free1000003.130754.5
Read 16-byte file10000051.1511041.3
Zero-second delayed perform10000067.5674994.5
pthread create/join100008.0802160.2
Write 16-byte file (atomic)1000051.55153943.7
Write 16-byte file1000080.98089726.2
1MB memcpy1000081.38130009.1
Read 16MB file100137.61376092573.3
Write 16MB file (atomic)30143.84793527088.9
Write 16MB file30151.25038515361.1

Note that this test suite is somewhat reduced compared to the original. NSTask and NSButtonCell don't exist, so those tests were removed. Conceivably they could be replaced with substitutes, but I didn't bother.

The first thing that stands out is the large speed difference for low-level operations compared to the Mac Pro used in the original tests. Of course I wouldn't expect a handheld device to compete against a modern desktop machine, but the contrast is still striking. The worst is the IMP-cached message send, which is over one hundred times slower on the iPhone.

It's also interesting to note that C++ virtual method calls have a better time than IMP-cached message sends. I'll assume that the difference is within the margin of error and that they are both actually the same speed. This is still an interesting result, since the C++ virtual method call involves more indirection than calling an IMP. I would guess that the ARM architecture includes an instruction which natively handles this indirection; anyone familiar with ARM care to comment?

Another interesting pairing is integer and floating-point division. Again these appear to be the same speed on the iPhone, but floating-point division is roughly 3.5 times slower on the Mac Pro. This makes floating-point division on the iPhone merely 15 times slower.

The results also show the atomic file writes to be faster than the non-atomic ones. I have no explanation for this other than testing error, but the difference in timing for the 16-byte file is pretty huge. The 5-8ms time to write the 16-byte file is interestingly large. At that size seek time should completely dominate, and flash memory has effectively no seek time, so I don't understand why this number would be so large. Perhaps CPU performance ends up costing this one so much. The 16MB test shows about a 3MB/s sustained write speed, not too bad.

The 1MB memcpy test reveals roughly 120MB/s of available memory bandwidth. I'm a bit surprised that it's this low, given the on-die RAM, but this is roughly comparable to the rest of the system even so.

Overall, this little machine isn't going to be substituting for a Mac Pro anytime soon, but it's not bad for a pocket-sized computer with such constraints on cost, battery usage, and heat. Now if only Apple would let me put software on it.

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

I find it pretty awkward to compare the results without copying them into something since they're in a different order. Could you make it consistent?
I have them sorted from fastest to slowest, which is how I prefer to present the data. If you want a matching order to compare with the numbers in the previous posts, you're one UNIX command away from alphabetic bliss.
I'm not sure how exactly an IMP-cached function call would look, but this is what a virtual function call looks like on ARM when compiled for an ARM 1176.

Assuming you have an object pointer in register r0:

; load the v-table from the object
LDR r0, [r0, #0]

; load the address of the method from the v-table
LDR r1, [r0, #8]

; call the method
BLX r1

LDR loads a 32-bit value from memory using a base pointer with offset addressing scheme.

BLX saves the return address in the link register (r14) and branches to the address contained in the register parameter. It can also change states from ARM to Thumb or Thumb to ARM. BLX will also flush the pipeline (the branch predictor probably won't be much help here).

For the function return, you copy the link register into the program counter.

You can see that it must perform two loads from memory to get the address of the method which probably accounts for the biggest part of the overhead.

I'd be interested to see what the code looks like to call an IMP-cached method.
Thanks for such an informative post. Calling an IMP-cached method is just calling a function pointer. The only thing that's special about it is where you got the pointer and the fact that you have to manually pass the two implicit parameters, self and _cmd. The code is available from a link just above the table if you're interested in seeing it, but it sure sounds like it ought to go faster. I wonder if the compiler is smart enough to optimize the C++ virtual function call into a straight function pointer invocation in this case, and that's why it ends up being the same speed.
The extra 5ns to call the IMP-cached method is probably related to setting up the SEL parameter in the function call. Given that the period of the ARM core is ~1.5ns (I've seen reports of the core running from 620MHz - 667MHz) this is about three more cycles (up to three instructions), which seems reasonable depending on how the @selector directive is implemented.

Since the C++ method call takes ~50 cycles to execute two loads and a branch you get a sense of just how slow the memory system is.

As an aside, the ARM procedure call standard states that the first four parameters to a function are passed in registers r0-r3. If your function takes more than four parameters the additional parameters are passed on the stack (which is sloooow).

Since a C++ method has an implicit this pointer as the first parameter, you should keep C++ methods to three or fewer parameters.

Since an Objective-C method has implicit parameters for self and the selector then you should keep methods to two or fewer parameters (when speed is of the utmost concern).
That's a good point about the @selector directive. On the platforms I'm familiar with, I believe it ends up being treated much like a literal string constant, with some linker magic added in to fix up newly loaded modules so that common selectors get identical values. This would make the use of the directive essentially the same as using a local variable, but it could very well be using a different and more costly technique on ARM.
Regarding write speeds for 16 bytes, this might be related to the block storage nature of flash memory devices. Details can be found on Wikipedia:
http://en.wikipedia.org/wiki/Flash_memory

NAND write operations occur at a block granularity. So a 16 byte write must write the size of an entire block. I don't know what block size is used on iPhone, but the article suggests 16, 128 or 256 kilobytes.
Let it be known that Facebooklicious (.com) is a bunch of jerkwad spammers. Here's a hint for the future, guys: I'm unlikely to delete your comment and then ridicule you if you take thirty seconds to write some original text instead of copy/pasting my own comment. I can recognize my own writing, you tremendous morons.
Trying it out on OS X 10.9.2 with a 2.6 GHz Intel Core i7, the results are insane. I don't have much experience with assembly but it looks like even at O0 that clang will devirtualize that method call. I still see calls to _objc_msgSend for the IMP cached messaging so not sure how that is so fast.

Here's a select few (Time per (ns))

IMP-cached message send    0.1
C++ virtual method call        0.2
Objective-C message send    1.3
16 byte malloc/free             41.2
NSObject alloc/init/release    113.9
16MB malloc/free                 244.2
I decided to run on an iPhone 5s running iOS 7.1 and got these timings:


C++ virtual method call     0.3
IMP-cached message send     0.4
Objective-C message send    0.8
16 byte malloc/free         134.1
NSObject alloc/init/release 191.8
16MB malloc/free            3725.5


So in some aspects, very competitive with the much more powerful desktop; in others (namely the 16MB malloc), still much slower.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Hosted at DigitalOcean.