mikeash.com pyblog/performance-comparisons-of-common-operations-iphone-edition.html comments

George - 2014-05-06 00:54:51

Tue, 06 May 2014 00:54:51 GMT

I decided to run on an iPhone 5s running iOS 7.1 and got these timings:



C++ virtual method call     0.3

IMP-cached message send     0.4

Objective-C message send    0.8

16 byte malloc/free         134.1

NSObject alloc/init/release 191.8

16MB malloc/free            3725.5

So in some aspects, very competitive with the much more powerful desktop; in others (namely the 16MB malloc), still much slower.

George - 2014-05-05 18:12:44

Mon, 05 May 2014 18:12:44 GMT

Trying it out on OS X 10.9.2 with a 2.6 GHz Intel Core i7, the results are insane. I don't have much experience with assembly but it looks like even at O0 that clang will devirtualize that method call. I still see calls to _objc_msgSend for the IMP cached messaging so not sure how that is so fast.

Here's a select few (Time per (ns))

IMP-cached message send    0.1
C++ virtual method call        0.2
Objective-C message send    1.3
16 byte malloc/free             41.2
NSObject alloc/init/release    113.9
16MB malloc/free                 244.2

mikeash - 2009-09-07 16:41:27

Mon, 07 Sep 2009 16:41:27 GMT

Let it be known that Facebooklicious (.com) is a bunch of jerkwad spammers. Here's a hint for the future, guys: I'm unlikely to delete your comment and then ridicule you if you take thirty seconds to write some original text instead of copy/pasting my own comment. I can recognize my own writing, you tremendous morons.

Michael Rondinelli - 2008-05-11 06:20:55

Sun, 11 May 2008 06:20:55 GMT

Regarding write speeds for 16 bytes, this might be related to the block storage nature of flash memory devices. Details can be found on Wikipedia:
http://en.wikipedia.org/wiki/Flash_memory

NAND write operations occur at a block granularity. So a 16 byte write must write the size of an entire block. I don't know what block size is used on iPhone, but the article suggests 16, 128 or 256 kilobytes.

mikeash - 2008-03-22 07:50:23

Sat, 22 Mar 2008 07:50:23 GMT

That's a good point about the @selector directive. On the platforms I'm familiar with, I believe it ends up being treated much like a literal string constant, with some linker magic added in to fix up newly loaded modules so that common selectors get identical values. This would make the use of the directive essentially the same as using a local variable, but it could very well be using a different and more costly technique on ARM.

Jeff Gilbert - 2008-03-22 02:32:52

Sat, 22 Mar 2008 02:32:52 GMT

The extra 5ns to call the IMP-cached method is probably related to setting up the SEL parameter in the function call. Given that the period of the ARM core is ~1.5ns (I've seen reports of the core running from 620MHz - 667MHz) this is about three more cycles (up to three instructions), which seems reasonable depending on how the @selector directive is implemented.

Since the C++ method call takes ~50 cycles to execute two loads and a branch you get a sense of just how slow the memory system is.

As an aside, the ARM procedure call standard states that the first four parameters to a function are passed in registers r0-r3. If your function takes more than four parameters the additional parameters are passed on the stack (which is sloooow).

Since a C++ method has an implicit this pointer as the first parameter, you should keep C++ methods to three or fewer parameters.

Since an Objective-C method has implicit parameters for self and the selector then you should keep methods to two or fewer parameters (when speed is of the utmost concern).

mikeash - 2008-03-21 07:33:20

Fri, 21 Mar 2008 07:33:20 GMT

Thanks for such an informative post. Calling an IMP-cached method is just calling a function pointer. The only thing that's special about it is where you got the pointer and the fact that you have to manually pass the two implicit parameters, self and _cmd. The code is available from a link just above the table if you're interested in seeing it, but it sure sounds like it ought to go faster. I wonder if the compiler is smart enough to optimize the C++ virtual function call into a straight function pointer invocation in this case, and that's why it ends up being the same speed.

Jeff Gilbert - 2008-03-21 06:17:58

Fri, 21 Mar 2008 06:17:58 GMT

I'm not sure how exactly an IMP-cached function call would look, but this is what a virtual function call looks like on ARM when compiled for an ARM 1176.

Assuming you have an object pointer in register r0:

; load the v-table from the object
LDR r0, [r0, #0]

; load the address of the method from the v-table
LDR r1, [r0, #8]

; call the method
BLX r1

LDR loads a 32-bit value from memory using a base pointer with offset addressing scheme.

BLX saves the return address in the link register (r14) and branches to the address contained in the register parameter. It can also change states from ARM to Thumb or Thumb to ARM. BLX will also flush the pipeline (the branch predictor probably won't be much help here).

For the function return, you copy the link register into the program counter.

You can see that it must perform two loads from memory to get the address of the method which probably accounts for the biggest part of the overhead.

I'd be interested to see what the code looks like to call an IMP-cached method.

mikeash - 2008-03-20 09:28:58

Thu, 20 Mar 2008 09:28:58 GMT

I have them sorted from fastest to slowest, which is how I prefer to present the data. If you want a matching order to compare with the numbers in the previous posts, you're one UNIX command away from alphabetic bliss.

Steven Fisher - 2008-03-20 06:09:10

Thu, 20 Mar 2008 06:09:10 GMT

I find it pretty awkward to compare the results without copying them into something since they're in a different order. Could you make it consistent?