mikeash.com: just this guy, you know?

Posted at 2014-07-04 13:40 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2014-07-18: Exploring Swift Memory Layout
Previous article: Friday Q&A 2014-06-20: Interesting Swift Features
Tags: fridayqna optimization swift
Friday Q&A 2014-07-04: Secrets of Swift's Speed
by Mike Ash  

Since the introduction of Swift, speed has been a key selling point. Indeed, it's right in the name of the language. It's said to be faster than dynamic languages like Python or JavaScript, potentially faster than Objective-C, and even claimed to be faster than C for certain cases. But how exactly does it do it?

While the language should allow for great performance, the current compiler is still a bit rough, and I've had a hard time getting Swift to come out ahead in any performance tests. Most of the slowness looks to come down to a lot of redundant retain/release activity being emitted by the compiler. I expect this to be fixed before too long, but in the meantime, it means that this article is going to be more about how Swift could potentially be faster than Objective-C than about how it's actually faster right now.

Faster Method Dispatch
As we all know by now, every time we write an Objective-C method call, it gets translated into a call to objc_msgSend which then handles the work of looking up the target method at runtime and invoking it. objc_msgSend is responsible for taking the selector you're sending and the object you're sending it to, and looking that up in the class method tables to figure out exactly which piece of code is supposed to handle it. It's extremely fast for what it does, but it does a lot more than is usually necessary.

Message sending is nifty because it makes no assumption about the runtime type of the target object. It could be an instance of the expression's static type; if you're sending a message to a variable declared NSView *, the object could very well be a straight NSView instance. It could also be a subclass. It could even be an instance of a completely unrelated class. You can lie to the compiler about an object's type and everything still works fine.

However, in 99.999% of Objective-C code, you don't lie to the compiler. When an object is declared NSView *, it really is either an NSView or a subclass. We need some dynamic dispatch, but the vast majority of code doesn't need the full message send dispatch. However, the nature of Objective-C is such that every call has to pay the full cost.

Consider this Swift code:

    class Class {
        func testmethod1() { print("testmethod1") }
        @final func testmethod2() { print("testmethod2") }

    class Subclass: Class {
        override func testmethod1() { print("overridden testmethod1") }

    func TestFunc(obj: Class) {

In the equivalent Objective-C code, both method calls would be compiled to objc_msgSend calls and that would be the end of the story.

In Swift, the compiler is able to take advantage of the tighter guarantees provided by the language. In Objective-C, we can lie to the compiler, and the object may not be anything like the type we say it is. In Swift, we can't lie to the compiler. If we say that obj is an instance of Class, then it's either an instance of that class or of a subclass.

Instead of objc_msgSend, the Swift compiler emits code that makes a call using a vtable. A vtable is just an array of function pointers, accessed by index. That array exists as part of the class. The code emitted by the compiler for the first method call is roughly equivalent to:

    methodImplementation = object->class.vtable[indexOfMethod1]

A simple array index is quite a bit faster than objc_msgSend, even with its fancy caching and finely tuned assembly code, so this is a nice performance win.

The call to testmethod2 is even nicer. Because it's declared @final, the compiler knows that nothing can override it. No matter what happens, the call will always go to the implementation in Class. Thus the compiler doesn't even need to emit a vtable lookup. It can just make a normal function call that goes straight to __TFC9speedtest5Class11testmethod2fS0_FT_T_, which is the mangled name of the testmethod2 implementation in my test case.

This isn't a huge win. objc_msgSend really is remarkably fast, and most programs don't spend that much time in it. In addition, Swift will still use objc_msgSend to send messages to Objective-C objects. Still, it should be good for a few percentage points.

More Intelligent Method Calls
The different method call semantics allow speed improvements beyond simply using faster dispatch techniques. Because the compiler can better understand where control goes, it can optimize the code better. It can potentially inline method calls or even eliminate them entirely.

In the above exmple, I deleted the body of testmethod2:

    @final func testmethod2() {}

The compiler is smart enough to see that the call to this method now does nothing. With optimizations turned on, it doesn't even emit the call. It calls testmethod1 and then it's done.

It can do this even with methods that aren't marked @final if it has enough information locally. For example, consider a slight variation on the above code:

    let obj = Class()

Since the compiler can see the creation of obj, it knows that it's an instance of Class and not a subclass. This allows it to completely eliminate the dynamic dispatch even for testmethod1, making a direct function call to the implementation.

To take an extreme case, consider code like this:

    for i in 0..1000000 {

In Objective-C, this code will send a million messages and take a noticeable amount of time to run. In Swift, if the compiler can see that testmethod2 does nothing, it could potentially eliminate the entire loop, meaning that this code runs in zero time.

Less Memory Allocation
If enough information is available to the compiler, it could potentially eliminate some memory allocations altogether. In code like the above, if the compiler can see the object creation and all uses and determine that the object never lives beyond the local scope, the compiler could skip heap allocation for the object and allocate it on the stack instead, which is much faster. In an extreme case, where the methods invoked on the object never actually use the object, it could skip the allocation altogether. Consider some ridiculous Objective-C code like this:

    for(int i = 0; i < 1000000; i++)
        [[[NSObject alloc] init] self];

This will send three million messages and allocate and destroy one million objects. The equivalent Swift code could, with a smart compiler, compile to nothing at all if it's able to determine that the self method doesn't do anything useful and doesn't cause anything to refer to the object beyond its local lifetime.

More Efficient Register Usage
Every Objective-C method takes two implicit parameters: self and _cmd, with the explicit parameters passed after those two. On most architectures (including x86-64, ARM, and ARM64), the first few parameters to a function are passed in registers, with the rest passed on the stack. Registers are much faster than stack, so passing parameters in registers can be a big performance win.

The implicit _cmd parameter is rarely needed. It's really only useful when taking advantage of fully dynamic message sending, which 99.999% of Objective-C code doesn't do. Yet it occupies a scarce parameter register in all calls, leaving one fewer for real parameters. There aren't many available: ARM only uses four registers for passing parameters, x86-64 uses six, and ARM64 uses eight.

Swift omits the _cmd parameter, freeing up an additional argument register for more useful purposes. For methods that take several explicit parameters (three or more for ARM, five or more for x86-64, and seven or more for ARM64), this means a small performance win on every call due to better use of argument registers.

These are all reasons why Swift can be faster than Objective-C, but how about plain C? Aliasing is something that could allow Swift to surpass C.

Aliasing is when you have more than one pointer to the same chunk of memory. For example:

    int *ptrA = malloc(100 * sizeof(*ptrA));
    int *ptrB = ptrA;

This is a tricky situation, because writing to ptrA will affect reads from ptrB, and vice versa. This can hurt the compiler's ability to optimize.

Consider a simple implementation of the standard library memcpy function:

    void *mikememcpy(void *dst, const void *src, size_t n) {
        char *dstBytes = dst;
        const char *srcBytes = src;

        for(size_t i = 0; i < n; i++)
            dstBytes[i] = srcBytes[i];

        return dst;

Copying one byte at a time is pretty inefficient. We'd generally want to copy larger chunks. SIMD instructions may allow copying 16 or even 32 bytes at a time, which would make this go much faster. In theory, the compiler should be able to analyze this loop and do that for us. However, aliasing stops that from happening. To understand why, consider this code:

    char *src = strdup("hello, world");
    char *dst = src + 1;
    mikememcpy(dst, src, strlen(dst));

With memcpy, this wouldn't be legal. It's not allowed to pass overlapping pointers like this. However, this is mikememcpy, which accepts overlapping pointers, but it behaves strangely.

In the first iteration of the copy loop, srcBytes[i] refers to the 'h' and dstBytes[i] refers to the 'e'. After copying this byte, the full string then contains "hhllo, world". In the second iteration, srcBytes[i] refers to the second 'h', and dstBytes[i] refers to the first 'l'. The full string after this iteration is "hhhlo, world". It continues like this, with the 'h' being copied forward byte by byte, until the full string is simply "hhhhhhhhhhhh". Not really what we wanted.

This sort of thing is why memcpy doesn't allow overlapping pointers. The memmove function is smart enough to handle overlapping copies like this, at the cost of some performance. By ignoring potential overlap, memcpy can go faster.

However, the compiler doesn't understand all of this context. It doesn't know that mikememcpy is supposed to assume that the arguments don't overlap. We'd be happy with any optimization that changed the semantics of the case where the arguments overlap, as long as it kept the non-overlapping case intact. For all the compiler knows, we want "hhhhhhhhhhhh". We need it. The code we wrote demands it. Any optimization has to keep these semantics intact, even though we don't care. This makes it tough to make this function more efficient by copying larger chunks of memory.

clang actually does an impressive job with this function. It generates code which checks for overlap and performs optimized copies when the two pointers don't overlap. The performance penalty from the compiler's lack of context ends up being small, but not zero. Those extra checks aren't entirely free.

This is a pervasive problem in C, because the language allows any two pointers of the same type (and sometimes of different types!) to alias. Most code is written with the assumption that pointers don't alias, but the compiler generally has to assume that they might. This makes C much harder to optimize, and can make C programs slower than they need to be.

It's such a common problem that the C99 standard introduced a new keyword just to deal with it. The restrict keyword tells the compiler that a given pointer won't alias, allowing for better optimizations. Adding it to mikememcpy's parameters produces better code:

    void *mikememcpy(void * restrict dst, const void * restrict src, size_t n) {
        char *dstBytes = dst;
        const char *srcBytes = src;

        for(size_t i = 0; i < n; i++)
            dstBytes[i] = srcBytes[i];

        return dst;

Problem solved, right? Well, how often have you actually added the restrict keyword to your code? I'm going to guess "never" for most of you. In my case, the first time I ever actually used restrict was when I wrote the modified mikememcpy above. It gets used for extremely performance sensivite code, but otherwise we just take the minor hit and move on.

It can show up in a lot of places you might not expect, too. For example:

    - (int)zero {
        memset(_ptr, 0, _size);
        return _count;

As far as the compiler knows, _ptr could point to _count. Thus, it generates code which loads _count, increments it, stores it, calls memset, then loads count again to return it. If we know that _ptr doesn't point to _count then this is wasted effort, but the compiler has to do it just in case. Compare with:

    - (int)zero {
        memset(_ptr, 0, _size);
        return _count;

Putting the memset first eliminates the redundant load of _count. It's a small win, but it's there.

Even something as innocent as an NSError ** parameter can hit this case. Imagine a counting method that never returns an error, but the API is designed to take one just in case the underlying implementation gets more complex:

    - (int)increment: (NSError **)outError {
        *outError = nil;
        return _count;

Once again, the compiler generates a redundant load of _count just in case outError points to it. This is an odd case, because C's aliasing rules don't allow aliasing pointers of different types, and so it should actually be safe to omit the redundant load. I assume that something about how the Objective-C type rules are implemented on top of C is defeating that. We could add restrict:

    - (int)increment: (NSError ** restrict)outError {
        *outError = nil;
        return _count;

This generates better code. But how often do you really do that? For me, this is now the second time I have ever used this keyword.

Swift code encounters these situations much less frequently. In general, you don't take references to arbitrary instance variables, and array semantics make it impossible for them to overlap. This should allow the Swift compiler to generate better code in general, with fewer redundant loads and stores to handle the case where you might alias something.

Swift has a couple of tricks that can potentially make for faster programs than Objective-C. Less dynamic dispatch allows for faster method calls. It also allows for better optimization of the code that makes those calls, such as inlining or removing calls altogether. The way method dispatch works also frees up an extra argument register, making it faster to pass parameters into methods that take a large number of parameters. Finally, the rarity of pointers in Swift means that aliasing is much less of a problem, allowing the compiler to perform better optimizations and fewer redundant loads and stores. It remains to be seen exactly how much of a difference all of this will make in practice, but it looks promising for modest speed gains.

That's all for today. Come back next time for more Swifty goodness. In the meantime, if there's a topic you'd like to see covered (Swift or not), please send it in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.


Regarding the method you wrote mikememcpy can you suggest some implementations for multi-threaded cpu's how we could split efficient the data into a multi-core environment ?

As you probably know there will be a big problem applying the same trick in there because the processor caching will become inefficient.
Monomorphization is a huge win not just because it means we can skip objc_msgSend: it means that method bodies can now be inlined!
@Andy would that ostensibly increase the size of the resulting binary in exchange for speed?

>The memmove function is smart enough to handle overlapping copies like this, at the cost of some performance. By ignoring potential overlap, memcpy can go faster.

The only "cost" memmove requires is a single subtract, compare, and branch to decide whether to copy front-to-back or back-to-front. In fact, the cost is so low that memcpy is an alias of memmove on iOS and OS X and many other systems.
@John: You're definitely right. C compilers have been working through the trade-offs around that with function inlining for a very long time—this is part of why -Os is different from -O3.
objc_msgSend has changed its implementation in almost every major version of OS X so far, even though it was only about 100 bytes (less than half that now).

I fully expect that we'll be seeing little performance improvements to Swift and the OS X runtime for many years to come. OS X 10.0 wasn't exactly a speed demon, either.

That said, as someone who actually uses the dynamic nature of Objective-C fairly extensively (as opposed to simply trying to write C++ with funny syntax, which is how most people seem to be using Objective-C), it's actually kind of nice to know that I'm not leaving any performance on the table today by sticking with Objective-C for a little while longer!
The first thing I learned in optimization class was not to optimize until I learned what was expensive.

It's sad to think of trading something fun and full of runtime goodness like Objective C for a bland language whose performance claim to fame is that its slightly faster at calling methods.
Swift can theoretically get even better with fully-dynamic classes (e.g. no @final), once it has access control. If a class is not exported from a module, then Swift knows that all subclasses must be defined in that module, and therefore it has the information to determine if a given method ever is overridden. If it's not, it can then optimize that method as if it were @final.

Similarly, it might be able to optimize away stored property setters, if it knows there can't be any subclasses that define willSet/didSet for the property.

Along these lines, protocols might be able to benefit as well. If Swift can determine that a non-exported protocol only has one implementation, it could optimize with the assumption that every instance of the protocol is that one implementation. Granted, this is a bit of a more esoteric optimization and less likely to actually be useful (the only single-use protocols I can think of would also be @objc, and that won't benefit). But it's another example of the sort of optimization Swift can do once it can be sure it knows the complete usage of a given (non-exported) type.
Alex: I wouldn't be surprised if a single CPU core running an optimal memcpy could saturate the system's memory write bandwidth by itself, which would mean that parallelization wouldn't help at all. Even if not true, the overhead of inter-core communication would mean it wouldn't be worthwhile unless you're copying a huge amount of data, and at that point you should probably reconsider why you need to do that anyway. Virtual memory tricks where you remap pages to merely make it *look* like the data is copied are probably a better choice if you really need it. Interesting side note, OS X's realloc implementation does exactly that when you realloc a sufficiently large allocation, rather than using memcpy.

Stephen Canon: Good point, it's not a big cost to check and switch implementations as needed.

James: Personally, I really like Swift's new features and am looking forward to using it for real work, even ignoring potential speed increases. I think that we'll see more dynamic hackery become available as the language progresses. For example, the reflect function and Mirror protocol indicate that some sort of introspection is intended to be made available, and it seems likely to me that enough will be provided to replicate what Objective-C does for the cases when you really need it.
Will CoreData implementation always stay Objective-C-dependent framework because of Swift's inability (for now, I guess) to perform runtime shenanigans?
Two typos spotted: 'shuold' and 'exmple'. Thanks for another interesting read (as usual).
Kevin: Dynamic means a lot more than just "non-@final". Has anyone figured out how to do the equivalent of NSClassFromString with arbitrary Swift classes, for example? Last I heard, that was believed to be impossible.

EOF, CoreData's Big brother
was rewritten in Java around 2000
as part of WebObjects.

Only thing we had to do different was an
extra subclass in EnterpriseObjects because
of 1 to Many relationship.
It took me day to figure out why it wasn't working
but after that everything was the same.
If replacing c/c++ is the goal.

One thing that is not talked about is
what should Swift Standard Library look like and encompass.

- Should Apple just port CoreFoundation to Swift
just using struct and enums
and then have source code converters to
clib and c++ STL.
Should there be any Class objects in it.
All kind of basic datatypes like Matrix, Complex, Vector
should be their even though Apple is reluctant to do so.

Apple will have open source yet retain control of it.
Very hard thing to do especially when they don't want
to share higher level functionality and compatibility
with other platforms.

I believe this should also be set in stone by 1.0
One more typo: ‘If we say that objc is an instance of…’ should be ‘obj’.
Douglas Hill: Thanks, clearly a case of muscle memory working against me there.
Great article. One quick question, based on your ideas around potential optimisation, should we use 'final' as much as possible to improve performance (where speed is necessary)?
I might have missed something but in this example:

for(int i = 0; i < 1000000; i++)
        [[[NSObject alloc] init] self];

The equivalent code in swift might have a global side effect from the init method, e.g. a print statement.

class A {
    init() {
        print("a init")

for _ in 1...1000000 {
    let a = A()

The compiler would have to call let a = A() since there might be side effects right?
NSClassFromString can be accomplished via objc_getClass
Swift method dispatch is actually slower for classes that inherit from NSObject as most classes will for the forseeable. In this case even Swift calling Swift seems to message using objc_msgSend though an extra stub converting argument calling conventions in case the class is subclassed from Objective-C.
A highly overlooked path of optimization is templates versus dynamics. Many of the same problems dynamics solve, templates can be applied with huge optimization opportunities.

When I see a benchmark comparison not using a language's features, the comparison is invalid. It is like those many Java is faster than C++ samples. Templates is a new tool that the non c++ programmers will learn to love more than messaging.
Very nice article, I quoted it in my blog, thank you
Your discussion of aliasing in C is not quite right. The strict aliasing rule in the C99 (and C11) standard makes it clear that the compiler can assume that pointers to different underlying types will not alias. That is: the C compiler is allowed to assume that NSError** and int* do not alias. Similarly, _ptr may only alias with &count if _ptr[0] is the same type as count. This is one reason why you rarely see restrict: it's generally only useful when passing multiple pointers of the same type. This is most often the case when passing multiple arrays of numbers (for numerical computations) or blocks of untyped memory (void* or char*). The other reason you almost never see it: for most code, this kind of optimization just really doesn't matter. Most code isn't performance critical, and there's often more pressing problems (algorithm, cache usage, etc.) in code that is.

Note that C compilers may choose to be safer and treat everything as if it will alias, just as they may choose to ignore the restrict keyword, or may choose to emit less performant assembly. When strict aliasing optimizations are enabled (IIRC, -O2 will enable them on recent versions of both gcc and clang), both gcc and clang will emit better code for your examples.

See this article, by Mike Acton, for more detail about strict aliasing:

What Swift does is to enable strict-aliasing optimization more safely, which is a wonderful thing. There's a fair amount of ire about the strict-aliasing optimizations in C. See:
Nice article. There is a typo here : Still, it shuold be good for a few percentage points.
meh: I explicitly pointed out that the NSError** example went beyond the language aliasing rules. "This is an odd case, because C's aliasing rules don't allow aliasing pointers of different types...." So, thanks for repeating it, I guess.... As for _ptr, there's nothing saying it can't be the same type. You're also allowed to alias char * to any other type, and char * values are pretty common. I tested my examples with clang at -O3 and inspected the resulting assembly to ensure the redundant loads I discussed were actually happening, so no, it won't emit better code for my examples.

dee: Much appreciated, fixed.
Thank you for the great article I did enjoyed reading it, I will be sure to bookmark your blog and definitely will come back from again.

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

The Answer to the Ultimate Question of Life, the Universe, and Everything?
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.