mikeash.com: just this guy, you know?

Posted at 2013-09-27 13:31 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2013-10-11: Why Registers Are Fast and RAM Is Slow
Previous article: Friday Q&A 2013-09-13: A Library for Easier Property List Type Checking and Error Reporting
Tags: fridayqna hardware iphone
Friday Q&A 2013-09-27: ARM64 and You
by Mike Ash  

Ever since the iPhone 5S was announced a couple of weeks ago, the world of tech journalism has been filled with massive quantities of misinformation. Unfortunately, good information takes time, and the world of tech journalism is more about speed than accuracy. Today, as suggested by a variety of readers, I'm going to give the rundown of just what 64-bit ARM in the iPhone 5S means for you, in terms of performance, capabilities, and development.

"64-bit"
Let's start by talking about the general term "64-bit" and what it means. There's a lot of confusion around this term, and a lot of that is because there's no single agreed-upon definition of it. However, there is generally some consensus about it, even if it's not universal.

There are two parts of the CPU that "X-bit" usually refers to: the width of the integer registers, and the width of pointers. Thankfully, in most modern CPUs, these widths are the same. "64-bit" then typically means that the CPU has 64-bit integer registers and 64-bit pointers.

It's also important to point out the things that "64-bit" does not refer to, as there's a lot of confusion in this area as well. In particular, "64-bit" does not include:

  1. Physical RAM address size. The number of bits used to actually talk to RAM (and therefore the amount of RAM the hardware can support) is decoupled from the question of CPU bitness. ARM CPUs have ranged from 26 bits to 40 bits, and this can be changed independently from the rest.
  2. Data bus size. The amount of data fetched from RAM or cache is likewise decoupled. Individual CPU instructions may request a certain amount of data, but the amount of data actually fetched can be independent, either by splitting the fetch into smaller parts, or fetching more than is necessary. The iPhone 5 already fetches data from memory in 64-bit chunks, and chunk sizes of up to 192 bits exist in the PC world.
  3. Anything related to floating-point. FPU register size and internal design is independent, and ARM CPUs have had 64-bit FPU registers since well before ARM64.

Generic Advantages and Disadvantages
If we compare otherwise-identical 32-bit and 64-bit CPUs, there isn't a whole lot of difference, which is a big part of the confusion around the significance of Apple's move to 64-bit ARM. The move is important, but largely because of specifics of the ARM processor and Apple's use of it.

Still, there are some differences. Perhaps the most obvious is that 64-bit integer registers make it more efficient to work with 64-bit integers. You can still work with 64-bit integers on a 32-bit processor, but it typically entails working with it in two 32-bit pieces, which means that arithmetic can take substantially longer. 64-bit CPUs can typically perform arithmetic on 64-bit quantities just as fast as on 32-bit quantities, so code that does heavy manipulation of 64-bit integers will run much faster.

Although 64-bit has no bearing on the amount of RAM that can be used by the CPU itself, it can make it much easier to use large amounts of RAM within a single program. A single program running on a 32-bit CPU only has 4GB of address space. Chunks of that address space are taken up by the operating system and standard libraries and such, typically leaving anywhere from 1-3GB available for use. If a 32-bit system has more than 4GB of RAM, taking advantage of all of it from a single program is tough. You have to resort to shenanigans like asking the operating system to map chunks of memory in and out of your process as you need them, or splitting your program into multiple processes.

This takes a lot of extra programming effort and can slow things down, so few programs actually do it. In practice, a 32-bit CPU limits individual programs to using 1-3GB of RAM each, and the advantage of having more RAM is the ability to run multiple such programs simultaneously, and the ability to cache more data from disk. This is still useful, but there are cases where the ability of a single program to use more RAM is needed.

The increased address space is also useful even on a system without that much RAM. Memory-mapped files are a handy construct, where the contents of a file are logically mapped into a process's memory space, even though physical RAM is not necessarily allocated for the entire file. On a 32-bit system, a program can't memory map large files (over, say, a few hundred megabytes) reliably. On a 64-bit system, the available address space is much larger, so there's no concern with running out.

The increased pointer size comes with a substantial downside: otherwise-identical programs will use more memory, perhaps a lot more, when running on a 64-bit CPU. Pointers have to be stored in memory as well, and each pointer takes twice the amount of memory. Pointers are really common in most programs, so that can make a substantial difference. Increased memory usage can put more pressure on caches, causing reduced performance.

In short: 64-bit can increase performance for certain types of code, and makes certain programming techniques, like memory mapped files, more viable. However, it can also decrease performance due to increased memory usage.

ARM64
The iPhone 5S's 64-bit CPU is not merely a regular ARM processor with wider registers. The 64-bit ARM architecture includes substantial changes from the 32-bit version.

First, a note on the name: the official name from ARM is "AArch64", but this is a silly name that pains me to type. Apple calls it ARM64, and that's what I will call it too.

ARM64 doubles the number of integer registers over 32-bit ARM. 32-bit ARM provides 16 integer registers, of which one is a dedicated program counter, two more are given over to a stack pointer and link register, and the other 13 are available for general use. With ARM64, there are 32 integer registers, with a dedicated zero register, link register, and frame pointer register. One further register is reserved for the platform, leaving 28 general purpose integer registers.

ARM64 also increases the number of floating-point registers available. The floating point registers on 32-bit ARM are a bit odd, so it's tough to compare. It has 32 32-bit floating point registers which can also be viewed as 16 overlapped 64-bit registers, and there are 16 additional independent 64-bit registers. The 32 total 64-bit registers registers can also be viewed as 16 overlapped 128-bit registers. ARM64 simplifies this to 32 128-bit registers, which can also be used for smaller data types, and there's no overlapping.

The register count can strongly influence performance. Memory is extremely slow compared to CPUs, and reading from and writing to memory takes a long time compared to how long it takes the CPU to process an instruction. CPUs try to hide this with layers of caches, but even the fastest layer of cache is slow compared to internal CPU registers. More registers means more data can be kept purely CPU-internal, reducing memory accesses and increasing performance.

Just how much of a difference this makes will depend on the specific code in question, as well as how good the compiler is at optimizing it to make the best use of available registers. When the Intel architecture moved from 32-bit to 64-bit, the number of registers was doubled from 8 to 16, and this made for a substantial performance improvement. ARM already had substantially more registers than the 32-bit Intel architecture, so the impact of additional registers is smaller, but it's a still helpful change.

ARM64 also brings some significant changes to the instruction set beyond the increased number of registers.

Most 32-bit ARM can be executed conditionally based on the state of a condition register at the time of execution. This allows compiling if statements and similar without requiring branching. Intended to increase performance, it must have been causing more trouble than it was worth, as ARM64 eliminates conditional execution.

ARM64's NEON SIMD unit provides full double-precision IEEE754 compliance, whereas the 32-bit version of NEON only supports single-precision, and leaves out some of the harder, more obscure bits of IEEE754.

ARM64 adds specialized instructions for AES encryption and SHA-1 and SHA-256 cryptographic hashes. Not important in general, but potentially a big win if you happen to be doing those things.

Overall, by far the most important changes are the greatly increased number of general-purpose registers, and support for full IEEE754-compliant double-precision arithmetic in NEON. These changes could allow for considerable performance increases in a lot of code.

32-bit Compatibility
It's important to note that the A7 includes a full 32-bit compatibility mode that allows running normal 32-bit ARM code without any changes and without emulation. This means that the iPhone 5S runs old iPhone apps with no problem and no performance impact compared to other hardware. 32-bit code does potentially run with somewhat reduced performance since it gets none of the advantages of ARM64.

Apple Runtime Changes
Apple takes advantage of architecture changes like this to make changes in their own libraries. Since they don't need to worry about maintaining binary compatibility across such a change, it's a good time to make changes that would otherwise break existing apps.

In Mac OS X 10.7, Apple introduced tagged pointers. Tagged pointers allow certain classes with small amounts of per-instance data to be stored entirely within the pointer. This can eliminate the need for memory allocations for many uses of classes like NSNumber, and can make for a good performance boost. Tagged pointers were only supported on 64-bit, partly due to binary compatibility concerns, but partly because 32-bit pointers don't leave a lot of room left over for actual data once the tag bits are accounted for. Presumably because of that, iOS never got tagged pointers. However, on ARM64, the Objective-C runtime includes tagged pointers, with all of the same benefits they've brought to the Mac.

Although pointers are 64 bits, not all of those bits are really used. Mac OS X on x86-64, for example, only uses 47 bits of a pointer. iOS on ARM64 uses even less, with only 33 bits of a pointer currently being used. As long as the extra bits are masked off before the pointer is used, they can be used to store other data. This leads to one of the most significant internal changes in the Objective-C runtime in the language's history.

Repurposed isa Pointer
Much of the information for this section comes from Greg Parker's article on the relevant changes. Check that out for information straight from the source.

First, a quick refresher: Objective-C objects are contiguous chunks of memory. The first pointer-sized piece of that memory is the isa. Traditionally, the isa is a pointer to the object`s class. For more information on how objects are laid out in memory, see my article on the Objective-C runtime.

Using an entire pointer-sized piece of memory for the isa pointer is a bit wasteful, especially on 64-bit CPUs which don't use all 64 bits of a pointer. ARM64 running iOS currently uses only 33 bits of a pointer, leaving 31 bits for other purposes. Class pointers are also aligned, meaning that a class pointer is guaranteed to be divisible by 8, which frees up another three bits, leaving 34 bits of the isa available for other uses. Apple's ARM64 runtime takes advantage of this for some great performance improvements.

Probably the most important performance improvement is an inline reference count. Nearly all Objective-C objects are reference counted (the exceptions being constant objects like NSString literals) and retain/release operations to modify the reference count happen extremely frequently. This is especially true with ARC, which emits even more retain/release calls than a typical human programmer. As such, high performance for retain and release is critical.

Traditionally, the reference count is not stored in the object itself. If the isa is the only field every object shares, then there's simply no room for any additional data. It would be possible to make it so that every object also contains a reference count field, but this would use up a great deal more memory. This is less important today, but it was a pretty big deal in the earlier days of Objective-C. Because of this, the retain count is stored in an external table.

Any time an object is retained, the runtime goes through this procedure:

  1. Fetch a global retain count hash table.
  2. Lock the table to make the operation thread safe.
  3. Look up the retain count of the object in the table.
  4. Increment the count and store the new value back in the table.
  5. Release the table lock.

This is a bit slow! The hash table implementation used for tracking retain counts is fast, for a hash table, but even the best hash tables are slow compared to direct memory access.

On ARM64, 19 bits of the isa field go to holding the object's reference count inline. That means that the procedure for retaining an object simplifies to:

  1. Perform an atomic increment of the correct portion of the isa field.

And that's it! This should be much, much faster.

There is a bit more to it than just that, because of some corner cases that need to be handled. The real code looks more like this:

  1. The bottom bit of the isa indicates whether all this extra data is active for this class. If it's not active, then fall back to the old hash table approach. This allows for a compatibility mode for classes that fall outside the representable range, or programs that incorrectly assume the isa is a pure class pointer.
  2. If the object is currently deallocating, do nothing.
  3. Increment the retain count, but don't store it back into the isa just yet.
  4. If it overflowed (an unusual but real possibility with only 19 bits available) then fall back to a hash table.
  5. Perform an atomic store of the new isa value.

Most of this was necessary with the old approach as well, and it doesn't add too much overhead. The new approach should still be much, much faster.

There are several other performance improvements stuffed into the remaining free bits that make deallocating objects faster. There's potentially a lot of cleanup that needs to be done when an Objective-C object deallocates, and being able to skip unnecessary cleanup can increase performance. These are:

  1. Whether the object ever had any associated objects, set with objc_setAssociatedObject. If not, then associated objects don't need to be cleaned up.
  2. Whether the object has a C++ destructor method, which is also used as the ARC automatic dealloc method. If not, then it doesn't need to be called.
  3. Whether the object has ever been referenced by a __weak variable. If it has, then any remaining __weak references need to be zeroed. If not, then this step can be skipped.

Previously, all of these flags were tracked per-class. If any instance of a class ever had an associated object set on it, for example, then every instance of that class would perform associated object cleanup when deallocating from that point on. Tracking them for each instance independently helps ensure that only the instances that really need it take the performance hit.

Adding it all together, it's a pretty big win. My casual benchmarking indicates that basic object creation and destruction takes about 380ns on a 5S running in 32-bit mode, while it's only about 200ns when running in 64-bit mode. If any instance of the class has ever had a weak reference and an associated object set, the 32-bit time rises to about 480ns, while the 64-bit time remains around 200ns for any instances that were not themselves the target.

In short, the improvements to Apple's runtime make it so that object allocation in 64-bit mode costs only 40-50% of what it does in 32-bit mode. If your app creates and destroys a lot of objects, that's a big deal.

Conclusion
The "64-bit" A7 is not just a marketing gimmic, but neither is it an amazing breakthrough that enables a new class of applications. The truth, as happens often, lies in between.

The simple fact of moving to 64-bit does little. It makes for slightly faster computations in some cases, somewhat higher memory usage for most programs, and makes certain programming techniques more viable. Overall, it's not hugely significant.

The ARM architecture changed a bunch of other things in its transition to 64-bit. An increased number of registers and a revised, streamlined instruction set make for a nice performance gain over 32-bit ARM.

Apple took advantage of the transition to make some changes of their own. The biggest change is an inline retain count, which eliminates the need to perform a costly hash table lookup for retain and release operations in the common case. Since those operations are so common in most Objective-C code, this is a big win. Per-object resource cleanup flags make object deallocation quite a bit faster in certain cases. All in all, the cost of creating and destroying an object is roughly cut in half. Tagged pointers also make for a nice performance win as well as reduced memory use.

ARM64 is a welcome addition to Apple's hardware. We all knew it would happen eventually, but few expected it this soon. It's here now, and it's great.

That's it for today. Check back next time for more adventures in the land of hardware and software. Friday Q&A is driven by reader suggestions, so if an idea pops into your head between now and then for a topic you'd like to see covered here, please send it in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

Great article!

Apple surprised me being a first company ever to release consumer implementation of AArch64, so hail to them!
Were there major changes to the ABI as well?

For example, in ARM32, the first 4 items in a function were passed using registers, while the rest of the function data was passed on the stack. (IIRC)

With the increased number of registers, it would be possible to increase the number of function variables passed using registers.
Thanks for the interesting article.

I would add that I haven't seen any details on the chip's internal data bus. I would assume that data transfers to both the floating point unit and GPU are 64-bits, which would offer a nice performance benefit for graphics and number crunching.

Also, the CPU is 64-bits, I would imagine the 1st level cache is 64-bits, and after that I don't know. My guess is that a 64-bit external data bus might burn a bit too much power for the performance. I would like to find out about the internal 2nd level (3rd?) cache configuration and the external system memory.

Any idea of the common range for reference counts? 19 bits seems like a lot to devote to this. I would have thought they could have moved some more information into the field instead.
With regards to the processor "official" name. The name of the processor could be whatever label the folks at Apple wanted to slap on it.(as is the case with all ARM based chips manufactured by licensees). In the case of Apple's newest processor, apparently it's ARM64. Regardless of what they call it, this processor is based on the ARMv8 architecture while AArch64 refers to one of two main execution states supported within the ARMv8 architecture, the other being AArch32.
NetMage: My guess is simply that they had 19 bits left over after cramming everything else they could think of into the field. It will likely shrink over time. E.g. if and when the iOS 64-bit address space expands, the actual class pointer will need to expand, meaning the reference count part will have to shrink.
Love the content, love the form.
Thanks
Ugh. Not a fan of pointer abuse for tagging. Back in the day, microsoft amigabasic pulled a similar trick using seemingly spare bits in 32-bit pointers in an era you were lucky to have over a meg of ram, and after all the hardware only supported 24 bits out of the 32.. But later amigas with a few bits more real memory space appeared, and amigabasic never worked again. The end.

Amd had sensibly kinda gone out of their way to make it trickier for shortsighted software guys to abuse x86-64 64-bit pointers that way, even though current chips only address a smaller space right now, it's very bad to assume they'll always be spare. Idiot-proof something and the world builds a better idiot.
blah, the difference is that Apple's runtime is very careful not to pass off isa pointers to the raw hardware, whereas Amiga BASIC (and "32-bit unclean" code for Classic Mac OS) was not.
blah: I've seen this misconception over and over and over again. There's a fundamental difference between application code, which needs to be written to the overall spec and not make assumptions, and operating system code, which can assume anything it feels like about the hardware, because it gets adjusted whenever the hardware changes.

Bit smashing tricks like this are bad news for application code in general. But they're perfectly fine for the Objective-C runtime, because if any of the assumptions the runtime relies on are invalidated, Apple can update the runtime in lockstep. The iPhone 6 isn't going to break tagged pointers, because any changes they make will be accounted for in iOS 8's Objective-C runtime.

People look at historical problems like this and learn the wrong lesson from it. The lesson is not to avoid bit packing tricks, the lesson is not to make assumptions about factors that can change and that you don't control. Apple controls all the relevant factors.
Until someone goes out of business or loses interest in doing updates. "updating the operating system" is not itself a cost-free operation by any means. AmigaBasic's runtime was a "part of the operating system" licensed from microsoft, until it died.

Apple are just a company, not a God. "we control the horizontal and the vertical for all eternity forever and ever amen"??? Nah, not so much, it'll bite them in the ass and/or pass the cost on to you the developer some day by breaking binary compatibilty, mark my words.
Great article!

As for 19 bits for the retainCount I fear the worst. If you ever thought that retain counts are just small numbers, this is a must read: http://www.cocoabuilder.com/archive/cocoa/326859-triggering-an-nstextstorage-bug.html

@blah: It's just an implementation detail of the Obj-C runtime. Your class pointers (i.e. what you get by calling [... class]) will be clean regardless of how isa is stored.
Full predication has the potential to make every instruction depend on every preceding instruction, which is no good if you're trying to execute instructions in parallel. Replacing that feature with a set of special conditional instructions both simplifies the pipeline and offers more opportunities for high-performance implementations.

The condition bits also eat a lot of precious opcode space for something that in the end isn't used all that often.
Dan Smith: yes, there were major ABI changes, to the point that objc_msgSend() comes in only one variant in ARM64 mode. Yay!
Dan Smith: AArch64 can pass up to 8 integral arguments in general-purpose registers, and additional up to 8 floating-point or vector arguments in vector registers. It can also pass small structures of up to 16 bytes in general-purpose registers, and structures recursively composed of up to 8 floating-point or vector types (called homogeneous floating-point aggregates) in vector registers. This is described in much more detail in "procedure call standard".
blah: You're just making up fantasies. If a future ARM CPU makes these isa tricks impossible, all Apple has to do is flip a switch and recompile the runtime, and all this stuff goes away. These is no way for it to break binary compatibility.
Adam Nohejl: Not sure if you missed it or I'm misunderstanding you, but the 19-bit retain count is simply for the fast case, not a hard limit. Once you hit half a million or so, the implementation transparently switches over to a side table again.

This is an advantage to having the inline retain count in a single place, in the runtime, rather than implemented scattershot in libraries where people think it's needed. The runtime guys can get it right once, and everybody else gets it for free.
If you want to explore at home, do the following in a terminal:

cd /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS7.0.sdk/usr/lib

otool -j -t -v -V libobjc.A.dylib >output.txt

Then expand the tabs in output.txt to at least 8 and figure out the disassembly.

In Xcode you can see what ARM64 instructions are generated for your own code by setting a project configuration to the arm64 architecture and selecting 'Assembler' from the small button top left in Xcode's editor pane.

Your naive faith amuses me, youngster.
blah: It doesn't take "faith" to notice that there is a bit whose sole purpose is to disable the whole isa-packing thing, and that tagged pointers are enabled or disabled with a #define in the ObjC runtime (which is, by the way, open source). Now, either elaborate on exactly how something could go wrong, or scoot. Vague handwavery is not welcome here.
Since you could do some casual benchmarking with 32-bit and 64-bit mode on your iPhone, does this mean you can compile, deploy, and run 64 bits apps on iOS 7?
Could you also release these 64-bits apps on the app store? I am thinking about my 32 bit iPad and all the wonderful 64-bit apps it won't run...
(I am not not an iOS developer, so am unfamiliar with XCode etc.)
Yes, iOS 7 supports 64-bit apps. You can build fat binaries, so that the resulting app will run as a 64-bit program on 64-bit CPUs, and as a 32-bit program otherwise. The same has been done on Macs for years.
My first thought on hearing about 64-bit iOS was Virtual Memory in the Cloud. Apps could function as though they had essentially unlimited RAM (including memory-mapped files), and page faults could cause the missing code or data to be fetched from iCloud.

Beyond this, a big shared data space for iDevices could be provided for commonly-needed data like maps, weather, sports scores, code segments, and on and on. Want the current temperature in Seattle? It will always be found at address XXXXXXXXXXXXXXXX.
Great article Mike! And great point Rob Lewis,... I've never thought about that possibility... Though wouldn't that be further in the future? I mean, broadband is commonplace here, but certainly not everywhere... And it might take a while... although... this is Apple "Most forward thinking phone yet"
@Mike: Aah, you're right, I've missed that. If the runtime handles overflow, everything seems OK. Should be reading more carefully!
Randy Lea: The bit width of internal busses is completely decoupled from the width of pointers and the width of CPU internal registers, and it's a good guess that it's actually wider than 64 bits.

I doubt anybody outside Apple knows what bus they're using in their A6 and A7 System-on-Chip (SoC) designs. AXI is a strong candidate; it's the standard bus used by high performance ARM CPU core designs. Apple has definitely used it before. It's the bus interface on ARM's Cortex-A9, which was the CPU core used in Apple's A5 (found in iPhone 4S, iPad 2, and others). Since they're now designing their own CPUs Apple may have switched to something else, but I'll discuss AXI since they're not saying and it's what I know.

So, the first thing to note is that AXI is a flexible standard and supports data widths anywhere from 32 to (if I recall correctly) 512 bits. Also, the Cortex-A9 already had a 64-bit AXI bus even though it's a 32-bit CPU.

AXI is a point-to-point link, so most ARM SoCs are constructed as a star or star-of-stars topology. In AXI parlance the hubs of the stars are known as "fabrics", and each AXI fabric routes traffic between its ports based on address decoding. There's no rule saying all ports on one fabric have to be the same width, or even the same clock speed, so real systems often contain a mix.

I've personally worked on a Cortex-A9 based SoC where a significant chunk of the AXI infrastructure was 128 bits wide, even though the CPU is only a 64-bit AXI master. This was because we had a 32-bit wide DDR memory controller. DDR memory usually runs at much higher clock rates than the AXI structure. In our case, the clock ratio chosen was 4X, so the 32-bit DDR could transfer 4 32-bit words over the duration of one AXI bus clock, meaning that AXI had to be 128 bits wide to keep up.

That's not the only possible choice. We could've selected 32-bit AXI at the same frequency as the DDR. However, DDR memory clock rates are relatively high and there are downsides to that. Another possible choice would've been 1/2 frequency at 64 bits. Wide/slow has different tradeoffs compared to narrow/fast, and for reasons we chose wide/slow.

Getting back to Apple, my money would be on 128 bits. They have a 64-bit DDR interface, but as I alluded to 1:1 frequency and width is not common. 256 has its own issues; it starts to get very wasteful of chip area. 128 seems like the right choice for a chip like the A7. But that's not anything I've done formal analysis on, or have any evidence to support, so take it with a grain of salt.
blah: What exactly has AMD done to "make it tricker" to use tagged pointers on x86-64? I've worked on dynamic compilers (which target x86-64, among many other architectures) that use tagged pointers, and I've never heard of any trouble with this.

In fact, this technique has been used in lots of runtimes for at least a couple decades, and (prior to your Microsoft AmigaBASIC example) I've never heard of it causing any trouble for anyone ever.

Yes, there may come a day when we regret only being able to address 256 terabytes of RAM per process in our mobile phones. If we can somehow manage to increase memory density at the same rate as we have since 1980, that will probably happen about 30 years from now. Assuming we're still on ARM64, Apple will have to recompile the Objective-C runtime some time before 2043 to use fewer type bits. There's ample opportunity: at the current rate, we'll be on iOS major version 37 (which is extremely unlikely to support a 2013 iPhone, anyway ... it'd be like wanting to run Mac OS X 10.9 on an original "128K" Mac).

Considering Apple has introduced completely new processor architectures 4 times in the past 30 years, that seems like a relatively minor issue to deal with. In 2043, ARM64 will be as old as the original (16-bit) 8086 is today, so it will be nothing short of a miracle if the ARM64 architecture lasts long enough that we need to trim some tag bits to address all of our RAM. :-)
blah, your disgruntled pessimism annoys me.
What circumstance (beyond some alien invasion) can you specify where this breaks applications? (Not some inane rambling of ols MS amiga crap - get off my lawn yada yada - you young wipper snappers get off my lawn). If Apple goes out of business, it won't be introducing new iPhones with new hardware right? so somehow it's going to update hardware but not be able to update the software?
Great article.

I specially like this analysis because I can use it when designing a new application, knowing the strengths and weakness spots of the new hardware/software combinations.

My lesson here
1. Make more use of memory-mapped-files (where appropriate)
2. Be less afraid about creating and disposing of objects
3. design your calculation code to rely on abundance of 64 bit registers.

And to over-pessimistic Blah, Apple is using the hardware-software-seamless-integration point in its marketing all the time, and for a reason. Until now, in all its computer (ahem, sorry, phone) products, this integration was very careful.

If ever this breaks, Apple will no longer be "Apple" and we won't be using iPhones either. Hardware/OS tight integration is usually to the benefit of the customers.

I agree with all the others --- (and yes, I worked with AmigaBasic) the chances we application writers will stumble on the tagged pointers and isa packed with retain counts are VERY low. Apple will do a lot to prevent this from happening.
You've left out a bunch of interesting stuff that will improve performance in the future, when Apple ditches 32-bit compatibility. This includes

- although predicated execution is gone, it is replaced by something that gives most of the performance benefit (in reducing branches) while being simpler to implement at high frequencies, namely conditional moves. The ARM64 conditional moves are not just the standard CMOV of some other architectures, they allow for a few simple manipulations of the of the moved value which cover most of the common cases. So one gets a single-cycle no-branching instruction which can calculate things like absolute value, or max/min.

- the addressing modes are simplified to reduce chains of adds. Again allows for higher frequency. (Again, we don't easily see the win in the higher frequency until Apple can ditch 32bit support which, of course, has to support those older addressing mode)

- the ability to shift values in every instruction is gone (one more thing that was limiting frequency)

- the load/store multiple instructions are gone. But they are replaced with a very clever idea, the load/store pair instructions which require a single address generation, but can load or store (duh) a pair of consecutive values. Because of the 128-bit wide bus to the L1 this just piggybacks on existing hardware, and because no complicated guarantees are made regarding atomicity of the loads, the instruction is easy and fast to implement. But it gives you, in a double ported cache, most of the advantages of a triple ported cache (for integer code, it won't help vector code) without having to pay the complexity or power of a triple ported cache

- the OS model (stuff like protection levels and interrupts) has been substantially simplified

- the memory model (guarantees about what loads/stores are atomic, and their ordered visibility to other CPUs) has again been simplified (and more granular instructions for controlling this have been provided, so that code can require only a simple and partial halt from the CPU when it has weak consistency requirements, rather than requiring a full halt for full consistency)

A common theme here is that most of these changes actually affect the design of HW as much as or more than the design of SW. A CPU that is required to also execute ARM32 code cannot exploit these changes because it still has to do the old things (eg the shifts every cycle, the address modes, the heavier OS model, the older memory ordering model, the decoder for THUMB instructions, ...).
My belief is that Apple will (for various reasons) push very aggressively to move the entire iOS development world to simultaneous 32/64-bit development (for example not accepting into the app store anything that doesn't have a 64-bit binary). This will allow them, remarkably soon, to ditch the ARM32 support, at which point, for the reasons I've stated, they'll be able to see a substantial frequency boost even without a process change.

Oh, one more thing; there are reasons to believe that if 64-bit iOS isn't already using 64K pages, it will be very soon. This is just one more minor speed boost --- TLB coverage becomes a lot larger, and lookup from page tables requires only two trips to memory, not three.
"Were there major changes to the ABI as well?

For example, in ARM32, the first 4 items in a function were passed using registers, while the rest of the function data was passed on the stack. (IIRC)"

Yes. The Apple ARM64 ABI feels a lot like the PPC ABI.
It
- allows for more parameters to move via registers (rather than the stack)
- lays out explicit rules for volatile and non-volatile registers (volatile across procedure calls)
- provides for two registers that can be used for trampolines, thunks and similar such code glue between functions.

I say Apple ARM64 ABI, because there are some minor differences between what Apple did and what ARM describes as the ABI, primarily in the handling of varargs. I have no idea why Apple felt it necessary to differ from ARM in this respect, or whose solution is "better".
@ Paul Christian
Rob is just being very nebulous - network latency will always be muuuuch higher than say a data mining trip to an '90s era hard disk - and there is NOOO way it will ever be used for an internal data structure like virtual memory. Fetch data? Well sure, iOS apps have been doing this novel thing since the iPhone 3g :D

Mike, you got a pingback from DF J Gruber :D, but he seems to be redecorating your conclusion.
Maynard Handley
- the ability to shift values in every instruction is gone (one more thing that was limiting frequency)

no, it wasn't the same in all instructions, and it isn't gone, just more limited
@ incorrector
As far as I can tell, the ONLY place it exists now is in some addressing modes (just like other ISAs) --- and only those addressing modes where it is considered that it won't slow down a critical path.

In particular we don't have it attached to an add via portmanteau instructions like "extend signed, shift, then add" which require basically extra gates slowing down the whole ALU.
Did the Apple ARM32 code use the ARM encoding instead of the Thumb-2 encoding? Because in Thumb-2, you didn't have predicated instructions.

How is the ARM64 instruction density (code size) compared to ARM32 (especially with Thumb-2 encodings, if applicable)? Seems like its going to be a lot bigger, especially without load/store multiple for prolog/epilog, and no shifters / fewer address modes.

Another (minor?) downside of more registers is more context switch overhead.
We would like to use ARM64, but cannot, as we would have to drop support for iOS 4 and 5. Do you have any clue why there would be such a restriction? I mean, why can't the 32-bit and 64-bit code slices coexist regardless of OS version, similar to how armv7 and armv7s (and previously armv6) slices currently coexist?

From Apple's documentation for 64-bit transition on iOS (https://developer.apple.com/library/ios/documentation/General/Conceptual/CocoaTouch64BitGuide/Introduction/Introduction.html):

Xcode can build your app with both 32-bit and 64-bit binaries included. This combined binary requires a minimum deployment target of iOS 7 or later.

...

Note: A future version of Xcode will let you create a single app that supports the 32-bit runtime on iOS 6 and later, and that supports the 64-bit runtime on iOS 7.
Maynard Handley
As far as I can tell, the ONLY place it exists now is in some addressing modes (just like other ISAs)

No, you're wrong, check ARM ARM. You can still do things like 'ANDS Reg, Reg, shifted Reg', except for the variable shifts by register (separate now), and rotations in arithmetic instructions (now replaced with register extensions)
bob: It's simply an unfortunate OS bug. And, of course, since it's a bug in older, unsupported OSes, the chances of seeing a fix are low. It should have worked, but now we're stuck.
@Tim 2013-09-28 22:14:20 re blah 2013-09-27 22:33:57
What exactly has AMD done to "make it tricker" to use tagged pointers on x86-64? I've worked on dynamic compilers (which target x86-64, among many other architectures) that use tagged pointers, and I've never heard of any trouble with this.


I'm no expert on Objective C, but I'm struggling to see how your tagged pointers fit nicely with AMD64 "Canonical Address Form". Maybe it's a non-issue in Apple's world where they control hardware, OS, runtime, and such? But AMD, and therefore by induction also Intel, don't seem to like the idea?

Enlightenment welcome.

References include: AMD64 Architecture Programmers Manual, Volume 2: System Programming, page 5: (available online, just search)

"Although some processor implementations do not use all 64 bits of the virtual address, they all check bits 63 through the most-significant implemented bit to see if those bits are all zeros or all ones. An address that complies with this property is in canonical address form. In most cases, a virtual-memory reference that is not in canonical form causes a general-protection exception (#GP) to occur. However, implied stack references where the stack address is not in canonical form causes a stack exception (#SS) to occur. Implied stack references include all push and pop instructions, and any instruction using RSP or RBP as a base register.

By checking canonical-address form, the AMD64 architecture prevents software from exploiting unused high bits of pointers for other purposes. Software complying with canonical-address
form on a specific processor implementation can run unchanged on long-mode implementations supporting larger virtual-address spaces." [my bold]
CJ: Remember that what you store in the word does not have to be what the code actually asks the CPU to load. If you're storing extra data in the high 16 bits of a pointer, all you have to do is mask those bits off before loading the pointer.

ARM64 actually has some tagged pointer support wherein you can ask for the CPU to optionally ignore the top few bits of a pointer when loading it, so you don't even have to do the masking, and can just use the value raw. But that is really just a bit of assistance, not something required.
@ incorrector

You're right --- there are still vestiges of shift available in some instructions. I didn't notice them first time through.

But I'd say my larger point is right as well. The shifts that are available (as far as I can tell -- the draft manual is still pretty awful and incomplete, IMHO) have been stripped of everything that made them tough and slow to implement.
You mentioned that they are immediate shifts, but beyond that they are all, as far as I can tell, left shifts by a small (1..4) constant. The lack of rotates and right shifts is what makes them easier to implement with simpler bit steering.
@Maynard Handley

Is there somewhere one can read about the memory model for ARM64, or for Apple's chips, or whatever defines it for the iPhones? I've never been able to find it for previous iPhones.
@ken
What do you mean by "memory model"?

The VM model is, of course, more or less standard BSD on Mach. (Not exactly textbook. They've obviously improved things over the years. For example prior to 10.8 they were not especially aggressive about cleaning pages that had the potential to be re-used soon, which meant that if something suddenly demanded a lot of pages, there was a long delay while those pages were written out. This was fixed in 10.8)

Above the VM model the app view is pretty standard C/UNIX. I've no idea what the exact layout is, but who cares these days? There'll be an area for the instructions, an area for globals, a heap, an area for shared libs (address randomized), a stack. The stack must grow downwards --- that an ARM hardware thing. The user space lives in all 0s up to 2^47, the OS spaces lives in all 1s down to 2^64-2^47. All vanilla.

Below the VM is the ARM64 memory model specifying atomicity and ordering. It is described in the document
ARMv8_ISA_Overview_PRD03-GENC-010197-15-0.pdf

This document WAS available on the web as of about two weeks ago. It has since disappeared as has every in-depth reference to the ARMv8 ISA. A search in the relevant places of the ARM web site says things like "place holder document".

Make of that what you will --- I assume what it boils down to is that there's some error that was discovered in the document which needs to be corrected, and it will be back soon.
What says (at least the parts I care about) are kinda obvious
- ALIGNED loads and stores are atomic, non-aligned loads and stores are not.
- there are some of the obvious cache-control instructions (prefetch into some level of cache, either marking the result LRU or MRU)
[I didn't see any of the IBM POWER style instructions to pre-load into the I-cache. I THOUGHT I saw instructions to zero out cache blocks, but I don't see them now, so maybe I miss remember --- those instructions are always problematic when the cache block size changes.]
- there are load exclusive/store exclusive instructions which, as far as I can tell, are what I'd call Load Linked and Store Conditional --- basically the usual RISC primitives for atomics.
- there are also load/store primitives that ensure the memory ordering seen by others. To quote:
"
A load-acquire is a load where it is guaranteed that all loads and stores appearing in program order after the load-acquire will be observed by each observer after that observer observes the load-acquire, but says nothing about loads and stores appearing before the load-acquire.

A store-release will be observed by each observer after that observer observes any loads or stores that appear in program order before the store-release, but says nothing about loads and stores appearing after the store-release.

In addition, a store-release followed by a load-acquire will be observed by each observer in program order.

A further consideration is that all store-release operations must be multi-copy atomic: that is, if one agent has seen a store-release, then all agents have seen the store-release. There are no requirements for ordinary stores to be multi-copy atomic.
"

There's an additional set of barrier primitives for data and instruction synchronization (I guess flushing the pipeline after you've dynamically generated some code, for example).
These come in all sorts of variants specifying exactly what does and doesn't get ordered (for example "Full System" vs "Inner shareable" vs "Outer shareable". I assume this means something like "I want the other CPU cores to see the changes I've made, but I don't care about whether the GPU or IO sees them".
Oh, one last thing. The Apple documentation on "getting your code ready for 64-bit ARM" specifically says "we make no guarantees that the page size is 4kB. We reserve the right to change it whenever we want, across devices and across OS versions. Use this call to get the pagesize if you need it..."
which strongly suggests that either now (for iOS7 on ARM64) or very soon they plan to change the page size (presumably to 64kB).
@Maynard:
"You've left out a bunch of interesting stuff that will improve performance in the future, when Apple ditches 32-bit compatibility. "


That doesn't require dropping the 32 bit mode, it could also be done by deciding any of those 32bit only operations take two (or more) cycles and introduce a pipeline bubble. At that point they only cost die area, not maximum frequency.
@Maynard Handley

No. Full-size shifts, both directions *OR* limited left shift for extended reg. Idk what manual you're talking about. Mine is pretty clear on the subject.
Fantastic article clearly written by someone with experience. How refreshing.
You could have done the inline-object reference count optimization without 64-bit, no? Just have another uint to store the count and other stuff along with the isa pointer in the NSObject struct. With ARM32, the class pointer would take 32-bits plus another 32 for the uint and would be the same size as the ARM64 version with 1 64-bit pointer.
Is it possible now that the VAS available to processes is so large, it may become reasonable to consistently use SQLite's memory-mapped I/O?

My understanding is Apple uses SQLite db's for a lot of their storage, this could provide a bit of a speed boost as well?
@blah: Tagged pointers are very common and actually quite safe if you do it right. (And usually TOTALLY worth it)

For example, if your architecture requires aligned pointers, then you can use the least-signitifant bit as the "tagged pointer" flag. If it's clear, it's a pointer. If it's set, it's something else.

Since you are only assigning meaning to bits that would otherwise never be anything other than zero (no matter how much memory you add), the change is quite safe as long as you are in control of every location where the pointer might be dereferenced (which is the case for these specific types).
Stephen: I believe the problem there is that the external refcount table has been used for so long that there is now a vast quantity of code out there that assumes NSObject has a single pointer-sized ivar and nothing more. You could easily add an inline refcount as a separate field in an ideal world, but in practice, I think it would break a ton of code.
Maynard Handley:
FYI the full ARMv8 architecture manual is available:

http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0487a/index.html
Thanks for taking the time to put this down in ink for us. It's a great summary and it's very enlightening.
@Maynard Handley

By "memory model" I meant the sort of stuff described in http://m.linuxjournal.com/article/8211?page=0,1. Thanks for the link!
Considering Apple has introduced completely new processor architectures 4 times in the past 30 years, that seems like a relatively minor issue to deal with. In 2043, ARM64 will be as old as the original (16-bit) 8086 is today, so it will be nothing short of a miracle if the ARM64 architecture lasts long enough that we need to trim some tag bits to address all of our RAM. :-)


Tim, interestingly the 32 bit ARM was started exactly 30 years ago (Oct 1983) and was released 1985. I had the great pleasure to write one of the first apps for it. So ARM64 may be far from obsolete in 2043, bizarre as it may sound.
For reference, getpagesize returns 16384 on 5S.
Could anyone explain why:

2) If the object is currently deallocating, do nothing.
Somewhere lost in the noise is the fact that the 64 bit A7 ARM processor is artificially limited to just 31 bits of address space (2 GB) by iOS 7. That is causing huge problems for developers that make use of memory mapped files. It means I cannot just open each of my large files as a single memory map. I have to map them in sections which can be very inefficient.

"ARM64 running iOS currently uses only 33 bits of a pointer, leaving 31 bits for other purposes." That is true but this is just an arbitrary decision made by Apple. They did not think that developers would need to address more than 2 GB of RAM on a device which is limited to 1 GB of physical memory. They can change this at any time and I predict that they will change it in iOS 8. I just don't know by how much. I am hoping for at least 42 bits of address space which would unlock the full potential of the A7's desktop class processing features (not to mention that of the A8).
Does OS X on x64 use bits of the isa field for the refcount, too? I can't find anything that says it does or doesn't.

The bigger usable address space of x64 (versus ARM64) means less space left for supplementary data on pointers, but even 17 bits is more than enough for the refcount for a lot of objects. Then again, if the refcounts are already stored off-object in a system-managed hash table, maybe x64's bigger caches already keep the refcounts in a fast enough place to make up for it.
Can you write about Objective C object layout like you wrote about Swift object layout? I'm trying to figure out how iOS lays out objects' ivars in memory and whether they have the same ordering and padding ramifications that structs members have. I understand this can change between versions of iOS, but it would still be useful to know for memory use optimization.
Tom: Last I checked OS X did not do this stuff, but it may have been enabled in 10.10, as I haven't investigated there yet. I think it would be well worth it to at least put the refcount info in there. Even just a few bits would be enough to handle most objects. I imagine it's rare for an object's reference count to go above a pretty small number, so as long as there's a usable fallback you'll get most of the speed advantage from a small number of bits.

Greg: The title doesn't make it obvious, but I think this article is probably what you're after: https://mikeash.com/pyblog/friday-qa-2009-03-13-intro-to-the-objective-c-runtime.html

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Hosted at DigitalOcean.