mikeash.com pyblog/friday-qa-2013-09-27-arm64-and-you.html commentshttp://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsmikeash.com Recent CommentsFri, 29 Mar 2024 08:48:51 GMTPyRSS2Gen-1.0.0http://blogs.law.harvard.edu/tech/rssmikeash - 2014-11-26 02:43:10http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>Tom:</b> Last I checked OS X did not do this stuff, but it may have been enabled in 10.10, as I haven't investigated there yet. I think it would be well worth it to at least put the refcount info in there. Even just a few bits would be enough to handle most objects. I imagine it's rare for an object's reference count to go above a pretty small number, so as long as there's a usable fallback you'll get most of the speed advantage from a small number of bits. <br /> <br /><b>Greg:</b> The title doesn't make it obvious, but I think this article is probably what you're after: <a href="https://mikeash.com/pyblog/friday-qa-2009-03-13-intro-to-the-objective-c-runtime.html">https://mikeash.com/pyblog/friday-qa-2009-03-13-intro-to-the-objective-c-runtime.html</a>8dbf660c076833b8d9a5e32fa59386d4Wed, 26 Nov 2014 02:43:10 GMTGreg - 2014-11-13 17:35:54http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsCan you write about Objective C object layout like you wrote about Swift object layout? I'm trying to figure out how iOS lays out objects' ivars in memory and whether they have the same ordering and padding ramifications that structs members have. I understand this can change between versions of iOS, but it would still be useful to know for memory use optimization.bb23f6571e095f7bd50e03b60356a697Thu, 13 Nov 2014 17:35:54 GMTTom - 2014-09-03 20:27:28http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsDoes OS X on x64 use bits of the isa field for the refcount, too? I can't find anything that says it does or doesn't. <br /> <br />The bigger usable address space of x64 (versus ARM64) means less space left for supplementary data on pointers, but even 17 bits is more than enough for the refcount for a lot of objects. Then again, if the refcounts are already stored off-object in a system-managed hash table, maybe x64's bigger caches already keep the refcounts in a fast enough place to make up for it. <br />34afb92f8ed868a4814f4874e2dd5c47Wed, 03 Sep 2014 20:27:28 GMTMark Granger - 2014-04-01 02:14:58http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsSomewhere lost in the noise is the fact that the 64 bit A7 ARM processor is artificially limited to just 31 bits of address space (2 GB) by iOS 7. That is causing huge problems for developers that make use of memory mapped files. It means I cannot just open each of my large files as a single memory map. I have to map them in sections which can be very inefficient. <br /> <br />"ARM64 running iOS currently uses only 33 bits of a pointer, leaving 31 bits for other purposes." That is true but this is just an arbitrary decision made by Apple. They did not think that developers would need to address more than 2 GB of RAM on a device which is limited to 1 GB of physical memory. They can change this at any time and I predict that they will change it in iOS 8. I just don't know by how much. I am hoping for at least 42 bits of address space which would unlock the full potential of the A7's desktop class processing features (not to mention that of the A8).5cb9868ecb749fda73f66fcd08bd48eaTue, 01 Apr 2014 02:14:58 GMTIaroslav Pavlov - 2013-11-28 23:41:13http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsCould anyone explain why: <br /> <br />2) If the object is currently deallocating, do nothing.9956a812e109d5c65ec6637366f77af3Thu, 28 Nov 2013 23:41:13 GMTSteven - 2013-11-21 09:36:20http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsFor reference, <code>getpagesize</code> returns 16384 on 5S.e2db33f7f9dd71ba1e19f77b3c6e1b6eThu, 21 Nov 2013 09:36:20 GMTJonathan - 2013-11-17 21:26:47http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<div class="blogcommentquote"><div class="blogcommentquoteinner">Considering Apple has introduced completely new processor architectures 4 times in the past 30 years, that seems like a relatively minor issue to deal with. In 2043, ARM64 will be as old as the original (16-bit) 8086 is today, so it will be nothing short of a miracle if the ARM64 architecture lasts long enough that we need to trim some tag bits to address all of our RAM. :-)</div></div> <br /> <br />Tim, interestingly the 32 bit ARM was started exactly 30 years ago (Oct 1983) and was released 1985. I had the great pleasure to write one of the first apps for it. So ARM64 may be far from obsolete in 2043, bizarre as it may sound. b58e9fd6d3d09aefd46bde0d0cb25d31Sun, 17 Nov 2013 21:26:47 GMTken - 2013-10-09 16:28:33http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@Maynard Handley <br /> <br />By "memory model" I meant the sort of stuff described in <a href="http://m.linuxjournal.com/article/8211?page=0">http://m.linuxjournal.com/article/8211?page=0</a>,1. Thanks for the link!3cfc6d34e4724089051452f650ad3b60Wed, 09 Oct 2013 16:28:33 GMTNashy_ - 2013-10-08 19:03:07http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsThanks for taking the time to put this down in ink for us. It's a great summary and it's very enlightening.c9872fd3cc697e2ad9132880ea1a03caTue, 08 Oct 2013 19:03:07 GMTvvid - 2013-10-06 11:11:14http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsMaynard Handley: <br />FYI the full ARMv8 architecture manual is available: <br /> <br /><a href="http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0487a/index.html">http://infocenter.arm.com/help/index.jsp?topic=/com.arm.doc.ddi0487a/index.html</a>69e7ca2c56a59c823ac9a1e7c2099876Sun, 06 Oct 2013 11:11:14 GMTmikeash - 2013-10-03 14:36:57http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>Stephen:</b> I believe the problem there is that the external refcount table has been used for so long that there is now a vast quantity of code out there that assumes <code>NSObject</code> has a single pointer-sized ivar and nothing more. You could easily add an inline refcount as a separate field in an ideal world, but in practice, I think it would break a ton of code.46f8e4512b20303c748270520f70abdbThu, 03 Oct 2013 14:36:57 GMTRobert Quattlebaum - 2013-10-03 00:25:59http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@blah: Tagged pointers are very common and actually quite safe if you do it right. (And usually TOTALLY worth it) <br /> <br />For example, if your architecture requires aligned pointers, then you can use the least-signitifant bit as the "tagged pointer" flag. If it's clear, it's a pointer. If it's set, it's something else. <br /> <br />Since you are only assigning meaning to bits that would otherwise never be anything other than zero (no matter how much memory you add), the change is quite safe as long as you are in control of every location where the pointer might be dereferenced (which is the case for these specific types).2381c37bca375f32da5e1f9929966d64Thu, 03 Oct 2013 00:25:59 GMTAdrian - 2013-10-02 21:43:08http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsIs it possible now that the VAS available to processes is so large, it may become reasonable to consistently use SQLite's memory-mapped I/O? <br /> <br />My understanding is Apple uses SQLite db's for a lot of their storage, this could provide a bit of a speed boost as well?70c38d57dc857652925c41d7d1ea12adWed, 02 Oct 2013 21:43:08 GMTStephen - 2013-10-02 20:54:21http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsYou could have done the inline-object reference count optimization without 64-bit, no? Just have another uint to store the count and other stuff along with the isa pointer in the NSObject struct. With ARM32, the class pointer would take 32-bits plus another 32 for the uint and would be the same size as the ARM64 version with 1 64-bit pointer.d72bf9cdbdbd0e2f633bfe0cce2b0cc7Wed, 02 Oct 2013 20:54:21 GMTSeth - 2013-10-02 17:31:38http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsFantastic article clearly written by someone with experience. How refreshing.0ae1f40874367fe81325244fc5e412f6Wed, 02 Oct 2013 17:31:38 GMTincorrector - 2013-10-02 00:05:24http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@Maynard Handley <br /> <br />No. Full-size shifts, both directions *OR* limited left shift for extended reg. Idk what manual you're talking about. Mine is pretty clear on the subject.2ca44f8ffc344b9468342c19d7f27012Wed, 02 Oct 2013 00:05:24 GMTJ Osborne - 2013-10-01 23:45:59http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@Maynard: <br /><div class="blogcommentquote"><div class="blogcommentquoteinner">"You've left out a bunch of interesting stuff that will improve performance in the future, when Apple ditches 32-bit compatibility. "</div></div> <br /> <br />That doesn't require dropping the 32 bit mode, it could also be done by deciding any of those 32bit only operations take two (or more) cycles and introduce a pipeline bubble. At that point they only cost die area, not maximum frequency. <br />953aa8a695a04dfe0d29f0c4b2fdbffbTue, 01 Oct 2013 23:45:59 GMTMaynard Handley - 2013-10-01 23:32:09http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsOh, one last thing. The Apple documentation on "getting your code ready for 64-bit ARM" specifically says "we make no guarantees that the page size is 4kB. We reserve the right to change it whenever we want, across devices and across OS versions. Use this call to get the pagesize if you need it..." <br />which strongly suggests that either now (for iOS7 on ARM64) or very soon they plan to change the page size (presumably to 64kB). <br />061985b774028f65ab79a7c2280a3724Tue, 01 Oct 2013 23:32:09 GMTMaynard Handley - 2013-10-01 23:26:57http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@ken <br />What do you mean by "memory model"? <br /> <br />The VM model is, of course, more or less standard BSD on Mach. (Not exactly textbook. They've obviously improved things over the years. For example prior to 10.8 they were not especially aggressive about cleaning pages that had the potential to be re-used soon, which meant that if something suddenly demanded a lot of pages, there was a long delay while those pages were written out. This was fixed in 10.8) <br /> <br />Above the VM model the app view is pretty standard C/UNIX. I've no idea what the exact layout is, but who cares these days? There'll be an area for the instructions, an area for globals, a heap, an area for shared libs (address randomized), a stack. The stack must grow downwards --- that an ARM hardware thing. The user space lives in all 0s up to 2^47, the OS spaces lives in all 1s down to 2^64-2^47. All vanilla. <br /> <br />Below the VM is the ARM64 memory model specifying atomicity and ordering. It is described in the document <br />ARMv8_ISA_Overview_PRD03-GENC-010197-15-0.pdf <br /> <br />This document WAS available on the web as of about two weeks ago. It has since disappeared as has every in-depth reference to the ARMv8 ISA. A search in the relevant places of the ARM web site says things like "place holder document". <br /> <br />Make of that what you will --- I assume what it boils down to is that there's some error that was discovered in the document which needs to be corrected, and it will be back soon. <br />What says (at least the parts I care about) are kinda obvious <br />- ALIGNED loads and stores are atomic, non-aligned loads and stores are not. <br />- there are some of the obvious cache-control instructions (prefetch into some level of cache, either marking the result LRU or MRU) <br />[I didn't see any of the IBM POWER style instructions to pre-load into the I-cache. I THOUGHT I saw instructions to zero out cache blocks, but I don't see them now, so maybe I miss remember --- those instructions are always problematic when the cache block size changes.] <br />- there are load exclusive/store exclusive instructions which, as far as I can tell, are what I'd call Load Linked and Store Conditional --- basically the usual RISC primitives for atomics. <br />- there are also load/store primitives that ensure the memory ordering seen by others. To quote: <br />" <br />A load-acquire is a load where it is guaranteed that all loads and stores appearing in program order after the load-acquire will be observed by each observer after that observer observes the load-acquire, but says nothing about loads and stores appearing before the load-acquire. <br /> <br />A store-release will be observed by each observer after that observer observes any loads or stores that appear in program order before the store-release, but says nothing about loads and stores appearing after the store-release. <br /> <br />In addition, a store-release followed by a load-acquire will be observed by each observer in program order. <br /> <br />A further consideration is that all store-release operations must be multi-copy atomic: that is, if one agent has seen a store-release, then all agents have seen the store-release. There are no requirements for ordinary stores to be multi-copy atomic. <br />" <br /> <br />There's an additional set of barrier primitives for data and instruction synchronization (I guess flushing the pipeline after you've dynamically generated some code, for example). <br />These come in all sorts of variants specifying exactly what does and doesn't get ordered (for example "Full System" vs "Inner shareable" vs "Outer shareable". I assume this means something like "I want the other CPU cores to see the changes I've made, but I don't care about whether the GPU or IO sees them". <br />5da1d7ac37ac181c1b65555dfe146bdeTue, 01 Oct 2013 23:26:57 GMTken - 2013-10-01 21:31:22http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@Maynard Handley <br /> <br />Is there somewhere one can read about the memory model for ARM64, or for Apple's chips, or whatever defines it for the iPhones? I've never been able to find it for previous iPhones.c9a10d535a4b629e3af7fc3b9101b24aTue, 01 Oct 2013 21:31:22 GMTMaynard Handley - 2013-10-01 21:00:45http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@ incorrector <br /> <br />You're right --- there are still vestiges of shift available in some instructions. I didn't notice them first time through. <br /> <br />But I'd say my larger point is right as well. The shifts that are available (as far as I can tell -- the draft manual is still pretty awful and incomplete, IMHO) have been stripped of everything that made them tough and slow to implement. <br />You mentioned that they are immediate shifts, but beyond that they are all, as far as I can tell, left shifts by a small (1..4) constant. The lack of rotates and right shifts is what makes them easier to implement with simpler bit steering.19f854cd81849c84357daaaea66ad963Tue, 01 Oct 2013 21:00:45 GMTmikeash - 2013-10-01 20:09:45http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>CJ:</b> Remember that what you store in the word does not have to be what the code actually asks the CPU to load. If you're storing extra data in the high 16 bits of a pointer, all you have to do is mask those bits off before loading the pointer. <br /> <br />ARM64 actually has some tagged pointer support wherein you can ask for the CPU to optionally ignore the top few bits of a pointer when loading it, so you don't even have to do the masking, and can just use the value raw. But that is really just a bit of assistance, not something required.fda4f530e13aaeaa1c591c96943262baTue, 01 Oct 2013 20:09:45 GMTCJ - 2013-10-01 19:39:38http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@Tim 2013-09-28 22:14:20 re blah 2013-09-27 22:33:57 <br /><div class="blogcommentquote"><div class="blogcommentquoteinner">What exactly has AMD done to "make it tricker" to use tagged pointers on x86-64? I've worked on dynamic compilers (which target x86-64, among many other architectures) that use tagged pointers, and I've never heard of any trouble with this.</div></div> <br /> <br />I'm no expert on Objective C, but I'm struggling to see how your tagged pointers fit nicely with AMD64 "Canonical Address Form". Maybe it's a non-issue in Apple's world where they control hardware, OS, runtime, and such? But AMD, and therefore by induction also Intel, don't seem to like the idea? <br /> <br />Enlightenment welcome. <br /> <br />References include: AMD64 Architecture Programmers Manual, Volume 2: System Programming, page 5: (available online, just search) <br /> <br />"Although some processor implementations do not use all 64 bits of the virtual address, they all check bits 63 through the most-significant implemented bit to see <b>if those bits are all zeros or all ones</b>. An address that complies with this property is in canonical address form. In most cases, a virtual-memory reference that is not in canonical form causes a general-protection exception (#GP) to occur. However, implied stack references where the stack address is not in canonical form causes a stack exception (#SS) to occur. Implied stack references include all push and pop instructions, and any instruction using RSP or RBP as a base register. <br /> <br />By checking canonical-address form, <b>the AMD64 architecture prevents software from exploiting unused high bits of pointers for other purposes.</b> Software complying with canonical-address <br />form on a specific processor implementation can run unchanged on long-mode implementations supporting larger virtual-address spaces." [my bold] <br />20f56613cc906c5c6e1a0b205103d9a7Tue, 01 Oct 2013 19:39:38 GMTmikeash - 2013-10-01 13:36:41http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>bob:</b> It's simply an unfortunate OS bug. And, of course, since it's a bug in older, unsupported OSes, the chances of seeing a fix are low. It should have worked, but now we're stuck.1f943240b4952bc453d8e69910f9150dTue, 01 Oct 2013 13:36:41 GMTincorrector - 2013-10-01 09:07:56http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsMaynard Handley <br />As far as I can tell, the ONLY place it exists now is in some addressing modes (just like other ISAs) <br /> <br />No, you're wrong, check ARM ARM. You can still do things like 'ANDS Reg, Reg, shifted Reg', except for the variable shifts by register (separate now), and rotations in arithmetic instructions (now replaced with register extensions)042d3a75f6275b67ebf3b4b15b588973Tue, 01 Oct 2013 09:07:56 GMTbob - 2013-09-30 21:36:41http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsWe would like to use ARM64, but cannot, as we would have to drop support for iOS 4 and 5. Do you have any clue why there would be such a restriction? I mean, why can't the 32-bit and 64-bit code slices coexist regardless of OS version, similar to how armv7 and armv7s (and previously armv6) slices currently coexist? <br /> <br />From Apple's documentation for 64-bit transition on iOS (<a href="https://developer.apple.com/library/ios/documentation/General/Conceptual/CocoaTouch64BitGuide/Introduction/Introduction.html">https://developer.apple.com/library/ios/documentation/General/Conceptual/CocoaTouch64BitGuide/Introduction/Introduction.html</a>): <br /> <br />Xcode can build your app with both 32-bit and 64-bit binaries included. This combined binary requires a minimum deployment target of iOS 7 or later. <br /> <br />... <br /> <br />Note: A future version of Xcode will let you create a single app that supports the 32-bit runtime on iOS 6 and later, and that supports the 64-bit runtime on iOS 7.7436311d02c5a55738d1baefa03b0d34Mon, 30 Sep 2013 21:36:41 GMTBruce - 2013-09-30 21:21:11http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsDid the Apple ARM32 code use the ARM encoding instead of the Thumb-2 encoding? Because in Thumb-2, you didn't have predicated instructions. <br /> <br />How is the ARM64 instruction density (code size) compared to ARM32 (especially with Thumb-2 encodings, if applicable)? Seems like its going to be a lot bigger, especially without load/store multiple for prolog/epilog, and no shifters / fewer address modes. <br /> <br />Another (minor?) downside of more registers is more context switch overhead. <br />ebfc90fe84cfb71283e00564fbd91011Mon, 30 Sep 2013 21:21:11 GMTMaynard Handley - 2013-09-30 18:22:45http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@ incorrector <br />As far as I can tell, the ONLY place it exists now is in some addressing modes (just like other ISAs) --- and only those addressing modes where it is considered that it won't slow down a critical path. <br /> <br />In particular we don't have it attached to an add via portmanteau instructions like "extend signed, shift, then add" which require basically extra gates slowing down the whole ALU. <br />d87f77d4f93ff4f671dd6fbe620a08f5Mon, 30 Sep 2013 18:22:45 GMTincorrector - 2013-09-30 18:04:40http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsMaynard Handley <br />- the ability to shift values in every instruction is gone (one more thing that was limiting frequency) <br /> <br />no, it wasn't the same in all instructions, and it isn't gone, just more limited786c7dfde433304b7aa9e731551f8c0dMon, 30 Sep 2013 18:04:40 GMTT-Rex - 2013-09-30 14:35:20http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@ Paul Christian <br />Rob is just being very nebulous - network latency will always be muuuuch higher than say a data mining trip to an '90s era hard disk - and there is NOOO way it will ever be used for an internal data structure like virtual memory. Fetch data? Well sure, iOS apps have been doing this novel thing since the iPhone 3g :D <br /> <br />Mike, you got a pingback from DF J Gruber :D, but he seems to be redecorating your conclusion.2ea889cc3084fd49a8f72f2a610fac3eMon, 30 Sep 2013 14:35:20 GMTMaynard Handley - 2013-09-29 17:59:30http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments"Were there major changes to the ABI as well? <br /> <br />For example, in ARM32, the first 4 items in a function were passed using registers, while the rest of the function data was passed on the stack. (IIRC)" <br /> <br />Yes. The Apple ARM64 ABI feels a lot like the PPC ABI. <br />It <br />- allows for more parameters to move via registers (rather than the stack) <br />- lays out explicit rules for volatile and non-volatile registers (volatile across procedure calls) <br />- provides for two registers that can be used for trampolines, thunks and similar such code glue between functions. <br /> <br />I say Apple ARM64 ABI, because there are some minor differences between what Apple did and what ARM describes as the ABI, primarily in the handling of varargs. I have no idea why Apple felt it necessary to differ from ARM in this respect, or whose solution is "better".38f9827ab57fd0c9efedad5c505a986eSun, 29 Sep 2013 17:59:30 GMTMaynard Handley - 2013-09-29 17:41:08http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsYou've left out a bunch of interesting stuff that will improve performance in the future, when Apple ditches 32-bit compatibility. This includes <br /> <br />- although predicated execution is gone, it is replaced by something that gives most of the performance benefit (in reducing branches) while being simpler to implement at high frequencies, namely conditional moves. The ARM64 conditional moves are not just the standard CMOV of some other architectures, they allow for a few simple manipulations of the of the moved value which cover most of the common cases. So one gets a single-cycle no-branching instruction which can calculate things like absolute value, or max/min. <br /> <br />- the addressing modes are simplified to reduce chains of adds. Again allows for higher frequency. (Again, we don't easily see the win in the higher frequency until Apple can ditch 32bit support which, of course, has to support those older addressing mode) <br /> <br />- the ability to shift values in every instruction is gone (one more thing that was limiting frequency) <br /> <br />- the load/store multiple instructions are gone. But they are replaced with a very clever idea, the load/store pair instructions which require a single address generation, but can load or store (duh) a pair of consecutive values. Because of the 128-bit wide bus to the L1 this just piggybacks on existing hardware, and because no complicated guarantees are made regarding atomicity of the loads, the instruction is easy and fast to implement. But it gives you, in a double ported cache, most of the advantages of a triple ported cache (for integer code, it won't help vector code) without having to pay the complexity or power of a triple ported cache <br /> <br />- the OS model (stuff like protection levels and interrupts) has been substantially simplified <br /> <br />- the memory model (guarantees about what loads/stores are atomic, and their ordered visibility to other CPUs) has again been simplified (and more granular instructions for controlling this have been provided, so that code can require only a simple and partial halt from the CPU when it has weak consistency requirements, rather than requiring a full halt for full consistency) <br /> <br />A common theme here is that most of these changes actually affect the design of HW as much as or more than the design of SW. A CPU that is required to also execute ARM32 code cannot exploit these changes because it still has to do the old things (eg the shifts every cycle, the address modes, the heavier OS model, the older memory ordering model, the decoder for THUMB instructions, ...). <br />My belief is that Apple will (for various reasons) push very aggressively to move the entire iOS development world to simultaneous 32/64-bit development (for example not accepting into the app store anything that doesn't have a 64-bit binary). This will allow them, remarkably soon, to ditch the ARM32 support, at which point, for the reasons I've stated, they'll be able to see a substantial frequency boost even without a process change. <br /> <br />Oh, one more thing; there are reasons to believe that if 64-bit iOS isn't already using 64K pages, it will be very soon. This is just one more minor speed boost --- TLB coverage becomes a lot larger, and lookup from page tables requires only two trips to memory, not three. 00e2ed4768587e56967a2daed61dbef5Sun, 29 Sep 2013 17:41:08 GMTMotti Shneor - 2013-09-29 15:01:19http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsGreat article. <br /> <br />I specially like this analysis because I can use it when designing a new application, knowing the strengths and weakness spots of the new hardware/software combinations. <br /> <br />My lesson here <br />1. Make more use of memory-mapped-files (where appropriate) <br />2. Be less afraid about creating and disposing of objects <br />3. design your calculation code to rely on abundance of 64 bit registers. <br /> <br />And to over-pessimistic Blah, Apple is using the hardware-software-seamless-integration point in its marketing all the time, and for a reason. Until now, in all its computer (ahem, sorry, phone) products, this integration was very careful. <br /> <br />If ever this breaks, Apple will no longer be "Apple" and we won't be using iPhones either. Hardware/OS tight integration is usually to the benefit of the customers. <br /> <br />I agree with all the others --- (and yes, I worked with AmigaBasic) the chances we application writers will stumble on the tagged pointers and isa packed with retain counts are VERY low. Apple will do a lot to prevent this from happening. <br />9a983aefc9649a07a224bc91ae9e24a5Sun, 29 Sep 2013 15:01:19 GMTbbh - 2013-09-29 13:25:17http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsblah, your disgruntled pessimism annoys me. <br />What circumstance (beyond some alien invasion) can you specify where this breaks applications? (Not some inane rambling of ols MS amiga crap - get off my lawn yada yada - you young wipper snappers get off my lawn). If Apple goes out of business, it won't be introducing new iPhones with new hardware right? so somehow it's going to update hardware but not be able to update the software? <br />8bc7e0f1a0f54320870d61df5a8aa838Sun, 29 Sep 2013 13:25:17 GMTTim - 2013-09-28 22:14:20http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsblah: What exactly has AMD done to "make it tricker" to use tagged pointers on x86-64? I've worked on dynamic compilers (which target x86-64, among many other architectures) that use tagged pointers, and I've never heard of any trouble with this. <br /> <br />In fact, this technique has been used in lots of runtimes for at least a couple decades, and (prior to your Microsoft AmigaBASIC example) I've never heard of it causing any trouble for anyone ever. <br /> <br />Yes, there may come a day when we regret only being able to address 256 terabytes of RAM per process in our mobile phones. If we can somehow manage to increase memory density at the same rate as we have since 1980, that will probably happen about 30 years from now. Assuming we're still on ARM64, Apple will have to recompile the Objective-C runtime some time before 2043 to use fewer type bits. There's ample opportunity: at the current rate, we'll be on iOS major version 37 (which is extremely unlikely to support a 2013 iPhone, anyway ... it'd be like wanting to run Mac OS X 10.9 on an original "128K" Mac). <br /> <br />Considering Apple has introduced completely new processor architectures 4 times in the past 30 years, that seems like a relatively minor issue to deal with. In 2043, ARM64 will be as old as the original (16-bit) 8086 is today, so it will be nothing short of a miracle if the ARM64 architecture lasts long enough that we need to trim some tag bits to address all of our RAM. :-) <br />eb204b11977f1641f7870b655ca33695Sat, 28 Sep 2013 22:14:20 GMTTim - 2013-09-28 21:41:41http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>Randy Lea</b>: The bit width of internal busses is completely decoupled from the width of pointers and the width of CPU internal registers, and it's a good guess that it's actually wider than 64 bits. <br /> <br />I doubt anybody outside Apple knows what bus they're using in their A6 and A7 System-on-Chip (SoC) designs. AXI is a strong candidate; it's the standard bus used by high performance ARM CPU core designs. Apple has definitely used it before. It's the bus interface on ARM's Cortex-A9, which was the CPU core used in Apple's A5 (found in iPhone 4S, iPad 2, and others). Since they're now designing their own CPUs Apple may have switched to something else, but I'll discuss AXI since they're not saying and it's what I know. <br /> <br />So, the first thing to note is that AXI is a flexible standard and supports data widths anywhere from 32 to (if I recall correctly) 512 bits. Also, the Cortex-A9 already had a 64-bit AXI bus even though it's a 32-bit CPU. <br /> <br />AXI is a point-to-point link, so most ARM SoCs are constructed as a star or star-of-stars topology. In AXI parlance the hubs of the stars are known as "fabrics", and each AXI fabric routes traffic between its ports based on address decoding. There's no rule saying all ports on one fabric have to be the same width, or even the same clock speed, so real systems often contain a mix. <br /> <br />I've personally worked on a Cortex-A9 based SoC where a significant chunk of the AXI infrastructure was 128 bits wide, even though the CPU is only a 64-bit AXI master. This was because we had a 32-bit wide DDR memory controller. DDR memory usually runs at much higher clock rates than the AXI structure. In our case, the clock ratio chosen was 4X, so the 32-bit DDR could transfer 4 32-bit words over the duration of one AXI bus clock, meaning that AXI had to be 128 bits wide to keep up. <br /> <br />That's not the only possible choice. We could've selected 32-bit AXI at the same frequency as the DDR. However, DDR memory clock rates are relatively high and there are downsides to that. Another possible choice would've been 1/2 frequency at 64 bits. Wide/slow has different tradeoffs compared to narrow/fast, and for <i>reasons</i> we chose wide/slow. <br /> <br />Getting back to Apple, my money would be on 128 bits. They have a 64-bit DDR interface, but as I alluded to 1:1 frequency and width is not common. 256 has its own issues; it starts to get very wasteful of chip area. 128 seems like the right choice for a chip like the A7. But that's not anything I've done formal analysis on, or have any evidence to support, so take it with a grain of salt. <br />9d4eb51e4099df70f3f4fa61592b515bSat, 28 Sep 2013 21:41:41 GMTAdam Nohejl - 2013-09-28 21:39:52http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments@Mike: Aah, you're right, I've missed that. If the runtime handles overflow, everything seems OK. Should be reading more carefully!622a9cda228280a6e79c2a09fb9204d8Sat, 28 Sep 2013 21:39:52 GMTPaul Christian - 2013-09-28 21:01:47http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsGreat article Mike! And great point Rob Lewis,... I've never thought about that possibility... Though wouldn't that be further in the future? I mean, broadband is commonplace here, but certainly not everywhere... And it might take a while... although... this is Apple "Most forward thinking phone yet"70f4196d63ac4cbb5a54bd35366cfda6Sat, 28 Sep 2013 21:01:47 GMTRob Lewis - 2013-09-28 20:31:43http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsMy first thought on hearing about 64-bit iOS was <i>Virtual Memory in the Cloud.</i> Apps could function as though they had essentially unlimited RAM (including memory-mapped files), and page faults could cause the missing code or data to be fetched from iCloud. <br /> <br />Beyond this, a big shared data space for iDevices could be provided for commonly-needed data like maps, weather, sports scores, code segments, and on and on. Want the current temperature in Seattle? It will always be found at address XXXXXXXXXXXXXXXX. 47787e8d9226be528b371df2bd793a9bSat, 28 Sep 2013 20:31:43 GMTmikeash - 2013-09-28 20:15:37http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsYes, iOS 7 supports 64-bit apps. You can build fat binaries, so that the resulting app will run as a 64-bit program on 64-bit CPUs, and as a 32-bit program otherwise. The same has been done on Macs for years.73c795b16f27415695c1d82eada2dfd3Sat, 28 Sep 2013 20:15:37 GMTSander Bos - 2013-09-28 20:00:58http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsSince you could do some casual benchmarking with 32-bit and 64-bit mode on your iPhone, does this mean you can compile, deploy, and run 64 bits apps on iOS 7? <br />Could you also release these 64-bits apps on the app store? I am thinking about my 32 bit iPad and all the wonderful 64-bit apps it won't run... <br />(I am not not an iOS developer, so am unfamiliar with XCode etc.) <br />d97f17ceb695ed8a76903cc956811d02Sat, 28 Sep 2013 20:00:58 GMTmikeash - 2013-09-28 19:21:01http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>blah:</b> It doesn't take "faith" to notice that there is a bit whose sole purpose is to disable the whole isa-packing thing, and that tagged pointers are enabled or disabled with a #define in the ObjC runtime (which is, by the way, open source). Now, either elaborate on exactly <i>how</i> something could go wrong, or scoot. Vague handwavery is not welcome here.881358f1db7aa58d2ff1783c3cbde39bSat, 28 Sep 2013 19:21:01 GMTblah - 2013-09-28 18:08:54http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsYour naive faith amuses me, youngster.2ad30b6a8c6f5d274ecd384f7a222304Sat, 28 Sep 2013 18:08:54 GMTSteve Weller - 2013-09-28 17:25:22http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsIf you want to explore at home, do the following in a terminal: <br /> <br />cd /Applications/Xcode.app/Contents/Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS7.0.sdk/usr/lib <br /> <br />otool -j -t -v -V libobjc.A.dylib &gt;output.txt <br /> <br />Then expand the tabs in output.txt to at least 8 and figure out the disassembly. <br /> <br />In Xcode you can see what ARM64 instructions are generated for your own code by setting a project configuration to the arm64 architecture and selecting 'Assembler' from the small button top left in Xcode's editor pane. <br /> <br />f3828a403e876f1f256acb531a4785a8Sat, 28 Sep 2013 17:25:22 GMTmikeash - 2013-09-28 15:17:01http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>Adam Nohejl:</b> Not sure if you missed it or I'm misunderstanding you, but the 19-bit retain count is simply for the fast case, not a hard limit. Once you hit half a million or so, the implementation transparently switches over to a side table again. <br /> <br />This is an advantage to having the inline retain count in a single place, in the runtime, rather than implemented scattershot in libraries where people think it's needed. The runtime guys can get it right once, and everybody else gets it for free.71c09d54d0b4e81f1043cee08b8cdcb9Sat, 28 Sep 2013 15:17:01 GMTmikeash - 2013-09-28 14:52:07http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>blah:</b> You're just making up fantasies. If a future ARM CPU makes these isa tricks impossible, all Apple has to do is flip a switch and recompile the runtime, and all this stuff goes away. These is no way for it to break binary compatibility.805291e0c1ca750c3226ef99889f50ccSat, 28 Sep 2013 14:52:07 GMTPetr Machata - 2013-09-28 12:18:12http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>Dan Smith</b>: AArch64 can pass up to 8 integral arguments in general-purpose registers, and additional up to 8 floating-point or vector arguments in vector registers. It can also pass small structures of up to 16 bytes in general-purpose registers, and structures recursively composed of up to 8 floating-point or vector types (called homogeneous floating-point aggregates) in vector registers. This is described in much more detail in "procedure call standard".96ccc818d872a43ed882a351c35261d8Sat, 28 Sep 2013 12:18:12 GMTJens Ayton - 2013-09-28 11:51:03http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsDan Smith: yes, there were major ABI changes, to the point that objc_msgSend() comes in only one variant in ARM64 mode. Yay!c14ab3209078b8111ac56eb3c88daf22Sat, 28 Sep 2013 11:51:03 GMTasdf - 2013-09-28 08:34:49http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsFull predication has the potential to make every instruction depend on every preceding instruction, which is no good if you're trying to execute instructions in parallel. Replacing that feature with a set of special conditional instructions both simplifies the pipeline and offers more opportunities for high-performance implementations. <br /> <br />The condition bits also eat a lot of precious opcode space for something that in the end isn't used all that often.c913c98fba081446174f628ac77fb33aSat, 28 Sep 2013 08:34:49 GMTAdam Nohejl - 2013-09-28 07:21:46http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsGreat article! <br /> <br />As for 19 bits for the retainCount I fear the worst. If you ever thought that retain counts are just small numbers, this is a must read: <a href="http://www.cocoabuilder.com/archive/cocoa/326859-triggering-an-nstextstorage-bug.html">http://www.cocoabuilder.com/archive/cocoa/326859-triggering-an-nstextstorage-bug.html</a> <br /> <br />@blah: It's just an implementation detail of the Obj-C runtime. Your class pointers (i.e. what you get by calling [... class]) will be clean regardless of how isa is stored.2a9ae54644907a827ebc0bd851ce3039Sat, 28 Sep 2013 07:21:46 GMTblah - 2013-09-28 01:28:28http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsUntil someone goes out of business or loses interest in doing updates. "updating the operating system" is not itself a cost-free operation by any means. AmigaBasic's runtime was a "part of the operating system" licensed from microsoft, until it died. <br /> <br />Apple are just a company, not a God. "we control the horizontal and the vertical for all eternity forever and ever amen"??? Nah, not so much, it'll bite them in the ass and/or pass the cost on to you the developer some day by breaking binary compatibilty, mark my words.dddb380dd07edb399b75871c88078889Sat, 28 Sep 2013 01:28:28 GMTmikeash - 2013-09-28 00:37:18http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>blah:</b> I've seen this misconception over and over and over again. There's a fundamental difference between <i>application code</i>, which needs to be written to the overall spec and not make assumptions, and <i>operating system code</i>, which can assume anything it feels like about the hardware, because it gets adjusted whenever the hardware changes. <br /> <br />Bit smashing tricks like this are bad news for application code in general. But they're perfectly fine for the Objective-C runtime, because if any of the assumptions the runtime relies on are invalidated, Apple can update the runtime in lockstep. The iPhone 6 isn't going to break tagged pointers, because any changes they make will be accounted for in iOS 8's Objective-C runtime. <br /> <br />People look at historical problems like this and learn the wrong lesson from it. The lesson is not to avoid bit packing tricks, the lesson is not to make assumptions about factors that can change and that you don't control. Apple controls all the relevant factors.5439dc3ab8241f802e20bf9b2db20d3cSat, 28 Sep 2013 00:37:18 GMTKyle S. - 2013-09-28 00:06:53http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsblah, the difference is that Apple's runtime is very careful not to pass off isa pointers to the raw hardware, whereas Amiga BASIC (and "32-bit unclean" code for Classic Mac OS) was not.0c0ca8e11ea411c65bf6d84dddf82f5bSat, 28 Sep 2013 00:06:53 GMTblah - 2013-09-27 22:33:57http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsUgh. Not a fan of pointer abuse for tagging. Back in the day, microsoft amigabasic pulled a similar trick using seemingly spare bits in 32-bit pointers in an era you were lucky to have over a meg of ram, and after all the hardware only supported 24 bits out of the 32.. But later amigas with a few bits more real memory space appeared, and amigabasic never worked again. The end. <br /> <br />Amd had sensibly kinda gone out of their way to make it trickier for shortsighted software guys to abuse x86-64 64-bit pointers that way, even though current chips only address a smaller space right now, it's very bad to assume they'll always be spare. Idiot-proof something and the world builds a better idiot.afcc453470fb9c5e590ab2e4368ad64eFri, 27 Sep 2013 22:33:57 GMTagumonkey - 2013-09-27 20:40:24http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsLove the content, love the form. <br />Thanks310090cab14f84156fa28cad6d36337cFri, 27 Sep 2013 20:40:24 GMTmikeash - 2013-09-27 19:56:33http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#comments<b>NetMage:</b> My guess is simply that they had 19 bits left over after cramming everything else they could think of into the field. It will likely shrink over time. E.g. if and when the iOS 64-bit address space expands, the actual class pointer will need to expand, meaning the reference count part will have to shrink.c12230256ddaba9ab5f5c63654f793a9Fri, 27 Sep 2013 19:56:33 GMTJonathan - 2013-09-27 19:49:53http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsWith regards to the processor "official" name. The name of the processor could be whatever label the folks at Apple wanted to slap on it.(as is the case with all ARM based chips manufactured by licensees). In the case of Apple's newest processor, apparently it's ARM64. Regardless of what they call it, this processor is based on the <i>ARMv8</i> architecture while <i>AArch64</i> refers to one of two main execution states supported within the ARMv8 architecture, the other being AArch32. 4f162b24d1be9b3b698c37c957cf0b14Fri, 27 Sep 2013 19:49:53 GMTNetMage - 2013-09-27 18:38:41http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsAny idea of the common range for reference counts? 19 bits seems like a lot to devote to this. I would have thought they could have moved some more information into the field instead. 8af94c7eee2057e41f811f7af9ff89a5Fri, 27 Sep 2013 18:38:41 GMTRandy Lea - 2013-09-27 18:06:42http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsThanks for the interesting article. <br /> <br />I would add that I haven't seen any details on the chip's internal data bus. I would assume that data transfers to both the floating point unit and GPU are 64-bits, which would offer a nice performance benefit for graphics and number crunching. <br /> <br />Also, the CPU is 64-bits, I would imagine the 1st level cache is 64-bits, and after that I don't know. My guess is that a 64-bit external data bus might burn a bit too much power for the performance. I would like to find out about the internal 2nd level (3rd?) cache configuration and the external system memory. <br /> <br />031894a5b980779ca85795c5c355692eFri, 27 Sep 2013 18:06:42 GMTDan Smith - 2013-09-27 18:01:24http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsWere there major changes to the ABI as well? <br /> <br />For example, in ARM32, the first 4 items in a function were passed using registers, while the rest of the function data was passed on the stack. (IIRC) <br /> <br />With the increased number of registers, it would be possible to increase the number of function variables passed using registers. a5d906e7e6587ff83c7077b8535ebfb4Fri, 27 Sep 2013 18:01:24 GMTBerkus - 2013-09-27 16:13:50http://www.mikeash.com/?page=pyblog/friday-qa-2013-09-27-arm64-and-you.html#commentsGreat article! <br /> <br />Apple surprised me being a first company ever to release consumer implementation of AArch64, so hail to them!816312ab07bb353363578e73c5456249Fri, 27 Sep 2013 16:13:50 GMT