Next article: Avoid Apress
Previous article: Friday Q&A 2011-12-23: Disassembling the Assembly, Part 2
Tags: assembly disassembly fridayqna guest objectivec
Gwynne finishes off her series on analyzing assembly code with a look at ARM assembly, for all of your iOS needs. Gwynne will be contributing the occasional article in the future as well as a guest author, without my introductions. Watch the Author field at the top of the post to see who's writing what. Without further ado, let's take a look at ARM.
Since I wrote part 1 of this series on reading x86_64 assembly language, I've gotten several requests for a version which talks about reading ARM assembly language, as ARM is the architecture used by devices running iOS. Unfortunately, at the time of those requests, I didn't know much of anything about ARM! So rather than disappoint, I embarked on a crash course in the instruction set.
Fortunately, it turned out that ARM isn't all that complicated; it's more like learning a new dialect of a language you already know than learning an entirely new language. For sanity's sake, in this article I will assume you've already read both part 1 and part 2, and explain only the differences that appear in ARM rather than reiterating all the basic concepts.
I have most likely made at least one mistake in my understanding of ARM, and I gladly invite any explanations and corrections that people may find.
CPU and ABI
Unfortunately, the ARM specifications are a little harder to get your hands on than x86_64's; you have to go to their website and register for an account before you can download the PDFs. On the bright side, it's not a paywall and the registration process is fairly simple. If you go to the trouble, look for the documents "ARMv7-AR Architecture Reference Manual (Issue C)" under "ARM Architecture", and "Base Platform ABI for the ARM Architecture" and "Procedure Call Standard for the ARM Architecture" under "ARM software development tools". Apple's documentation on the ARM ABI used for iOS is public, and gives nearly all information one would typically need regarding the ABI.
I found it a little bemusing that the primary documentation for ARM is split across several dozen documents, while that for x86_64 is in six PDFs total. There are an equal number of documents under the x86_64 umbrella as a whole, but you need only two (maximum three if you do SIMD work) of them to do applications programming, and those are very clearly labeled. At least three documents are needed just to get the same amount of data for ARM, and they're considerably harder to find unless you already know a number of details about the platform and high-level language you'll be working with.
The many flavors of ARM
Before I can talk about the particulars of how ARM does what it does, I have to explain something about the architecture: There are at least a dozen flavors of it. Unlike x86, ARM was designed from its very beginning for use in a variety of environments, all with different requirements for speed, power consumption, and efficiency. It has also gone through a number of major revisions, and has a long list of optional features. In this article I will focus on the particular flavor of ARM used by the modern iPhone, iPad, and iPod Touch: The ARMv7 architecture, Application profile, with Thumb 2 and NEON SIMD.
What does this mean? It means I'll be working from version 7 of the ARM instruction set (there are currently 8 revisions, but ARMv8 is not yet implemented in any shipping processors). I'll focus on the Application profile, which is intended for typical operating systems, rather than the Real-time profile (intended for small - you guessed it! - realtime systems) or the Mobile profile (intended for embedded systems). ARMv6, used in first- and second-generation iDevices save the iPad, is largely similar, but not identical. It also means I'll use the Thumb instruction set, as it's strongly recommended by Apple for use on iOS.
Thumb
The ARM architecture specifies a secondary instruction set called Thumb. The purpose of Thumb is fairly simple: Do all the common tasks with smaller instructions. While ARM instructions have a fixed size of 32 bits each, Thumb instructions (with a very few exceptions) are 16 bits each, making for much smaller code. Apple has recommending building iOS code with Thumb for most of iOS' lifetime for exactly this reason, though their toolchain is famous for creating crashing code when building ARMv6 Thumb with floating-point support.
In ARMv6, reading Thumb versus ARM assembler was annoying at best, as they didn't use the same language. With ARMv7, however, ARM developed a "Unified Assembly Language" (UAL) which covers all the operations of both ARM and Thumb instructions in a single (you guessed it!) unified set of mnemonics. This is another reason I'm sticking to ARMv7.
Note: It was also not until ARMv7 that Thumb (more specifically, Thumb 2) got proper support for floating-point extensions, which is part of the reason Apple's compilers had so much trouble earlier on.
A particular quirk of Thumb is how you tell the CPU which instruction set you're running at any given moment. This is solved by a set of "interworking" branch instructions; when a branch is taken by one of these instructions to an address stored in a register, it interprets the least significant (last) bit of the address as a flag telling the CPU which mode to switch to. When a branch is taken to a hardcoded address, however, the instruction set flag is simply flipped. Therefore, a bx #12345
instruction will branch to Thumb if the current ISA is ARM, and ARM if the current ISA is Thumb, whereas a bx r5
instruction will set the current ISA based on the low bit of r5
(a 1
bit branches to Thumb, a 0
bit to ARM).
Keep in mind, however, that all this fancy interworking stuff only happens if the program asks it to. The normal b
and bl
instructions don't do any of this. Whether or not you change instruction sets during any branch is up to you!
Don't worry if this doesn't make too much sense; thanks to UAL, the meaning of the instructions is the same between Thumb and ARM in almsot all cases, so knowing how to read the assembly language is most of what you need. It is important to remember only that most Thumb instructions can only access a limited set of registers, and take much smaller ranges of immediate values.
Registers
ARM specifies sixteen general-purpose registers and one status register for application use. NEON further defines sixteen 128-bit vector registers which overlap with the set of thirty-two double precision floating-point registers used by the earlier VFP specification (these in turn overlap with the set of thirty-two single precision registers). Registers with multiple names can be referred to by either name with the same result, though by convention it is always preferred to use a name (such as lr
) over a number (such as r14
) except in specific situations. Of these fifteen, several are reserved for specific purposes.
r0-r3
- The first four GPRs are used for argument passing and return values, but are freely available for use as scratch storage within a function.r4-r6, r8, r10, r11
- The next three GPRs, as well asr8
,r10
, andr11
are documented as being preserved across function calls, but are also otherwise available for use.r7
- On iOS,r7
is the frame pointer, much asrbp
is in x86_64. The ARM architecture in general does not specify a use forr7
; this is specific to iOS. To be exact, ARM specifiesr11
asfp
, but Apple chose not to use that register on iOS, and avoided thefp
name so as not to make their assembler incompatible with other ARM implementations.r9
- On iOS 2.x,r9
is a special register used by the OS for unspecified purposes and must not be modified. On iOS 3.0 and above,r9
is free for use and does not need to be preserved across function calls.r12
,ip
-r12
is the "intra-procedure scratch register". Between calls, it has the same semantics asr9
and does not have to be preserved. It is also calledip
(not to be confused with x86_64'srip
register - they are not similar), and is used as such for computing destination addresses for long branches.r13
,sp
-r13
, also calledsp
, is the stack pointer. This serves the same purpose asrsp
on x86_64 and works in much the same way.r14
,lr
- The link register is loaded by thebl
andblx
instructions, which make subroutine calls (much ascall
does on x86_64). ARM dedicates a register to storing the return address for a call rather than relying on the stack to hold it. This means that a subroutine can, at least in theory, operate without ever touching the stack at all, an important win in real-time and embedded systems.r15
,pc
- The program counter holds the address of the next instruction to be excuted. It is the counterpart to x86_64'srip
register. Some non-branch instructions can accesspc
directly, but it is very strongly discouraged except in very specific situations (for example, as we'll see below, it is a common technique to pushlr
to the stack in a prolog and pop the value intopc
in an epilogue, which results in a return from subroutine).q0-q15
- A set of sixteen 128-bit vector registers which are used for parameter passing, result values, and scratch computation. Of these,q4-q7
are preserved across function calls.d0-d31
- A set of thirty-two 64-bit double-precision floating-point registers. These are mapped onto the corresponding vector registers such thatd0
andd1
are the low and high 64 bits ofq0
,d2
andd3
are the low and high 64 bits ofq1
, and so on.s0-s31
- A set of thirty-two 32-bit single-precision floating-point registers. These are mapped onto the corresponding double-precision registers such thats0
ands1
are the low and high bits ofd0
, and so forth. This also implies thats0-s3
are the four 32-bit components ofq0
.cpsr
- The CPU's status register, partially equivelant torflags
in x86_64. It has the following bits, most of which are not preserved across function calls:N
- Negative flag, i.e. the sign bit of the result of a computation.Z
- Zero flag, i.e. whether a result is equal to zero.C
- Carry flag, i.e. whether an addition carried or a subtraction borrowed.V
- Overflow flag, i.e. whether a computation result overflowed its destination.Q
- Saturation flag, i.e. whether a computation resulted in saturation; this is used by instructons that saturate on overflow (i.e. set all bits to 1 rather than wrapping around the integer range).GE
- The Greater than or Equal flags, used by parallel arithmetic instructions to represent the results from several additions or subtractions at once.E
- Endianness. The ARM architecture supports switching the endianness mode of the CPU at runtime, but iOS specifies that this bit must always remain zero for little-endian mode.T
- Thumb. This is the status flag which determines whether Thumb or ARM code is being executed. It can not be modified directly and the preservation of its value is context-dependant. (There is also aJ
flag for Jazelle mode, but iOS does not implement or use it.)- All other bits are system-level and can only be accessed by privileged code.
Calling conventions
Actually, I don't have to say much here; it's all in the registers commentary. r0-r3
are used for integer argument passing and returns, s/d/q0-qN
are used for floating-point argument passing and returns, and anything that doesn't fit in the registers is spilled to the stack. As usual, this is a bit simplified, but it works!
Oh, and by the way, the memory model for ARM is, for all intents and purposes, the same as that for x86_64, save for the smaller memory space.
Disassembling, at last
Enough architecture talk, let's look at some assembler. I'm using the same code again as in parts 1 and 2, and the following command for compilation:
/Developer/Platforms/iPhoneOS.platform/Developer/usr/bin/clang -S test.m -fobjc-arc -arch armv7 -miphoneos-version-min=5.0 -isysroot /Developer/Platforms/iPhoneOS.platform/Developer/SDKs/iPhoneOS5.0.sdk -mthumb -Os
That's use the iOS SDK's Clang, show assembler output, compile with ARC, compile for ARMv7, require iOS 5 (required to link properly), build against the iOS SDK (required for ARM compilation), use Thumb, and optimize for size. And here's the result for the main
function:
.thumb_func _main
_main:
push {r4, r5, r6, r7, lr}
add r7, sp, #12
str r8, [sp, #-4]!
blx _objc_autoreleasePoolPush
movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4))
mov r8, r0
movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_0:
add r1, pc
movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4))
LPC8_1:
add r0, pc
ldr r1, [r1]
ldr r0, [r0]
blx _objc_msgSend
movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
movs r3, #42
movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4))
LPC8_2:
add r1, pc
ldr r1, [r1]
movw r2, :lower16:(L__unnamed_cfstring_27-(LPC8_3+4))
movt r2, :upper16:(L__unnamed_cfstring_27-(LPC8_3+4))
LPC8_3:
add r2, pc
blx _objc_msgSend
mov r5, r0
movw r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
movt r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4))
LPC8_4:
add r0, pc
ldr r1, [r0]
mov r0, r5
blx _objc_msgSend
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r4, r0
mov r0, r4
bl _MyFunction
mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue
blx _objc_retainAutoreleasedReturnValue
mov r6, r0
mov r0, r4
blx _objc_release
movw r0, :lower16:(L__unnamed_cfstring_23-(LPC8_5+4))
mov r1, r6
movt r0, :upper16:(L__unnamed_cfstring_23-(LPC8_5+4))
LPC8_5:
add r0, pc
blx _NSLog
mov.w r0, #1065353216
bl _MyFPFunction
vmov s0, r0
movw r0, :lower16:(L__unnamed_cfstring_31-(LPC8_6+4))
movt r0, :upper16:(L__unnamed_cfstring_31-(LPC8_6+4))
vcvt.f64.f32 d16, s0
LPC8_6:
add r0, pc
vmov r1, r2, d16
blx _NSLog
mov r0, r6
blx _objc_release
mov r0, r5
blx _objc_release
mov r0, r8
blx _objc_autoreleasePoolPop
movs r0, #0
ldr r8, [sp], #4
pop {r4, r5, r6, r7, pc}
Whew! For a function built with an instruction set specifically designed to be small, this is pretty long. Seventy-two lines. In truth, the compiled machine code is quite small; only the assembler that produces that code is verbose, which is typical of RISC architectures.
Some quick notes about ARM assembler syntax:
- The order of operands in an ARM instruction is reversed from the "GAS" (GNU ASsembler) syntax used by the x86_64 assembler. A typical instruction is "mnemonic destination, operand1, operand2". This is ironically closer to the original Intel assembler syntax.
- Immediate operands are delimited with
#
rather than$
. - Register names are not prefixed by
%
.
Let's see what we can make of main
:
.thumb_func _main _main: push {r4, r5, r6, r7, lr} add r7, sp, #12 str r8, [sp, #-4]!
.thumb_func
is an assembler directive that tellsclang
to use the Thumb ISA for the function named. Then we have the symbol for the function. ARM'spush
instruction can take an entire list of registers at once to save to the stack, somain
is savingr4
throughr6
so it can use them as scratch,r7
(the old frame pointer), andlr
(the return address of the calling function).lr
must be saved because any subroutine callsmain
makes will overwrite the value - it is not preserved across function calls. Not only that, but iOS (and any other implementation with a frame pointer) requires the return address to be on the stack as part of the stack frame for the function. The iOS ABI says that failure to set up a stack frame "can prevent debugging and performance tools from generating valid backtraces." Next, we add12
to the stack pointer and store the result inr7
to create the frame pointer. Notice that this does not modifysp
itself! The last instruction is rather tricky: It subtracts4
fromsp
, saves the new value ofsp
, and writes the value inr8
to memory atsp
. The net result of this is to pushr8
to the stack. The compiler doesn't use anotherpush
instruction because the encoding ofstr
forr8
is smaller in Thumb (16-bit versus needing a 32-bit Thumb instruction topush
it).blx _objc_autoreleasePoolPush mov r8, r0
Branch and link with interworking to
objc_autoreleasePoolPush
. This is just a subroutine call, with the option to switch to the ARM ISA. The function takes no parameters, and will return its result inr0
. The result is saved inr8
. Yes, I know I've read the instructions out of order here, but the compiler separated them for some reason and it makes no difference to the code flow to show them in conceptual order rather than assembler order.movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4)) movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_25-(LPC8_0+4)) movw r0, :lower16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4)) LPC8_0: add r1, pc movt r0, :upper16:(L_OBJC_CLASSLIST_REFERENCES_$_-(LPC8_1+4)) LPC8_1: add r0, pc ldr r1, [r1] ldr r0, [r0] blx _objc_msgSend
Welcome to calling Objective-C selectors on ARM! This reads: Take the address of label
L_OBJC_SELECTOR_REFERENCES_25
, subtract the address of labelLPC8_0+4
, and select the low sixteen bits of the result, writing them into the bottom 16 bits ofr1
. Then take upper 16 bits of the same address and write them into the upper 16 bits ofr1
. This is done in two steps because there is no Thumb encoding for a 32-bit immediate register load.pc
is then added tor1
, forming apc
-relative address. This idiom should look familiar from x86_64'srip
-relative addressing. The same is done again forL_OBJC_CLASSLIST_REFERENCES_$_
, loading it intor0
. The twoldr
instructions read values from the addresses in the two registers into the registers themselves, equivelant to the C codea = *a;
. Finaly, the code does a branch and link with interworking instruction toobjc_msgSend
, completing the method call. Congratulations, you've just learned how ARM calls[MyClass alloc]
! The vtable tricks done in x86_64 don't exist for ARM, as ARM's addressing and branching modes don't make direct dispatch more efficient than a subroutine call.movw r1, :lower16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4)) movt r1, :upper16:(L_OBJC_SELECTOR_REFERENCES_28-(LPC8_2+4)) LPC8_2: add r1, pc ldr r1, [r1] movw r2, :lower16:(L__unnamed_cfstring_27-(LPC8_3+4)) movt r2, :upper16:(L__unnamed_cfstring_27-(LPC8_3+4)) LPC8_3: add r2, pc movs r3, #42 blx _objc_msgSend mov r5, r0
Once again I've slightly rearranged the instruction stream to make more sense. This call is really no different from the first. The result from
alloc
is already inr0
. The-initWithName:number:
selector is loaded intor1
, the string@"name"
intor2
, and the immediate value42
intor3
. The result of the method call is saved inr5
.movw r0, :lower16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4)) movt r0, :upper16:(L_OBJC_SELECTOR_REFERENCES_29-(LPC8_4+4)) LPC8_4: add r0, pc ldr r1, [r0] mov r0, r5 blx _objc_msgSend
This is the
[obj name]
method call. Nothing special here.mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue blx _objc_retainAutoreleasedReturnValue mov r4, r0
The ARC-aware compiler inserts an effective
nop
(move a register to itself) into the instruction stream so certain code in the Objective-C runtime can detect the call toobjc_retainAutoreleasedReturnValue
, then actulaly makes the call. No registers need to be set up, as the return value inr0
matches the intended first parameter also inr0
. The result of the function, i.e. the retained[obj name]
, is saved inr4
.mov r0, r4 bl _MyFunction mov r7, r7 @ marker for objc_retainAutoreleaseReturnValue blx _objc_retainAutoreleasedReturnValue mov r6, r0
The value from
r4
is reloaded intor0
, unnecessarily! The compiler seems to have stumbled in optimizing here, and I'm not sure why. In any case, it then callsMyFunction
, and anotherobjc_retainAutoreleasedReturnValue
stanza, saving the final result (i.e.string
) inr6
.mov r0, r4 blx _objc_release
[obj name]
is no longer needed, so release it.movw r0, :lower16:(L__unnamed_cfstring_23-(LPC8_5+4)) movt r0, :upper16:(L__unnamed_cfstring_23-(LPC8_5+4)) LPC8_5: add r0, pc mov r1, r6 blx _NSLog
Again having rearranged the instructions slightly, this is a load of
@"%@"
intor0
andstring
intor1
, and a call toNSLog
. Variadic functions appear to have no special semantics on ARM.mov.w r0, #1065353216 bl _MyFPFunction vmov s0, r0
The 32-bit decimal representation of
1.0
is loaded intor0
, an integer register. Why? Because it fits and integer registers are more efficient than vector or floating-point registers when possible.MyFPFunction
is called, and the result (again inr0
!) loaded into thes0
floating-point register. Note: The.w
suffix on themov
instruction means to use the 32-bit Thumb encoding for the instruction even if the 16-bit encoding would otherwise be selected. For this particular instruction, it is probably unnecessary, but it's considered good form to use the annotation even if the 32-bit encoding would chosen automatically.movw r0, :lower16:(L__unnamed_cfstring_31-(LPC8_6+4)) movt r0, :upper16:(L__unnamed_cfstring_31-(LPC8_6+4)) LPC8_6: add r0, pc vcvt.f64.f32 d16, s0 vmov r1, r2, d16 blx _NSLog
Load
@"%f"
intor0
. Convert the single-precision floating-point value ins0
to a double-precision floating-point value ind16
, per the C type promotion rules for variadic function parameter lists (float
is converted todouble
). Load the double-precision value ind16
into the two integer registersr1
andr2
, and callNSLog
. It is always preferred to pass floating-point arguments in integer registers if they are available.mov r0, r6 blx _objc_release mov r0, r5 blx _objc_release mov r0, r8 blx _objc_autoreleasePoolPop
Release
string
, releaseobjc
, pop the autorelease pool.movs r0, #0 ldr r8, [sp], #4 pop {r4, r5, r6, r7, pc}
Load
0
as the return value ofmain
, popr8
off the stack (remember,r8
can not be efficiently addressed by apop
instruction in Thumb), and restore all the saved registers. Notice that while we savedlr
at the beginning ofmain
, we now pop that value intopc
. This effectively executes an interworking return from subroutine. (pop
intopc
is explicitly documented as an interworking branch).
And that's main
.
MyFPFunction
Let's take a quick look at the floating-point operations:
.thumb_func _MyFPFunction
_MyFPFunction:
vmov.f32 s2, #5.000000e-01
vldr.32 s0, LCPI7_0
vmov s4, r0
vadd.f32 d16, d2, d1
vadd.f32 d0, d16, d0
vmov r0, s0
bx lr
LCPI7_0:
.long 3197737370
- Declare a Thumb function
MyFPFunction
. vmov.f32 s2, #5.000000e-01 vldr.32 s0, LCPI7_0 vmov s4, r0
Load the immediate single-precision floating-point value
5.0
intos2
. Load the single-precision floating-point value atLCPI7_0
(-0.3
) intos0
. Load the single-precision floating-point value stored inr0
intos4
.vadd.f32 d16, d2, d1 vadd.f32 d0, d16, d0 vmov r0, s0
Using single-precision math, first add
d1
andd2
intod16
(remember thats0
corresponds tod0
,s2
corresponds tod1
, ands4
corresponds tod2
). Then addd0
andd16
intod0
. Finally, store the single-precision result back intor0
.bx lr
Branch, with interworking, to the link register. This function is so small and simple that it has no prologue, no epilogue, and no stack frame, when built in optimizing mode. This is a noticable win for ARM code, where it wouldn't have been for x86_64. Since
lr
contains the return address formain
, this returns us there.
That's the whole function! Wasn't that simple?
Conclusion
This concludes the whirlwind tour of ARM assembly, and this series of articles on assembly in general. There is quite a bit I haven't covered in both architectures, such as conditional instructions, looping, and optional flag updates on instructions, but this was meant only as an introduction after all. I hope you've enjoyed it. Thanks for reading!
Comments:
In any case, calls to NSLog and other variadic functions will always use the base version of AAPCS, as stated in the specification.
Also: In my experience (I've done kernel level work on ARM), the ARMv7 Architecture reference manual (Or, as I like to say, the ARM ARM) contains pretty much everything that the 6 x86 PDFs do. The SystemV x86_64 ABI (as used by everything but Windows and EFI) is specified in a separate document found on x86-64.org, so is really no different, and pretty much all the ARM floating point and vector operations are specified inside the ARM ARM.
test.m:3:13: fatal error: 'Cocoa/Cocoa.h' file not found
. Which seems to make sense since iOS programs don't use Cocoa.h. Are you sure you used the identical test.m?#import <Foundation/Foundation.h>
to build for armv7.
mov r4, r0
mov r0, r4
I stumbled across this also whilst trying to work out why the compiler was doing something odd - http://stackoverflow.com/questions/9151028/why-do-these-simple-methods-compile-differently/
The optimiser clearly is doing strange things and quite odd that it didn't manage to clean up the redundant mov. I guess these things happen though.
You should file a bug report for it as well.
mov r7, r7
as a marker. Can r7
be replaced by other register?Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.