Next article: Friday Q&A 2013-10-25: NSObject: the Class and the Protocol
Previous article: Friday Q&A 2013-09-27: ARM64 and You
Tags: fridayqna hardware memory
In the previous article on ARM64, I mentioned that one advantage of the new architecture is the fact that it has twice as many registers, allowing code load data from RAM less often, which is much slower. Reader Daniel Hooper asks the natural question: just why is RAM so much slower than registers?
Let's start with distance. It's not necessarily a big factor, but it's the most fun to analyze. RAM is farther away from the CPU than registers are, which can make it take longer to fetch data from it.
Take a 3GHz processor as an extreme example. The speed of light is roughly one foot per nanosecond, or about 30cm per nanosecond for you metric folk. Light can only travel about four inches in time of a single clock cycle of this processor. That means a roundtrip signal can only get to a component that's two inches away or less, and that assumes that the hardware is perfect and able to transmit information at the speed of light in vacuum. For a desktop PC, that's pretty significant. However, it's much less important for an iPhone, where the clock speed is much lower (the 5S runs at 1.3GHz) and the RAM is right next to the CPU.
Much as we might wish it wasn't, cost is always a factor. In software, when trying to make a program run fast, we don't go through the entire program and give it equal attention. Instead, we identify the hotspots that are most critical to performance, and give them the most attention. This makes the best use of our limited resources. Hardware is similar. Faster hardware is more expensive, and that expense is best spent where it'll make the most difference.
Registers get used extremely frequently, and there aren't a lot of them. There are only about 6,000 bits of register data in an A7 (32 64-bit general-purpose registers plus 32 128-bit floating-point registers, and some miscellaneous ones). There are about 8 billion bits (1GB) of RAM in an iPhone 5S. It's worthwhile to spend a bunch of money making each register bit faster. There are literally a million times more RAM bits, and those eight billion bits pretty much have to be as cheap as possible if you want a $650 phone instead of a $6,500 phone.
Registers use an expensive design that can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the register hardware to push the read line to the appropriate state.
Reading a RAM bit, on the other hand, is more involved. A bit in the DRAM found in any smartphone or PC consists of a single capacitor and a single transistor. The capacitors are extremely small, as you'd expect given that you can fit eight billion of them in your pocket. This means they carry a very small amount of charge, which makes it hard to measure. We like to think of digital circuits as dealing in ones and zeroes, but the analog world comes into play here. The read line is pre-charged to a level that's halfway between a one and a zero. Then the capacitor is connected to it, which either adds or drains a tiny amount of charge. An amplifier is used to push the charge towards zero or one. Once the charge in the line is sufficiently amplified, the result can be returned.
The fact that a RAM bit is only one transistor and one tiny capacitor makes it extremely cheap to manufacture. Register bits contain more parts and thereby cost much more.
There's also a lot more complexity involved just in figuring out what hardware to talk to with RAM because there's so much more of it. Reading from a register looks like:
- Extract the relevant bits from the instruction.
- Put those bits onto the register file's read lines.
- Read the result.
Reading from RAM looks like:
- Get the pointer to the data being loaded. (Said pointer is probably in a register. This already encompasses all of the work done above!)
- Send that pointer off to the MMU.
- The MMU translates the virtual address in the pointer to a physical address.
- Send the physical address to the memory controller.
- Memory controller figures out what bank of RAM the data is in and asks the RAM.
- The RAM figures out particular chunk the data is in, and asks that chunk.
- Step 6 may repeat a couple of more times before narrowing it down to a single array of cells.
- Load the data from the array.
- Send it back to the memory controller.
- Send it back to the CPU.
- Use it!
Dealing With Slow RAM
That sums up why RAM is so much slower. But how does the CPU deal with such slowness? A RAM load is a single CPU instruction, but it can take potentially hundreds of CPU cycles to complete. How does the CPU deal with this?
First, just how long does a CPU take to execute a single instruction? It can be tempting to just assume that a single instruction executes in a single cycle, but reality is, of course, much more complicated.
Back in the good old days, when men wore their sheep proudly and the nation was undefeated in war, this was not a difficult question to answer. It wasn't one-instruction-one-cycle, but there was at least some clear correspondence. The Intel 4004, for example, took either 8 or 16 clock cycles to execute one instruction, depending on what that instruction was. Nice and understandable. Things gradually got more complex, with a wide variety of timings for different instructions. Older CPU manuals will give a list of how long each instruction takes to execute.
Now? Not so simple.
Along with increasing clock rates, there's also been a long drive to increase the number of instructions that can be executed per clock cycle. Back in the day, that number was something like 0.1 of an instruction per clock cycle. These days, it's up around 3-4 on a good day. How does it perform this wizardry? When you have a billion or more transistors per chip, you can add in a lot of smarts. Although the CPU might be executing 3-4 instructions per clock cycle, that doesn't mean each instruction takes 1/4th of a clock cycle to execute. They still take at least one cycle, often more. What happens is that the CPU is able to maintain multiple instructions in flight at any given time. Each instruction can be broken up into pieces: load the instruction, decode it to see what it means, gather the input data, perform the computation, store the output data. Those can all happen on separate cycles.
On any given CPU cycle, the CPU is doing a bunch of stuff simultaneously:
- Fetching potentially several instructions at once.
- Decoding potentially a completely different set of instructions.
- Fetching the data for potentially yet another different set of instructions.
- Performing computations for yet more instructions.
- Storing data for yet more instructions.
But, you say, how could this possibly work? For example:
add x1, x1, x2 add x1, x1, x3
These can't possibly execute in parallel like that! You need to be finished with the first instruction before you start the second!
It's true, that can't possibly work. That's where the smarts come in. The CPU is able to analyze the instruction stream and figure out which instructions depend on other instructions and shuffle things around. For example, if an instruction after those two adds doesn't depend on them, the CPU could end up executing that instruction before the second add, even though it comes later in the instruction stream. The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions.
What happens when you hit a memory load instruction? First of all, it is definitely going to take forever, relatively speaking. If you're really lucky and the value is in L1 cache, it'll only take a few cycles. If you're unlucky and it has to go all the way out to main RAM to find the data, it could take literally hundreds of cycles. There may be a lot of thumb-twiddling to be done.
The CPU will try not to twiddle its thumbs, because that's inefficient. First, it will try to anticipate. It may be able to spot that load instruction in advance, figure out what it's going to load, and initiate the load before it really starts executing the instruction. Second, it will keep executing other instructions while it waits, as long as it can. If there are instructions after the load instruction that don't depend on the data being loaded, they can still be executed. Finally, once it's executed everything it can and it absolutely cannot proceed any further without that data it's waiting on, it has little choice but to stall and wait for the data to come back from RAM..
- RAM is slow because there's a ton of it.
- That means you have to use designs that are cheaper, and cheaper means slower.
- Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code.
- That means that the first thing a CPU does while waiting for a RAM load is run other code.
- If all else fails, it'll just stop and wait, and wait, and wait, and wait.
That wraps things up for today. As always, Friday Q&A is driven by reader suggestions, so if you have a topic you'd like to see covered or a question you'd like to see answered, send it in!
The iPhone doesn't have paging (last I checked, at least), and will terminate applications that go above their memory quota (and send a low-memory warning to them before doing so). General purpose computers don't have this flexibility and have to deal with applications requesting more memory than is available in RAM.
Anyway, paging isn't necessarily germane to an article about the iPhone's memory layout, but it has a pretty big effect on workstation performance.
Indeed, registers are right inside the CPU so their effective distance is very small.
More properly, each processor core usually has its own register file, which usually is at least double than the actual visible register count to accomodate out-of-order instruction execution - so you can double your estimate of 6000 bits.
The stack pointer and program counter are also implemented as registers, so a CPU core profits a lot from having very fast registers.
Another few nits:
1) There are more registers than you give the cpu credit for. Even within a single core, and even discounting an architecture like SPARC which has a large windowing register file, all modern cpus have shadow registers which get mapped to the logical registers the code actually sees due to register renaming. This automatically increases the number of actual registers by several times
2) The power usage of CMOS SRAM is not actually that high. Like all other static CMOS circuitry, the actual power dissipation during steady state (when there are no transitions), is very low. In addition, up to multiple tens of MB of on chip cache in modern processors is of the same design. In fact, when considering the subthreshold leakage of deep submicron processes, the power consumption of DRAM, may be similar or higher to SRAM, particularly if the memory contents are not being changed (as DRAM must of course be constantly refreshed). In addition, in modern designs, aggressive clock gating of logic and cache blocks, as well as dynamic voltage and frequency scaling, means that one cannot categorically say that per unit storage, DRAM uses more power than SRAM (registers included). Your statement "The entire circuit sits idle most of the time, only requiring power when being read, or for occasional refreshes needed to recharge the capacitors.," is actually more true of SRAM than DRAM, as even "idle / unused DRAM must still be refreshed," even when the cpu's clock is not running (a la sleep mode).
The effective DRAM access time is complex function of L1/L2/L3 cache hit/miss/snoop latencies combined with DRAM bank and page access patterns. The effective DRAM latency is much less than the worst case DRAM miss time.
A trick often used on the register starved Intel X86 (32 bit) architecture is to allocated a 64 byte cache line sized and aligned chuck of memory and use it for often used variables. A high rate of accesses will assure it will be in L1 most of the time. L1 access time is ~ 4 cycles which is pretty good and certainly much much better than the three to four decimal orders of magnitude worst case DRAM cache miss time.
Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns
It's not just the distance (compare register file to L1 cache), it's the size and design.
But yes, fundamentally registers are fast because it is their job to be fast and because it is impossible to have everything be that fast.
The C++ OOP object model that focuses on code over data for example.
At least performance wise IMHO we should be thinking "data orientation".
That big white data elephant in the room gets ignored all too often.
Where the excuse is usually: "Your hardware should handle it, not my problem", "You mean you want to actually run something else at the same time?", or "That's too much work!".
I'll take low resource usage + quick and snappy style, over the big + bloated + slow (to load and, or, run).
This is the first time I read one your posts, Mike, and it's really comforting that there is still one person out there who actually cares about software performance, down to the Assembly level.
My opinion is that many applications right now would run very well on a machine with just 64 MB of RAM, but the messiness level in the code as well as complete oblivion to optimization has lead us to the point that we're at right now.
By the way, I had no idea that the signal travels at the speed of light - does that mean we have reached the limits when it comes to processing power?
1. You covered the business of getting the data value out of the DRAM memory cell pretty well, but you left out something pretty important: the delay where you wait for the charge in the memory cell to trickle out on the sense line is an RC delay with a pretty hard lower limit. This is why, while we have improved the speed of the memory bus quite a lot, we still have to wait ~30ns before we get the first bits out of any memory access.
2. Mapping the memory address in RAM, once the address has been sent to the DRAM banks, also has some hard lower limit timings. In the bad old days the memory address was sent in two parts, and there were two lookup phases, first to find the memory page, second to find the row in the page. Part of that was done to conserve lines on the memory bus, but it was also done because the decode logic gets slower and slower with each extra bit you need to decode.
This is also true for decoding addresses at every level of the memory hierarchy. It's not just that registers and cache are expensive, but you can only make them so large before their sheer size starts slowing them down, and you lose the speed advantage. If the decode logic for register numbers gets too deep you have to increase the processor cycle (slow down the clock) in order account for the delay.
It's not quite as ideal as you say, as sadly learned by Intel with the P4.
The problem with hyper threading is that, yes, you have a second thread to step in when the first thread blocks on RAM. BUT that second thread also uses up half the space of your caches, and so increases the need to load from RAM...
Which effect dominates? Obviously it depends on the code, but experience seems to suggest that, across a wide range of code, you need a fast (less than 20 cycles) path to L2 (to compensate for the halved size of L1) and at least 1 MB per thread of L3. Even with those, hyper threading is only worth about a quarter of a CPU.
(On the P4, with 8K of L1, slowish and smallish L2, and no L3, it was game over --- no surprise that hyper threading was basically useless. Valuable to devs who wanted to test their threaded code, but not much to anyone else.)
Compare now with an A7. The large L1 and fastish L2 are good, but there is 1MB of L2 per core, and you'd really want that to be at least 2MB/core before you added hyper threading.
(A7 does appear to have a 4MB L3, shared between both CPUs and the GPU, but that's far from the CPUs and slow by the standards of an Intel L3. If you wanted to use that as an effective backing for hyper threading, you'd really need to do this Intel style, moving it closer to the CPUs, perhaps segmenting it into pieces that "live" with each CPU, and all round getting it a lot faster.
All of these are things one would like to do anyway. It makes sense to move to a low latency high bandwidth smaller L2, and a much faster L3 tightly coupled to both GPU and CPU. Presumably we will get that when the mythical Apple designed GPU gets added in to the mix, maybe with the A8.
At that point hyper threading would be a reasonable design choice, though it could be argued that it's just not worth the cost in engineer time --- throw in a third core on the die, they're just not that large, and you'll have better performance with less time to market.)
Comments RSS feed for this page
Add your thoughts, post a comment:
Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.