mikeash.com: just this guy, you know?

Posted at 2013-10-11 14:14 | RSS feed (Full text feed) | Blog Index
Next article: Friday Q&A 2013-10-25: NSObject: the Class and the Protocol
Previous article: Friday Q&A 2013-09-27: ARM64 and You
Tags: fridayqna hardware memory
Friday Q&A 2013-10-11: Why Registers Are Fast and RAM Is Slow
by Mike Ash  

In the previous article on ARM64, I mentioned that one advantage of the new architecture is the fact that it has twice as many registers, allowing code load data from RAM less often, which is much slower. Reader Daniel Hooper asks the natural question: just why is RAM so much slower than registers?

Distance
Let's start with distance. It's not necessarily a big factor, but it's the most fun to analyze. RAM is farther away from the CPU than registers are, which can make it take longer to fetch data from it.

Take a 3GHz processor as an extreme example. The speed of light is roughly one foot per nanosecond, or about 30cm per nanosecond for you metric folk. Light can only travel about four inches in time of a single clock cycle of this processor. That means a roundtrip signal can only get to a component that's two inches away or less, and that assumes that the hardware is perfect and able to transmit information at the speed of light in vacuum. For a desktop PC, that's pretty significant. However, it's much less important for an iPhone, where the clock speed is much lower (the 5S runs at 1.3GHz) and the RAM is right next to the CPU.

Cost
Much as we might wish it wasn't, cost is always a factor. In software, when trying to make a program run fast, we don't go through the entire program and give it equal attention. Instead, we identify the hotspots that are most critical to performance, and give them the most attention. This makes the best use of our limited resources. Hardware is similar. Faster hardware is more expensive, and that expense is best spent where it'll make the most difference.

Registers get used extremely frequently, and there aren't a lot of them. There are only about 6,000 bits of register data in an A7 (32 64-bit general-purpose registers plus 32 128-bit floating-point registers, and some miscellaneous ones). There are about 8 billion bits (1GB) of RAM in an iPhone 5S. It's worthwhile to spend a bunch of money making each register bit faster. There are literally a million times more RAM bits, and those eight billion bits pretty much have to be as cheap as possible if you want a $650 phone instead of a $6,500 phone.

Registers use an expensive design that can be read quickly. Reading a register bit is a matter of activating the right transistor and then waiting a short time for the register hardware to push the read line to the appropriate state.

Reading a RAM bit, on the other hand, is more involved. A bit in the DRAM found in any smartphone or PC consists of a single capacitor and a single transistor. The capacitors are extremely small, as you'd expect given that you can fit eight billion of them in your pocket. This means they carry a very small amount of charge, which makes it hard to measure. We like to think of digital circuits as dealing in ones and zeroes, but the analog world comes into play here. The read line is pre-charged to a level that's halfway between a one and a zero. Then the capacitor is connected to it, which either adds or drains a tiny amount of charge. An amplifier is used to push the charge towards zero or one. Once the charge in the line is sufficiently amplified, the result can be returned.

The fact that a RAM bit is only one transistor and one tiny capacitor makes it extremely cheap to manufacture. Register bits contain more parts and thereby cost much more.

There's also a lot more complexity involved just in figuring out what hardware to talk to with RAM because there's so much more of it. Reading from a register looks like:

  1. Extract the relevant bits from the instruction.
  2. Put those bits onto the register file's read lines.
  3. Read the result.

Reading from RAM looks like:

  1. Get the pointer to the data being loaded. (Said pointer is probably in a register. This already encompasses all of the work done above!)
  2. Send that pointer off to the MMU.
  3. The MMU translates the virtual address in the pointer to a physical address.
  4. Send the physical address to the memory controller.
  5. Memory controller figures out what bank of RAM the data is in and asks the RAM.
  6. The RAM figures out particular chunk the data is in, and asks that chunk.
  7. Step 6 may repeat a couple of more times before narrowing it down to a single array of cells.
  8. Load the data from the array.
  9. Send it back to the memory controller.
  10. Send it back to the CPU.
  11. Use it!

Whew.

Dealing With Slow RAM
That sums up why RAM is so much slower. But how does the CPU deal with such slowness? A RAM load is a single CPU instruction, but it can take potentially hundreds of CPU cycles to complete. How does the CPU deal with this?

First, just how long does a CPU take to execute a single instruction? It can be tempting to just assume that a single instruction executes in a single cycle, but reality is, of course, much more complicated.

Back in the good old days, when men wore their sheep proudly and the nation was undefeated in war, this was not a difficult question to answer. It wasn't one-instruction-one-cycle, but there was at least some clear correspondence. The Intel 4004, for example, took either 8 or 16 clock cycles to execute one instruction, depending on what that instruction was. Nice and understandable. Things gradually got more complex, with a wide variety of timings for different instructions. Older CPU manuals will give a list of how long each instruction takes to execute.

Now? Not so simple.

Along with increasing clock rates, there's also been a long drive to increase the number of instructions that can be executed per clock cycle. Back in the day, that number was something like 0.1 of an instruction per clock cycle. These days, it's up around 3-4 on a good day. How does it perform this wizardry? When you have a billion or more transistors per chip, you can add in a lot of smarts. Although the CPU might be executing 3-4 instructions per clock cycle, that doesn't mean each instruction takes 1/4th of a clock cycle to execute. They still take at least one cycle, often more. What happens is that the CPU is able to maintain multiple instructions in flight at any given time. Each instruction can be broken up into pieces: load the instruction, decode it to see what it means, gather the input data, perform the computation, store the output data. Those can all happen on separate cycles.

On any given CPU cycle, the CPU is doing a bunch of stuff simultaneously:

  1. Fetching potentially several instructions at once.
  2. Decoding potentially a completely different set of instructions.
  3. Fetching the data for potentially yet another different set of instructions.
  4. Performing computations for yet more instructions.
  5. Storing data for yet more instructions.

But, you say, how could this possibly work? For example:

    add x1, x1, x2
    add x1, x1, x3

These can't possibly execute in parallel like that! You need to be finished with the first instruction before you start the second!

It's true, that can't possibly work. That's where the smarts come in. The CPU is able to analyze the instruction stream and figure out which instructions depend on other instructions and shuffle things around. For example, if an instruction after those two adds doesn't depend on them, the CPU could end up executing that instruction before the second add, even though it comes later in the instruction stream. The ideal of 3-4 instructions per clock cycle can only be achieved in code that has a lot of independent instructions.

What happens when you hit a memory load instruction? First of all, it is definitely going to take forever, relatively speaking. If you're really lucky and the value is in L1 cache, it'll only take a few cycles. If you're unlucky and it has to go all the way out to main RAM to find the data, it could take literally hundreds of cycles. There may be a lot of thumb-twiddling to be done.

The CPU will try not to twiddle its thumbs, because that's inefficient. First, it will try to anticipate. It may be able to spot that load instruction in advance, figure out what it's going to load, and initiate the load before it really starts executing the instruction. Second, it will keep executing other instructions while it waits, as long as it can. If there are instructions after the load instruction that don't depend on the data being loaded, they can still be executed. Finally, once it's executed everything it can and it absolutely cannot proceed any further without that data it's waiting on, it has little choice but to stall and wait for the data to come back from RAM..

Conclusion

  1. RAM is slow because there's a ton of it.
  2. That means you have to use designs that are cheaper, and cheaper means slower.
  3. Modern CPUs do crazy things internally and will happily execute your instruction stream in an order that's wildly different from how it appears in the code.
  4. That means that the first thing a CPU does while waiting for a RAM load is run other code.
  5. If all else fails, it'll just stop and wait, and wait, and wait, and wait.

That wraps things up for today. As always, Friday Q&A is driven by reader suggestions, so if you have a topic you'd like to see covered or a question you'd like to see answered, send it in!

Did you enjoy this article? I'm selling whole books full of them! Volumes II and III are now out! They're available as ePub, PDF, print, and on iBooks and Kindle. Click here for more information.

Comments:

One other wrinkle is that, in the event RAM is full, the OS starts paging, placing RAM pages on disk (and having to go to disk to read that memory again).

The iPhone doesn't have paging (last I checked, at least), and will terminate applications that go above their memory quota (and send a low-memory warning to them before doing so). General purpose computers don't have this flexibility and have to deal with applications requesting more memory than is available in RAM.

Anyway, paging isn't necessarily germane to an article about the iPhone's memory layout, but it has a pretty big effect on workstation performance.
RAM is farther away from the CPU than registers are, which can make it take longer to fetch data from it.

Indeed, registers are right inside the CPU so their effective distance is very small.
More properly, each processor core usually has its own register file, which usually is at least double than the actual visible register count to accomodate out-of-order instruction execution - so you can double your estimate of 6000 bits.
The stack pointer and program counter are also implemented as registers, so a CPU core profits a lot from having very fast registers.
It's worth noting that on a hyper-threading enabled CPU two threads can share the same execution resources. So you get two streams of likely non-dependent instructions coming from the decoder, and the CPU is free to interleave the instructions according to its rules. In software with a lot of memory read stalls (e.g. in-memory databases) that gives close to the theoretical maximum x2 speedup.
You missed explaining the most important detail, and the real culprit: it's not about distance or the speed of light/electricity, it's about THE DELAYS IN SIGNAL PROCESSING that occurs in the memory controller etc.
Even until the early 80s, many small computers used SRAM as main memory, in order not to have to deal with refresh (unlike modern memory modules, refresh circuity was not built into the memory or the controller, and had to be handled by either roll your own circuitry, or by the cpu itself). The lack of intermediate interface circuitry between the cpu and the main memory itself is also a cause for a lot of the latency between requests to main memory and when the data actually arrives (even discounting cpu cache). Even reading a single byte from main memory requires a multi stage process of sending an address to the right memory module (to allow for faster bus speeds and limit power consumption, even in the case of parallel busses 1) the address and data lines are multiplexed, and 2) the bus width is not big enough to send the entire memory address at one time. So even sending the request to the control circuitry on the main memory chip takes potentially hundreds of CPU clock cycles. This is orthogonal to the physical distance between the cpu and the memory. This is just step #4 in your list of accessing data from main memory.

Another few nits:
1) There are more registers than you give the cpu credit for. Even within a single core, and even discounting an architecture like SPARC which has a large windowing register file, all modern cpus have shadow registers which get mapped to the logical registers the code actually sees due to register renaming. This automatically increases the number of actual registers by several times

2) The power usage of CMOS SRAM is not actually that high. Like all other static CMOS circuitry, the actual power dissipation during steady state (when there are no transitions), is very low. In addition, up to multiple tens of MB of on chip cache in modern processors is of the same design. In fact, when considering the subthreshold leakage of deep submicron processes, the power consumption of DRAM, may be similar or higher to SRAM, particularly if the memory contents are not being changed (as DRAM must of course be constantly refreshed). In addition, in modern designs, aggressive clock gating of logic and cache blocks, as well as dynamic voltage and frequency scaling, means that one cannot categorically say that per unit storage, DRAM uses more power than SRAM (registers included). Your statement "The entire circuit sits idle most of the time, only requiring power when being read, or for occasional refreshes needed to recharge the capacitors.," is actually more true of SRAM than DRAM, as even "idle / unused DRAM must still be refreshed," even when the cpu's clock is not running (a la sleep mode).

The article is true as far as it goes, but unfortunately it ignores caching. All modern CPUs depend on caches to hide DRAM latency. Most modern CPUs have at least L1 and L2 caches and possibly a large L3 cache as well. These caches are often able to hide the large latencies to DRAM.

The effective DRAM access time is complex function of L1/L2/L3 cache hit/miss/snoop latencies combined with DRAM bank and page access patterns. The effective DRAM latency is much less than the worst case DRAM miss time.

A trick often used on the register starved Intel X86 (32 bit) architecture is to allocated a 64 byte cache line sized and aligned chuck of memory and use it for often used variables. A high rate of accesses will assure it will be in L1 most of the time. L1 access time is ~ 4 cycles which is pretty good and certainly much much better than the three to four decimal orders of magnitude worst case DRAM cache miss time.

from here:
http://www.mikeash.com/pyblog/friday-qa-2013-10-11-why-registers-are-fast-and-ram-is-slow.html

Core i7 Xeon 5500 Series Data Source Latency (approximate)
L1 CACHE hit, ~4 cycles
L2 CACHE hit, ~10 cycles
L3 CACHE hit, line unshared ~40 cycles
L3 CACHE hit, shared line in another core ~65 cycles
L3 CACHE hit, modified in another core ~75 cycles remote
remote L3 CACHE ~100-300 cycles
Local Dram ~60 ns
Remote Dram ~100 ns
You forgot about the number of specifiers per instruction. An instruction typically specifies two, three, four, or even five registers. A memory reference instruction can specify typically one memory location. This is possible because registers are implemented in a redundant, multi ported structure right inside the pipeline, usually in a dedicated pipe stage. The L1 cache can also be thought of as occupying its own pipe stage, but a memory reference instruction can specify only one location in that cache (modulo things like misaligned access, string operations, etc.)

It's not just the distance (compare register file to L1 cache), it's the size and design.
Another reason that having more registers is important is because they can be easily specified in an instruction. Specifying a particular register typically takes three to five bits (eight to thirty two registers), so an instruction can specify multiple registers, allowing for r1 += r2, r1 = r2 + r3, or perhaps even r1 = r2 * r3 + r4. Doing the same thing with memory locations would be more challenging because the interesting addressing modes require specifying a lot more than three to five bits.

But yes, fundamentally registers are fast because it is their job to be fast and because it is impossible to have everything be that fast.
Also a good question to ask on the same topic is why do most, in particular the establishment (I.E. CS education) ignore this fact and stick to the same old models from 1977?
The C++ OOP object model that focuses on code over data for example.

At least performance wise IMHO we should be thinking "data orientation".

That big white data elephant in the room gets ignored all too often.
Where the excuse is usually: "Your hardware should handle it, not my problem", "You mean you want to actually run something else at the same time?", or "That's too much work!".

I'll take low resource usage + quick and snappy style, over the big + bloated + slow (to load and, or, run).
Actually, the reading from the RAM part is much longer, you'll have to use the registers frequently in that step.

This is the first time I read one your posts, Mike, and it's really comforting that there is still one person out there who actually cares about software performance, down to the Assembly level.

My opinion is that many applications right now would run very well on a machine with just 64 MB of RAM, but the messiness level in the code as well as complete oblivion to optimization has lead us to the point that we're at right now.

By the way, I had no idea that the signal travels at the speed of light - does that mean we have reached the limits when it comes to processing power?
Fadi, the signal does NOT travel at the speed of light. Mike was giving sort of a theoretical limit.
A couple other factors on the slowness of RAM vs. registers:

1. You covered the business of getting the data value out of the DRAM memory cell pretty well, but you left out something pretty important: the delay where you wait for the charge in the memory cell to trickle out on the sense line is an RC delay with a pretty hard lower limit. This is why, while we have improved the speed of the memory bus quite a lot, we still have to wait ~30ns before we get the first bits out of any memory access.

2. Mapping the memory address in RAM, once the address has been sent to the DRAM banks, also has some hard lower limit timings. In the bad old days the memory address was sent in two parts, and there were two lookup phases, first to find the memory page, second to find the row in the page. Part of that was done to conserve lines on the memory bus, but it was also done because the decode logic gets slower and slower with each extra bit you need to decode.

This is also true for decoding addresses at every level of the memory hierarchy. It's not just that registers and cache are expensive, but you can only make them so large before their sheer size starts slowing them down, and you lose the speed advantage. If the decode logic for register numbers gets too deep you have to increase the processor cycle (slow down the clock) in order account for the delay.
@Eloff

It's not quite as ideal as you say, as sadly learned by Intel with the P4.

The problem with hyper threading is that, yes, you have a second thread to step in when the first thread blocks on RAM. BUT that second thread also uses up half the space of your caches, and so increases the need to load from RAM...

Which effect dominates? Obviously it depends on the code, but experience seems to suggest that, across a wide range of code, you need a fast (less than 20 cycles) path to L2 (to compensate for the halved size of L1) and at least 1 MB per thread of L3. Even with those, hyper threading is only worth about a quarter of a CPU.
(On the P4, with 8K of L1, slowish and smallish L2, and no L3, it was game over --- no surprise that hyper threading was basically useless. Valuable to devs who wanted to test their threaded code, but not much to anyone else.)

Compare now with an A7. The large L1 and fastish L2 are good, but there is 1MB of L2 per core, and you'd really want that to be at least 2MB/core before you added hyper threading.
(A7 does appear to have a 4MB L3, shared between both CPUs and the GPU, but that's far from the CPUs and slow by the standards of an Intel L3. If you wanted to use that as an effective backing for hyper threading, you'd really need to do this Intel style, moving it closer to the CPUs, perhaps segmenting it into pieces that "live" with each CPU, and all round getting it a lot faster.

All of these are things one would like to do anyway. It makes sense to move to a low latency high bandwidth smaller L2, and a much faster L3 tightly coupled to both GPU and CPU. Presumably we will get that when the mythical Apple designed GPU gets added in to the mix, maybe with the A8.

At that point hyper threading would be a reasonable design choice, though it could be argued that it's just not worth the cost in engineer time --- throw in a third core on the die, they're just not that large, and you'll have better performance with less time to market.)

Comments RSS feed for this page

Add your thoughts, post a comment:

Spam and off-topic posts will be deleted without notice. Culprits may be publicly humiliated at my sole discretion.

Name:
The Answer to the Ultimate Question of Life, the Universe, and Everything?
Comment:
Formatting: <i> <b> <blockquote> <code>.
NOTE: Due to an increase in spam, URLs are forbidden! Please provide search terms or fragment your URLs so they don't look like URLs.
Code syntax highlighting thanks to Pygments.
Hosted at DigitalOcean.