Not all memory is equal.
It’s beginner’s mistake to think that all memory access takes an equal amount of time. Many managed languages like to abstract details of memory access away so that the programmer can focus on the functionality of their program, but in doing so managed languages like C# hide potentially huge performance hits.
The caches, naturally, cache memory access. When a memory location is addressed each cache will be checked in turn to see if there’s a copy of the memory location in the cache. The stall time in a program increases the further away from the processor the memory hit takes place - these are rough numbers to work by:
Memory type Stall time Average PC iPhone 8 L1 cache hit 2ns 16KB 128KB L2 cache hit 5ns 64KB 8MB L3 cache hit ~20-40ns 4MB Main memory 100ns+ 16GB 2GB Disk seek 10ms (10,000,00ns)
Those numbers will change over time but stay roughly relative to one another. Usually each cache is split into a data cache and an instruction cache, for instance the iPhone 8 splits its L1 cache into two 64KB caches for data and instructions.
So what’s happenning?
When the requested memory location is hit the line of memory, typically 64-bytes, is moved into all the caches above it - so if it’s found in the L3 cache, then the 64-byte line around the memory location is copied into L2 and L1. If your code then accesses the next byte, because the entire line of 64it will be retrieved from the L1 cache. Eventually, if you’re still accessing memory, you exhaust the cache line and other cache lines will be moved into the L-caches. When a cache is full, the oldest cache line is over-written with the latest cache line request.
Given how long it takes to retrieve a cache line from slower memory types, it makes sense to try to minimise the amount of jumping randomly around random access memory. Typically this means batching calls to code and accessing data while it’s still in the closest cache possible.
So in this series I’m going to outline several simple ways in which a programmer can optimise to avoid processor stalls - these are all simple techniques I’ve used in published Unity games written in C#
Episode 1: Hot and Cold data
The L1 and L2 caches are small, and it’s so very easy to fill them up with data that you’re not actually using. Take the following classes:
Imagine that each of myColdData isn’t accessed very often at all, but imagine too that MyManager.Update() is called 60 times per second, and therefore each myHotData is accessed 60 times per second. What will happen is that there will be a huge number of processor stalls that could easily be optimised away. Each instance of Lukewarm is 16 bytes long, and so a cache line of 64 bytes will be able to contain 4 instances. That means 3 cache hits for every 1 cache miss, and in the example code above, that’s 25,000 cache misses of ~100ns+. Now if you’re repeating this code 60 times per second (1 second/60fps = 16.67ms per frame) then congratulations, you’ve just used 2.5ms of your 16.67ms stalling the processor.
Instead, consider splitting data into hot and cold data. Lets take the same example but rearrange it:
All we’ve done is seperated the data into frequently accessed hot data, and rarely accessed cold data. Now when we access hotdata each cache line will be filled only with hot data - and we can fit 16 instances of hot data into each cacheline rather than just 4, and that means 1 cache miss followed by 15 cache hits, or 6,250 cache misses total for 100,000 objects, or if you’re trying to do this 60 times per second, that’s 0.625ms of your 16.67ms frame time used up in stalled data.
But beacuse hotData in total consumes only 400KB, there’s a much higher chance that in the next frame this hot data will still be hanging around the L3 cache, massively reducing the round trip time, potentially halving the stall time.
In my application of this optimisation pattern in MineSprint (iOS) I had a lot more cold data in each instance than hot data and so the performance gain was much larger than the example above.
Next time: Optimising nested loops for memory access