20 Feb 2020

How to Optimise C# (S1E1): Memory Access: Hot & Cold Data

Not all memory is equal.

It’s beginner’s mistake to think that all memory access takes an equal amount of time. Many managed languages like to abstract details of memory access away so that the programmer can focus on the functionality of their program, but in doing so managed languages like C# hide potentially huge performance hits.

The caches, naturally, cache memory access. When a memory location is addressed each cache will be checked in turn to see if there’s a copy of the memory location in the cache. The stall time in a program increases the further away from the processor the memory hit takes place - these are rough numbers to work by:

Memory type     Stall time          Average PC      iPhone 8
L1 cache hit    2ns                 16KB            128KB
L2 cache hit    5ns                 64KB            8MB
L3 cache hit    ~20-40ns            4MB             
Main memory     100ns+              16GB            2GB
Disk seek       10ms (10,000,00ns)

Those numbers will change over time but stay roughly relative to one another. Usually each cache is split into a data cache and an instruction cache, for instance the iPhone 8 splits its L1 cache into two 64KB caches for data and instructions.

So what’s happenning?

When the requested memory location is hit the line of memory, typically 64-bytes, is moved into all the caches above it - so if it’s found in the L3 cache, then the 64-byte line around the memory location is copied into L2 and L1. If your code then accesses the next byte, because the entire line of 64it will be retrieved from the L1 cache. Eventually, if you’re still accessing memory, you exhaust the cache line and other cache lines will be moved into the L-caches. When a cache is full, the oldest cache line is over-written with the latest cache line request.

Given how long it takes to retrieve a cache line from slower memory types, it makes sense to try to minimise the amount of jumping randomly around random access memory. Typically this means batching calls to code and accessing data while it’s still in the closest cache possible.

So in this series I’m going to outline several simple ways in which a programmer can optimise to avoid processor stalls - these are all simple techniques I’ve used in published Unity games written in C#

Episode 1: Hot and Cold data

The L1 and L2 caches are small, and it’s so very easy to fill them up with data that you’re not actually using. Take the following classes:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
class Lukewarm {
    int myHotData;
    int myColdData1; 
    int myColdData2;
    int myColdData3;
}

class MyManager {
    Lukewarm[] lukewarm;
    
    MyManager() {
        lukewarm = new Lukewarm[100000];
        //Initialise lukewarm data here...
    }

    void Update() {
        for (int i = 0; i < 100000; i++) {
            lukewarm[i].myHotData++;
        }
    }
}

Imagine that each of myColdData isn’t accessed very often at all, but imagine too that MyManager.Update() is called 60 times per second, and therefore each myHotData is accessed 60 times per second. What will happen is that there will be a huge number of processor stalls that could easily be optimised away. Each instance of Lukewarm is 16 bytes long, and so a cache line of 64 bytes will be able to contain 4 instances. That means 3 cache hits for every 1 cache miss, and in the example code above, that’s 25,000 cache misses of ~100ns+. Now if you’re repeating this code 60 times per second (1 second/60fps = 16.67ms per frame) then congratulations, you’ve just used 2.5ms of your 16.67ms stalling the processor.

Instead, consider splitting data into hot and cold data. Lets take the same example but rearrange it:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
class HotData {
    int myHotData;
}

class ColdData {
    int myColdData1; 
    int myColdData2;
    int myColdData3;
}

class MyManager {
    HotData[] hotData;
    ColdData[] coldData;
    
    MyManager() {
        hotData = new HotData[100000];
        coldData = new ColdData[100000];
        //Initialise hotData and coldData here...
    }

    void Update() {
        for (int i = 0; i < 100000; i++) {
            hotData[i].myHotData++;
        }
    }
}

All we’ve done is seperated the data into frequently accessed hot data, and rarely accessed cold data. Now when we access hotdata each cache line will be filled only with hot data - and we can fit 16 instances of hot data into each cacheline rather than just 4, and that means 1 cache miss followed by 15 cache hits, or 6,250 cache misses total for 100,000 objects, or if you’re trying to do this 60 times per second, that’s 0.625ms of your 16.67ms frame time used up in stalled data.

But beacuse hotData in total consumes only 400KB, there’s a much higher chance that in the next frame this hot data will still be hanging around the L3 cache, massively reducing the round trip time, potentially halving the stall time.

In my application of this optimisation pattern in MineSprint (iOS) I had a lot more cold data in each instance than hot data and so the performance gain was much larger than the example above.

Next time: Optimising nested loops for memory access

comments powered by Disqus