21 Feb 2020

How to Optimise C# (S1E2): Memory Access: Nested Loops

Continuing on from S1E1 that highlighted issues around memory access stalls, and optimisation of hot and cold data, the same issue can happen with nested loops when we traverse and access data structures larger than the L1 cache if we fail to ensure that we’re accessing each cache line exactly once.

Episode 2: Nested Loops

Take the following psuedocode that holds height data on a 2D map:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


class MyMap {
    double[,] altitude;
    
    MyMap() {
        altitude = new double[8000,8000];
        //Initialise altitude data here...
    }

    void Update() {
        for (int z = 0; z < 8000; z++) {
            for (int x = 0; x < 8000; x++) {
                altitude[x,z]++;
            }
        }
    }
}

The data structure itself is large: 8,000 units x 8,000 units x 8 bytes (the space a double requires) = 256MB, which is far larger than most L1 and L2 caches at present. As mentioned in the previous episode, the further away from the processor the memory is found, the longer it takes to arrive with Main Memory often taking 100ns+ to retreive the 64-byte cache line that contains our variable. In this example, 8 doubles fit into a single cache line.

So what will happen? Because the memory in C# arrays is laid out in row-major order where the right-most column is contiguous in memory but the column indexer being incremented in the inner loop above is the left-most column, on every iteration we’re accessing a variable that is 8,000 doubles away from the last one we accessed. However, since we cannot even fit an entire row of 8,000 units (64KB) into the L1 or L2 cache (usually 16KB for L1, 64KB for L2, only half of which is available for data), then we’re guaranteed a large stall for the 7 other doubles in that cache line to be retrieved from the L3 cache later on - we should try to access them while they’re in L1 cache rather than waiting until their entry in the L1 cache has been overwritten.

In the example above, in every cache line of 8 doubles we have 1 access from main memory (100ns) plus 7 accesses of the L3 cache (20-40ns) = ~300ns for an 8-double cache line. There are 64M doubles in our data structure, so that’s 8M cache lines to traverse at ~300ns each = 2.4s in stalls alone.

Now if each row, instead of having a capacity of 8,000 units had a capacity of 4,000 units, each row would fit inside the L2 cache, then we’d see 1 access from main memory (100ns still) and the following 7 accesses would come from L2 (5ns) rather than L3 (20-40ns), and so each 8-double cache line would cost around ~135ns. So shrinking the data structure is one optimisation if you’re able to program for specific hardware, and if you can make your entire structure fit inside L1 (16KB, 2ns access time), you’ll have further minimized stalls.

But of course, reducing the size of your data structure isn’t always possible. We can, however, make sure that we move each cache line to the L1 cache once (from main memory) rather than 8 times (once from main memory, and 7 times from L2/L3 cache), and we can do this simply by transposing the order of the nested loops to the following pseudocode:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16


class MyMap {
    double[,] altitude;
    
    MyMap() {
        altitude = new double[8000,8000];
        //Initialise altitude data here...
    }

    void Update() {
        for (int x = 0; x < 8000; x++) {
            for (int z = 0; z < 8000; z++) {
                altitude[x,z]++;
            }
        }
    }
}

The only thing I’ve changed here is I’ve swapped the two for-loops around so that the z-index is the inner loop rather than x-index. That’s it. Now the inner loop is incrementing the right-most column, and because of how memory is laid out in arrays, we’re now accessing the doubles sequentially rather than one in 8,000 (repeated 8,000 times). Even if the processor does nothing clever in terms of loading the entire cache line into the processor itself, the worst-case cost is ~100ns to access the Main Memory plus ~2ns to access the next 7 items in the L1 cache = ~114ns per cache line, thus reducing the stall time from ~300ns per cache-line simply by making sure our for-loops were the right way around.

matt helps

programming help

21 Feb 2020