A small architecture demo

Introduction

#define ROWS 256
#define COLUMNS 256

long long arr[ROWS][COLUMNS] = {0};

static void columnMajor(benchmark::State& state){
    int rval = 0;
    for(int i = 0; i < COLUMNS; i++){
        for(int j = 0; j < ROWS; j++){
        rval += arr[j][i];
        }
    }
}

static void rowMajor(benchmark::State& state){
    int rval = 0;
    for(int i = 0; i < ROWS; i++){
        for(int j = 0; j < COLUMNS; j++){
        rval += arr[i][j];            
        }
    }
}

Take a look at the above code snippet. We have two void functions that simply sum all elements in a 2D long long array. At a glance they do the exact same thing, same time complexity, same iteration bounds.

But there’s one minor change that massively changes the performance between the two. This change takes advantage of the cache and the design of C++.

Analysis

We start by noting we create a 2D array of long longs with dimensions 256 columns and 256 rows. Recall that when we access a 2D array with known dimensions y rows and x columns, initialized as arr[y][x], some pointer arithmetic is done so that calling arr[m][n] is the same as calling the element arr[x * m + n)]. In C++ (among other languages), 2D arrays are stored in contiguous memory blocks, where each segment of x elements represents a “row” and there are y “rows”.

Following the array initialization, we define two functions that seemingly do the same thing. The key difference is that in one we iterate over row index, and then over the column index. In the other we iterate over the column index and then the row index. Functionally, these both just add all the elements in the array together, theres's the same number of array lookup operations and additiion operations, which means both have a time complexity of O(2 * ROWS * COLUMNS). They should have the same runtime.

Using Google Benchmark to verify, we actually see that the rowMajor function is more than twice as fast as it’s columnMajor counterpart.

------------------------------------------------------
Benchmark            Time             CPU   Iterations
------------------------------------------------------
columnMajor     112737 ns       112736 ns         6349
rowMajor         38386 ns        38377 ns        17709

But why??

Cache Line Utilization

Whenever an array access operation is performed, it is loaded in a cache line, which is then stored on the CPU’s cache. This is normally the 64 bytes including and after the element accessed. The size of the cache line is completely controlled by hardware, and does occasionally vary (like on Apple Silicon chips).

Looking back at array memory layouts, we can see that accessing the 64 bytes including and after each query corresponds nicely to accessing elements (mostly) on the same “row”. This is why rowMajor dominates columnMajor in runtime, array data for some future additions are already in the cache. We can see this effect diminish as we make the sizeof the data smaller, where things start to even out, because more data is stored per cache line, so less cache lines need to be loaded for the entire 2D array to be summed.

Benchmarks

For all benchmarks:

x axis represents the value of COLUMNS
y axis represents the value of ROWS
the value in a cell represents the number of iterations rowMajor completed divided by the number of iterations columnMajor completed
THIS IS NOT COMPLETELY RIGOROUS and we should allow a 3% margain of error for a 95% confidence interval

Run on (10 X 24 MHz CPU s)
CPU Caches:
  L1 Data 64 KiB
  L1 Instruction 128 KiB
  L2 Unified 4096 KiB (x10)

For the long long type (8 bytes):

	128	512	2048	8192
128	1.08	2.21	10.67	10.10
512	1.38	2.64	11.10	11.21
2048	1.49	2.82	10.08	13.17
8192	1.22	5.10	15.8	20

For the int type (4 bytes):

	128	512	2048	8192
128	1.02	1.56	6.83	10.30
512	1.32	1.58	7.30	10.89
2048	1.31	1.56	7.60	12.83
8192	1.30	2.68	9.88	10

For the short type (2 bytes):

	128	512	2048	8192
128	0.97	1.01	1.03	1.04
512	0.98	1.02	1.02	1.05
2048	1.01	1.00	1.03	1.32
8192	1.00	1.00	1.39	2

Interpretation

We can see a clear trend that as the size of an array increases, the greater the gains made by rowMajor access, as previously explained. We observe that generally, increasing columns leads to a larger performance difference for rowMajor access, which aligns with our previous theory, as we decrease the number of cache hits for columnMajor, whilst still retaining just about 100% However, this doesn’t explain the case for shorts, where it seems that in some cases, columnMajor access is actually faster than rowMajor access. This is an interesting finding that I will try to get to the bottom of some other time (with a post of course!).

Implication

When doing data intensive operations, one should be wary of row vs column major access, depending on programming language, implementation of data structures and even machine architecture. I am sure that columnMajor beats rowMajor because of some architectural feature of the cache or SIMD or array stride or any other number of things that I don’t know enough about.

The process of benchmarking was also quite interesting to me, being able to quantify performance differences empirically instead of theoretically, and it’s definetely something I hope to do more of with future projects.

As always, my contact info is listed at chenrz.com/about/.