Blog – alanza.xyz

ROBBIE LYMAN

Just write the function!

This is a post inspired by my approach to this year’s Advent of Code, but it does not contain spoilers. Instead I want to talk about how being more “data-oriented” has helped my thinking about coding problems.

Maps
- The “obvious” approach
- Just write the function!
This is a testable assertion!
Appendix: even being clever, you can still be dumb

Maps

A pretty frequent Advent of Code metaphor involves having a little rectangular map that you are navigating or manipulating. The map is usually represented in the puzzle input as a rectangular grid of ASCII characters, with newlines, '\n', delimiting each row of the grid.

The “obvious” approach

In my previous runs of Advent of Code, I dutifully parsed the map into some kind of 2D array structure, so that if I wanted to access the value at (x,y), with (0,0) as the upper left-hand corner, I could do const value = map[y][x];

There’s nothing wrong with this pattern, it’s just slow. Many people have made variants of this point in my hearing, but I think the two that stuck the best were Bjarne Stroustrup and Andrew Kelley. (The Stroustrup clip is shorter, and the Kelley talk really shows you how he was able to lean into this learning.) The short of it is that faithfully replicating in memory a multidimensional or nested structure that some data is trying to convey is often a mistake because it introduces indirection.

My understanding (which I want to underline is not expertise) is that your machine has some number of “registers”, each of which holds some small amount of data that can be accessed and manipulated on the level of or at the speed of single assembly instructions—very quickly. After that, computers can hold a good amount of contiguous data in various caches at a time. I’m told that there are various sizes of caches, each larger than the last and correspondingly slower to access. Slower than that is memory that is neither currently in a register nor in cache but somewhere in RAM, and slowest of all (while still on your physical machine) is memory on disk. When you access a bit of data, for example by executing the line const value = map[y][x]; above, assuming that the data is not already in a register, the computer will attempt to “load” it from cache or if that operation fails (you have a “cache miss”), from RAM.

The point is that in the worst case, map[y][x] has to first load map in order to index into it as map[y] and then load map[y] to index into that.

Just write the function!

But what if I do need to operate on the (x,y) coordinate of a point on the map? Well, why not determine those coordinates yourself? After all, they’re not hard to compute: if you store the map as a flat array of bytes, []const u8 in Zig, since each ASCII character takes up one byte, the y-coordinate is the number of newline characters before the byte-offset of the point you care about, and the x-coordinate is either the byte-offset, if y is zero, or the byte-offset minus one more than the byte-offset of the previous newline character.

If we rely on the assumption that the grid is rectangular, though, we don’t even need to actually count! In code:

/// for a buffer of length len
/// representing a rectangular grid
/// with identically-spaced delimiters at line endings,
/// returns the (x,y) coordinate corresponding to a given offset
fn indexToCoordinates(
    offset: usize,
    len: usize,
    line_length: usize,
) error{ Overflow, Delimiter }!struct { usize, usize } {
    if (offset >= len) return error.Overflow;
    // in 1-indexing, the first delimiter
    // (which isn't actually there) is at index 0,
    // the next is at line_length, then 2 * line_length, and so on...
    // so we add 1 to the offset to compute the line.
    const one_indexed = offset + 1;
    // the one-indexed x-coordinate is how far past
    // the most recent multiple of line_length we are
    const one_indexed_x = one_indexed % line_length;
    if (one_indexed_x == 0) return error.Delimiter;
    // this number is already correctly zero-indexed
    const y = @divFloor(one_indexed, line_length);
    return .{ one_indexed_x - 1, y };
}

const std = @import("std");

This is a testable assertion!

Of course, no need to take my word for it! Here’s a gist:

The functions “segmented” and “contiguous” perform exactly the same computation: We randomly generate many valid indices into input, which in the case of “segmented” is represented in memory as a 2D array, and in the case of “contiguous” is instead a flat array. Then we check whether the value at the input is the ASCII character . and if so add the x coordinate multiplied by the y coordinate. In “segmented” this is given to us, while in “contiguous” we must compute it.

The binary should be run with a name either containing the word “contiguous” or not. If so, calling it as, e.g. ./contiguous 08.txt 3 will use the input file 08.txt as the map and run for 10 ^ 3 = 1000 iterations.

Using hyperfine, and compiling the Zig file in ReleaseFast mode, I benchmarked the performance of these two functions for values of the iterations argument between 1 and 9. (Initially I intended to also run it for 10, but that seemed like it would take a long time.)

Here is what I found:

hyperfine --parameter-scan repeats 1 9 "./contiguous 08.txt {repeats}" "./segmented 08.txt {repeats}" -N --warmup 10

Benchmark 1: ./contiguous 08.txt 1
  Time (mean ± σ):       2.5 ms ±   0.2 ms    [User: 0.9 ms, System: 1.4 ms]
  Range (min … max):     2.3 ms …   3.5 ms    1105 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 2: ./segmented 08.txt 1
  Time (mean ± σ):       3.0 ms ±   0.2 ms    [User: 1.0 ms, System: 1.8 ms]
  Range (min … max):     2.7 ms …   4.2 ms    1000 runs

Benchmark 3: ./contiguous 08.txt 2
  Time (mean ± σ):       2.5 ms ±   0.2 ms    [User: 0.9 ms, System: 1.4 ms]
  Range (min … max):     2.3 ms …   3.7 ms    1222 runs

Benchmark 4: ./segmented 08.txt 2
  Time (mean ± σ):       3.0 ms ±   0.2 ms    [User: 1.0 ms, System: 1.8 ms]
  Range (min … max):     2.7 ms …   4.2 ms    967 runs

Benchmark 5: ./contiguous 08.txt 3
  Time (mean ± σ):       2.5 ms ±   0.2 ms    [User: 0.9 ms, System: 1.4 ms]
  Range (min … max):     2.3 ms …   4.0 ms    1217 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 6: ./segmented 08.txt 3
  Time (mean ± σ):       3.0 ms ±   0.2 ms    [User: 1.0 ms, System: 1.8 ms]
  Range (min … max):     2.7 ms …   4.0 ms    992 runs

Benchmark 7: ./contiguous 08.txt 4
  Time (mean ± σ):       2.5 ms ±   0.2 ms    [User: 0.9 ms, System: 1.4 ms]
  Range (min … max):     2.3 ms …   3.6 ms    1080 runs

  Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.

Benchmark 8: ./segmented 08.txt 4
  Time (mean ± σ):       3.1 ms ±   0.2 ms    [User: 1.0 ms, System: 1.8 ms]
  Range (min … max):     2.8 ms …   4.1 ms    1039 runs

Benchmark 9: ./contiguous 08.txt 5
  Time (mean ± σ):       2.8 ms ±   0.2 ms    [User: 1.2 ms, System: 1.4 ms]
  Range (min … max):     2.6 ms …   3.8 ms    1038 runs

Benchmark 10: ./segmented 08.txt 5
  Time (mean ± σ):       3.3 ms ±   0.2 ms    [User: 1.3 ms, System: 1.7 ms]
  Range (min … max):     3.1 ms …   4.5 ms    862 runs

Benchmark 11: ./contiguous 08.txt 6
  Time (mean ± σ):       5.2 ms ±   0.2 ms    [User: 3.5 ms, System: 1.4 ms]
  Range (min … max):     4.9 ms …   6.4 ms    585 runs

Benchmark 12: ./segmented 08.txt 6
  Time (mean ± σ):       6.4 ms ±   0.3 ms    [User: 4.3 ms, System: 1.8 ms]
  Range (min … max):     6.1 ms …   7.5 ms    420 runs

Benchmark 13: ./contiguous 08.txt 7
  Time (mean ± σ):      28.6 ms ±   0.3 ms    [User: 26.3 ms, System: 1.9 ms]
  Range (min … max):    27.8 ms …  29.7 ms    106 runs

Benchmark 14: ./segmented 08.txt 7
  Time (mean ± σ):      36.5 ms ±   0.4 ms    [User: 33.9 ms, System: 2.2 ms]
  Range (min … max):    35.8 ms …  37.6 ms    80 runs

Benchmark 15: ./contiguous 08.txt 8
  Time (mean ± σ):     259.2 ms ±   0.6 ms    [User: 254.6 ms, System: 4.0 ms]
  Range (min … max):   258.4 ms … 260.0 ms    11 runs

Benchmark 16: ./segmented 08.txt 8
  Time (mean ± σ):     335.7 ms ±   0.8 ms    [User: 329.9 ms, System: 5.1 ms]
  Range (min … max):   334.6 ms … 337.3 ms    10 runs

Benchmark 17: ./contiguous 08.txt 9
  Time (mean ± σ):      2.556 s ±  0.003 s    [User: 2.538 s, System: 0.016 s]
  Range (min … max):    2.552 s …  2.562 s    10 runs

Benchmark 18: ./segmented 08.txt 9
  Time (mean ± σ):      3.314 s ±  0.002 s    [User: 3.290 s, System: 0.021 s]
  Range (min … max):    3.312 s …  3.320 s    10 runs

As you can see, in all cases ./contiguous runs faster—not by a ton, but measurably faster. For example, on the final iteration, the mean time for a run of ./segmented took 1.296 times as long to complete as ./contiguous. At 1,000,000,000 iterations, (and thanks to the --warmup call), you can be fairly sure that this difference is coming from the hot loop, rather than the different allocation pattern or costs from interacting with the file.

Appendix: even being clever, you can still be dumb

Initially, (and in my solution for today’s Advent of Code), I had an indexToCoordinates function which actually counted the newline characters on each call. Of course, this transforms an O(1) lookup call into an O(n) iteration calculation! Now, n is fixed and not super large, but each count has to index into the buffer and do a comparison, so there’s no reason to expect that this version of indexToCoordinates should be speedy. It took actually running hyperfine for me to realize that I could get a speedup by trusting the structure of the data.