Blog – alanza.xyz
ROBBIE LYMAN
Just write the function!
This is a post inspired by my approach to this year’s Advent of Code, but it does not contain spoilers. Instead I want to talk about how being more “data-oriented” has helped my thinking about coding problems.
- Maps
- The “obvious” approach
- Just write the function!
- This is a testable assertion!
- Appendix: even being clever, you can still be dumb
Maps
A pretty frequent Advent of Code metaphor involves having a little rectangular map that you are navigating or manipulating. The map is usually represented in the puzzle input as a rectangular grid of ASCII characters, with newlines, '\n'
, delimiting each row of the grid.
The “obvious” approach
In my previous runs of Advent of Code, I dutifully parsed the map into some kind of 2D array structure, so that if I wanted to access the value at (x,y)
, with (0,0) as the upper left-hand corner, I could do const value = map[y][x];
There’s nothing wrong with this pattern, it’s just slow. Many people have made variants of this point in my hearing, but I think the two that stuck the best were Bjarne Stroustrup and Andrew Kelley. (The Stroustrup clip is shorter, and the Kelley talk really shows you how he was able to lean into this learning.) The short of it is that faithfully replicating in memory a multidimensional or nested structure that some data is trying to convey is often a mistake because it introduces indirection.
My understanding (which I want to underline is not expertise) is that your machine has some number of “registers”, each of which holds some small amount of data that can be accessed and manipulated on the level of or at the speed of single assembly instructions—very quickly. After that, computers can hold a good amount of contiguous data in various caches at a time. I’m told that there are various sizes of caches, each larger than the last and correspondingly slower to access. Slower than that is memory that is neither currently in a register nor in cache but somewhere in RAM, and slowest of all (while still on your physical machine) is memory on disk. When you access a bit of data, for example by executing the line const value = map[y][x];
above, assuming that the data is not already in a register, the computer will attempt to “load” it from cache or if that operation fails (you have a “cache miss”), from RAM.
The point is that in the worst case, map[y][x]
has to first load map
in order to index into it as map[y]
and then load map[y]
to index into that.
Just write the function!
But what if I do need to operate on the (x,y)
coordinate of a point on the map? Well, why not determine those coordinates yourself? After all, they’re not hard to compute: if you store the map as a flat array of bytes, []const u8
in Zig, since each ASCII character takes up one byte, the y
-coordinate is the number of newline characters before the byte-offset of the point you care about, and the x
-coordinate is either the byte-offset, if y
is zero, or the byte-offset minus one more than the byte-offset of the previous newline character.
If we rely on the assumption that the grid is rectangular, though, we don’t even need to actually count! In code:
/// for a buffer of length len
/// representing a rectangular grid
/// with identically-spaced delimiters at line endings,
/// returns the (x,y) coordinate corresponding to a given offset
fn indexToCoordinates(
offset: usize,
len: usize,
line_length: usize,
) error{ Overflow, Delimiter }!struct { usize, usize } {
if (offset >= len) return error.Overflow;
// in 1-indexing, the first delimiter
// (which isn't actually there) is at index 0,
// the next is at line_length, then 2 * line_length, and so on...
// so we add 1 to the offset to compute the line.
const one_indexed = offset + 1;
// the one-indexed x-coordinate is how far past
// the most recent multiple of line_length we are
const one_indexed_x = one_indexed % line_length;
if (one_indexed_x == 0) return error.Delimiter;
// this number is already correctly zero-indexed
const y = @divFloor(one_indexed, line_length);
return .{ one_indexed_x - 1, y };
}
const std = @import("std");
This is a testable assertion!
Of course, no need to take my word for it! Here’s a gist:
The functions “segmented” and “contiguous” perform exactly the same computation: We randomly generate many valid indices into input, which in the case of “segmented” is represented in memory as a 2D array, and in the case of “contiguous” is instead a flat array. Then we check whether the value at the input is the ASCII character .
and if so add the x coordinate multiplied by the y coordinate. In “segmented” this is given to us, while in “contiguous” we must compute it.
The binary should be run with a name either containing the word “contiguous” or not. If so, calling it as, e.g. ./contiguous 08.txt 3
will use the input file 08.txt
as the map and run for 10 ^ 3 = 1000
iterations.
Using hyperfine, and compiling the Zig file in ReleaseFast
mode, I benchmarked the performance of these two functions for values of the iterations argument between 1 and 9. (Initially I intended to also run it for 10, but that seemed like it would take a long time.)
Here is what I found:
hyperfine --parameter-scan repeats 1 9 "./contiguous 08.txt {repeats}" "./segmented 08.txt {repeats}" -N --warmup 10
Benchmark 1: ./contiguous 08.txt 1
Time (mean ± σ): 2.5 ms ± 0.2 ms [User: 0.9 ms, System: 1.4 ms]
Range (min … max): 2.3 ms … 3.5 ms 1105 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 2: ./segmented 08.txt 1
Time (mean ± σ): 3.0 ms ± 0.2 ms [User: 1.0 ms, System: 1.8 ms]
Range (min … max): 2.7 ms … 4.2 ms 1000 runs
Benchmark 3: ./contiguous 08.txt 2
Time (mean ± σ): 2.5 ms ± 0.2 ms [User: 0.9 ms, System: 1.4 ms]
Range (min … max): 2.3 ms … 3.7 ms 1222 runs
Benchmark 4: ./segmented 08.txt 2
Time (mean ± σ): 3.0 ms ± 0.2 ms [User: 1.0 ms, System: 1.8 ms]
Range (min … max): 2.7 ms … 4.2 ms 967 runs
Benchmark 5: ./contiguous 08.txt 3
Time (mean ± σ): 2.5 ms ± 0.2 ms [User: 0.9 ms, System: 1.4 ms]
Range (min … max): 2.3 ms … 4.0 ms 1217 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 6: ./segmented 08.txt 3
Time (mean ± σ): 3.0 ms ± 0.2 ms [User: 1.0 ms, System: 1.8 ms]
Range (min … max): 2.7 ms … 4.0 ms 992 runs
Benchmark 7: ./contiguous 08.txt 4
Time (mean ± σ): 2.5 ms ± 0.2 ms [User: 0.9 ms, System: 1.4 ms]
Range (min … max): 2.3 ms … 3.6 ms 1080 runs
Warning: Statistical outliers were detected. Consider re-running this benchmark on a quiet system without any interferences from other programs. It might help to use the '--warmup' or '--prepare' options.
Benchmark 8: ./segmented 08.txt 4
Time (mean ± σ): 3.1 ms ± 0.2 ms [User: 1.0 ms, System: 1.8 ms]
Range (min … max): 2.8 ms … 4.1 ms 1039 runs
Benchmark 9: ./contiguous 08.txt 5
Time (mean ± σ): 2.8 ms ± 0.2 ms [User: 1.2 ms, System: 1.4 ms]
Range (min … max): 2.6 ms … 3.8 ms 1038 runs
Benchmark 10: ./segmented 08.txt 5
Time (mean ± σ): 3.3 ms ± 0.2 ms [User: 1.3 ms, System: 1.7 ms]
Range (min … max): 3.1 ms … 4.5 ms 862 runs
Benchmark 11: ./contiguous 08.txt 6
Time (mean ± σ): 5.2 ms ± 0.2 ms [User: 3.5 ms, System: 1.4 ms]
Range (min … max): 4.9 ms … 6.4 ms 585 runs
Benchmark 12: ./segmented 08.txt 6
Time (mean ± σ): 6.4 ms ± 0.3 ms [User: 4.3 ms, System: 1.8 ms]
Range (min … max): 6.1 ms … 7.5 ms 420 runs
Benchmark 13: ./contiguous 08.txt 7
Time (mean ± σ): 28.6 ms ± 0.3 ms [User: 26.3 ms, System: 1.9 ms]
Range (min … max): 27.8 ms … 29.7 ms 106 runs
Benchmark 14: ./segmented 08.txt 7
Time (mean ± σ): 36.5 ms ± 0.4 ms [User: 33.9 ms, System: 2.2 ms]
Range (min … max): 35.8 ms … 37.6 ms 80 runs
Benchmark 15: ./contiguous 08.txt 8
Time (mean ± σ): 259.2 ms ± 0.6 ms [User: 254.6 ms, System: 4.0 ms]
Range (min … max): 258.4 ms … 260.0 ms 11 runs
Benchmark 16: ./segmented 08.txt 8
Time (mean ± σ): 335.7 ms ± 0.8 ms [User: 329.9 ms, System: 5.1 ms]
Range (min … max): 334.6 ms … 337.3 ms 10 runs
Benchmark 17: ./contiguous 08.txt 9
Time (mean ± σ): 2.556 s ± 0.003 s [User: 2.538 s, System: 0.016 s]
Range (min … max): 2.552 s … 2.562 s 10 runs
Benchmark 18: ./segmented 08.txt 9
Time (mean ± σ): 3.314 s ± 0.002 s [User: 3.290 s, System: 0.021 s]
Range (min … max): 3.312 s … 3.320 s 10 runs
As you can see, in all cases ./contiguous
runs faster—not by a ton, but measurably faster. For example, on the final iteration, the mean time for a run of ./segmented
took 1.296 times as long to complete as ./contiguous
. At 1,000,000,000 iterations, (and thanks to the --warmup
call), you can be fairly sure that this difference is coming from the hot loop, rather than the different allocation pattern or costs from interacting with the file.
Appendix: even being clever, you can still be dumb
Initially, (and in my solution for today’s Advent of Code), I had an indexToCoordinates
function which actually counted the newline characters on each call. Of course, this transforms an O(1)
lookup call into an O(n)
iteration calculation! Now, n
is fixed and not super large, but each count has to index into the buffer and do a comparison, so there’s no reason to expect that this version of indexToCoordinates
should be speedy. It took actually running hyperfine
for me to realize that I could get a speedup by trusting the structure of the data.