Faster Go code by being mindful of memory
The Advent of Code is an annual set of coding puzzles. I find today’s puzzle interesting because it’s a good example of how sometimes better memory management can make a real difference in performance.
A friend and I are both doing the Advent of Code in Go this year, and we sometimes compare solutions. We tend to favor simplicity and readability over performance, but if we can make our code run much faster without adding a lot of complexity, we usually like to.
The puzzle isn’t very complicated. You are given a stream of bytes. Find the first consecutive block of 14 bytes where all 14 of these bytes are different. Return the position of that block in the stream.
Writing a benchmark#
Every participant’s puzzle input is different, so I’ll be using mine for this
article. Here is a link to it if you want to download
it. The answer for this input is 3613 (the block is fgspbvtdjrwznc
).
Let’s write a minimalist solution:
// solution.go
package d06
import (
"fmt"
"io"
)
func Solve(r io.Reader, w io.Writer) error {
_, err := fmt.Fprint(w, "3613")
return err
}
Obviously this is cheating: I already know the answer. But this is just to setup a benchmark and measure performance. Let’s add a unit test and a benchmark:
// solution_test.go
package d06
import (
"bytes"
"io"
"log"
"os"
"testing"
)
func Example() {
file, err := os.Open("testdata/input.txt")
if err != nil {
log.Fatalf("could not open input file: %v", err)
}
defer file.Close()
if err := Solve(file, os.Stdout); err != nil {
log.Fatalf("could not solve: %v", err)
}
// Output: 3613
}
func Benchmark(b *testing.B) {
input, err := os.ReadFile("testdata/input.txt")
if err != nil {
b.Fatalf("could not read input file: %v", err)
}
r := bytes.NewReader(input)
w := io.Discard
for n := 0; n < b.N; n++ {
r.Reset(input)
_ = Solve(r, w)
}
}
Now we can confirm that our solution provides the correct answer and measure its performance:
$ go test -bench=. -cpu=1,2,4,8 -benchmem
goos: darwin
goarch: arm64
pkg: github.com/busser/adventofcode/y2022/d06
Benchmark 53504293 22.37 ns/op 0 B/op 0 allocs/op
Benchmark-2 53038576 22.36 ns/op 0 B/op 0 allocs/op
Benchmark-4 52894310 22.38 ns/op 0 B/op 0 allocs/op
Benchmark-8 52344886 22.37 ns/op 0 B/op 0 allocs/op
PASS
ok github.com/busser/adventofcode/y2022/d06 5.920s
Here’s how to interpret the benchmark results above:
- We ran 4 benchmarks:
Benchmark
used a single CPU coreBenchmark-2
used 2 CPU coresBenchmark-4
used 4 CPU coresBenchmark-8
used 8 CPU cores
- Each benchmark called the
Solve
function about 50 million times in order to measure its performance. - The
Solve
function runs in about 22 nanoseconds on average. - The
Solve
function uses 0 bytes of heap memory on average. - The
Solve
function makes 0 heap memory allocations on average.
Simple, intuitive solution#
Now let’s write an actual solution. The steps our solution takes are:
- Read the stream of bytes and store it in a slice.
- Move along the datastream, checking each block of 14 bytes.
- Check if a block has duplicates by keeping track of past bytes with a map.
Our solution now looks like this:
package d06
import (
"errors"
"fmt"
"io"
)
func Solve(r io.Reader, w io.Writer) error {
datastream, err := io.ReadAll(r)
if err != nil {
return fmt.Errorf("could not read input: %w", err)
}
position, err := blockPosition(datastream, 14)
if err != nil {
return err
}
_, err = fmt.Fprintf(w, "%d", position)
return err
}
func blockPosition(datastream []byte, size int) (int, error) {
for i := 0; i < len(datastream)-size; i++ {
if !hasDuplicates(datastream[i : i+size]) {
return i + size, nil
}
}
return 0, errors.New("not found")
}
func hasDuplicates(block []byte) bool {
seen := make(map[byte]bool)
for _, b := range block {
if seen[b] {
return true
}
seen[b] = true
}
return false
}
This solution works and solved the puzzle for me. But then I ran a benchmark to see if I could do any better.
Performance analysis#
Running our benchmark gives us a pretty good idea of how fast the solution is:
$ go test -bench=. -cpu=1,2,4,8 -benchmem
goos: darwin
goarch: arm64
pkg: github.com/busser/adventofcode/y2022/d06
Benchmark 2820 361775 ns/op 27813 B/op 179 allocs/op
Benchmark-2 3300 360844 ns/op 27811 B/op 179 allocs/op
Benchmark-4 3280 363004 ns/op 27814 B/op 179 allocs/op
Benchmark-8 3306 362023 ns/op 27819 B/op 179 allocs/op
PASS
ok github.com/busser/adventofcode/y2022/d06 4.847s
Here’s what we see:
- The
Solve
function runs in about 0.36 milliseconds on average. - The
Solve
function uses about 28 kilobytes of memory on average. - The
Solve
function makes 179 allocations on the heap on average.
If there is a simple way to make my code faster, I want to know about it.
Go has built-in profiling tools to help me with this. I can use the
-cpuprofile
flag to measure my program’s CPU usage in detail:
$ go test -bench=. -cpu=1,2,4,8 -benchmem -cpuprofile=cpu.out
And use go tool pprof
to produce a graphical visualisation of the profile:
$ go tool pprof -focus=Solve -call_tree -relative_percentages -png -output=cpu.png cpu.out
Generating report in cpu.png
We can instantly see that our solution spends most of its time reading writing
to a map inside the hasDuplicates
function. Let’s try to find a faster
alternative to a map.
Replacing the map with a slice#
The hasDuplicates
function needs to keep track of bytes it has seen before in
the block it was passed. Byte values go from 0 to 255, so we can replace a map
of bytes to booleans with a slice of 256 booleans, like this:
func hasDuplicates(block []byte) bool {
// seen := make(map[byte]bool)
seen := make([]bool, 256)
for _, b := range block {
if seen[b] {
return true
}
seen[b] = true
}
return false
}
This is only a one-line change and it doesn’t make the code more complicated. It only works because we know that byte values have a small range, but we might as well take advantage of that.
Performance analysis#
Running our benchmark again yields great results:
$ go test -bench=. -cpu=1,2,4,8 -benchmem
goos: darwin
goarch: arm64
pkg: github.com/busser/adventofcode/y2022/d06
Benchmark 28832 39994 ns/op 17416 B/op 8 allocs/op
Benchmark-2 30354 39644 ns/op 17417 B/op 8 allocs/op
Benchmark-4 30141 39720 ns/op 17419 B/op 8 allocs/op
Benchmark-8 30249 39870 ns/op 17421 B/op 8 allocs/op
PASS
ok github.com/busser/adventofcode/y2022/d06 6.581s
- The
Solve
function runs in 0.04 milliseconds on average, a major improvement over our previous 0.36 milliseconds. - The
Solve
function uses about 17 kilobytes of memory on average, down from 28 kilobytes. - The
Solve
function makes 8 allocations on the heap on average, down from 179.
Let’s have a look at where our program spends its time now:
$ go test -bench=. -cpu=1,2,4,8 -benchmem -cpuprofile=cpu.out
... Benchmark results ...
$ go tool pprof -focus=Solve -call_tree -relative_percentages -png -output=cpu.png cpu.out
Generating report in cpu.png
Like before, most of the work is done in the hasDuplicates
function. I’d like
to know which steps in the function take time:
$ go tool pprof -focus=Solve -relative_percentages -list=hasDuplicates cpu.out
Total: 980ms
ROUTINE ======================== github.com/busser/adventofcode/y2022/d06.hasDuplicates in /Users/arthur/workspace/personal/adventofcode/y2022/d06/solution.go
840ms 860ms (flat, cum) 87.76% of Total
. . 31: return 0, errors.New("not found")
. . 32:}
. . 33:
. . 34:func hasDuplicates(block []byte) bool {
. . 35: // seen := make(map[byte]bool)
340ms 350ms 36: seen := make([]bool, 256)
. . 37:
30ms 30ms 38: for _, b := range block {
270ms 280ms 39: if seen[b] {
. . 40: return true
. . 41: }
200ms 200ms 42: seen[b] = true
. . 43: }
. . 44:
. . 45: return false
. . 46:}
We can see that the function spends time mostly on three things:
- Allocating the slice of bytes we use to keep track of past values
- Reading from the slice
- Writing to the slice
I don’t know how to make slice read/write operations faster. However I know how to make the slice smaller.
Making a smaller data structure#
In Go, a bool
takes up an entire byte in memory. If we replace our slice of
bool
with a slice of uint8
, our code uses the same amount of memory:
func hasDuplicates(block []byte) bool {
// seen := make(map[byte]bool)
// seen := make([]bool, 256)
seen := make([]uint8, 256)
for _, b := range block {
if seen[b] == 1 {
return true
}
seen[b] = 1
}
return false
}
$ go test -bench=. -cpu=1,2,4,8 -benchmem
goos: darwin
goarch: arm64
pkg: github.com/busser/adventofcode/y2022/d06
Benchmark 28718 40367 ns/op 17417 B/op 8 allocs/op
Benchmark-2 29954 39947 ns/op 17419 B/op 8 allocs/op
Benchmark-4 29869 40160 ns/op 17421 B/op 8 allocs/op
Benchmark-8 29871 40251 ns/op 17424 B/op 8 allocs/op
PASS
ok github.com/busser/adventofcode/y2022/d06 6.762s
So a lot of memory is actually going unused. If a bool
takes up 8 bits of
space and we only use 1, maybe we can get rid of the 7 we don’t use.
But first, would that actually make a performance difference? Let’s replace our
slice of uint8
with a slice of uint16
. This will show what happens if we
waste even more bits.
func hasDuplicates(block []byte) bool {
// seen := make(map[byte]bool)
// seen := make([]bool, 256)
// seen := make([]uint8, 256)
seen := make([]uint16, 256)
for _, b := range block {
if seen[b] == 1 {
return true
}
seen[b] = 1
}
return false
}
$ go test -bench=. -cpu=1,2,4,8 -benchmem
goos: darwin
goarch: arm64
pkg: github.com/busser/adventofcode/y2022/d06
Benchmark 19788 59449 ns/op 17417 B/op 8 allocs/op
Benchmark-2 20364 58948 ns/op 17419 B/op 8 allocs/op
Benchmark-4 20308 59158 ns/op 17421 B/op 8 allocs/op
Benchmark-8 20319 59111 ns/op 17424 B/op 8 allocs/op
PASS
ok github.com/busser/adventofcode/y2022/d06 7.467s
Here is what we see:
- The
Solve
function takes 0.06 milliseconds, up from 0.04. - The
Solve
function does not allocate more memory than before on the heap.
This makes sense. Our slice is allocated, used, and discarded in the same function, so the compiler logically allocated it onto the stack. And allocating more memory on the stack takes more time. So shaving off those unused bits may be worth it.
With need a way to keep track of whether we have seen a value before, for 256 different values. We only need a single bit per value to remember if we’ve seen it before. So in theory we could do this with only 256 bits.
256 bits is 32 bytes:
bits := make([]byte, 32)
Now we need to implement a bitset, with “get” and “set” operations. To set a specific bit, we need to compute which byte in our slice to inspect, and which bit in that byte to edit:
// This sets bit n°123 to 1.
b := 123
i, mask := b/8, byte(1<<(b%8))
bits[i] |= mask
Same thing for reading a bit:
// This prints bit n°123.
b := 123
i, mask := b/8, byte(1<<(b%8))
fmt.Println(bits[i] & mask)
This way, we can rewrite the hasDuplicates
function:
func hasDuplicates(block []byte) bool {
bits := make([]byte, 32)
for _, b := range block {
i, mask := b/8, byte(1<<(b%8))
if bits[i]&mask != 0 {
return true
}
bits[i] |= mask
}
return false
}
Performance analysis#
Running our benchmarks one last time yields good results:
$ go test -bench=. -cpu=1,2,4,8 -benchmem
goos: darwin
goarch: arm64
pkg: github.com/busser/adventofcode/y2022/d06
Benchmark 52460 21037 ns/op 17416 B/op 8 allocs/op
Benchmark-2 57086 20588 ns/op 17417 B/op 8 allocs/op
Benchmark-4 58436 20288 ns/op 17419 B/op 8 allocs/op
Benchmark-8 59562 20516 ns/op 17421 B/op 8 allocs/op
PASS
ok github.com/busser/adventofcode/y2022/d06 5.690s
- The
Solve
function takes 0.02 milliseconds on average.- This is 2x faster than when we used a slice.
- This is 18x faster than when we used a map!
Conclusion#
I will admit that implementing a bitset is not as simple as using a map or a slice. It was fun though! And it demonstrates an important point: be mindful of memory usage when performance matters.
Some very simple optimisations — like using a slice instead of a map — can bring great performance gains without making the code harder to grasp. They aren’t always possible, but keep an eye out for them nonetheless.