CodeTravelAbout Me

Solving Advent of Code in 1 microsecond using handwritten x86 assembly

It's a luxury to be able to write x += y
December 31, 2022

Advent of Code

Advent of Code (AOC) is an Advent calendar of small programming puzzles. Starting on December 1st, a new programming puzzle is unlocked every day until Christmas. Each day's puzzle has two parts. The difficulty of the puzzles increases as the days progress. Lots of people complete AOC puzzles for many reasons. Some enjoy the thrill of competing for the top leaderboard spot right as the puzzle unlocks at midnight. Most people (myself included) use the puzzles as a way to practice a new programming language.

The first few days are typically quite trivial which gives people a lot of creative liberty in how they choose to solve it. Many people code golf, use esoteric languages (like the Shakespeare Programming Language, or even Microsoft Excel!), or find some bizarre but entertaining way to challenge themselves to solve the problem.

This year I decided to learn the basics of x86 assembly and solve the first problem at the lowest level possible, resulting in all kinds of headaches and ultimately a solution which was too fast to even benchmark. Join me on a journey into the ancient and dangerous world of assembly language.

Day 1: The Problem

The day 1 puzzle requires you to count the calories being carried by the elves as they begin their expedition into the jungle. Each line in the puzzle input represents calories from a meal. Each group of calories represents an elf's total meals. Elves are separated by a blank line. In the example input below, the first elf is carrying 6,000 calories, the second 4,000, and so on:

1000
2000
3000

4000

5000
6000

7000
8000
9000

10000

In order to solve Part 1, you must find the elf carrying the most calories, and find the number of total calories he has.

The problem is simple, and the solution is straightforward. We need to loop over every group of elves, sum their calories, and keep track of the largest sum, overwriting it when we find a bigger number.

Parsing the Input

Our first challenge in assembly is to parse our input. We are at the lowest level of programming that there is. Nothing is available to us. Unless you link your program with something like the C standard library, you're completely on your own in every regard. In order to read a file you need to dive into system calls and write code for string parsing. I chose a different option however: format the input to work directly in our program.

Our input is just a series of numbers separated by a newline. In fact, if you replace the newlines with -1, our input suddenly becomes a contiguous array of numbers. All you have to do is throw it in a data section, give it a label, and append .long in front of each number, and you've got yourself a bona fide array of signed integers immediately ready to be used:

.data
.global nums
nums:
.long 1000
.long 2000
.long 3000
.long -1
.long 4000
.long -1
.long 5000
.long 6000
.long -1
.long 7000
.long 8000
.long 9000
.long -1
.long 10000
; and so on...

This can be linked directly into our program, and we don't have to spend any time at all parsing a file.

Solving Part 1

The first real problem we run into is that it's kind of difficult to even know what your assembly program is doing, or if it's even working at all.

Let's write the most minimal program in assembly possible, and then we can set up a small C driver to call our function and print the result. When we build and link these two programs together, we have a system where we can write our solution in assembly, but also see the result of our code.

.text

.global part1
part1:
    ; the return value is expected in the EAX register in x86 calling convention
    ; this is equal to "return 42;" in C
    movl $42, %eax
    ret
#include "stdio.h"

extern int part1();

int main() {

  int p1 = part1();
  printf("%d\n", p1);

  return 0;
}

After a lot of trial and error + painful debugging, I wrote enough code to complete the first part of the problem and be awarded with the first star:

.global part1
part1:
    movl $0, %ebx ; highestSoFar
    movl $0, %ecx ; counter

.ResetAndAdd:
    movl $0, %edx ; accumulator

.AddNext:
    leaq nums(%rip), %rax ; get start address of our number array
    addq %rcx,  %rax ; add our current offset
    movl (%rax), %eax ; move the value of that address into eax
    addl $4, %ecx ; add 4 (bytes) to our counter, effectively i++

    cmpl $-1, %eax ; test to see if the number is a newline (-1)
    je .NextElf ; if -1, run our code for the next elf.

    addl %eax, %edx ; otherwise, add the number to our accumulator
    jmp .AddNext

.NextElf:
    ; if we have looked through all the numbers, return
    cmpl $8948, %ecx ; 8948 = len(nums) * 4 bytes per number
    jge .Finish ; if we've looked at all the input, return

    ; check if new sum is > ebx
    cmpl %ebx, %edx
    jle .ResetAndAdd

    ; it is greater than
    movl %edx, %ebx ; move new value into highestSoFar
    jmp .ResetAndAdd

.Finish:
    movl %ebx, %eax ; return highestSoFar
    ret

This assembly function written in C would look like this:

extern int nums[];

int part1() {
  int highestSoFar = 0;
  int accumulator = 0;

  for (int i = 0; i < 2237; i++) {

    int number = nums[i];

    if (number == -1) {
      if (accumulator > highestSoFar) {
        highestSoFar = accumulator;
      }
      accumulator = 0;
      continue;
    }

    accumulator += number;
  }

  return highestSoFar;
}

Even while writing this small assembly program, I shot myself in the foot several times. When you screw up in assembly, you have nobody to help you. There is no compiler or tooling to hold your hand. It is just you vs. pure logic, and the computer will do exactly what you tell it to do, even to a horrible fault.

Who? What? When? Where? Why?

There is also a surprising amount of context that you must learn if you're new to assembly. Which registers are you able to use safely? You really only have four registers available to you: eax, ebx, ecx, and edx. These are the four general registers and how the programmer uses them defines their meaning. Before I realized that you can really only work with these, I tried using esp for example and my program would not work. Why? I was modifying the stack pointer unknowingly to use the register in my own calculations and that is a big no-no. Even though these registers are special (and should be off-limits), nothing is there to stop you from using them. Realizing that I only had four registers available felt very limiting. That's when you realize you have to start using the stack or creating variables in data sections to start holding your calculations and move those values in and out of registers.

This means that you have to juggle so much more context in your head about your program. If you have a variable that has some meaning, along with remembering what it is, you also have to remember where it is. This realization shows you just how much we take for granted in our high level languages.

For example, in C we can define a variable, do other stuff, and then add to that variable later. We never have to care where that variable is in memory. Is it already in a register? Is it on the stack, or in a data section? Do I need to move it into a register before I can work with it? Which registers am I not using at this exact spot in my program? Who cares. We don't have to mentally keep track of it, The compiler abstracts that away for us:

int x = 0;

// ...

// What a luxury! We don't have to think about
// where 'x' is at all. We just use it.
x += 40;

In assembly, it is 100% up to you to determine how your program uses memory at every level, and where that memory is stored. Is it on the stack or in a register? Is that variable at offset 4 or 8 on the stack again? I can't remember. Now imagine that you have loaded a variable from an incorrect memory address. You've just broken your program in a non-obvious way, and the only way to realize it is through a painful amount of debugging.

One Missing Character to Ruin Your Day

While solving part 1, I made an assumption that bit me hard: I assumed the default jump instruction was jp. Spoiler alert: it isn't.

It took me tearing apart my whole program to realize my problem: My jump instruction isn't working how I expected it to. This led me to find out that the jump instruction I wanted is actually jmp. jp was a real instruction, but it was short for Jump short if parity, only branching if the parity flag was set. One innocent missing m caused me a lot of headache.

Control flow instructions are another thing that you just never have to care about in a high level language. The compiler will always create the correct one based on your code.

Solving Part 2

Typically, the second part of an AOC problem will ask for more, and expose faults in the speed of your code or the flexibility of your implementation. The second part for Day 1 was no exception:

Find the top three Elves carrying the most Calories. How many Calories are those Elves carrying in total?

And boy, assembly language is the least flexible language out there. Fortunately it wasn't too bad to adapt the program to handle the top three elves instead of one.

Instead of re-calculating everything in another function, I add a data section to hold the results, and then calculate the top three elves in part 1 while only returning the top one. In part two I then take the top three from the calculations and return their sum.

.data
ans:
.long 0
.long 0
.long 0

.text

.global part1
part1:
    movl $0, %ebx ; highestSoFar
    movl $0, %ecx ; counter

.ResetAndAdd:
    movl $0, %edx ; accumulator

.AddNext:
    leaq nums(%rip), %rax
    addq %rcx,  %rax
    movl (%rax), %eax
    addl $4, %ecx

    cmpl $-1, %eax
    je .NextElf

    addl %eax, %edx
    jmp .AddNext

.NextElf:
    ; iif we have looked through all the numbers, return
    cmpl $8948, %ecx
    jge .Finish

    ; check if new sum is > ans[0] through ans[2]
    movl ans(%rip), %ebx
    cmpl %ebx, %edx ; compare ans[0] to accumulator
    jle .Ans1
    movl %edx, ans(%rip)
    jmp .ResetAndAdd
    .Ans1:
    movl 4+ans(%rip), %ebx
    cmpl %ebx, %edx ; compare ans[1] to accumulator
    jle .Ans2
    movl %edx, 4+ans(%rip)
    jmp .ResetAndAdd
    .Ans2:
    movl 8+ans(%rip), %ebx
    cmpl %ebx, %edx ; compare ans[2] to accumulator
    jle .ResetAndAdd
    movl %edx, 8+ans(%rip)
    jmp .ResetAndAdd


.Finish:
    movl ans(%rip), %eax
    ret

.global part2
part2:
    movl $0, %eax
    addl ans(%rip), %eax
    addl 4+ans(%rip), %eax
    addl 8+ans(%rip), %eax
    ret

This approach in assembly when literally translated to C would look like this:

extern int nums[];

int ans[3] = {0, 0, 0};

int part1() {
  int accumulator = 0;

  for (int i = 0; i < 2237; i++) {

    int number = nums[i];

    if (number == -1) {
      if (accumulator > ans[0]) {
        ans[0] = accumulator;
      } else if (accumulator > ans[1]) {
        ans[1] = accumulator;
      } else if (accumulator > ans[2]) {
        ans[2] = accumulator;
      }
      accumulator = 0;
      continue;
    }

    accumulator += number;
  }

  return ans[0];
}

int part2() {
  return ans[0] + ans[1] + ans[2];
}

And while this C code is somewhat ugly, it accurately demonstrates the logic of how the assembly code I wrote works. I was quite happy to be able to solve both parts in x86.

Naturally I wanted to know how long it took to calculate the answer. After trying to time part1() with C's clock, it was completing so fast that clock returned the same number both before and after calling part1().

I read about methods to get around this, and one suggested running the function in a loop n number of times and then dividing the time spent by n:

int main() {

  clock_t start, end;
  double cpu_time_used;
  int p1 = 0;

  start = clock();
  for (int i = 0; i < 1000000; i++) {
    p1 = part1();
  }
  end = clock();
  cpu_time_used = ((double)(end - start)) / CLOCKS_PER_SEC / 1000000;

  printf("%d\n", p1);

  int p2 = part2();
  printf("%d\n", p2);
  printf("Done in %.50f seconds\n", cpu_time_used);

  return 0;
}

Which outputs: Done in 0.000001135999 seconds on my average 2.8Ghz Intel i7 laptop.

It's not clear to me how much time is added by the time it takes to loop and call the function, but even with that added overhead it seems to be completing in an average of 1 microsecond. That's pretty fast. An example Rust program that someone else wrote for part 1 was able to solve it in 27 microseconds, but that included parsing the input, whereas my version has no input parsing.

Takeaways

Programming in assembly language gives me a new-found respect for how much high-level languages give us by abstracting away all the tedium and potential room for error. Low level languages such as C or C++ have a bad reputation for allowing people to shoot themselves in the foot, but the room for error in assembly feels exponentially greater. High level languages protect us from a huge amount of potential problems that most of us probably didn't realize existed in the first place. This isn't even taking into consideration garbage collection and other niceties even higher-level languages offer.

Writing in C again after writing in x86 feels like going from crawling on the ground through broken glass to riding a bike down the road with a nice breeze on a summer day.