A Simple Implementation of a Brainfuck JIT Compiler

lyc8503

2022-10-10

This article is currently an experimental machine translation and may contain errors. If anything is unclear, please refer to the original Chinese version. I am continuously working to improve the translation.

This article was inspired by a C++ course OJ problem that required writing a Brainfuck interpreter.

One classmate started complaining in the group chat that the OJ test cases were too weak—code didn’t need any optimization to pass.

Coincidentally, just a few days earlier I had come across a tutorial implementation of a BF JIT on GitHub, so I decided to try implementing it myself in C++.

~~Originally, we were just supposed to write a simple BF interpreter in C++. Now this is clearly overkill—like using a chainsaw to kill a chicken.~~

Introduction to Brainfuck (From Wikipedia)

The language consists of eight commands, listed below. A brainfuck program is a sequence of these commands, possibly interspersed with other characters (which are ignored). The commands are executed sequentially, with some exceptions: an instruction pointer begins at the first command, and each command it points to is executed, after which it normally moves forward to the next command. The program terminates when the instruction pointer moves past the last command.

The brainfuck language uses a simple machine model consisting of the program and instruction pointer, as well as a one-dimensional array of at least 30,000 byte cells initialized to zero; a movable data pointer (initialized to point to the leftmost byte of the array); and two streams of bytes for input and output (most often connected to a keyboard and a monitor respectively, and using the ASCII character encoding).

The eight language commands each consist of a single character:

Character	Meaning
`>`	Increment the data pointer (to point to the next cell to the right).
`<`	Decrement the data pointer (to point to the next cell to the left).
`+`	Increment (increase by one) the byte at the data pointer.
`-`	Decrement (decrease by one) the byte at the data pointer.
`.`	Output the byte at the data pointer.
`,`	Accept one byte of input, storing its value in the byte at the data pointer.
`[`	If the byte at the data pointer is zero, then instead of moving the instruction pointer forward to the next command, jump it forward to the command after the matching `]` command.
`]`	If the byte at the data pointer is nonzero, then instead of moving the instruction pointer forward to the next command, jump it back to the command after the matching `[` command.

As the name suggests, Brainfuck programs tend to be difficult to comprehend. This is partly because any mildly complex task requires a long sequence of commands and partly because the program’s text gives no direct indications of the program’s state. These, as well as Brainfuck’s inefficiency and its limited input/output capabilities, are some of the reasons it is not used for serious programming. Nonetheless, like any Turing complete language, Brainfuck is theoretically capable of computing any computable function or simulating any other computational model, if given access to an unlimited amount of memory. A variety of Brainfuck programs have been written. Although Brainfuck programs, especially complicated ones, are difficult to write, it is quite trivial to write an interpreter for Brainfuck in a more typical language such as C due to its simplicity. There even exist Brainfuck interpreters written in the Brainfuck language itself.

Brainfuck is an example of a so-called Turing tarpit: It can be used to write any program, but it is not practical to do so, because Brainfuck provides so little abstraction that the programs get very long or complicated.

Translating the Instruction Sequence

A Just-in-time (JIT) compiler, as the name suggests, translates code into machine code at runtime and executes it immediately.

JIT compilation is extremely complex for high-level languages we commonly use (such as Hotspot or V8), but for esoteric languages like Brainfuck, it’s surprisingly simple.

Each Brainfuck instruction can be directly translated into equivalent C code—or even directly into machine code.

For example, > becomes data_pointer++;, and + becomes data[data_pointer]++;, and so on.

Now, in theory, we could just replace the input program string with raw machine code and call it a day.

But ~~hold your horses~~.

For instance, ten consecutive > commands would be naively translated into ten separate data_pointer++; statements, instead of the more efficient data_pointer += 10;.

So we can apply some simple optimizations to the BF program before translating it.

The implementation mainly combines consecutive +, -, >, and < instructions: https://github.com/lyc8503/BrainfuckJIT/blob/master/main.cpp#L64

Once we have our optimized, custom instruction sequence, we can begin translating it into assembly code.

We’ll use the rdx register to represent the data pointer mentioned earlier. The instructions can then be translated as follows:

> and <
1
2
add rdx, 1
sub rdx, 1
+ and -
1
2
add byte [rdx], 1
sub byte [rdx], 1

. and ,

We use Linux system calls read and write:

mov rax, 0      ; `read` syscall no 0
mov rdi, 0      ; 1st arg: int fd, value 0 (stdin)
mov rsi, rdx    ; 2nd arg: void* buf, addr rdx
push rdx        ; save rdx to stack
mov rdx, 1      ; 3rd arg: size_t count, 1 character
syscall         ; do the syscall
pop rdx         ; restore rdx

mov rax, 1      ; `write` syscall no 1
mov rdi, 1      ; 1st arg: int fd, value 1 (stdout)
mov rsi, rdx    ; 2nd arg: void* buf, addr rdx
push rdx        ; save rdx to stack
mov rdx, 1      ; 3rd arg: size_t count, 1 character
syscall         ; do the syscall
pop rdx         ; restore rdx

[ and ]
These are conditional jump instructions. Their assembly representations are:
1
2
cmp byte [rdx], 0
je 0x1
1
2
cmp byte [rdx], 0
jne 0x1
However, when we first encounter an opening bracket [, we don’t yet know where the matching closing bracket ] is, so we can’t fill in the jump address immediately.
This classic problem can be solved using a stack.
When we encounter an opening bracket, we push the current code position onto the stack. When we later encounter the corresponding closing bracket, we can compute the correct jump target.

Implementation details: https://github.com/lyc8503/BrainfuckJIT/blob/master/main.cpp#L108

Executing Machine Code

Before executing the generated code, we first need to initialize rdx to point to a pre-allocated memory region.

// initialize the rdx as the data pointer
// machine code generated by nasm `mov rdx, [the value of the pointer to data]`
code.push_back(0x48);
code.push_back(0xba);

// little-endian byte order
for (int i = 0; i < 8; i++) {
	code.push_back(((unsigned long long) data & ((unsigned long long) 0xff << (8 * i))) >> (8 * i));
}

In theory, we could now use inline assembly to set the rip register to the start of our generated code and begin execution.

However, due to modern operating system security protections, dynamically generated code cannot be executed directly—it would result in a segmentation fault (SIGSEGV).

We must first request a special memory region that is readable, writable, and executable. On Linux, this can be done using mmap. See the manual for detailed parameter meanings.

1 2	// allocate rwx memory virtualCodeAddress = mmap(nullptr, len, PROT_READ \| PROT_WRITE \| PROT_EXEC, MAP_ANONYMOUS \| MAP_PRIVATE, 0, 0);

After that, we copy our generated code into this memory region and execute it via a function pointer. Note that the C++ compiler will set up a stack frame for the function call, so don’t forget to append a ret instruction at the end of the machine code to return properly.

1	code.push_back(0xc3); // machine code `ret` to exit the func

Full implementation: https://github.com/lyc8503/BrainfuckJIT/blob/master/main.cpp#L21

Correctness Testing

After implementation, the JIT passes all original OJ test cases.

We can also try running a self-interpreter:

>>>+[[-]>>[-]++>+>+++++++[<++++>>++<-]++>>+>+>+++++[>++>++++++<<-]+>>>,<++[[>[->>]<[>>]<<-]<[<]<+>>[>]>[<+>-[[<+>-]>]<[[[-]<]++<-[<+++++++++>[<->-]>>]>>]]<<]<]<[[<]>[[>]>>[>>]+[<<]<[<]<+>>-]>[>]+[->>]<<<<[[<<]<[<]+<<[+>+<<-[>-->+<<-[>+<[>>+<<-]]]>[<+>-]<]++>>-->[>]>>[>>]]<<[>>+<[[<]<]>[[<<]<[<]+[-<+>>-[<<+>++>-[<->[<<+>>-]]]<[>+<-]>]>[>]>]>[>>]>>]<<[>>+>>+>>]<<[->>>>>>>>]<<[>.>>>>>>>]<<[>->>>>>]<<[>,>>>]<<[>+>]<<[+<<]<]

Performance Benchmark

We tested a double-layer self-interpreter running a “Hello World” program. The original interpreter took 2m18s, while the JIT version completed in just 2.5s.

1
2

>>>+[[-]>>[-]++>+>+++++++[<++++>>++<-]++>>+>+>+++++[>++>++++++<<-]+>>>,<++[[>[->>]<[>>]<<-]<[<]<+>>[>]>[<+>-[[<+>-]>]<[[[-]<]++<-[<+++++++++>[<->-]>>]>>]]<<]<]<[[<]>[[>]>>[>>]+[<<]<[<]<+>>-]>[>]+[->>]<<<<[[<<]<[<]+<<[+>+<<-[>-->+<<-[>+<[>>+<<-]]]>[<+>-]<]++>>-->[>]>>[>>]]<<[>>+<[[<]<]>[[<<]<[<]+[-<+>>-[<<+>++>-[<->[<<+>>-]]]<[>+<-]>]>[>]>]>[>>]>>]<<[>>+>>+>>]<<[->>>>>>>>]<<[>.>>>>>>>]<<[>->>>>>]<<[>,>>>]<<[>+>]<<[+<<]<]
>>>+[[-]>>[-]++>+>+++++++[<++++>>++<-]++>>+>+>+++++[>++>++++++<<-]+>>>,<++[[>[->>]<[>>]<<-]<[<]<+>>[>]>[<+>-[[<+>-]>]<[[[-]<]++<-[<+++++++++>[<->-]>>]>>]]<<]<]<[[<]>[[>]>>[>>]+[<<]<[<]<+>>-]>[>]+[->>]<<<<[[<<]<[<]+<<[+>+<<-[>-->+<<-[>+<[>>+<<-]]]>[<+>-]<]++>>-->[>]>>[>>]]<<[>>+<[[<]<]>[[<<]<[<]+[-<+>>-[<<+>++>-[<->[<<+>>-]]]<[>+<-]>]>[>]>]>[>>]>>]<<[>>+>>+>>]<<[->>>>>>>>]<<[>.>>>>>>>]<<[>->>>>>]<<[>,>>>]<<[>+>]<<[+<<]<]!>++++++++[<+++++++++>-]<.>++++[<+++++++>-]<+.+++++++..+++.>>++++++[<+++++++>-]<++.------------.>++++++[<+++++++++>-]<+.<.+++.------.--------.>>>++++[<++++++++>-]<+.!

# Original interpreter
real    2m18.136s
user    2m18.063s
sys     0m0.012s

# With JIT
real	0m2.489s
user	0m0.010s
sys	    0m0.000s

On a Mandelbrot set renderer, the original interpreter took 3m2s, while the JIT version completed in just 1s.

# Original
real    3m2.546s
user    3m2.396s
sys     0m0.014s

# With JIT
real	0m1.157s
user	0m0.022s
sys	    0m0.151s

Room for Optimization

Currently, input/output operations are implemented using direct Linux system calls, reading or writing only one character at a time. The overhead of system calls is significant, especially for programs with heavy I/O. This can be improved by buffering—reading input in bulk and buffering output before flushing.

This article is licensed under the CC BY-NC-SA 4.0 license.

Author: lyc8503, Article link: https://blog.lyc8503.net/en/post/bf-jit/
If this article was helpful or interesting to you, consider buy me a coffee¬_¬
Feel free to comment in English below o/