June 22, 2019

Writing a lisp compiler from scratch in JavaScript: 6. LLVM system calls

Previously in compiler basics:
1. lisp to assembly
2. user-defined functions and variables
3. LLVM
4. LLVM conditionals and compiling fibonacci
Next in compiler basics:
5. an x86 upgrade

In this post we'll extend the ulisp compiler's LLVM backend to support printing integers to stdout.

Exit code limitations

Until now we've validated program state by setting the exit code to the result of the program computation. But the exit code is an eight bit integer. What if we want to validate a computation that produces a result larger than 255?

To do this we need a way to print integers. This is challenging because printing normally deals with byte arrays. libc's printf, for example, takes a byte array as its first argument.

The shortest path forward is to add support for system calls so we can print one character at a time. Here's a version of a print form that hacks around not having arrays to send each integer in a number to stdout.

(def print-char (c)
     ; First argument is stdout
     ; Second argument is a pointer to a char array (of length one)
     ; Third argument is the length of the char array
     (syscall/sys_write 1 &c 1))

(def print (n)
     (if (> n 9)
         (print (/ n 10)))

     ; 48 is the ASCII code for '0'
     (print-char (+ 48 (% n 10))))

In order to support this we need to add the syscall/sys_write, >, %, and / builtin forms. We'll also need to add support for taking the address of a variable.

All code is available on Github as is the particular commit related to this post.

References

The sys_write syscall requires us to pass the memory address of the byte array to write. We don't support arrays, but we can treat an individual variable as an array of length one by passing the variable's address.

If we were compiling to C we could just pass the address of a local variable. But LLVM doesn't allow us to take the address of variables directly. We need to push the variable onto the LLVM stack to get an address.

Under the hood LLVM will likely optimize this into a local variable reference instead of first pushing to the stack.

Since LLVM IR is typed, the value representing the address of a local variable will be a pointer type. We'll need to refer to all uses of this value as a pointer. So we'll need to modify ulisp to track local types rather than hard-coding i64 everywhere.

Scope

To begin we'll modify the Scope class to track types. We only need to do this on registration. Afterward, we'll have to find all uses of local variables to make sure they use the local's value and type fields appropriately.

class Scope {

  ...

  register(local) {
    let copy = local.replace('-', '_');
    let n = 1;
    while (this.locals[copy]) {
      copy = local + n++;
    }

    this.locals[local] = {
      value: copy,
      type: 'i64',
    };
    return this.locals[local];
  }

  ...

}

We won't go through every use of a Scope variable in this post, but you can find it in the related commit to ulisp.

Reference

The long-term approach for handling a reference syntactically is probably to rewrite &x to (& x) in the parser. The lazy approach we'll take for now is to handle a reference as a special kind of identifier in compileExpression.

We'll use the LLVM alloca instruction to create space on the stack. This will return a pointer and will turn the destination variable into a pointer type. Then we'll use store to set the value at the address to the current value of the variable being referenced.

class Compiler {

  ...

  compileExpression(exp, destination, context) {

    ...

    // Is a reference, push onto the stack and return its address
    if (exp.startsWith('&')) {
      const symbol = exp.substring(1);
      const tmp = context.scope.symbol();
      this.compileExpression(symbol, tmp, context);
      this.emit(1, `%${destination.value} = alloca ${tmp.type}, align 4`);
      destination.type = tmp.type + '*';
      this.emit(1, `store ${tmp.type} %${tmp.value}, ${destination.type} %${destination.value}, align 4`);
      return;
    }

    ...

  }

  ...

}

And now we're set to take the address of any code.

System calls

LLVM IR provides no high-level means for making system calls. The only way is to use inline assembly. This syntax is based on GCC inline assembly and is confusing, with few explained examples, and unhelpful error messages.

Thankfully the assembly code needed for a syscall is only one line, one word: the syscall assembly instruction. We use inline assembly variable-to-register mapping functionality to line up all the parameters for the syscall. Here is an example:

%result = call i64 asm sideeffect "syscall", "=r,{rax},{rdi},{rsi},{rdx}" (i64 %raxArg, i64 %rdiArg, i64 %rsiArg, i64 %rdxArg)

This says to execute the inline assembly string, "syscall". The sideeffect flag means that this assembly should always be run even if the result isn't used. =r means the inline assembly returns a value, and the rest of the string is the list of registers that arguments should be mapped to. Finally we call the function with all the LLVM variables we want to be mapped.

Eventually we should also use the inline assembly syntax to list registers that are modified so that LLVM can know to save them before and after.

Code

We'll add a mapping for syscall/sys_write and a helper function for generating syscall code using the example above as a template. We'll suport 64-bit Darwin and Linux kernels.

const SYSCALL_TABLE = {
  darwin: {
    sys_write: 0x2000004,
    sys_exit: 0x2000001,
  },
  linux: {
    sys_write: 1,
    sys_exit: 60,
  },
}[process.platform];

class Compiler {
  constructor() {
    this.outBuffer = [];
    this.primitiveFunctions = {
      def: this.compileDefine.bind(this),
      begin: this.compileBegin.bind(this),
      'if': this.compileIf.bind(this),
      '+': this.compileOp('add'),
      '-': this.compileOp('sub'),
      '*': this.compileOp('mul'),
      '%': this.compileOp('urem'),
      '<': this.compileOp('icmp slt'),
      '=': this.compileOp('icmp eq'),
      'syscall/sys_write': this.compileSyscall(SYSCALL_TABLE.sys_write),
    };
  }

  ...

  compileSyscall(id) {
    return (args, destination, context) => {
      const argTmps = args.map((arg) => {
          const tmp = context.scope.symbol();
          this.compileExpression(arg, tmp, context);
          return tmp.type + ' %' + tmp.value;
      }).join(', ');
      const regs = ['rdi', 'rsi', 'rdx', 'r10', 'r8', 'r9'];
      const params = args.map((arg, i) => `{${regs[i]}}`).join(',');
      const idTmp = context.scope.symbol().value;
      this.emit(1, `%${idTmp} = add i64 ${id}, 0`)
      this.emit(1, `%${destination.value} = call ${destination.type} asm sideeffect "syscall", "=r,{rax},${params},~{dirflag},~{fpsr},~{flags}" (i64 %${idTmp}, ${argTmps})`);
    }
  }
}

>, /

Finally, we have a few new operations to add support for. But they'll be pretty simple using the compileOp helper function.

class Compiler {
  constructor() {
    this.outBuffer = [];
    this.primitiveFunctions = {
      def: this.compileDefine.bind(this),
      begin: this.compileBegin.bind(this),
      'if': this.compileIf.bind(this),
      '+': this.compileOp('add'),
      '-': this.compileOp('sub'),
      '*': this.compileOp('mul'),
      '/': this.compileOp('udiv'),
      '%': this.compileOp('urem'),
      '<': this.compileOp('icmp slt'),
      '>': this.compileOp('icmp sgt'),
      '=': this.compileOp('icmp eq'),
      'syscall/sys_write': this.compileSyscall(SYSCALL_TABLE.sys_write),
    };
  }

  ...

}

print

We're ready to give our print function a shot.

$ cat test.lisp
(def print-char (c)
     ; First argument is stdout
     ; Second argument is a pointer to a char array (of length one)
     ; Third argument is the length of the char array
     (syscall/sys_write 1 &c 1))

(def print (n)
     (if (> n 9)
         (print (/ n 10)))

     ; 48 is the ASCII code for '0'
     (print-char (+ 48 (% n 10))))

(def main ()
     (print 1234)
     0)
$ node ulisp.js test.lisp
$ ./build/a.out
1234

Looks good! In the next post we'll talk about tail call elimination.