January 20, 2019

Writing a lisp compiler from scratch in JavaScript: 2. user-defined functions and variables

Previously in compiler basics:
1. lisp to assembly

Next in compiler basics:
3. LLVM
4. LLVM conditionals and compiling fibonacci
5. LLVM system calls
6. an x86 upgrade

In this post we'll extend the compiler to support defining functions and variables. Additionally, we'll require the program's entrypoint to be within a main function.

The resulting code can be found here.

Function definition

The simplest function definition we need to support is for our main function. This will look like this:

$ cat basic.lisp
(def main ()
     (+ 1 2))

Where compiling and running it should produce a return code of 3:

$ node ulisp.js basic.lisp
$ ./build/a.out
$ echo $?
3

Parsing function definitions

The entire language is defined in S-expressions and we already parse S-expressions.

$ node
> const { parse } = require('./parser');
> JSON.stringify(parse('(def main () (+ 1 2))'));
'[[["def","main",[],["+",1,2]]],""]'

So we're done!

Code generation

There are two tricky parts to code generation once function definitions are introduced:

  • Functions definitions are not expressions (in assembly)
  • Function calling conventions for the callee
  • Variable scope

Function definitions

A function definition looks like a function call. So we'll need to keep a list of "primitive" functions that handle what looks like function calls differently.

function compile_define() {
  // TODO
}

const primitive_functions = {
  def: compile_define,
};

Then in our compile_call function we need to see if the function being "called" is in this list. If so, we allow the associated callback to handle compilation.

function compile_call(fun, args, destination) {
  if (primitive_functions[fun]) {
    primitive_functions[fun](args, destination);
    return;
  }

  // Save param registers
  args.map((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));

  // Compile registers and store as params
  args.map((arg, i) => compile_expression(arg, PARAM_REGISTERS[i], scope));

  emit(1, `CALL ${BUILTIN_FUNCTIONS[fun] || scope[fun]}`);

  // Restore param registers
  args.map((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));

  if (destination && destination !== 'RAX') {
    emit(1, `MOV ${destination}, RAX`);
  }
}

Now we can begin thinking about compile_define. It takes args which will be a list of three elements containing the function's:

  • name
  • parameters
  • and body

It does not use destination because we're treating function definitions as statements for now and not as expressions. If we were treating it as an expression, we might store the address of the function in the destination register. We keep destination around to keep the primitive function signatures consistent.

Based on how we called functions before and how we defined the hard-coded add function, we know what a function definition in assembly generally looks like. And we know the arguments to the function when called will be in RDI, RSI, and RDX.

function compile_define([name, parameters, body]) {
  // Function name becomes a label we can CALL
  emit(0, `${name}:`);

  // Something to do with RDI, RSI, RDX and the parameters variable?

  // We renamed compile_argument to compile_expression to be more general
  compile_expression(body[0], 'RAX');

  // Maybe some cleanup to do with RDI, RSI, RDX?

  emit(1, 'RET\n');
}

Not a bad first sketch. But how do we match up RDI, RSI, RDX and the user-defined parameters variable names? For example in the following:

(def plus-two (a)
     (+ a 2))

It's clear to us that a must match up to RDI. In order to do this we need to track all variables in a scope dictionary mapping the variable name to the register where it's stored.

Additionally, keeping track of scope can help us fail quickly in the following scenario:

(def plus-two (a)
     (+ b 2))

The variable b is used but never defined. It has not been added to the scope dictionary. So our compiler can fail quickly saying there is an undefined variable being referenced.

Taking this a step further, what if we want to catch the following too:

(def plus-two (a)
     (plus a 2))

We're trying to call plus but it has not been defined. We should be able to fail quickly here too. But that means we're need to track the scope of function names in addition to variables. We'll choose to track function names and variable names in the same scope dictionary.

This is the distinction between a lisp-1 and a lisp-2. We are a lisp-1 like Scheme because we have a single scope. Common Lisp is a lisp-2 because it stores function name scope separately from variable name scope.

Implementing scope

We need to revise every compile function to accept a scope dictionary (specifically: compile, compile_expression, compile_call, and compile_define). If a variable is referenced, we need to look up it's location in the scope dictionary. If a variable is defined (e.g. a function name or a function parameter) we need to add a mapping to the scope dictionary.

Modifying compile_expression is easiest:

function compile_expression(arg, destination, scope) {
  // Is a nested function call, compile it
  if (Array.isArray(arg)) {
    compile_call(arg[0], arg.slice(1), destination, scope);
    return;
  }

  if (scope[arg] || Number.isInteger(arg)) {
    emit(1, `MOV ${destination}, ${scope[arg] || arg}`);
  } else {
    throw new Error('Attempt to reference undefined variable or unsupported literal: ' + arg);
  }
}

Next we modify compile_call:

function compile_call(fun, args, destination, scope) {
  if (primitive_functions[fun]) {
    primitive_functions[fun](args, destination, scope);
    return;
  }

  // Save param registers
  args.map((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));

  // Compile registers and store as params
  args.map((arg, i) => compile_expression(arg, PARAM_REGISTERS[i], scope));

  const validFunction = BUILTIN_FUNCTIONS[fun] || scope[fun];
  if (validFunction) {
    emit(1, `CALL ${validFunction}`);
  } else {
    throw new Error('Attempt to call undefined function: ' + fun);
  }

  // Restore param registers
  args.map((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));

  if (destination && destination !== 'RAX') {
    emit(1, `MOV ${destination}, RAX`);
  }
}

And then compile_define where we modify scope for the first time:

function compile_define([name, params, ...body], destination, scope) {
  // Add this function to outer scope
  scope[name] = name.replace('-', '_');

  // Copy outer scope so parameter mappings aren't exposed in outer scope.
  const childScope = { ...scope };

  emit(0, `${scope[name]}:`);

  params.forEach(function (param, i) {
    const register = PARAM_REGISTERS[i];
    // Store parameter mapped to associated register
    childScope[param] = register;
  });

  // Pass childScope in for reference when body is compiled.
  compile_expression(body[0], 'RAX', childScope);

  emit(1, 'RET\n');
}

And finally we need to modify the entrypoint compile:

module.exports.compile = function (ast) {
  emit_prefix();
  // Pass in new, empty scope mapping
  compile_call(ast[0], ast.slice(1), 'RAX', {});
  emit_postfix();
}

And scope-wise we're pretty good!

Function calling convention: callee

We currently have a problem that we're using parameters registers to store local variables that messes up with how we are storing parameters for function calls within the function itself.

Ideally we could store function local variables (including the parameters when we get them) separately from how we store function call parameters within the function.

Thankfully according to the calling convention we've followed, we're given a set of registers that are callee-preserved. Of them we'll use RBX, RBP, and R12 in that order. This allows us to mess with so long as we store them and restore them within the function.

Applying the same storing/restoring strategy to local variables as we did for parameters, we get:

const LOCAL_REGISTERS = [
  'RBX',
  'RBP',
  'R12',
];

function compile_define([name, params, ...body], destination, scope) {
  // Add this function to outer scope
  scope[name] = name.replace('-', '_');

  // Copy outer scope so parameter mappings aren't exposed in outer scope.
  const childScope = { ...scope };

  emit(0, `${scope[name]}:`);

  params.forEach(function (param, i) {
    const register = PARAM_REGISTERS[i];
    const local = LOCAL_REGISTERS[i];
    emit(1, `PUSH ${local}`);
    emit(1, `MOV ${local}, ${register}`);
    // Store parameter mapped to associated local
    childScope[param] = local;
  });

  // Pass childScope in for reference when body is compiled.
  compile_expression(body[0], 'RAX', childScope);

  params.forEach(function (param, i) {
    // Backwards first
    const local = LOCAL_REGISTERS[params.length - i - 1];
    emit(1, `POP ${local}`);
  });

  emit(1, 'RET\n');
}

And we're set.

Cleanup

We've still got a few messes going on:

  • emit_prefix wraps out entire body in _main, we're requiring our own main now
  • emitting to stdout instead of to a file
  • multiple function definitions is treated as nonsense

Starting first, we rewrite emit_prefix and emit_postfix so that our _main just calls main.

function emit_prefix() {
  emit(1, '.global _main\n');

  emit(1, '.text\n');

  emit(0, 'plus:');
  emit(1, 'ADD RDI, RSI');
  emit(1, 'MOV RAX, RDI');
  emit(1, 'RET\n');
}

function emit_postfix() {
  emit(0, '_main:');
  emit(1, 'CALL main');
  emit(1, 'MOV RDI, RAX'); // Set exit arg
  emit(1, `MOV RAX, ${SYSCALL_MAP['exit']}`);
  emit(1, 'SYSCALL');
}

Next to deal with writing to a file instead of stdout, we need our emit function to write to a buffer. We'll let ulisp.js write that buffer to a file. Because we're incredibly lazy, we'll do this all globally.

let OUT = '';

function emit(depth, args) {
  const indent = new Array(depth + 1).join('  ');
  OUT += `${indent}${args}\n`;
}

...

module.exports.compile = function (ast) {
  OUT = '';

  emit_prefix();
  compile_call(ast[0], ast.slice(1), 'RAX', {});
  emit_postfix();

  return OUT;
}

And modify ulisp.js:

const cp = require('child_process');
const fs = require('fs');

const { parse } = require('./parser');
const { compile } = require('./compiler');

function main (args) {
  const input = fs.readFileSync(args[2]).toString();
  const [ast] = parse(input);
  const program = compile(ast);

  try {
    fs.mkdirSync('build');
  } catch (e) {}
  fs.writeFileSync('build/prog.s', program);
  cp.execSync('gcc -mstackrealign -masm=intel -o build/a.out build/prog.s');
}

main(process.argv);

And we're finally ready to run a simple program.

A program!

$ cat test.lisp
(def main () (+ 1 2))
$ node ulisp.js test.lisp
$ ./build/a.out
$ echo $?
3

Hurray! Now let's try defining and calling a second function to validate parameter behavior.

$ cat test.lisp
(def plus-two (a)
     (+ a 2))

(def main ()
     (plus-two 3))
$ node ulisp.js test.lisp
$ ./build/a.out
./compiler.js:106
    throw new Error('Attempt to call undefined function: ' + fun);
    ^

Error: Attempt to call undefined function: p2

...

We start getting some really weird errors. And the reason is because our compiler doesn't know how to deal with sibling S-expressions.

So we'll introduce a new primitive function called begin that calls all it's sibling functions and returns the value of the last call. Then we'll wrap the program in an implicit begin so we don't need to.

function compile_begin(body, destination, scope) {
  body.forEach((expression) => compile_expression(expression, 'RAX', scope));
  if (destination && destination !== 'RAX') {
    emit(1, `MOV ${destination}, RAX`);
  }
}

const primitive_functions = {
  def: compile_define,
  begin: compile_begin,
};

...

module.exports.compile = function (ast) {
  OUT = '';

  emit_prefix();
  compile_call('begin', ast, 'RAX', {});
  emit_postfix();

  return OUT;
}

And we try our test program again. :)

$ cat test.lisp
(def plus-two (a)
     (+ a 2))

(def main ()
     (plus-two 3))
$ node ulisp.js test.lisp
$ ./build/a.out
$ echo $?
5

And that's all there is to it! Stay tuned for the next post on conditionals and tail-call optimization.