Previously in compiler basics:
1. lisp to assembly
Next in compiler basics:
3. LLVM
4. LLVM conditionals and compiling fibonacci
5. LLVM system calls
6. an x86 upgrade
In this post we'll extend the compiler to support defining functions
and variables. Additionally, we'll require the program's entrypoint to
be within a main
function.
The resulting code can be found here.
Function definition
The simplest function definition we need to support is for our main
function. This will look like this:
$ cat basic.lisp
(def main ()
(+ 1 2))
Where compiling and running it should produce a return code of 3:
$ node ulisp.js basic.lisp
$ ./build/a.out
$ echo $?
3
Parsing function definitions
The entire language is defined in S-expressions and we already parse S-expressions.
$ node
> const { parse } = require('./parser');
> JSON.stringify(parse('(def main () (+ 1 2))'));
'[[["def","main",[],["+",1,2]]],""]'
So we're done!
Code generation
There are two tricky parts to code generation once function definitions are introduced:
- Functions definitions are not expressions (in assembly)
- Function calling conventions for the callee
- Variable scope
Function definitions
A function definition looks like a function call. So we'll need to keep a list of "primitive" functions that handle what looks like function calls differently.
function compile_define() {
// TODO
}
const primitive_functions = {
def: compile_define,
};
Then in our compile_call
function we need to see if the function
being "called" is in this list. If so, we allow the associated
callback to handle compilation.
function compile_call(fun, args, destination) {
if (primitive_functions[fun]) {
primitive_functions[fun](args, destination);
return;
}
// Save param registers
args.map((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));
// Compile registers and store as params
args.map((arg, i) => compile_expression(arg, PARAM_REGISTERS[i], scope));
emit(1, `CALL ${BUILTIN_FUNCTIONS[fun] || scope[fun]}`);
// Restore param registers
args.map((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));
if (destination && destination !== 'RAX') {
emit(1, `MOV ${destination}, RAX`);
}
}
Now we can begin thinking about compile_define
. It takes args
which will be a list of three elements containing the function's:
- name
- parameters
- and body
It does not use destination because we're treating function definitions as statements for now and not as expressions. If we were treating it as an expression, we might store the address of the function in the destination register. We keep destination around to keep the primitive function signatures consistent.
Based on how we called functions before and how we defined the
hard-coded add
function, we know what a function definition in
assembly generally looks like. And we know the arguments to the
function when called will be in RDI, RSI, and RDX.
function compile_define([name, parameters, body]) {
// Function name becomes a label we can CALL
emit(0, `${name}:`);
// Something to do with RDI, RSI, RDX and the parameters variable?
// We renamed compile_argument to compile_expression to be more general
compile_expression(body[0], 'RAX');
// Maybe some cleanup to do with RDI, RSI, RDX?
emit(1, 'RET\n');
}
Not a bad first sketch. But how do we match up RDI
, RSI
, RDX
and
the user-defined parameters
variable names? For example in the
following:
(def plus-two (a)
(+ a 2))
It's clear to us that a
must match up to RDI
. In order to do this
we need to track all variables in a scope
dictionary mapping the
variable name to the register where it's stored.
Additionally, keeping track of scope can help us fail quickly in the following scenario:
(def plus-two (a)
(+ b 2))
The variable b
is used but never defined. It has not been added to
the scope dictionary. So our compiler can fail quickly saying there is
an undefined variable being referenced.
Taking this a step further, what if we want to catch the following too:
(def plus-two (a)
(plus a 2))
We're trying to call plus
but it has not been defined. We should be
able to fail quickly here too. But that means we're need to track the
scope of function names in addition to variables. We'll choose to
track function names and variable names in the same scope dictionary.
This is the distinction between a lisp-1 and a lisp-2. We are a lisp-1 like Scheme because we have a single scope. Common Lisp is a lisp-2 because it stores function name scope separately from variable name scope.
Implementing scope
We need to revise every compile function to accept a scope dictionary
(specifically: compile
, compile_expression
, compile_call
, and
compile_define
). If a variable is referenced, we need to look up
it's location in the scope dictionary. If a variable is defined
(e.g. a function name or a function parameter) we need to add a
mapping to the scope dictionary.
Modifying compile_expression
is easiest:
function compile_expression(arg, destination, scope) {
// Is a nested function call, compile it
if (Array.isArray(arg)) {
compile_call(arg[0], arg.slice(1), destination, scope);
return;
}
if (scope[arg] || Number.isInteger(arg)) {
emit(1, `MOV ${destination}, ${scope[arg] || arg}`);
} else {
throw new Error('Attempt to reference undefined variable or unsupported literal: ' + arg);
}
}
Next we modify compile_call
:
function compile_call(fun, args, destination, scope) {
if (primitive_functions[fun]) {
primitive_functions[fun](args, destination, scope);
return;
}
// Save param registers
args.map((_, i) => emit(1, `PUSH ${PARAM_REGISTERS[i]}`));
// Compile registers and store as params
args.map((arg, i) => compile_expression(arg, PARAM_REGISTERS[i], scope));
const validFunction = BUILTIN_FUNCTIONS[fun] || scope[fun];
if (validFunction) {
emit(1, `CALL ${validFunction}`);
} else {
throw new Error('Attempt to call undefined function: ' + fun);
}
// Restore param registers
args.map((_, i) => emit(1, `POP ${PARAM_REGISTERS[args.length - i - 1]}`));
if (destination && destination !== 'RAX') {
emit(1, `MOV ${destination}, RAX`);
}
}
And then compile_define
where we modify scope for the first time:
function compile_define([name, params, ...body], destination, scope) {
// Add this function to outer scope
scope[name] = name.replace('-', '_');
// Copy outer scope so parameter mappings aren't exposed in outer scope.
const childScope = { ...scope };
emit(0, `${scope[name]}:`);
params.forEach(function (param, i) {
const register = PARAM_REGISTERS[i];
// Store parameter mapped to associated register
childScope[param] = register;
});
// Pass childScope in for reference when body is compiled.
compile_expression(body[0], 'RAX', childScope);
emit(1, 'RET\n');
}
And finally we need to modify the entrypoint compile
:
module.exports.compile = function (ast) {
emit_prefix();
// Pass in new, empty scope mapping
compile_call(ast[0], ast.slice(1), 'RAX', {});
emit_postfix();
}
And scope-wise we're pretty good!
Function calling convention: callee
We currently have a problem that we're using parameters registers to store local variables that messes up with how we are storing parameters for function calls within the function itself.
Ideally we could store function local variables (including the parameters when we get them) separately from how we store function call parameters within the function.
Thankfully according to the calling convention we've followed, we're
given a set of registers that are callee-preserved. Of them we'll use
RBX
, RBP
, and R12
in that order. This allows us to mess with so
long as we store them and restore them within the function.
Applying the same storing/restoring strategy to local variables as we did for parameters, we get:
const LOCAL_REGISTERS = [
'RBX',
'RBP',
'R12',
];
function compile_define([name, params, ...body], destination, scope) {
// Add this function to outer scope
scope[name] = name.replace('-', '_');
// Copy outer scope so parameter mappings aren't exposed in outer scope.
const childScope = { ...scope };
emit(0, `${scope[name]}:`);
params.forEach(function (param, i) {
const register = PARAM_REGISTERS[i];
const local = LOCAL_REGISTERS[i];
emit(1, `PUSH ${local}`);
emit(1, `MOV ${local}, ${register}`);
// Store parameter mapped to associated local
childScope[param] = local;
});
// Pass childScope in for reference when body is compiled.
compile_expression(body[0], 'RAX', childScope);
params.forEach(function (param, i) {
// Backwards first
const local = LOCAL_REGISTERS[params.length - i - 1];
emit(1, `POP ${local}`);
});
emit(1, 'RET\n');
}
And we're set.
Cleanup
We've still got a few messes going on:
- emit_prefix wraps out entire body in
_main
, we're requiring our ownmain
now - emitting to stdout instead of to a file
- multiple function definitions is treated as nonsense
Starting first, we rewrite emit_prefix
and emit_postfix
so that
our _main
just calls main
.
function emit_prefix() {
emit(1, '.global _main\n');
emit(1, '.text\n');
emit(0, 'plus:');
emit(1, 'ADD RDI, RSI');
emit(1, 'MOV RAX, RDI');
emit(1, 'RET\n');
}
function emit_postfix() {
emit(0, '_main:');
emit(1, 'CALL main');
emit(1, 'MOV RDI, RAX'); // Set exit arg
emit(1, `MOV RAX, ${SYSCALL_MAP['exit']}`);
emit(1, 'SYSCALL');
}
Next to deal with writing to a file instead of stdout, we need our
emit
function to write to a buffer. We'll let ulisp.js
write that
buffer to a file. Because we're incredibly lazy, we'll do this all
globally.
let OUT = '';
function emit(depth, args) {
const indent = new Array(depth + 1).join(' ');
OUT += `${indent}${args}\n`;
}
...
module.exports.compile = function (ast) {
OUT = '';
emit_prefix();
compile_call(ast[0], ast.slice(1), 'RAX', {});
emit_postfix();
return OUT;
}
And modify ulisp.js
:
const cp = require('child_process');
const fs = require('fs');
const { parse } = require('./parser');
const { compile } = require('./compiler');
function main (args) {
const input = fs.readFileSync(args[2]).toString();
const [ast] = parse(input);
const program = compile(ast);
try {
fs.mkdirSync('build');
} catch (e) {}
fs.writeFileSync('build/prog.s', program);
cp.execSync('gcc -mstackrealign -masm=intel -o build/a.out build/prog.s');
}
main(process.argv);
And we're finally ready to run a simple program.
A program!
$ cat test.lisp
(def main () (+ 1 2))
$ node ulisp.js test.lisp
$ ./build/a.out
$ echo $?
3
Hurray! Now let's try defining and calling a second function to validate parameter behavior.
$ cat test.lisp
(def plus-two (a)
(+ a 2))
(def main ()
(plus-two 3))
$ node ulisp.js test.lisp
$ ./build/a.out
./compiler.js:106
throw new Error('Attempt to call undefined function: ' + fun);
^
Error: Attempt to call undefined function: p2
...
We start getting some really weird errors. And the reason is because our compiler doesn't know how to deal with sibling S-expressions.
So we'll introduce a new primitive function called begin
that calls
all it's sibling functions and returns the value of the last
call. Then we'll wrap the program in an implicit begin
so we don't
need to.
function compile_begin(body, destination, scope) {
body.forEach((expression) => compile_expression(expression, 'RAX', scope));
if (destination && destination !== 'RAX') {
emit(1, `MOV ${destination}, RAX`);
}
}
const primitive_functions = {
def: compile_define,
begin: compile_begin,
};
...
module.exports.compile = function (ast) {
OUT = '';
emit_prefix();
compile_call('begin', ast, 'RAX', {});
emit_postfix();
return OUT;
}
And we try our test program again. :)
$ cat test.lisp
(def plus-two (a)
(+ a 2))
(def main ()
(plus-two 3))
$ node ulisp.js test.lisp
$ ./build/a.out
$ echo $?
5
And that's all there is to it! Stay tuned for the next post on conditionals and tail-call optimization.
Part two on compiler basics using JavaScript: user-defined functions and variables https://t.co/XOam67HO8h
— Phil Eaton (@phil_eaton) January 20, 2019