Previously in compiler basics:
1. lisp to assembly
2. user-defined functions and variables
Next in compiler basics:
4. LLVM conditionals and compiling fibonacci
5. LLVM system calls
6. an x86 upgrade
In this post we'll extend the compiler to emit LLVM IR as an option instead of x86 assembly.
All source code is available on Github.
LLVM IR is a portable, human-readable, typed, assembly-like syntax that LLVM can apply optimizations on before generating assembly for the target architecture. Many language implementors choose to compile to LLVM IR specifically to avoid needing to implement sophisticated optimizations.
But the biggest reason I'm adding an LLVM backend is so that I can punt on implementing register allocation. This is the technique that allows you to generically use as many registers as possible before storing local variables on the stack. While register allocation algorithms are not that difficult, I got bored/lazy trying to implement this for ulisp. And LLVM IR provides "infinite" locals that get mapped as needed to registers and the stack -- implementing register allocation.
LLVM IR basics
In LLVM IR, all local variables must be prefixed
with %
. All global variables (including function names)
must be prefixed with @
. LLVM IR must be in
single-static
assignment
(SSA) form, which means that no variable is assigned
twice. Additionally, literals cannot be assigned to variables
directly. So we'll work around that by adding 0 to the
literal. Furthermore, we'll take advantage of
the add
, sub
, and mul
operations built into LLVM IR.
; x = 4
%x = add i32 4, 0
The type that the operation is operating on must be specified after
the operation name. In this case we are specifying
that add
is operating on and returning 32-bit integers.
While this might seem very inefficient, we'll see in the end that LLVM easily optimizes this away.
Function definition
Functions are defined at the top-level and are much simpler than x86 assembly since the details of calling conventions are handled by LLVM.
; (def plus (a b) (+ a b))
define i32 @plus (i32 a, i32 b) {
%res = add i32 a, b
ret i32 %res
}
In ulisp, all functions will return a result (and the only supported
type for now are 32-bit integers). So we annotate the definition with
this return type (i32
in define
i32
). Finally, we return inside the function with
the ret
instruction that must also specify the type
(again i32
).
Generating LLVM IR
We are going to generate LLVM IR as text. But any large project will benefit from generating LLVM IR via API.
Supporting multiple backends
The goal is to be able to switch at compile-time between generating x86 assembly or generating LLVM IR. So we'll need to reorganize ulisp a little bit.
We'll edit src/ulisp.js
to accept a second argument to
specify the backend (and from now on we'll default to LLVM).
const cp = require('child_process');
const fs = require('fs');
const { parse } = require('./parser');
const backends = require('./backend');
function main(args) {
const input = fs.readFileSync(args[2]).toString();
let backend;
switch (args[3]) {
case 'llvm':
case undefined:
backend = backends.llvm;
break;
case 'x86':
backend = backends.x86;
break;
default:
console.log('Unsupported backend ' + args[3]);
}
const [ast] = parse(input);
const program = backend.compile(ast);
try {
fs.mkdirSync('build');
} catch (e) {}
backend.build('build', program);
}
main(process.argv);
The LLVM backend
We'll add src/backend/llvm.js
and And expose
compile
and build
functions.
compile(ast)
This will work the same as it did for the x86 backend, creating a new
Compiler
helper object, creating a scope manager (which
we'll get into in more detail shortly), and generating code from the
AST wrapped in a begin
.
module.exports.compile = function(ast) {
const c = new Compiler();
const scope = new Scope();
c.compileBegin(ast, scope.symbol(), scope);
return c.getOutput();
};
build(buildDir, output)
The job of build
will be to clean up the build directory,
write any output as needed to the directory, and compile the written
output. Since we're dealing with LLVM IR, we first call
llc on the IR file to
get an assembly file. Then we can call gcc
on the
assembly to get a binary output.
const cp = require('child_process');
const fs = require('fs');
...
module.exports.build = function(buildDir, program) {
const prog = 'prog';
fs.writeFileSync(buildDir + `/${prog}.ll`, program);
cp.execSync(`llc -o ${buildDir}/${prog}.s ${buildDir}/${prog}.ll`);
cp.execSync(`gcc -o ${buildDir}/${prog} ${buildDir}/${prog}.s`);
};
Taking advantage of locals
Before we get too far into the specifics of LLVM IR code generation,
let's build out the infrastructure to take advantage of "infinite"
locals. In particular, we want a local-manager (Scope
)
with four functions:
register(local: name)
: for tracking user variables and mapping to safe namessymbol()
: for tracking internal temporary variablesget(local: name)
: for returning the safe name of a user variablecopy()
: for duplicating the local-tracker when we enter a new scope
It is important to track and map user variables into safe names so we don't accidentally conflict between variable names used by the user and names used by the compiler itself.
register(local)
When we register, we'll want to replace any unsafe characters that Lisp allows but LLVM likely won't. For now, we'll just replace any dashes in the name (since dashes are fine in variables in Lisp) with underscores. Then we'll add a number to the end of the local name until we have a safe name that doesn't exist already. Finally we return that safe name after storing the mapping.
class Scope {
constructor() {
this.locals = {};
}
register(local) {
let copy = local.replace('-', '_');
let n = 1;
while (this.locals[copy]) {
copy = local + n++;
}
this.locals[local] = copy;
return copy;
}
}
symbol()
This is a simple function that will return one new unused safe name that we can store things in.
class Scope {
...
symbol() {
const nth = Object.keys(this.locals).length + 1;
return this.register('sym' + nth);
}
...
}
We start off by making up a name based on the prefix sym
and a suffix of the current key length and pass that into the
register
function to make sure we get a safe name.
get(local)
This function is a very simple lookup to return the safe name for a user variable. It is up to the caller of this function to handle if the user variable does not exist in scope (and perhaps throw a compiler error back to the programmer).
class Scope {
...
get(local) {
return this.locals[local];
}
...
}
copy()
Finally, we want to expose a copy function so we can duplicate the local storage before entering a new scope. (A variable inside a function should not exist in scope outside the function.)
class Scope {
...
copy() {
const c = new Scope();
c.locals = { ...this.locals };
return c;
}
...
}
Back to codegen!
As mentioned in module.exports.compile
, we're going to
use a Compiler
that exposes a number of compiler helpers:
emit(depth, code)
: an internal helper for outputting indented lines of codecompileBegin(ast, destination, scope)
: compiles a begin blockcompileExpression(ast, destination, scope)
: compiles variable references, literals, and passes on function callscompileCall(functionName, ast, destination, scope)
: compiles a function callcompileDefine([functionName, parameters, ...body], destination, scope)
: compiles a function definitioncompileOp(op)
: helper function for generating code for primitive operations likeadd
getOutput()
: returns the code generated by the compiler
emit(depth, code)
Like we had in the x86 backend, this will indent the code two spaces
depth
times and write it to the buffer we track generated
code.
class Compiler {
constructor() {
this.outBuffer = [];
}
emit(depth, code) {
const indent = new Array(depth + 1).join(' ');
this.outBuffer.push(indent + code);
}
}
compileBegin(ast, destination, scope)
Our first compiler function actually does no code generation
itself. We'll call compileExpression
on each item within
the begin block. And we'll pass the destination
to the
last expression in the list so that the value of a begin block is set
to the value of its last expression. All other expressions will
receive a temporary variable to store results.
class Compiler {
...
compileBegin(body, destination, scope) {
body.forEach((expression, i) =>
this.compileExpression(
expression,
i === body.length - 1 ? destination : scope.symbol(),
scope,
),
);
}
...
}
Example:
(begin 1 2) ; returns 2
compileExpression(ast, destination, scope)
This is the most generic compile function. If the ast is a list
(representing a function call), it will pass compilation off to
compileCall
. Otherwise the only non-function call parts
of the language are variable references and numeric literals.
class Compiler {
...
compileExpression(exp, destination, scope) {
// Is a nested function call, compile it
if (Array.isArray(exp)) {
this.compileCall(exp[0], exp.slice(1), destination, scope);
return;
}
// If numeric literal, store to destination register by adding 0.
if (Number.isInteger(exp)) {
this.emit(1, `%${destination} = add i32 ${exp}, 0`);
return;
}
// If is local, store to destination register similarly.
const res = scope.get(exp);
if (res) {
this.emit(1, `%${destination} = add i32 %${res}, 0`);
} else {
throw new Error(
'Attempt to reference undefined variable or unsupported literal: ' +
exp,
);
}
...
}
Example:
1
...
a
...
(+ 1 a)
compileCall(functionName, arguments, destination, scope)
Most function calls will automatically compile arguments before
calling the function. However, certain control-flow primitives don't
do this (e.g. def
, if
, etc.). Macros in Lisp
allow you to add new control-flow primitives (even if you don't use it
to modify control-flow). But we will ignore user-defined primitives
for now.
We'll keep a list of control-flow primitives and pass off compilation to them if the function name matches a primitive. Otherwise, we'll look up the function name in scope (to find its safe name), compile the arguments, and call the function with the results of the arguments.
class Compiler {
constructor() {
this.outBuffer = [];
this.primitiveFunctions = {
def: this.compileDefine.bind(this),
begin: this.compileBegin.bind(this),
};
}
...
compileCall(fun, args, destination, scope) {
if (this.primitiveFunctions[fun]) {
this.primitiveFunctions[fun](args, destination, scope);
return;
}
const validFunction = scope.get(fun);
if (validFunction) {
const safeArgs = args
.map((a) => {
const res = scope.symbol();
this.compileExpression(a, res, scope);
return 'i32 %' + res;
})
.join(', ');
this.emit(1, `%${destination} = call i32 @${validFunction}(${safeArgs})`);
} else {
throw new Error('Attempt to call undefined function: ' + fun);
}
}
...
}
Yay LLVM for simplifying calls!
Example:
(foo 1)
...
(+ 1 2)
compileDefine([functionName, parameters, ...body], destination, scope)
This is the last undefined compile function we've used. The call signature may look funny but we write less code if we keep the primitive signatures the same. In any case, JavaScript's destructuring makes it pretty enough.
Aside from code generation, we also need to add the function itself to scope so we can look it up later in use. Additionally we need to create a copy of the current scope for the body of the function. And we'll add the parameter names themselves to the child scope.
class Compiler {
...
compileDefine([name, params, ...body], destination, scope) {
// Add this function to outer scope
const safeName = scope.register(name);
// Copy outer scope so parameter mappings aren't exposed in outer scope.
const childScope = scope.copy();
const safeParams = params.map((param) =>
// Store parameter mapped to associated local
childScope.register(param),
);
this.emit(
0,
`define i32 @${safeName}(${safeParams
.map((p) => `i32 %${p}`)
.join(', ')}) {`,
);
// Pass childScope in for reference when body is compiled.
const ret = childScope.symbol();
this.compileExpression(body[0], ret, childScope);
this.emit(1, `ret i32 %${ret}`);
this.emit(0, '}\n');
}
...
}
Example:
(def plus (a b) (+ a b))
compileOp(op)
The last function mentioned above will help us expose some useful primitives. This function will take a string builtin operation and return a function that can be used to generate code when the operation is called.
class Compiler {
...
compileOp(op) {
return ([a, b], destination, scope) => {
const arg1 = scope.symbol();
const arg2 = scope.symbol();
this.compileExpression(a, arg1, scope);
this.compileExpression(b, arg2, scope);
this.emit(1, `%${destination} = ${op} i32 %${arg1}, %${arg2}`);
};
}
...
}
This allows us to add some builtin ops as primitives (even though they aren't control-flow modifying).
class Compiler {
constructor() {
this.outBuffer = [];
this.primitiveFunctions = {
def: this.compileDefine.bind(this),
begin: this.compileBegin.bind(this),
'+': this.compileOp('add'),
'-': this.compileOp('sub'),
'*': this.compileOp('mul'),
};
}
...
}
Example:
(+ 1 2)
Hello world!
Putting it all together, we'll compile this Lisp program:
(def plus-two (a b)
(+ a (+ b 2)))
(def main ()
(plus-two 3 (plus-two 1 1)))
To get 9.
$ node src/ulisp.js tests/function_definition.lisp
$ ./build/prog
$ echo $?
9
Generated code
The generated LLVM can be found in ./build/prog.ll
:
define i32 @plus_two(i32 %a, i32 %b) {
%sym7 = add i32 %a, 0
%sym9 = add i32 %b, 0
%sym10 = add i32 2, 0
%sym8 = add i32 %sym9, %sym10
%sym6 = add i32 %sym7, %sym8
ret i32 %sym6
}
define i32 @main() {
%sym6 = add i32 3, 0
%sym8 = add i32 1, 0
%sym9 = add i32 1, 0
%sym7 = call i32 @plus_two(i32 %sym8, i32 %sym9)
%sym5 = call i32 @plus_two(i32 %sym6, i32 %sym7)
ret i32 %sym5
}
You can see all these unnecessary add, ... 0
instructions. But let's look at the x86 assembly that LLVM generates
in build/prog.s
:
...
_plus_two: ## @plus_two
.cfi_startproc
## %bb.0:
## kill: def $esi killed $esi def $rsi
## kill: def $edi killed $edi def $rdi
leal 2(%rdi,%rsi), %eax
retq
.cfi_endproc
## -- End function
...
And we see that LLVM easily optimized the inefficiencies away. :)
Next up
- Compiling conditionals
- Tail call optimization
- Lists and dynamic memory
- Strings?
- Foreign function calls?
- Self-hosting?
Adding an LLVM backend to ulisp (small Lisp compiler in JavaScript) https://t.co/VIddKW1r3N
— Phil Eaton (@phil_eaton) March 10, 2019