August 16, 2020

Writing a simple Python compiler: 1. hello, fibonacci

In this post we'll write a Python to C compiler in Python. This is especially easy to do since Python has a builtin parser library and because a number of CPython internals are exposed for extension writers.

By the end of this post, in a few hundred lines of Python, we'll be able to compile and run the following program:

$ cat tests/recursive_fib.py
def fib(n):
    if n == 0 or n == 1:
        return n

    return fib(n - 1) + fib(n - 2)


def main():
    print(fib(40))
$ python3 pyc tests/recursive_fib.py
$ ./bin/a.out
102334155

This post implements an extremely small subset of Python and completely gives up on even trying to manage memory because I cannot fathom manual reference counting. Maybe some day I'll find a way to swap in an easy GC like Boehm.

Source code for this project is available on Github.

Dependencies

We'll need Python3, GCC, libpython3, and clang-format.

On Fedora-based systems:

$ sudo dnf install gcc python3-devel clang-format python3

And on Debian-based systems:

$ sudo apt install gcc python3-dev clang-format python3

This program will likely work as well on Windows, Mac, FreeBSD, etc. but I haven't gone through the trouble of testing this (or providing custom compiler directives). Pull requests welcome!

A hand-written first-pass

Before we get into the compiler, let's write the fibonacci program by hand in C using libpython.

As described in the Python embedding guide we'll need to include libpython and initialize it in our main.c:

#define PY_SSIZE_T_CLEAN
#include <Python.h>

int main(int argc, char *argv[]) {
  Py_Initialize();

  return 0;
}

To compile against libpython, we'll use python3-config installed as part of python3-devel to tell us what should be linked at each step during compilation.

$ gcc -c -o main.o $(python3-config --cflags) main.c
$ gcc $(python3-config --ldflags) main.o
$ ./a.out; echo $?
0

Cool! Now as we think about translating the fibonacci implementation, we want to keep everything as Python objects for as long as possible. This means passing and receiving PyObject* to and from all functions, and converting all C integers to PyLong*, a "subtype" of PyObject*. You can imagine that everything in Python is an object until you operate on it.

For more information on objects in Python, check out the Data model page in Python docs.

To map a C integer to a PyLong* we use PyLong_FromLong. To map in reverse, we use PyLong_AsLong.

To compare two PyObject*s we can use PyObject_RichCompareBool which will handle the comparison regardless of the type of the two parameters. Without this helper we'd have to write complex checks to make sure that the two sides are the same and if they are, unwrap them into their underlying C value and compare the C value.

We can use PyNumber_Add and PyNumber_Subtract for basic arithmetic, and there are many similar helpers available to us for operations down the line.

Now we can write a translation:

#define PY_SSIZE_T_CLEAN
#include <Python.h>

PyObject* fib(PyObject* n) {
  PyObject* zero = PyLong_FromLong(0);
  PyObject* one = PyLong_FromLong(1);
  if (PyObject_RichCompareBool(n, zero, Py_EQ) || PyObject_RichCompareBool(n, one, Py_EQ)) {
    return n;
  }

  PyObject* left = fib(PyNumber_Subtract(n, one));

  PyObject* two = PyLong_FromLong(2);
  PyObject* right = fib(PyNumber_Subtract(n, two));

  return PyNumber_Add(left, right);
}

int main(int argc, char *argv[]) {
  Py_Initialize();

  PyObject* res = fib(PyLong_FromLong(7)); // Should be 13

  return PyLong_AsLong(res);
}

Compile and run it:

$ gcc -c -o main.o $(python3-config --cflags) main.c
$ gcc $(python3-config --ldflags) main.o
$ ./a.out; echo $?
13

That's great! But we cheated in one place. We assumed that the input to the fib function was an integer, and we propagated that assumption everywhere we wrote PyNumber_* operations. When we write the compiler, we'll need to check that both arguments are an integer before we call a numeric helper, otherwise we may need to call a string concatenation helper or something else entirely.

Compiler Architecture

We'll break the code into four major parts:

  1. libpyc.c: helper functions for generated code
  2. pyc/context.py: utilities for scope and writing code in memory
  3. pyc/codegen.py: for generating C code from a Python AST
  4. pyc/__main__.py: the entrypoint

When I'm writing a new compiler using an existing parser I almost always start with the entrypoint and code generator so I can explore the AST. However, it's easiest to explain the code if we start with the utilities first.

We'll also want an empty pyc/__init__.py.

libpyc.c

This C file will contain three helper functions for safely adding, subtracting, and printing. It will be concatenated to the top of the generated C file. We'll only support integers for now but this structure sets us up for supporting more types later on.

We'll use PyLong_Check before calling number-specific methods.

#define PY_SSIZE_T_CLEAN
#include <Python.h>

inline PyObject* PYC_Add(PyObject* l, PyObject* r) {
  // TODO: allow __add__ override

  // Includes ints and bools
  if (PyLong_Check(l) && PyLong_Check(r)) {
    return PyNumber_Add(l, r);
  }

  // TODO: handle str, etc.

  // TODO: throw exception
  return NULL;
}

inline PyObject* PYC_Sub(PyObject* l, PyObject* r) {
  // TODO: allow __add__ override

  // Includes ints and bools
  if (PyLong_Check(l) && PyLong_Check(r)) {
    return PyNumber_Subtract(l, r);
  }

  // TODO: handle str, etc.

  // TODO: throw exception
  return NULL;
}

inline PyObject* PYC_Print(PyObject* o) {
  PyObject_Print(o, stdout, Py_PRINT_RAW);
  printf("\n");
  return Py_None;
}

That's it! We could generate these as strings in Python but it gets hairy to do so. By using a dedicated C file, we can take advantage of syntax highlighting since this file is only C code. And since we've marked all functions as inline, there's no runtime cost to using not embedding these as strings in Python.

pyc/context.py

This file will contain a Context class for managing identifiers in scope and for proxying to a Writer class that contains helpers for writing lines of C code.

We'll have two instances of the Writer class in Context so that we can write to a body (or current/primary) region and an initialization region.

The initialization region is necessary in case there are any variables declared at the top-level. We can't initialize these variables in C outside of a function since every PyObject* must be created after calling Py_Initialize. This section will be written into our C main function before we enter a compiled Python main function.

import copy


class Writer():
    content = ""

    def write(self, exp: str, indent: int = 0):
        self.content += ("  " * indent) + exp

    def writeln(self, stmt: str, indent: int = 0):
        self.write(stmt + "\n", indent)

    def write_statement(self, stmt: str, indent: int = 0):
        self.writeln(stmt + ";", indent)


class Context():
    initializations = Writer()
    body = Writer()
    indentation = 0

    scope = 0
    ret = None
    namings = {}
    counter = -1

    def __getattr__(self, name: str) -> object:
        # Helpers to avoid passing in self.indentation every time
        outputs = [initializations", "body"]
        for output in outputs:
            if name.startswith(output):
                return lambda s, i=None: getattr(getattr(self, output), name[len(output)+1:])(s, i if i is not None else self.indentation)

        return object.__getattr__(self, name)

    def get_local(self, source_name: str) -> dict:
        return self.namings[source_name]

    def register_global(self, name: str, loc: str):
        self.namings[name] = {
            "name": loc,
            "scope": 0,
        }

    def register_local(self, local: str = "tmp") -> str:
        self.counter += 1
        self.namings[local] = {
            "name": f"{local}_{self.counter}",
            # naming dictionary is copied, so we need to capture scope
            # at declaration
            "scope": self.scope,
        }
        return self.namings[local]["name"]

    def copy(self):
        new = copy.copy(self)
        # For some reason copy.deepcopy doesn't do this
        new.namings = dict(new.namings)
        return new

    def at_toplevel(self):
        return self.scope == 0

This is all pretty boring boilerplate. Let's move on.

pyc/main.py

The entrypoint is responsible for reading source code, parsing it, calling the code generator, writing the source code to a C file, and compiling it.

First, we read and parse the source code:

import ast
import os
import subprocess
import shutil
import sys

from context import Context
from codegen import generate

BUILTINS = {
    "print": "PYC_Print",
}


def main():
    target = sys.argv[1]
    with open(target) as f:
        source = f.read()
    tree = ast.parse(source, target)

Then we write libpyc.c into the body, register builtins, and run code generation:

...

def main()
    ...

    ctx = Context()
    with open("libpyc.c") as f:
        ctx.body_write(f.read() + "\n")

    for builtin, fn in BUILTINS.items():
        ctx.register_global(builtin, fn)

    generate(ctx, tree)

Next, we create a clean output directory and write main.c with the generated code and a main function to initialization Python and any global variables:

...

def main():
   ...

    # Create and move to working directory
    outdir = "bin"
    shutil.rmtree(outdir, ignore_errors=True)
    os.mkdir(outdir)
    os.chdir(outdir)

    with open("main.c", "w") as f:
        f.write(ctx.body.content)

        main = ctx.namings.get("main")["name"]
        f.write(f"""int main(int argc, char *argv[]) {{
  Py_Initialize();

  // Initialize globals, if any.
{ctx.initializations.content}
  PyObject* r = {main}();
  return PyLong_AsLong(r);
}}""")

Finally, we run clang-format and gcc against the generated C code:

...

def main():
    ...

    subprocess.run(["clang-format", "-i", "main.c"])

    cflags_raw = subprocess.check_output(["python3-config", "--cflags"])
    cflags = [f.strip() for f in cflags_raw.decode().split(" ") if f.strip()]
    cmd = ["gcc", "-c", "-o", "main.o"] + cflags + ["main.c"]
    subprocess.run(cmd)

    ldflags_raw = subprocess.check_output(["python3-config", "--ldflags"])
    ldflags = [f.strip() for f in ldflags_raw.decode().split(" ") if f.strip()]
    cmd = ["gcc"] + ldflags + ["main.o"]
    subprocess.run(cmd)

All together:

import ast
import os
import subprocess
import shutil
import sys

from context import Context
from codegen import generate

BUILTINS = {
    "print": "PYC_Print",
}


def main():
    target = sys.argv[1]
    with open(target) as f:
        source = f.read()
    tree = ast.parse(source, target)

    ctx = Context()
    with open("libpyc.c") as f:
        ctx.body_write(f.read() + "\n")

    for builtin, fn in BUILTINS.items():
        ctx.register_global(builtin, fn)

    generate(ctx, tree)

    # Create and move to working directory
    outdir = "bin"
    shutil.rmtree(outdir, ignore_errors=True)
    os.mkdir(outdir)
    os.chdir(outdir)

    with open("main.c", "w") as f:
        f.write(ctx.body.content)

        main = ctx.namings.get("main")["name"]
        f.write(f"""int main(int argc, char *argv[]) {{
  Py_Initialize();

  // Initialize globals, if any.
{ctx.initializations.content}
  PyObject* r = {main}();
  return PyLong_AsLong(r);
}}""")

    subprocess.run(["clang-format", "-i", "main.c"])

    cflags_raw = subprocess.check_output(["python3-config", "--cflags"])
    cflags = [f.strip() for f in cflags_raw.decode().split(" ") if f.strip()]
    cmd = ["gcc", "-c", "-o", "main.o"] + cflags + ["main.c"]
    subprocess.run(cmd)

    ldflags_raw = subprocess.check_output(["python3-config", "--ldflags"])
    ldflags = [f.strip() for f in ldflags_raw.decode().split(" ") if f.strip()]
    cmd = ["gcc"] + ldflags + ["main.o"]
    subprocess.run(cmd)


main()

Done!

pyc/codegen.py

Lastly we write the translation layer from Python AST to C. We'll break this out into 10 helper functions. It is helpful to have the AST spec for reference.

1/10: generate

The entrypoint of the code generator is generate(ctx: Context, exp). It generates code for any object with a body attribute storing a list of statements. This function will generate code for objects like modules, function bodies, if bodies, etc.

The statements we'll support to begin are:

  • ast.Assign
  • ast.FunctionDef
  • ast.Return
  • ast.If
  • and ast.Expr

For each statement, we'll simply pass on generation to an associated helper function. In the case of expression generation though, we'll also add a noop operation on the result of the expression otherwise the compiler will complain about an unused variable.

def generate(ctx: Context, module):
    for stmt in module.body:
        if isinstance(stmt, ast.Assign):
            generate_assign(ctx, stmt)
        elif isinstance(stmt, ast.FunctionDef):
            generate_function_def(ctx, stmt)
        elif isinstance(stmt, ast.Return):
            generate_return(ctx, stmt)
        elif isinstance(stmt, ast.If):
            generate_if(ctx, stmt)
        elif isinstance(stmt, ast.Expr):
            r = generate_expression(ctx, stmt.value)
            ctx.body_writeln("// noop to hide unused warning")
            ctx.body_write_statement(f"{r} += 0")
        else:
            raise Exception(f"Unsupported statement type: {type(stmt)}")

Remember to throw exceptions aggressively otherwise you'll have a bad time debugging programs using new syntax.

Let's dig into these helpers.

2/10: generate_assign

To generate assignment code, we need to check if we're at the top-level or not. If we're at the top-level we can declare the variable but we can't initialize it yet. So we add the initialization code to the initialization section of the program.

If we're not at the top-level, we can declare and assign in one statement.

Before doing either though, we register the variable name so we can get a safe local name to use in generated code. Then we compile the right-hand side so we can assign it to the left-hand side.

import ast

from context import Context


def initialize_variable(ctx: Context, name: str, val: str):
    if ctx.at_toplevel():
        decl = f"PyObject* {name}"
        ctx.body_write_statement(decl, 0)

        init = f"{name} = {val}"
        ctx.initializations_write_statement(init)
    else:
        ctx.body_write_statement(f"PyObject* {name} = {val}")


def generate_assign(ctx: Context, stmt: ast.Assign):
    # TODO: support assigning to a tuple
    local = ctx.register_local(stmt.targets[0].id)
    val = generate_expression(ctx, stmt.value)
    initialize_variable(ctx, local, val)

We're going to need to implement generate_expression to make this work.

3/10: generate_expression

Just like for statements in generate, there are a few kinds of expressions we need to implement:

  • ast.Num
  • ast.BinOp
  • ast.BoolOp
  • ast.Name
  • ast.Compare
  • and ast.Call

For ast.Num, we just need to wrap the literal number as a PyLong*. And for ast.Name we just need to look up the local name in context. Otherwise we delegate to more helper functions.

def generate_expression(ctx: Context, exp) -> str:
    if isinstance(exp, ast.Num):
        # TODO: deal with non-integers
        tmp = ctx.register_local("num")
        initialize_variable(ctx, tmp, f"PyLong_FromLong({exp.n})")
        return tmp
    elif isinstance(exp, ast.BinOp):
        return generate_bin_op(ctx, exp)
    elif isinstance(exp, ast.BoolOp):
        return generate_bool_op(ctx, exp)
    elif isinstance(exp, ast.Name):
        return ctx.get_local(exp.id)["name"]
    elif isinstance(exp, ast.Compare):
        return generate_compare(ctx, exp)
    elif isinstance(exp, ast.Call):
        return generate_call(ctx, exp)

    raise Exception(f"Unsupported expression: {type(exp)}")

For every code generation helper that is an expression, we store the expression in a local variable and return the variable's name so that parent nodes in the AST can refer to the child. This can result in inefficient code generation (useless assignment) but that's not really a big deal for a project like this and will likely be optimized away by GCC anyway. The more annoying aspect is that useless assignment just makes the generated code harder to read.

4/10: generate_bin_op

For binary operators we need to support addition and subtraction. Other binary operators like equality or and/or are represented in ast.Compare and ast.BoolOp.

This is easy to write because we already prepared helpers in libpyc.c: PYC_Sub and PYC_Add.

def generate_bin_op(ctx: Context, binop: ast.BinOp) -> str:
    result = ctx.register_local("binop")

    l = generate_expression(ctx, binop.left)
    r = generate_expression(ctx, binop.right)

    if isinstance(binop.op, ast.Add):
        ctx.body_write_statement(f"PyObject* {result} = PYC_Add({l}, {r})")
    elif isinstance(binop.op, ast.Sub):
        ctx.body_write_statement(f"PyObject* {result} = PYC_Sub({l}, {r})")
    else:
        raise Exception(f"Unsupported binary operator: {type(binop.op)}")

    return result

Easy enough.

5/10: generate_bool_op

We only need to support or for the fibonacci program, but or in Python is more complicated than in C. In Python, the first value to be truthy short-circuits the expression and the result is its value, not True.

We'll use goto to short-circuit and we'll use PyObject_IsTrue to do the truthy check:

def generate_bool_op(ctx: Context, boolop: ast.BoolOp) -> str:
    result = ctx.register_local("boolop")
    ctx.body_write_statement(f"PyObject* {result}")

    if isinstance(boolop.op, ast.Or):
        done_or = ctx.register_local("done_or")

        for exp in boolop.values:
            v = generate_expression(ctx, exp)
            ctx.body_write_statement(f"{result} = {v}")
            ctx.body_writeln(f"if (PyObject_IsTrue({v})) {{")
            ctx.body_write_statement(f"goto {done_or}", ctx.indentation+1)
            ctx.body_writeln("}")

        ctx.body_writeln(f"{done_or}:\n", 0)

    return result

Now that I write this down I see we could probably move this function into libpyc.c if we used a loop. Maybe in the next iteration.

We move on.

6/10: generate_compare

This function handles equality and inequality checks. We'll adapt the PyObject_RichCompareBool helper we used in the hand-written translation.

The only additional thing to keep in mind is that the right-hand side is passed as an array. So we have to iterate through it and apply the equality/inequality check on all objects in the list.

def generate_compare(ctx: Context, exp: ast.Compare) -> str:
    result = ctx.register_local("compare")
    left = generate_expression(ctx, exp.left)
    ctx.body_write_statement(f"PyObject* {result} = {left}")

    for i, op in enumerate(exp.ops):
        v = generate_expression(ctx, exp.comparators[i])

        if isinstance(op, ast.Eq):
            ctx.body_write_statement(f"{result} = PyObject_RichCompare({result}, {v}, Py_EQ)")
        elif isinstance(op, ast.NotEq):
            ctx.body_write_statement(f"{result} = PyObject_RichCompare({result}, {v}, Py_NE)")
        else:
            raise Exception(f"Unsupported comparison: {type(op)}")

    return result

7/10: generate_call

The last expression is simple enough. We compile the call's arguments first, then the function itself, then we call the function with the arguments like any C function. Calling the C function directly will have ramifications for interacting with Python libraries (basically, we won't be able to interact with any) but it's the easiest way to get started.

def generate_call(ctx: Context, exp: ast.Call) -> str:
    args = ', '.join([generate_expression(ctx, a) for a in exp.args])
    fun = generate_expression(ctx, exp.func)
    res = ctx.register_local("call_result")

    # TODO: lambdas and closures need additional work
    ctx.body_write_statement(
        f"PyObject* {res} = {fun}({args})")
    return res

And that's it for expressions! Just a few more statement helpers to support.

8/10: generate_function_def

This is a fun one. First we register the function name in scope. Then we copy the context so variables within the function body are contained within the function body. We increment scope so we know we've left the top-level. Finally, we compile the body.

def generate_function_def(ctx: Context, fd: ast.FunctionDef):
    name = ctx.register_local(fd.name)

    childCtx = ctx.copy()
    args = ", ".join([f"PyObject* {childCtx.register_local(a.arg)}" for a in fd.args.args])
    ctx.body_writeln(f"PyObject* {name}({args}) {{", 0)

    childCtx.scope += 1
    childCtx.indentation += 1
    generate(childCtx, fd)

    if not childCtx.ret:
        childCtx.body_write_statement("return Py_None")

    ctx.body_writeln("}\n", 0)

The check for childCtx.ret isn't strictly necessary because we could just emit a return even if there already was one. Asking generate_return to set this attribute and having generate_function_def check it just makes the generate code a little prettier.

9/10: generate_return

Very straightforward, we just compile the value to be returned and then we emit a return statement.

We store the return value so that the function definition can know whether to add a return PyNone statement.

def generate_return(ctx: Context, r: ast.Return):
    ctx.ret = generate_expression(ctx, r.value)
    ctx.body_writeln("")
    ctx.body_write_statement(f"return {ctx.ret}")

And we've got one last statement to support!

10/10: generate_if

You know the deal: compile the test and if the test is truthy, enter the compiled body. We'll deal with the else body another time.

def generate_if(ctx: Context, exp: ast.If):
    test = generate_expression(ctx, exp.test)
    ctx.body_writeln(f"if (PyObject_IsTrue({test})) {{")
    ctx.indentation += 1
    generate(ctx, exp)
    # TODO: handle exp.orelse
    ctx.indentation -= 1
    ctx.body_writeln("}\n")

And we're done the compiler!

Trying it out

As promised:

$ cat tests/recursive_fib.py
def fib(n):
    if n == 0 or n == 1:
        return n

    return fib(n - 1) + fib(n - 2)


def main():
    print(fib(40))
$ python3 pyc tests/recursive_fib.py
$ ./bin/a.out
102334155

Microbenchmarking, or making compiler Twitter unhappy

Keep in mind this implementation does a small fraction of what CPython is doing.

If you time the generated code:

$ python3 pyc tests/recursive_fib.py
$ time ./bin/a.out
102334155
./bin/a.out  18.69s user 0.03s system 99% cpu 18.854 total

And CPython (with main() append to the source):

time python3 tests/recursive_fib.py
102334155
python3 tests/recursive_fib.py  76.24s user 0.11s system 99% cpu 1:16.81 total

The only reason I mention this is because when I did a similar compiler project for JavaScript targeting C++/libV8, the generated code was about the same or a little slower in speed.

I haven't gotten that much better at writing these compilers.