January 10, 2018

First few hurdles writing a Scheme interpreter

I started working on BSDScheme last October, inspired to get back into language implementation after my coworker built bshift, a compiler for a C-like language. BSDScheme is an interpreter for a (currently small subset of) Scheme written in D. It implements a few substantial primitive functions (in under 1000 LoC!). It uses the same test framework bshift uses, btest. I'm going to expand here on some notes I wrote in a post on Reddit on some issues I faced during these first few months developing BSDSCheme.

Before I get too far, here is a simple exponent function running in BSDScheme. It demonstates a few of the basic builtin primitives and also integers being upgraded to D's std.bigint when an integer operation produces an integer unable to fit in 64 bits. (See the times and plus guards for details; see the examples directory for other examples.)

$ cat examples/recursion.scm
(define (exp base pow)
  (if (= pow 0)
      (* base (exp base (- pow 1)))))

(display (exp 2 64))
$ ./bin/bsdscheme examples/exp.scm

The first big correction I made was to the way values are represented in memory. I originally implemented BSDScheme's value representation as a struct with a pointer to each possible value type. This design was simple to begin with but space-inefficient. I modelled a redesign after the Chicken Scheme data representation. It uses a struct with two fields, header and data. Both fields are word-size integers (currently hard-coded as 64 bits). The header stores type and length information and the data stores data.

In this representation, simple types (integers < 2^63, booleans, characters, etc.) take up only 128 bits. The integers, booleans, etc. are placed directly into the 64 bit data field. Other types (larger integers, strings, functions, etc) use the data field to store a pointer to memory allocated in the heap. Getting the conversion of these complex types right was the trickiest part of this data representation effort... lots of void-pointer conversions.

The next big fix I made was to simplify the way generic functions dealt with their arguments. Originally I passed each function its arguments un-evaluated and left it up to each function to evaluate its arguments before operating on them. While there was nothing intrinsically wrong with this method, it was overly complicated and bug-prone. I refactored the builtin functions into two groups: normal functions and special functions. Normal function arguments are evaluated before sending the arguments S-expression to the function. Special functions receive the arguments S-expression verbatim so they can decide what / when to evaluate.

The last issue I'll talk about in this post was dealing with the AST representation. When I started out, the easiest way to get things working was to have an AST representation completely separate from the representation of BSDScheme values. This won't get you far in Scheme. In order to (eventually) support macros (and in the meantime support eval), the AST representation would have to make use of the value representation. This was the most complicated and confusing issue so far in BSDScheme. With the switch to recursive data structures, it was hard to know if an error occurred because I parsed incorrectly, or recursed over what I parsed incorrectly, or even if I was printing out what I parsed incorrectly. After some embarrassing pain, I got all the pieces in place after a month and it set me up to easily support converting my original interpret function into a generic eval function that I could expose to the language like any other special function.

One frustrating side-effect of this AST conversion is that since the parsing stage builds out trees using the internal value representation, the parsing stage is tied to the interpreter. From what I can tell, this basically means I have to revert back to some intermediate AST representation or throw away the parser to support a compiler backend.

Next steps in BSDScheme include converting all the examples into tests, combining the needlessly split out lexing and parsing stage into a single read function that can be exposed into the language, fleshing out R7RS library support, and looking more into LLVM as a backend.