December 28, 2021

Writing a minimal Lua implementation with a virtual machine from scratch in Rust

By the end of this guide we'll have a minimal, working implementation of a small part of Lua from scratch. It will be able to run the following program (among others):

function fib(n)
   if n < 2 then
      return n;
   end

   local n1 = fib(n-1);
   local n2 = fib(n-2);
   return n1 + n2;
end

print(fib(30));

This is my second project in Rust and only the third time I've invented an instruction set so don't take my style as gospel. However, I have found some Rust parsing tutorials overly complex so I'm hoping you'll find this one simpler.

All source code is available on Github.

Entrypoint

Running cargo init will give the boilerplate necessary. In src/main.rs we'll accept a file name from the command line, perform lexical analysis to retrieve all tokens from the file, perform grammar analysis on the tokens to retrieve a tree structure, compile the tree to a linear set of virtual machine instructions, and finally interpret the virtual machine instructions.

mod eval;
mod lex;
mod parse;

use std::env;
use std::fs;

fn main() {
    let args: Vec<String> = env::args().collect();
    let contents = fs::read_to_string(&args[1]).expect("Could not read file");

    let raw: Vec<char> = contents.chars().collect();

    let tokens = match lex::lex(&raw) {
        Ok(tokens) => tokens,
        Err(msg) => panic!("{}", msg),
    };

    let ast = match parse::parse(&raw, tokens) {
        Ok(ast) => ast,
        Err(msg) => panic!("{}", msg),
    };

    let pgrm = eval::compile(&raw, ast);

    eval::eval(pgrm);
}

Easy peasy. Now let's implement lex.

Lexical analysis

Lexical analysis drops whitespace (Lua is not whitespace sensitive) and chunks all source code characters into their smallest possible meaningful pieces like commas, numbers, identifiers, keywords, etc.

In order to have useful error messages, we'll keep track of state in the file with a Location struct that implements increment and debug.

This goes in src/lex.rs.

#[derive(Copy, Clone, Debug)]
pub struct Location {
    col: i32,
    line: i32,
    index: usize,
}

The increment function will update line and column numbers as well as the current index in the file.

impl Location {
    fn increment(&self, newline: bool) -> Location {
        if newline {
            Location {
                index: self.index + 1,
                col: 0,
                line: self.line + 1,
            }
        } else {
            Location {
                index: self.index + 1,
                col: self.col + 1,
                line: self.line,
            }
        }
    }

And the debug function will dump the current line with a pointer in text to the current column along with a message.

    pub fn debug<S: Into<String>>(&self, raw: &[char], msg: S) -> String {
        let mut line = 0;
        let mut line_str = String::new();
        // Find the whole line of original source
        for c in raw {
            if *c == '\n' {
                line += 1;

                // Done discovering line in question
                if !line_str.is_empty() {
                    break;
                }

                continue;
            }

            if self.line == line {
                line_str.push_str(&c.to_string());
            }
        }

        let space = " ".repeat(self.col as usize);
        format!("{}\n\n{}\n{}^ Near here", msg.into(), line_str, space)
    }
}

The smallest individual unit after lexical analysis is a token which is either a keyword, number, identifier, operator, or syntax. (This implementation is clearly skipping lots of real Lua syntax like strings.)

#[derive(Debug, PartialEq, Eq, Clone)]
pub enum TokenKind {
    Identifier,
    Syntax,
    Keyword,
    Number,
    Operator,
}

#[derive(Debug, Clone)]
pub struct Token {
    pub value: String,
    pub kind: TokenKind,
    pub loc: Location,
}

The top-level lex function will iterate over the file and call a lex helper for each kind of token, returning an array of all tokens on success. In between lexing it will "eat whitespace".

pub fn lex(s: &[char]) -> Result<Vec<Token>, String> {
    let mut loc = Location {
        col: 0,
        index: 0,
        line: 0,
    };
    let size = s.len();
    let mut tokens: Vec<Token> = vec![];

    let lexers = [
        lex_keyword,
        lex_identifier,
        lex_number,
        lex_syntax,
        lex_operator,
    ];
    'outer: while loc.index < size {
        loc = eat_whitespace(s, loc);
        if loc.index == size {
            break;
        }

        for lexer in lexers {
            let res = lexer(s, loc);
            if let Some((t, next_loc)) = res {
                loc = next_loc;
                tokens.push(t);
                continue 'outer;
            }
        }

        return Err(loc.debug(s, "Unrecognized character while lexing:"));
    }

    Ok(tokens)
}

Whitespace

Eating whitespace is just incrementing the location while we see a space, tab, newline, etc.

fn eat_whitespace(raw: &[char], initial_loc: Location) -> Location {
    let mut c = raw[initial_loc.index];
    let mut next_loc = initial_loc;
    while [' ', '\n', '\r', '\t'].contains(&c) {
        next_loc = next_loc.increment(c == '\n');
        if next_loc.index == raw.len() {
            break;
        }
        c = raw[next_loc.index];
    }

    next_loc
}

Numbers

Lexing numbers iterates through the source starting at a position until it stops seeing decimal digits (this implementation only supports integers).

fn lex_number(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let mut ident = String::new();
    let mut next_loc = initial_loc;
    let mut c = raw[initial_loc.index];
    while c.is_digit(10) {
        ident.push_str(&c.to_string());
        next_loc = next_loc.increment(false);
        c = raw[next_loc.index];
    }

If there are no digits in the string then this is not a number.

    if !ident.is_empty() {
        Some((
            Token {
                value: ident,
                loc: initial_loc,
                kind: TokenKind::Number,
            },
            next_loc,
        ))
    } else {
        None
    }
}

Identifiers

Identifiers are any collection of alphabet characters, numbers, and underscores.

fn lex_identifier(raw: &Vec<char>, initial_loc: Location) -> Option<(Token, Location)> {
    let mut ident = String::new();
    let mut next_loc = initial_loc;
    let mut c = raw[initial_loc.index];
    while c.is_alphanumeric() || c == '_' {
        ident.push_str(&c.to_string());
        next_loc = next_loc.increment(false);
        c = raw[next_loc.index];
    }

But they cannot start with a number.

    // First character must not be a digit
    if ident.len() > 0 && !ident.chars().next().unwrap().is_digit(10) {
        Some((
            Token {
                value: ident,
                loc: initial_loc,
                kind: TokenKind::Identifier,
            },
            next_loc,
        ))
    } else {
        None
    }
}

Keywords

Keywords are alphabetical like identifiers are but they cannot be reused as variables by the user.

fn lex_keyword(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let syntax = ["function", "end", "if", "then", "local", "return"];

    let mut next_loc = initial_loc;
    let mut value = String::new();
    'outer: for possible_syntax in syntax {
        let mut c = raw[initial_loc.index];
        next_loc = initial_loc;
        while c.is_alphanumeric() || c == '_' {
            value.push_str(&c.to_string());
            next_loc = next_loc.increment(false);
            c = raw[next_loc.index];

            let n = next_loc.index - initial_loc.index;
            if value != possible_syntax[..n] {
                value = String::new();
                continue 'outer;
            }
        }

        // Not a complete match
        if value.len() < possible_syntax.len() {
            value = String::new();
            continue;
        }

        // If it got to this point it found a match, so exit early.
        // We don't need a longest match.
        break;
    }

    if value.is_empty() {
        return None;
    }

Aside from matching a list of strings we have to make sure there is a complete match. For example function1 is not a keyword, it's a valid identifier. Whereas function 1 is a valid set of tokens (the keyword function and the number 1), even if it's not a valid Lua grammar.

    // If the next character would be part of a valid identifier, then
    // this is not a keyword.
    if next_loc.index < raw.len() - 1 {
        let next_c = raw[next_loc.index];
        if next_c.is_alphanumeric() || next_c == '_' {
            return None;
        }
    }

    Some((
        Token {
            value: value,
            loc: initial_loc,
            kind: TokenKind::Keyword,
        },
        next_loc,
    ))
}

Syntax

Syntax (in this context) is just language junk that isn't operators. Things like commas, parenthesis, etc.

fn lex_syntax(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let syntax = [";", "=", "(", ")", ","];

    for possible_syntax in syntax {
        let c = raw[initial_loc.index];
        let next_loc = initial_loc.increment(false);
        // TODO: this won't work with multiple-character syntax bits like >= or ==
        if possible_syntax == c.to_string() {
            return Some((
                Token {
                    value: possible_syntax.to_string(),
                    loc: initial_loc,
                    kind: TokenKind::Syntax,
                },
                next_loc,
            ));
        }
    }

    None
}

Operators

Operators are things like plus, minus, and less than symbols. Operators are syntax but it helps us later on to break these out into a seperate type of token.

fn lex_operator(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
    let operators = ["+", "-", "<"];

    for possible_syntax in operators {
        let c = raw[initial_loc.index];
        let next_loc = initial_loc.increment(false);
        // TODO: this won't work with multiple-character operators like >= or ==
        if possible_syntax == c.to_string() {
            return Some((
                Token {
                    value: possible_syntax.to_string(),
                    loc: initial_loc,
                    kind: TokenKind::Operator,
                },
                next_loc,
            ));
        }
    }

    None
}

And now we're all done lexing!

Grammar analysis

Parsing finds grammatical (tree) patterns in a flat list of tokens. This is called a syntax tree or abstract syntax tree (AST).

The boring part is defining the tree. Generally speaking (and specifically for this project), the syntax tree is a list of statements. Statements can be function definitions or expression statements or if statements or return statements or local declarations.

This goes in src/parse.rs.

#[derive(Debug)]
pub enum Statement {
    Expression(Expression),
    If(If),
    FunctionDeclaration(FunctionDeclaration),
    Return(Return),
    Local(Local),
}

pub type Ast = Vec<Statement>;

There's almost nothing special at all about the rest of the tree definitions.

#[derive(Debug)]
pub enum Literal {
    Identifier(Token),
    Number(Token),
}

#[derive(Debug)]
pub struct FunctionCall {
    pub name: Token,
    pub arguments: Vec<Expression>,
}

#[derive(Debug)]
pub struct BinaryOperation {
    pub operator: Token,
    pub left: Box<Expression>,
    pub right: Box<Expression>,
}

#[derive(Debug)]
pub enum Expression {
    FunctionCall(FunctionCall),
    BinaryOperation(BinaryOperation),
    Literal(Literal),
}

#[derive(Debug)]
pub struct FunctionDeclaration {
    pub name: Token,
    pub parameters: Vec<Token>,
    pub body: Vec<Statement>,
}

#[derive(Debug)]
pub struct If {
    pub test: Expression,
    pub body: Vec<Statement>,
}

#[derive(Debug)]
pub struct Local {
    pub name: Token,
    pub expression: Expression,
}

#[derive(Debug)]
pub struct Return {
    pub expression: Expression,
}

And that's it for the AST!

Some helpers

Lastly before the fun part, we'll define a few helpers for validating each kind of token.

fn expect_keyword(tokens: &[Token], index: usize, value: &str) -> bool {
    if index >= tokens.len() {
        return false;
    }

    let t = tokens[index].clone();
    t.kind == TokenKind::Keyword && t.value == value
}

fn expect_syntax(tokens: &[Token], index: usize, value: &str) -> bool {
    if index >= tokens.len() {
        return false;
    }

    let t = tokens[index].clone();
    t.kind == TokenKind::Syntax && t.value == value
}

fn expect_identifier(tokens: &[Token], index: usize) -> bool {
    if index >= tokens.len() {
        return false;
    }

    let t = tokens[index].clone();
    t.kind == TokenKind::Identifier
}

Now on to the fun part, actually detecting these trees!

Top-level parse

The top-level parse function and it's major helper, parse_statement, dispatch very similarly to the top-level lex function. For each statement in the file we look for function declarations, if statements, return statements, local declarations, and expression statements.

fn parse_statement(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
    let parsers = [
        parse_if,
        parse_expression_statement,
        parse_return,
        parse_function,
        parse_local,
    ];
    for parser in parsers {
        let res = parser(raw, tokens, index);
        if res.is_some() {
            return res;
        }
    }

    None
}

pub fn parse(raw: &[char], tokens: Vec<Token>) -> Result<Ast, String> {
    let mut ast = vec![];
    let mut index = 0;
    let ntokens = tokens.len();
    while index < ntokens {
        let res = parse_statement(raw, &tokens, index);
        if let Some((stmt, next_index)) = res {
            index = next_index;
            ast.push(stmt);
            continue;
        }

        return Err(tokens[index].loc.debug(raw, "Invalid token while parsing:"));
    }

    Ok(ast)
}

Expression statements

Expression statements are just a wrapper for the Rust type system. They call parse_expression (which we'll define shortly), expect a semicolon afterward, and wrap the expression in a statement.

fn parse_expression_statement(
    raw: &[char],
    tokens: &[Token],
    index: usize,
) -> Option<(Statement, usize)> {
    let mut next_index = index;
    let res = parse_expression(raw, tokens, next_index)?;

    let (expr, next_next_index) = res;
    next_index = next_next_index;
    if !expect_syntax(tokens, next_index, ";") {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected semicolon after expression:")
        );
        return None;
    }

    next_index += 1; // Skip past semicolon

    Some((Statement::Expression(expr), next_index))
}

Expressions

Expressions in this minimal Lua are only one of function calls, literals (numbers, identifiers), or binary operations. To keep things very simple, binary operations cannot be combined. So instead of 1 + 2 + 3 we'd need to do local tmp1 = 1 + 2; local tmp2 = tmp1 + 3; and so on.

fn parse_expression(raw: &[char], tokens: &[Token], index: usize) -> Option<(Expression, usize)> {
    if index >= tokens.len() {
        return None;
    }

    let t = tokens[index].clone();
    let left = match t.kind {
        TokenKind::Number => Expression::Literal(Literal::Number(t)),
        TokenKind::Identifier => Expression::Literal(Literal::Identifier(t)),
        _ => {
            return None;
        }
    };

If what follows the first literal is an open parenthesis then we try to parse a function call.

    let mut next_index = index + 1;
    if expect_syntax(tokens, next_index, "(") {
        next_index += 1; // Skip past open paren

        // Function call
        let mut arguments: Vec<Expression> = vec![];

We need to call parse_expression recursively for every possible argument passed to the function.

        while !expect_syntax(tokens, next_index, ")") {
            if arguments.is_empty() {
                if !expect_syntax(tokens, next_index, ",") {
                    println!(
                        "{}",
                        tokens[next_index]
                            .loc
                            .debug(raw, "Expected comma between function call arguments:")
                    );
                    return None;
                }

                next_index += 1; // Skip past comma
            }

            let res = parse_expression(raw, tokens, next_index);
            if let Some((arg, next_next_index)) = res {
                next_index = next_next_index;
                arguments.push(arg);
            } else {
                println!(
                    "{}",
                    tokens[next_index]
                        .loc
                        .debug(raw, "Expected valid expression in function call arguments:")
                );
                return None;
            }
        }

        next_index += 1; // Skip past closing paren

        return Some((
            Expression::FunctionCall(FunctionCall {
                name: tokens[index].clone(),
                arguments,
            }),
            next_index,
        ));
    }

Otherwise if there isn't an opening parenthesis then we could be parsing either a literal expression or a binary operation. If the token that follows is an operator token then we know it's a binary operation.

    // Might be a literal expression
    if next_index >= tokens.len() || tokens[next_index].clone().kind != TokenKind::Operator {
        return Some((left, next_index));
    }

    // Otherwise is a binary operation
    let op = tokens[next_index].clone();
    next_index += 1; // Skip past op

    if next_index >= tokens.len() {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected valid right hand side binary operand:")
        );
        return None;
    }

    let rtoken = tokens[next_index].clone();

It is at this point that we could (but won't) call parse_expression recursively. I don't want to deal with operator precedence right now so we'll just require that the right hand side is another literal.

    let right = match rtoken.kind {
        TokenKind::Number => Expression::Literal(Literal::Number(rtoken)),
        TokenKind::Identifier => Expression::Literal(Literal::Identifier(rtoken)),
        _ => {
            println!(
                "{}",
                rtoken
                    .loc
                    .debug(raw, "Expected valid right hand side binary operand:")
            );
            return None;
        }
    };
    next_index += 1; // Skip past right hand operand

    Some((
        Expression::BinaryOperation(BinaryOperation {
            left: Box::new(left),
            right: Box::new(right),
            operator: op,
        }),
        next_index,
    ))
}

And now we're done parsing expressions!

Function declarations

Functions start with the function keyword, and an identifier token follows.

fn parse_function(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
    if !expect_keyword(tokens, index, "function") {
        return None;
    }

    let mut next_index = index + 1;
    if !expect_identifier(tokens, next_index) {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected valid identifier for function name:")
        );
        return None;
    }
    let name = tokens[next_index].clone();

After the function name comes the argument list that can be empty or a comma separated list of identifiers.

    next_index += 1; // Skip past name
    if !expect_syntax(tokens, next_index, "(") {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected open parenthesis in function declaration:")
        );
        return None;
    }

    next_index += 1; // Skip past open paren
    let mut parameters: Vec<Token> = vec![];
    while !expect_syntax(tokens, next_index, ")") {
        if !parameters.is_empty() {
            if !expect_syntax(tokens, next_index, ",") {
                println!("{}", tokens[next_index].loc.debug(raw, "Expected comma or close parenthesis after parameter in function declaration:"));
                return None;
            }

            next_index += 1; // Skip past comma
        }

        parameters.push(tokens[next_index].clone());
        next_index += 1; // Skip past param
    }

    next_index += 1; // Skip past close paren

Next we parse all statements in the function body until we find the end keyword.

    let mut statements: Vec<Statement> = vec![];
    while !expect_keyword(tokens, next_index, "end") {
        let res = parse_statement(raw, tokens, next_index);
        if let Some((stmt, next_next_index)) = res {
            next_index = next_next_index;
            statements.push(stmt);
        } else {
            println!(
                "{}",
                tokens[next_index]
                    .loc
                    .debug(raw, "Expected valid statement in function declaration:")
            );
            return None;
        }
    }

    next_index += 1; // Skip past end

    Some((
        Statement::FunctionDeclaration(FunctionDeclaration {
            name,
            parameters,
            body: statements,
        }),
        next_index,
    ))
}

Phew! We're halfway through the parser.

Return statements

Return statements just check for the return keyword, an expression, and a semicolon.

fn parse_return(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
    if !expect_keyword(tokens, index, "return") {
        return None;
    }

    let mut next_index = index + 1; // Skip past return
    let res = parse_expression(raw, tokens, next_index);
    if res.is_none() {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected valid expression in return statement:")
        );
        return None;
    }

    let (expr, next_next_index) = res.unwrap();
    next_index = next_next_index;
    if !expect_syntax(tokens, next_index, ";") {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected semicolon in return statement:")
        );
        return None;
    }

    next_index += 1; // Skip past semicolon

    Some((Statement::Return(Return { expression: expr }), next_index))
}

Local declarations

Local declarations start with the local keyword, then the local name, then an equal sign, then an expression, and then a semicolon.

fn parse_local(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
    if !expect_keyword(tokens, index, "local") {
        return None;
    }

    let mut next_index = index + 1; // Skip past local

    if !expect_identifier(tokens, next_index) {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected valid identifier for local name:")
        );
        return None;
    }

    let name = tokens[next_index].clone();
    next_index += 1; // Skip past name

    if !expect_syntax(tokens, next_index, "=") {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected = syntax after local name:")
        );
        return None;
    }

    next_index += 1; // Skip past =

    let res = parse_expression(raw, tokens, next_index);
    if res.is_none() {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected valid expression in local declaration:")
        );
        return None;
    }

    let (expr, next_next_index) = res.unwrap();
    next_index = next_next_index;

    if !expect_syntax(tokens, next_index, ";") {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected semicolon in return statement:")
        );
        return None;
    }

    next_index += 1; // Skip past semicolon

    Some((
        Statement::Local(Local {
            name,
            expression: expr,
        }),
        next_index,
    ))
}

If statements

This implementation of Lua doesn't support elseif so parsing if just checks for the if keyword followed by a test expression, then the else keyword, then the if body (a list of statements), and then the end keyword.

fn parse_if(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
    if !expect_keyword(tokens, index, "if") {
        return None;
    }

    let mut next_index = index + 1; // Skip past if
    let res = parse_expression(raw, tokens, next_index);
    if res.is_none() {
        println!(
            "{}",
            tokens[next_index]
                .loc
                .debug(raw, "Expected valid expression for if test:")
        );
        return None;
    }

    let (test, next_next_index) = res.unwrap();
    next_index = next_next_index;

    if !expect_keyword(tokens, next_index, "then") {
        return None;
    }

    next_index += 1; // Skip past then

    let mut statements: Vec<Statement> = vec![];
    while !expect_keyword(tokens, next_index, "end") {
        let res = parse_statement(raw, tokens, next_index);
        if let Some((stmt, next_next_index)) = res {
            next_index = next_next_index;
            statements.push(stmt);
        } else {
            println!(
                "{}",
                tokens[next_index]
                    .loc
                    .debug(raw, "Expected valid statement in if body:")
            );
            return None;
        }
    }

    next_index += 1; // Skip past end

    Some((
        Statement::If(If {
            test,
            body: statements,
        }),
        next_index,
    ))
}

And goshdarnit we're done parsing.

Compiling to a made up virtual machine

This virtual machine will be entirely stack-based other than the stack pointer and program counter.

The calling convention is that the caller will put arguments on the stack followed by the frame pointer, the program counter, and then the number of arguments (for cleanup). Then it will alter the program counter and frame pointer. Then the caller will allocate space on the stack for all arguments and locals within the function.

For simplicity in addressing modes, the function declaration once jumped to will copy the arguments from before the frame pointer to in front of it (yes I know, I know, this is silly).

The virtual machine will support add, subtract, less than operations as well as jump, jump-if-not-zero, return, and call. It will support a few more memory-specific instructions for loading literals, loading identifiers, and managing arguments.

I'll explain the non-obvious instructions as we implement them.

use crate::parse::*;
use std::collections::HashMap;

#[derive(Debug)]
enum Instruction {
    DupPlusFP(i32),
    MoveMinusFP(usize, i32),
    MovePlusFP(usize),
    Store(i32),
    Return,
    JumpIfNotZero(String),
    Jump(String),
    Call(String, usize),
    Add,
    Subtract,
    LessThan,
}

The result of compiling will be a Program instance. This instance will contain symbol information and the actual instructions to run.

#[derive(Debug)]
struct Symbol {
    location: i32,
    narguments: usize,
    nlocals: usize,
}

#[derive(Debug)]
pub struct Program {
    syms: HashMap<String, Symbol>,
    instructions: Vec<Instruction>,
}

Compiling, similar to parsing, just calls the helper compile_statement for each statement in the AST.

pub fn compile(raw: &[char], ast: Ast) -> Program {
    let mut locals: HashMap<String, i32> = HashMap::new();
    let mut pgrm = Program {
        syms: HashMap::new(),
        instructions: Vec::new(),
    };
    for stmt in ast {
        compile_statement(&mut pgrm, raw, &mut locals, stmt);
    }

    pgrm
}

And compile_statement dispatches to additional helpers based on the kind of statement.

fn compile_statement(
    pgrm: &mut Program,
    raw: &Vec<char>,
    locals: &mut HashMap<String, i32>,
    stmt: Statement,
) {
    match stmt {
        Statement::FunctionDeclaration(fd) => compile_declaration(pgrm, raw, locals, fd),
        Statement::Return(r) => compile_return(pgrm, raw, locals, r),
        Statement::If(if_) => compile_if(pgrm, raw, locals, if_),
        Statement::Local(loc) => compile_local(pgrm, raw, locals, loc),
        Statement::Expression(e) => compile_expression(pgrm, raw, locals, e),
    }
}

Function declarations

Let's do the hard one first. First off, function declarations will include an unconditional guard around them so that we can evaluate from the 0th instruction at the top-level and have only non-function-declaration statements be evaluated.

fn compile_declaration(
    pgrm: &mut Program,
    raw: &[char],
    _: &mut HashMap<String, i32>,
    fd: FunctionDeclaration,
) {
    // Jump to end of function to guard top-level
    let done_label = format!("function_done_{}", pgrm.instructions.len());
    pgrm.instructions
        .push(Instruction::Jump(done_label.clone()));

Then we'll add another limitation/simplification that local variables are only accessible within the current function scope.

For each parameter, we'll copy the parameter on the stack before the frame pointer to a place in front of the frame pointer. This gets around addressing mode limitations in our virtual machine.

    let mut new_locals = HashMap::<String, i32>::new();

    let function_index = pgrm.instructions.len() as i32;
    let narguments = fd.parameters.len();
    for (i, param) in fd.parameters.iter().enumerate() {
        pgrm.instructions.push(Instruction::MoveMinusFP(
            i,
            narguments as i32 - (i as i32 + 1),
        ));
        new_locals.insert(param.value.clone(), i as i32);
    }

Then we compile the body.

    for stmt in fd.body {
        compile_statement(pgrm, raw, &mut new_locals, stmt);
    }

Once the body is compiled we know the total number of locals so we can fill out the symbol table correctly. The location is importantly already stored because it is the location of the instruction where the function started.

    pgrm.syms.insert(
        fd.name.value,
        Symbol {
            location: function_index as i32,
            narguments,
            nlocals: new_locals.keys().len(),
        },
    );

Finally we add a symbol linking the done label for the function to the position of the end of the function. Again, this allows us to skip past the function declaration when evaluating instructions from 0 to N.

    pgrm.syms.insert(
        done_label,
        Symbol {
            location: pgrm.instructions.len() as i32,
            narguments: 0,
            nlocals: 0,
        },
    );
}

Ok that wasn't so bad. And the rest are simpler.

Local declarations

The expression for the local is compiled and then the local name is stored in a locals table mapped to the current number of locals (including arguments). This allows the compiler to turn identifier token lookups into simply an offset from the frame pointer.

fn compile_local(
    pgrm: &mut Program,
    raw: &[char],
    locals: &mut HashMap<String, i32>,
    local: Local,
) {
    let index = locals.keys().len();
    locals.insert(local.name.value, index as i32);
    compile_expression(pgrm, raw, locals, local.expression);
    pgrm.instructions.push(Instruction::MovePlusFP(index));
}

And specifically, the instruction pattern is to evaluate the expression and then copy it back into a relative position in the stack.

Literals

Number literals use the store instruction for pushing a number onto the stack. Identifier literals are copied to the top of the stack from their position relative to the frame pointer.

fn compile_literal(
    pgrm: &mut Program,
    _: &[char],
    locals: &mut HashMap<String, i32>,
    lit: Literal,
) {
    match lit {
        Literal::Number(i) => {
            let n = i.value.parse::<i32>().unwrap();
            pgrm.instructions.push(Instruction::Store(n));
        }
        Literal::Identifier(ident) => {
            pgrm.instructions
                .push(Instruction::DupPlusFP(locals[&ident.value]));
        }
    }
}

Function calls

Pretty simple: just compile all the arguments and then issue a call instruction.

fn compile_function_call(
    pgrm: &mut Program,
    raw: &Vec<char>,
    locals: &mut HashMap<String, i32>,
    fc: FunctionCall,
) {
    let len = fc.arguments.len();
    for arg in fc.arguments {
        compile_expression(pgrm, raw, locals, arg);
    }

    pgrm.instructions
        .push(Instruction::Call(fc.name.value, len));
}

Binary operations

Binary operations compile the left, then the right, and then issue an instruction based on the operator. All the operators are builtin and act on the top two elements on the stack.

fn compile_binary_operation(
    pgrm: &mut Program,
    raw: &[char],
    locals: &mut HashMap<String, i32>,
    bop: BinaryOperation,
) {
    compile_expression(pgrm, raw, locals, *bop.left);
    compile_expression(pgrm, raw, locals, *bop.right);
    match bop.operator.value.as_str() {
        "+" => {
            pgrm.instructions.push(Instruction::Add);
        }
        "-" => {
            pgrm.instructions.push(Instruction::Subtract);
        }

        "<" => {
            pgrm.instructions.push(Instruction::LessThan);
        }
        _ => panic!(
            "{}",
            bop.operator
                .loc
                .debug(raw, "Unable to compile binary operation:")
        ),
    }
}

Expressions

Compiling expressions just dispatches to a compile helper based on the type of expression. We've already written those three helpers.

fn compile_expression(
    pgrm: &mut Program,
    raw: &[char],
    locals: &mut HashMap<String, i32>,
    exp: Expression,
) {
    match exp {
        Expression::BinaryOperation(bop) => {
            compile_binary_operation(pgrm, raw, locals, bop);
        }
        Expression::FunctionCall(fc) => {
            compile_function_call(pgrm, raw, locals, fc);
        }
        Expression::Literal(lit) => {
            compile_literal(pgrm, raw, locals, lit);
        }
    }
}

If

First we compile the conditional test and then we jump to after the if the test result is not zero.

fn compile_if(pgrm: &mut Program, raw: &[char], locals: &mut HashMap<String, i32>, if_: If) {
    compile_expression(pgrm, raw, locals, if_.test);
    let done_label = format!("if_else_{}", pgrm.instructions.len());
    pgrm.instructions
        .push(Instruction::JumpIfNotZero(done_label.clone()));

Then we compile the body.

    for stmt in if_.body {
        compile_statement(pgrm, raw, locals, stmt);
    }

And finally make sure we insert the done symbol in the right place after the if.

    pgrm.syms.insert(
        done_label,
        Symbol {
            location: pgrm.instructions.len() as i32 - 1,
            nlocals: 0,
            narguments: 0,
        },
    );
}

Return

The final statement type is return. We simply compile the return expression and issue a return instruction.

fn compile_return(
    pgrm: &mut Program,
    raw: &[char],
    locals: &mut HashMap<String, i32>,
    ret: Return,
) {
    compile_expression(pgrm, raw, locals, ret.expression);
    pgrm.instructions.push(Instruction::Return);
}

That's it for the compiler! Now the trickiest part. I lost a few hours debugging and iterating on the next bit.

The virtual machine

Ok so the easy part is that there are only two registers, a program counter and a frame pointer. There's also a data stack. The frame pointer points to the location on the data stack where each function can start storing its locals.

Evaluation starts from 0 and goes until the last instruction.

pub fn eval(pgrm: Program) {
    let mut pc: i32 = 0;
    let mut fp: i32 = 0;
    let mut data: Vec<i32> = vec![];

    while pc < pgrm.instructions.len() as i32 {
        match &pgrm.instructions[pc as usize] {

Each instruction will be responsible for incrementing the program counter or having it jump around.

Addition, subtraction, less than

The easiest ones are the math operators. We just pop off the data stack, perform the operation, and store the result.

            Instruction::Add => {
                let right = data.pop().unwrap();
                let left = data.pop().unwrap();
                data.push(left + right);
                pc += 1;
            }
            Instruction::Subtract => {
                let right = data.pop().unwrap();
                let left = data.pop().unwrap();
                data.push(left - right);
                pc += 1;
            }
            Instruction::LessThan => {
                let right = data.pop().unwrap();
                let left = data.pop().unwrap();
                data.push(if left < right { 1 } else { 0 });
                pc += 1;
            }

The store instruction is another easy one. It just pushes a literal number onto the stack.

            Instruction::Store(n) => {
                data.push(*n);
                pc += 1;
            }

Jump variants

The jump variants are easy too. Just grab the location and change the program counter. If it's a conditional jump then test the condition first.

            Instruction::JumpIfNotZero(label) => {
                let top = data.pop().unwrap();
                if top == 0 {
                    pc = pgrm.syms[label].location;
                }
                pc += 1;
            }
            Instruction::Jump(label) => {
                pc = pgrm.syms[label].location;
            }

Loading from a variable

The MovePlusFP instruction copies a value from the stack (offset the frame pointer) onto the top of the stack. This is for references to arguments and locals.

            Instruction::MovePlusFP(i) => {
                let val = data.pop().unwrap();
                let index = fp as usize + *i;
                // Accounts for top-level locals
                while index >= data.len() {
                    data.push(0);
                }
                data[index] = val;
                pc += 1;
            }

Storing locals

The DupPlusFP instruction is used by compile_locals to store a local once compiled onto the stack in the relative position from the frame pointer.

            Instruction::DupPlusFP(i) => {
                data.push(data[(fp + i) as usize]);
                pc += 1;
            }

Duplicating arguments

The MoveMinusFP instruction is, again, a hack to work around limited addressing modes in this minimal virtual machine. It copies arguments from behind the frame pointer to in front of the frame pointer.

            Instruction::MoveMinusFP(local_offset, fp_offset) => {
                data[fp as usize + local_offset] = data[(fp - (fp_offset + 4)) as usize];
                pc += 1;
            }

Now we're down to the last two instructions: call and return.

Call

Call has a special dispatch for builtin functions (the only one that exists being print).

            Instruction::Call(label, narguments) => {
                // Handle builtin functions
                if label == "print" {
                    for _ in 0..*narguments {
                        print!("{}", data.pop().unwrap());
                        print!(" ");
                    }
                    println!();
                    pc += 1;
                    continue;
                }

Otherwise it pushes the current frame pointer, then the program counter, and finally the number of arguments (not locals) onto the stack for preservation. Then it sets up the new program counter and frame pointer and creates space for all locals and arguments after the new frame pointer.

                data.push(fp);
                data.push(pc + 1);
                data.push(pgrm.syms[label].narguments as i32);
                pc = pgrm.syms[label].location;
                fp = data.len() as i32;

                // Set up space for all arguments/locals
                let mut nlocals = pgrm.syms[label].nlocals;
                while nlocals > 0 {
                    data.push(0);
                    nlocals -= 1;
                }
            }

Return

The return instructions pops the return value from the stack. Then it pops off all locals and arguments. Then it restores the program counter and frame pointer, and pops off the arguments before the frame pointer. Finally it adds the return value back onto the stack.

            Instruction::Return => {
                let ret = data.pop().unwrap();

                // Clean up the local stack
                while fp < data.len() as i32 {
                    data.pop();
                }

                // Restore pc and fp
                let mut narguments = data.pop().unwrap();
                pc = data.pop().unwrap();
                fp = data.pop().unwrap();

                // Clean up arguments
                while narguments > 0 {
                    data.pop();
                    narguments -= 1;
                }

                // Add back return value
                data.push(ret);
            }

And yes, this implementation would be more efficient if instead of literally pushing and popping we just incremented/decremented a stack pointer.

And that's it! We're completely done a basic parser, compiler, and virtual machine for a subet of Lua. Is it janky? Yeah. Is it simple? Kind of? Does it work? It seems to!

Summary

Ok we've got <1200 lines of Rust enough to run some decent Lua programs. We run this fib program against this implementation and against Lua 5.4.3 (which isn't LuaJIT) and what do we see?

$ cargo build --release
$ cat test/fib.lua
function fib(n)
   if n < 2 then
      return n;
   end

   local n1 = fib(n-1);
   local n2 = fib(n-2);
   return n1 + n2;
end

print(fib(30));
$ time ./target/release/lust test/fib.lua
832040
./target/release/lust test/fib.lua  0.29s user 0.00s system 99% cpu 0.293 total
$ time lua test/fib.lua
832040
lua test/fib.lua  0.06s user 0.00s system 99% cpu 0.063 total

This implementation is a bit slower! Time to do some profiling and maybe revisit some of those aforementioned inefficiencies.

Big thanks to Christian Scott on Twitter for pointing out I should not be benchmarking with debug builds!

And thanks to reddit123123123123 on Reddit for suggesting I use cargo clippy to clean up my code.

Thanks to GiffE on Github for pointing out some key inconsistencies between this implementation and Lua. I won't modify anything because a perfect Lua subset wasn't the goal, but I'm sharing because it was good analysis and criticism of this implementation.