By the end of this guide we'll have a minimal, working implementation of a small part of Lua from scratch. It will be able to run the following program (among others):
function fib(n)
if n < 2 then
return n;
end
local n1 = fib(n-1);
local n2 = fib(n-2);
return n1 + n2;
end
print(fib(30));
This is my second project in Rust and only the third time I've invented an instruction set so don't take my style as gospel. However, I have found some Rust parsing tutorials overly complex so I'm hoping you'll find this one simpler.
All source code is available on Github.
Entrypoint
Running cargo init
will give the boilerplate necessary. In
src/main.rs
we'll accept a file name from the command line, perform
lexical analysis to retrieve all tokens from the file, perform grammar
analysis on the tokens to retrieve a tree structure, compile the tree
to a linear set of virtual machine instructions, and finally interpret
the virtual machine instructions.
mod eval;
mod lex;
mod parse;
use std::env;
use std::fs;
fn main() {
let args: Vec<String> = env::args().collect();
let contents = fs::read_to_string(&args[1]).expect("Could not read file");
let raw: Vec<char> = contents.chars().collect();
let tokens = match lex::lex(&raw) {
Ok(tokens) => tokens,
Err(msg) => panic!("{}", msg),
};
let ast = match parse::parse(&raw, tokens) {
Ok(ast) => ast,
Err(msg) => panic!("{}", msg),
};
let pgrm = eval::compile(&raw, ast);
eval::eval(pgrm);
}
Easy peasy. Now let's implement lex
.
Lexical analysis
Lexical analysis drops whitespace (Lua is not whitespace sensitive) and chunks all source code characters into their smallest possible meaningful pieces like commas, numbers, identifiers, keywords, etc.
In order to have useful error messages, we'll keep track of state in
the file with a Location
struct that implements increment
and
debug
.
This goes in src/lex.rs
.
#[derive(Copy, Clone, Debug)]
pub struct Location {
col: i32,
line: i32,
index: usize,
}
The increment
function will update line and column numbers as well
as the current index in the file.
impl Location {
fn increment(&self, newline: bool) -> Location {
if newline {
Location {
index: self.index + 1,
col: 0,
line: self.line + 1,
}
} else {
Location {
index: self.index + 1,
col: self.col + 1,
line: self.line,
}
}
}
And the debug
function will dump the current line with a pointer in
text to the current column along with a message.
pub fn debug<S: Into<String>>(&self, raw: &[char], msg: S) -> String {
let mut line = 0;
let mut line_str = String::new();
// Find the whole line of original source
for c in raw {
if *c == '\n' {
line += 1;
// Done discovering line in question
if !line_str.is_empty() {
break;
}
continue;
}
if self.line == line {
line_str.push_str(&c.to_string());
}
}
let space = " ".repeat(self.col as usize);
format!("{}\n\n{}\n{}^ Near here", msg.into(), line_str, space)
}
}
The smallest individual unit after lexical analysis is a token which is either a keyword, number, identifier, operator, or syntax. (This implementation is clearly skipping lots of real Lua syntax like strings.)
#[derive(Debug, PartialEq, Eq, Clone)]
pub enum TokenKind {
Identifier,
Syntax,
Keyword,
Number,
Operator,
}
#[derive(Debug, Clone)]
pub struct Token {
pub value: String,
pub kind: TokenKind,
pub loc: Location,
}
The top-level lex
function will iterate over the file and call a lex
helper for each kind of token, returning an array of all tokens on
success. In between lexing it will "eat whitespace".
pub fn lex(s: &[char]) -> Result<Vec<Token>, String> {
let mut loc = Location {
col: 0,
index: 0,
line: 0,
};
let size = s.len();
let mut tokens: Vec<Token> = vec![];
let lexers = [
lex_keyword,
lex_identifier,
lex_number,
lex_syntax,
lex_operator,
];
'outer: while loc.index < size {
loc = eat_whitespace(s, loc);
if loc.index == size {
break;
}
for lexer in lexers {
let res = lexer(s, loc);
if let Some((t, next_loc)) = res {
loc = next_loc;
tokens.push(t);
continue 'outer;
}
}
return Err(loc.debug(s, "Unrecognized character while lexing:"));
}
Ok(tokens)
}
Whitespace
Eating whitespace is just incrementing the location while we see a space, tab, newline, etc.
fn eat_whitespace(raw: &[char], initial_loc: Location) -> Location {
let mut c = raw[initial_loc.index];
let mut next_loc = initial_loc;
while [' ', '\n', '\r', '\t'].contains(&c) {
next_loc = next_loc.increment(c == '\n');
if next_loc.index == raw.len() {
break;
}
c = raw[next_loc.index];
}
next_loc
}
Numbers
Lexing numbers iterates through the source starting at a position until it stops seeing decimal digits (this implementation only supports integers).
fn lex_number(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let mut ident = String::new();
let mut next_loc = initial_loc;
let mut c = raw[initial_loc.index];
while c.is_digit(10) {
ident.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
}
If there are no digits in the string then this is not a number.
if !ident.is_empty() {
Some((
Token {
value: ident,
loc: initial_loc,
kind: TokenKind::Number,
},
next_loc,
))
} else {
None
}
}
Identifiers
Identifiers are any collection of alphabet characters, numbers, and underscores.
fn lex_identifier(raw: &Vec<char>, initial_loc: Location) -> Option<(Token, Location)> {
let mut ident = String::new();
let mut next_loc = initial_loc;
let mut c = raw[initial_loc.index];
while c.is_alphanumeric() || c == '_' {
ident.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
}
But they cannot start with a number.
// First character must not be a digit
if ident.len() > 0 && !ident.chars().next().unwrap().is_digit(10) {
Some((
Token {
value: ident,
loc: initial_loc,
kind: TokenKind::Identifier,
},
next_loc,
))
} else {
None
}
}
Keywords
Keywords are alphabetical like identifiers are but they cannot be reused as variables by the user.
fn lex_keyword(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let syntax = ["function", "end", "if", "then", "local", "return"];
let mut next_loc = initial_loc;
let mut value = String::new();
'outer: for possible_syntax in syntax {
let mut c = raw[initial_loc.index];
next_loc = initial_loc;
while c.is_alphanumeric() || c == '_' {
value.push_str(&c.to_string());
next_loc = next_loc.increment(false);
c = raw[next_loc.index];
let n = next_loc.index - initial_loc.index;
if value != possible_syntax[..n] {
value = String::new();
continue 'outer;
}
}
// Not a complete match
if value.len() < possible_syntax.len() {
value = String::new();
continue;
}
// If it got to this point it found a match, so exit early.
// We don't need a longest match.
break;
}
if value.is_empty() {
return None;
}
Aside from matching a list of strings we have to make sure
there is a complete match. For example function1
is not a keyword,
it's a valid identifier. Whereas function 1
is a valid set of tokens
(the keyword function
and the number 1
), even if it's not a valid
Lua grammar.
// If the next character would be part of a valid identifier, then
// this is not a keyword.
if next_loc.index < raw.len() - 1 {
let next_c = raw[next_loc.index];
if next_c.is_alphanumeric() || next_c == '_' {
return None;
}
}
Some((
Token {
value: value,
loc: initial_loc,
kind: TokenKind::Keyword,
},
next_loc,
))
}
Syntax
Syntax (in this context) is just language junk that isn't operators. Things like commas, parenthesis, etc.
fn lex_syntax(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let syntax = [";", "=", "(", ")", ","];
for possible_syntax in syntax {
let c = raw[initial_loc.index];
let next_loc = initial_loc.increment(false);
// TODO: this won't work with multiple-character syntax bits like >= or ==
if possible_syntax == c.to_string() {
return Some((
Token {
value: possible_syntax.to_string(),
loc: initial_loc,
kind: TokenKind::Syntax,
},
next_loc,
));
}
}
None
}
Operators
Operators are things like plus, minus, and less than symbols. Operators are syntax but it helps us later on to break these out into a seperate type of token.
fn lex_operator(raw: &[char], initial_loc: Location) -> Option<(Token, Location)> {
let operators = ["+", "-", "<"];
for possible_syntax in operators {
let c = raw[initial_loc.index];
let next_loc = initial_loc.increment(false);
// TODO: this won't work with multiple-character operators like >= or ==
if possible_syntax == c.to_string() {
return Some((
Token {
value: possible_syntax.to_string(),
loc: initial_loc,
kind: TokenKind::Operator,
},
next_loc,
));
}
}
None
}
And now we're all done lexing!
Grammar analysis
Parsing finds grammatical (tree) patterns in a flat list of tokens. This is called a syntax tree or abstract syntax tree (AST).
The boring part is defining the tree. Generally speaking (and specifically for this project), the syntax tree is a list of statements. Statements can be function definitions or expression statements or if statements or return statements or local declarations.
This goes in src/parse.rs
.
#[derive(Debug)]
pub enum Statement {
Expression(Expression),
If(If),
FunctionDeclaration(FunctionDeclaration),
Return(Return),
Local(Local),
}
pub type Ast = Vec<Statement>;
There's almost nothing special at all about the rest of the tree definitions.
#[derive(Debug)]
pub enum Literal {
Identifier(Token),
Number(Token),
}
#[derive(Debug)]
pub struct FunctionCall {
pub name: Token,
pub arguments: Vec<Expression>,
}
#[derive(Debug)]
pub struct BinaryOperation {
pub operator: Token,
pub left: Box<Expression>,
pub right: Box<Expression>,
}
#[derive(Debug)]
pub enum Expression {
FunctionCall(FunctionCall),
BinaryOperation(BinaryOperation),
Literal(Literal),
}
#[derive(Debug)]
pub struct FunctionDeclaration {
pub name: Token,
pub parameters: Vec<Token>,
pub body: Vec<Statement>,
}
#[derive(Debug)]
pub struct If {
pub test: Expression,
pub body: Vec<Statement>,
}
#[derive(Debug)]
pub struct Local {
pub name: Token,
pub expression: Expression,
}
#[derive(Debug)]
pub struct Return {
pub expression: Expression,
}
And that's it for the AST!
Some helpers
Lastly before the fun part, we'll define a few helpers for validating each kind of token.
fn expect_keyword(tokens: &[Token], index: usize, value: &str) -> bool {
if index >= tokens.len() {
return false;
}
let t = tokens[index].clone();
t.kind == TokenKind::Keyword && t.value == value
}
fn expect_syntax(tokens: &[Token], index: usize, value: &str) -> bool {
if index >= tokens.len() {
return false;
}
let t = tokens[index].clone();
t.kind == TokenKind::Syntax && t.value == value
}
fn expect_identifier(tokens: &[Token], index: usize) -> bool {
if index >= tokens.len() {
return false;
}
let t = tokens[index].clone();
t.kind == TokenKind::Identifier
}
Now on to the fun part, actually detecting these trees!
Top-level parse
The top-level parse
function and it's major helper,
parse_statement
, dispatch very similarly to the top-level lex
function. For each statement in the file we look for function
declarations, if statements, return statements, local declarations,
and expression statements.
fn parse_statement(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
let parsers = [
parse_if,
parse_expression_statement,
parse_return,
parse_function,
parse_local,
];
for parser in parsers {
let res = parser(raw, tokens, index);
if res.is_some() {
return res;
}
}
None
}
pub fn parse(raw: &[char], tokens: Vec<Token>) -> Result<Ast, String> {
let mut ast = vec![];
let mut index = 0;
let ntokens = tokens.len();
while index < ntokens {
let res = parse_statement(raw, &tokens, index);
if let Some((stmt, next_index)) = res {
index = next_index;
ast.push(stmt);
continue;
}
return Err(tokens[index].loc.debug(raw, "Invalid token while parsing:"));
}
Ok(ast)
}
Expression statements
Expression statements are just a wrapper for the Rust type
system. They call parse_expression
(which we'll define shortly),
expect a semicolon afterward, and wrap the expression in a statement.
fn parse_expression_statement(
raw: &[char],
tokens: &[Token],
index: usize,
) -> Option<(Statement, usize)> {
let mut next_index = index;
let res = parse_expression(raw, tokens, next_index)?;
let (expr, next_next_index) = res;
next_index = next_next_index;
if !expect_syntax(tokens, next_index, ";") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected semicolon after expression:")
);
return None;
}
next_index += 1; // Skip past semicolon
Some((Statement::Expression(expr), next_index))
}
Expressions
Expressions in this minimal Lua are only one of function calls,
literals (numbers, identifiers), or binary operations. To keep things
very simple, binary operations cannot be combined. So instead of 1 +
2 + 3
we'd need to do local tmp1 = 1 + 2; local tmp2 = tmp1 + 3;
and so on.
fn parse_expression(raw: &[char], tokens: &[Token], index: usize) -> Option<(Expression, usize)> {
if index >= tokens.len() {
return None;
}
let t = tokens[index].clone();
let left = match t.kind {
TokenKind::Number => Expression::Literal(Literal::Number(t)),
TokenKind::Identifier => Expression::Literal(Literal::Identifier(t)),
_ => {
return None;
}
};
If what follows the first literal is an open parenthesis then we try to parse a function call.
let mut next_index = index + 1;
if expect_syntax(tokens, next_index, "(") {
next_index += 1; // Skip past open paren
// Function call
let mut arguments: Vec<Expression> = vec![];
We need to call parse_expression
recursively for every possible
argument passed to the function.
while !expect_syntax(tokens, next_index, ")") {
if arguments.is_empty() {
if !expect_syntax(tokens, next_index, ",") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected comma between function call arguments:")
);
return None;
}
next_index += 1; // Skip past comma
}
let res = parse_expression(raw, tokens, next_index);
if let Some((arg, next_next_index)) = res {
next_index = next_next_index;
arguments.push(arg);
} else {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid expression in function call arguments:")
);
return None;
}
}
next_index += 1; // Skip past closing paren
return Some((
Expression::FunctionCall(FunctionCall {
name: tokens[index].clone(),
arguments,
}),
next_index,
));
}
Otherwise if there isn't an opening parenthesis then we could be parsing either a literal expression or a binary operation. If the token that follows is an operator token then we know it's a binary operation.
// Might be a literal expression
if next_index >= tokens.len() || tokens[next_index].clone().kind != TokenKind::Operator {
return Some((left, next_index));
}
// Otherwise is a binary operation
let op = tokens[next_index].clone();
next_index += 1; // Skip past op
if next_index >= tokens.len() {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid right hand side binary operand:")
);
return None;
}
let rtoken = tokens[next_index].clone();
It is at this point that we could (but won't) call
parse_expression
recursively. I don't want to deal with operator
precedence right now so we'll just require that the right hand side is
another literal.
let right = match rtoken.kind {
TokenKind::Number => Expression::Literal(Literal::Number(rtoken)),
TokenKind::Identifier => Expression::Literal(Literal::Identifier(rtoken)),
_ => {
println!(
"{}",
rtoken
.loc
.debug(raw, "Expected valid right hand side binary operand:")
);
return None;
}
};
next_index += 1; // Skip past right hand operand
Some((
Expression::BinaryOperation(BinaryOperation {
left: Box::new(left),
right: Box::new(right),
operator: op,
}),
next_index,
))
}
And now we're done parsing expressions!
Function declarations
Functions start with the function
keyword, and an identifier token follows.
fn parse_function(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
if !expect_keyword(tokens, index, "function") {
return None;
}
let mut next_index = index + 1;
if !expect_identifier(tokens, next_index) {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid identifier for function name:")
);
return None;
}
let name = tokens[next_index].clone();
After the function name comes the argument list that can be empty or a comma separated list of identifiers.
next_index += 1; // Skip past name
if !expect_syntax(tokens, next_index, "(") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected open parenthesis in function declaration:")
);
return None;
}
next_index += 1; // Skip past open paren
let mut parameters: Vec<Token> = vec![];
while !expect_syntax(tokens, next_index, ")") {
if !parameters.is_empty() {
if !expect_syntax(tokens, next_index, ",") {
println!("{}", tokens[next_index].loc.debug(raw, "Expected comma or close parenthesis after parameter in function declaration:"));
return None;
}
next_index += 1; // Skip past comma
}
parameters.push(tokens[next_index].clone());
next_index += 1; // Skip past param
}
next_index += 1; // Skip past close paren
Next we parse all statements in the function body until we find the
end
keyword.
let mut statements: Vec<Statement> = vec![];
while !expect_keyword(tokens, next_index, "end") {
let res = parse_statement(raw, tokens, next_index);
if let Some((stmt, next_next_index)) = res {
next_index = next_next_index;
statements.push(stmt);
} else {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid statement in function declaration:")
);
return None;
}
}
next_index += 1; // Skip past end
Some((
Statement::FunctionDeclaration(FunctionDeclaration {
name,
parameters,
body: statements,
}),
next_index,
))
}
Phew! We're halfway through the parser.
Return statements
Return statements just check for the return
keyword, an expression,
and a semicolon.
fn parse_return(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
if !expect_keyword(tokens, index, "return") {
return None;
}
let mut next_index = index + 1; // Skip past return
let res = parse_expression(raw, tokens, next_index);
if res.is_none() {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid expression in return statement:")
);
return None;
}
let (expr, next_next_index) = res.unwrap();
next_index = next_next_index;
if !expect_syntax(tokens, next_index, ";") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected semicolon in return statement:")
);
return None;
}
next_index += 1; // Skip past semicolon
Some((Statement::Return(Return { expression: expr }), next_index))
}
Local declarations
Local declarations start with the local
keyword, then the local
name, then an equal sign, then an expression, and then a semicolon.
fn parse_local(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
if !expect_keyword(tokens, index, "local") {
return None;
}
let mut next_index = index + 1; // Skip past local
if !expect_identifier(tokens, next_index) {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid identifier for local name:")
);
return None;
}
let name = tokens[next_index].clone();
next_index += 1; // Skip past name
if !expect_syntax(tokens, next_index, "=") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected = syntax after local name:")
);
return None;
}
next_index += 1; // Skip past =
let res = parse_expression(raw, tokens, next_index);
if res.is_none() {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid expression in local declaration:")
);
return None;
}
let (expr, next_next_index) = res.unwrap();
next_index = next_next_index;
if !expect_syntax(tokens, next_index, ";") {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected semicolon in return statement:")
);
return None;
}
next_index += 1; // Skip past semicolon
Some((
Statement::Local(Local {
name,
expression: expr,
}),
next_index,
))
}
If statements
This implementation of Lua doesn't support elseif
so parsing if
just checks for the if
keyword followed by a test expression, then
the else
keyword, then the if body (a list of statements), and then the
end
keyword.
fn parse_if(raw: &[char], tokens: &[Token], index: usize) -> Option<(Statement, usize)> {
if !expect_keyword(tokens, index, "if") {
return None;
}
let mut next_index = index + 1; // Skip past if
let res = parse_expression(raw, tokens, next_index);
if res.is_none() {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid expression for if test:")
);
return None;
}
let (test, next_next_index) = res.unwrap();
next_index = next_next_index;
if !expect_keyword(tokens, next_index, "then") {
return None;
}
next_index += 1; // Skip past then
let mut statements: Vec<Statement> = vec![];
while !expect_keyword(tokens, next_index, "end") {
let res = parse_statement(raw, tokens, next_index);
if let Some((stmt, next_next_index)) = res {
next_index = next_next_index;
statements.push(stmt);
} else {
println!(
"{}",
tokens[next_index]
.loc
.debug(raw, "Expected valid statement in if body:")
);
return None;
}
}
next_index += 1; // Skip past end
Some((
Statement::If(If {
test,
body: statements,
}),
next_index,
))
}
And goshdarnit we're done parsing.
Compiling to a made up virtual machine
This virtual machine will be entirely stack-based other than the stack pointer and program counter.
The calling convention is that the caller will put arguments on the stack followed by the frame pointer, the program counter, and then the number of arguments (for cleanup). Then it will alter the program counter and frame pointer. Then the caller will allocate space on the stack for all arguments and locals within the function.
For simplicity in addressing modes, the function declaration once jumped to will copy the arguments from before the frame pointer to in front of it (yes I know, I know, this is silly).
The virtual machine will support add, subtract, less than operations as well as jump, jump-if-not-zero, return, and call. It will support a few more memory-specific instructions for loading literals, loading identifiers, and managing arguments.
I'll explain the non-obvious instructions as we implement them.
use crate::parse::*;
use std::collections::HashMap;
#[derive(Debug)]
enum Instruction {
DupPlusFP(i32),
MoveMinusFP(usize, i32),
MovePlusFP(usize),
Store(i32),
Return,
JumpIfNotZero(String),
Jump(String),
Call(String, usize),
Add,
Subtract,
LessThan,
}
The result of compiling will be a Program
instance. This instance
will contain symbol information and the actual instructions to run.
#[derive(Debug)]
struct Symbol {
location: i32,
narguments: usize,
nlocals: usize,
}
#[derive(Debug)]
pub struct Program {
syms: HashMap<String, Symbol>,
instructions: Vec<Instruction>,
}
Compiling, similar to parsing, just calls the helper
compile_statement
for each statement in the AST.
pub fn compile(raw: &[char], ast: Ast) -> Program {
let mut locals: HashMap<String, i32> = HashMap::new();
let mut pgrm = Program {
syms: HashMap::new(),
instructions: Vec::new(),
};
for stmt in ast {
compile_statement(&mut pgrm, raw, &mut locals, stmt);
}
pgrm
}
And compile_statement
dispatches to additional helpers based on the
kind of statement.
fn compile_statement(
pgrm: &mut Program,
raw: &Vec<char>,
locals: &mut HashMap<String, i32>,
stmt: Statement,
) {
match stmt {
Statement::FunctionDeclaration(fd) => compile_declaration(pgrm, raw, locals, fd),
Statement::Return(r) => compile_return(pgrm, raw, locals, r),
Statement::If(if_) => compile_if(pgrm, raw, locals, if_),
Statement::Local(loc) => compile_local(pgrm, raw, locals, loc),
Statement::Expression(e) => compile_expression(pgrm, raw, locals, e),
}
}
Function declarations
Let's do the hard one first. First off, function declarations will include an unconditional guard around them so that we can evaluate from the 0th instruction at the top-level and have only non-function-declaration statements be evaluated.
fn compile_declaration(
pgrm: &mut Program,
raw: &[char],
_: &mut HashMap<String, i32>,
fd: FunctionDeclaration,
) {
// Jump to end of function to guard top-level
let done_label = format!("function_done_{}", pgrm.instructions.len());
pgrm.instructions
.push(Instruction::Jump(done_label.clone()));
Then we'll add another limitation/simplification that local variables are only accessible within the current function scope.
For each parameter, we'll copy the parameter on the stack before the frame pointer to a place in front of the frame pointer. This gets around addressing mode limitations in our virtual machine.
let mut new_locals = HashMap::<String, i32>::new();
let function_index = pgrm.instructions.len() as i32;
let narguments = fd.parameters.len();
for (i, param) in fd.parameters.iter().enumerate() {
pgrm.instructions.push(Instruction::MoveMinusFP(
i,
narguments as i32 - (i as i32 + 1),
));
new_locals.insert(param.value.clone(), i as i32);
}
Then we compile the body.
for stmt in fd.body {
compile_statement(pgrm, raw, &mut new_locals, stmt);
}
Once the body is compiled we know the total number of locals so we can fill out the symbol table correctly. The location is importantly already stored because it is the location of the instruction where the function started.
pgrm.syms.insert(
fd.name.value,
Symbol {
location: function_index as i32,
narguments,
nlocals: new_locals.keys().len(),
},
);
Finally we add a symbol linking the done label for the function to the position of the end of the function. Again, this allows us to skip past the function declaration when evaluating instructions from 0 to N.
pgrm.syms.insert(
done_label,
Symbol {
location: pgrm.instructions.len() as i32,
narguments: 0,
nlocals: 0,
},
);
}
Ok that wasn't so bad. And the rest are simpler.
Local declarations
The expression for the local is compiled and then the local name is
stored in a locals table mapped to the current number of locals
(including arguments). This allows the compiler to turn identifier
token lookups into simply an offset from the frame pointer.
fn compile_local(
pgrm: &mut Program,
raw: &[char],
locals: &mut HashMap<String, i32>,
local: Local,
) {
let index = locals.keys().len();
locals.insert(local.name.value, index as i32);
compile_expression(pgrm, raw, locals, local.expression);
pgrm.instructions.push(Instruction::MovePlusFP(index));
}
And specifically, the instruction pattern is to evaluate the expression and then copy it back into a relative position in the stack.
Literals
Number literals use the store
instruction for pushing a number onto
the stack. Identifier literals are copied to the top of the stack from
their position relative to the frame pointer.
fn compile_literal(
pgrm: &mut Program,
_: &[char],
locals: &mut HashMap<String, i32>,
lit: Literal,
) {
match lit {
Literal::Number(i) => {
let n = i.value.parse::<i32>().unwrap();
pgrm.instructions.push(Instruction::Store(n));
}
Literal::Identifier(ident) => {
pgrm.instructions
.push(Instruction::DupPlusFP(locals[&ident.value]));
}
}
}
Function calls
Pretty simple: just compile all the arguments and then issue a call instruction.
fn compile_function_call(
pgrm: &mut Program,
raw: &Vec<char>,
locals: &mut HashMap<String, i32>,
fc: FunctionCall,
) {
let len = fc.arguments.len();
for arg in fc.arguments {
compile_expression(pgrm, raw, locals, arg);
}
pgrm.instructions
.push(Instruction::Call(fc.name.value, len));
}
Binary operations
Binary operations compile the left, then the right, and then issue an instruction based on the operator. All the operators are builtin and act on the top two elements on the stack.
fn compile_binary_operation(
pgrm: &mut Program,
raw: &[char],
locals: &mut HashMap<String, i32>,
bop: BinaryOperation,
) {
compile_expression(pgrm, raw, locals, *bop.left);
compile_expression(pgrm, raw, locals, *bop.right);
match bop.operator.value.as_str() {
"+" => {
pgrm.instructions.push(Instruction::Add);
}
"-" => {
pgrm.instructions.push(Instruction::Subtract);
}
"<" => {
pgrm.instructions.push(Instruction::LessThan);
}
_ => panic!(
"{}",
bop.operator
.loc
.debug(raw, "Unable to compile binary operation:")
),
}
}
Expressions
Compiling expressions just dispatches to a compile helper based on the type of expression. We've already written those three helpers.
fn compile_expression(
pgrm: &mut Program,
raw: &[char],
locals: &mut HashMap<String, i32>,
exp: Expression,
) {
match exp {
Expression::BinaryOperation(bop) => {
compile_binary_operation(pgrm, raw, locals, bop);
}
Expression::FunctionCall(fc) => {
compile_function_call(pgrm, raw, locals, fc);
}
Expression::Literal(lit) => {
compile_literal(pgrm, raw, locals, lit);
}
}
}
If
First we compile the conditional test and then we jump to after the if the test result is not zero.
fn compile_if(pgrm: &mut Program, raw: &[char], locals: &mut HashMap<String, i32>, if_: If) {
compile_expression(pgrm, raw, locals, if_.test);
let done_label = format!("if_else_{}", pgrm.instructions.len());
pgrm.instructions
.push(Instruction::JumpIfNotZero(done_label.clone()));
Then we compile the body.
for stmt in if_.body {
compile_statement(pgrm, raw, locals, stmt);
}
And finally make sure we insert the done
symbol in the right place after the if.
pgrm.syms.insert(
done_label,
Symbol {
location: pgrm.instructions.len() as i32 - 1,
nlocals: 0,
narguments: 0,
},
);
}
Return
The final statement type is return. We simply compile the return expression and issue a return instruction.
fn compile_return(
pgrm: &mut Program,
raw: &[char],
locals: &mut HashMap<String, i32>,
ret: Return,
) {
compile_expression(pgrm, raw, locals, ret.expression);
pgrm.instructions.push(Instruction::Return);
}
That's it for the compiler! Now the trickiest part. I lost a few hours debugging and iterating on the next bit.
The virtual machine
Ok so the easy part is that there are only two registers, a program counter and a frame pointer. There's also a data stack. The frame pointer points to the location on the data stack where each function can start storing its locals.
Evaluation starts from 0 and goes until the last instruction.
pub fn eval(pgrm: Program) {
let mut pc: i32 = 0;
let mut fp: i32 = 0;
let mut data: Vec<i32> = vec![];
while pc < pgrm.instructions.len() as i32 {
match &pgrm.instructions[pc as usize] {
Each instruction will be responsible for incrementing the program counter or having it jump around.
Addition, subtraction, less than
The easiest ones are the math operators. We just pop off the data stack, perform the operation, and store the result.
Instruction::Add => {
let right = data.pop().unwrap();
let left = data.pop().unwrap();
data.push(left + right);
pc += 1;
}
Instruction::Subtract => {
let right = data.pop().unwrap();
let left = data.pop().unwrap();
data.push(left - right);
pc += 1;
}
Instruction::LessThan => {
let right = data.pop().unwrap();
let left = data.pop().unwrap();
data.push(if left < right { 1 } else { 0 });
pc += 1;
}
The store
instruction is another easy one. It just pushes a literal
number onto the stack.
Instruction::Store(n) => {
data.push(*n);
pc += 1;
}
Jump variants
The jump variants are easy too. Just grab the location and change the program counter. If it's a conditional jump then test the condition first.
Instruction::JumpIfNotZero(label) => {
let top = data.pop().unwrap();
if top == 0 {
pc = pgrm.syms[label].location;
}
pc += 1;
}
Instruction::Jump(label) => {
pc = pgrm.syms[label].location;
}
Loading from a variable
The MovePlusFP
instruction copies a value from the stack (offset the
frame pointer) onto the top of the stack. This is for references to
arguments and locals.
Instruction::MovePlusFP(i) => {
let val = data.pop().unwrap();
let index = fp as usize + *i;
// Accounts for top-level locals
while index >= data.len() {
data.push(0);
}
data[index] = val;
pc += 1;
}
Storing locals
The DupPlusFP
instruction is used by compile_locals
to store a
local once compiled onto the stack in the relative position from the
frame pointer.
Instruction::DupPlusFP(i) => {
data.push(data[(fp + i) as usize]);
pc += 1;
}
Duplicating arguments
The MoveMinusFP
instruction is, again, a hack to work around limited
addressing modes in this minimal virtual machine. It copies
arguments from behind the frame pointer to in front of the frame
pointer.
Instruction::MoveMinusFP(local_offset, fp_offset) => {
data[fp as usize + local_offset] = data[(fp - (fp_offset + 4)) as usize];
pc += 1;
}
Now we're down to the last two instructions: call and return.
Call
Call has a special dispatch for builtin functions (the only one that
exists being print
).
Instruction::Call(label, narguments) => {
// Handle builtin functions
if label == "print" {
for _ in 0..*narguments {
print!("{}", data.pop().unwrap());
print!(" ");
}
println!();
pc += 1;
continue;
}
Otherwise it pushes the current frame pointer, then the program counter, and finally the number of arguments (not locals) onto the stack for preservation. Then it sets up the new program counter and frame pointer and creates space for all locals and arguments after the new frame pointer.
data.push(fp);
data.push(pc + 1);
data.push(pgrm.syms[label].narguments as i32);
pc = pgrm.syms[label].location;
fp = data.len() as i32;
// Set up space for all arguments/locals
let mut nlocals = pgrm.syms[label].nlocals;
while nlocals > 0 {
data.push(0);
nlocals -= 1;
}
}
Return
The return instructions pops the return value from the stack. Then it pops off all locals and arguments. Then it restores the program counter and frame pointer, and pops off the arguments before the frame pointer. Finally it adds the return value back onto the stack.
Instruction::Return => {
let ret = data.pop().unwrap();
// Clean up the local stack
while fp < data.len() as i32 {
data.pop();
}
// Restore pc and fp
let mut narguments = data.pop().unwrap();
pc = data.pop().unwrap();
fp = data.pop().unwrap();
// Clean up arguments
while narguments > 0 {
data.pop();
narguments -= 1;
}
// Add back return value
data.push(ret);
}
And yes, this implementation would be more efficient if instead of literally pushing and popping we just incremented/decremented a stack pointer.
And that's it! We're completely done a basic parser, compiler, and virtual machine for a subet of Lua. Is it janky? Yeah. Is it simple? Kind of? Does it work? It seems to!
Summary
Ok we've got <1200 lines of Rust enough to run some decent Lua programs. We run this fib program against this implementation and against Lua 5.4.3 (which isn't LuaJIT) and what do we see?
$ cargo build --release
$ cat test/fib.lua
function fib(n)
if n < 2 then
return n;
end
local n1 = fib(n-1);
local n2 = fib(n-2);
return n1 + n2;
end
print(fib(30));
$ time ./target/release/lust test/fib.lua
832040
./target/release/lust test/fib.lua 0.29s user 0.00s system 99% cpu 0.293 total
$ time lua test/fib.lua
832040
lua test/fib.lua 0.06s user 0.00s system 99% cpu 0.063 total
This implementation is a bit slower! Time to do some profiling and maybe revisit some of those aforementioned inefficiencies.
Big thanks to Christian
Scott on Twitter for pointing out I should not be benchmarking
with debug builds!
And thanks
to reddit123123123123
on Reddit for suggesting I use cargo clippy
to
clean up my code.
Thanks to GiffE
on Github for pointing out some key inconsistencies between this
implementation and Lua. I won't modify anything because a perfect
Lua subset wasn't the goal, but I'm sharing because it was good
analysis and criticism of this implementation.
I wrote a new blog post on parsing, compiling, and virtual machine evaluation for a super minimal Lua implementation written from scratch in Rust!https://t.co/8qFviEecJo pic.twitter.com/d1MGArlErR
— Phil Eaton (@phil_eaton) December 28, 2021