Better yet, take a look at this post walking through emulating x86 ELF binaries in Go:
Emulating linux/AMD64 userland: interpreting an ELF binary
Next up in emulator basics:
2. system calls
In this post we'll create a small virtual machine in JavaScript and use it to run a simple C program compiled with GCC for an x86_64 (or AMD64) CPU running Linux.
All source code is available on Github.
Virtual machine data storage
Our virtual machine will have two means of storing data: registers and an integer stack. Each register can store a 64-bit integer. The stack is an array of 8-bit (or 1 byte) integers.
We'll make the following registers available for modification and use by the program(mer):
RDI, RSI, RSP, RBP, RAX, RBX, RCX, RDX, R8, R9, R10, R11, R12, R13, R14, R15
The RSP
register is used by the virtual machine for
tracking the location of the last entry in the stack. It will be
modified by the virtual machine when it encounters the
POP
, PUSH
, CALL
and
RET
instructions we'll support. We'll get into the
specifics shortly.
And we'll make the following registers available for use (but not modification) by the program(mer):
RIP, CS, DS, FS, SS, ES, GS, CF, ZF, PF, AF, SF, TF, IF, DF, OF
Each of these has a special meaning but we'll focus on
RIP
. The RIP
register contains the address
of the instruction currently being interpreted by our virtual
machine. After every instruction the virtual machine will increment
the value in this register -- except for a few special instructions
like CALL
and RET
.
Memory addresses
It will become useful to provide direct access to memory with a special syntax. We'll focus just on 64-bit addresses that will look like this:
MOV QWORD PTR [RBP - 8], 12
This asks for the value 12
to be written into the memory
address at RBP - 8
bytes. The QWORD PTR
part
clarifies that we want to write 8 bytes worth of the value. Since
12
is less than 8 bytes, the rest will be filled with
zeros.
ADD RAX, QWORD PTR [RBP - 8]
This asks for eight bytes starting from the memory address RBP -
8
to be added to the value in RAX
and stored back
in RAX
.
Virtual machine instruction set
In our virtual machine we'll define support for the following instructions:
MOV $REGISTER, $REGISTER or $MEMORY ADDRESS or $LITERAL NUMBER
- This instruction copies the second value into the first.
ADD $REGISTER, $REGISTER or $MEMORY ADDRESS
- This instruction adds the second value into the first and stores the result into the first.
PUSH $REGISTER
- This instruction will decrement the
RSP
register by 8 bytes and store the value at the bottom of the stack.
- This instruction will decrement the
POP $REGISTER
- This instruction will increment the
RSP
register by 8 bytes, remove the last element in the stack (at the bottom), and store it into the register.
- This instruction will increment the
CALL $LABEL
- This instruction will push the value in the
RIP
register (plus one) onto the stack and set theRIP
register to the line of code of the label. More on this later.
- This instruction will push the value in the
RET
- This instruction will remove the value at the bottom of the stack and store it in the
RIP
register.
- This instruction will remove the value at the bottom of the stack and store it in the
Now we have more than enough instructions to write some interesting programs for the virtual machine.
Virtual machine semantics
We'll make one last assumption before explaining further. In our
programs, there must be a main
label which must contain
a RET
instruction. Once we hit the terminal
RET
, we will exit the virtual machine and set the exit
code to the value stored in the RAX
register.
Let's look at a simple program:
main: ; the required main label
MOV RAX, 1 ; store 1 in RAX
MOV RDI, 2 ; store 2 in RDI
ADD RAX, RDI ; store the result of adding RAX and RDI in RAX
RET ; give control back to the virtual machine
When we run this program, first we initialize a stack (we'll give it
1000 elements) and set the RSP
register to 1000 (the top
of the stack). Then we look for the main
label and set
the RIP
register to 1, the line number after the label
appears (0). Then until the RIP
register is 1000 again,
we interpret the instruction at the line number stored in the
RIP
register. Once the RIP
register hits
1000, we exit the program setting RAX
as the exit code.
One more example
Now let's look at one more program:
plus:
ADD RDI, RSI
MOV RAX, RDI
RET
main:
MOV RDI, 1
MOV RSI, 2
CALL plus
RET
Our virtual machine will start at the line after the
main
label. Then it will store 1
into
RDI
and 2
into RSI
. Then it
will jump to the second line in the program to add RDI
and RSI
and store the result in RDI
. Then it
will copy RDI
into RAX
and return control to
the final line. This last RET
will in turn return control
to the virtual machine. Then the program will exit with exit code
3
.
Parsing
Now that we've finished up describing our virtual machine language and semantics, we need to parse the instructions into a format we can easily interpret.
To do this we'll iterate over the program skip any lines that start with a dot. These are virtual machine directives that are important for us to ignore for now. We'll also remove any characters including and following a semi-colon or hash-tag, until the end of the line. These are comments.
We'll store a dictionary of label names to line numbers (the line number of the label plus one) and without the colon.
And we'll store the instructions as an array of objects composed of an operation and optional operands.
Code
function parse(program) {
const labels = {};
const instructions = [];
const lines = program.split('\n');
for (let i = 0; i < lines.length; i++) {
let line = lines[i].trim(); // Remove any trailing, leading whitespace
// TODO: handle each line
}
return { labels, instructions };
}
First let's handle the directives we want to ignore:
for (let i = 0; i < lines.length; i++) {
let line = lines[i].trim(); // Remove any trailing, leading whitespace
if (line.startsWith('.')) {
continue;
}
}
And then comments:
for (let i = 0; i < lines.length; i++) {
let line = lines[i].trim(); // Remove any trailing, leading whitespace
if (line.startsWith('.') || line.startsWith(';') || line.startsWith('#')) {
continue;
}
if (line.includes(';')) {
line = line.split(';')[0];
}
if (line.includes('#')) {
line = line.split('#')[0];
}
if (!line) {
continue;
}
}
And then labels:
for (let i = 0; i < lines.length; i++) {
let line = lines[i].trim(); // Remove any trailing, leading whitespace
if (line.startsWith('.') || line.startsWith(';') || line.startsWith('#')) {
continue;
}
if (line.includes(';')) {
line = line.split(';')[0];
}
if (line.includes('#')) {
line = line.split('#')[0];
}
if (!line) {
continue;
}
if (line.includes(':')) {
const label = line.split(':')[0];
labels[label] = instructions.length;
continue;
}
}
And finally instruction parsing plus the rest:
function parse(program) {
const labels = {};
const instructions = [];
const lines = program.split('\n');
for (let i = 0; i < lines.length; i++) {
let line = lines[i].trim(); // Remove any trailing, leading whitespace
if (line.startsWith('.') || line.startsWith(';')) {
continue;
}
if (line.includes(';')) {
line = line.split(';')[0];
}
if (line.includes(':')) {
const label = line.split(':')[0];
labels[label] = instructions.length;
continue;
}
const operation = line.split(/\s/)[0].toLowerCase();
const operands = line.substring(operation.length).split(',').map(t => t.trim());
instructions.push({
operation,
operands,
});
}
return { labels, instructions };
}
Hurray! A brittle parser.
Interpreting
We've already described the semantics a few times. So let's get started with the foundation and initialization.
We'll use BigInts because JavaScript integers are 53-bits wide. This isn't incredibly important in our basic programs but it will quickly became painful without.
And we'll make process memory available as an array of 8-bit integers. In order to make this easy to use, we'll also provide helper function for writing to and reading from memory.
const fs = require('fs');
const REGISTERS = [
'RDI', 'RSI', 'RSP', 'RBP', 'RAX', 'RBX', 'RCX', 'RDX', 'RIP', 'R8',
'R9', 'R10', 'R11', 'R12', 'R13', 'R14', 'R15', 'CS', 'DS', 'FS',
'SS', 'ES', 'GS', 'CF', 'ZF', 'PF', 'AF', 'SF', 'TF', 'IF', 'DF', 'OF',
];
function writeMemoryBytes(process, address, value, size) {
for (let i = 0n; i < size; i++) {
value >>= i * 8n;
process.memory[address + i] = value & 0xFFn;
}
}
function readMemoryBytes(process, address, size) {
let value = 0n;
for (let i = 0n; i < size; i++) {
value |= (process.memory[address + i] || 0n) << (i * 8n);
}
return value;
}
function interpret(process) {
// TODO: interpret
}
function main(file) {
const memory = new Array(10000);
const code = fs.readFileSync(file).toString();
const { instructions, labels } = parse(code);
const registers = REGISTERS.reduce((rs, r) => ({ ...rs, [r]: 0n }), {});
registers.RIP = BigInt(labels.main === undefined ? labels._main : labels.main);
registers.RSP = BigInt(memory.length - 8);
const process = {
registers,
memory,
instructions,
labels,
};
writeMemoryBytes(process, registers.RSP, registers.RSP, 8);
interpret(process);
return Number(registers.RAX);
}
process.exit(main(process.argv[2]));
We'll accept _main
as an entry point as well as
main
to support our macOS users. If you know why our
macOS users use _main
I'd love to know.
To interpret, we grab the instruction pointed to in RIP
and switch on the operation.
function interpret(process) {
do {
const instruction = process.instructions[process.registers.RIP];
switch (instruction.operation.toLowerCase()) {
case 'mov':
break;
case 'add':
break;
case 'call':
break;
case 'ret':
break;
case 'push':
break;
case 'pop':
break;
}
} while (process.registers.RIP != BigInt(readMemoryBytes(process, memory.length - 8, 8)));
}
Interpreting MOV
Example:
MOV RAX, 1
MOV RAX, RDI
MOV QWORD PTR [RBP - 8], 8
This instruction will store a value into a register or address and
increment RIP
. If the left-hand side is a memory address
we will write to memory.
case 'mov': {
const lhs = interpretValue(process, instruction.operands[0], { lhs: true });
const rhs = interpretValue(process, instruction.operands[1]);
if (REGISTERS.includes(lhs)) {
process.registers[lhs] = rhs;
} else {
writeMemoryBytes(process, lhs.address, rhs, lhs.size);
}
process.registers.RIP++;
break;
}
We're delegating to a helper function to handle registers vs. memory addresses vs. literals appropriately. Without memory addresses it's a simple function:
function interpretValue(process, value, { lhs } = { lhs: false }) {
if (REGISTERS.includes(value)) {
if (lhs) {
return value;
} else {
return process.registers[value];
}
}
return BigInt.asIntN(64, value);
}
We need to do some hacking to support memory addresses:
function interpretValue(process, value, { lhs } = { lhs: false }) {
if (REGISTERS.includes(value)) {
if (lhs) {
return value;
} else {
return process.registers[value];
}
}
if (value.startsWith('QWORD PTR [')) {
const offsetString = value.substring('QWORD PTR ['.length, value.length - 1).trim();
if (offsetString.includes('-')) {
const [l, r] = offsetString.split('-').map(l => interpretValue(process, l.trim()));
const address = l - r;
const bytes = 8; // qword is 8 bytes
if (lhs) {
return { address, size: bytes };
} else {
return readMemoryBytes(process, address, bytes);
}
}
throw new Error('Unsupported offset calculation: ' + value);
}
return BigInt.asIntN(64, value);
}
Interpreting ADD
Example:
ADD RAX, RDI
This instruction will combine both registers and store the result in
the first, then increment the RIP
register.
case 'add': {
const lhs = interpretValue(process, instruction.operands[0], { lhs: true });
const rhs = interpretValue(process, instruction.operands[1]);
process.registers[lhs] += rhs;
process.registers.RIP++;
break;
}
Interpreting CALL
Example:
CALL plus
This instruction store RIP
(plus one, to continue after
the call instruction) on the stack and sets RIP
to the
location specified by the label.
case 'call': {
process.registers.RSP -= 8n;
writeMemoryBytes(process, process.registers.RSP, process.registers.RIP + 1n, 8);
const label = instruction.operands[0];
process.registers.RIP = process.labels[label];
break;
}
Interpreting RET
Example:
RET
This instruction removes the last element from the stack and stores it
in the RIP
register.
case 'ret': {
const value = readMemoryBytes(process, process.registers.RSP, 8);
process.registers.RSP += 8n;
process.registers.RIP = value;
break;
}
Interpreting PUSH
Example:
PUSH RAX
This instruction stores the value in the register on the stack and
increments RIP
.
case 'push': {
const value = interpretValue(process, instruction.operands[0]);
process.registers.RSP -= 8n;
writeMemoryBytes(process, process.registers.RSP, value, 8);
process.registers.RIP++;
break;
}
Interpreting POP
Example:
POP RAX
This instruction removes the last element from the stack and stores it
into the register specified. Then it increments RIP
.
case 'pop': {
const lhs = interpretValue(process, instruction.operands[0], { lhs: true });
const value = readMemoryBytes(process, process.registers.RSP, 8);
process.registers.RSP += 8n;
process.registers[lhs] = value;
process.registers.RIP++;
break;
}
All together
$ cat test1.asm
main: ; the required main label
MOV RAX, 1 ; store 1 in RAX
MOV RDI, 2 ; store 2 in RDI
ADD RAX, RDI ; store the result of adding RAX and RDI in RAX
RET ; give control back to the virtual machine
$ node emulator.js test1.asm
$ echo $?
3
And finally, let's see what we can do with a simple C program:
$ cat plus.c
long main() {
long a = 5;
long b = 6;
return a + b;
}
$ gcc -S -masm=intel -o plus.s plus.c
$ node emulator.js plus.s
$ echo $?
11
And we've got the start of a working x86_64/AMD64 emulator.
Next steps
We aren't setting flags appropriately to support conditionals, so that's low-hanging fruit. Additionally, syscalls open up a new world (that we'll end up needing since exit codes are limited to 8-bits of information). Additionally, our parsing is brittle. Dealing with ELF files may be a better direction to go and also enables more. We'll explore these aspects and others in follow-up posts.
Human interest
I originally intended to build a GameBoy emulator because the hardware is simple and uniform. But I found it easiest to start hacking together an AMD64 emulator because AMD64 is well-documented and gcc is easy enough to use. I'm still interested though unless/until I figure out how to emulate a graphics card for AMD64.
It's tricky! But not that tricky. I built a graphical debugger around this emulator to help out with the logic and off-by-one errors. But ultimately it's been surprising to me how easy it is to get started -- especially when I'm not concerned about absolute correctness (yet).
Here's my first post on a series on emulator basics. It's baby's first stack and register virtual machine and it turns out it runs x86 code. https://t.co/WiWmGedawt #linux #assembly https://t.co/xjiMkhgpdN
— Phil Eaton (@phil_eaton) May 24, 2019