Notes

Emulator basics: a stack and register machine

May 21, 2019

javascriptassemblyamd64linuxx86_64interpreters

Next up in emulator basics:
Emulator basics (2): system calls

In this post we'll create a small virtual machine in JavaScript and use it to run a simple C program compiled with GCC for an x86_64 (or AMD64) CPU running Linux.

All source code is available on Github.

The first version of this post incremented/decremented the stack pointer, RSP, by 1 instead of 8 bytes. This was a mistake because the first version of this project stored the stack as an array of 64-bit integers rather than 8-bit integers (a byte array).

Virtual machine data storage

Our virtual machine will have two means of storing data: registers and an integer stack. Each register can store a 64-bit integer. The stack is an array of 8-bit (or 1 byte) integers.

We'll make the following registers available for modification and use by the program(mer):

RDI, RSI, RSP, RBP, RAX, RBX, RCX, RDX, R8, R9, R10, R11, R12, R13, R14, R15

The RSP register is used by the virtual machine for tracking the location of the last entry in the stack. It will be modified by the virtual machine when it encounters the POP, PUSH, CALL and RET instructions we'll support. We'll get into the specifics shortly.

And we'll make the following registers available for use (but not modification) by the program(mer):

RIP, CS, DS, FS, SS, ES, GS, CF, ZF, PF, AF, SF, TF, IF, DF, OF

Each of these has a special meaning but we'll focus on RIP. The RIP register contains the address of the instruction currently being interpreted by our virtual machine. After every instruction the virtual machine will increment the value in this register -- except for a few special instructions like CALL and RET.

Memory addresses

It will become useful to provide direct access to memory with a special syntax. We'll focus just on 64-bit addresses that will look like this:

  MOV QWORD PTR [RBP - 8], 12

This asks for the value 12 to be written into the memory address at RBP - 8 bytes. The QWORD PTR part clarifies that we want to write 8 bytes worth of the value. Since 12 is less than 8 bytes, the rest will be filled with zeros.

  ADD RAX, QWORD PTR [RBP - 8]

This asks for eight bytes starting from the memory address RBP - 8 to be added to the value in RAX and stored back in RAX.

Virtual machine instruction set

In our virtual machine we'll define support for the following instructions:

Now we have more than enough instructions to write some interesting programs for the virtual machine.

Virtual machine semantics

We'll make one last assumption before explaining further. In our programs, there must be a main label which must contain a RET instruction. Once we hit the terminal RET, we will exit the virtual machine and set the exit code to the value stored in the RAX register.

Let's look at a simple program:

main: ; the required main label
  MOV RAX, 1 ; store 1 in RAX
  MOV RDI, 2 ; store 2 in RDI
  ADD RAX, RDI ; store the result of adding RAX and RDI in RAX
  RET ; give control back to the virtual machine

When we run this program, first we initialize a stack (we'll give it 1000 elements) and set the RSP register to 1000 (the top of the stack). Then we look for the main label and set the RIP register to 1, the line number after the label appears (0). Then until the RIP register is 1000 again, we interpret the instruction at the line number stored in the RIP register. Once the RIP register hits 1000, we exit the program setting RAX as the exit code.

One more example

Now let's look at one more program:

plus:
  ADD RDI, RSI
  MOV RAX, RDI
  RET

main:
  MOV RDI, 1
  MOV RSI, 2
  CALL plus
  RET

Our virtual machine will start at the line after the main label. Then it will store 1 into RDI and 2 into RSI. Then it will jump to the second line in the program to add RDI and RSI and store the result in RDI. Then it will copy RDI into RAX and return control to the final line. This last RET will in turn return control to the virtual machine. Then the program will exit with exit code 3.

Parsing

Now that we've finished up describing our virtual machine language and semantics, we need to parse the instructions into a format we can easily interpret.

To do this we'll iterate over the program skip any lines that start with a dot. These are virtual machine directives that are important for us to ignore for now. We'll also remove any characters including and following a semi-colon or hash-tag, until the end of the line. These are comments.

We'll store a dictionary of label names to line numbers (the line number of the label plus one) and without the colon.

And we'll store the instructions as an array of objects composed of an operation and optional operands.

Code

function parse(program) {
  const labels = {};
  const instructions = [];

  const lines = program.split('\n');

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace
    // TODO: handle each line
  }

  return { labels, instructions };
}

First let's handle the directives we want to ignore:

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace

    if (line.startsWith('.')) {
      continue;
    }
  }

And then comments:

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace

    if (line.startsWith('.') || line.startsWith(';') || line.startsWith('#')) {
      continue;
    }

    if (line.includes(';')) {
      line = line.split(';')[0];
    }

    if (line.includes('#')) {
      line = line.split('#')[0];
    }

    if (!line) {
      continue;
    }
  }

And then labels:

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace

    if (line.startsWith('.') || line.startsWith(';') || line.startsWith('#')) {
      continue;
    }

    if (line.includes(';')) {
      line = line.split(';')[0];
    }

    if (line.includes('#')) {
      line = line.split('#')[0];
    }

    if (!line) {
      continue;
    }

    if (line.includes(':')) {
      const label = line.split(':')[0];
      labels[label] = instructions.length;
      continue;
    }
  }

And finally instruction parsing plus the rest:

function parse(program) {
  const labels = {};
  const instructions = [];

  const lines = program.split('\n');

  for (let i = 0; i < lines.length; i++) {
    let line = lines[i].trim(); // Remove any trailing, leading whitespace

    if (line.startsWith('.') || line.startsWith(';')) {
      continue;
    }

    if (line.includes(';')) {
      line = line.split(';')[0];
    }

    if (line.includes(':')) {
      const label = line.split(':')[0];
      labels[label] = instructions.length;
      continue;
    }

    const operation = line.split(/\s/)[0].toLowerCase();
    const operands = line.substring(operation.length).split(',').map(t => t.trim());
    instructions.push({
      operation,
      operands,
    });
  }

  return { labels, instructions };
}

Hurray! A brittle parser.

Interpreting

We've already described the semantics a few times. So let's get started with the foundation and initialization.

We'll use BigInts because JavaScript integers are 53-bits wide. This isn't incredibly important in our basic programs but it will quickly became painful without.

And we'll make process memory available as an array of 8-bit integers. In order to make this easy to use, we'll also provide helper function for writing to and reading from memory.

const fs = require('fs');

const REGISTERS = [
  'RDI', 'RSI', 'RSP', 'RBP', 'RAX', 'RBX', 'RCX', 'RDX', 'RIP', 'R8',
  'R9', 'R10', 'R11', 'R12', 'R13', 'R14', 'R15', 'CS', 'DS', 'FS',
  'SS', 'ES', 'GS', 'CF', 'ZF', 'PF', 'AF', 'SF', 'TF', 'IF', 'DF', 'OF',
];

function writeMemoryBytes(process, address, value, size) {
  for (let i = 0n; i < size; i++) {
    value >>= i * 8n;
    process.memory[address + i] = value & 0xFFn;
  }
}

function readMemoryBytes(process, address, size) {
  let value = 0n;
  for (let i = 0n; i < size; i++) {
    value |= (process.memory[address + i] || 0n) << (i * 8n);
  }
  return value;
}

function interpret(process) {
  // TODO: interpret
}

function main(file) {
  const memory = new Array(10000);
  const code = fs.readFileSync(file).toString();
  const { instructions, labels } = parse(code);

  const registers = REGISTERS.reduce((rs, r) => ({ ...rs, [r]: 0n }), {});
  registers.RIP = BigInt(labels.main === undefined ? labels._main : labels.main);
  registers.RSP = BigInt(memory.length - 8);

  const process = {
    registers,
    memory,
    instructions,
    labels,
  };

  writeMemoryBytes(process, registers.RSP, registers.RSP, 8);

  interpret(process);
  return Number(registers.RAX);
}

process.exit(main(process.argv[2]));

We'll accept _main as an entry point as well as main to support our macOS users. If you know why our macOS users use _main I'd love to know.

To interpret, we grab the instruction pointed to in RIP and switch on the operation.

function interpret(process) {
  do {
    const instruction = process.instructions[process.registers.RIP];
    switch (instruction.operation.toLowerCase()) {
      case 'mov':
        break;
      case 'add':
        break;
      case 'call':
        break;
      case 'ret':
        break;
      case 'push':
        break;
      case 'pop':
        break;
    }
  } while (process.registers.RIP != BigInt(readMemoryBytes(process, memory.length - 8, 8)));
}

Interpreting MOV

Example:

  MOV RAX, 1
  MOV RAX, RDI
  MOV QWORD PTR [RBP - 8], 8

This instruction will store a value into a register or address and increment RIP. If the left-hand side is a memory address we will write to memory.

      case 'mov': {
        const lhs = interpretValue(process, instruction.operands[0], { lhs: true });
        const rhs = interpretValue(process, instruction.operands[1]);
        if (REGISTERS.includes(lhs)) {
          process.registers[lhs] = rhs;
        } else {
          writeMemoryBytes(process, lhs.address, rhs, lhs.size);
        }
        process.registers.RIP++;
        break;
      }

We're delegating to a helper function to handle registers vs. memory addresses vs. literals appropriately. Without memory addresses it's a simple function:

function interpretValue(process, value, { lhs } = { lhs: false }) {
  if (REGISTERS.includes(value)) {
    if (lhs) {
      return value;
    } else {
      return process.registers[value];
    }
  }

  return BigInt.asIntN(64, value);
}

We need to do some hacking to support memory addresses:

function interpretValue(process, value, { lhs } = { lhs: false }) {
  if (REGISTERS.includes(value)) {
    if (lhs) {
      return value;
    } else {
      return process.registers[value];
    }
  }

  if (value.startsWith('QWORD PTR [')) {
    const offsetString = value.substring('QWORD PTR ['.length, value.length - 1).trim();
    if (offsetString.includes('-')) {
      const [l, r] = offsetString.split('-').map(l => interpretValue(process, l.trim()));
      const address = l - r;
      const bytes = 8; // qword is 8 bytes
      if (lhs) {
        return { address, size: bytes };
      } else {
        return readMemoryBytes(process, address, bytes);
      }
    }

    throw new Error('Unsupported offset calculation: ' + value);
  }

  return BigInt.asIntN(64, value);
}

Interpreting ADD

Example:

  ADD RAX, RDI

This instruction will combine both registers and store the result in the first, then increment the RIP register.

      case 'add': {
        const lhs = interpretValue(process, instruction.operands[0], { lhs: true });
        const rhs = interpretValue(process, instruction.operands[1]);
        process.registers[lhs] += rhs;
        process.registers.RIP++;
        break;
      }

Interpreting CALL

Example:

  CALL plus

This instruction store RIP (plus one, to continue after the call instruction) on the stack and sets RIP to the location specified by the label.

      case 'call': {
        process.registers.RSP -= 8n;
        writeMemoryBytes(process, process.registers.RSP, process.registers.RIP + 1n, 8);
        const label = instruction.operands[0];
        process.registers.RIP = process.labels[label];
        break;
      }

Interpreting RET

Example:

  RET

This instruction removes the last element from the stack and stores it in the RIP register.

      case 'ret': {
        const value = readMemoryBytes(process, process.registers.RSP, 8);
        process.registers.RSP += 8n;
        process.registers.RIP = value;
        break;
      }

Interpreting PUSH

Example:

  PUSH RAX

This instruction stores the value in the register on the stack and increments RIP.

      case 'push': {
        const value = interpretValue(process, instruction.operands[0]);
        process.registers.RSP -= 8n;
        writeMemoryBytes(process, process.registers.RSP, value, 8);
        process.registers.RIP++;
        break;
      }

Interpreting POP

Example:

  POP RAX

This instruction removes the last element from the stack and stores it into the register specified. Then it increments RIP.

      case 'pop': {
        const lhs = interpretValue(process, instruction.operands[0], { lhs: true });
        const value = readMemoryBytes(process, process.registers.RSP, 8);
        process.registers.RSP += 8n;
        process.registers[lhs] = value;
        process.registers.RIP++;
        break;
      }

All together

$ cat test1.asm
main: ; the required main label
  MOV RAX, 1 ; store 1 in RAX
  MOV RDI, 2 ; store 2 in RDI
  ADD RAX, RDI ; store the result of adding RAX and RDI in RAX
  RET ; give control back to the virtual machine
$ node emulator.js test1.asm
$ echo $?
3

And finally, let's see what we can do with a simple C program:

$ cat plus.c
long main() {
  long a = 5;
  long b = 6;
  return a + b;
}
$ gcc -S -masm=intel -o plus.s plus.c
$ node emulator.js plus.s
$ echo $?
11

And we've got the start of a working x86_64/AMD64 emulator.

Next steps

We aren't setting flags appropriately to support conditionals, so that's low-hanging fruit. Additionally, syscalls open up a new world (that we'll end up needing since exit codes are limited to 8-bits of information). Additionally, our parsing is brittle. Dealing with ELF files may be a better direction to go and also enables more. We'll explore these aspects and others in follow-up posts.

Human interest

I originally intended to build a GameBoy emulator because the hardware is simple and uniform. But I found it easiest to start hacking together an AMD64 emulator because AMD64 is well-documented and gcc is easy enough to use. I'm still interested though unless/until I figure out how to emulate a graphics card for AMD64.

It's tricky! But not that tricky. I built a graphical debugger around this emulator to help out with the logic and off-by-one errors. But ultimately it's been surprising to me how easy it is to get started -- especially when I'm not concerned about absolute correctness (yet).

Share