August 26, 2021

Writing a simple JSON library from scratch: a tour through modern C++

Modern C++ has a lot of cool features. Move semantics means passing around structs in functions is cheap. std::shared_ptr means I don't have to manage any memory; no more new/delete! (But try as I might to understand std::unique_ptr, I'm just not there yet.)

The syntax has also gotten some treatment with auto and tuple destructuring.

In order to test out this modern C++ I wanted a small but meaningful project that operates on very dynamic data. The two that always come to mind are JSON parsers or Lisp interpreters.

This post walks through writing a basic JSON library from scratch using only the standard library. The source code for the resulting library is available on Github.

The biggest simplification we'll make is that rather than full JSON numbers, we'll only allow integers.

Big caveat! I couldn't be farther from a C++ expert! Email or tweet me as you see mistakes, madness, lies.

API

The two big parts of the API will be about lexing (turning a string into an array of tokens) and parsing (turning an array of tokens into a JSON object-tree). A better implementation would implement the lexer as taking a character stream rather than a string, but taking a string is simpler. So we'll stick with that.

Both of these functions can fail so we'll return a tuple in both cases with a string containing a possibly blank error message.

We will define the header in ./include/json.hpp.

#ifndef JSON_H
#define JSON_H

#include <tuple>
#include <vector>
#include <string>

namespace json {
std::tuple<std::vector<JSONToken>, std::string> lex(std::string);
std::tuple<JSONValue, int, std::string> parse(std::vector<JSONToken>, int index = 0);
} // namespace json

#endif

The token returned by lex will need to contain the token's string value, the location (offset) in the original source, a pointer to the full source (for debugging), and the token's type. The token type itself will be an enum of either string, number, syntax (colon, bracket, etc.), boolean, or null.

...
#include <string>
#include <memory>

namespace json {

enum class JSONTokenType { String, Number, Syntax, Boolean, Null };
struct JSONToken {
  std::string value;
  JSONTokenType type;
  int location;
  std::shared_ptr<std::string> full_source;
};

...

} // namespace json

...

This is the only place in the entire code we'll pass around a pointer. Using std::shared_ptr means we don't have to do any manual memory management either. No new or delete.

Next, JSONValue is a struct containing optional string, boolean, number, array, and object fields with a type num to differentiate.


...
#include <map>
#include <optional>

namespace json {

enum class JSONValueType { String, Number, Object, Array, Boolean, Null };
struct JSONValue {
  std::optional<std::string> string;
  std::optional<double> number;
  std::optional<bool> boolean;
  std::optional<std::vector<JSONValue>> array;
  std::optional<std::map<std::string, JSONValue>> object;
  JSONValueType type;
};

enum class JSONTokenType { String, Number, Syntax, Boolean, Null };

...

Thanks to std::optional we can avoid using pointers to describe these fields. I did take a look at std::variant but it seemed like its API was overly complex.

Finally, we'll add two more functions: a high level parse function that combines the job of lexing and parsing, and a deparse function for printing a JSONValue as a JSON string.

...
std::tuple<JSONValue, int, std::string> parse(std::vector<JSONToken>, int index = 0);
std::tuple<JSONValue, std::string> parse(std::string);
std::string deparse(JSONValue, std::string whitespace = "");
} // namespace json
...

Now we're ready to start on the implementation.

Lexing

First up is lexing; turning a JSON string into an array of tokens: a number, string, null keyword, boolean keyword, or syntax like comma or colon.

The main lex loop skips whitespace and calls helper functions for each kind of token. If a token is found, we accumulate it and move to the end of that token (some tokens like : are a single character, some tokens like "my great string" are multiple characters.)

Each token we find gets a pointer to the original JSON source for use in error messages if parsing fails. Again this will be the only time we explicitly pass around pointers in this implementation. We don't do any manual management because we're going to use std::shared_ptr.

#include "json.hpp"

namespace json {
std::tuple<std::vector<JSONToken>, std::string> lex(std::string raw_json) {
  std::vector<JSONToken> tokens;
  // All tokens will embed a pointer to the raw JSON for debugging purposes
  auto original_copy = std::make_shared<std::string>(raw_json);

  auto generic_lexers = {lex_syntax, lex_string, lex_number, lex_null, lex_true, lex_false};
  for (int i = 0; i < raw_json.length(); i++) {
    // Skip past whitespace
    if (auto new_index = lex_whitespace(raw_json, i); i != new_index) {
      i = new_index - 1;
      continue;
    }

    auto found = false;
    for (auto lexer : generic_lexers) {
      if (auto [token, new_index, error] = lexer(raw_json, i); i != new_index) {
        // Error while lexing, return early
        if (error.length()) {
          return {{}, error};
        }

        // Store reference to the original source
        token.full_source = original_copy;
        tokens.push_back(token);
        i = new_index - 1;
        found = true;
        break;
      }
    }

    if (found) {
      continue;
    }

    return {{}, format_error("Unable to lex", raw_json, i)};
  }

  return {tokens, ""};
}
} // namespace json

Two neat things you'll notice in there are tuple literal syntax ({tokens, ""}) and how easy it is to type a value containing an array of function pointers using auto (generic_lexers).

format_error

Since we referenced format_error, let's define it. This needs to accept a message prefix, the full JSON string, and the index offset where the error should point to.

Inside the function we'll iterate over the string until we find the entire line containing this index offset. We'll display that line and a pointer to the character that is causing/starting the error.

...

#include <sstream>

namespace json {
std::string format_error(std::string base, std::string source, int index) {
  std::ostringstream s;
  int counter = 0;
  int line = 1;
  int column = 0;
  std::string lastline = "";
  std::string whitespace = "";
  for (auto c : source) {
    if (counter == index) {
      break;
    }

    if (c == '\n') {
      line++;
      column = 0;
      lastline = "";
      whitespace = "";
    } else if (c == '\t') {
      column++;
      lastline += "  ";
      whitespace += "  ";
    } else {
      column++;
      lastline += c;
      whitespace += " ";
    }

    counter++;
  }

  // Continue accumulating the lastline for debugging
  while (counter < source.size()) {
    auto c = source[counter];
    if (c == '\n') {
      break;
    }
    lastline += c;
    counter++;
  }

  s << base << " at line " << line << ", column " << column << std::endl;
  s << lastline << std::endl;
  s << whitespace << "^";

  return s.str();
}

...

The printf API is annoying and Clang 12 (latest Clang on latest Fedora) doesn't seem to support std::format. So we just use std::sstream to do string "formatting".

But ok, back to lexing! Next up: whitespace.

lex_whitespace

This function's job is to skip past whitespace. Thankfully we've got std::isspace to help.

int lex_whitespace(std::string raw_json, int index) {
  while (std::isspace(raw_json[index])) {
    if (index == raw_json.length()) {
      break;
    }

    index++;
  }

  return index;
}

It's very simple!

lex_syntax

All of the generic lexers follow the same pattern. They return either a valid token and the index where the token ends, or they return an error string.

Since all the syntax elements in JSON (,, :, {, }, [ and , ]) are single characters, we don't need to write a "longest substring" helper function. We simply check if the current character is one of these characters and return a syntax token if so.

std::tuple<JSONToken, int, std::string> lex_syntax(std::string raw_json, int index) {
  JSONToken token{"", JSONTokenType::Syntax, index};
  std::string value = "";
  auto c = raw_json[index];
  if (c == '[' || c == ']' || c == '{' || c == '}' || c == ':' || c == ',') {
    token.value += c;
    index++;
  }

  return {token, index, ""};
}

lex_string

This one manages state so it's a little more complex. We need to check if the current character is a double quote, then iterate over characters until we find the ending quote.

It's possible to hit EOF here so we need to handle that case. And handling nested quotes is left as an exercise for the reader. :)

std::tuple<JSONToken, int, std::string> lex_string(std::string raw_json,
                                                   int original_index) {
  int index = original_index;
  JSONToken token{"", JSONTokenType::String, index};
  std::string value = "";
  auto c = raw_json[index];
  if (c != '"') {
    return {token, original_index, ""};
  }
  index++;

  // TODO: handle nested quotes
  while (c = raw_json[index], c != '"') {
    if (index == raw_json.length()) {
      return {token, index, format_error("Unexpected EOF while lexing string", raw_json, index)};
    }

    token.value += c;
    index++;
  }
  index++;

  return {token, index, ""};
}

Nothing too special to discuss here. So on to lexing numbers.

lex_number

Since we're only supporting integers, this one has no internal state. We check characters until we stop seeing digits.

std::tuple<JSONToken, int, std::string> lex_number(std::string raw_json,
                                                   int original_index) {
  int index = original_index;
  JSONToken token = {"", JSONTokenType::Number, index};
  std::string value = "";
  // TODO: handle not just integers
  while (true) {
    if (index == raw_json.length()) {
      break;
    }

    auto c = raw_json[index];
    if (!(c >= '0' && c <= '9')) {
      break;
    }

    token.value += c;
    index++;
  }

  return {token, index, ""};
}

Done. On to keywords: null, false, true.

lex_keyword

This is a helper function that will check for a literal keyword.

std::tuple<JSONToken, int, std::string> lex_keyword(std::string raw_json,
                                                    std::string keyword,
                                                    JSONTokenType type,
                                                    int original_index) {
  int index = original_index;
  JSONToken token{"", type, index};
  while (keyword[index - original_index] == raw_json[index]) {
    if (index == raw_json.length()) {
      break;
    }

    index++;
  }

  if (index - original_index == keyword.length()) {
    token.value = keyword;
  }
  return {token, index, ""};
}

With this defined we can now implement lex_false, lex_true, and lex_null.

std::tuple<JSONToken, int, std::string> lex_null(std::string raw_json,
                                                 int index) {
  return lex_keyword(raw_json, "null", JSONTokenType::Null, index);
}

std::tuple<JSONToken, int, std::string> lex_true(std::string raw_json,
                                                 int index) {
  return lex_keyword(raw_json, "true", JSONTokenType::Boolean, index);
}

std::tuple<JSONToken, int, std::string> lex_false(std::string raw_json,
                                                  int index) {
  return lex_keyword(raw_json, "false", JSONTokenType::Boolean, index);
}

And that's it for lexing! And although we defined all of these top-down, you'll want to write them mostly in reverse order or put in forward declarations.

If you wanted to you could now write a simple main.cpp like:

#include "json.hpp"

#include <iostream>

int main(int argc, char *argv[]) {
  if (argc == 1) {
    std::cerr << "Expected JSON input argument to parse" << std::endl;
    return 1;
  }

  std::string in{argv[1]};

  auto [tokens, error] = json::lex(in);
  if (error.size()) {
    std::cerr << error << std::endl;
    return 1;
  }

  for (auto t : tokens) {
    std::cout << t.value << std::endl;
  }
}

Set up a Makefile:

main: *.cpp ./include/*.hpp
        clang++ -g -Wall -std=c++2a -I./include *.cpp -o $@

Build with make and run ./main '{"a": 1}' to see the list of tokens printed out.

Now let's move on to parsing from the array of tokens.

Parsing

This process takes the array of tokens and turns them into a tree structure. The tree develops children as we spot [ or { tokens. The tree child ends when we spot ] or } tokens.

std::tuple<JSONValue, int, std::string> parse(std::vector<JSONToken> tokens,
                                              int index) {
  auto token = tokens[index];
  switch (token.type) {
  case JSONTokenType::Number: {
    auto n = std::stod(token.value);
    return {JSONValue{.number = n, .type = JSONValueType::Number}, index + 1, ""};
  }
  case JSONTokenType::Boolean:
    return {JSONValue{.boolean = token.value == "true", .type = JSONValueType::Boolean}, index + 1, ""};
  case JSONTokenType::Null:
    return {JSONValue{.type = JSONValueType::Null}, index + 1, ""};
  case JSONTokenType::String:
    return {JSONValue{.string = token.value, .type = JSONValueType::String}, index + 1, ""};
  case JSONTokenType::Syntax:
    if (token.value == "[") {
      auto [array, new_index, error] = parse_array(tokens, index + 1);
      return {JSONValue{.array = array, .type = JSONValueType::Array}, new_index, error};
    }

    if (token.value == "{") {
      auto [object, new_index, error] = parse_object(tokens, index + 1);
      return {JSONValue{.object = std::optional(object), .type = JSONValueType::Object}, new_index, error};
    }
  }

  return {{}, index, format_parse_error("Failed to parse", token)};
}

This in turn reference format_parse_error on failure which is an error-string-maker similar to format_error. It actually calls format_error with more details specific to parsing.

std::string JSONTokenType_to_string(JSONTokenType jtt) {
  switch (jtt) {
  case JSONTokenType::String:
    return "String";
  case JSONTokenType::Number:
    return "Number";
  case JSONTokenType::Syntax:
    return "Syntax";
  case JSONTokenType::Boolean:
    return "Boolean";
  case JSONTokenType::Null:
    return "Null";
  }
}

std::string format_parse_error(std::string base, JSONToken token) {
  std::ostringstream s;
  s << "Unexpected token '" << token.value << "', type '"
    << JSONTokenType_to_string(token.type) << "', index ";
  s << std::endl << base;
  return format_error(s.str(), *token.full_source, token.location);
}

This function depended on a helper for turning the JSONTokenType enum into a string. As a user it's very annoying when langauges doesn't give you stringifier methods for enums by default for debugging. I know there's some ways to do this with reflection in C++ but it seemed hairy. But I digest.

parse_array

This function was called by parse when we found an opening bracket. This function needs to recursively call parse and then check for a comma and call parse again ... until it finds the closing bracket.

It will fail if it every finds something other than a comma or closing bracket following a succesful call to parse.

std::tuple<std::vector<JSONValue>, int, std::string>
parse_array(std::vector<JSONToken> tokens, int index) {
  std::vector<JSONValue> children = {};
  while (index < tokens.size()) {
    auto t = tokens[index];
    if (t.type == JSONTokenType::Syntax) {
      if (t.value == "]") {
        return {children, index + 1, ""};
      }

      if (t.value == ",") {
        index++;
        t = tokens[index];
      } else if (children.size() > 0) {
        return {{},
                index,
                format_parse_error("Expected comma after element in array", t)};
      }
    }

    auto [child, new_index, error] = parse(tokens, index);
    if (error.size()) {
      return {{}, index, error};
    }

    children.push_back(child);
    index = new_index;
  }

  return {
      {},
      index,
      format_parse_error("Unexpected EOF while parsing array", tokens[index])};
}

And finally we need to implement parse_object.

parse_object

This function is similar to parse_array but it needs to find $string COLON $parse() COMMA pattern pairs.

std::tuple<std::map<std::string, JSONValue>, int, std::string>
parse_object(std::vector<JSONToken> tokens, int index) {
  std::map<std::string, JSONValue> values = {};
  while (index < tokens.size()) {
    auto t = tokens[index];
    if (t.type == JSONTokenType::Syntax) {
      if (t.value == "}") {
        return {values, index + 1, ""};
      }

      if (t.value == ",") {
        index++;
        t = tokens[index];
      } else if (values.size() > 0) {
        return {
            {},
            index,
            format_parse_error("Expected comma after element in object", t)};
      } else {
        return {{},
                index,
                format_parse_error(
                    "Expected key-value pair or closing brace in object", t)};
      }
    }

    auto [key, new_index, error] = parse(tokens, index);
    if (error.size()) {
      return {{}, index, error};
    }

    if (key.type != JSONValueType::String) {
      return {
          {}, index, format_parse_error("Expected string key in object", t)};
    }
    index = new_index;
    t = tokens[index];

    if (!(t.type == JSONTokenType::Syntax && t.value == ":")) {
      return {{},
              index,
              format_parse_error("Expected colon after key in object", t)};
    }
    index++;
    t = tokens[index];

    auto [value, new_index1, error1] = parse(tokens, index);
    if (error1.size()) {
      return {{}, index, error1};
    }

    values[key.string.value()] = value;
    index = new_index1;
  }

  return {values, index + 1, ""};
}

These parse functions are all slightly tedious but still very simple. And thankfully, we're done!

We can now implement the variation of parse that ties together lexing and parsing.

std::tuple<JSONValue, std::string> parse(std::string source) {
  auto [tokens, error] = json::lex(source);
  if (error.size()) {
    return {{}, error};
  }

  auto [ast, _, error1] = json::parse(tokens);
  return {ast, error1};
}

And we're completely done the string to JSONValue code.

deparse

The very last piece of the implementation is to do the reverse of the past operations: generate a string from a JSONValue.

This is a recursive function and the only mildly tricky part is deciding how to do whitespace if we want a prettier output.


std::string deparse(JSONValue v, std::string whitespace) {
  switch (v.type) {
  case JSONValueType::String:
    return "\"" + v.string.value() + "\"";
  case JSONValueType::Boolean:
    return (v.boolean.value() ? "true" : "false");
  case JSONValueType::Number:
    return std::to_string(v.number.value());
  case JSONValueType::Null:
    return "null";
  case JSONValueType::Array: {
    std::string s = "[\n";
    auto a = v.array.value();
    for (int i = 0; i < a.size(); i++) {
      auto value = a[i];
      s += whitespace + "  " + deparse(value, whitespace + "  ");
      if (i < a.size() - 1) {
        s += ",";
      }

      s += "\n";
    }

    return s + whitespace + "]";
  }
  case JSONValueType::Object: {
    std::string s = "{\n";
    auto values = v.object.value();
    auto i = 0;
    for (auto const &[key, value] : values) {
      s += whitespace + "  " + "\"" + key +
           "\": " + deparse(value, whitespace + "  ");

      if (i < values.size() - 1) {
        s += ",";
      }

      s += "\n";
      i++;
    }

    return s + whitespace + "}";
  }
  }
}

Done. Done. Done.

main.cpp

This program will simply accept a JSON input, parse it, and pretty print it right back out. Kind of like a simplified jq.

#include "json.hpp"

#include <iostream>

int main(int argc, char *argv[]) {
  if (argc == 1) {
    std::cerr << "Expected JSON input argument to parse" << std::endl;
    return 1;
  }

  std::string in{argv[1]};

  auto [ast, error] = json::parse(in);
  if (error.size()) {
    std::cerr << error << std::endl;
    return 1;
  }

  std::cout << json::deparse(ast);
}

Build it with make that we already defined, and run it against something big like this.

$ cd cpp-json
$ make
$ ./main "$(cat ./test/glossary.json)"
{
  "glossary": {
    "GlossDiv": {
      "GlossList": {
        "GlossEntry": {
          "Abbrev": "ISO 8879:1986",
          "Acronym": "SGML",
          "GlossDef": {
            "GlossSeeAlso": [
              "GML",
              "XML"
            ],
            "para": "A meta-markup language, used to create markup languages such as DocBook."
          },
          "GlossSee": "markup",
          "GlossTerm": "Standard Generalized Markup Language",
          "ID": "SGML",
          "SortAs": "SGML"
        }
      },
      "title": "S"
    },
    "title": "example glossary"
  }
}

Or something incorrect like:

./main '{"foo": [{ 1: 2 }]}'
Unexpected token '1', type 'Number', index
Expected string key in object at line 1, column 11
{"foo": [{ 1: 2 }]}
           ^

And give Valgrind the old try:

valgrind ./main '{"a": [1, 2, null, { "c": 129 }]}'
==153027== Memcheck, a memory error detector
==153027== Copyright (C) 2002-2017, and GNU GPL'd, by Julian Seward et al.
==153027== Using Valgrind-3.17.0 and LibVEX; rerun with -h for copyright info
==153027== Command: ./main {"a":\ [1,\ 2,\ null,\ {\ "c":\ 129\ }]}
==153027==
{
  "a": [
    1.000000,
    2.000000,
    null,
    {
      "c": 129.000000
    }
  ]
}==153027==
==153027== HEAP SUMMARY:
==153027==     in use at exit: 0 bytes in 0 blocks
==153027==   total heap usage: 128 allocs, 128 frees, 105,386 bytes allocated
==153027==
==153027== All heap blocks were freed -- no leaks are possible
==153027==
==153027== For lists of detected and suppressed errors, rerun with: -s
==153027== ERROR SUMMARY: 0 errors from 0 contexts (suppressed: 0 from 0)

Pretty sweet. Modern C++, I like it!