Writing a minimal in-memory storage engine for MySQL/MariaDB

I spent a week looking at MySQL/MariaDB internals along with ~80 other devs. Although MySQL and MariaDB are mostly the same (more on that later), I focused on MariaDB specifically this week.

Before last week I had never built MySQL/MariaDB before. The first day of this hack week, I got MariaDB building locally and made a code tweak so that SELECT 23 returned 213, and another tweak so that SELECT 80 + 20 returned 60. The second day I got a basic UDF in C working so that SELECT mysum(20, 30) returned 50.

The rest of the week I spent figuring out how to build a minimal in-memory storage engine, which I'll walk through in this post. 218 lines of C++.

It supports CREATE, DROP, INSERT, and SELECT for tables that only have INTEGER fields. It is explicitly not thread-safe because I didn't have time to understand MariaDB's lock primitives.

In this post I'll also talk about how the MariaDB custom storage API compares to the Postgres one, based on a previous hack week project I did.

All code for this post can be found in my fork on GitHub.

MySQL and MariaDB

Before we go further though, why do I keep saying MySQL/MariaDB?

MySQL is GPL licensed (let's completely ignore the commercial variations of MySQL that Oracle offers). The code is open-source. However, the development is done behind closed doors. There is a code dump every month or so.

MariaDB is a fork of MySQL by the creator of MySQL (who is no longer involved, as it happens). It is also GPL licensed (let's completely ignore the commercial variations of MariaDB that MariaDB Corporation offers). The code is open-source. The development is also open-source.

When you install "MySQL" in your Linux distro you are often actually installing MariaDB.

The two are mostly compatible. During this week, I stumbled onto that they evolved support for SELECT .. FROM VALUES .. differently. Some differences are documented on the MariaDB KB. But this KB is painful to browse. Which leads me to my next point.

The MySQL docs are excellent. Easy to read, browse; and they are thorough. The MariaDB docs are a work in progress. I'm sorry I can't be stoic: in just a week I've come to really hate using this KB. Thankfully, in some twisted way, it also doesn't seem to be very thorough either. It isn't completely avoidable though since there is no guarantee MySQL and MariaDB do the same thing.

Ultimately, I spent the week using MariaDB because I'm biased toward fully open projects. But I kept having to look at MySQL docs, hoping they were relevant.

Now that you understand the state of things, let's move on to fun stuff!

Storage engines

Mature databases often support swapping out the storage layer. Maybe you want an in-memory storage layer so that you can quickly run integration tests. Maybe you want to switch between B-Trees (read-optimized) and LSM Trees (write-optimized) and unordered heaps (write-optimized) depending on your workload. Or maybe you just want to try a third-party storage library (e.g. RocksDB or Sled or TiKV).

The benefit of swapping out only the storage engine is that, from a user's perspective, the semantics and features of the database stay mostly the same. But the database is magically faster for a workload.

You keep powerful user management, extension support, SQL support, and a well-known wire protocol. You modify only the method of storing the actual data.

Existing storage engines

MySQL/MariaDB is particularly well known for its custom storage engine support. The MySQL docs for alternate storage engines are great.

While the docs do warn that you should probably stick with the default storage engine, that warning didn't quite feel strong enough because nothing else seemed to indicate the state of other engines.

Specifically, in the past I was always interested in the CSV storage engine. But when you look at the actual code for the CSV engine there is a pretty strong warning:

First off, this is a play thing for me, there are a number of things
wrong with it:
  *) It was designed for csv and therefore its performance is highly
     questionable.
  *) Indexes have not been implemented. This is because the files can
     be traded in and out of the table directory without having to worry
     about rebuilding anything.
  *) NULLs and "" are treated equally (like a spreadsheet).
  *) There was in the beginning no point to anyone seeing this other
     then me, so there is a good chance that I haven't quite documented
     it well.
  *) Less design, more "make it work"

Now there are a few cool things with it:
  *) Errors can result in corrupted data files.
  *) Data files can be read by spreadsheets directly.

TODO:
 *) Move to a block system for larger files
 *) Error recovery, its all there, just need to finish it
 *) Document how the chains work.

-Brian

The difference between the seeming confidence of the docs and seeming confidence of the contributor made me chuckle.

The benefit of these diverse storage engines for me was that they give examples of how to implement the storage engine API. The csv, blackhole, example, and heap storage engines were particularly helpful to read.

The heap engine is a complete in-memory storage engine. Complete means complex though. So there seemed to be room for a stripped down version of an in-memory engine.

And that's we'll cover in this post! First though I want to talk a little bit about the limitations of custom storage engines.

Limitations

While being able to tailor a storage engine to a workload is powerful, there are limits to the benefits based on the design of the storage API.

Both Postgres and MySQL/MariaDB currently have a custom storage API built around individual rows.

Column-wise execution

I have previously written that custom storage engines allows you to switch between column- and row-oriented data storage. Two big reasons to do column-wise storage are 1) opportunity for compression, and 2) fast operations on a single column.

The opportunity for 1) compression on disk would still exist even if you needed to deal with individual rows at the storage API layer since the compression could happen on disk. However any benefits of passing around compressed columns in memory disappear if you must convert to rows for the storage API.

You'd also lose the advantage for 2) fast operations on a single column if the column must be converted into a row at the storage API whereupon it's passed to higher levels that perform execution. The execution would happen row-wise, not column-wise.

All of this is to say that while column-wise storage is possible, the benefit of doing so is not obvious with the current API design for both MySQL/MariaDB and Postgres.

Vectorization

An API built around individual rows also sets limits on the amount of vectorization you can do. A custom storage engine could still do some vectorization under the hood: always filling a buffer with N rows and returning a row from the buffer when the storage API requests a single row. But there is likely some degree of performance left on the table with an API that deals with individual rows.

Remember though: if you did batched reads and writes of rows in the custom storage layer, there isn't necessarily any vectorization happening at the execution layer. From a previous study I did, neither MySQL/MariaDB nor Postgres do vectorized query execution. This paragraph isn't a critique of the storage API, it's just something to keep in mind.

Storage versus execution

The general point I'm making here is that unless both the execution and storage APIs are designed in a certain way, you may attempt optimizations in the storage layer that are ineffective or even harmfull because the execution layer doesn't or can't take advantage of them.

Nothing permanent

The current limitations of the storage API are not intrinsic aspects of MySQL/MariaDB or Postgres's design. For both project there used to be no pluggable storage at all. We can imagine a future patch to either project that allows support for batched row reads and writes that together could make column-wise storage and vectorized execution more feasible.

Even today there have been invasive attempts to fully support column-wise storage and execution in Postgres. And there have also been projects to bring vectorized execution to Postgres.

I'm not as familiar with the MySQL landscape to comment about efforts at the moment their.

Debug build of MariaDB running locally

Now that you've got some background, let's get a debug build of MariaDB!

$ git clone https://github.com/MariaDB/server mariadb
$ cd mariadb
$ mkdir build
$ cd build
$ cmake -DCMAKE_BUILD_TYPE=Debug ..
$ make -j8

This takes a while. When I was hacking on Postgres (a C project), it took 1 minute on my beefy Linux server to build. It took 20-30 minutes to build MySQL/MariaDB from scratch. That's C++ for you!

Thankfully incremental builds of MySQL/MariaDB for a tweak after the initial build take roughly the same time as incremental builds of Postgres after a tweak.

Once the build is done, create a database.

$ ./build/scripts/mariadb-install-db --srcdir=$(pwd) --datadir=$(pwd)/db

And create a config for the database.

$ echo "[client]
socket=$(pwd)/mariadb.sock

[mariadb]
socket=$(pwd)/mariadb.sock

basedir=$(pwd)
datadir=$(pwd)/db
pid-file=$(pwd)/db.pid" > my.cnf

Start up the server.

$ ./build/sql/mariadbd --defaults-extra-file=$(pwd)/my.cnf --debug:d:o,$(pwd)/db.debug
./build/sql/mariadbd: Can't create file '/var/log/mariadb/mariadb.log' (errno: 13 "Permission denied")
2024-01-03 17:10:15 0 [Note] Starting MariaDB 11.4.0-MariaDB-debug source revision 3fad2b115569864d8c1b7ea90ce92aa895cfef08 as process 185550
2024-01-03 17:10:15 0 [Note] InnoDB: !!!!!!!! UNIV_DEBUG switched on !!!!!!!!!
2024-01-03 17:10:15 0 [Note] InnoDB: Compressed tables use zlib 1.2.13
2024-01-03 17:10:15 0 [Note] InnoDB: Number of transaction pools: 1
2024-01-03 17:10:15 0 [Note] InnoDB: Using crc32 + pclmulqdq instructions
2024-01-03 17:10:15 0 [Note] InnoDB: Initializing buffer pool, total size = 128.000MiB, chunk size = 2.000MiB
2024-01-03 17:10:15 0 [Note] InnoDB: Completed initialization of buffer pool
2024-01-03 17:10:15 0 [Note] InnoDB: Buffered log writes (block size=512 bytes)
2024-01-03 17:10:15 0 [Note] InnoDB: End of log at LSN=57155
2024-01-03 17:10:15 0 [Note] InnoDB: Opened 3 undo tablespaces
2024-01-03 17:10:15 0 [Note] InnoDB: 128 rollback segments in 3 undo tablespaces are active.
2024-01-03 17:10:15 0 [Note] InnoDB: Setting file './ibtmp1' size to 12.000MiB. Physically writing the file full; Please wait ...
2024-01-03 17:10:15 0 [Note] InnoDB: File './ibtmp1' size is now 12.000MiB.
2024-01-03 17:10:15 0 [Note] InnoDB: log sequence number 57155; transaction id 16
2024-01-03 17:10:15 0 [Note] InnoDB: Loading buffer pool(s) from ./db/ib_buffer_pool
2024-01-03 17:10:15 0 [Note] Plugin 'FEEDBACK' is disabled.
2024-01-03 17:10:15 0 [Note] Plugin 'wsrep-provider' is disabled.
2024-01-03 17:10:15 0 [Note] InnoDB: Buffer pool(s) load completed at 240103 17:10:15
2024-01-03 17:10:15 0 [Note] Server socket created on IP: '0.0.0.0'.
2024-01-03 17:10:15 0 [Note] Server socket created on IP: '::'.
2024-01-03 17:10:15 0 [Note] mariadbd: Event Scheduler: Loaded 0 events
2024-01-03 17:10:15 0 [Note] ./build/sql/mariadbd: ready for connections.
Version: '11.4.0-MariaDB-debug'  socket: './mariadb.sock'  port: 3306  Source distribution

With that --debug flag, debug logs will show up in $(pwd)/db.debug. It's unclear why debug logs are treated separately from the console logs shown here. I'd rather them all be in one place.

In another terminal, run a client and make a request!

$ ./build/client/mariadb --defaults-extra-file=$(pwd)/my.cnf --database=test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 3
Server version: 11.4.0-MariaDB-debug Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> SELECT 1;
+---+
| 1 |
+---+
| 1 |
+---+
1 row in set (0.001 sec)

Huzzah! Let's write a custom storage engine!

Where does the code go?

When writing an extension for some project, I usually expect to have the extension exist in its own repo. I was able to do this with the Postgres in-memory storage engine I wrote. And in general, Postgres extensions exist as their own repos.

I was able to create and build a UDF plugin outside the MariaDB source tree. But when it came to getting a storage engine to build and load successfully, I wasted almost an entire day (a large amount of time in a single hack week) getting nowhere.

Extensions for MySQL/MariaDB are most easily built via the CMake infrastructure within the repo. Surely there's some way to replicate that infrastructure from outside the repo but I wasn't able to figure it out within a day and didn't want to spend more time on it.

Apparently the normal thing to do in MySQL/MariaDB is to keep extensions within a fork of MySQL/MariaDB.

When I switched to this method I was able to very quickly get the storage engine building and loaded. So that's what we'll do.

Boilerplate

Within the MariaDB source tree, create a new folder in the storage subdirectory.

$ mkdir storage/memem

Within storage/memem/CMakeLists.txt add the following.

# Copyright (c) 2006, 2010, Oracle and/or its affiliates. All rights reserved.
# 
# This program is free software; you can redistribute it and/or modify
# it under the terms of the GNU General Public License as published by
# the Free Software Foundation; version 2 of the License.
# 
# This program is distributed in the hope that it will be useful,
# but WITHOUT ANY WARRANTY; without even the implied warranty of
# MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
# GNU General Public License for more details.
# 
# You should have received a copy of the GNU General Public License
# along with this program; if not, write to the Free Software
# Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA  02110-1335 USA

SET(MEMEM_SOURCES  ha_memem.cc ha_memem.h)
MYSQL_ADD_PLUGIN(memem ${MEMEM_SOURCES} STORAGE_ENGINE)

This hooks into MySQL/MariaDB build infrastructure. So next time you run make within the build directory we created above, it will also build this project.

The storage engine class

It would be nice to see a way to extend MySQL in C (for one, because it would then be easier to port to other languages). But all of the builtin storage methods use classes. So we'll do that too.

The class we must implement is an instance of handler. There is a single handler instance per thread, corresponding to a single running query. (Postgres gives each query its own process, MySQL gives each query its own thread.) However, handler instances are reused across different queries.

There are a number of virtual methods on handler we must implement in our subclass. For most of them we'll do nothing: simply returning immediately. These simple methods will be implemented in ha_memem.h. The methods with more complex logic will be implemented in ha_memem.cc.

Let's set up includes in ha_memem.h.

/* Copyright (c) 2005, 2010, Oracle and/or its affiliates. All rights reserved.

  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; version 2 of the License.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the Free Software
  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335  USA */

#ifdef USE_PRAGMA_INTERFACE
#pragma interface /* gcc class implementation */
#endif

#include "thr_lock.h"
#include "handler.h"
#include "table.h"
#include "sql_const.h"

#include <vector>
#include <memory>

Next we'll define structs for our in-memory storage.

typedef std::vector<uchar> MememRow;

struct MememTable
{
  std::vector<std::shared_ptr<MememRow>> rows;
  std::shared_ptr<std::string> name;
};

struct MememDatabase
{
  std::vector<std::shared_ptr<MememTable>> tables;
};

Within ha_memem.cc we'll implement a global (not thread-safe) static MememDatabase* that all handler instances will query when requested. We need the definitions in the header file though because we'll store the table currently being queried in the handler subclass.

This is so that every call to write_row to write a single row or call to rnd_next to read a single row does not need to look up the in-memory table object N times within the same query.

And finally we'll define the subclass of handler and implementations of trivial methods.

class ha_memem final : public handler
{
  uint current_position= 0;
  std::shared_ptr<MememTable> memem_table= 0;

public:
  ha_memem(handlerton *hton, TABLE_SHARE *table_arg) : handler(hton, table_arg)
  {
  }
  ~ha_memem()= default;
  const char *index_type(uint key_number) { return ""; }
  ulonglong table_flags() const { return 0; }
  ulong index_flags(uint inx, uint part, bool all_parts) const { return 0; }
  /* The following defines can be increased if necessary */
#define MEMEM_MAX_KEY MAX_KEY     /* Max allowed keys */
#define MEMEM_MAX_KEY_SEG 16      /* Max segments for key */
#define MEMEM_MAX_KEY_LENGTH 3500 /* Like in InnoDB */
  uint max_supported_keys() const { return MEMEM_MAX_KEY; }
  uint max_supported_key_length() const { return MEMEM_MAX_KEY_LENGTH; }
  uint max_supported_key_part_length() const { return MEMEM_MAX_KEY_LENGTH; }
  int open(const char *name, int mode, uint test_if_locked) { return 0; }
  int close(void) { return 0; }
  int truncate() { return 0; }
  int rnd_init(bool scan);
  int rnd_next(uchar *buf);
  int rnd_pos(uchar *buf, uchar *pos) { return 0; }
  int index_read_map(uchar *buf, const uchar *key, key_part_map keypart_map,
                     enum ha_rkey_function find_flag)
  {
    return HA_ERR_END_OF_FILE;
  }
  int index_read_idx_map(uchar *buf, uint idx, const uchar *key,
                         key_part_map keypart_map,
                         enum ha_rkey_function find_flag)
  {
    return HA_ERR_END_OF_FILE;
  }
  int index_read_last_map(uchar *buf, const uchar *key,
                          key_part_map keypart_map)
  {
    return HA_ERR_END_OF_FILE;
  }
  int index_next(uchar *buf) { return HA_ERR_END_OF_FILE; }
  int index_prev(uchar *buf) { return HA_ERR_END_OF_FILE; }
  int index_first(uchar *buf) { return HA_ERR_END_OF_FILE; }
  int index_last(uchar *buf) { return HA_ERR_END_OF_FILE; }
  void position(const uchar *record) { return; }
  int info(uint flag) { return 0; }
  int external_lock(THD *thd, int lock_type) { return 0; }
  int create(const char *name, TABLE *table_arg, HA_CREATE_INFO *create_info);
  THR_LOCK_DATA **store_lock(THD *thd, THR_LOCK_DATA **to,
                             enum thr_lock_type lock_type)
  {
    return to;
  }
  int delete_table(const char *name) { return 0; }

private:
  void reset_memem_table();
  virtual int write_row(const uchar *buf);
  int update_row(const uchar *old_data, const uchar *new_data)
  {
    return HA_ERR_WRONG_COMMAND;
  };
  int delete_row(const uchar *buf) { return HA_ERR_WRONG_COMMAND; }
};

A complete storage engine might seriously implement all of these methods. But we'll only seriously implement 7 of them.

To finish up the boilerplate, we'll switch over to ha_memem.cc and set up the includes.

/* Copyright (c) 2005, 2012, Oracle and/or its affiliates. All rights reserved.

  This program is free software; you can redistribute it and/or modify
  it under the terms of the GNU General Public License as published by
  the Free Software Foundation; version 2 of the License.

  This program is distributed in the hope that it will be useful,
  but WITHOUT ANY WARRANTY; without even the implied warranty of
  MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the
  GNU General Public License for more details.

  You should have received a copy of the GNU General Public License
  along with this program; if not, write to the Free Software
  Foundation, Inc., 51 Franklin St, Fifth Floor, Boston, MA 02110-1335  USA */

#ifdef USE_PRAGMA_IMPLEMENTATION
#pragma implementation // gcc: Class implementation
#endif

#define MYSQL_SERVER 1
#include <my_global.h>
#include "sql_priv.h"
#include "unireg.h"
#include "sql_class.h"

#include "ha_memem.h"

Ok! Let's dig into the implementation.

Implementation

The global database

First up, we need to declare a global MememDatabase* instance. We'll also implement a helper function for finding the index of a table by name within the database.

// WARNING! All accesses of `database` in this code are thread
// unsafe. Since this was written during a hack week, I didn't have
// time to figure out MySQL/MariaDB's runtime well enough to do the
// thread-safe version of this.
static MememDatabase *database;

static int memem_table_index(const char *name)
{
  int i;
  assert(database->tables.size() < INT_MAX);
  for (i= 0; i < (int) database->tables.size(); i++)
  {
    if (strcmp(database->tables[i]->name->c_str(), name) == 0)
    {
      return i;
    }
  }

  return -1;
}

As I wrote this post I noticed that this code also assumes there's only a single database. That isn't how MySQL works. Everytime you call USE ... in MySQL you are switching between databases. You can query tables across databases. A real in-memory backend would need to be aware of the different databases, not just different tables. But to keep the code succinct we won't implement that in this post.

Next we'll implement plugin initialization and cleanup.

Plugin lifecycle

Before we register the plugin with MariaDB, we need to set up initialization and cleanup methods for it.

The initialization method will take care of initializing the global MememDatabase* database object. It will set up a handler for creating new instances of our handler subclass. And it will set up a handler for deleting tables.

static handler *memem_create_handler(handlerton *hton, TABLE_SHARE *table,
                                     MEM_ROOT *mem_root)
{
  return new (mem_root) ha_memem(hton, table);
}

static int memem_init(void *p)
{
  handlerton *memem_hton;

  memem_hton= (handlerton *) p;
  memem_hton->db_type= DB_TYPE_AUTOASSIGN;
  memem_hton->create= memem_create_handler;
  memem_hton->drop_table= [](handlerton *, const char *name) {
    int index= memem_table_index(name);
    if (index == -1)
    {
      return HA_ERR_NO_SUCH_TABLE;
    }

    database->tables.erase(database->tables.begin() + index);
    DBUG_PRINT("info", ("[MEMEM] Deleted table '%s'.", name));

    return 0;
  };
  memem_hton->flags= HTON_CAN_RECREATE;

  // Initialize global in-memory database.
  database= new MememDatabase;

  return 0;
}

The DBUG_PRINT macro is a debug helper MySQL/MariaDB gives us. As noted above, the output is directed to a file specified by the --debug flag. Unfortunately I couldn't figure out how to flush the stream this macro writes to. It seemed like occasionally when there was a segfault logs I expected to be there weren't there. And the file would often contain what looked like partially written logs. Anyway, as long as there wasn't a segfault the debug file will eventually contain the DBUG_PRINT logs.

The only thing the plugin cleanup function must do is delete the global database.

static int memem_fini(void *p)
{
  delete database;
  return 0;
}

Now we can register the plugin!

Plugin registration

The maria_declare_plugin and maria_declare_plugin_end register the plugin's metadata (name, version, etc.) and callbacks.

struct st_mysql_storage_engine memem_storage_engine= {
    MYSQL_HANDLERTON_INTERFACE_VERSION};

maria_declare_plugin(memem){
    MYSQL_STORAGE_ENGINE_PLUGIN,
    &memem_storage_engine,
    "MEMEM",
    "MySQL AB",
    "In-memory database.",
    PLUGIN_LICENSE_GPL,
    memem_init, /* Plugin Init */
    memem_fini, /* Plugin Deinit */
    0x0100 /* 1.0 */,
    NULL,                          /* status variables                */
    NULL,                          /* system variables                */
    "1.0",                         /* string version */
    MariaDB_PLUGIN_MATURITY_STABLE /* maturity */
} maria_declare_plugin_end;

That's it! Now we need to implement methods for writing rows, reading rows, and creating a new table.

Create table

To create a table, we make sure one by this name doesn't already exist, make sure it only has INTEGER fields, allocate memory for the table, and append it to the global database.

int ha_memem::create(const char *name, TABLE *table_arg,
                     HA_CREATE_INFO *create_info)
{
  assert(memem_table_index(name) == -1);

  // We only support INTEGER fields for now.
  uint i = 0;
  while (table_arg->field[i]) {
    if (table_arg->field[i]->type() != MYSQL_TYPE_LONG)
      {
    DBUG_PRINT("info", ("Unsupported field type."));
    return 1;
      }

    i++;
  }

  auto t= std::make_shared<MememTable>();
  t->name= std::make_shared<std::string>(name);
  database->tables.push_back(t);
  DBUG_PRINT("info", ("[MEMEM] Created table '%s'.", name));

  return 0;
}

Not very complicated. Let's handle INSERT-ing rows next.

Insert row

There is no method called when an INSERT starts. There is a table field on the handler parent class that is updated though when a SELECT or INSERT is going. So we can fetch the current table from that field.

Since we have a slot for a std::shared_ptr<MememTable> memem_table on the ha_memem class, we can check if it is NULL when we insert a row. If it is, we look up the current table and set this->memem_table to its MememTable.

But there's a bit more to it than just the table name. The const char* name passed to the create() method above seems to be a sort of fully qualified name for the table. By observation, when creating a table y in a database test, the const char* name value is ./test/y. The . prefix probably means that the database is local, but I'm not sure.

So we'll write a helper method that will reconstruct the fully qualified table name before looking up that fully qualified table name in the global database.

void ha_memem::reset_memem_table()
{
  // Reset table cursor.
  current_position= 0;

  std::string full_name= "./" + std::string(table->s->db.str) + "/" +
                         std::string(table->s->table_name.str);
  DBUG_PRINT("info", ("[MEMEM] Resetting to '%s'.", full_name.c_str()));
  assert(database->tables.size() > 0);
  int index= memem_table_index(full_name.c_str());
  assert(index >= 0);
  assert(index < (int) database->tables.size());

  memem_table= database->tables[index];
}

Then we can use this within write_row to figure out the current MememTable being queried.

But first, let's digress into how MySQL stores rows.

The MySQL row API

When you write a Postgres custom storage API, you are expected to basically read from or write to an array of Datum.

Totally sensible.

In MySQL, you read from and write to an array of bytes. That's pretty weird to me. Of course you can build your own higher level serialization/deserialization on top of it. But it's just strange to me everyone has to know this basically opaque API.

Certainly it's documented.

The handler class is the interface for dynamically loadable
storage engines. Do not add ifdefs and take care when adding or
changing virtual functions to avoid vtable confusion

Functions in this class accept and return table columns data. Two data
representation formats are used:
1. TableRecordFormat - Used to pass [partial] table records to/from
   storage engine

2. KeyTupleFormat - used to pass index search tuples (aka "keys") to
   storage engine. See opt_range.cc for description of this format.

TableRecordFormat
=================
[Warning: this description is work in progress and may be incomplete]
The table record is stored in a fixed-size buffer:

  record: null_bytes, column1_data, column2_data, ...

The offsets of the parts of the buffer are also fixed: every column has 
an offset to its column{i}_data, and if it is nullable it also has its own
bit in null_bytes.

In our implementation, we'll skip the support for NULL values. We'll only support INTEGER fields. But we still need to be aware that the first byte will be taken up. We'll also assume there won't be more than one byte of a NULL bitmap.

It is this opaque byte array that we'll read from in write_row(const uchar* buf) and write to in read_row(uchar* buf).

Insert row (take two)

To keep things simple we're going to store the row in MememTable the same way MySQL passes it around.

int ha_memem::write_row(const uchar *buf)
{
  if (memem_table == NULL)
  {
    reset_memem_table();
  }

  // Assume there are no NULLs.
  buf++;

  uint field_count = 0;
  while (table->field[field_count]) field_count++;

  // Store the row in the same format MariaDB gives us.
  auto row= std::make_shared<std::vector<uchar>>(
      buf, buf + sizeof(int) * field_count);
  memem_table->rows.push_back(row);

  return 0;
}

Which makes reading the row quite simple too!

Read row

The only slight difference between reading and writing a row is that MySQL/MariaDB will tell us when the SELECT scan for a table starts.

We'll use that opportunity to reset the current_row cursor and reset the memem_table field. Since, again, handler classes are only used once per query but they are reused for queries running at other times.

int ha_memem::rnd_init(bool scan)
{
  reset_memem_table();
  return 0;
}

int ha_memem::rnd_next(uchar *buf)
{
  if (current_position == memem_table->rows.size())
  {
    // Reset the in-memory table to make logic errors more obvious.
    memem_table= NULL;
    return HA_ERR_END_OF_FILE;
  }
  assert(current_position < memem_table->rows.size());

  uchar *ptr= buf;
  *ptr= 0;
  ptr++;

  // Rows internally are stored in the same format that MariaDB
  // wants. So we can just copy them over.
  std::shared_ptr<std::vector<uchar>> row= memem_table->rows[current_position];
  std::copy(row->begin(), row->end(), ptr);

  current_position++;
  return 0;
}

And we're done!

Build and test

Go back into the build directory we created within the source tree root and rerun make -j8.

Kill the server (you'll need to do something like killall mariadbd since the server doesn't respond to Ctrl-c). And restart it.

For some reason this plugin doesn't need to be loaded. We can run SHOW PLUGINS; in the MariaDB CLI and we'll see it.

$ ./build/client/mariadb --defaults-extra-file=/home/phil/vendor/mariadb/my.cnf --database=test
Reading table information for completion of table and column names
You can turn off this feature to get a quicker startup with -A

Welcome to the MariaDB monitor.  Commands end with ; or \g.
Your MariaDB connection id is 5
Server version: 11.4.0-MariaDB-debug Source distribution

Copyright (c) 2000, 2018, Oracle, MariaDB Corporation Ab and others.

Type 'help;' or '\h' for help. Type '\c' to clear the current input statement.

MariaDB [test]> SHOW PLUGINS;
+-------------------------------+----------+--------------------+-----------------+---------+
| Name                          | Status   | Type               | Library         | License |
+-------------------------------+----------+--------------------+-----------------+---------+
| binlog                        | ACTIVE   | STORAGE ENGINE     | NULL            | GPL     |
...
| MEMEM                         | ACTIVE   | STORAGE ENGINE     | NULL            | GPL     |
...
| BLACKHOLE                     | ACTIVE   | STORAGE ENGINE     | ha_blackhole.so | GPL     |
+-------------------------------+----------+--------------------+-----------------+---------+
73 rows in set (0.012 sec)

There we go! To create a table with it we need to set ENGINE = MEMEM. For example, CREATE TABLE x (i INT) ENGINE = MEMEM.

Let's create a script to try out the memem engine, in storage/memem/test.sql.

drop table if exists y;
drop table if exists z;

create table y(i int, j int) engine = MEMEM;
insert into y values (2, 1029);
insert into y values (92, 8);
select * from y where i + 8 = 10;

create table z(a int) engine = MEMEM;
insert into z values (322);
insert into z values (8);
select * from z where a > 20;

And run it.

$ ./build/client/mariadb --defaults-extra-file=$(pwd)/my.cnf --database=test --table --verbose < storage/memem/test.sql
--------------
drop table if exists y
--------------

--------------
drop table if exists z
--------------

--------------
create table y(i int, j int) engine = MEMEM
--------------

--------------
insert into y values (2, 1029)
--------------

--------------
insert into y values (92, 8)
--------------

--------------
select * from y where i + 8 = 10
--------------

+------+------+
| i    | j    |
+------+------+
|    2 | 1029 |
+------+------+
--------------
create table z(a int) engine = MEMEM
--------------

--------------
insert into z values (322)
--------------

--------------
insert into z values (8)
--------------

--------------
select * from z where a > 20
--------------

+------+
| a    |
+------+
|  322 |
+------+

What you see there is the power of storage engines! It supports the full SQL language even while we implemented storage somewhere completely different than the default.

In-memory is boring

Certainly, I'm getting bored doing the same project over and over again on different databases. However, it's minimal projects like this that make it super easy to then go and port the storage engine to something else.

The goal here is to be minimal but meaningful. And I've accomplished that for myself at least!

On ChatGPT

As I've written before, this sort of exploration wouldn't be possible within the time frame I gave myself if it weren't for ChatGPT. Specifically, the paid tier GPT4.

Neither the MySQL nor the MariaDB docs were so helpful that I could immediately figure out things like how to get the current table name within a scan (the table member of the handler class).

With ChatGPT you can ask questions like: "In a MySQL C++ plugin, how do I get the name of the table from a handler class as a C string?". Sometimes it's right and sometime's it's not. But you can try out the code and if it builds it is at least somewhat correct!

Wrote a post walking you through building a super minimal in-memory storage engine for MySQL/MariaDB in 218 lines of C++.

And took time again to reflect on the limitations of custom storage engines and how MySQL compares to Postgres internally here.https://t.co/nImUC36DPs pic.twitter.com/1Oj2Lcua8O
— Phil Eaton (@eatonphil) January 9, 2024