Writing a storage engine for Postgres: an in-memory Table Access Method

With Postgres 12, released in 2019, it became possible to swap out Postgres's storage engine.

This is a feature MySQL has supported for a long time. There are at least 8 different built-in engines you can pick from. MyRocks, MySQL on RocksDB, is another popular third-party distribution.

I assume there will be a renaissance of Postgres storage engines. To date, the efforts are nascent. OrioleDB and Citus Columnar are two promising third-party table access methods being actively developed.

Why alternative storage engines?

The ability to swap storage engines is useful because different workloads sometimes benefit from different storage approaches. Analytics workloads and columnar storage layouts go well together. Write-heavy workloads and LSM trees go well together. And some people like in-memory storage for running integration tests.

By swapping out only the storage engine, you get the benefit of the rest of the Postgres or MySQL infrastructure. The query language, the wire protocol, the ecosystem, etc.

Why not foreign data wrappers?

Very little has been written about the difference between foreign data wrappers (FDWs) and table access methods. Table access methods seems to be the lower-level layer where presumably you get better performance and cleaner integration. But there is clearly overlap between these two extension options.

For example there is a FDW for ClickHouse so when you create tables and rows and query the tables you are really creating and querying rows in a ClickHouse server. Similarly there's a FDW for RocksDB. And Citus's columnar engine works either as a foreign data wrapper or a table access method.

The Citus page draws the clearest distinction between FDWs and table access methods, but even that page is vague. Performance doesn't seem to be the main difference. Closer integration, and thus the ability to look more like vanilla Postgres from the outside, seems to be the gist.

In any case, I wanted to explore the table access method API.

Digging in

I haven't written Postgres extensions before and I've never written C professionally. If you're familiar with Postgres internals or C and notice something funky, please let me know!

It turns out that almost no one has written how to implement the minimal table access methods for various storage engine operations. So after quite a bit of stumbling to get the basics of an in-memory storage engine working, I'm going to walk you through my approach.

This is prototype-quality code which hopefully will be a useful base for further exploration.

All code for this post is available on GitHub.

A debug Postgres build

First off, let's make a debug build of Postgres.

$ git clone https://github.com/postgres/postgres
$ # An arbitrary commit from `master` after Postgres 16 I am on
$ git checkout 849172ff4883d44168f96f39d3fde96d0aa34c99
$ cd postgres
$ ./configure --enable-cassert --enable-debug CFLAGS="-ggdb -Og -g3 -fno-omit-frame-pointer"
$ make -j8
$ sudo make install

This will install Postgres binaries (e.g. psql, pg_ctl, initdb, pg_config) into /usr/local/pgsql/bin.

I'm going to reference those absolute paths throughout this post because you might have a system (package manager) install of Postgres already.

Let's create a database and start up this debug build:

$ /usr/local/pgsql/bin/initdb test-db
$ /usr/local/pgsql/bin/pg_ctl -D test-db -l logfile start

Extension infrastructure

Since we installed Postgres from scratch, /usr/local/pgsql/bin/pg_config will supply all of the infrastructure we need.

The "infrastructure" is basically just PGXS: Postgres Makefile utilities.

It's convention-heavy. So in a new Makefile for this project we'll specify:

MODULES: Any C sources to build, without the .c file extension
EXTENSION: Extension metadata file, without the .control file extension
DATA: A SQL file that is executed when the extension is loaded, this time with the .sql extension

MODULES = pgtam
EXTENSION = pgtam
DATA = pgtam--0.0.1.sql

PG_CONFIG = /usr/local/pgsql/bin/pg_config
PGXS := $(shell $(PG_CONFIG) --pgxs)
include $(PGXS)

The final three lines set up the PGXS Makefile library based on the particular installed Postgres build we want to build the extension against and install the extension to.

PGXS gives us a few important targets like make distclean, make, and make install we'll use later on.

`pgtam.c`

A minimal C file that registers a function capable of serving as a table access method is:

#include "postgres.h"
#include "fmgr.h"
#include "access/tableam.h"

PG_MODULE_MAGIC;

const TableAmRoutine memam_methods = {
  .type = T_TableAmRoutine,
};

PG_FUNCTION_INFO_V1(mem_tableam_handler);
Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
  PG_RETURN_POINTER(&memam_methods);
}

If you want to read about extension basics without the complexity of table access methods, you can find a complete, minimal Postgres extension I wrote to validate the infrastructure here. Or you can follow a larger tutorial.

The workflow for registering a table access method is to first run CREATE EXTENSION pgtam. This assumes pgtam is an extension that has a function that returns a TableAmRoutine struct instance, a table of table access methods.

Then you must run CREATE ACCESS METHOD mem TYPE TABLE HANDLER mem_tableam_handler. And finally you can use the access method when creating a table with USING mem: CREATE TABLE x(a INT) USING mem.

`pgtam.control`

This file contains extension metadata. At a minimum, the version of the extension and the filename for the extension where it should be installed.

default_version = '0.0.1'
module_pathname = '$libdir/pgtam'

`pgtam--0.0.1.sql`

Finally, in pgtam--0.0.1.sql (which is executed when we call CREATE EXTENSION pgtam), we register the handler function as a foreign function, and then we register the function as an access method.

CREATE OR REPLACE FUNCTION mem_tableam_handler(internal)
RETURNS table_am_handler AS 'pgtam', 'mem_tableam_handler'
LANGUAGE C STRICT;

CREATE ACCESS METHOD mem TYPE TABLE HANDLER mem_tableam_handler;

Build

Now that we've got all the pieces in place, we can build and install the extension.

$ make
$ sudo make install

Let's add a test.sql script to exercise the extension:

DROP EXTENSION IF EXISTS pgtam CASCADE;
CREATE EXTENSION pgtam;
CREATE TABLE x(a INT) USING mem;

And run it:

$ /usr/local/pgsql/bin/psql postgres -f test.sql
DROP EXTENSION
CREATE EXTENSION
psql:test.sql:3: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:3: error: connection to server was lost

Ok, so psql crashed! Let's look at the server logs. When we started Postgres with pg_ctl we specified the log file as logfile in the directory where we ran pg_ctl.

If we look through it we'll spot an assertion failure:

$ grep Assert logfile
TRAP: failed Assert("routine->scan_begin != NULL"), File: "tableamapi.c", Line: 52, PID: 2906922

That's a great sign! This is Postgres's debug infrastructure helping to make sure the table access method is correctly implemented.

Table access method stubs

The next step is to add function stubs for all the non-optional methods of the TableAmRoutine struct.

I've done all the work for you already so you can just copy this over the existing pgtam.c. It's a big file, but don't worry. There's nothing to explain. Just a bunch of blank functions returning default values when required.

#include "postgres.h"
#include "fmgr.h"
#include "access/tableam.h"
#include "access/heapam.h"
#include "nodes/execnodes.h"
#include "catalog/index.h"
#include "commands/vacuum.h"
#include "utils/builtins.h"
#include "executor/tuptable.h"

PG_MODULE_MAGIC;

const TableAmRoutine memam_methods;

static const TupleTableSlotOps* memam_slot_callbacks(
  Relation relation
) {
  return NULL;
}

static TableScanDesc memam_beginscan(
  Relation relation,
  Snapshot snapshot,
  int nkeys,
  struct ScanKeyData *key,
  ParallelTableScanDesc parallel_scan,
  uint32 flags
) {
  return NULL;
}

static void memam_rescan(
  TableScanDesc sscan,
  struct ScanKeyData *key,
  bool set_params,
  bool allow_strat,
  bool allow_sync,
  bool allow_pagemode
) {
}

static void memam_endscan(TableScanDesc sscan) {
}

static bool memam_getnextslot(
  TableScanDesc sscan,
  ScanDirection direction,
  TupleTableSlot *slot
) {
  return false;
}

static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {
  return NULL;
}

static void memam_index_fetch_reset(IndexFetchTableData *scan) {
}

static void memam_index_fetch_end(IndexFetchTableData *scan) {
}

static bool memam_index_fetch_tuple(
  struct IndexFetchTableData *scan,
  ItemPointer tid,
  Snapshot snapshot,
  TupleTableSlot *slot,
  bool *call_again,
  bool *all_dead
) {
  return false;
}

static void memam_tuple_insert(
  Relation relation,
  TupleTableSlot *slot,
  CommandId cid,
  int options,
  BulkInsertState bistate
) {
}

static void memam_tuple_insert_speculative(
  Relation relation,
  TupleTableSlot *slot,
  CommandId cid,
  int options,
  BulkInsertState bistate,
  uint32 specToken) {
}

static void memam_tuple_complete_speculative(
  Relation relation,
  TupleTableSlot *slot,
  uint32 specToken,
  bool succeeded) {
}

static void memam_multi_insert(
  Relation relation,
  TupleTableSlot **slots,
  int ntuples,
  CommandId cid,
  int options,
  BulkInsertState bistate
) {
}

static TM_Result memam_tuple_delete(
  Relation relation,
  ItemPointer tid,
  CommandId cid,
  Snapshot snapshot,
  Snapshot crosscheck,
  bool wait,
  TM_FailureData *tmfd,
  bool changingPart
) {
  TM_Result result = {};
  return result;
}

static TM_Result memam_tuple_update(
  Relation relation,
  ItemPointer otid,
  TupleTableSlot *slot,
  CommandId cid,
  Snapshot snapshot,
  Snapshot crosscheck,
  bool wait,
  TM_FailureData *tmfd,
  LockTupleMode *lockmode,
  TU_UpdateIndexes *update_indexes
) {
  TM_Result result = {};
  return result;
}

static TM_Result memam_tuple_lock(
  Relation relation,
  ItemPointer tid,
  Snapshot snapshot,
  TupleTableSlot *slot,
  CommandId cid,
  LockTupleMode mode,
  LockWaitPolicy wait_policy,
  uint8 flags,
  TM_FailureData *tmfd)
{
  TM_Result result = {};
  return result;
}

static bool memam_fetch_row_version(
  Relation relation,
  ItemPointer tid,
  Snapshot snapshot,
  TupleTableSlot *slot
) {
  return false;
}

static void memam_get_latest_tid(
  TableScanDesc sscan,
  ItemPointer tid
) {
}

static bool memam_tuple_tid_valid(TableScanDesc scan, ItemPointer tid) {
  return false;
}

static bool memam_tuple_satisfies_snapshot(
  Relation rel,
  TupleTableSlot *slot,
  Snapshot snapshot
) {
  return false;
}

static TransactionId memam_index_delete_tuples(
  Relation rel,
  TM_IndexDeleteOp *delstate
) {
  TransactionId id = {};
  return id;
}

static void memam_relation_set_new_filelocator(
  Relation rel,
  const RelFileLocator *newrlocator,
  char persistence,
  TransactionId *freezeXid,
  MultiXactId *minmulti
) {
}

static void memam_relation_nontransactional_truncate(
  Relation rel
) {
}

static void memam_relation_copy_data(
  Relation rel,
  const RelFileLocator *newrlocator
) {
}

static void memam_relation_copy_for_cluster(
  Relation OldHeap,
  Relation NewHeap,
  Relation OldIndex,
  bool use_sort,
  TransactionId OldestXmin,
  TransactionId *xid_cutoff,
  MultiXactId *multi_cutoff,
  double *num_tuples,
  double *tups_vacuumed,
  double *tups_recently_dead
) {
}

static void memam_vacuum_rel(
  Relation rel,
  VacuumParams *params,
  BufferAccessStrategy bstrategy
) {
}

static bool memam_scan_analyze_next_block(
  TableScanDesc scan,
  BlockNumber blockno,
  BufferAccessStrategy bstrategy
) {
  return false;
}

static bool memam_scan_analyze_next_tuple(
  TableScanDesc scan,
  TransactionId OldestXmin,
  double *liverows,
  double *deadrows,
  TupleTableSlot *slot
) {
  return false;
}

static double memam_index_build_range_scan(
  Relation heapRelation,
  Relation indexRelation,
  IndexInfo *indexInfo,
  bool allow_sync,
  bool anyvisible,
  bool progress,
  BlockNumber start_blockno,
  BlockNumber numblocks,
  IndexBuildCallback callback,
  void *callback_state,
  TableScanDesc scan
) {
  return 0;
}

static void memam_index_validate_scan(
  Relation heapRelation,
  Relation indexRelation,
  IndexInfo *indexInfo,
  Snapshot snapshot,
  ValidateIndexState *state
) {
}

static bool memam_relation_needs_toast_table(Relation rel) {
  return false;
}

static Oid memam_relation_toast_am(Relation rel) {
  Oid oid = {};
  return oid;
}

static void memam_fetch_toast_slice(
  Relation toastrel,
  Oid valueid,
  int32 attrsize,
  int32 sliceoffset,
  int32 slicelength,
  struct varlena *result
) {
}

static void memam_estimate_rel_size(
  Relation rel,
  int32 *attr_widths,
  BlockNumber *pages,
  double *tuples,
  double *allvisfrac
) {
}

static bool memam_scan_sample_next_block(
  TableScanDesc scan, SampleScanState *scanstate
) {
  return false;
}

static bool memam_scan_sample_next_tuple(
  TableScanDesc scan,
  SampleScanState *scanstate,
  TupleTableSlot *slot
) {
  return false;
}

const TableAmRoutine memam_methods = {
  .type = T_TableAmRoutine,

  .slot_callbacks = memam_slot_callbacks,

  .scan_begin = memam_beginscan,
  .scan_end = memam_endscan,
  .scan_rescan = memam_rescan,
  .scan_getnextslot = memam_getnextslot,

  .parallelscan_estimate = table_block_parallelscan_estimate,
  .parallelscan_initialize = table_block_parallelscan_initialize,
  .parallelscan_reinitialize = table_block_parallelscan_reinitialize,

  .index_fetch_begin = memam_index_fetch_begin,
  .index_fetch_reset = memam_index_fetch_reset,
  .index_fetch_end = memam_index_fetch_end,
  .index_fetch_tuple = memam_index_fetch_tuple,

  .tuple_insert = memam_tuple_insert,
  .tuple_insert_speculative = memam_tuple_insert_speculative,
  .tuple_complete_speculative = memam_tuple_complete_speculative,
  .multi_insert = memam_multi_insert,
  .tuple_delete = memam_tuple_delete,
  .tuple_update = memam_tuple_update,
  .tuple_lock = memam_tuple_lock,

  .tuple_fetch_row_version = memam_fetch_row_version,
  .tuple_get_latest_tid = memam_get_latest_tid,
  .tuple_tid_valid = memam_tuple_tid_valid,
  .tuple_satisfies_snapshot = memam_tuple_satisfies_snapshot,
  .index_delete_tuples = memam_index_delete_tuples,

  .relation_set_new_filelocator = memam_relation_set_new_filelocator,
  .relation_nontransactional_truncate = memam_relation_nontransactional_truncate,
  .relation_copy_data = memam_relation_copy_data,
  .relation_copy_for_cluster = memam_relation_copy_for_cluster,
  .relation_vacuum = memam_vacuum_rel,
  .scan_analyze_next_block = memam_scan_analyze_next_block,
  .scan_analyze_next_tuple = memam_scan_analyze_next_tuple,
  .index_build_range_scan = memam_index_build_range_scan,
  .index_validate_scan = memam_index_validate_scan,

  .relation_size = table_block_relation_size,
  .relation_needs_toast_table = memam_relation_needs_toast_table,
  .relation_toast_am = memam_relation_toast_am,
  .relation_fetch_toast_slice = memam_fetch_toast_slice,

  .relation_estimate_size = memam_estimate_rel_size,

  .scan_sample_next_block = memam_scan_sample_next_block,
  .scan_sample_next_tuple = memam_scan_sample_next_tuple
};

PG_FUNCTION_INFO_V1(mem_tableam_handler);

Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
  PG_RETURN_POINTER(&memam_methods);
}

Let's build and test it!

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE

Hey we're getting somewhere! It successfully created the table with our custom table access method.

Querying rows

Next, let's try querying the table by adding a SELECT a FROM x to test.sql and running it:

$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:6: error: connection to server was lost

This time there's nothing in logfile that helps:

$ tail -n15 logfile
2023-11-01 18:43:32.449 UTC [2906199] LOG:  database system is ready to accept connections
2023-11-01 18:58:32.572 UTC [2907997] LOG:  checkpoint starting: time
2023-11-01 18:58:35.305 UTC [2907997] LOG:  checkpoint complete: wrote 28 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=2.712 s, sync=0.015 s, total=2.733 s; sync files=23, longest=0.004 s, average=0.001 s; distance=128 kB, estimate=150 kB; lsn=0/15F88E0, redo lsn=0/15F8888
2023-11-01 19:08:14.485 UTC [2906199] LOG:  server process (PID 2908242) was terminated by signal 11: Segmentation fault
2023-11-01 19:08:14.485 UTC [2906199] DETAIL:  Failed process was running: SELECT a FROM x;
2023-11-01 19:08:14.485 UTC [2906199] LOG:  terminating any other active server processes
2023-11-01 19:08:14.486 UTC [2906199] LOG:  all server processes terminated; reinitializing
2023-11-01 19:08:14.508 UTC [2908253] LOG:  database system was interrupted; last known up at 2023-11-01 18:58:35 UTC
2023-11-01 19:08:14.518 UTC [2908253] LOG:  database system was not properly shut down; automatic recovery in progress
2023-11-01 19:08:14.519 UTC [2908253] LOG:  redo starts at 0/15F8888
2023-11-01 19:08:14.520 UTC [2908253] LOG:  invalid record length at 0/161DE70: expected at least 24, got 0
2023-11-01 19:08:14.520 UTC [2908253] LOG:  redo done at 0/161DE38 system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 0.00 s
2023-11-01 19:08:14.521 UTC [2908254] LOG:  checkpoint starting: end-of-recovery immediate wait
2023-11-01 19:08:14.532 UTC [2908254] LOG:  checkpoint complete: wrote 35 buffers (0.2%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.010 s, total=0.012 s; sync files=27, longest=0.003 s, average=0.001 s; distance=149 kB, estimate=149 kB; lsn=0/161DE70, redo lsn=0/161DE70
2023-11-01 19:08:14.533 UTC [2906199] LOG:  database system is ready to accept connections

This was the first place I got stuck. How on earth do I figure out what methods to implement? I mean, it's clearly one or more of these methods from the struct. But there are so many methods.

I tried setting a breakpoint in gdb on the process returned by SELECT pg_backend_pid() for a psql session, but the breakpoint never seemed to be hit for any of my methods.

So I did the low-tech solution and opened a file, /tmp/pgtam.log, turned off buffering on it, and added a log to every method on the TableAmRoutine struct:

@@ -12,9 +12,13 @@

 const TableAmRoutine memam_methods;

+FILE* fd;
+#define DEBUG_FUNC() fprintf(fd, "in %s\n", __func__);
+
 static const TupleTableSlotOps* memam_slot_callbacks(
   Relation relation
 ) {
+  DEBUG_FUNC();
   return NULL;
 }

@@ -26,6 +30,7 @@
   ParallelTableScanDesc parallel_scan,
   uint32 flags
 ) {
+  DEBUG_FUNC();
   return NULL;
 }

@@ -37,9 +42,11 @@
   bool allow_sync,
   bool allow_pagemode
 ) {
+  DEBUG_FUNC();
 }

 static void memam_endscan(TableScanDesc sscan) {
+  DEBUG_FUNC();
 }

 static bool memam_getnextslot(
@@ -47,17 +54,21 @@
   ScanDirection direction,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

 static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {
+  DEBUG_FUNC();
   return NULL;
 }

 static void memam_index_fetch_reset(IndexFetchTableData *scan) {
+  DEBUG_FUNC();
 }

 static void memam_index_fetch_end(IndexFetchTableData *scan) {
+  DEBUG_FUNC();
 }

 static bool memam_index_fetch_tuple(
@@ -68,6 +79,7 @@
   bool *call_again,
   bool *all_dead
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -78,6 +90,7 @@
   int options,
   BulkInsertState bistate
 ) {
+  DEBUG_FUNC();
 }

 static void memam_tuple_insert_speculative(
@@ -87,6 +100,7 @@
   int options,
   BulkInsertState bistate,
   uint32 specToken) {
+  DEBUG_FUNC();
 }

 static void memam_tuple_complete_speculative(
@@ -94,6 +108,7 @@
   TupleTableSlot *slot,
   uint32 specToken,
   bool succeeded) {
+  DEBUG_FUNC();
 }

 static void memam_multi_insert(
@@ -104,6 +119,7 @@
   int options,
   BulkInsertState bistate
 ) {
+  DEBUG_FUNC();
 }

 static TM_Result memam_tuple_delete(
@@ -117,6 +133,7 @@
   bool changingPart
 ) {
   TM_Result result = {};
+  DEBUG_FUNC();
   return result;
 }

@@ -133,6 +150,7 @@
   TU_UpdateIndexes *update_indexes
 ) {
   TM_Result result = {};
+  DEBUG_FUNC();
   return result;
 }

@@ -148,6 +166,7 @@
   TM_FailureData *tmfd)
 {
   TM_Result result = {};
+  DEBUG_FUNC();
   return result;
 }

@@ -157,6 +176,7 @@
   Snapshot snapshot,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -164,9 +184,11 @@
   TableScanDesc sscan,
   ItemPointer tid
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_tuple_tid_valid(TableScanDesc scan, ItemPointer tid) {
+  DEBUG_FUNC();
   return false;
 }

@@ -175,6 +197,7 @@
   TupleTableSlot *slot,
   Snapshot snapshot
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -183,6 +206,7 @@
   TM_IndexDeleteOp *delstate
 ) {
   TransactionId id = {};
+  DEBUG_FUNC();
   return id;
 }

@@ -193,17 +217,20 @@
   TransactionId *freezeXid,
   MultiXactId *minmulti
 ) {
+  DEBUG_FUNC();
 }

 static void memam_relation_nontransactional_truncate(
   Relation rel
 ) {
+  DEBUG_FUNC();
 }

 static void memam_relation_copy_data(
   Relation rel,
   const RelFileLocator *newrlocator
 ) {
+  DEBUG_FUNC();
 }

 static void memam_relation_copy_for_cluster(
@@ -218,6 +245,7 @@
   double *tups_vacuumed,
   double *tups_recently_dead
 ) {
+  DEBUG_FUNC();
 }

 static void memam_vacuum_rel(
@@ -225,6 +253,7 @@
   VacuumParams *params,
   BufferAccessStrategy bstrategy
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_scan_analyze_next_block(
@@ -232,6 +261,7 @@
   BlockNumber blockno,
   BufferAccessStrategy bstrategy
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -242,6 +272,7 @@
   double *deadrows,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -258,6 +289,7 @@
   void *callback_state,
   TableScanDesc scan
 ) {
+  DEBUG_FUNC();
   return 0;
 }

@@ -268,14 +300,17 @@
   Snapshot snapshot,
   ValidateIndexState *state
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_relation_needs_toast_table(Relation rel) {
+  DEBUG_FUNC();
   return false;
 }

 static Oid memam_relation_toast_am(Relation rel) {
   Oid oid = {};
+  DEBUG_FUNC();
   return oid;
 }

@@ -287,6 +322,7 @@
   int32 slicelength,
   struct varlena *result
 ) {
+  DEBUG_FUNC();
 }

 static void memam_estimate_rel_size(
@@ -296,11 +332,13 @@
   double *tuples,
   double *allvisfrac
 ) {
+  DEBUG_FUNC();
 }

 static bool memam_scan_sample_next_block(
   TableScanDesc scan, SampleScanState *scanstate
 ) {
+  DEBUG_FUNC();
   return false;
 }

@@ -309,6 +347,7 @@
   SampleScanState *scanstate,
   TupleTableSlot *slot
 ) {
+  DEBUG_FUNC();
   return false;
 }

And then in the entrypoint, initialize the file for logging.

@@ -369,5 +408,9 @@
 PG_FUNCTION_INFO_V1(mem_tableam_handler);

 Datum mem_tableam_handler(PG_FUNCTION_ARGS) {
+  fd = fopen("/tmp/pgtam.log", "a");
+  setvbuf(fd, NULL, _IONBF, 0); // Prevent buffering
+  fprintf(fd, "\n\nmem_tableam handler loaded\n");
+
   PG_RETURN_POINTER(&memam_methods);
 }

Let's give it a shot!

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:6: error: connection to server was lost

And let's check our log file:

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks

Now we're getting somewhere!

I later realized elog() is the way most people log within Postgres/within extensions. I didn't know that when I was getting started though. This separate logging was a simple way to get the info out.

`slot_callbacks`

Since the request crashes and the last logged function is memam_slot_callbacks, it seems like that is where we should concentrate. The table access method docs suggest looking at the default heap access method for inspiration.

Its version of slot_callbacks returns &TTSOpsBufferHeapTuple:

static const TupleTableSlotOps *
heapam_slot_callbacks(Relation relation)
{
    return &TTSOpsBufferHeapTuple;
}

I have no idea what that means, but since it is defined in src/backend/executor/execTuples.c it doesn't seem to be tied to the heap access method implementation. Let's try it.

While it works initially, I noticed later on that TTSOpsBufferHeapTuple turns out not to be the right choice here. TTSOpsVirtual seems to be the right implementation.

@@ -19,7 +19,7 @@
   Relation relation
 ) {
   DEBUG_FUNC();
-  return NULL;
+  return &TTSOpsVirtual;
 }

 static TableScanDesc memam_beginscan(

Build and run:

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
psql:test.sql:6: server closed the connection unexpectedly
        This probably means the server terminated abnormally
        before or while processing the request.
psql:test.sql:6: error: connection to server was lost

It still crashes. But this time in /tmp/pgtam.log we made it into a new method!

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan

`scan_begin`

The function signature is:

TableScanDesc heap_beginscan(
  Relation relation,
  Snapshot snapshot,
  int nkeys,
  ScanKey key,
  ParallelTableScanDesc parallel_scan,
  uint32 flags
);

Since we just implemented stub versions of all the methods, we've been returning NULL. Since we're failing in this function, maybe we should try returning something that isn't NULL.

By looking at the definition of TableScanDesc, we can see it is a pointer to the TableScanDescData struct defined in src/include/access/relscan.h.

Let's malloc a TableScanDescData, free it in endscan, and return the TableScanDescData instance in beginscan:

@@ -30,8 +30,12 @@
   ParallelTableScanDesc parallel_scan,
   uint32 flags
 ) {
+  TableScanDescData* scan = {};
   DEBUG_FUNC();
-  return NULL;
+
+  scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
+
+  return (TableScanDesc)scan;
 }

 static void memam_rescan(
@@ -87,6 +87,7 @@

 static void memam_endscan(TableScanDesc sscan) {
   DEBUG_FUNC();
+  free(sscan);
 }

Build and run (you can do it on your own). No difference.

I got stuck for a while here too. Clearly something must be filled out in this struct but it could be anything. Through trial and error I realized the one field that must be filled out is scan->rs_rd.

@@ -34,6 +34,7 @@
   DEBUG_FUNC();

   scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
+  scan->rs_rd = relation;

   return (TableScanDesc)scan;
 }

We build and run:

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
 a
---
(0 rows)

And it works! It doesn't return anything but that's correct. There's nothing to return.

So what if we actually want to return something? Let's check our logs in /tmp/pgtam.log.

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_endscan

Ok, I'm getting the gist of the API. A full table scan (which this is, because there are no indexes at play) starts with an initialization for a slot, then the scan begins, then getnextslot is called for each row, and then endscan is called to allow for cleanup.

So let's try returning a row in getnextslot.

`getnextslot`

The getnextslot signature is:

bool memam_getnextslot(
  TableScanDesc sscan,
  ScanDirection direction,
  TupleTableSlot *slot
);

So the sscan should be what we returned from beginscan and the interface docs say the current row gets stored in slot.

The return value seems to indicate whether or not we've reached the end of the scan. However, the scan will still end even if you return true if the slot is not filled out correctly. If the slot is filled out correctly and you unconditionally return true, you will crash the process.

Let's take a look at the definition of TupleTableSlot:

typedef struct TupleTableSlot
{
    NodeTag     type;
#define FIELDNO_TUPLETABLESLOT_FLAGS 1
    uint16      tts_flags;      /* Boolean states */
#define FIELDNO_TUPLETABLESLOT_NVALID 2
    AttrNumber  tts_nvalid;     /* # of valid values in tts_values */
    const TupleTableSlotOps *const tts_ops; /* implementation of slot */
#define FIELDNO_TUPLETABLESLOT_TUPLEDESCRIPTOR 4
    TupleDesc   tts_tupleDescriptor;    /* slot's tuple descriptor */
#define FIELDNO_TUPLETABLESLOT_VALUES 5
    Datum      *tts_values;     /* current per-attribute values */
#define FIELDNO_TUPLETABLESLOT_ISNULL 6
    bool       *tts_isnull;     /* current per-attribute isnull flags */
    MemoryContext tts_mcxt;     /* slot itself is in this context */
    ItemPointerData tts_tid;    /* stored tuple's tid */
    Oid         tts_tableOid;   /* table oid of tuple */
} TupleTableSlot;

tts_values is an array of Datum (which is a Postgres value). So that sounds like the actual values of the row. The tts_isnull field also looks important since that seems to be whether each value in the row is null or not. And tts_nvalid sounds important too since presumably it's the length of the tts_isnull and tts_values arrays.

The rest of it may or may not be important. Let's try filling out these three fields though and see what happens.

Datum

Back in the Postgres C extension documentation, we can see some simple examples of converting between C types and Postgres's Datum type.

For example:

Datum
add_one(PG_FUNCTION_ARGS)
{
    int32   arg = PG_GETARG_INT32(0);

    PG_RETURN_INT32(arg + 1);
}

If we look at the definition of PG_RETURN_INT32 in src/include/fmgr.h, we see:

#define PG_RETURN_INT32(x)   return Int32GetDatum(x)

So Int32GetDatum() is the function we'll use to set a Datum for a cell in a row.

@@ -54,13 +54,26 @@
   DEBUG_FUNC();
 }

+static bool done = false;
 static bool memam_getnextslot(
   TableScanDesc sscan,
   ScanDirection direction,
   TupleTableSlot *slot
 ) {
   DEBUG_FUNC();
-  return false;
+
+  if (done) {
+    return false;
+  }
+
+  slot->tts_nvalid = 1;
+  slot->tts_values = (Datum*)malloc(sizeof(Datum) * slot->tts_nvalid);
+  slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
+  slot->tts_isnull = (bool*)malloc(sizeof(bool) * slot->tts_nvalid);
+  slot->tts_isnull[0] = false;
+  done = true;
+
+  return true;
 }

 static IndexFetchTableData* memam_index_fetch_begin(Relation rel) {

The goal is that we return a single row and then exit the scan. It will have one 32-bit integer cell (remember we created the table CREATE TABLE x (a INT); INT is shorthand for INT4 which is a 32-bit integer) that will have the value 314.

But if we build and run this, we get no rows.

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
 a
---
(0 rows)

I got stuck for a while here. Plugging my getnextslot code into ChatGPT helped. One thing it gave me to try was calling ExecStoreVirtualTuple on the slot. I noticed that the built-in heap access method also called a function like this in getnextslot.

And I realized that tts_nvalid is already set up and the memory for tts_values and tts_isnull is already allocated. So the code became a little simpler.

@@ -66,11 +66,9 @@
     return false;
   }

-  slot->tts_nvalid = 1;
-  slot->tts_values = (Datum*)malloc(sizeof(Datum) * slot->tts_nvalid);
   slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
-  slot->tts_isnull = (bool*)malloc(sizeof(bool) * slot->tts_nvalid);
   slot->tts_isnull[0] = false;
+  ExecStoreVirtualTuple(slot);
   done = true;

   return true;

Build and run:

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
  a
-----
 314
(1 row)

Fantastic!

Creating a table

Now that we've proven we can return random data, let's set up infrastructure for storing tables in memory.

@@ -15,6 +15,41 @@
 FILE* fd;
 #define DEBUG_FUNC() fprintf(fd, "in %s\n", __func__);

+
+struct Column {
+  int value;
+};
+
+struct Row {
+  struct Column* columns;
+  size_t ncolumns;
+};
+
+#define MAX_ROWS 100
+struct Table {
+  char* name;
+  struct Row* rows;
+  size_t nrows;
+};
+
+#define MAX_TABLES 100
+struct Database {
+  struct Table* tables;
+  size_t ntables;
+};
+
+struct Database* database;
+
+static void get_table(struct Table** table, Relation relation) {
+  char* this_name = NameStr(relation->rd_rel->relname);
+  for (size_t i = 0; i < database->ntables; i++) {
+    if (strcmp(database->tables[i].name, this_name) == 0) {
+      *table = &database->tables[i];
+      return;
+    }
+  }
+}
+
 static const TupleTableSlotOps* memam_slot_callbacks(
   Relation relation
 ) {

Based on what we logged in /tmp/pgtam.log it seems like memam_relation_set_new_filelocator is called when a new table is created. So let's handle adding a new table there.

@@ -233,7 +268,16 @@
   TransactionId *freezeXid,
   MultiXactId *minmulti
 ) {
+  struct Table table = {};
   DEBUG_FUNC();
+
+  table.name = strdup(NameStr(rel->rd_rel->relname));
+  fprintf(fd, "Created table: [%s]\n", table.name);
+  table.rows = (struct Row*)malloc(sizeof(struct Row) * MAX_ROWS);
+  table.nrows = 0;
+
+  database->tables[database->ntables] = table;
+  database->ntables++;
 }

 static void memam_relation_nontransactional_truncate(

Finally, we'll initialize the in-memory Database* when the handler is loaded.

@@ -428,5 +472,11 @@
   setvbuf(fd, NULL, _IONBF, 0); // Prevent buffering
   fprintf(fd, "\n\nmem_tableam handler loaded\n");

+  if (database == NULL) {
+    database = (struct Database*)malloc(sizeof(struct Database));
+    database->ntables = 0;
+    database->tables = (struct Table*)malloc(sizeof(struct Table) * MAX_TABLES);
+  }
+
   PG_RETURN_POINTER(&memam_methods);
 }

If we build and run, we won't notice anything new.

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
  a
-----
 314
(1 row)

But we should see a message in /tmp/pgtam.log about the new table being created.

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator
Created table: [x]


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_getnextslot
in memam_endscan

And there it is! Creation looks good.

Inserting rows

Let's add INSERT INTO x VALUES (23), (101); to test.sql and run the SQL script.

$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
INSERT 0 2
  a
-----
 314
(1 row)

And let's check the log to see what method is called when we try to INSERT.

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator
Created table: [x]


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_slot_callbacks
in memam_tuple_insert
in memam_tuple_insert
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_getnextslot
in memam_endscan

tuple_insert seems to be the method! Looks like it gets called once for each row to insert. Perfect.

The signature for tuple_insert is:

void memam_tuple_insert(
  Relation relation,
  TupleTableSlot *slot,
  CommandId cid,
  int options,
  BulkInsertState bistate
);

We can get the table name from relation, and instead of writing to slot we can read from slot->tts_values instead.

@@ -141,7 +141,38 @@
   int options,
   BulkInsertState bistate
 ) {
+  TupleDesc desc = RelationGetDescr(relation);
+  struct Table* table = NULL;
+  struct Column column = {};
+  struct Row row = {};
+
   DEBUG_FUNC();
+
+  get_table(&table, relation);
+  if (table == NULL) {
+    elog(ERROR, "table not found");
+    return;
+  }
+
+  if (table->nrows == MAX_ROWS) {
+    elog(ERROR, "cannot insert more rows");
+    return;
+  }
+
+  row.ncolumns = desc->natts;
+  Assert(slot->tts_nvalid == row.ncolumns);
+  Assert(row.ncolumns > 0);
+
+  row.columns = (struct Column*)malloc(sizeof(struct Column) * row.ncolumns);
+  for (size_t i = 0; i < row.ncolumns; i++) {
+    Assert(desc->attrs[i].atttypid == INT4OID);
+    column.value = DatumGetInt32(slot->tts_values[i]);
+    row.columns[i] = column;
+    fprintf(fd, "Got value: %d\n", column.value);
+  }
+
+  table->rows[table->nrows] = row;
+  table->nrows++;
 }

 static void memam_tuple_insert_speculative(

Build and run and again we won't notice anything new.

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
INSERT 0 2
  a
-----
 314
(1 row)

But if we check the logs, we should see the two column values we inserted, one for each row.

$ cat /tmp/pgtam.log


mem_tableam handler loaded


mem_tableam handler loaded
in memam_relation_set_new_filelocator
Created table: [x]


mem_tableam handler loaded
in memam_relation_needs_toast_table


mem_tableam handler loaded
in memam_slot_callbacks
in memam_tuple_insert
Got value: 23
in memam_tuple_insert
Got value: 101
in memam_estimate_rel_size
in memam_slot_callbacks
in memam_beginscan
in memam_getnextslot
in memam_getnextslot
in memam_endscan

Woohoo!

Un-hardcoding the scan

The final thing we need to do is drop the hardcoded 314 we returned from getnextslot and instead we need to look up the current table and return rows from it. This also means we need to keep track of which row we're on. So beginscan will also need to change slightly.

@@ -57,6 +56,14 @@
   return &TTSOpsVirtual;
 }

+
+struct MemScanDesc {
+  TableScanDescData rs_base; // Base class from access/relscan.h.
+
+  // Custom data.
+  uint32 cursor;
+};
+
 static TableScanDesc memam_beginscan(
   Relation relation,
   Snapshot snapshot,
@@ -65,11 +72,13 @@
   ParallelTableScanDesc parallel_scan,
   uint32 flags
 ) {
-  TableScanDescData* scan = {};
-  DEBUG_FUNC();
+  struct MemScanDesc* scan;

-  scan = (TableScanDescData*)malloc(sizeof(TableScanDescData));
-  scan->rs_rd = relation;
+  DEBUG_FUNC();
+
+  scan = (struct MemScanDesc*)malloc(sizeof(struct MemScanDesc));
+  scan->rs_base.rs_rd = relation;
+  scan->cursor = 0;

   return (TableScanDesc)scan;
 }
@@ -89,23 +97,26 @@
   DEBUG_FUNC();
 }

-static bool done = false;
 static bool memam_getnextslot(
   TableScanDesc sscan,
   ScanDirection direction,
   TupleTableSlot *slot
 ) {
+  struct MemScanDesc* mscan = (struct MemScanDesc*)sscan;
+  struct Table* table = NULL;
   DEBUG_FUNC();

-  if (done) {
+  ExecClearTuple(slot);
+
+  get_table(&table, mscan->rs_base.rs_rd);
+  if (table == NULL || mscan->cursor == table->nrows) {
     return false;
   }

-  slot->tts_values[0] = Int32GetDatum(314 /* Some unique-looking value */);
+  slot->tts_values[0] = Int32GetDatum(table->rows[mscan->cursor].columns[0].value);
   slot->tts_isnull[0] = false;
   ExecStoreVirtualTuple(slot);
-  done = true;
-
+  mscan->cursor++;
   return true;
 }

Let's try it out.

$ make && sudo make install
$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to table x
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
INSERT 0 2
  a
-----
  23
 101
(2 rows)

And there we have it. :)

Awesome SQL power

So we tried one table and we tried a SELECT without anything else.

What happens if we use more of SQL? Let's create another table and try some more complex queries. Edit test.sql:

DROP EXTENSION IF EXISTS pgtam CASCADE;
CREATE EXTENSION pgtam;
CREATE TABLE x(a INT) USING mem;
CREATE TABLE y(b INT) USING mem;
INSERT INTO x VALUES (23), (101);
SELECT a FROM x;
SELECT a + 100 FROM x WHERE a = 23;
SELECT a, COUNT(1) FROM x GROUP BY a ORDER BY COUNT(1) DESC;
SELECT b FROM y;

Run it:

$ /usr/local/pgsql/bin/psql postgres -f test.sql
psql:test.sql:1: NOTICE:  drop cascades to 2 other objects
DETAIL:  drop cascades to table x
drop cascades to table y
DROP EXTENSION
CREATE EXTENSION
CREATE TABLE
CREATE TABLE
INSERT 0 2
  a
-----
  23
 101
(2 rows)

 ?column?
----------
      123
(1 row)

  a  | count
-----+-------
  23 |     1
 101 |     1
(2 rows)

 b
---
(0 rows)

Pretty sweet!

Next steps

It would be neat to build a storage engine that reads from and writes to a CSV a la MySQL's CSV storage engine. Or a storage engine that uses RocksDB.

It would also be good to figure out how indexes work, how deletions work, how updates and DDL beyond CREATE works.

And I should probably contribute some of this to the table access method docs which are pretty sparse at the moment.

I've been working this week to understand Postgres Table Access Methods for alternative storage engines.

Especially challenging because the documentation is pretty sparse and few minimal implementations exist.

I wrote up my approach!https://t.co/LQGglRkev5 pic.twitter.com/v0MeOu4Hbr
— Phil Eaton (@eatonphil) November 2, 2023

November 1, 2023

Writing a storage engine for Postgres: an in-memory Table Access Method

Why alternative storage engines?

Why not foreign data wrappers?

Digging in

A debug Postgres build

Extension infrastructure

`pgtam.c`

`pgtam.control`

`pgtam--0.0.1.sql`

Build

Table access method stubs

Querying rows

`slot_callbacks`

`scan_begin`

`getnextslot`

Datum

Creating a table

Inserting rows

Un-hardcoding the scan

Awesome SQL power

Next steps

Feedback

Why alternative storage engines?

Why not foreign data wrappers?

Digging in

A debug Postgres build

Extension infrastructure

pgtam.c

pgtam.control

pgtam--0.0.1.sql

Build

Table access method stubs

Querying rows

slot_callbacks

scan_begin

getnextslot

Datum

Creating a table

Inserting rows

Un-hardcoding the scan

Awesome SQL power

Next steps

Feedback

Subscribe

Thanks :)

`pgtam.c`

`pgtam.control`

`pgtam--0.0.1.sql`

`slot_callbacks`

`scan_begin`

`getnextslot`