January 23, 2021

Extending gosql to supporting LIMIT and OFFSET

It's been a few months since I picked up gosql and I wanted to use it to prototype a SQL interface for data stored in S3. But one missing critical feature in gosql is LIMIT and OFFSET support. This post walks through the few key changes to gosql to support LIMIT and OFFSET.

You can find this commit in full on Github.

This post builds on top of a series on building a SQL database from scratch in Golang.
1. SELECT, INSERT, CREATE and a REPL
2. binary expressions and WHERE filters
3. indexes
4. a database/sql driver

Lexing

The first step is to update the lexer to know about the LIMIT and OFFSET keywords. Since we already have a generalized method of lexing any keywords from an array (see lexer.go:lexKeyword), this is really easy. Just add a new Keyword:

@@ -37,6 +37,8 @@ const (
        OnKeyword         Keyword = "on"
        PrimarykeyKeyword Keyword = "primary key"
        NullKeyword       Keyword = "null"
+       LimitKeyword      Keyword = "limit"
+       OffsetKeyword     Keyword = "offset"
 )

And then add these two new enums to the list of Keywords to lex:

@@ -261,6 +263,8 @@ func lexKeyword(source string, ic cursor) (*Token, cursor, bool) {
                OnKeyword,
                PrimarykeyKeyword,
                NullKeyword,
+               LimitKeyword,
+               OffsetKeyword,
        }

        var options []string

That's it for the lexer.

Parsing

Before we can parse limit and offset into the AST, we have to modify our AST struct to support these two fields in ast.go:

@@ -54,9 +54,11 @@ type SelectItem struct {
 }

 type SelectStatement struct {
-       Item  *[]*SelectItem
-       From  *Token
-       Where *Expression
+       Item   *[]*SelectItem
+       From   *Token
+       Where  *Expression
+       Limit  *Expression
+       Offset *Expression
 }

And to be a good citizen, we'll fix up the GenerateCode helper function (for pretty-printing the AST) to show LIMIT and OFFSET.

@@ -73,17 +75,24 @@ func (ss SelectStatement) GenerateCode() string {
                item = append(item, s)
        }

-       from := ""
+       code := "SELECT\n" + strings.Join(item, ",\n")
        if ss.From != nil {
-               from = fmt.Sprintf("\nFROM\n\t\"%s\"", ss.From.Value)
+               code += fmt.Sprintf("\nFROM\n\t\"%s\"", ss.From.Value)
        }

-       where := ""
        if ss.Where != nil {
-               where = fmt.Sprintf("\nWHERE\n\t%s", ss.Where.GenerateCode())
+               code += "\nWHERE\n\t" + ss.Where.GenerateCode()
        }

-       return fmt.Sprintf("SELECT\n%s%s%s;", strings.Join(item, ",\n"), from, where)
+       if ss.Limit != nil {
+               code += "\nLIMIT\n\t" + ss.Limit.GenerateCode()
+       }
+
+       if ss.Offset != nil {
+               code += "\nOFFSET\n\t" + ss.Limit.GenerateCode()
+       }
+
+       return code + ";"
 }

 type ColumnDefinition struct {

That's it for modifying the AST itself. Now we can modify the select statement parser to look for these two new sections. It's pretty simple: for both LIMIT and OFFSET first check if they exist in the current statement and then try to parse the expression after them, in parser.go:

@@ -285,6 +288,30 @@ func (p Parser) parseSelectStatement(tokens []*Token, initialCursor uint, delimi
                cursor = newCursor
        }

+       _, cursor, ok = p.parseToken(tokens, cursor, limitToken)
+       if ok {
+               limit, newCursor, ok := p.parseExpression(tokens, cursor, []Token{offsetToken, delimiter}, 0)
+               if !ok {
+                       p.helpMessage(tokens, cursor, "Expected LIMIT value")
+                       return nil, initialCursor, false
+               }
+
+               slct.Limit = limit
+               cursor = newCursor
+       }
+
+       _, cursor, ok = p.parseToken(tokens, cursor, offsetToken)
+       if ok {
+               offset, newCursor, ok := p.parseExpression(tokens, cursor, []Token{delimiter}, 0)
+               if !ok {
+                       p.helpMessage(tokens, cursor, "Expected OFFSET value")
+                       return nil, initialCursor, false
+               }
+
+               slct.Offset = offset
+               cursor = newCursor
+       }
+
        return &slct, cursor, true
 }

And the last tricky bit is to make sure that previous optional parseExpression know that they can be delimited by OFFSET and LIMIT (this delimiter awareness is just how the parser works):

@@ -273,9 +273,12 @@ func (p Parser) parseSelectStatement(tokens []*Token, initialCursor uint, delimi
                cursor = newCursor
        }

+       limitToken := tokenFromKeyword(LimitKeyword)
+       offsetToken := tokenFromKeyword(OffsetKeyword)
+
        _, cursor, ok = p.parseToken(tokens, cursor, whereToken)
        if ok {
-               where, newCursor, ok := p.parseExpression(tokens, cursor, []Token{delimiter}, 0)
+               where, newCursor, ok := p.parseExpression(tokens, cursor, []Token{limitToken, offsetToken, delimiter}, 0)
                if !ok {
                        p.helpMessage(tokens, cursor, "Expected WHERE conditionals")
                        return nil, initialCursor, false

That's it for parsing!

Runtime

Gosql has just one storage backend currently: an in-memory store. To support LIMIT and OFFSET we need to evaluate both expressions if they exist. Then while we're iterating through table rows, after testing whether each row passes the WHERE filter, we'll check if the number of rows passing the WHERE filter falls within the range of OFFSET and LIMIT + OFFSET otherwise we'll skip the row, in memory.go:

@@ -587,6 +587,33 @@ func (mb *MemoryBackend) Select(slct *SelectStatement) (*Results, error) {
                }
        }

+       limit := len(t.rows)
+       if slct.Limit != nil {
+               v, _, _, err := t.evaluateCell(0, *slct.Limit)
+               if err != nil {
+                       return nil, err
+               }
+
+               limit = int(*v.AsInt())
+       }
+       if limit < 0 {
+               return nil, fmt.Errorf("Invalid, negative limit")
+       }
+
+       offset := 0
+       if slct.Offset != nil {
+               v, _, _, err := t.evaluateCell(0, *slct.Offset)
+               if err != nil {
+                       return nil, err
+               }
+
+               offset = int(*v.AsInt())
+       }
+       if offset < 0 {
+               return nil, fmt.Errorf("Invalid, negative limit")
+       }
+
+       rowIndex := -1
        for i := range t.rows {
                result := []Cell{}
                isFirstRow := len(results) == 0
@@ -602,6 +629,13 @@ func (mb *MemoryBackend) Select(slct *SelectStatement) (*Results, error) {
                        }
                }

+               rowIndex++
+               if rowIndex < offset {
+                       continue
+               } else if rowIndex > offset+limit-1 {
+                       break
+               }
+
                for _, col := range finalItems {
                        value, columnName, columnType, err := t.evaluateCell(uint(i), *col.Exp)
                        if err != nil {

Just to call out explicitly, with LIMIT and OFFSET we still have to check every single row in the table (at least until we've reached the offset). This should clearly illustrate why paginating based on LIMIT and OFFSET is not a great idea for big datasets compared to index-based pagination.

That's all!

Trying it out

$ go build cmd/main.go
$ ./main
Welcome to gosql.
# create table user (name text, age int);
ok
# insert into user values ('meg', 2);
ok
# insert into user values ('jerry', 2);
ok
# insert into user values ('phil', 1);
ok
# select * from user;
  name  | age
--------+------
  meg   |   2
  jerry |   2
  phil  |   1
(3 results)
ok
# select * from user limit 1;
  name | age
-------+------
  meg  |   2
(1 result)
ok
# select * from user where age=1 limit 1;
  name | age
-------+------
  phil |   1
(1 result)
ok
# select * from user where age=1 limit 4;
  name | age
-------+------
  phil |   1
(1 result)
ok
# select * from user where age=2 limit 1;
  name | age
-------+------
  meg  |   2
(1 result)
ok
# select * from user where age=2 limit 1 offset 1;
  name  | age
--------+------
  jerry |   2
(1 result)
ok

Not so hard to hack is it? Make sure to include some tests!