December 20, 2020

Static analysis with semgrep: practical examples using Docker

In this post we'll get a basic semgrep environment set up in Docker running some custom rules against our code.

Existing linters

Linters like pylint for Python or eslint for JavaScript are great for general, broad language standards. But what about common nits in code review like using print statements instead of a logger, or using a defer statement inside a for loop (Go specific), or the existence of multiple nested loops.

Most developers don't have experience working with language parsing. So it's fairly uncommon in small- and medium-sized teams to see custom linting rules. And while no single linter or language is that much more complex than the other (it's all just AST operations), there is a small penalty to learning the AST and framework for each language linter.

Semgrep

Semgrep is a generic tool for finding patterns in source code. Unlike traditional regex (and traditional grep) it can find recursive patterns. This makes it especially useful as a tool to learn for finding patterns in any language.

An advantage of semgrep rules is that you can learn the semgrep pattern matching syntax (which is surprisingly easy) and then you can write rules for any language you'd like to write rules for.

And while the online rule tester is awesome, I had a hard time going from that to a working sample on my own laptop with Docker. We'll do just that.

Catching print statements in Python

Let's say we want a script to fail on any use of print statements in Python:

$ cat test/python/simple-print.py
def main():
  print("DEBUG: here")
  print("DEBUG: ", "now here")

The current default example shown in the online editor happens to be for just this. Click the Advanced tab and you'll see the following:

rules:
- id: fail-on-print
  pattern: |
    print("...")
  message: |
    Semgrep found a match
  severity: WARNING

Copy this into config.yml. Let's modify the pattern to warn on all print calls, not just print calls with a single string argument:

rules:
- id: fail-on-print
  pattern: |
    print(...)
  message: |
    Semgrep found a match
  severity: WARNING

The editor doesn't mention it (nor do any docs I can find) but we also need to include two keys in the individual rule object: mode and languages.

rules:
- id: fail-on-print
  pattern: |
    print(...)
  message: |
    Semgrep found a match
  severity: WARNING
  mode: search
  languages: ["generic"]

Semgrep fails really weirdly if you set mode to anything other than search, but it won't warn you that what you set is garbage. The languages setting is similarly fickle and doesn't give you much feedback if you set it incorrectly.

Also, I'm using the "generic" language here because I don't understand the difference between languages and as far as I'm concerned the syntax I'm using here is already pretty generic.

We run the semgrep Docker image:

$ docker run -v "${PWD}:/src" returntocorp/semgrep --config=config.yml test/python
A new version of Semgrep is available. Please see https://github.com/returntocorp/semgrep#upgrading for more information.
running 1 rules...
test/python/simple-print.py
severity:warning rule:fail-on-print: Semgrep found a match

2:print("DEBUG: here")
ran 1 rules on 1 files: 1 findings"")

And there we've got our warning!

Not completely clear to me why we're getting warned about a new version when we've pulled latest as the linked docs suggest. Maybe there's a newer version that hasn't made it into a Docker image yet.

Catching fmt.Print* statements in Go

Let's say we also want to fail on print statements in Go (because we should use a logger instead):

$ cat test/golang/simple-print.go
package main

import "fmt"

func main() {
  a := fmt.Sprintf("here")
  fmt.Println(a)
  fmt.Printf("%s\n", a)
  e := fmt.Errorf("My crazy error")
}

We could try to look for any import "fmt" code in a file but that would fail on uses of fmt.Sprintf or fmt.Errorf which are fine. Instead we'll just focus on uses of fmt.Printf or fmt.Println:

$ cat go-config.yml
rules:
- id: fail-on-print
  pattern-either:
    - pattern: fmt.Printf(...)
    - pattern: fmt.Println(...)
  message: |
    Semgrep found a match
  severity: WARNING
  mode: search
  languages: ["generic"]

Run the Go config against the Go files:

$ docker run -v "${PWD}:/src" returntocorp/semgrep --config=go-config.yml test/golang
A new version of Semgrep is available. Please see https://github.com/returntocorp/semgrep#upgrading for more information.
running 1 rules...
test/golang/simple-print.go
severity:warning rule:fail-on-print: Semgrep found a match

8:fmt.Printf("%s\n", a)
--------------------------------------------------------------------------------
7:fmt.Println(a)
ran 1 rules on 1 files: 2 findings

Cool! Making some sense. Now let's try a harder pattern.

Catching triple-nested for loops

Let's try to warn on the triple-nested loop in this code:

$ cat test/golang/loopy.go
package main

import "log"

func main() {
  doneFirst := false
  for i := 0; i < 10; i++ {
    log.Print(i)

    for j := 0; j < 100; j++ {
      c := i * j

      going := true
      k := 0
      for going {
        if k == c {
          break
        }

        k++
        log.Print(k)
      }
    }

    doneFirst = true
  }
}

If we want to catch the use of nested for loops here then we'll need to search for the loops surrounded by arbitrary syntax. Semgrep's ... syntax makes this easy.

$ cat go-config2.yml
rules:
- id: fail-on-3-loop
  pattern: |
    for ... {
      ...
      for ... {
        ...

        for ... {
          ...
        }
        ...
      }
      ...
    }
  message: |
    Semgrep found a match
  severity: WARNING
  mode: search
  languages: ["generic"]

And run semgrep:

$ docker run -v "${PWD}:/src" returntocorp/semgrep --config=go-config2.yml test/golang
A new version of Semgrep is available. Please see https://github.com/returntocorp/semgrep#upgrading for more information.
running 1 rules...
test/golang/loopy.go
severity:warning rule:fail-on-3-loop: Semgrep found a match

7:for i := 0; i < 10; i++ {
8:              log.Print(i)
9:
10:             for j := 0; j < 100; j++ {
11:                     c := i * j
12:
13:                     going := true
14:                     k := 0
15:                     for going {
16:                             if k == c {
-------- [hid 10 additional lines, adjust with --max-lines-per-finding] --------
ran 1 rules on 2 files: 1 findings

That's just swell.

Limits of static analysis

Now let's say we refactor one of the inner loops into its own function.

$ cat test/golang/loopy.go
package main

import "log"

func inner(i, j int) {
  c := i * j

  going := true
  k := 0
  for going {
    if k == c {
      break
    }

    k++
    log.Print(k)
  }
}

func main() {
  doneFirst := false
  for i := 0; i < 10; i++ {
    log.Print(i)

    for j := 0; j < 100; j++ {
      inner(i, j)
    }

    doneFirst = true
  }
}

And run semgrep again:

$ docker run -v "${PWD}:/src" returntocorp/semgrep --config=go-config2.yml test/golang
 A new version of Semgrep is available. Please see https://github.com/returntocorp/semgrep#upgrading for more information.
 running 1 rules...
 ran 1 rules on 2 files: 0 findings

Well great. The 3-nested loop still exists but we can't find it anymore because it's not syntactically obvious anymore.

At this point we'd need to start getting into linting based on runtime analysis. If you know of a tool that does this and lets you write rules like semgrep for it, please tell me!

In summary

In the end though, it's still very useful to be able to learn a single language for writing syntax rules at a high level to enforce behavior in code. Furthermore, a generic syntax matcher helps you write easily write rules for things that don't already have linters like YAML or JSON configuration or Vagrantfiles.

It can be annoying to work around some missing docs in semgrep but overall it's a great tool for the kit.