In this post we'll get a basic semgrep environment set up in Docker running some custom rules against our code.
Existing linters
Linters like pylint for Python or eslint for JavaScript are great for general, broad language standards. But what about common nits in code review like using print statements instead of a logger, or using a defer statement inside a for loop (Go specific), or the existence of multiple nested loops.
Most developers don't have experience working with language parsing. So it's fairly uncommon in small- and medium-sized teams to see custom linting rules. And while no single linter or language is that much more complex than the other (it's all just AST operations), there is a small penalty to learning the AST and framework for each language linter.
Semgrep
Semgrep is a generic tool for finding patterns in source code. Unlike traditional regex (and traditional grep) it can find recursive patterns. This makes it especially useful as a tool to learn for finding patterns in any language.
An advantage of semgrep rules is that you can learn the semgrep pattern matching syntax (which is surprisingly easy) and then you can write rules for any language you'd like to write rules for.
And while the online rule tester is awesome, I had a hard time going from that to a working sample on my own laptop with Docker. We'll do just that.
Catching print statements in Python
Let's say we want a script to fail on any use of print statements in Python:
$ cat test/python/simple-print.py
def main():
print("DEBUG: here")
print("DEBUG: ", "now here")
The current default example shown in the online editor happens to be for just this. Click the Advanced tab and you'll see the following:
rules:
- id: fail-on-print
pattern: |
print("...")
message: |
Semgrep found a match
severity: WARNING
Copy this into config.yml
. Let's modify the pattern to
warn on all print calls, not just print calls with a single string
argument:
rules:
- id: fail-on-print
pattern: |
print(...)
message: |
Semgrep found a match
severity: WARNING
The editor doesn't mention it (nor do any docs I can find) but we also
need to include two keys in the individual rule
object: mode
and languages
.
rules:
- id: fail-on-print
pattern: |
print(...)
message: |
Semgrep found a match
severity: WARNING
mode: search
languages: ["generic"]
Semgrep fails really weirdly if you set mode
to
anything other than search
, but it won't warn you that
what you set is garbage. The languages
setting is
similarly fickle and doesn't give you much feedback if you set it
incorrectly.
Also, I'm using the "generic" language here because I don't understand the difference between languages and as far as I'm concerned the syntax I'm using here is already pretty generic.
We run the semgrep Docker image:
$ docker run -v "${PWD}:/src" returntocorp/semgrep --config=config.yml test/python
A new version of Semgrep is available. Please see https://github.com/returntocorp/semgrep#upgrading for more information.
running 1 rules...
test/python/simple-print.py
severity:warning rule:fail-on-print: Semgrep found a match
2:print("DEBUG: here")
ran 1 rules on 1 files: 1 findings"")
And there we've got our warning!
Not completely clear to me why we're getting warned about a new
version when we've pulled latest
as the linked docs
suggest. Maybe there's a newer version that hasn't made it into a
Docker image yet.
Catching fmt.Print* statements in Go
Let's say we also want to fail on print statements in Go (because we should use a logger instead):
$ cat test/golang/simple-print.go
package main
import "fmt"
func main() {
a := fmt.Sprintf("here")
fmt.Println(a)
fmt.Printf("%s\n", a)
e := fmt.Errorf("My crazy error")
}
We could try to look for any import "fmt"
code in a file
but that would fail on uses of fmt.Sprintf
or fmt.Errorf
which are fine. Instead we'll just focus on
uses of fmt.Printf
or fmt.Println
:
$ cat go-config.yml
rules:
- id: fail-on-print
pattern-either:
- pattern: fmt.Printf(...)
- pattern: fmt.Println(...)
message: |
Semgrep found a match
severity: WARNING
mode: search
languages: ["generic"]
Run the Go config against the Go files:
$ docker run -v "${PWD}:/src" returntocorp/semgrep --config=go-config.yml test/golang
A new version of Semgrep is available. Please see https://github.com/returntocorp/semgrep#upgrading for more information.
running 1 rules...
test/golang/simple-print.go
severity:warning rule:fail-on-print: Semgrep found a match
8:fmt.Printf("%s\n", a)
--------------------------------------------------------------------------------
7:fmt.Println(a)
ran 1 rules on 1 files: 2 findings
Cool! Making some sense. Now let's try a harder pattern.
Catching triple-nested for loops
Let's try to warn on the triple-nested loop in this code:
$ cat test/golang/loopy.go
package main
import "log"
func main() {
doneFirst := false
for i := 0; i < 10; i++ {
log.Print(i)
for j := 0; j < 100; j++ {
c := i * j
going := true
k := 0
for going {
if k == c {
break
}
k++
log.Print(k)
}
}
doneFirst = true
}
}
If we want to catch the use of nested for loops here then we'll need
to search for the loops surrounded by arbitrary
syntax. Semgrep's ...
syntax makes this easy.
$ cat go-config2.yml
rules:
- id: fail-on-3-loop
pattern: |
for ... {
...
for ... {
...
for ... {
...
}
...
}
...
}
message: |
Semgrep found a match
severity: WARNING
mode: search
languages: ["generic"]
And run semgrep:
$ docker run -v "${PWD}:/src" returntocorp/semgrep --config=go-config2.yml test/golang
A new version of Semgrep is available. Please see https://github.com/returntocorp/semgrep#upgrading for more information.
running 1 rules...
test/golang/loopy.go
severity:warning rule:fail-on-3-loop: Semgrep found a match
7:for i := 0; i < 10; i++ {
8: log.Print(i)
9:
10: for j := 0; j < 100; j++ {
11: c := i * j
12:
13: going := true
14: k := 0
15: for going {
16: if k == c {
-------- [hid 10 additional lines, adjust with --max-lines-per-finding] --------
ran 1 rules on 2 files: 1 findings
That's just swell.
Limits of static analysis
Now let's say we refactor one of the inner loops into its own function.
$ cat test/golang/loopy.go
package main
import "log"
func inner(i, j int) {
c := i * j
going := true
k := 0
for going {
if k == c {
break
}
k++
log.Print(k)
}
}
func main() {
doneFirst := false
for i := 0; i < 10; i++ {
log.Print(i)
for j := 0; j < 100; j++ {
inner(i, j)
}
doneFirst = true
}
}
And run semgrep again:
$ docker run -v "${PWD}:/src" returntocorp/semgrep --config=go-config2.yml test/golang
A new version of Semgrep is available. Please see https://github.com/returntocorp/semgrep#upgrading for more information.
running 1 rules...
ran 1 rules on 2 files: 0 findings
Well great. The 3-nested loop still exists but we can't find it anymore because it's not syntactically obvious anymore.
At this point we'd need to start getting into linting based on runtime analysis. If you know of a tool that does this and lets you write rules like semgrep for it, please tell me!
In summary
In the end though, it's still very useful to be able to learn a single language for writing syntax rules at a high level to enforce behavior in code. Furthermore, a generic syntax matcher helps you write easily write rules for things that don't already have linters like YAML or JSON configuration or Vagrantfiles.
It can be annoying to work around some missing docs in semgrep but overall it's a great tool for the kit.
#semgrep is a really neat tool for syntactic analysis. Here are a few simple examples (catch print statements, triple nested loops, etc.) using Docker. Includes some necessary info the docs don't get intohttps://t.co/UDHEH5JmOa
— Phil Eaton (@phil_eaton) December 20, 2020