Semgrep on Windows via WSL
As of today (November 2023), you cannot run Semgrep directly on Windows. Seriously, save your sanity and don't try. But you can run it in the Windows Subsystem for Linux.
Install WSL
I won't go into the details. Use the Install Linux on Windows with WSL guide from Microsoft.
Which Distro Should I Choose?
It's your call. My current daily WSL2 driver is Debian 11. I have used Semgrep in multiple versions of Ubuntu and Debian without issues.
Make sure you're using a recent distro that supports installing Python 3.7 or higher via its package manager (why install it manually when you can make your life easier?).
Note you can have multiple versions of the same distro.
WSL1 vs. WSL2
See Comparing WSL 1 and WSL 2.
Are most of your files on the Windows file system? Use WSL1.
WSL2 uses Hyper-V and so has good performance for files on its own file system
(e.g., ~/...
). I use WSL2.
Are you behind a corporate proxy or use VPN software (e.g., Cisco AnyConnect)? Use WSL1.
WSL2 uses Hyper-V. Hyper-V has issues connecting to VPN when you use certain VPN software. I have spent hundreds of hours trying to fix it. You might think you can do too but just use WSL1 and save some of your time.
WSL2 also does not get inotify events for files on the Windows file system.
Easily switch between WSL1 and WSL2
wsl -l -v
to see all the distros.
wsl --set-version <distro name> 2
or wsl --set-version <distro name> 1
.
It might take a few minutes to copy the files but it generally works.
Install Semgrep on WSL
python3 -m pip install semgrep
or python -m pip install semgrep
(depending
on distro).
Semgrep Command not Found
- Look for Semgrep in
~/.local/bin
. - Add it to your path by adding the following line to
~/.bashrc
or~/.profile
(my preference).export PATH=$PATH:~/.local/bin
- Run
source ~/.profile
orsource ~/.bashrc
to make the change.
Configs
Download a Ruleset YAML File
This will download the YAML file with all the rules.
# p/{ruleset-name}
# this will download the default ruleset in a file named `default`
wget https://semgrep.dev/c/p/default
# note it's capital O (O as in Oscar, not zero)
wget https://semgrep.dev/c/p/default -O default.yaml
# you can also use curl or even your browser
curl https://semgrep.dev/c/p/default
Note: These URLs are for internal usage and are subject to change.
Run ALL the Rules
- Throw the kitchen sink at your code:
--config r/all
. - Run the manually created "catch them all" scan:
--config p/default
.
Note: Semgrep is intelligent and detects a file's language by extension so it will not run every rule on every file. See Language extensions and languages key values.
Sample Rules
Some fun things to do with Semgrep.
Double Matches with Different Semgrep Messages
I was printing the type of a metavariable. https://semgrep.dev/playground/r/L1UB80/parsiya.tips-double-match
rules:
- id: tips-double-match
pattern: $RETTYPE $METHOD() { ... }
message: $RETTYPE
severity: WARNING
languages:
- java
It was matched twice.
package pk;
import org.foo.bar.MyType;
public class MyClass {
public MyType method() {
// do something
return MyType("123");
}
}
Once with the type and once with the complete import name.
Line 7
MyType
Line 7
org.foo.bar.MyType
Fix
One fix (credit: Lewis Ardern, Semgrep,
source) is to add it to focus-metavariable
. Note,
how we need to add patterns
to have focus-metavariable
as a tag.
https://semgrep.dev/playground/r/8GUn82/parsiya.tips-double-match-fix
rules:
- id: tips-double-match-fix
patterns:
- pattern: $RETTYPE $METHOD() { ... }
- focus-metavariable: $RETTYPE
message: $RETTYPE
severity: WARNING
languages:
- java
Explanation
Credit: Iago Abal, Semgrep, source on Semgrep slack.
For Semgrep
MyType
is also equivalent toorg.foo.bar.MyType
, so when you ask Semgrep to match$RETTYPE
againstMyType
it produces those two matches. And because$RETTYPE
is part of the rule message, each match produces a different message, and Semgrep doesn't deduplicate two findings if each finding has a different message. I thinkfocus-metavariable
removes the duplicate because the "fake"org.foo.bar.MyType
expression that we generate as equivalent toMyType
uses tokens from theimport
and so the ranges of those tokens do not intersect with the method declaration... I see that more like a bug.
These double-matches you can observe them with other equivalences as in https://semgrep.dev/s/QDdD, because
&
is commutative and Semgrep does some AC-matching,$A
may be bothx
andy
, so you get two matches.
Skipping Java Annotations
Assume we have a file like this:
@Annotation1
public class ParentClass {
@First
@Second
@Third
public int meth1() {
return 1;
}
}
And I wanted to skip all annotations after @First
. This is not a valid pattern:
pattern: |
@First
...
public $RETURNTYPE $METHOD(...) { ... }
https://semgrep.dev/playground/r/gxU0Q9/parsiya.tips-java-annotations
Fix
Credit: Cooper Pierce, Semgrep, source on Semgrep slack.
annotations beyond those specified are ignored when matching so something like [the following] would do what you describe
rules:
- id: tips-java-annotations
pattern: |
@First
public $RETURNTYPE $METHOD(...) { ... }
message: |
$CLASS
severity: WARNING
languages:
- java
https://semgrep.dev/playground/r/3qUbAk/parsiya.tips-java-annotations-fix
pattern-inside AND & OR
This is AND
. The match must satisfy both.
- pattern-inside: ...
- pattern-inside: ...
This is OR
.
- pattern-either:
- pattern-inside: ...
- pattern-inside: ...
if Statements in C/C++
Capture Conditions of if
Statements in C/C++: if ($X)
.
Capture if
conditions with one line blocks.
https://semgrep.dev/playground/r/5rUED4/parsiya.tips-detect-single-line-if-block
rules:
- id: detect_if
patterns:
- pattern: if ($X) ...
- pattern-not: if ($X) { $Y; ... }
message: Found a one-line if block
languages:
- c
severity: WARNING
Credit: Cooper Pierce, Semgrep, source on Semgrep slack.
Array Arguments in C/C++
$TYPE $VAR[...];
is not valid, use $TYPE $VAR[$SIZE];
. This also matches
multi-dimensional arrays like int nDim_init[10][10][10][10][10][10];
.
In general: Use metavariables instead of ...
in C/C++.
Explanation
...
is usually reserved to match a sequence of things (e.g.,foo(...)
), or if something is optional (e.g.,return ...;
)
Credit: Padioleau Yoann, Semgrep, source: Semgrep slack.
Alert if a Specific File or Path Exists
Unconventional but we can write a rule like this:
rules:
- id: detect-file
patterns:
- pattern-regex: .*
message: Semgrep found the file
languages:
- generic
severity: WARNING
paths:
include:
- /path/to/badfile*
exclude:
- /paths/to/exclude/*
https://semgrep.dev/playground/r/eqUAk3/parsiya.detect-file
The path is relative to where you run Semgrep. The file doesn't need to have any content. paths > include/exclude should give us a lot of power to detect different paths.
Credit: Yours Truly, Parsia, source: Semgrep slack.
Alert if JavaScript Imports Exist and are Used
This was asked in the Semgrep slack and I came up with this answer that I liked. We want to get a match if there are a few specific JavaScript imports in a file and if they are all used. The order of imports and usage shouldn't matter. Here's an example:
var os = require("os");
var http = require("http");
var dns = require("dns");
os.exec("ls");
http.get("something");
dns.something("whatever");
let a = 1;
The rule takes advantage of having creating a union of six different
pattern-inside
clauses (three for imports and three for usages). It will match
everything after all six patterns are met.
Semgrep playground link: https://semgrep.dev/playground/r/2ZUkgL/parsiya.three-imports-used.
rules:
- id: three-imports-used
patterns:
- pattern-inside: |
$OS = require('os')
...
- pattern-inside: |
$HTTP = require('http')
...
- pattern-inside: |
$DNS = require('dns')
...
- pattern-inside: |
$OS.$METHOD1(...)
...
- pattern-inside: |
$HTTP.$METHOD2(...)
...
- pattern-inside: |
$DNS.$METHOD3(...)
...
message: Semgrep found a match
languages:
- js
severity: WARNING
Using pattern-metavariable with Language Generic
This is a neat trick from my good friend Lewis Ardern. The question
on the Semgrep Slack wanted to match text in a bash file like this:
openssl genpkey -algorithm RSA -out private_key.pem -pkeyopt rsa_keygen_bits:1024
.
The extracted info was supposed to be the 1024
number. Then the rule had to
check if the number was less than 2048
.
I faced a problem here. It's not possible to create a Semgrep pattern like this for bash.
pattern: openssl ... -pkeyopt rsa_keygen_bits:$BITS ...
I guess it's because of the way tree-sitter creates tokens in bash. But I could
create a pattern like this and get rsa_keygen_bits:1024
completely:
pattern: openssl ... -pkeyopt $RSA ...
We can use a pattern-metavariable
with the generic
language to do text
processing here which allows us to extract the number in a metavariable and also
use metavariable-comparison
.
patterns:
- pattern: |
openssl ... -pkeyopt $KEY ...
- metavariable-pattern:
metavariable: $KEY
language: generic
patterns:
- pattern: rsa_keygen_bits:$BITS ...
- metavariable-comparison:
comparison: $BITS < 2048
- focus-metavariable:
- $BITS
Complete rule: https://semgrep.dev/playground/s/6YQ1
Rule Tests
Test File Names for Rules with the paths Tag
I had a rule that was looking for *-NAME.cpp
files. E.g.,
rules:
- id: some-rule
languages:
- cpp
paths:
include:
- "*-NAME.cpp"
The test file should match one of the items in include
. In this case, I needed
to rename the test file some-rule-NAME.cpp
.
Memes
You Made This?
Edits to the famous comic: Yours Truly, Parsia.