Parsia-Clone

'Documentation is a love letter that you write to your future self.' - Damian Conway

7 minute read - Research

Semgrep Tips and Tricks

Github Link

Semgrep on Windows via WSL

As of today (November 2023), you cannot run Semgrep directly on Windows. Seriously, save your sanity and don't try. But you can run it in the Windows Subsystem for Linux.

Install WSL

I won't go into the details. Use the Install Linux on Windows with WSL guide from Microsoft.

Which Distro Should I Choose?

It's your call. My current daily WSL2 driver is Debian 11. I have used Semgrep in multiple versions of Ubuntu and Debian without issues.

Make sure you're using a recent distro that supports installing Python 3.7 or higher via its package manager (why install it manually when you can make your life easier?).

Note you can have multiple versions of the same distro.

WSL1 vs. WSL2

See Comparing WSL 1 and WSL 2.

Are most of your files on the Windows file system? Use WSL1.

WSL2 uses Hyper-V and so has good performance for files on its own file system (e.g., ~/...). I use WSL2.

Are you behind a corporate proxy or use VPN software (e.g., Cisco AnyConnect)? Use WSL1.

WSL2 uses Hyper-V. Hyper-V has issues connecting to VPN when you use certain VPN software. I have spent hundreds of hours trying to fix it. You might think you can do too but just use WSL1 and save some of your time.

WSL2 also does not get inotify events for files on the Windows file system.

Easily switch between WSL1 and WSL2

wsl -l -v to see all the distros.

wsl --set-version <distro name> 2 or wsl --set-version <distro name> 1.

It might take a few minutes to copy the files but it generally works.

Install Semgrep on WSL

python3 -m pip install semgrep or python -m pip install semgrep (depending on distro).

Semgrep Command not Found

  1. Look for Semgrep in ~/.local/bin.
  2. Add it to your path by adding the following line to ~/.bashrc or ~/.profile (my preference).
    • export PATH=$PATH:~/.local/bin
  3. Run source ~/.profile or source ~/.bashrc to make the change.

Configs

Download a Ruleset YAML File

This will download the YAML file with all the rules.

# p/{ruleset-name}

# this will download the default ruleset in a file named `default`
wget https://semgrep.dev/c/p/default

# note it's capital O (O as in Oscar, not zero)
wget https://semgrep.dev/c/p/default -O default.yaml

# you can also use curl or even your browser
curl https://semgrep.dev/c/p/default

Note: These URLs are for internal usage and are subject to change.

Run ALL the Rules

  • Throw the kitchen sink at your code: --config r/all.
  • Run the manually created "catch them all" scan: --config p/default.

Note: Semgrep is intelligent and detects a file's language by extension so it will not run every rule on every file. See Language extensions and languages key values.

Sample Rules

Some fun things to do with Semgrep.

Double Matches with Different Semgrep Messages

I was printing the type of a metavariable. https://semgrep.dev/playground/r/L1UB80/parsiya.tips-double-match

rules:
  - id: tips-double-match
    pattern: $RETTYPE $METHOD() { ... }
    message: $RETTYPE
    severity: WARNING
    languages:
      - java

It was matched twice.

package pk;

import org.foo.bar.MyType;

public class MyClass {

    public MyType method() {
        // do something
        return MyType("123");
    }
}

Once with the type and once with the complete import name.

Line 7
MyType

Line 7
org.foo.bar.MyType

Fix

One fix (credit: Lewis Ardern, Semgrep, source) is to add it to focus-metavariable. Note, how we need to add patterns to have focus-metavariable as a tag.

https://semgrep.dev/playground/r/8GUn82/parsiya.tips-double-match-fix

rules:
  - id: tips-double-match-fix
    patterns:
      - pattern: $RETTYPE $METHOD() { ... }
      - focus-metavariable: $RETTYPE
    message: $RETTYPE
    severity: WARNING
    languages:
      - java

Explanation

Credit: Iago Abal, Semgrep, source on Semgrep slack.

For Semgrep MyType is also equivalent to org.foo.bar.MyType, so when you ask Semgrep to match $RETTYPE against MyType it produces those two matches. And because $RETTYPE is part of the rule message, each match produces a different message, and Semgrep doesn't deduplicate two findings if each finding has a different message. I think focus-metavariable removes the duplicate because the "fake" org.foo.bar.MyType expression that we generate as equivalent to MyType uses tokens from the import and so the ranges of those tokens do not intersect with the method declaration... I see that more like a bug.

These double-matches you can observe them with other equivalences as in https://semgrep.dev/s/QDdD, because & is commutative and Semgrep does some AC-matching, $A may be both x and y, so you get two matches.

Skipping Java Annotations

Assume we have a file like this:

@Annotation1
public class ParentClass {

    @First
    @Second
    @Third
    public int meth1() {
        return 1;
    }
}

And I wanted to skip all annotations after @First. This is not a valid pattern:

pattern: |
  @First
  ...
  public $RETURNTYPE $METHOD(...) { ... }  

https://semgrep.dev/playground/r/gxU0Q9/parsiya.tips-java-annotations

Fix

Credit: Cooper Pierce, Semgrep, source on Semgrep slack.

annotations beyond those specified are ignored when matching so something like [the following] would do what you describe

rules:
  - id: tips-java-annotations
    pattern: |
      @First
      public $RETURNTYPE $METHOD(...) { ... }      
    message: |
      $CLASS      
    severity: WARNING
    languages:
      - java

https://semgrep.dev/playground/r/3qUbAk/parsiya.tips-java-annotations-fix

pattern-inside AND & OR

This is AND. The match must satisfy both.

- pattern-inside: ...
- pattern-inside: ...

This is OR.

- pattern-either:
  - pattern-inside: ...
  - pattern-inside: ...

if Statements in C/C++

Capture Conditions of if Statements in C/C++: if ($X).

Capture if conditions with one line blocks.

https://semgrep.dev/playground/r/5rUED4/parsiya.tips-detect-single-line-if-block

rules:
  - id: detect_if
    patterns:
      - pattern: if ($X) ...
      - pattern-not: if ($X) { $Y; ... }
    message: Found a one-line if block
    languages:
      - c
    severity: WARNING

Credit: Cooper Pierce, Semgrep, source on Semgrep slack.

Array Arguments in C/C++

$TYPE $VAR[...]; is not valid, use $TYPE $VAR[$SIZE];. This also matches multi-dimensional arrays like int nDim_init[10][10][10][10][10][10];.

In general: Use metavariables instead of ... in C/C++.

Explanation

... is usually reserved to match a sequence of things (e.g., foo(...)), or if something is optional (e.g., return ...;)

Credit: Padioleau Yoann, Semgrep, source: Semgrep slack.

Alert if a Specific File or Path Exists

Unconventional but we can write a rule like this:

rules:
- id: detect-file
  patterns:
    - pattern-regex: .*
  message: Semgrep found the file
  languages:
    - generic
  severity: WARNING
  paths:
    include:
      - /path/to/badfile*
    exclude:
      - /paths/to/exclude/*

https://semgrep.dev/playground/r/eqUAk3/parsiya.detect-file

The path is relative to where you run Semgrep. The file doesn't need to have any content. paths > include/exclude should give us a lot of power to detect different paths.

Credit: Yours Truly, Parsia, source: Semgrep slack.

Alert if JavaScript Imports Exist and are Used

This was asked in the Semgrep slack and I came up with this answer that I liked. We want to get a match if there are a few specific JavaScript imports in a file and if they are all used. The order of imports and usage shouldn't matter. Here's an example:

var os = require("os");
var http = require("http");
var dns = require("dns");
os.exec("ls");
http.get("something");
dns.something("whatever");
let a = 1;

The rule takes advantage of having creating a union of six different pattern-inside clauses (three for imports and three for usages). It will match everything after all six patterns are met.

Semgrep playground link: https://semgrep.dev/playground/r/2ZUkgL/parsiya.three-imports-used.

rules:
- id: three-imports-used
  patterns:
    - pattern-inside: |
        $OS = require('os')
        ...        
    - pattern-inside: |
        $HTTP = require('http')
        ...        
    - pattern-inside: |
        $DNS = require('dns')
        ...        
    - pattern-inside: |
        $OS.$METHOD1(...)
        ...        
    - pattern-inside: |
        $HTTP.$METHOD2(...)
        ...        
    - pattern-inside: |
        $DNS.$METHOD3(...)
        ...        
  message: Semgrep found a match
  languages:
    - js
  severity: WARNING

Using pattern-metavariable with Language Generic

This is a neat trick from my good friend Lewis Ardern. The question on the Semgrep Slack wanted to match text in a bash file like this: openssl genpkey -algorithm RSA -out private_key.pem -pkeyopt rsa_keygen_bits:1024. The extracted info was supposed to be the 1024 number. Then the rule had to check if the number was less than 2048.

I faced a problem here. It's not possible to create a Semgrep pattern like this for bash.

pattern: openssl ... -pkeyopt rsa_keygen_bits:$BITS ...

I guess it's because of the way tree-sitter creates tokens in bash. But I could create a pattern like this and get rsa_keygen_bits:1024 completely:

pattern: openssl ... -pkeyopt $RSA ...

We can use a pattern-metavariable with the generic language to do text processing here which allows us to extract the number in a metavariable and also use metavariable-comparison.

patterns:
  - pattern: |
      openssl ... -pkeyopt $KEY ...      
  - metavariable-pattern:
      metavariable: $KEY
      language: generic
      patterns:
        - pattern: rsa_keygen_bits:$BITS ...
        - metavariable-comparison:
            comparison: $BITS < 2048
  - focus-metavariable:
      - $BITS

Complete rule: https://semgrep.dev/playground/s/6YQ1

Rule Tests

Test File Names for Rules with the paths Tag

I had a rule that was looking for *-NAME.cpp files. E.g.,

rules:
- id: some-rule
  languages:
    - cpp
  paths:
    include:
      - "*-NAME.cpp"

The test file should match one of the items in include. In this case, I needed to rename the test file some-rule-NAME.cpp.

Memes

You Made This?

Edits to the famous comic: Yours Truly, Parsia.

You Made This?