--- title: "Advanced usage" slug: advanced summary: "This section explores the advanced usage of Semgrep, including how to create new rules." weight: 10 --- # Advanced usage ## Ignoring (parts of) code in your project with Semgrep Semgrep identifies programming languages based on their file extensions rather than content analysis. Use the `--scan-unknown-extensions` flag and the `--lang` flag to specify the language you want Semgrep to use when scanning files with non-standard extensions. For example: ```sh semgrep --config /path/to/your/config --lang python --scan-unknown-extensions /path/to/your/file.xyz ``` In this example, Semgrep will scan the `/path/to/your/file.xyz` file as a Python file, even though the `.xyz` extension is not a standard Python file extension. See also the [Allow user to specify file extensions for languages #3090](https://github.com/semgrep/semgrep/issues/3090) GitHub issue to work around restrictions if you want to use Semgrep against your specific language, even if the file extension is not standard. ### Files/directories - By default, Semgrep follows the default [.semgrepignore](https://github.com/semgrep/semgrep/blob/develop/src/targeting/default.semgrepignore) file. - If present, Semgrep will look at the repository's `.gitignore` file. - In case of a conflict between the two files, the `.semgrepignore` file takes precedence. This means that if the `.gitignore` file includes a file and the `.semgrepignore` file excludes it, Semgrep will not analyze the file. Before starting a scan, it is recommended that you review the files and directories in your project directory. Note that certain paths may be excluded by default. If you want to change the default exclusion behavior, such as including third-party libraries or unit tests in the scan, you can create a custom `.semgrepignore` file. ### Excluding code sections To prevent Semgrep from flagging incorrect code patterns, insert a comment in your code immediately before or on the line preceding the pattern match (e.g., `// nosemgrep: rule-id`). It is crucial to have a space between `//` and `nosemgrep`. As a best practice, remember to: - Exclude only particular findings in your comments rather than disabling all rules with a generic `// nosemgrep` comment. - Explain why you disabled a rule or justify your risk acceptance decision. - If you encounter a false positive and want to ignore a Semgrep rule, provide feedback to either the Semgrep development team or your internal development team responsible for the specific rule. This will help improve the accuracy of the rule and reduce the chances of future false positives. For more information on how to use `nosemgrep` to ignore code blocks for a particular rule, refer to the [Semgrep documentation on ignoring code](https://semgrep.dev/docs/ignoring-files-folders-code/#ignoring-code-through-nosemgrep). ## Writing custom rules While Semgrep offers a library of pre-built rules, creating custom rules can significantly enhance your security testing by tailoring it to your specific codebase and requirements. However, creating effective Semgrep rules can be challenging without proper guidance and understanding. This section will give you the essential knowledge and skills to create high-quality Semgrep rules. You will learn about the rule language's syntax and how to develop effective patterns, handle edge cases, and create powerful custom Semgrep rules. This will aid in detecting potential security vulnerabilities early on, ultimately improving your testing process. ### Example custom rule As a starting point for creating a custom rule, use the following schema to create the `custom_rule.yaml` file. ```yaml {linenos=inline} rules: - id: rule-id languages: [go] message: Some message severity: ERROR # INFO / WARNING / ERROR pattern: test(...) ``` ### Running custom rules - To run the above-mentioned rule as a single file, use the following command: ```shell semgrep --config custom_rule.yaml ``` - To run a set of rules in a directory: ```shell semgrep --config path/ ``` ### ABCs of writing custom rules To start writing custom Semgrep rules, it is crucial to understand a few key concepts and tools: 1. **Familiarize yourself with Semgrep syntax**: Begin by exploring the official [Learn Semgrep Syntax](https://semgrep.dev/learn) page, which provides a comprehensive guide on the fundamentals of Semgrep rule writing. 1. **Refer to language-specific pattern examples**: Consult the [Semgrep Pattern Examples by Language](https://semgrep.dev/embed/cheatsheet) for examples tailored to specific programming languages. 1. **Use the Semgrep Playground**: The [Semgrep Playground](https://semgrep.dev/playground/new) is a convenient online tool for writing and testing rules. However, it is essential to consider the following points when using the Playground: {{< hint danger >}}**Be cautious of privacy concerns**: The Semgrep Playground allows users to experiment with code without downloading or installing software on their local machine. While this platform is helpful for testing and debugging rules, it may expose sensitive information such as passwords, API keys, or other secrets contained in the code you submit for scanning. Always use a local development environment with proper security and privacy controls for sensitive code.{{< /hint >}} - **Employ the `simple mode`**: The Semgrep Playground's simple mode makes it easy to combine rule patterns. - **Use the `Share` button**: Share your rule and test code with others using the Share button. - **Add tests to your test code**: Incorporate [tests](https://semgrep.dev/docs/writing-rules/testing-rules/) (e.g., `# ruleid: `) into your test code to evaluate your rule's effectiveness while working in the Semgrep Playground (see [example](https://semgrep.dev/s/ezxE)). - **Note the limitations with comments**: Be aware that the Semgrep Playground does not retain comments when sharing a link or "forking" a rule (Ctrl+S). Refer to this [GitHub issue](https://github.com/semgrep/semgrep/issues/7120) for more information. ### Building blocks #### Ellipses (`...`) {{< hint info >}}**Purpose**: The ellipsis (`...`) is used to match zero or more arguments, statements, parameters, and so on, allowing for greater flexibility in pattern matching. {{< /hint >}} Here is an example rule for Python: ```yaml {linenos=inline} rules: - id: rule-id languages: [Python] message: Some message severity: INFO pattern: requests.get(..., verify=False, ...) ``` Here, the ellipsis before and after the `verify=False` argument allows the pattern to match any number of arguments before and after the `verify` parameter. This ensures that the pattern can match function calls with various argument combinations, as long as the `verify=False` argument is present. This pattern matches the following code snippets: ```python {linenos=inline, hl_lines=[1,3,6]} requests.get(verify=False, url=URL) requests.post(verify=False, url=URL) requests.get(URL, verify=False, timeout=3) requests.head() requests.get(URL) requests.get(URL, verify=False) ``` In the second example, the ellipsis is used to create a pattern that matches an `if` statement followed by an unnecessary `else` block after a `return` statement within the `if` block. Below is the `unnecessary-if-else-pattern` rule for Python: ```yaml {linenos=inline} rules: - id: unnecessary-if-else-pattern languages: [Python] message: Unnecessary else after return $X severity: INFO pattern: | if ...: return ... else: ... ``` Now, let's break down the pattern components: 1. `if ...:`: This part of the pattern matches any `if` statement, regardless of the condition being tested. The ellipsis within the `if` statement is a wildcard that matches any expression or code structure used as the condition. This flexibility ensures that the pattern can detect a wide range of `if` statements with various conditions. 2. `return ...`: Within the matched `if` block, the `return` statement is followed by an ellipsis. This wildcard matches any expression or value being returned. This allows the pattern to detect `return` statements with different values or expressions, such as `return True`, `return False`, `return x`, or `return calculate_result()`. 3. `...` within the `else` block: The ellipsis in the `else` block is a wildcard that matches any number of statements. This pattern matches the following code snippet: ```py {linenos=inline} if a > b: return True else: print("a is not greater than b") ``` By including the ellipsis (`...`) in your Semgrep rules, you can create more flexible and comprehensive patterns that account for variations in code structure. #### Metavariables {{< hint info >}}**Purpose**: Metavariables are used to match and track values across a specific code scope. They are denoted by a dollar sign followed by a capitalized letters (e.g., `$X`, `$Y`, `$COND`).{{< /hint >}} Here is an example pattern in Golang: ```yaml pattern: $X.($TYPE) ``` The metavariable `$X` matches: ```go {linenos=inline, hl_lines=[1,2]} msg, ok := m.(*MsgDonate) // $X = m p := val.(types.Pool) // $X = val x := val msg, ok = m ``` Metavariables can also be interpolated into the output message of a Semgrep rule. For instance, consider the following rule: ```yaml {linenos=inline} rules: - id: metavariable-example-rule patterns: - pattern: func $X(...) { ... } message: Found $X function languages: [golang] severity: WARNING ``` For the following code: ```go {linenos=inline} func test123(input string) { fmt.Println("test") } ``` This returns the `Found test123 function` message in the Semgrep output, as follows: ```shell $ semgrep -f rule.yml # (...) metavariable-example-rule Found test123 function 1┆ func test123(input string) { 2┆ fmt.Println("test") 3┆ } ``` Metavariables help create more dynamic and versatile Semgrep rules by capturing values that can be used for further pattern matching or validation. ##### Leveraging metavariables Metavariables can be used in a variety of ways to enhance Semgrep rules, making them more dynamic and adaptable when analyzing code. Some common use cases include: 1. **Matching variable names**: Metavariables can be used to match variable names in the code, allowing the rule to be flexible and applicable to various situations. For example: ```yaml pattern: $X := $Y ``` This pattern would match assignments like `a := b` or `result := calculation()`. 2. **Capturing function calls**: Metavariables can be employed to capture function calls and their arguments. This can be useful for detecting potentially unsafe or deprecated functions. For example: ```yaml pattern: $FUNC($ARG) ``` This pattern would match function calls like `dangerousFunc(input)` or `deprecatedFunc(arg1, arg2)`. 3. **Matching control structures**: Metavariables can help identify specific control structures, such as loops or conditionals, with a particular focus on the expressions used within these structures. For example: ```yaml pattern: for $INDEX := $INIT; $COND; $UPDATE { ... } ``` This pattern would match for-loops like `for i := 0; i < 10; i++ { ... }`. 4. **Comparing code patterns**: Metavariables can be used to compare different parts of the code to ensure consistency or prevent potential bugs. For example, you can detect cases where the same assignment is made in both branches of an `if-else` statement: ```yaml pattern: if $COND { $X = $Y } else { $X = $Y } ``` This pattern would match code like: ```go {linenos=inline} if someCondition { x = y } else { x = y } ``` 5. **Identifying patterns across multiple lines**: Metavariables can be employed to match and track values across multiple lines of code, making it possible to detect patterns that span several statements. For example: ```yaml pattern: | $VAR1 := $EXPR1 $VAR2 := $VAR1 ``` This pattern would match code like the following: ```go {linenos=inline} a := b + c d := a ``` In conclusion, metavariables offer a powerful way to create dynamic and adaptable Semgrep rules. They help capture and track values across code scopes, enabling the identification of complex patterns and providing informative output messages for developers and security professionals. #### Nested metavariables {{< hint info >}}**Purpose**: Nested metavariables allow you to match a pattern with a metavariable that also contains another metavariable meeting certain conditions.{{< /hint >}} Here is an example rule: ```yaml {linenos=inline} rules: - id: metavariable-pattern-nest languages: [python] message: substraction in foo(bar(...)) patterns: - pattern: foo($X, ...) # First metavariable-pattern - metavariable-pattern: metavariable: $X patterns: - pattern: bar($Y) # Nested metavariable pattern - metavariable-pattern: metavariable: $Y patterns: - pattern: ... - ... severity: WARNING ``` This rule matches the following Python code: ```python {linenos=inline} foo(bar(1-2)) foo(bar(bar(1-2))) ``` Nested metavariables allow for more complex and precise pattern matching in Semgrep rules by allowing you to define relationships between multiple metavariables. #### Using `metavariable-pattern` for polyglot file scanning {{< hint info >}}**Purpose**: To match patterns across different languages within a single file (e.g., JavaScript embedded in HTML).{{< /hint >}} Example: Find all instances of JavaScript's [eval](https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/eval) function used in an HTML file ([example](https://semgrep.dev/s/W9by)). ```yaml {linenos=inline} rules: - id: metavariable-pattern-nest languages: [html] message: eval in JS patterns: - pattern: - metavariable-pattern: metavariable: $Y language: javascript patterns: - pattern: eval(...) severity: WARNING ``` This rule matches the following HTML code: ```html {linenos=inline} ``` Using `metavariable-pattern` allows for cross-language pattern matching in polyglot files, enabling you to identify specific code patterns within mixed-language files. #### Using `metavariable-pattern` + `pattern-either` {{< hint info >}}**Purpose**: To specify multiple alternative patterns that can match a metavariable.{{< /hint >}} Example: Flag instances where a variable declaration uses one of several specific types ([example](https://semgrep.dev/s/J0zk) / [trailofbits.go.string-to-int-signedness-cast.string-to-int-signedness-cast](https://semgrep.dev/playground/r/trailofbits.go.string-to-int-signedness-cast.string-to-int-signedness-cast?editorMode=advanced) rule). ```yaml {linenos=inline} rules: - id: metavariable-pattern-multiple-or languages: [go] message: xyz patterns: - pattern: var $A $TYPE = ... - metavariable-pattern: metavariable: $TYPE pattern-either: - pattern: uint8 - pattern: uint16 - pattern: uint32 - pattern: int8 - pattern: int16 - pattern: int32 severity: WARNING ``` This rule matches the following Go code: ```go {linenos=inline, hl_lines=["1-6"]} var a uint8 = 255 var b uint16 = 65535 var c uint32 = 4294967295 var d int8 = -128 var e int16 = -32768 var f int32 = -2147483648 var g string = "xyz" ``` Combining `metavariable-pattern` with `pattern-either` allows you to create Semgrep rules that match a `metavariable` if it meets any of the specified conditions. #### Metavariable-pattern + patterns {{< hint info >}}**Purpose**: Use `metavariable-pattern` and `patterns` to flag instances where a metavariable `$X` meets _all_ conditions (`patterns`) ([example](https://semgrep.dev/s/BJqv) / [lxml-in-pandas rule](https://semgrep.dev/playground/r/trailofbits.python.lxml-in-pandas.lxml-in-pandas?editorMode=advanced)) {{< /hint >}} Here is an example rule: ```yaml {linenos=inline} rules: - id: metavariable-pattern-and-patterns languages: - go message: xyz1 patterns: - pattern: var $A $TYPE = $Z - metavariable-pattern: metavariable: $Z patterns: - pattern-not: | -128 - pattern-not: | -32768 severity: WARNING ``` This rule matches the following Go code: ```go {linenos=inline, hl_lines=[1,3]} var b uint16 = 65535 var d int8 = -128 var c uint32 = 4294967295 var e int16 = -32768 ``` #### Constant propagation Constant propagation in Semgrep refers to the process of matching instances where a `metavariable` holds a specific value or relation. ##### Matching instances where a metavariable holds a specific value {{< hint info >}}**Purpose**: To match instances where a metavariable holds a specific value or relation, use the `metavariable-comparison` key.{{< /hint >}} Example: Match cases where the variable `$X` is greater than `1337` ([example](https://semgrep.dev/s/LqeL)). ```yaml {linenos=inline} rules: - id: metavariable-comparison languages: [python] message: $X is higher than 1337 patterns: - pattern: function($X) - metavariable-comparison: # Match when $X > 1337 metavariable: $X comparison: $X > 1337 severity: WARNING ``` This rule matches the following Python code: ```python {linenos=inline, hl_lines=["2-3"]} n = 1339 function(n) # Match (n > 1337) function(1338) # Match (constant > 1337) function(123) ``` ##### Comparing specific metavariables {{< hint info >}}**Purpose**: Compare specific metavariables.{{< /hint >}} Example: Match functions where the first argument is lower than the second one ([example](https://semgrep.dev/s/dYnd)). ```yaml {linenos=inline} rules: - id: metavariable-comparison-rule patterns: - pattern: f($A, $B) - metavariable-comparison: comparison: int($A) < int($B) metavariable: $A message: $A < $B languages: [python] severity: WARNING ``` This rule matches the following Python code: ```python {linenos=inline,hl_lines=["1-2"]} f(1,2) f(2,3) f(4,3) f(12312,1) ``` #### Deep expression operator {{< hint info >}}**Purpose**: To match deeply nested expressions in the code.{{< /hint >}} Deep expression operator is useful when you want to identify specific patterns that are buried within complex structures like conditional statements, loops, or function calls. Using the deep expression operator, you can create rules that target specific code patterns regardless of how deep they are in the code structure. The deep expression operator is represented by `<... ...>`. It acts as a wildcard that matches any code structure between the opening and closing ellipses. By using the deep expression operator, you can create Semgrep rules that match patterns in any level of nesting. **Example**: Matching a function call nested within an `if` statement ([example](https://semgrep.dev/s/2Qv8)). Suppose you want to match any instance of a specific function call (e.g., `user.is_admin()`) within an `if` statement, regardless of how deeply nested it is. ```yaml {linenos=inline} rules: - id: deep-expression-example pattern: | if <... user.is_admin() ...>: print(...) message: if statement with is_admin() check languages: [python] severity: WARNING ``` This rule matches the following Python code: ```python {linenos=inline} if user.authenticated() and user.is_admin() and user.has_group(gid): print("hello") ``` #### Understanding `pattern-inside` and `pattern-not-inside` ##### Using `pattern-inside` By using `pattern-inside`, you can create rules that match patterns only when they appear **within** a certain code construct, like a function, or class definition, a loop, or a conditional block. Here's an example of how you might use `pattern-inside` to detect cases where a sensitive function is called within a loop: ```yaml {linenos=inline} rules: - id: sensitive_function_in_loop languages: - python message: "Sensitive function called inside a loop" severity: WARNING patterns: - pattern-inside: | for ... in ...: ... - pattern: | sensitive_function(...) ``` In this example, the `pattern-inside` operator is used to match any `for` loop in Python, and the second pattern matches calls to `sensitive_function()`. The rule will trigger only if both patterns are matched, meaning that the `sensitive_function` is called **inside** a loop. Here's an example of Python code that would trigger the `sensitive_function_in_loop` rule: ```python {linenos=inline, hl_lines=[10]} def sensitive_function(data): # Process sensitive data pass def main(): data_list = ['data1', 'data2', 'data3'] for data in data_list: # Call to sensitive_function is inside a loop sensitive_function(data) def second(data): sensitive_function(data) ``` ##### Using `pattern-not-inside` `pattern-not-inside` is the opposite of `pattern-inside` and is used to match a pattern only when it **does not appear** within a specified context. This operator helps you to exclude certain parts of the code from your analysis, further refining your rules and reducing false positives. For instance, you can use `pattern-not-inside` to detect calls to the `print_debug()` function when they occur outside a `if debug:` block: ```yaml {linenos=inline} rules: - id: print_debug_outside_debug_block languages: - python message: "print_debug() should be called inside a 'if debug:' block" severity: WARNING patterns: - pattern-not-inside: | if debug: ... - pattern: | print_debug(...) ``` Here is a Python code example demonstrating the use of this rule: ```python {linenos=inline, hl_lines=[11]} debug = True def print_debug(msg): print("DEBUG:", msg) def correct_usage(): if debug: print_debug("This is a debug message inside a 'if debug:' block") def incorrect_usage(): print_debug("This is a debug message outside a 'if debug:' block") def main(): correct_usage() incorrect_usage() ``` ##### Combining `pattern-inside` and `pattern-not-inside` In some cases, you might want to create rules that use both `pattern-inside` and `pattern-not-inside` operators to capture instances where a specific pattern is found within a particular context but not within another. **Example**: Detecting `print()` calls in functions but not in `main()`. Suppose you want to enforce a rule where `print()` calls are allowed only within the `main()` function and not in any other functions. You can create a rule that combines `pattern-inside` and `pattern-not-inside` operators to achieve this. ```yaml {linenos=inline} rules: - id: print_calls_outside_main languages: - python message: "print() calls should only be inside the main() function" severity: WARNING patterns: - pattern-inside: | def $X(...): ... - pattern-not-inside: | def main(...): ... - pattern: | print(...) ``` In this example, the `pattern-inside` operator matches any function definition, while the `pattern-not-inside` operator ensures that the `main()` function is excluded. The final pattern matches calls to the `print()` function. The rule will trigger only when a `print()` call is found inside a function other than `main()`. Here's an example of Python code that triggers the `print_calls_outside_main` rule: ```python {linenos=inline, hl_lines=[3,11]} def sample_function(): # print() call inside a function other than main() print("This is a sample function") def main(): print("This is the main function") sample_function() def other_function(): some_function() print("XYZ") ``` #### Taint mode Taint mode is a powerful feature in Semgrep that can track the flow of data from one location to another. By using taint mode, you can: 1) **Track data flow across multiple variables:** Taint mode enables you to trace how data moves across different variables, functions, components, and allows you to easily identify insecure flow paths (e.g., situations where a specific sanitizer is not used). 2) **Find injection vulnerabilities:** Taint mode is particularly useful for identifying injection vulnerabilities such as SQL injection, command injection, and XSS attacks. 3) **Write simple and resilient Semgrep rules:** Taint mode simplifies the process of writing Semgrep rules that are resilient to certain code patterns nested in `if` statements, loops, and other structures. To use taint mode, you need to set the `mode: taint` and specify `pattern-sources`/`pattern-sinks` fields in your custom Semgrep rule. See this [example](https://semgrep.dev/s/el3X): ```yaml {linenos=inline} rules: - id: taint-tracking-example1 mode: taint pattern-sources: - pattern: getData() pattern-sinks: - pattern: printToUser(...) message: data flows from getData to printToUser languages: [python] severity: WARNING ``` Optionally, you can use additional fields in your Semgrep rule to further refine your taint analysis: - `pattern-propagators`: This field allows you to specify functions or methods that propagate tainted data ([example](https://semgrep.dev/s/7Nrv)). You can also refer to [sanitizers by side-effect](https://semgrep.dev/docs/writing-rules/data-flow/taint-mode/#sanitizers-by-side-effect) for more information. - `pattern-sanitizers`: This field allows you to specify functions or methods that sanitize tainted data. For more information, see the [taint mode documentation](https://semgrep.dev/docs/writing-rules/data-flow/taint-mode/#propagators). ### Combining patterns When writing Semgrep rules, you may encounter situations where a single pattern (e.g., `pattern: evil_function(...)`) isn't sufficient to capture the behavior you want to detect. In these cases, you can use one of the following to combine patterns: - `patterns`: This method combines multiple patterns with a logical AND (&&). In other words, all patterns must match for the rule to trigger. This is useful when you want to detect code snippets that satisfy multiple conditions simultaneously. - `pattern-either`: This method combines multiple patterns with a logical OR (||). In other words, if any of the patterns match, the rule triggers. This is useful when you want to detect code snippets satisfying at least one specified condition. Suppose you want to detect calls to two insecure functions, `insecure_function_1()` and `insecure_function_2()`. You can use the `pattern-either` operator to achieve this. ```yaml {linenos=inline} rules: - id: insecure_function_calls languages: - python message: "Call to an insecure function detected" severity: WARNING patterns: - pattern-either: - pattern: | insecure_function_1(...) - pattern: | insecure_function_2(...) ``` In this example, the `pattern-either` operator is used to match calls to either `insecure_function_1()` or `insecure_function_2()`. The rule will trigger if any of these patterns are matched. Here's an example of Python code that triggers the `insecure_function_calls` rule: ```yaml {linenos=inline, hl_lines=[9,12]} def insecure_function_1(): print("Insecure function 1 called") def insecure_function_2(): print("Insecure function 2 called") def main(): # Call to insecure_function_1() triggers the rule insecure_function_1() # Call to insecure_function_2() also triggers the rule insecure_function_2() ``` - `pattern-regex`: This matches code with a [PCRE](https://www.pcre.org/original/doc/html/pcrepattern.html)-compatible pattern in multiline mode. In other words, it matches code using a regular expression pattern. #### Rule syntax diagram The following diagram will help you understand the relationship between the relevant fields in the rule. While writing a rule, you can use the advanced mode in the [Semgrep Playground](https://semgrep.dev/playground/new) to test and refine it. The playground highlights any errors in your rules, providing immediate feedback. {{< mermaid >}} flowchart TB Fields{Rule Fields} ---->|Only one is allowed| Required{Required} click Fields "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=Rule%20syntax-,Rule%20syntax,-TIP" Required ==> id click id "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=Description-,id,-string" Required ==> message click message "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=no%2Dunused%2Dvariable-,message,-string" Required ==> severity click severity "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=Rule%20messages.-,severity,-string" Required ==> languages((languages)) click languages "https://semgrep.dev/docs/writing-rules/rule-syntax/#language-extensions-and-tags" Required ===>|Only one is required| Pattern_Fields{Pattern Fields} click Pattern_Fields "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional:~:text=pattern*,in%20multiline%20mode" click Required "https://semgrep.dev/docs/writing-rules/rule-syntax/#required" Pattern_Fields ==> pattern click pattern "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern" Pattern_Fields ==> pattern-regex[pattern-regex] click pattern-regex "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-regex" Pattern_Fields ==> pattern-either((pattern-either)) click pattern-either "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-either" Pattern_Fields ==> patterns((patterns)) click patterns "https://semgrep.dev/docs/writing-rules/rule-syntax/#patterns" pattern-either -.-> pattern-regex pattern-either -.-> pattern pattern-either -.-> pattern-inside click pattern-inside "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-inside" pattern-either <-.-> patterns patterns -.-> pattern-inside patterns <-..-> metavariable-pattern{metavariable-pattern} click metavariable-pattern "https://semgrep.dev/docs/writing-rules/rule-syntax/#metavariable-pattern" metavariable-pattern --> metavariable2[metavariable] metavariable-pattern -.-> language metavariable-pattern -.-> pattern metavariable-pattern -.-> pattern-either metavariable-pattern -.-> pattern-regex patterns -.-> metavariable-regex{metavariable-regex} click metavariable-regex "https://semgrep.dev/docs/writing-rules/rule-syntax/#metavariable-regex" metavariable-regex --> metavariable metavariable-regex --> regex patterns -.-> metavariable-comparison{metavariable-comparison} click metavariable-comparison "https://semgrep.dev/docs/writing-rules/rule-syntax/#metavariable-comparison" metavariable-comparison --> metavariable3[metavariable] metavariable-comparison --> comparison metavariable-comparison -.-> base metavariable-comparison -.-> strip patterns -.-> pattern patterns -.-> pattern-not click pattern-not "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-not" patterns -.-> pattern-not-inside click pattern-not-inside "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-not-inside" patterns -.-> pattern-not-regex click pattern-not-regex "https://semgrep.dev/docs/writing-rules/rule-syntax/#pattern-not-regex" Fields -.-> Optional{Optional} Optional -.-> options(options) Optional -.-> fix(fix) Optional -.-> metadata(metadata) Optional -.-> paths(paths) click Optional "https://semgrep.dev/docs/writing-rules/rule-syntax/#optional" click options "https://semgrep.dev/docs/writing-rules/rule-syntax/#options" click fix "https://semgrep.dev/docs/writing-rules/rule-syntax/#fix" click metadata "https://semgrep.dev/docs/writing-rules/rule-syntax/#metadata" click paths "https://semgrep.dev/docs/writing-rules/rule-syntax/#paths" {{< /mermaid >}} **Example #1**: Looking at the chart, you can see that the `pattern-either` and `pattern-not` fields are not directly connected. However, you can combine them using the `patterns` field, which performs a logical AND operation on all the patterns included. **Example #2**: For instance, if you want to use `pattern-either` to combine multiple patterns with a logical OR and exclude a specific pattern using `pattern-not`, you can do so by including both of them under the same `patterns` field. The resulting combination of patterns will match only code that satisfies all of the patterns included in the `pattern-either` field, except for the pattern specified in `pattern-not`. See the example [`exclude-when-using-secure-option`](https://semgrep.dev/s/vgob) rule. ### Generic pattern matching It is possible to match generic patterns in unsupported languages/contexts. Use the `generic` language for configuration files, XML, etc., and combine it with the specific extension through the `paths` - `include` fields to reduce false positives. For example, see the [`nsc-allows-plaintext-traffic` rule](https://semgrep.dev/playground/r/java.android.best-practice.network-security-config.nsc-allows-plaintext-traffic?editorMode=advanced), which scans the Android manifest XML file for potential misconfiguration: ```yaml {linenos=inline} rules: - id: nsc-allows-plaintext-traffic languages: [generic] patterns: - pattern: | - pattern-not-inside: | - pattern-not-inside: > ... ... ... ... ... ... ... ... ... ... severity: INFO paths: include: - "*.xml" ``` ### Metadata Metadata fields are a feature in Semgrep that allow you to attach additional information to your rules. By including metadata fields in your rules, you can give developers more context and guidance on addressing potential issues. This information can include details such as the rule's severity level, recommended fixes, or the author's contact information. By including metadata, you can make your rules more informative and actionable for developers who encounter them. This can help them prioritize and fix issues more efficiently, ultimately improving the overall security of your codebase. In addition to providing context and guidance to developers, there are several other reasons why an organization might want to use Semgrep metadata: 1. **Standardization.** Using metadata fields consistently across all of your organization's Semgrep rules ensures that developers see the same types of information and recommendations no matter which rules they encounter. This can help standardize the security review process and simplify prioritizing and addressing issues. - Example: [By including fields required by the security category in the Semgrep Registry](https://semgrep.dev/docs/contributing/contributing-to-semgrep-rules-repository/#including-fields-required-by-security-category), developers will prioritize findings with high `confidence` and high `impact` metadata. 2. **Collaboration.** Including author information in your Semgrep rules can make it easier for other organization members to collaborate on security issues. - Example: Suppose someone has a question or needs more information about a particular rule. In that case, they can contact the `author` directly for clarification. 3. **Compliance.** Suppose your organization needs to comply with specific security regulations or standards. In this case, you could include a `compliance` metadata field in your Semgrep rules, indicating which regulation or standard the rule relates to. This helps ensure that your codebase complies with all relevant requirements. You can create any metadata field, as demonstrated in the [hooray-taint-mode](https://semgrep.dev/playground/s/4K3g) rule. We recommend including the following metadata fields required by the security category in the Semgrep Registry: 1. `cwe`: A [Common Weakness Enumeration](https://cwe.mitre.org/index.html) identifier that classifies the security issue. 2. `confidence`: An assessment of the rule's accuracy, represented as high, medium, or low. 3. `likelihood`: An estimation of the probability that the detected issue will be exploited, represented as high, medium, or low. 4. `impact`: A measure of the potential damage caused by exploiting the detected issue, represented as high, medium, or low. 5. `subcategory`: A more specific classification of the rule, falling under one of the following categories: [vuln, audit, or guardrail](https://semgrep.dev/docs/contributing/contributing-to-semgrep-rules-repository/#subcategory). By including these metadata fields, you provide valuable context and help users better understand the security implications of the issues detected by your rule. ### Various tips #### Matching an array with a non-string element This Semgrep rule aims to detect JavaScript or TypeScript arrays that contain at least one non-string element. See this [array-with-a-non-string-element example](https://semgrep.dev/s/BJnb). ```yaml {linenos=inline} rules: - id: array-with-a-non-string-element languages: [js] message: array with element that is not a string severity: WARNING patterns: - metavariable-pattern: metavariable: $A patterns: - pattern-not: "..." - pattern: [..., $A, ...] ``` #### "Removing" negative pattern from pattern-either This Semgrep rule aims to detect Python code snippets where a function `a(...)`, `b(...)`, or `c(...)` is called, but it should not match the case where function `a()` is called with the argument `x`. See this [pattern-not-with-pattern-either example](https://semgrep.dev/s/5N96) ```yaml {linenos=inline} rules: - id: pattern-not-in-pattern-either patterns: - pattern-either: - pattern: a(...) - pattern: b(...) - pattern: c(...) - pattern-not: a(x) message: pattern either with one negative pattern languages: [python] severity: WARNING ``` ### Maintaining good quality of Semgrep rules Before publishing a new rule or updating an existing one, it is crucial to ensure that it meets specific standards and is effective. To help with this, we've created a [Development Practices checklist](https://github.com/trailofbits/semgrep-rules/blob/main/CONTRIBUTING.md#development-practices) in our _Contributing to Trail of Bits Semgrep Rules_ document that you can follow to make sure your custom rule is ready for publication. ### Help with writing custom rules {{< hint danger >}}**Warning:** Be careful about asking for external assistance for writing rules or sharing rule output that may be specific to a sensitive and/or private codebase. Doing so could inadvertently disclose the identity of the code owner, portions of the code, or particular bugs.{{}} When running into issues while working on custom rules, several resources are available to help you. Two of the most valuable resources are the following: - The [Semgrep Community Slack](https://go.semgrep.dev/slack) is a great place to ask for help with custom rule development. The channel is staffed by knowledgeable developers familiar with Semgrep's architecture and syntax. They are usually quick to respond to questions. They can guide you in structuring your rules and in debugging any issues that arise. Additionally, the Slack channel is a great place to connect with other developers working on similar projects, allowing you to learn from others' experiences and share your insights. - Use [Semgrep GitHub issues](https://github.com/semgrep/semgrep/issues) to report bugs, suggest new features, and ask for help with specific issues. ## Thoroughly testing Semgrep rules for optimal performance Creating comprehensive tests for your Semgrep rules is essential to ensure they perform as expected and cover a wide range of test cases. By thoroughly testing the rules against various code samples, you can confirm that they accurately identify intended vulnerabilities, potential errors, or coding standard violations. This ultimately leads to more reliable and effective security and code quality analysis. ### Designing comprehensive test cases A well-rounded test suite for a custom Semgrep rule should cover multiple aspects of the rule's functionality. When designing test cases, consider the following: 1. **Create a file containing code samples**: Create a file containing code with the same name as the rule. For example, if your rule filename is `unsafe-exec.yml`, create a corresponding `unsafe-exec.py` file with sample code. 2. **Incorporate a diverse range of code samples**: Adhere to the following guidelines when adding code samples to the test file: - Include at least one true positive comment (e.g., `// ruleid: id-of-your-rule`). - Include at least one true negative comment (e.g., `// ok: id-of-your-rule`). - Start with simple, descriptive examples that are easy to understand. - Progress to more advanced, complex examples, such as those involving nested structures (e.g., inside an `if` statement) or deep expressions. - Include edge cases that may challenge the rule's accuracy or efficiency, such as large input values, complex code structures, or unusual data types. - Test the rule against different language features and constructs, including loops, conditionals, classes, and functions. - Intentionally create code samples that should not trigger the rule, and ensure that the rule does not produce false positives in these cases. 3. **Ensure all tests pass**: Run the `$ semgrep --test` command to verify that all test cases pass. 4. **Evaluate the rule against real-world code**: Test the rule against actual code from your projects, open-source repositories, or other codebases to assess its effectiveness in real-life scenarios. ## Testing custom rules in CI ### GitHub Actions The following workflow can be used to test custom Semgrep rules in GitHub Actions: ```yml name: Test Semgrep rules on: [push, pull_request] jobs: semgrep-test: runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - uses: actions/setup-python@v4 with: python-version: "3.11" cache: "pip" - run: python -m pip install -r requirements.txt - run: semgrep --test --test-ignore-todo ./path/to/rules/ ``` Make sure to include `semgrep` in your `requirements.txt` (or [`poetry` or `pipenv` equivalents](https://github.com/actions/setup-python/blob/main/docs/advanced-usage.md#caching-packages)) file to speed up workflow runs by caching the dependency. Note, we include `--test-ignore-todo` here so we do not fail CI runs on [TODO tests](https://semgrep.dev/docs/writing-rules/testing-rules), which are a valuable form of documentation for future rule improvements. ## Autofix feature The autofix feature can automatically correct identified vulnerabilities, potential errors, or coding standard violations. There are many benefits to using the autofix feature: - Training every developer on all the best practices for large code bases is not feasible. Autofixes can help fill in the gaps and provide guidance as needed. - Autofixes maintain developer focus by removing monotonous changes, allowing them to concentrate on more complex tasks. - Adding autofixes allows developers to be educated and trained on new best practices as they are introduced into the codebase. - Autofixes can provide on-demand fixes and are much more actionable and educational than simple lint warnings. - Without making developers aware of a deprecation, they won't know not to use a deprecated component, and they won't know what to use instead. Autofixes can help make these transitions smoother. ### Creating a Semgrep rule with the autofix feature Follow these steps to develop a rule with the autofix feature (see the [ioutil-readdir-deprecated](https://semgrep.dev/s/wPEX) rule with the autofix feature implemented): 1. Add the `fix` key to a rule, specifying the replacement pattern for the identified vulnerability. Here is an example rule with the autofix feature: ```yaml {linenos=inline} rules: - id: ioutil-readdir-deprecated languages: [golang] message: ioutil.ReadDir is deprecated. Use more efficient os.ReadDir. severity: WARNING pattern: ioutil.ReadDir($X) fix: os.ReadDir($X) ``` For the following Golang code: ```go {linenos=inline} package main import ( "fmt" "io/ioutil" "log" "os" ) func main() { // ruleid: ioutil-readdir-deprecated files, err := ioutil.ReadDir(".") if err != nil { log.Fatal(err) } for _, file := range files { fmt.Println(file.Name()) } } ``` 2. Run the rule using the standard command to confirm that the rule is detecting the intended issue: ```bash $ semgrep -f rule.yaml # (...) Findings: readdir.go ioutil-readdir-deprecated ioutil.ReadDir is deprecated. Use more efficient os.ReadDir. ▶▶┆ Autofix ▶ os.ReadDir(".") 11┆ files, err := ioutil.ReadDir(".") # (...) ``` 3. Run the rule with the `--dryrun` and the `--autofix` options to preview the behavior of the autofix feature on the code without making any changes to the analyzed code: ```bash $ semgrep -f rule.yaml --dryrun --autofix # (...) Findings: readdir.go ioutil-readdir-deprecated ioutil.ReadDir is deprecated. Use more efficient os.ReadDir. ▶▶┆ Autofix ▶ os.ReadDir(".") 11┆ files, err := os.ReadDir(".") # (...) ``` 4. Create a new test file for the autofix by adding the `.fixed` suffix in front of the file extension (e.g., `readdir.go` -> `readdir.fixed.go`). This file should contain the expected output after the autofix is applied. Content of the `readdir.fixed.go` file: ```go {linenos=inline} package main import ( "fmt" "io/ioutil" "log" "os" ) func main() { // ruleid: ioutil-readdir-deprecated files, err := os.ReadDir(".") if err != nil { log.Fatal(err) } for _, file := range files { fmt.Println(file.Name()) } } ``` 5. Run the test to confirm that the autofix is working as expected: ```shell $ semgrep --test 1/1: ✓ All tests passed 1/1: ✓ All fix tests passed ``` 6. Now you are ready to apply autofix to the analyzed file with the `--autofix` option. ```shell $ semgrep -f rule.yaml --autofix # (...) Findings: readdir.go ioutil-readdir-deprecated ioutil.ReadDir is deprecated. Use more efficient os.ReadDir. ▶▶┆ Autofix ▶ os.ReadDir(".") 11┆ files, err := ioutil.ReadDir(".") # (...) ``` By following these steps, you can create a custom Semgrep rule with an effective autofix feature that identifies issues and provides a solution to fix them. ### Regular expression-based autofix The `fix` field presented above allows you to specify a simple string replacement, while the `fix-regex` field enables more complex regular expression-based replacements. For more information, refer to the official documentation on [Autofix with regular expression replacement](https://semgrep.dev/docs/writing-rules/autofix/#autofix-with-regular-expression-replacement). ## Optimizing Semgrep rules {{< hint info >}}Improve rule performance and minimize false positives through repeatable processes. {{< /hint >}} Optimizing your Semgrep rules is crucial for maintaining high performance and minimizing false positives. This section will guide how to create efficient and accurate Semgrep rules. 1. **Analyze time summary**: To include a time summary with the results, use the `--time` flag. This will provide the following information: - Total time / Config time / Core time - Semgrep-core time - Total CPU time - File parse time - Rule parse time - Matching time - Slowest five analyzed files - Slowest five rules to match 2. **Narrow down findings to specific file paths**: Assess whether findings should be limited to specific file paths (e.g., Dockerfiles). - You can apply particular rules to certain paths using the `paths` keyword. For example, the [avoid-apt-get-upgrade](https://semgrep.dev/playground/r/generic.dockerfile.best-practice.avoid-apt-get-upgrade.avoid-apt-get-upgrade) rule targets only Dockerfiles: ```yaml {linenos=inline,linenostart=17} paths: include: - "*dockerfile*" - "*Dockerfile*" ``` 3. **Use `pattern-inside` and `pattern-not-inside`**: The `pattern-inside` and `pattern-not-inside` clauses allow you to specify a context in which a pattern should or should not be matched, respectively. Consider a scenario where you want to identify calls to `insecure_function()` within a loop, followed by a specific statement, such as a call to `log_data()`, but only when the log level is set to `DEBUG`. Initially, you can achieve this by using one `pattern` statement: ```yaml {linenos=inline} rules: - id: insecure_function_in_loop_followed_by_debug_log languages: [python] message: | Insecure function called within a loop followed by log_data() with log level DEBUG severity: WARNING pattern: | for ... in ...: ... insecure_function(...) ... log_data("DEBUG", ...) ``` Here's an example of Python code that triggers the `insecure_function_in_loop_followed_by_debug_log` rule: ```python {linenos=inline, hl_lines=["11-17"]} def insecure_function(): print("Insecure function called") def log_data(log_level, msg): if log_level == "DEBUG": print("DEBUG:", msg) def main(): data_list = ['data1', 'data2', 'data3'] for data in data_list: # Call to insecure_function() within a loop, # followed by log_data() with log level DEBUG triggers the rule insecure_function() other_function() function1337() log_data("DEBUG", "Insecure function called with data: " + data) ``` Running the `insecure_function_in_loop_followed_by_debug_log` rule may not provide the clearest output, as it displays the entire `for` loop: ```shell $ semgrep -f insecure_function_in_loop_followed_by_debug_log.yml # (...) insecure_function_in_loop_followed_by_debug_log Insecure function called within a loop followed by log_data() with log level DEBUG 11┆ for data in data_list: 12┆ # Call to insecure_function() within a loop, 13┆ # followed by log_data() with log level DEBUG triggers the rule 14┆ insecure_function() 15┆ other_function() 16┆ function1337() 17┆ log_data("DEBUG", "Insecure function called with data: " + data) ``` For such findings, only the calls to `insecure_function()` might be of critical importance. To improve the output, you can use the following clauses instead: 1. `patterns`: This clause combines two sub-patterns with a logical AND operator, meaning all sub-patterns must match: a. `pattern-inside`: This clause matches any `for` loop in the Python code, establishing the context for the subsequent patterns. It sets a condition that must be met for the rule to trigger, acting as the first part of a logical AND operation. b. `pattern`: This sub-pattern matches calls to any function followed by a call to `log_data("DEBUG", ...)`. The rule potentially triggers if this `pattern` and the previous `pattern-inside` match. c. `focus-metavariable`: This operator focuses the finding on the line of code matched by `$FUNC`. d. `metavariable-pattern`: This sub-pattern restricts `$FUNC` to functions called `insecure_function`. Here is a fixed version of the `insecure_function_in_loop_followed_by_debug_log` rule: ```yaml {linenos=inline} rules: - id: insecure_function_in_loop_followed_by_debug_log_fixed languages: [python] message: | Insecure function called within a loop followed by log_data() with log level DEBUG severity: WARNING patterns: - pattern-inside: | for ... in ...: ... - pattern: | $FUNC(...) ... log_data("DEBUG", ...) - focus-metavariable: $FUNC - metavariable-pattern: metavariable: $FUNC pattern: insecure_function ``` Running the `insecure_function_in_loop_followed_by_debug_log_fixed` Semgrep rule will produce a more concise and focused output: ```shell $ semgrep -f insecure_function_in_loop_followed_by_debug_log_fixed.yml # (...) insecure_function_in_loop_followed_by_debug_log_fixed Insecure function called within a loop followed by log_data() with log level DEBUG 13┆ insecure_function() ``` 4. **Minimize the use of ellipses** `...`: While ellipses are a powerful tool for matching a wide range of code snippets, they can lead to performance issues and false positives when overused. Limit the use of ellipses to situations necessary for accurate pattern matching. 5. **Determine the necessity of metavariables**: Before using a metavariable in your rule, determine if it is truly necessary. Metavariables can be useful for capturing and comparing values, but if a metavariable is unnecessary for your rule to function correctly, consider removing it. For example, consider the following Semgrep rule that uses a metavariable `$X`: ```yaml {linenos=inline} rules: - id: unnecessary_metavariable_example languages: [python] message: The variable is assigned the value 123 pattern: $X = 123 severity: WARNING ``` This rule matches any variable assignment with the value `123`. However, the metavariable `$X` might be unnecessary if you don't need to capture the variable name. In this case, you can use the `...` operator instead, which matches any expression: ```yaml {linenos=inline} rules: - id: without_metavariable_example languages: [python] message: A variable is assigned the value 123 pattern: ... = 123 severity: WARNING ``` By replacing the `$X` metavariable with the `...` operator, you can reduce the complexity and improve the performance of your rule without losing the intended functionality. This approach should be used when the metavariable is not essential for the rule's purpose or subsequent comparisons or checks. 6. **Test your rules with real-world code**: To ensure the effectiveness of your rules, test them with real-world code samples. This lets you identify potential issues and false positives before deploying your rules in a production environment.