Semgrep (Semantic Grep) is a multi-language static analysis tool that matches code by its structure – not just its text – using a unified Abstract Syntax Tree and a rich pattern language. This document covers the full internal architecture from CLI entry-point to taint sink detection.
Table of Contents
- Introduction
- High-Level Architecture
- Component Breakdown
- The Full Analysis Pipeline
- Target Discovery & Filtering
- Rule Parsing & Optimization
- Parsing & the Universal AST
- The Matching Engine
- The Intermediate Language (IL) & CFG
- Taint Analysis (Dataflow)
- Output & Reporting Pipeline
- OSemgrep / RPC Architecture
- Key Data Structures
-1. Why?
I have been working as a DevSecOps engineer for nearly four years now. When I started, I had little exposure to tooling like SAST, SCA, or DAST. My mindset was firmly rooted in offensive security. Penetration testing was the goal, the dream.
Then I discovered SAST. My first reaction was genuine excitement: “This is incredible, automated code analysis at scale!” But over time, reality set in. In practice, most findings tend to be noisy. False positives are plentiful, and the truly critical vulnerabilities are rarely surfaced by off-the-shelf rules. To be clear: I am not saying these tools are useless, far from it. They are an essential layer of defense in any mature security program. But they do have limits.
That tension, between the promise and the reality of SAST, planted a question in my head that would not go away: why not build my own? The idea kept nagging at me until I finally decided to dig in. And what a rabbit hole it turned out to be.
The natural starting point was Semgrep. Understanding how it works under the hood, the parsing pipeline, the universal AST, the matching engine, the taint analysis, has been one of the most intellectually rewarding deep-dives I have done in a long time. This article is the result of that exploration.
There will be more articles on this topic. I have a lot to say about static analysis, and I am just getting started.
No em dashes used anywhere. You may need to reload the file in your editor to see the changes.
0. Introduction
0.1 What is SAST?
Static Application Security Testing (SAST) is a white-box security technique that analyzes source code, bytecode, or binary code without executing the program. A SAST tool reads the code the same way a compiler or interpreter would, builds an internal model of program structure and data flow, and then applies a library of security rules to that model to identify vulnerabilities.
SAST sits at the earliest possible stage of the Software Development Lifecycle (SDLC). This “shift-left” principle means that bugs are caught before they can ever reach a staging or production environment – dramatically reducing the cost and effort of fixing them.
Core properties of a SAST tool:
| Property | Description |
|---|---|
| No execution required | Analyzes source files directly; does not need a running server, database, or network |
| Language-aware | Understands syntax and semantics: knows that x = y is an assignment, not two identifiers |
| Finds bugs early | Runs in CI/CD pipelines or even pre-commit hooks, giving developers instant feedback |
| Scalable | Can scan millions of lines of code in seconds with parallel analysis |
| Deterministic | Same code always produces the same result (no flakiness from network or timing) |
SAST vs. Other Security Testing Approaches
The four major security testing categories differ along two axes: whether source code access is needed, and whether the application must be running.
| Approach | Code Access | App Running | Typical Integration Point |
|---|---|---|---|
| SAST | Yes | No | Commit / PR / CI build |
| DAST | No | Yes | Staging environment |
| IAST | Yes | Yes | QA environment with agent |
| SCA | Yes (manifests) | No | Dependency install / build |
| Manual Code Review | Yes | No | Security sprint / audit |
Modern security programs use all of these layers together. SAST is the first line of defense because it is the cheapest and fastest to run.
How SAST Works: The Core Technique
Most SAST tools follow the same fundamental pipeline:

Semgrep is unique among SAST tools because it exposes this pipeline to end-users directly. Instead of shipping only a fixed rule library, Semgrep lets security engineers write custom rules in YAML using a pattern language that mirrors actual code syntax – no formal language theory required.
0.2 History of Semgrep
Semgrep was not built from scratch. Its roots trace back to research done inside Facebook more than fifteen years ago.
The pfff Lineage (2009)
In 2009, Yoann Padioleau – then an engineer at Facebook – created pfff (“Parsing Framework for Fun”). pfff was an OCaml library and toolset that could parse multiple programming languages (PHP, C, OCaml, Java) into a common AST representation. The goal was to build code navigation, refactoring, and lightweight static analysis tools that worked uniformly across Facebook’s polyglot codebase.
pfff introduced two foundational ideas that Semgrep still relies on today:
- A universal AST (
AST_generic) shared across all supported languages. - sgrep – a prototype “semantic grep” that let engineers write structural code patterns (with wildcards) that matched against the pfff AST instead of raw text.
From sgrep to Semgrep (2013-2019)
The sgrep prototype demonstrated that structural matching was practical and useful for finding bugs at scale inside Facebook. The concept was refined internally over several years but not yet released publicly. The OCaml core continued to evolve alongside pfff.
Open Source Launch (2020)
In 2020, r2c (Return to Corporation, later renamed Semgrep Inc.) spun the technology out and released Semgrep as a free, open-source tool under the LGPL-2.1 license. The release included:
- A Python CLI wrapping the OCaml binary.
- The Semgrep Registry – a community-maintained library of security rules.
- Support for 10+ languages from day one.
This release was a breakthrough: for the first time, any developer or security engineer could write structural code patterns without needing a PhD in program analysis.
Expansion and Commercialization (2021-2022)
Semgrep Inc. began building commercial capabilities on top of the OSS core:
- Semgrep Pro / DeepSemgrep (2021): Added inter-procedural taint analysis and cross-file dataflow – tracking how tainted data flows not just within a function but across module boundaries and entire codebases.
- Semgrep Supply Chain / SCA (2022): Integrated dependency manifest scanning to detect known CVEs in third-party libraries, tying reachability analysis (“is this vulnerable function actually called?”) back to the SAST engine.
- Semgrep Secrets (2022-2023): Extended the rule language to detect hardcoded credentials and API keys, with entropy scoring and AI-powered online validation.
OSemgrep and Modern Semgrep (2023-present)
As the Python CLI became a performance and maintenance bottleneck, Semgrep Inc. began the OSemgrep project: a full reimplementation of the Python CLI in OCaml, merging the CLI and core engine into a single binary. This eliminates subprocess overhead, enables tighter integration with the IDE via RPC and LSP protocols, and simplifies deployment.
| Era | Key Events |
|---|---|
| 2009 | pfff created at Facebook; multi-language AST framework |
| 2013 | sgrep prototype: structural pattern matching on AST |
| 2020 | Semgrep OSS released by r2c; Registry launched |
| 2021 | DeepSemgrep: inter-procedural taint + cross-file analysis |
| 2022 | Supply Chain (SCA) and Secrets detection added |
| 2023 | OSemgrep rewrite begins; RPC/LSP IDE integrations |
| 2024 | Semgrep Secrets GA; MCP tool support; 30+ languages |
1. High-Level Architecture
Semgrep is split into two tiers: a Python CLI that handles user-facing concerns, and an OCaml core engine that performs all heavy lifting - parsing, matching, and dataflow analysis.

2. Component Breakdown
2.1 Semgrep CLI (cli/src/semgrep/)
| Module | Responsibility |
|---|---|
main.py / cli.py | Entry point, subcommand routing (scan, ci, test, login) |
config_resolver.py | Fetches rules from local paths, URLs, or the Semgrep Registry |
target_manager.py | Walks the file system, applies .semgrepignore / .gitignore rules |
core_runner.py | Serialises the scan plan and invokes semgrep-core; collects JSON output |
run_scan.py | Orchestrates the full scan flow (rules → targets → core → output) |
output.py / scan_report.py | Post-processes findings, deduplication, # nosemgrep suppression |
join_rule.py | Evaluates multi-file join rules by joining metavariable bindings |
engine.py | Detects which engine variant to use (OSS / Pro / DeepSemgrep) |
metrics.py / telemetry.py | Collects and sends anonymous usage metrics |
rpc.py / rpc_call.py | Optional JSON-RPC bridge for IDE/LSP integrations |
2.2 Semgrep Core (src/)
| Directory | Key Files | Responsibility |
|---|---|---|
src/main | Main.ml | Binary entry point; parses CLI args and dispatches |
src/core_scan | Core_scan.ml, Parmap_targets.ml | Parallel scan orchestration over target/rule matrix |
src/targeting | Find_targets.ml, Semgrepignore.ml, Guess_lang.ml | File discovery, language detection, ignore rules |
src/rule | Rule.ml, Parse_rule.ml, Xpattern.ml | Rule data model and YAML parsing |
src/parsing | Parse_target.ml, Pfff_or_tree_sitter.ml, Parse_pattern.ml | Source & pattern parsing → AST_generic |
src/ast_generic | AST_generic.ml (in libs/) | The universal AST type definitions |
src/naming | Naming_AST.ml | Name/scope resolution on the parsed AST |
src/prefiltering | Analyze_rule.ml, File.ml | Fast pre-scan to skip irrelevant files |
src/matching | Pattern_vs_code.ml, Matching_generic.ml | Core structural pattern matching |
src/engine | Match_rules.ml, Match_search_mode.ml, Match_taint_spec.ml | Rule dispatch, search mode, taint spec |
src/il | IL.ml, CFG.ml, Fun_CFG.ml | Intermediate Language + Control Flow Graph |
src/analyzing | AST_to_IL.ml, CFG_build.ml, Dataflow_core.ml, Constant_propagation.ml | AST→IL lowering, CFG construction, generic dataflow |
src/tainting | OSS_dataflow_tainting.ml, Taint.ml, Taint_shape.ml, Taint_lval_env.ml | Taint propagation and sink detection |
src/fixing | Autofix application | Applies rule-suggested code fixes |
src/sca | Dependency analysis | Software Composition Analysis (SCA) |
3. The Full Analysis Pipeline

4. Target Discovery & Filtering
4.1 File Discovery Flow

4.2 Language Detection (Guess_lang.ml)
Detection is multi-layered (in priority order):
- Explicit
--langflag – overrides all heuristics - File extension –
.py→ Python,.ts→ TypeScript, etc. - Shebang line (
#!/usr/bin/env python3) .semgrepignore/ rulepaths:selectors
4.3 The .semgrepignore System (Semgrepignore.ml)
The
.semgrepignorelayering: built-in defaults < repo.semgrepignore<.gitignore< CLI--exclude/--includeflags. Later rules win;!prefix negates.
5. Rule Parsing & Optimization
5.1 Rule YAML Structure
A Semgrep rule contains:
id- unique identifierlanguages- target language(s)pattern/patterns/pattern-either- the formulapattern-not/pattern-not-inside- negative clausesmetavariable-*- metavariable constraints (regex, type, comparison, pattern)mode-search(default) ortaintfix- optional autofix template
5.2 Rule Parsing Pipeline

5.3 Formula Algebra
Rules can express boolean logic over patterns:
| Operator | Meaning |
|---|---|
pattern | Match this exact pattern |
patterns (AND) | All sub-patterns must match |
pattern-either (OR) | Any sub-pattern may match |
pattern-not | Exclude if this pattern matches |
pattern-inside | Match must be inside this pattern |
pattern-not-inside | Match must not be inside this pattern |
focus-metavariable | Report the position of a specific metavariable |
metavariable-regex | Metavariable value must match a regex |
metavariable-pattern | Metavariable value must itself match a sub-pattern |
metavariable-type | Metavariable type must match (typed languages) |
metavariable-comparison | Numeric/string comparison on metavariable |
5.4 Prefiltering (src/prefiltering/)
Before any parsing occurs, rules are statically analyzed to extract “must-have” literal strings. A bloom-filter-like scan of raw file bytes allows the engine to skip files that provably cannot match any active rule - often eliminating 90%+ of files without touching the parser.
Prefiltering: Rules are statically analyzed to extract mandatory string literals. Files not containing any of those literals are skipped before parsing – often eliminating 90%+ of the corpus.
6. Parsing & the Universal AST
6.1 Why a Universal AST?
Semgrep supports 30+ languages using a single matching engine by normalising every language’s AST into AST_generic - a lingua franca for code structure. Adding a new language means writing only a parser that outputs this type; the matching logic is reused automatically.
6.2 Parser Backends

6.3 AST Normalisation & Naming
After raw parsing, two passes enrich the generic AST:
- Normalisation (
Normalize_generic.ml): Canonicalise syntactic sugar (e.g.,x += 1→x = x + 1). - Name Resolution (
Naming_AST.ml): Resolve variable references to their definition scope. This enables the engine to distinguishos.path.join(the stdlib function) from a localpath.join.

6.4 Pattern Parsing
Rule patterns go through the same parsers but with a dedicated entry point (Parse_pattern.ml). Metavariable tokens ($X, $...ARGS) are injected as special AST nodes before parsing so the grammar treats them as valid expressions - then the engine interprets them during matching.
7. The Matching Engine
7.1 Overall Matching Architecture

7.2 Pattern_vs_code.ml - The Core Matcher
This is the largest file in the codebase (~155 KB). It implements a recursive descent structural comparator between a pattern AST node and a code AST node using OCaml’s Monadic_bind pattern.

Key matching rules:
| Pattern Construct | Matching Behaviour |
|---|---|
$X (uppercase metavar) | Binds to any single expression/identifier; subsequent uses enforce equality |
$...ARGS (ellipsis metavar) | Binds to any sequence of arguments |
... (ellipsis) | Matches any sequence of statements/args/fields |
Literal "foo" | Matches the exact string literal |
_ | Matches any single node without binding |
<... X ...> (deep expression) | Searches inside any sub-expression for X |
7.3 Metavariable Constraints (Post-match filters)
After a raw structural match, the engine evaluates metavariable constraint clauses in sequence. All must pass for the finding to be reported.

7.4 Matching Visitor (Matching_visitor.ml)
The visitor drives the match attempt over every node in the code AST. For each rule formula it:
- Walks the code AST top-down.
- Attempts the pattern at each node.
- Collects
Range_with_metavars- the code range that matched alongside metavar bindings. - Applies boolean formula operators (
AND,OR,NOT,INSIDE) to combine raw matches.
8. The Intermediate Language (IL) & CFG
Search-mode matching works directly on the AST. Taint mode requires understanding execution order, so the engine first lowers the AST into an Intermediate Language (IL) and then builds a Control Flow Graph (CFG).
8.1 AST → IL (AST_to_IL.ml)

The IL simplifies dataflow by making all side effects explicit as discrete instructions, each with a clear left-hand side (assignment target) and right-hand side (expression), eliminating complex nested expressions.
8.2 CFG Construction (CFG_build.ml)

8.3 Constant Propagation (Constant_propagation.ml)
Before taint analysis, the engine runs sparse constant propagation over the CFG. This resolves:
- String literals assigned to variables (
url = "http://evil.com") - Simple arithmetic on integer constants
- Concatenation of known string constants
This enriches the IL so taint patterns can match the effective value of expressions, not just their syntactic form.
9. Taint Analysis (Dataflow)
9.1 End-to-End Taint Flow

9.2 Fixpoint Engine (Dataflow_core.ml)
The core dataflow algorithm is a worklist-based forward analysis:

9.3 Taint Representation (Taint.ml / Taint_shape.ml)
Taint is not just a boolean - it carries provenance information:
| Concept | Representation |
|---|---|
| Taint label | Which source rule produced this taint (enables label-matching) |
| Taint call stack | The chain of function calls through which taint propagated |
| Taint shape | Per-field/index taint (Taint_shape.ml): obj.field can be tainted independently of obj |
| Xtaint | Clean / Tainted(set) / Both - tracks when a value is both clean and tainted (e.g., via branches) |
9.4 Taint Propagation Rules
Taint Propagation Rules:
x = tainted_ytaintsx;x.f = tainted_ytaints fieldx.f; call return is tainted if a propagator rule applies; string interpolation of tainted data taints the result; sanitizer calls remove taint.
9.5 Source / Sink / Sanitizer Matching (Taint_spec_match.ml)
The “best match” strategy prevents double-reporting:
Best Match Strategy (
Taint_spec_match.ml): When multiple overlapping AST nodes match a source/sink pattern (e.g.,foovsfoo.bar), the engine keeps only the longest/most-specific match to avoid double-reporting.
9.6 Intra- vs Inter-procedural Analysis
| Mode | Scope | Cross-function | Notes |
|---|---|---|---|
| Intra-procedural (OSS) | Single function | No | Function params assumed possibly tainted |
| Inter-procedural (Pro) | Whole program / cross-file | Yes | Build call graph; compute per-function taint summaries; propagate across call sites |
10. Output & Reporting Pipeline
10.1 Core Output → CLI Formatting

10.2 Exit Codes
| Code | Meaning |
|---|---|
0 | No findings (or --error not set) |
1 | Findings present (when --error flag used) |
2 | Scan error (parse error, rule error) |
3 | Invalid arguments |
123 | Could not connect to Semgrep Registry |
10.3 The # nosemgrep Suppression System
# nosemgrepsuppression: If a finding is on line N and line N or N-1 contains# nosemgrep(optionally followed by a colon-separated list of rule IDs), the finding is suppressed and not reported.
11. OSemgrep / RPC Architecture
11.1 OSemgrep (src/osemgrep/)
OSemgrep is the OCaml reimplementation of the Python CLI - a long-term effort to bring the full CLI into the OCaml binary, eliminating the Python dependency for performance-critical or embedded scenarios.
OSemgrep vs pysemgrep: The Python CLI invokes
semgrep-coreas a subprocess and exchanges JSON over stdin/stdout. OSemgrep (the OCaml reimplementation) calls core directly as a library, eliminating the subprocess boundary and the Python runtime dependency.
11.2 RPC / LSP Bridge (src/rpc/, cli/src/semgrep/rpc.py)
For IDE integrations, Semgrep exposes a JSON-RPC interface:
RPC / LSP Bridge: The IDE sends LSP events to the language server (
rpc.py/lsp_legacy/). The server forwards them as JSON-RPC scan requests to the Core. Findings are returned as LSP diagnostics and shown as inline annotations in the editor.
12. Key Data Structures
12.1 AST_generic Node Hierarchy (simplified)
The any union covers all node kinds. The most common sub-types:
| Kind | Variants |
|---|---|
| Expr | Literal, Id, Call, Assign, BinOp, Lambda, Conditional, … |
| Stmt | ExprStmt, If, For, Return, Try, Block, … |
| Type | TyName, TyFun, TyArray, TyGeneric, … |
| Definition | FuncDef, ClassDef, VarDef, … |
| Directive | ImportFrom, ImportAll, Pragma, … |
12.2 Rule.t (OCaml Record, simplified)
| Field | Type | Description |
|---|---|---|
id | Rule_ID.t | Unique rule identifier |
languages | Language.t list | Target languages |
formula | Formula | Pattern formula (search mode) |
taint_spec | Taint_spec | Source/sink/sanitizer spec (taint mode) |
severity | Error | Warning | Info | Inventory | Finding severity |
fix | string option | Autofix template |
metadata | JSON blob | OWASP, CWE, confidence tags |
paths | selectors | include/exclude path filters |
options | Rule_options.t | Per-rule engine feature toggles |
12.3 Range_with_metavars.t (Match Result)
| Field | Description |
|---|---|
range | Byte range (start_pos, end_pos) in the source file |
mvars | List of (mvar_name, AST_node) bindings |
origin | Rule_ID that produced this match |
taint_trace | Source-to-sink provenance path (taint mode only) |
fix | Computed autofix string (if rule has fix:) |
Appendix A: Parser Coverage
| Backend | Languages |
|---|---|
| Tree-sitter | Go, Java, JavaScript, TypeScript, C, C++, C#, Kotlin, Ruby, Rust, Scala, Swift, HCL, Dockerfile, JSON, YAML, HTML, Bash, … |
| Pfff | Python, PHP, OCaml (legacy) |
| Custom / Mixed | Generic (any language via Spacegrep / Aliengrep) |
Appendix B: Glossary
| Term | Definition |
|---|---|
| AST_generic | Semgrep’s universal AST - a normalised representation of any language’s code |
| IL | Intermediate Language - a flattened, 3-address form of a function’s body |
| CFG | Control Flow Graph - nodes are IL instructions; edges are execution paths |
| Metavariable | Pattern wildcard (e.g., $X) that binds to matched AST nodes |
| Taint | A label marking data that originated from a dangerous source |
| Fixpoint | The stable point of a dataflow iteration loop (no more state changes) |
| Prefilter | Per-file skip-scan based on literal strings extracted from rule patterns |
| OSemgrep | OCaml re-implementation of the semgrep CLI (replaces pysemgrep) |
| Pro engine | Semgrep’s commercial engine adding inter-procedural and cross-file analysis |
| SCA | Software Composition Analysis - scanning package manifests for known CVEs |
| SARIF | Static Analysis Results Interchange Format - standardised JSON output schema |
| — |