Semgrep Architecture: Comprehensive Reference -

Semgrep (Semantic Grep) is a multi-language static analysis tool that matches code by its structure – not just its text – using a unified Abstract Syntax Tree and a rich pattern language. This document covers the full internal architecture from CLI entry-point to taint sink detection.

Introduction
- What is SAST?
- History of Semgrep
High-Level Architecture
Component Breakdown
The Full Analysis Pipeline
Target Discovery & Filtering
Rule Parsing & Optimization
Parsing & the Universal AST
The Matching Engine
The Intermediate Language (IL) & CFG
Taint Analysis (Dataflow)
Output & Reporting Pipeline
OSemgrep / RPC Architecture
Key Data Structures

-1. Why?

I have been working as a DevSecOps engineer for nearly four years now. When I started, I had little exposure to tooling like SAST, SCA, or DAST. My mindset was firmly rooted in offensive security. Penetration testing was the goal, the dream.

Then I discovered SAST. My first reaction was genuine excitement: “This is incredible, automated code analysis at scale!” But over time, reality set in. In practice, most findings tend to be noisy. False positives are plentiful, and the truly critical vulnerabilities are rarely surfaced by off-the-shelf rules. To be clear: I am not saying these tools are useless, far from it. They are an essential layer of defense in any mature security program. But they do have limits.

That tension, between the promise and the reality of SAST, planted a question in my head that would not go away: why not build my own? The idea kept nagging at me until I finally decided to dig in. And what a rabbit hole it turned out to be.

The natural starting point was Semgrep. Understanding how it works under the hood, the parsing pipeline, the universal AST, the matching engine, the taint analysis, has been one of the most intellectually rewarding deep-dives I have done in a long time. This article is the result of that exploration.

There will be more articles on this topic. I have a lot to say about static analysis, and I am just getting started.

No em dashes used anywhere. You may need to reload the file in your editor to see the changes.

0. Introduction

0.1 What is SAST?

Static Application Security Testing (SAST) is a white-box security technique that analyzes source code, bytecode, or binary code without executing the program. A SAST tool reads the code the same way a compiler or interpreter would, builds an internal model of program structure and data flow, and then applies a library of security rules to that model to identify vulnerabilities.

SAST sits at the earliest possible stage of the Software Development Lifecycle (SDLC). This “shift-left” principle means that bugs are caught before they can ever reach a staging or production environment – dramatically reducing the cost and effort of fixing them.

Core properties of a SAST tool:

Property	Description
No execution required	Analyzes source files directly; does not need a running server, database, or network
Language-aware	Understands syntax and semantics: knows that `x = y` is an assignment, not two identifiers
Finds bugs early	Runs in CI/CD pipelines or even pre-commit hooks, giving developers instant feedback
Scalable	Can scan millions of lines of code in seconds with parallel analysis
Deterministic	Same code always produces the same result (no flakiness from network or timing)

SAST vs. Other Security Testing Approaches

The four major security testing categories differ along two axes: whether source code access is needed, and whether the application must be running.

Approach	Code Access	App Running	Typical Integration Point
SAST	Yes	No	Commit / PR / CI build
DAST	No	Yes	Staging environment
IAST	Yes	Yes	QA environment with agent
SCA	Yes (manifests)	No	Dependency install / build
Manual Code Review	Yes	No	Security sprint / audit

Modern security programs use all of these layers together. SAST is the first line of defense because it is the cheapest and fastest to run.

How SAST Works: The Core Technique

Most SAST tools follow the same fundamental pipeline:

SAST Core Pipeline

Semgrep is unique among SAST tools because it exposes this pipeline to end-users directly. Instead of shipping only a fixed rule library, Semgrep lets security engineers write custom rules in YAML using a pattern language that mirrors actual code syntax – no formal language theory required.

0.2 History of Semgrep

Semgrep was not built from scratch. Its roots trace back to research done inside Facebook more than fifteen years ago.

The pfff Lineage (2009)

In 2009, Yoann Padioleau – then an engineer at Facebook – created pfff (“Parsing Framework for Fun”). pfff was an OCaml library and toolset that could parse multiple programming languages (PHP, C, OCaml, Java) into a common AST representation. The goal was to build code navigation, refactoring, and lightweight static analysis tools that worked uniformly across Facebook’s polyglot codebase.

pfff introduced two foundational ideas that Semgrep still relies on today:

A universal AST (AST_generic) shared across all supported languages.
sgrep – a prototype “semantic grep” that let engineers write structural code patterns (with wildcards) that matched against the pfff AST instead of raw text.

From sgrep to Semgrep (2013-2019)

The sgrep prototype demonstrated that structural matching was practical and useful for finding bugs at scale inside Facebook. The concept was refined internally over several years but not yet released publicly. The OCaml core continued to evolve alongside pfff.

Open Source Launch (2020)

In 2020, r2c (Return to Corporation, later renamed Semgrep Inc.) spun the technology out and released Semgrep as a free, open-source tool under the LGPL-2.1 license. The release included:

A Python CLI wrapping the OCaml binary.
The Semgrep Registry – a community-maintained library of security rules.
Support for 10+ languages from day one.

This release was a breakthrough: for the first time, any developer or security engineer could write structural code patterns without needing a PhD in program analysis.

Expansion and Commercialization (2021-2022)

Semgrep Inc. began building commercial capabilities on top of the OSS core:

Semgrep Pro / DeepSemgrep (2021): Added inter-procedural taint analysis and cross-file dataflow – tracking how tainted data flows not just within a function but across module boundaries and entire codebases.
Semgrep Supply Chain / SCA (2022): Integrated dependency manifest scanning to detect known CVEs in third-party libraries, tying reachability analysis (“is this vulnerable function actually called?”) back to the SAST engine.
Semgrep Secrets (2022-2023): Extended the rule language to detect hardcoded credentials and API keys, with entropy scoring and AI-powered online validation.

OSemgrep and Modern Semgrep (2023-present)

As the Python CLI became a performance and maintenance bottleneck, Semgrep Inc. began the OSemgrep project: a full reimplementation of the Python CLI in OCaml, merging the CLI and core engine into a single binary. This eliminates subprocess overhead, enables tighter integration with the IDE via RPC and LSP protocols, and simplifies deployment.

Era	Key Events
2009	pfff created at Facebook; multi-language AST framework
2013	sgrep prototype: structural pattern matching on AST
2020	Semgrep OSS released by r2c; Registry launched
2021	DeepSemgrep: inter-procedural taint + cross-file analysis
2022	Supply Chain (SCA) and Secrets detection added
2023	OSemgrep rewrite begins; RPC/LSP IDE integrations
2024	Semgrep Secrets GA; MCP tool support; 30+ languages

1. High-Level Architecture

Semgrep is split into two tiers: a Python CLI that handles user-facing concerns, and an OCaml core engine that performs all heavy lifting - parsing, matching, and dataflow analysis.

Semgrep High-Level Architecture

2. Component Breakdown

2.1 Semgrep CLI (`cli/src/semgrep/`)

Module	Responsibility
`main.py` / `cli.py`	Entry point, subcommand routing (`scan`, `ci`, `test`, `login`)
`config_resolver.py`	Fetches rules from local paths, URLs, or the Semgrep Registry
`target_manager.py`	Walks the file system, applies `.semgrepignore` / `.gitignore` rules
`core_runner.py`	Serialises the scan plan and invokes `semgrep-core`; collects JSON output
`run_scan.py`	Orchestrates the full scan flow (rules → targets → core → output)
`output.py` / `scan_report.py`	Post-processes findings, deduplication, `# nosemgrep` suppression
`join_rule.py`	Evaluates multi-file join rules by joining metavariable bindings
`engine.py`	Detects which engine variant to use (OSS / Pro / DeepSemgrep)
`metrics.py` / `telemetry.py`	Collects and sends anonymous usage metrics
`rpc.py` / `rpc_call.py`	Optional JSON-RPC bridge for IDE/LSP integrations

2.2 Semgrep Core (`src/`)

Directory	Key Files	Responsibility
`src/main`	`Main.ml`	Binary entry point; parses CLI args and dispatches
`src/core_scan`	`Core_scan.ml`, `Parmap_targets.ml`	Parallel scan orchestration over target/rule matrix
`src/targeting`	`Find_targets.ml`, `Semgrepignore.ml`, `Guess_lang.ml`	File discovery, language detection, ignore rules
`src/rule`	`Rule.ml`, `Parse_rule.ml`, `Xpattern.ml`	Rule data model and YAML parsing
`src/parsing`	`Parse_target.ml`, `Pfff_or_tree_sitter.ml`, `Parse_pattern.ml`	Source & pattern parsing → `AST_generic`
`src/ast_generic`	`AST_generic.ml` (in `libs/`)	The universal AST type definitions
`src/naming`	`Naming_AST.ml`	Name/scope resolution on the parsed AST
`src/prefiltering`	`Analyze_rule.ml`, `File.ml`	Fast pre-scan to skip irrelevant files
`src/matching`	`Pattern_vs_code.ml`, `Matching_generic.ml`	Core structural pattern matching
`src/engine`	`Match_rules.ml`, `Match_search_mode.ml`, `Match_taint_spec.ml`	Rule dispatch, search mode, taint spec
`src/il`	`IL.ml`, `CFG.ml`, `Fun_CFG.ml`	Intermediate Language + Control Flow Graph
`src/analyzing`	`AST_to_IL.ml`, `CFG_build.ml`, `Dataflow_core.ml`, `Constant_propagation.ml`	AST→IL lowering, CFG construction, generic dataflow
`src/tainting`	`OSS_dataflow_tainting.ml`, `Taint.ml`, `Taint_shape.ml`, `Taint_lval_env.ml`	Taint propagation and sink detection
`src/fixing`	Autofix application	Applies rule-suggested code fixes
`src/sca`	Dependency analysis	Software Composition Analysis (SCA)

3. The Full Analysis Pipeline

Full Analysis Pipeline: 7 Stages from CLI Entry to Output

4. Target Discovery & Filtering

4.1 File Discovery Flow

Target Discovery and Filtering Flow

4.2 Language Detection (`Guess_lang.ml`)

Detection is multi-layered (in priority order):

Explicit --lang flag – overrides all heuristics
File extension – .py → Python, .ts → TypeScript, etc.
Shebang line (#!/usr/bin/env python3)
.semgrepignore / rule paths: selectors

4.3 The `.semgrepignore` System (`Semgrepignore.ml`)

The .semgrepignore layering: built-in defaults < repo .semgrepignore < .gitignore < CLI --exclude/--include flags. Later rules win; ! prefix negates.

5. Rule Parsing & Optimization

5.1 Rule YAML Structure

A Semgrep rule contains:

id - unique identifier
languages - target language(s)
pattern / patterns / pattern-either - the formula
pattern-not / pattern-not-inside - negative clauses
metavariable-* - metavariable constraints (regex, type, comparison, pattern)
mode - search (default) or taint
fix - optional autofix template

5.2 Rule Parsing Pipeline

Rule Parsing Pipeline: YAML to Compiled Pattern AST

5.3 Formula Algebra

Rules can express boolean logic over patterns:

Operator	Meaning
`pattern`	Match this exact pattern
`patterns` (AND)	All sub-patterns must match
`pattern-either` (OR)	Any sub-pattern may match
`pattern-not`	Exclude if this pattern matches
`pattern-inside`	Match must be inside this pattern
`pattern-not-inside`	Match must not be inside this pattern
`focus-metavariable`	Report the position of a specific metavariable
`metavariable-regex`	Metavariable value must match a regex
`metavariable-pattern`	Metavariable value must itself match a sub-pattern
`metavariable-type`	Metavariable type must match (typed languages)
`metavariable-comparison`	Numeric/string comparison on metavariable

5.4 Prefiltering (`src/prefiltering/`)

Before any parsing occurs, rules are statically analyzed to extract “must-have” literal strings. A bloom-filter-like scan of raw file bytes allows the engine to skip files that provably cannot match any active rule - often eliminating 90%+ of files without touching the parser.

Prefiltering: Rules are statically analyzed to extract mandatory string literals. Files not containing any of those literals are skipped before parsing – often eliminating 90%+ of the corpus.

6. Parsing & the Universal AST

6.1 Why a Universal AST?

Semgrep supports 30+ languages using a single matching engine by normalising every language’s AST into AST_generic - a lingua franca for code structure. Adding a new language means writing only a parser that outputs this type; the matching logic is reused automatically.

6.2 Parser Backends

Parser Backends and Universal AST

6.3 AST Normalisation & Naming

After raw parsing, two passes enrich the generic AST:

Normalisation (Normalize_generic.ml): Canonicalise syntactic sugar (e.g., x += 1 → x = x + 1).
Name Resolution (Naming_AST.ml): Resolve variable references to their definition scope. This enables the engine to distinguish os.path.join (the stdlib function) from a local path.join.

AST Normalisation and Naming pipeline

6.4 Pattern Parsing

Rule patterns go through the same parsers but with a dedicated entry point (Parse_pattern.ml). Metavariable tokens ($X, $...ARGS) are injected as special AST nodes before parsing so the grammar treats them as valid expressions - then the engine interprets them during matching.

7. The Matching Engine

7.1 Overall Matching Architecture

Matching Engine: Match_rules dispatches to search and taint engines, producing Range_with_metavars results

7.2 `Pattern_vs_code.ml` - The Core Matcher

This is the largest file in the codebase (~155 KB). It implements a recursive descent structural comparator between a pattern AST node and a code AST node using OCaml’s Monadic_bind pattern.

Pattern_vs_code.ml: recursive structural comparator binding metavariables and applying semantic equivalences

Key matching rules:

Pattern Construct	Matching Behaviour
`$X` (uppercase metavar)	Binds to any single expression/identifier; subsequent uses enforce equality
`$...ARGS` (ellipsis metavar)	Binds to any sequence of arguments
`...` (ellipsis)	Matches any sequence of statements/args/fields
Literal `"foo"`	Matches the exact string literal
`_`	Matches any single node without binding
`<... X ...>` (deep expression)	Searches inside any sub-expression for X

7.3 Metavariable Constraints (Post-match filters)

After a raw structural match, the engine evaluates metavariable constraint clauses in sequence. All must pass for the finding to be reported.

Metavariable Constraint Filter Chain: all checks must pass for the finding to be reported

7.4 Matching Visitor (`Matching_visitor.ml`)

The visitor drives the match attempt over every node in the code AST. For each rule formula it:

Walks the code AST top-down.
Attempts the pattern at each node.
Collects Range_with_metavars - the code range that matched alongside metavar bindings.
Applies boolean formula operators (AND, OR, NOT, INSIDE) to combine raw matches.

8. The Intermediate Language (IL) & CFG

Search-mode matching works directly on the AST. Taint mode requires understanding execution order, so the engine first lowers the AST into an Intermediate Language (IL) and then builds a Control Flow Graph (CFG).

8.1 AST → IL (`AST_to_IL.ml`)

AST to IL Lowering: AST_to_IL.ml flattens the tree into a linearised 3-address instruction stream

The IL simplifies dataflow by making all side effects explicit as discrete instructions, each with a clear left-hand side (assignment target) and right-hand side (expression), eliminating complex nested expressions.

8.2 CFG Construction (`CFG_build.ml`)

CFG Construction: CFG_build.ml converts linear IL into a Control Flow Graph with basic blocks and edges

8.3 Constant Propagation (`Constant_propagation.ml`)

Before taint analysis, the engine runs sparse constant propagation over the CFG. This resolves:

String literals assigned to variables (url = "http://evil.com")
Simple arithmetic on integer constants
Concatenation of known string constants

This enriches the IL so taint patterns can match the effective value of expressions, not just their syntactic form.

9. Taint Analysis (Dataflow)

9.1 End-to-End Taint Flow

End-to-End Taint Analysis: rule spec is instantiated, CFG is analyzed with a fixpoint loop, taint findings are emitted

9.2 Fixpoint Engine (`Dataflow_core.ml`)

The core dataflow algorithm is a worklist-based forward analysis:

Fixpoint Engine: worklist-driven forward analysis iterating until state stabilizes

9.3 Taint Representation (`Taint.ml` / `Taint_shape.ml`)

Taint is not just a boolean - it carries provenance information:

Concept	Representation
Taint label	Which `source` rule produced this taint (enables label-matching)
Taint call stack	The chain of function calls through which taint propagated
Taint shape	Per-field/index taint (`Taint_shape.ml`): `obj.field` can be tainted independently of `obj`
Xtaint	`Clean` / `Tainted(set)` / `Both` - tracks when a value is both clean and tainted (e.g., via branches)

9.4 Taint Propagation Rules

Taint Propagation Rules: x = tainted_y taints x; x.f = tainted_y taints field x.f; call return is tainted if a propagator rule applies; string interpolation of tainted data taints the result; sanitizer calls remove taint.

9.5 Source / Sink / Sanitizer Matching (`Taint_spec_match.ml`)

The “best match” strategy prevents double-reporting:

Best Match Strategy (Taint_spec_match.ml): When multiple overlapping AST nodes match a source/sink pattern (e.g., foo vs foo.bar), the engine keeps only the longest/most-specific match to avoid double-reporting.

9.6 Intra- vs Inter-procedural Analysis

Mode	Scope	Cross-function	Notes
Intra-procedural (OSS)	Single function	No	Function params assumed possibly tainted
Inter-procedural (Pro)	Whole program / cross-file	Yes	Build call graph; compute per-function taint summaries; propagate across call sites

10. Output & Reporting Pipeline

10.1 Core Output → CLI Formatting

Output and Reporting Pipeline: Core JSON is ingested, deduplicated, nosemgrep-filtered, then formatted

10.2 Exit Codes

Code	Meaning
`0`	No findings (or `--error` not set)
`1`	Findings present (when `--error` flag used)
`2`	Scan error (parse error, rule error)
`3`	Invalid arguments
`123`	Could not connect to Semgrep Registry

10.3 The `# nosemgrep` Suppression System

# nosemgrep suppression: If a finding is on line N and line N or N-1 contains # nosemgrep (optionally followed by a colon-separated list of rule IDs), the finding is suppressed and not reported.

11. OSemgrep / RPC Architecture

11.1 OSemgrep (`src/osemgrep/`)

OSemgrep is the OCaml reimplementation of the Python CLI - a long-term effort to bring the full CLI into the OCaml binary, eliminating the Python dependency for performance-critical or embedded scenarios.

OSemgrep vs pysemgrep: The Python CLI invokes semgrep-core as a subprocess and exchanges JSON over stdin/stdout. OSemgrep (the OCaml reimplementation) calls core directly as a library, eliminating the subprocess boundary and the Python runtime dependency.

11.2 RPC / LSP Bridge (`src/rpc/`, `cli/src/semgrep/rpc.py`)

For IDE integrations, Semgrep exposes a JSON-RPC interface:

RPC / LSP Bridge: The IDE sends LSP events to the language server (rpc.py / lsp_legacy/). The server forwards them as JSON-RPC scan requests to the Core. Findings are returned as LSP diagnostics and shown as inline annotations in the editor.

12. Key Data Structures

12.1 `AST_generic` Node Hierarchy (simplified)

The any union covers all node kinds. The most common sub-types:

Kind	Variants
Expr	`Literal`, `Id`, `Call`, `Assign`, `BinOp`, `Lambda`, `Conditional`, …
Stmt	`ExprStmt`, `If`, `For`, `Return`, `Try`, `Block`, …
Type	`TyName`, `TyFun`, `TyArray`, `TyGeneric`, …
Definition	`FuncDef`, `ClassDef`, `VarDef`, …
Directive	`ImportFrom`, `ImportAll`, `Pragma`, …

12.2 `Rule.t` (OCaml Record, simplified)

Field	Type	Description
`id`	`Rule_ID.t`	Unique rule identifier
`languages`	`Language.t list`	Target languages
`formula`	`Formula`	Pattern formula (search mode)
`taint_spec`	`Taint_spec`	Source/sink/sanitizer spec (taint mode)
`severity`	`Error \| Warning \| Info \| Inventory`	Finding severity
`fix`	`string option`	Autofix template
`metadata`	JSON blob	OWASP, CWE, confidence tags
`paths`	selectors	include/exclude path filters
`options`	`Rule_options.t`	Per-rule engine feature toggles

12.3 `Range_with_metavars.t` (Match Result)

Field	Description
`range`	Byte range `(start_pos, end_pos)` in the source file
`mvars`	List of `(mvar_name, AST_node)` bindings
`origin`	`Rule_ID` that produced this match
`taint_trace`	Source-to-sink provenance path (taint mode only)
`fix`	Computed autofix string (if rule has `fix:`)

Appendix A: Parser Coverage

Backend	Languages
Tree-sitter	Go, Java, JavaScript, TypeScript, C, C++, C#, Kotlin, Ruby, Rust, Scala, Swift, HCL, Dockerfile, JSON, YAML, HTML, Bash, …
Pfff	Python, PHP, OCaml (legacy)
Custom / Mixed	Generic (any language via Spacegrep / Aliengrep)

Appendix B: Glossary

Term	Definition
AST_generic	Semgrep’s universal AST - a normalised representation of any language’s code
IL	Intermediate Language - a flattened, 3-address form of a function’s body
CFG	Control Flow Graph - nodes are IL instructions; edges are execution paths
Metavariable	Pattern wildcard (e.g., `$X`) that binds to matched AST nodes
Taint	A label marking data that originated from a dangerous source
Fixpoint	The stable point of a dataflow iteration loop (no more state changes)
Prefilter	Per-file skip-scan based on literal strings extracted from rule patterns
OSemgrep	OCaml re-implementation of the semgrep CLI (replaces pysemgrep)
Pro engine	Semgrep’s commercial engine adding inter-procedural and cross-file analysis
SCA	Software Composition Analysis - scanning package manifests for known CVEs
SARIF	Static Analysis Results Interchange Format - standardised JSON output schema
—

Table of Contents