Semgrep Architecture: Comprehensive Reference


Semgrep (Semantic Grep) is a multi-language static analysis tool that matches code by its structure – not just its text – using a unified Abstract Syntax Tree and a rich pattern language. This document covers the full internal architecture from CLI entry-point to taint sink detection.

Table of Contents

  1. Introduction
  2. High-Level Architecture
  3. Component Breakdown
  4. The Full Analysis Pipeline
  5. Target Discovery & Filtering
  6. Rule Parsing & Optimization
  7. Parsing & the Universal AST
  8. The Matching Engine
  9. The Intermediate Language (IL) & CFG
  10. Taint Analysis (Dataflow)
  11. Output & Reporting Pipeline
  12. OSemgrep / RPC Architecture
  13. Key Data Structures

-1. Why?

I have been working as a DevSecOps engineer for nearly four years now. When I started, I had little exposure to tooling like SAST, SCA, or DAST. My mindset was firmly rooted in offensive security. Penetration testing was the goal, the dream.

Then I discovered SAST. My first reaction was genuine excitement: “This is incredible, automated code analysis at scale!” But over time, reality set in. In practice, most findings tend to be noisy. False positives are plentiful, and the truly critical vulnerabilities are rarely surfaced by off-the-shelf rules. To be clear: I am not saying these tools are useless, far from it. They are an essential layer of defense in any mature security program. But they do have limits.

That tension, between the promise and the reality of SAST, planted a question in my head that would not go away: why not build my own? The idea kept nagging at me until I finally decided to dig in. And what a rabbit hole it turned out to be.

The natural starting point was Semgrep. Understanding how it works under the hood, the parsing pipeline, the universal AST, the matching engine, the taint analysis, has been one of the most intellectually rewarding deep-dives I have done in a long time. This article is the result of that exploration.

There will be more articles on this topic. I have a lot to say about static analysis, and I am just getting started.

No em dashes used anywhere. You may need to reload the file in your editor to see the changes.

0. Introduction

0.1 What is SAST?

Static Application Security Testing (SAST) is a white-box security technique that analyzes source code, bytecode, or binary code without executing the program. A SAST tool reads the code the same way a compiler or interpreter would, builds an internal model of program structure and data flow, and then applies a library of security rules to that model to identify vulnerabilities.

SAST sits at the earliest possible stage of the Software Development Lifecycle (SDLC). This “shift-left” principle means that bugs are caught before they can ever reach a staging or production environment – dramatically reducing the cost and effort of fixing them.

Core properties of a SAST tool:

PropertyDescription
No execution requiredAnalyzes source files directly; does not need a running server, database, or network
Language-awareUnderstands syntax and semantics: knows that x = y is an assignment, not two identifiers
Finds bugs earlyRuns in CI/CD pipelines or even pre-commit hooks, giving developers instant feedback
ScalableCan scan millions of lines of code in seconds with parallel analysis
DeterministicSame code always produces the same result (no flakiness from network or timing)

SAST vs. Other Security Testing Approaches

The four major security testing categories differ along two axes: whether source code access is needed, and whether the application must be running.

ApproachCode AccessApp RunningTypical Integration Point
SASTYesNoCommit / PR / CI build
DASTNoYesStaging environment
IASTYesYesQA environment with agent
SCAYes (manifests)NoDependency install / build
Manual Code ReviewYesNoSecurity sprint / audit

Modern security programs use all of these layers together. SAST is the first line of defense because it is the cheapest and fastest to run.

How SAST Works: The Core Technique

Most SAST tools follow the same fundamental pipeline:

SAST Core Pipeline

Semgrep is unique among SAST tools because it exposes this pipeline to end-users directly. Instead of shipping only a fixed rule library, Semgrep lets security engineers write custom rules in YAML using a pattern language that mirrors actual code syntax – no formal language theory required.

0.2 History of Semgrep

Semgrep was not built from scratch. Its roots trace back to research done inside Facebook more than fifteen years ago.

The pfff Lineage (2009)

In 2009, Yoann Padioleau – then an engineer at Facebook – created pfff (“Parsing Framework for Fun”). pfff was an OCaml library and toolset that could parse multiple programming languages (PHP, C, OCaml, Java) into a common AST representation. The goal was to build code navigation, refactoring, and lightweight static analysis tools that worked uniformly across Facebook’s polyglot codebase.

pfff introduced two foundational ideas that Semgrep still relies on today:

  1. A universal AST (AST_generic) shared across all supported languages.
  2. sgrep – a prototype “semantic grep” that let engineers write structural code patterns (with wildcards) that matched against the pfff AST instead of raw text.

From sgrep to Semgrep (2013-2019)

The sgrep prototype demonstrated that structural matching was practical and useful for finding bugs at scale inside Facebook. The concept was refined internally over several years but not yet released publicly. The OCaml core continued to evolve alongside pfff.

Open Source Launch (2020)

In 2020, r2c (Return to Corporation, later renamed Semgrep Inc.) spun the technology out and released Semgrep as a free, open-source tool under the LGPL-2.1 license. The release included:

  • A Python CLI wrapping the OCaml binary.
  • The Semgrep Registry – a community-maintained library of security rules.
  • Support for 10+ languages from day one.

This release was a breakthrough: for the first time, any developer or security engineer could write structural code patterns without needing a PhD in program analysis.

Expansion and Commercialization (2021-2022)

Semgrep Inc. began building commercial capabilities on top of the OSS core:

  • Semgrep Pro / DeepSemgrep (2021): Added inter-procedural taint analysis and cross-file dataflow – tracking how tainted data flows not just within a function but across module boundaries and entire codebases.
  • Semgrep Supply Chain / SCA (2022): Integrated dependency manifest scanning to detect known CVEs in third-party libraries, tying reachability analysis (“is this vulnerable function actually called?”) back to the SAST engine.
  • Semgrep Secrets (2022-2023): Extended the rule language to detect hardcoded credentials and API keys, with entropy scoring and AI-powered online validation.

OSemgrep and Modern Semgrep (2023-present)

As the Python CLI became a performance and maintenance bottleneck, Semgrep Inc. began the OSemgrep project: a full reimplementation of the Python CLI in OCaml, merging the CLI and core engine into a single binary. This eliminates subprocess overhead, enables tighter integration with the IDE via RPC and LSP protocols, and simplifies deployment.

EraKey Events
2009pfff created at Facebook; multi-language AST framework
2013sgrep prototype: structural pattern matching on AST
2020Semgrep OSS released by r2c; Registry launched
2021DeepSemgrep: inter-procedural taint + cross-file analysis
2022Supply Chain (SCA) and Secrets detection added
2023OSemgrep rewrite begins; RPC/LSP IDE integrations
2024Semgrep Secrets GA; MCP tool support; 30+ languages

1. High-Level Architecture

Semgrep is split into two tiers: a Python CLI that handles user-facing concerns, and an OCaml core engine that performs all heavy lifting - parsing, matching, and dataflow analysis.

Semgrep High-Level Architecture


2. Component Breakdown

2.1 Semgrep CLI (cli/src/semgrep/)

ModuleResponsibility
main.py / cli.pyEntry point, subcommand routing (scan, ci, test, login)
config_resolver.pyFetches rules from local paths, URLs, or the Semgrep Registry
target_manager.pyWalks the file system, applies .semgrepignore / .gitignore rules
core_runner.pySerialises the scan plan and invokes semgrep-core; collects JSON output
run_scan.pyOrchestrates the full scan flow (rules → targets → core → output)
output.py / scan_report.pyPost-processes findings, deduplication, # nosemgrep suppression
join_rule.pyEvaluates multi-file join rules by joining metavariable bindings
engine.pyDetects which engine variant to use (OSS / Pro / DeepSemgrep)
metrics.py / telemetry.pyCollects and sends anonymous usage metrics
rpc.py / rpc_call.pyOptional JSON-RPC bridge for IDE/LSP integrations

2.2 Semgrep Core (src/)

DirectoryKey FilesResponsibility
src/mainMain.mlBinary entry point; parses CLI args and dispatches
src/core_scanCore_scan.ml, Parmap_targets.mlParallel scan orchestration over target/rule matrix
src/targetingFind_targets.ml, Semgrepignore.ml, Guess_lang.mlFile discovery, language detection, ignore rules
src/ruleRule.ml, Parse_rule.ml, Xpattern.mlRule data model and YAML parsing
src/parsingParse_target.ml, Pfff_or_tree_sitter.ml, Parse_pattern.mlSource & pattern parsing → AST_generic
src/ast_genericAST_generic.ml (in libs/)The universal AST type definitions
src/namingNaming_AST.mlName/scope resolution on the parsed AST
src/prefilteringAnalyze_rule.ml, File.mlFast pre-scan to skip irrelevant files
src/matchingPattern_vs_code.ml, Matching_generic.mlCore structural pattern matching
src/engineMatch_rules.ml, Match_search_mode.ml, Match_taint_spec.mlRule dispatch, search mode, taint spec
src/ilIL.ml, CFG.ml, Fun_CFG.mlIntermediate Language + Control Flow Graph
src/analyzingAST_to_IL.ml, CFG_build.ml, Dataflow_core.ml, Constant_propagation.mlAST→IL lowering, CFG construction, generic dataflow
src/taintingOSS_dataflow_tainting.ml, Taint.ml, Taint_shape.ml, Taint_lval_env.mlTaint propagation and sink detection
src/fixingAutofix applicationApplies rule-suggested code fixes
src/scaDependency analysisSoftware Composition Analysis (SCA)

3. The Full Analysis Pipeline

Full Analysis Pipeline: 7 Stages from CLI Entry to Output


4. Target Discovery & Filtering

4.1 File Discovery Flow

Target Discovery and Filtering Flow

4.2 Language Detection (Guess_lang.ml)

Detection is multi-layered (in priority order):

  1. Explicit --lang flag – overrides all heuristics
  2. File extension.py → Python, .ts → TypeScript, etc.
  3. Shebang line (#!/usr/bin/env python3)
  4. .semgrepignore / rule paths: selectors

4.3 The .semgrepignore System (Semgrepignore.ml)

The .semgrepignore layering: built-in defaults < repo .semgrepignore < .gitignore < CLI --exclude/--include flags. Later rules win; ! prefix negates.


5. Rule Parsing & Optimization

5.1 Rule YAML Structure

A Semgrep rule contains:

  • id - unique identifier
  • languages - target language(s)
  • pattern / patterns / pattern-either - the formula
  • pattern-not / pattern-not-inside - negative clauses
  • metavariable-* - metavariable constraints (regex, type, comparison, pattern)
  • mode - search (default) or taint
  • fix - optional autofix template

5.2 Rule Parsing Pipeline

Rule Parsing Pipeline: YAML to Compiled Pattern AST

5.3 Formula Algebra

Rules can express boolean logic over patterns:

OperatorMeaning
patternMatch this exact pattern
patterns (AND)All sub-patterns must match
pattern-either (OR)Any sub-pattern may match
pattern-notExclude if this pattern matches
pattern-insideMatch must be inside this pattern
pattern-not-insideMatch must not be inside this pattern
focus-metavariableReport the position of a specific metavariable
metavariable-regexMetavariable value must match a regex
metavariable-patternMetavariable value must itself match a sub-pattern
metavariable-typeMetavariable type must match (typed languages)
metavariable-comparisonNumeric/string comparison on metavariable

5.4 Prefiltering (src/prefiltering/)

Before any parsing occurs, rules are statically analyzed to extract “must-have” literal strings. A bloom-filter-like scan of raw file bytes allows the engine to skip files that provably cannot match any active rule - often eliminating 90%+ of files without touching the parser.

Prefiltering: Rules are statically analyzed to extract mandatory string literals. Files not containing any of those literals are skipped before parsing – often eliminating 90%+ of the corpus.


6. Parsing & the Universal AST

6.1 Why a Universal AST?

Semgrep supports 30+ languages using a single matching engine by normalising every language’s AST into AST_generic - a lingua franca for code structure. Adding a new language means writing only a parser that outputs this type; the matching logic is reused automatically.

6.2 Parser Backends

Parser Backends and Universal AST

6.3 AST Normalisation & Naming

After raw parsing, two passes enrich the generic AST:

  1. Normalisation (Normalize_generic.ml): Canonicalise syntactic sugar (e.g., x += 1x = x + 1).
  2. Name Resolution (Naming_AST.ml): Resolve variable references to their definition scope. This enables the engine to distinguish os.path.join (the stdlib function) from a local path.join.

AST Normalisation and Naming pipeline

6.4 Pattern Parsing

Rule patterns go through the same parsers but with a dedicated entry point (Parse_pattern.ml). Metavariable tokens ($X, $...ARGS) are injected as special AST nodes before parsing so the grammar treats them as valid expressions - then the engine interprets them during matching.


7. The Matching Engine

7.1 Overall Matching Architecture

Matching Engine: Match_rules dispatches to search and taint engines, producing Range_with_metavars results

7.2 Pattern_vs_code.ml - The Core Matcher

This is the largest file in the codebase (~155 KB). It implements a recursive descent structural comparator between a pattern AST node and a code AST node using OCaml’s Monadic_bind pattern.

Pattern_vs_code.ml: recursive structural comparator binding metavariables and applying semantic equivalences

Key matching rules:

Pattern ConstructMatching Behaviour
$X (uppercase metavar)Binds to any single expression/identifier; subsequent uses enforce equality
$...ARGS (ellipsis metavar)Binds to any sequence of arguments
... (ellipsis)Matches any sequence of statements/args/fields
Literal "foo"Matches the exact string literal
_Matches any single node without binding
<... X ...> (deep expression)Searches inside any sub-expression for X

7.3 Metavariable Constraints (Post-match filters)

After a raw structural match, the engine evaluates metavariable constraint clauses in sequence. All must pass for the finding to be reported.

Metavariable Constraint Filter Chain: all checks must pass for the finding to be reported

7.4 Matching Visitor (Matching_visitor.ml)

The visitor drives the match attempt over every node in the code AST. For each rule formula it:

  1. Walks the code AST top-down.
  2. Attempts the pattern at each node.
  3. Collects Range_with_metavars - the code range that matched alongside metavar bindings.
  4. Applies boolean formula operators (AND, OR, NOT, INSIDE) to combine raw matches.

8. The Intermediate Language (IL) & CFG

Search-mode matching works directly on the AST. Taint mode requires understanding execution order, so the engine first lowers the AST into an Intermediate Language (IL) and then builds a Control Flow Graph (CFG).

8.1 AST → IL (AST_to_IL.ml)

AST to IL Lowering: AST_to_IL.ml flattens the tree into a linearised 3-address instruction stream

The IL simplifies dataflow by making all side effects explicit as discrete instructions, each with a clear left-hand side (assignment target) and right-hand side (expression), eliminating complex nested expressions.

8.2 CFG Construction (CFG_build.ml)

CFG Construction: CFG_build.ml converts linear IL into a Control Flow Graph with basic blocks and edges

8.3 Constant Propagation (Constant_propagation.ml)

Before taint analysis, the engine runs sparse constant propagation over the CFG. This resolves:

  • String literals assigned to variables (url = "http://evil.com")
  • Simple arithmetic on integer constants
  • Concatenation of known string constants

This enriches the IL so taint patterns can match the effective value of expressions, not just their syntactic form.


9. Taint Analysis (Dataflow)

9.1 End-to-End Taint Flow

End-to-End Taint Analysis: rule spec is instantiated, CFG is analyzed with a fixpoint loop, taint findings are emitted

9.2 Fixpoint Engine (Dataflow_core.ml)

The core dataflow algorithm is a worklist-based forward analysis:

Fixpoint Engine: worklist-driven forward analysis iterating until state stabilizes

9.3 Taint Representation (Taint.ml / Taint_shape.ml)

Taint is not just a boolean - it carries provenance information:

ConceptRepresentation
Taint labelWhich source rule produced this taint (enables label-matching)
Taint call stackThe chain of function calls through which taint propagated
Taint shapePer-field/index taint (Taint_shape.ml): obj.field can be tainted independently of obj
XtaintClean / Tainted(set) / Both - tracks when a value is both clean and tainted (e.g., via branches)

9.4 Taint Propagation Rules

Taint Propagation Rules: x = tainted_y taints x; x.f = tainted_y taints field x.f; call return is tainted if a propagator rule applies; string interpolation of tainted data taints the result; sanitizer calls remove taint.

9.5 Source / Sink / Sanitizer Matching (Taint_spec_match.ml)

The “best match” strategy prevents double-reporting:

Best Match Strategy (Taint_spec_match.ml): When multiple overlapping AST nodes match a source/sink pattern (e.g., foo vs foo.bar), the engine keeps only the longest/most-specific match to avoid double-reporting.

9.6 Intra- vs Inter-procedural Analysis

ModeScopeCross-functionNotes
Intra-procedural (OSS)Single functionNoFunction params assumed possibly tainted
Inter-procedural (Pro)Whole program / cross-fileYesBuild call graph; compute per-function taint summaries; propagate across call sites

10. Output & Reporting Pipeline

10.1 Core Output → CLI Formatting

Output and Reporting Pipeline: Core JSON is ingested, deduplicated, nosemgrep-filtered, then formatted

10.2 Exit Codes

CodeMeaning
0No findings (or --error not set)
1Findings present (when --error flag used)
2Scan error (parse error, rule error)
3Invalid arguments
123Could not connect to Semgrep Registry

10.3 The # nosemgrep Suppression System

# nosemgrep suppression: If a finding is on line N and line N or N-1 contains # nosemgrep (optionally followed by a colon-separated list of rule IDs), the finding is suppressed and not reported.


11. OSemgrep / RPC Architecture

11.1 OSemgrep (src/osemgrep/)

OSemgrep is the OCaml reimplementation of the Python CLI - a long-term effort to bring the full CLI into the OCaml binary, eliminating the Python dependency for performance-critical or embedded scenarios.

OSemgrep vs pysemgrep: The Python CLI invokes semgrep-core as a subprocess and exchanges JSON over stdin/stdout. OSemgrep (the OCaml reimplementation) calls core directly as a library, eliminating the subprocess boundary and the Python runtime dependency.

11.2 RPC / LSP Bridge (src/rpc/, cli/src/semgrep/rpc.py)

For IDE integrations, Semgrep exposes a JSON-RPC interface:

RPC / LSP Bridge: The IDE sends LSP events to the language server (rpc.py / lsp_legacy/). The server forwards them as JSON-RPC scan requests to the Core. Findings are returned as LSP diagnostics and shown as inline annotations in the editor.


12. Key Data Structures

12.1 AST_generic Node Hierarchy (simplified)

The any union covers all node kinds. The most common sub-types:

KindVariants
ExprLiteral, Id, Call, Assign, BinOp, Lambda, Conditional, …
StmtExprStmt, If, For, Return, Try, Block, …
TypeTyName, TyFun, TyArray, TyGeneric, …
DefinitionFuncDef, ClassDef, VarDef, …
DirectiveImportFrom, ImportAll, Pragma, …

12.2 Rule.t (OCaml Record, simplified)

FieldTypeDescription
idRule_ID.tUnique rule identifier
languagesLanguage.t listTarget languages
formulaFormulaPattern formula (search mode)
taint_specTaint_specSource/sink/sanitizer spec (taint mode)
severityError | Warning | Info | InventoryFinding severity
fixstring optionAutofix template
metadataJSON blobOWASP, CWE, confidence tags
pathsselectorsinclude/exclude path filters
optionsRule_options.tPer-rule engine feature toggles

12.3 Range_with_metavars.t (Match Result)

FieldDescription
rangeByte range (start_pos, end_pos) in the source file
mvarsList of (mvar_name, AST_node) bindings
originRule_ID that produced this match
taint_traceSource-to-sink provenance path (taint mode only)
fixComputed autofix string (if rule has fix:)

Appendix A: Parser Coverage

BackendLanguages
Tree-sitterGo, Java, JavaScript, TypeScript, C, C++, C#, Kotlin, Ruby, Rust, Scala, Swift, HCL, Dockerfile, JSON, YAML, HTML, Bash, …
PfffPython, PHP, OCaml (legacy)
Custom / MixedGeneric (any language via Spacegrep / Aliengrep)

Appendix B: Glossary

TermDefinition
AST_genericSemgrep’s universal AST - a normalised representation of any language’s code
ILIntermediate Language - a flattened, 3-address form of a function’s body
CFGControl Flow Graph - nodes are IL instructions; edges are execution paths
MetavariablePattern wildcard (e.g., $X) that binds to matched AST nodes
TaintA label marking data that originated from a dangerous source
FixpointThe stable point of a dataflow iteration loop (no more state changes)
PrefilterPer-file skip-scan based on literal strings extracted from rule patterns
OSemgrepOCaml re-implementation of the semgrep CLI (replaces pysemgrep)
Pro engineSemgrep’s commercial engine adding inter-procedural and cross-file analysis
SCASoftware Composition Analysis - scanning package manifests for known CVEs
SARIFStatic Analysis Results Interchange Format - standardised JSON output schema

See also