infrastructure · not a scanner

Security findings you can replay, verify, and trust.

Ward is security review infrastructure. Every finding ships with a deterministic evidence chain, a calibrated probability, and a signed witness — replayable bit-for-bit and verifiable by a third party without trusting Ward. Runs in your environment.

See the results

Deterministic replay Signed witnesses · self-hostable 5 languages · 19 vulnerability classes

~/src/project zsh · ward

$ ward review

static analysis 11 candidates

investigating 2 bundled · 9 queued

evidence pipeline ok

▸ aiohttp CVE-2024-23334 evidence: under_review

classpath traversal

gradeconfig-dependent

trigger follow_symlinks False → True

oracleproven · canary echoed via symlink read

evidence config-trigger · repro-test · repro-result

▸ sanic cli/app.py:142 evidence: draft

classpath traversal

gradesemantic-only · counterexample

refuted import_module requires identifier grammar

evidence counterexample · 4f3f7a7c…

2 bundled · 9 queued · $0.19 spent · 13 turns total

real output from the run on the examples page

01 Real-world baseline

Real-CVE evaluation is the floor.

We ran Ward and the tools teams most commonly rely on over the same corpus of 3,464 paired CVE-harness entries grounded in historical CVEs across five ecosystems. A finding only counts when it flags the vulnerable code targeted by the patch and disappears on the fix commit.

1,392

paired true positives

82.1%

recall

73.6%

precision

3,464

paired CVE-harness entries · 5 ecosystems

Same corpus, same harness · paired scoring

dated 2026-05-01 · 3,464-entry corpus · 5 ecosystems

ward

1,392 TP · R=82.1%

semgrep

430 TP · R=25.5% · 31% of Ward

codeql

251 TP · R=14.5% · 18% of Ward

▸ Paired scoring: a finding is “real” only if present on the vulnerable commit, localized to the code the patch fixed, and absent on the fix commit. CodeQL precision is extremely high (98.8%, only 3 false positives after paired reclassification) but recall is 14.5% — missing 85.5% of CVEs. These figures are consistent with independent published benchmarks; see methodology for citations. Read the methodology →

▸ Locked, paired unsafe-Rust head-to-head (80 RUSTSEC pairs, 2026-05-13) pre-release: Ward F1=0.655, MCC=+0.564; every tested competitor 0 TPs at ≥ WARNING; McNemar p ≈ 1.46 × 10⁻¹¹. Internal pre-release benchmark; public reproduction unlocks at Ward source release. Read the writeup →

02 Verifiable, not vibes

A finding should be reproducible — by anyone, not just us.

A typical scanner emits an alert and a severity. Ward emits a deterministic record: the source-to-sink trace, the calibrated probability, the signed evidence chain that produced the decision, and a hermetic capsule a third party can replay bit-for-bit. Verifiable, not trust-me.

deterministic · signed · sandboxed

Ward treats a finding as a verifiable record, not a flag in a dashboard. The pipeline is deterministic, the evidence chain is signed, and the inputs and tools are pinned — everything needed to reproduce the result, including a skeptic.

Deterministic pipeline · same inputs produce bit-identical outputs
Hermetic replay capsule per run · re-derivable without our cooperation
Signed evidence chain · attestations anchored to an external transparency log
Calibrated probability with empirical bounds · not a severity label
Sandbox-by-default · analyzers run with capability restrictions, not network trust
Refutations carry a structured counterexample · not a silent suppression

verify a run third-party reproducible ▸

$ ward verify ./run-2026-04-29.capsule

✓ capsule integrity blake3:7a9f…

✓ analyzer attestations ed25519 valid

✓ commitment chain externally anchored

✓ inputs match digest bit-identical

✓ findings re-derived 1,392 / 1,392

trust: verified

a Ward run is a record anyone can re-derive from the capsule and verify against the signed chain — without trusting Ward.

03 From detection to evidence

Detect, investigate, decide — with evidence at every step.

The scanner remains the base layer. On top of it, Ward runs an investigation loop that carries each finding from candidate signal to a decision a reviewer can defend — with the trace, repro, and provenance attached.

I · DETECT

Cross-file reasoning

<source>

input

<intermediate>

…

<sink> · risk

action

finding

Reasons across files to surface vulnerable flows that single-file pattern matching often misses.

II · INVESTIGATE

Evidence bundles

candidate trace

·taint flow attached

·entry point identified

·bundle: draft

investigation evidence

·repro test

·repro result

·provenance

·review queue

Investigation runs inside a capability-restricted sandbox. Every step — model call, repro execution, patch attempt — is captured in a signed witness with provenance pinned to specific tool and model versions. Reproducible, auditable, attributable.

III · DECIDE

Evidence grades

Reproduced

exploit runs under default config · oracle proven

Config-dependent

opt-in risk · trigger captured in bundle

Counterexample

refutation with concrete witness

Semantic-only

no runnable oracle · routed to analyst review

decisions are loss-aware · costs versioned, not heuristic

Each finding carries an explicit grade. The block / warn / allow decision is driven by a versioned loss function, not a static threshold — auditable and policy-controlled.

04 Methodology

What the evaluation shows.

The benchmark matters. So do its limits. Here’s what we count, what we compare, and where the current pre-release claims stop.

What is paired scoring, precisely?

For each CVE we have a repo and two SHAs: vuln_sha (the commit the CVE was filed against) and fix_sha (the merge that closed it). We run the scanner on both and call the finding “real” only if it fires on vuln_sha at a location whose scope includes the code the patch fixed, and does not fire on fix_sha. Any other pattern is not credited. Raw alert counts across scanners aren’t comparable; paired scoring is.

Is Ward just another SAST tool?

No. Static analysis is the entry point, not the whole story. Ward is verifiable security review infrastructure: deterministic scanner output, signed evidence chains, hermetic replay capsules, calibrated probabilities, and a sandboxed investigation layer on top. The scanner is farther along than the investigation layer today.

Does Ward send my code to external APIs?

The static pipeline runs entirely in your environment with no external calls. The investigation layer can use pinned model providers; a fully self-hosted variant for air-gapped deployments is on the near roadmap. In every mode, code only enters declared, sandboxed surfaces, and every model interaction is captured in the signed witness chain.

How do I verify a finding without trusting Ward?

Each run produces a hermetic replay capsule and a signed evidence chain anchored to an external transparency log. A third party can re-derive the same findings from the capsule and verify the attestations independently — no Ward cooperation required. Anchor staleness and verification failures explicitly downgrade the run’s trust state.

Which tools did you compare against?

The current comparison includes Semgrep and CodeQL. Both were run over the same 3,464-entry corpus under the same paired-scoring harness. CodeQL was run at full-corpus scale (2026-05-01) using CodeQL CLI 2.24.3 with the javascript-security-extended and equivalent language query suites. We’ll publish the exact versions, configurations, and harness details alongside the benchmark methodology.

The recall figures we observe for CodeQL (~14%) and Semgrep (~25%) are consistent with the published academic literature. Independent evaluations report CodeQL at 20–22.5% recall on Java-only CVE corpora (Bennett et al., EASE 2024; IRIS / CWE-Bench-Java, ICLR 2025), and lower still on Go and Rust where query coverage is thinner. Semgrep off-the-shelf is reported at 17–24% on real Python and JavaScript CVE datasets (RealVuln, arXiv:2604.13764; DeepSource OSS CVE benchmark). Our paired methodology — which requires a finding to be localized to the patched code and absent on the fix commit — is stricter than most published evaluations, so our numbers sit at or slightly below the published midpoints.

The gap is structural: the same EASE 2024 study attributes 76.9% of all SAST false negatives to missing or insufficient rules, not engine limitations. The tools miss what they miss because no one has written the query or sink specification for that vulnerability pattern yet — not because taint analysis fundamentally can’t reach it.

Will the corpus be public?

We intend to publish the methodology, scoring harness, benchmark dates, and pinned tool configurations. We have not finalized what portion of the corpus itself will be public.

When can I try it?

Ward is pre-release and in active development. There’s no public install today. If you want to be notified when there is, leave your email below.