infrastructure · not a scanner

Security findings you can replay, verify, and trust.

Ward is security review infrastructure. Every finding ships with a deterministic evidence chain, a calibrated probability, and a signed witness — replayable bit-for-bit and verifiable by a third party without trusting Ward. Runs in your environment.

Deterministic replay Signed witnesses · self-hostable 5 languages · 19 vulnerability classes
~/src/project zsh · ward
$ ward review
static analysis 11 candidates
investigating 2 bundled · 9 queued
evidence pipeline ok

▸ aiohttp CVE-2024-23334 evidence: under_review
classpath traversal
gradeconfig-dependent
trigger follow_symlinks False → True
oracleproven · canary echoed via symlink read
evidence config-trigger · repro-test · repro-result
▸ sanic cli/app.py:142 evidence: draft
classpath traversal
gradesemantic-only · counterexample
refuted import_module requires identifier grammar
evidence counterexample · 4f3f7a7c…

2 bundled · 9 queued · $0.19 spent · 13 turns total
real output from the run on the examples page
01   Real-world baseline

Real-CVE evaluation is the floor.

We ran Ward and the tools teams most commonly rely on over the same corpus of 3,408 entries grounded in historical CVEs across five ecosystems. A finding only counts when it flags the vulnerable code targeted by the patch and disappears on the fix commit.

1,392
paired true positives
82.1%
recall
73.6%
precision
3,408
corpus entries · 5 ecosystems

Same corpus, same harness · paired scoring

dated 2026-04-28 · 3,381-entry overlap · codeql full-corpus rerun pending
ward
1,392 TP
semgrep
430 TP · 31% of Ward
codeql
pending full-corpus rerun
withheld

▸ Paired scoring: a finding is “real” only if present on the vulnerable commit, localized to the code the patch fixed, and absent on the fix commit. CodeQL numbers are withheld until the full-corpus rerun completes cleanly enough to publish a reproducible figure under the same harness. Read the methodology →

02   Verifiable, not vibes

A finding should be reproducible — by anyone, not just us.

A typical scanner emits an alert and a severity. Ward emits a deterministic record: the source-to-sink trace, the calibrated probability, the signed evidence chain that produced the decision, and a hermetic capsule a third party can replay bit-for-bit. Verifiable, not trust-me.

deterministic · signed · sandboxed

Ward treats a finding as a verifiable record, not a flag in a dashboard. The pipeline is deterministic, the evidence chain is signed, and the inputs and tools are pinned — everything needed to reproduce the result, including a skeptic.

  • Deterministic pipeline · same inputs produce bit-identical outputs
  • Hermetic replay capsule per run · re-derivable without our cooperation
  • Signed evidence chain · attestations anchored to an external transparency log
  • Calibrated probability with empirical bounds · not a severity label
  • Sandbox-by-default · analyzers run with capability restrictions, not network trust
  • Refutations carry a structured counterexample · not a silent suppression
verify a run third-party reproducible ▸
$ ward verify ./run-2026-04-29.capsule
capsule integrity blake3:7a9f…
analyzer attestations ed25519 valid
commitment chain externally anchored
inputs match digest bit-identical
findings re-derived 1,392 / 1,392
trust: verified
a Ward run is a record anyone can re-derive from the capsule and verify against the signed chain — without trusting Ward.
03   From detection to evidence

Detect, investigate, decide — with evidence at every step.

The scanner remains the base layer. On top of it, Ward runs an investigation loop that carries each finding from candidate signal to a decision a reviewer can defend — with the trace, repro, and provenance attached.

I · DETECT

Cross-file reasoning

<source>
input
<intermediate>
<sink> · risk
action
finding

Reasons across files to surface vulnerable flows that single-file pattern matching often misses.

II · INVESTIGATE

Evidence bundles

candidate trace
·taint flow attached
·entry point identified
·
·bundle: draft
investigation evidence
·repro test
·repro result
·provenance
·review queue

Investigation runs inside a capability-restricted sandbox. Every step — model call, repro execution, patch attempt — is captured in a signed witness with provenance pinned to specific tool and model versions. Reproducible, auditable, attributable.

III · DECIDE

Evidence grades

Reproduced
exploit runs under default config · oracle proven
Config-dependent
opt-in risk · trigger captured in bundle
Counterexample
refutation with concrete witness
Semantic-only
no runnable oracle · routed to analyst review
decisions are loss-aware · costs versioned, not heuristic

Each finding carries an explicit grade. The block / warn / allow decision is driven by a versioned loss function, not a static threshold — auditable and policy-controlled.

04   Methodology

What the evaluation shows.

The benchmark matters. So do its limits. Here’s what we count, what we compare, and where the current pre-release claims stop.

What is paired scoring, precisely?

For each CVE we have a repo and two SHAs: vuln_sha (the commit the CVE was filed against) and fix_sha (the merge that closed it). We run the scanner on both and call the finding “real” only if it fires on vuln_sha at a location whose scope includes the code the patch fixed, and does not fire on fix_sha. Any other pattern is not credited. Raw alert counts across scanners aren’t comparable; paired scoring is.

Is Ward just another SAST tool?

No. Static analysis is the entry point, not the whole story. Ward is verifiable security review infrastructure: deterministic scanner output, signed evidence chains, hermetic replay capsules, calibrated probabilities, and a sandboxed investigation layer on top. The scanner is farther along than the investigation layer today.

Does Ward send my code to external APIs?

The static pipeline runs entirely in your environment with no external calls. The investigation layer can use pinned model providers; a fully self-hosted variant for air-gapped deployments is on the near roadmap. In every mode, code only enters declared, sandboxed surfaces, and every model interaction is captured in the signed witness chain.

How do I verify a finding without trusting Ward?

Each run produces a hermetic replay capsule and a signed evidence chain anchored to an external transparency log. A third party can re-derive the same findings from the capsule and verify the attestations independently — no Ward cooperation required. Anchor staleness and verification failures explicitly downgrade the run’s trust state.

Which tools did you compare against?

The current comparison includes Semgrep. CodeQL is being rerun under a stricter full-corpus setup, but those runs currently take more than 24 hours and have not completed cleanly enough for us to publish a reproducible headline number. We’ll publish the exact versions, configurations, and harness details alongside the benchmark methodology.

Will the corpus be public?

We intend to publish the methodology, scoring harness, benchmark dates, and pinned tool configurations. We have not finalized what portion of the corpus itself will be public.

When can I try it?

Ward is pre-release and in active development. There’s no public install today. If you want to be notified when there is, leave your email below.