not just a scanner · an evidence layer

Security findings with evidence, not just alerts.

Ward starts with real-CVE-tested static analysis, then carries each finding through agentic investigation to a typed evidence bundle a reviewer can act on — trace, provenance, config-aware risk, and reproducible proof where a proof lane exists.

Rust, Go, Python, JavaScript, Java 19 vulnerability classes Evidence-backed findings
~/src/project zsh · ward
$ ward review
static analysis 11 candidates
investigating 2 bundled · 9 queued
evidence pipeline ok

▸ aiohttp CVE-2024-23334 evidence: under_review
classpath traversal
gradeconfig-dependent
trigger follow_symlinks False → True
oracleproven · canary echoed via symlink read
evidence config-trigger · repro-test · repro-result
▸ sanic cli/app.py:142 evidence: draft
classpath traversal
gradesemantic-only · counterexample
refuted import_module requires identifier grammar
evidence counterexample · 4f3f7a7c…

2 bundled · 9 queued · $0.19 spent · 13 turns total
real output from the run on the examples page
01   Real-world baseline

Real-CVE evaluation is the floor.

We ran Ward and the tools teams most commonly rely on over the same corpus of 2,068 entries grounded in historical CVEs across five ecosystems. A finding only counts when it flags the vulnerable code targeted by the patch and disappears on the fix commit.

799
paired true positives
72.6%
recall
73.1%
precision
2,068
corpus entries · 5 ecosystems

Same corpus, same harness · paired scoring

dated 2026-04-11
ward
799 TP
semgrep
58% of Ward
codeql
pending full-corpus rerun
withheld

▸ Paired scoring: a finding is “real” only if present on the vulnerable commit, localized to the code the patch fixed, and absent on the fix commit. CodeQL numbers are withheld until the full-corpus rerun completes cleanly enough to publish a reproducible figure under the same harness. Read the methodology →

02   Evidence, not vibes

Findings should carry more than a severity label.

A scanner dashboard tells you something might be wrong. Ward ships the source-to-sink trace, the reproduction (or counterexample), the config that decides whether the bug is actually exploitable, and the execution history behind every decision — bundled with the finding, not buried in a log.

evidence bundles · provenance · config-aware grading

A Ward finding can carry more than a rule match: a cross-file trace, reproducible evidence where a proof lane exists, a reviewable bundle state, and the provenance needed to explain why the system believes the issue is real.

  • Cross-file trace from source to sink
  • Reviewable evidence bundle for investigation state
  • Provenance and pinned execution context captured per run
  • Config-dependent risk distinguished from default-unsafe behavior
  • Refuted findings carry a structured counterexample, not a silent suppression
bundle surface current state ▸
signalsurfacestatenotes
traceshippedcross-file
repropilotproof lanes
patchpilotfixture-first
proofshippedevidence
configshippedpath c
03   From detection to evidence

Detect, investigate, decide — with evidence at every step.

The scanner remains the base layer. On top of it, Ward runs an investigation loop that carries each finding from candidate signal to a decision a reviewer can defend — with the trace, repro, and provenance attached.

I · DETECT

Cross-file reasoning

<source>
input
<intermediate>
<sink> · risk
action
finding

Reasons across files to surface vulnerable flows that single-file pattern matching often misses.

II · INVESTIGATE

Evidence bundles

candidate trace
·taint flow attached
·entry point identified
·
·bundle: draft
investigation evidence
·repro test
·repro result
·provenance
·review queue

For supported lanes, Ward can carry a candidate forward into repro artifacts, provenance, and review state.

III · DECIDE

Evidence grades

library intrinsic
reproduced
config dependent
opt-in risk
semantic only
needs review
default-unsafe vs opt-in risk vs semantic evidence

Ward is adding product-level distinctions between bugs that are unsafe by default, risks that require an opt-in configuration, and findings that still need analyst judgment.

04   Methodology

What the evaluation shows.

The benchmark matters. So do its limits. Here’s what we count, what we compare, and where the current pre-release claims stop.

What is paired scoring, precisely?

For each CVE we have a repo and two SHAs: vuln_sha (the commit the CVE was filed against) and fix_sha (the merge that closed it). We run the scanner on both and call the finding “real” only if it fires on vuln_sha at a location whose scope includes the code the patch fixed, and does not fire on fix_sha. Any other pattern is not credited. Raw alert counts across scanners aren’t comparable; paired scoring is.

Is Ward just another SAST tool?

No. Static analysis is the entry point, not the whole story. Ward is being built as evidence-backed security review infrastructure: scanner findings, evidence bundles, provenance, and reviewable investigation state. The scanner is farther along than the investigation layer today.

Which tools did you compare against?

The current comparison includes Semgrep. CodeQL is being rerun under a stricter full-corpus setup, but those runs currently take more than 24 hours and have not completed cleanly enough for us to publish a reproducible headline number. We’ll publish the exact versions, configurations, and harness details alongside the benchmark methodology.

Will the corpus be public?

We intend to publish the methodology, scoring harness, benchmark dates, and pinned tool configurations. We have not finalized what portion of the corpus itself will be public.

When can I try it?

Ward is pre-release and in active development. There’s no public install today. If you want to be notified when there is, leave your email below.