Both start the same way — Ward's static analyzer flags a suspect taint flow. From there an agent walks each candidate through the same pipeline: read source, form a hypothesis, run it past an oracle (or rule it out), file typed evidence, route to a reviewer. One ends in a sandboxed proof; the other in a structured counterexample.
A real, public CVE in aiohttp's static file route. Default-safe; exploitable only when the application developer opts in to the unsafe parameter. Ward routes this to a config-dependent evidence grade, not a raw "high severity" alert.
The scanner traces a request URL's filepath segment into the static-route handler and marks the FileResponse call as a sink. The output is conservative by design — by itself this is a hypothesis worth investigating, nothing more.
A pinned model receives the finding alongside Ward's v0.3.2 system prompt — tool catalogue, attachment contracts per evidence grade, and a turn budget. Provider, model snapshot, and generation id are all recorded so the run can be replayed or audited later.
Three windows in web_urldispatcher.py: the _handle body around the sink, the containment-check that runs before it, and StaticResource.__init__ where the follow_symlinks knob is set.
643filepath = self._directory.joinpath(filename).resolve()
644if not self._follow_symlinks:
645 filepath.relative_to(self._directory) # raises if escaped
…
667elif filepath.is_file():
668 return FileResponse(filepath, chunk_size=self._chunk_size)
Default config blocks the traversal. The unsafe parameter removes the containment check. The library is default-safe — the bug only activates under a specific deployment choice.
"With follow_symlinks=True, the filepath.relative_to(self._directory) check is skipped. A symlink inside the static directory pointing to /etc/passwd would be followed. This is the classic follow_symlinks=True path traversal."
A symlink inside the static dir pointing at a secret file outside it, served through an app configured with follow_symlinks=True. The first three runs failed for harness-shape reasons — file served but oracle didn't see the canary. The agent diagnosed each miss and adapted.
The oracle is not the model. It's a separate check — a harness canary planted in a file outside the static root that has to surface in the response body for the verdict to flip proven=true.
{
"oracle_type": "path_traversal_canary_echo",
"proven": true,
"evidence": "harness canary 'WARD_PATH_TRAVERSAL_CANARY' echoed
in response body — traversal read the out-of-root file",
"response": { "status": 200, "latency_ms": 0,
"body_digest": "blake3:903caef4…" }
}
The config-dependent grade has a fixed contract: name the knob, pin the exploit source, attach the oracle's verdict. Each artifact has a content digest — a reviewer can verify nothing was edited after the fact.
Notable: the ledger reports supported_grade: reproduced — the evidence is strong enough for the higher tier — but the discoverer claimed config_dependent because the library is default-safe. Honest grading is part of the contract, not the model's choice.
make_claim(outcome="proved", grade_hint="config-dependent", …)
→ bundle: {
"evidence_grade": "config_dependent",
"review_status": "under_review",
"supported_grade": "reproduced",
"missing_for_next": ["patch + regression_test"],
"attachments": 3
}
claim_confidence_ppm: 900000
reason = config_dependent_approval. The reviewer sees the proof, the trigger, and the full provenance chain. The "fix" here is a deployment decision — audit application configs for follow_symlinks=True — not a library patch.
The same scanner flags a CLI-arg-controlled value flowing into import_module() in Sanic's CLI loader. A naïve "any tainted sink" rule would call this a path-traversal hit. Ward investigates, finds the reason it can't actually be exploited, and ships the explanation alongside the dismissal.
A CLI-arg flowing through to a module loader. Conservative taint analysis says: this looks like a sink. Twenty-four distinct taint paths in this codebase converge on the same line — a strong static signal worth verifying.
An open-weights model — pinned with the same provenance-bearing backend Sonnet uses. The pipeline isn't tied to one provider, and the run id is enough to go pull the actual transcript.
One pass at cli/app.py lines 132–152. The shape of the API — module-name resolver, not file opener — is enough to settle the question. No further reading required.
135f" eg. sanic {self.args.module} --simple"
…
142module = import_module(module_name)
143app = getattr(module, app_name, None)
import_module resolves Python module names — dot-separated identifiers. Path-traversal characters aren't valid module names; the call raises ImportError before any file is touched. The taint flow is real; the sink is not.
"Python module names cannot contain path traversal characters (must be valid identifiers). import_module() requires dot-separated module names (e.g. app.main), and user-provided input results in ImportError for invalid syntax, not actual path traversal."
A reviewer doesn't see "ignored." They see the lines that were considered, the contract that rules them out, and a content digest that pins the artifact for replay.
Suppression isn't boolean. Three states track the bundle independently:
make_claim(outcome="unproven", grade_hint="semantic-plus-counterexample", …)
→ bundle: {
"evidence_grade": "semantic_only",
"review_status": "draft",
"supported_grade": "semantic_plus_counterexample",
"missing_for_next": ["repro_result"],
"attachments": 1
}
claim_confidence_ppm: 900000
reason = semantic_only_draft. Next time the same code shape shows up, that knowledge is still there. Ward correctly rejected a misranked candidate, and the audit trail proves it.
A scanner's job is to find candidate vulnerabilities. The harder job — and the one that decides whether security teams actually trust a tool — is telling them apart. Ward attaches structured evidence to both outcomes:
Both bundles above were produced end-to-end by Ward's evidence pipeline: static analysis → agentic investigation → sandboxed repro or counterexample → typed evidence bundle → analyst queue. No fixture shortcuts, no human steering between steps. Combined model spend across both runs: about $0.19.