examples

Two findings, step by step.

Both start the same way — Ward's static analyzer flags a suspect taint flow. From there an agent walks each candidate through the same pipeline: read source, form a hypothesis, run it past an oracle (or rule it out), file typed evidence, route to a reviewer. One ends in a sandboxed proof; the other in a structured counterexample.

EXAMPLE 01 · confirmed

aiohttp CVE-2024-23334 — path traversal under follow_symlinks=True.

A real, public CVE in aiohttp's static file route. Default-safe; exploitable only when the application developer opts in to the unsafe parameter. Ward routes this to a config-dependent evidence grade, not a raw "high severity" alert.

model
claude-sonnet-4.6
turns used
11 of 32
wall time
3m 15s
api spend
$0.1866
  1. 01
    Static finding candidate

    Ward's taint analyzer flags a candidate sink.

    The scanner traces a request URL's filepath segment into the static-route handler and marks the FileResponse call as a sink. The output is conservative by design — by itself this is a hypothesis worth investigating, nothing more.

    ruleTAINT-GENERIC
    classpath_traversal
    fileaiohttp/web_urldispatcher.py:668
    sinkFileResponse(filepath, …)
    flowsrc[expect_handler] → sink[fileresponse.arg0]
    conf.0.80
  2. 02
    Investigation opens turn 0

    Discoverer agent picks up the candidate.

    A pinned model receives the finding alongside Ward's v0.3.2 system prompt — tool catalogue, attachment contracts per evidence grade, and a turn budget. Provider, model snapshot, and generation id are all recorded so the run can be replayed or audited later.

    modelanthropic/claude-4.6-sonnet-20260217
    promptdiscoverer-system-v0.3.2
    backendOpenRouterPinned · zdr=true
    gen idgen-1777057578-7UaFDtUkdZcofpI8gHw7
  3. 03
    Read source turns 2–8

    Agent walks the surrounding code.

    Three windows in web_urldispatcher.py: the _handle body around the sink, the containment-check that runs before it, and StaticResource.__init__ where the follow_symlinks knob is set.

    643filepath = self._directory.joinpath(filename).resolve()
    644if not self._follow_symlinks:
    645    filepath.relative_to(self._directory)   # raises if escaped
    
    667elif filepath.is_file():
    668    return FileResponse(filepath, chunk_size=self._chunk_size)
  4. 04
    Form hypothesis turn 8

    Agent isolates the gating condition.

    Default config blocks the traversal. The unsafe parameter removes the containment check. The library is default-safe — the bug only activates under a specific deployment choice.

    discoverer · turn 8

    "With follow_symlinks=True, the filepath.relative_to(self._directory) check is skipped. A symlink inside the static directory pointing to /etc/passwd would be followed. This is the classic follow_symlinks=True path traversal."

  5. 05
    Build & run repro 4 attempts

    Sandboxed exploit drafted, then iterated against the harness.

    A symlink inside the static dir pointing at a secret file outside it, served through an app configured with follow_symlinks=True. The first three runs failed for harness-shape reasons — file served but oracle didn't see the canary. The agent diagnosed each miss and adapted.

    turn 9 harness failed to start — top-level web.run_app blocked the import unproven
    turn 13 harness ran, file served (200) — oracle didn't see canary in body digest unproven
    turn 15 added /check echo handler — canary still not in oracle's view unproven
    turn 18 canary inlined in response prose — read of out-of-root file echoed back proved
  6. 06
    Oracle verdict proven

    An independent oracle confirms the read.

    The oracle is not the model. It's a separate check — a harness canary planted in a file outside the static root that has to surface in the response body for the verdict to flip proven=true.

    oracle response · turn 19
    {
      "oracle_type": "path_traversal_canary_echo",
      "proven": true,
      "evidence": "harness canary 'WARD_PATH_TRAVERSAL_CANARY' echoed
                   in response body — traversal read the out-of-root file",
      "response": { "status": 200, "latency_ms": 0,
                    "body_digest": "blake3:903caef4…" }
    }
  7. 07
    Submit evidence 3 attachments

    Typed attachments filed against the bundle.

    The config-dependent grade has a fixed contract: name the knob, pin the exploit source, attach the oracle's verdict. Each artifact has a content digest — a reviewer can verify nothing was edited after the fact.

    config-trigger 3aba08de…
    Names follow_symlinks, default False, unsafe True, gating site web_urldispatcher.py:529.
    repro-test ca0c033e…
    Byte-pinned aiohttp app: symlink inside static_dir pointing at a secret outside it, plus a /check handler that exposes the read for the oracle.
    repro-result 105c404c…
    exit_code 0, proven=true, response 200 with the canary echoed.
  8. 08
    Claim & bundle turn 31

    Discoverer claims; the ledger files the bundle.

    Notable: the ledger reports supported_grade: reproduced — the evidence is strong enough for the higher tier — but the discoverer claimed config_dependent because the library is default-safe. Honest grading is part of the contract, not the model's choice.

    make_claim(outcome="proved", grade_hint="config-dependent", …)
    
    → bundle: {
        "evidence_grade":   "config_dependent",
        "review_status":    "under_review",
        "supported_grade":  "reproduced",
        "missing_for_next": ["patch + regression_test"],
        "attachments":      3
      }
      claim_confidence_ppm: 900000
  9. 09
    Analyst queue under review

    Routed for human sign-off.

    reason = config_dependent_approval. The reviewer sees the proof, the trigger, and the full provenance chain. The "fix" here is a deployment decision — audit application configs for follow_symlinks=True — not a library patch.

EXAMPLE 02 · refuted

Sanic cli/app.py:142 — taint reaches import_module, but isn't a vulnerability.

The same scanner flags a CLI-arg-controlled value flowing into import_module() in Sanic's CLI loader. A naïve "any tainted sink" rule would call this a path-traversal hit. Ward investigates, finds the reason it can't actually be exploited, and ships the explanation alongside the dismissal.

model
qwen3-next-80b-thinking
turns used
2 of 32
wall time
91s
api spend
~$0.0025
  1. 01
    Static finding candidate

    Same scanner, same rule, similar shape.

    A CLI-arg flowing through to a module loader. Conservative taint analysis says: this looks like a sink. Twenty-four distinct taint paths in this codebase converge on the same line — a strong static signal worth verifying.

    ruleTAINT-GENERIC
    classpath_traversal
    filesanic/cli/app.py:142
    sinkimport_module(module_name)
    flowsrc[path] → sink[import_module.arg0]
    conf.0.70
    variants24 distinct paths to same sink
  2. 02
    Investigation opens turn 0

    Different model, same pipeline.

    An open-weights model — pinned with the same provenance-bearing backend Sonnet uses. The pipeline isn't tied to one provider, and the run id is enough to go pull the actual transcript.

    modelqwen/qwen3-next-80b-a3b-thinking-2509
    promptdiscoverer-system-v0.3.2
    backendOpenRouterPinned · Alibaba
    gen idgen-1777084375-nbyQMafjbeGouwlniX5a
  3. 03
    Read source turn 1

    Agent reads the surrounding window.

    One pass at cli/app.py lines 132–152. The shape of the API — module-name resolver, not file opener — is enough to settle the question. No further reading required.

    135f"   eg. sanic {self.args.module} --simple"
    
    142module = import_module(module_name)
    143app = getattr(module, app_name, None)
  4. 04
    Identify contract violation turn 2

    Sink can't be exercised — wrong grammar.

    import_module resolves Python module names — dot-separated identifiers. Path-traversal characters aren't valid module names; the call raises ImportError before any file is touched. The taint flow is real; the sink is not.

    discoverer · refutation

    "Python module names cannot contain path traversal characters (must be valid identifiers). import_module() requires dot-separated module names (e.g. app.main), and user-provided input results in ImportError for invalid syntax, not actual path traversal."

  5. 05
    Submit counterexample 1 attachment

    A structured refutation, not a flat dismissal.

    A reviewer doesn't see "ignored." They see the lines that were considered, the contract that rules them out, and a content digest that pins the artifact for replay.

    counterexample 4f3f7a7c…
    Label: "False positive — Path traversal misclassification". Cites cli/app.py:142 and cli/app.py:72 as evidence sites; reasoning rests on the import_module grammar.
  6. 06
    Claim & bundle turn 5

    Discoverer claims unproven; ledger files the bundle.

    Suppression isn't boolean. Three states track the bundle independently:

    • Filed at semantic_only · the lowest grade compatible with an "unproven" claim.
    • Evidence supports semantic_plus_counterexample · the structured counterexample is strong enough for a higher tier.
    • Awaits promotion on review · a human can lift the grade after confirming the reasoning.
    make_claim(outcome="unproven", grade_hint="semantic-plus-counterexample", …)
    
    → bundle: {
        "evidence_grade":   "semantic_only",
        "review_status":    "draft",
        "supported_grade":  "semantic_plus_counterexample",
        "missing_for_next": ["repro_result"],
        "attachments":      1
      }
      claim_confidence_ppm: 900000
  7. 07
    Analyst queue draft

    Routed with the reasoning attached.

    reason = semantic_only_draft. Next time the same code shape shows up, that knowledge is still there. Ward correctly rejected a misranked candidate, and the audit trail proves it.

WHY BOTH OUTCOMES MATTER

The point of "evidence-backed" isn't that everything reaches "exploited."

A scanner's job is to find candidate vulnerabilities. The harder job — and the one that decides whether security teams actually trust a tool — is telling them apart. Ward attaches structured evidence to both outcomes:

Both bundles above were produced end-to-end by Ward's evidence pipeline: static analysis → agentic investigation → sandboxed repro or counterexample → typed evidence bundle → analyst queue. No fixture shortcuts, no human steering between steps. Combined model spend across both runs: about $0.19.