Notes — Ward

2026-05-13 · public benchmark writeup
Ward leads the tested off-the-shelf scanners on Rust unsafe-class vulnerability detection
Headline writeup of the 80-pair RUSTSEC head-to-head benchmark — among tested off-the-shelf scanner configurations, Ward is the only one that fires in-class true positives at user-facing severity. F1=0.655, MCC=+0.564, McNemar p ≈ 1.46 × 10⁻¹¹ vs Semgrep, CodeQL, Rudra, and cargo-geiger. Internal pre-release benchmark; public reproduction planned.
2026-05-13 · locked methodology spec
Unsafe-Rust head-to-head benchmark — methodology
Pre-registered scoring rules, statistical contract, ruleset selection, container image digest, and the exact paired finding-identity policy used to score every tool.
2026-05-13 · full headline + audit
Unsafe-Rust head-to-head benchmark — results
Paired confusion matrix, per-class breakdown, McNemar pairings, bootstrap CIs, and the audit section (Rudra rule-id mapping fix + stdout-parser patch verifying the headline ranking is robust).
2026-05-13 · fairness check
Unsafe-Rust head-to-head benchmark — max-breadth auxiliary results
Re-ran the competitor set on their broadest publicly available rulesets (Semgrep across 1,079 rules; CodeQL on rust-security-and-quality.qls; Rudra with parser fix). The ranking does not change.

Ward's working journal.