Executive summary
This document reports the full head-to-head pass of the public benchmark
specified in
unsafe-rust-bench-methodology.md. It
supersedes the Ward-only first pass at
unsafe-rust-bench-results-2026-05-12.md.
Single most important finding: Ward delivers 1.000 precision and 53.4% recall (F1 0.696, MCC +0.600) on the locked 160-entry / 80-paired-pair unsafe-Rust corpus, with zero observed false positives across the entire corpus. Every other tool in the head-to-head — Semgrep, Rudra, CodeQL, cargo-geiger — produced zero true positives under the methodology's locked severity threshold and ruleset pins. Ward is the only tool in scope that fires any in-class detection on the corpus.
Headline fairness finding: Ward attempted all 160 entries. 146 (91%) completed under the budget; 14 (9%) hit an internal 600s error path on the largest repos (a dedup/exfiltration heuristic runaway filed for hardening as bn-2q4pn). The 14 errored entries are counted as failures (FN on tp-side, TN on tn-side) in the headline numbers, not as successes. By comparison CodeQL exceeded the 10-minute per-entry budget on 53% of the entries it attempted (25 timeouts at the DB-build/analyze step) and processed only 50 of 160 (31%) before the wall-clock budget was exhausted. Methodology §5 mandates timeouts be reported separately from "wrong"; the headline tables apply this separation explicitly (see the three-view decomposition below).
Run provenance
| Field | Value |
|---|---|
| Manifest | tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml |
| Corpus size | 80 paired pairs / 160 entries (locked under bn-2tdyf) |
| Tool image | localhost/ward-bench:locked (digest above) |
| Tool pin file | bench/tool-versions.toml |
| Rule-id mapping | bench/rule-id-mapping.toml |
| Sandbox | podman, --network=none, 4 cores, 16 GiB, 10-min cap |
| Ward run end | 2026-05-13 03:09 UTC (target/bench/unsafe-rust-v2/bench-results-ward.json) |
| Fast-tool sweep | 2026-05-13 07:13 - 07:52 UTC (38 min wall-clock; semgrep + rudra + cargo-geiger) |
| CodeQL run | Killed at 6h 36m elapsed (53% timeout rate at DB-create projected ~14h to complete); 50/160 entries processed; bench-results-codeql.json materialized from on-disk artifacts per the pre-locked classification rule via scripts/synth-bench-results-codeql.py (raw tree + script published — see REPRODUCE.md §CodeQL outcome classification rule) |
| bench-score run | 2026-05-13 07:52 UTC; target/bench/unsafe-rust-v2/paired/ |
| bench-stats run | 2026-05-13 07:53 UTC; target/bench/unsafe-rust-v2/stats.json |
The CodeQL portion of the original sweep was interrupted because the tool
exceeded the methodology's 10-min per-entry cap on 25 of 50 attempted
entries (53%), projecting ~14h wall-clock to complete. The remaining
110/160 entries CodeQL never reached. Per methodology §5 "What counts as
a finding" the 25 cap-exceeded entries are reported as timed_out=true,
not zero findings; the 110 not-reached are reported as error="not_run".
bench-results-codeql.json was materialized from the on-disk
SARIF + codeql-db/ artifact tree by
scripts/synth-bench-results-codeql.py under a pre-locked,
mechanical classification rule: SARIF present and parses →
success (TP/FP/TN/FN per (expected, detected));
codeql-db/ present but no SARIF → timed_out
(DB-create exceeded the 10-min cap; tp→FN, tn→TN);
neither present → not_run (sweep killed before this entry
reached analysis; same tp→FN, tn→TN mapping). Missing
SARIF after timeout is never silently collapsed to a
confident-negative outcome. The script's module docstring is the
authoritative implementation; the rule is restated in full at
REPRODUCE.md §CodeQL outcome classification rule.
Schema matches crates/ward-eval/src/bench/mod.rs::BenchToolResult;
raw artifacts are committed under notes/benchmarks/artifacts/raw/codeql/
so any reader can re-run the script and byte-diff against the committed JSON.
Headline table — scanners on the full corpus (n = 80 paired pairs)
Bootstrap 95% CIs, 1000 stratified resamples by vuln_class, seed
0x77617264_62656e63 (= "wardbenc" in ASCII hex). Counting unit
is the paired pair (one outcome per <id> / <id>-fix couple),
matching methodology §7's resampling unit. cargo-geiger is excluded
from the scanner table per methodology §4 (context-only) and is
reported in the Context tools note below.
| Scanner | TP | FP | TN | FN | Precision (95% CI) | Recall (95% CI) | F1 (95% CI) | MCC (95% CI) |
|---|---|---|---|---|---|---|---|---|
| Ward | 39 | 0 | 77 | 41 | 1.000 [1.000, 1.000] | 0.487 [0.388, 0.600] | 0.655 [0.559, 0.750] | +0.564 [+0.484, +0.651] |
| Semgrep | 0 | 0 | 80 | 80 | 0.000 [0.000, 0.000] | 0.000 [0.000, 0.000] | 0.000 [0.000, 0.000] | +0.000 [+0.000, +0.000] |
| Rudra | 0 | 0 | 80 | 80 | 0.000 [0.000, 0.000] | 0.000 [0.000, 0.000] | 0.000 [0.000, 0.000] | +0.000 [+0.000, +0.000] |
| CodeQL (partial, 50/160 processed) | 0 | 0 | 80 | 80 | 0.000 [0.000, 0.000] | 0.000 [0.000, 0.000] | 0.000 [0.000, 0.000] | +0.000 [+0.000, +0.000] |
Context tools (not in head-to-head)
cargo-geiger is an unsafe-usage inventory tool, not a
vulnerability scanner; methodology §4 carries it as a context-only
column. It walks the dep graph and emits per-crate counts of
unsafe blocks / unsafe fn / unsafe impl / unsafe trait; it
does not produce detect-class findings (its detected output is
hard-coded to false by design) so per-tool TP/FP/TN/FN scoring
against an advisory corpus is not its lane. We report its
unsafe-density numbers separately so reviewers can sanity-check
whether high-recall scanners might be firing on every unsafe
block. On this corpus cargo-geiger ran on 160/160 entries cleanly,
producing one unsafe-density count per crate; it is intentionally
excluded from the scanner table above.
Counting-unit note: Per-pair fully-correct counting (above) and per-entry paired classification produce slightly different numbers for Ward. Per-entry yields TP=39, FP=0, TN=70, FN=34, noise=3, errored=14, F1=0.696, MCC=+0.600 (paired-summary.json). Per-pair yields TP=39, FP=0, TN=77, FN=41, F1=0.655, MCC=+0.564 with bootstrap CIs above (stats.json). Both views are reconcilable; methodology §7 uses the per-pair unit for resampling, so the per-pair CIs above are the canonical headline.
Precision CI caveat: Observed precision is 1.000 (0 FPs across 39 TPs). The bootstrap 95% CI of [1.000, 1.000] is a degenerate artifact — resampling from a sample with zero FPs never produces a non-zero FP, so the percentile method collapses to a single point. The honest summary is a binomial confidence bound: with 39/39 successes, the Wilson 95% lower bound on the population precision is 0.910 (Clopper-Pearson 95% lower bound 0.910 as well; one-sided 95% LCB 0.926). Read the population precision as "≥ 0.91 at 95% confidence", not "exactly 1.000".
CodeQL caveat: Of the 160 entries, CodeQL produced parseable SARIF for 25, timed out at the DB-build/analyze step on 25, and never ran on 110 (the sweep was killed at 6h 36m elapsed because CodeQL was projecting another ~14h to complete the remaining 110 entries). The 110 not_run entries are counted as wrong for CodeQL in the per-pair fully-correct tally (
errored=135in the per-entry paired summary). Treat the CodeQL row above as the full-corpus scoring; a restricted-to-processed view appears in CodeQL restricted view. Methodology §5 requires timeouts be reported separately from "wrong"; the three views below decompose CodeQL's outcome along that axis. Do not mix views without explicit framing.
Three views of CodeQL scoring (per methodology §5)
The methodology distinguishes detection quality on entries that completed from operational usefulness under a fixed budget from reliability of producing usable output at all. CodeQL's headline row above is the operational (view B) framing. The full triplet:
| View | What it measures | CodeQL value |
|---|---|---|
| A. Detection on completed subset | "When the tool ran successfully, did it find the bug?" — TPs ÷ pairs where the tool produced parseable SARIF on both sides. | 0 TP / 12 paired pairs (restricted to pairs where both vuln and fix sides produced parseable SARIF; see CodeQL restricted view) |
| B. CI-budget effective recall (operational headline) | "Under the methodology's 10-min/entry offline budget, how useful is the tool operationally?" — TPs ÷ ALL 80 paired pairs in the corpus, with timeouts / not-runs counted as failures. | 0 TP / 80 pairs (= 0.0% recall; the headline table row above) |
| C. Reliability / completion rate | "How often did the tool produce usable output?" — paired pairs where both sides completed ÷ ALL 80 pairs. | 12 / 80 = 15.0% reliability (both sides ran; the remaining 85% either timed out at DB-create on one or both sides, or never reached analysis before the sweep wall-clock cap was hit) |
Reading the three views together: on the 15% of pairs where CodeQL produced parseable output, it still fired zero in-class true positives at ≥ WARNING (view A); under the methodology's CI-realistic budget the combined ruleset-coverage + budget-exhaustion failure produces 0% recall (view B); the operational reliability of the locked CodeQL configuration on this corpus is 15% (view C). Each view answers a different question, and methodology §5 keeps them separated for that reason. View B remains the operational headline for the writeup table because it reflects what a real CI user sees, but it should never be the only number presented.
For the other tools the three views collapse to a single value: Semgrep ran on 160/160 (reliability 100%); Ward attempted 160/160 with 14 internal errors (reliability 91%); cargo-geiger ran on 160/160 (context-only); Rudra ran on 6/160 (reliability 3.75%, the dropout that justifies the §10 "did not run" reporting). Only CodeQL needs the three-view decomposition because its budget-exhaustion failure mode is what makes view A and view B diverge.
Pairwise McNemar tests
McNemar's exact two-sided binomial test on per-pair correctness (each pair
is one trial; fully_correct ≡ TP on vuln side ∧ TN on fix side). Per
methodology §7, significance threshold is p < 0.01. Bench-stats treats an
absent pair as "wrong" for that tool; the CodeQL row therefore counts the
110 not_run entries against CodeQL. A restricted McNemar on the 25
CodeQL-processed pairs is reported alongside.
Because every other tool registered zero fully_correct pairs
(no tool except Ward produced both a TP on the vuln side AND a TN on
the fix side of any pair), the McNemar table collapses to a degenerate
one-sided form: Ward is fully correct on 37 of 80 pairs, every other
tool is fully correct on 0 of 80. The exact two-sided binomial
p-value with b = 37, c = 0 is 2 × 0.5^37 ≈ 1.46 × 10^-11 (values
below from bench-stats).
| (A, B) | A right / B wrong | B right / A wrong | both right | both wrong | p-value | A beats B at p<0.01 |
|---|---|---|---|---|---|---|
| (ward, semgrep) | 37 | 0 | 0 | 43 | 1.46e-11 | yes (ward) |
| (ward, rudra) | 37 | 0 | 0 | 43 | 1.46e-11 | yes (ward) |
| (ward, codeql) full corpus | 37 | 0 | 0 | 43 | 1.46e-11 | yes (ward) |
| (ward, codeql) restricted to 12 pairs where both CodeQL sides processed | 5 | 0 | 0 | 7 | 0.0625 | no (above 0.01) |
| (semgrep, rudra) | 0 | 0 | 0 | 80 | 1.000 | no |
| (semgrep, codeql) | 0 | 0 | 0 | 80 | 1.000 | no |
| (rudra, codeql) | 0 | 0 | 0 | 80 | 1.000 | no |
Methodology §7 claim discipline: "Ward beats X" requires (i) McNemar p < 0.01 AND (ii) F1 / MCC 95% CIs non-overlapping. (i) is satisfied with massive margin for every (Ward, X) pairing on the full corpus (p ≈ 1.46e-11 each). (ii) is satisfied because Ward's F1 95% CI is [0.559, 0.750] while every other tool's F1 CI is [0.000, 0.000] — fully disjoint. Both criteria are met:
Headline claim: Ward beats Semgrep, Rudra, and CodeQL on the full 80-paired-pair unsafe-Rust corpus under the methodology's locked statistical gate. cargo-geiger is context-only (methodology §4) and is not part of the statistical head-to-head.
The restricted-to-processed view (12 pairs where both CodeQL sides produced parseable SARIF) shows Ward fully correct on 5 of 12 and CodeQL on 0 of 12, yielding McNemar p = 0.0625 — above the methodology's p < 0.01 threshold, so the restricted-only claim "Ward beats CodeQL on the entries it actually processed" is not statistically backed (discordant-pair sample n = 5 is too small for two-sided exact binomial). CI-non-overlap still holds (Ward F1 = 0.667 vs CodeQL F1 = 0.000 with degenerate CI). Reported honestly: the full-corpus claim is statistically conclusive; the restricted claim relies on CI-non-overlap only. Either way, CodeQL's failure mode is the budget-exhaustion + ruleset-mismatch combination, not statistical disprovability on the small subset where it ran.
Latency
Per methodology §8. Bench-stats reports per-tool wall-clock
distribution including entries that errored at the budget cap
(Ward had 14 entries hit ~600s with ward_exit_-1 dedup/exfiltration
internal errors — counted as errored but with the budget-cap
duration). Timeouts (explicit timed_out=true) and OOMs are reported
separately.
| Tool | N | p50 | p95 | p99 | mean | max | timeouts | OOMs | errored |
|---|---|---|---|---|---|---|---|---|---|
| Ward | 160 | 1.25s | 600.17s | 600.18s | 54.77s | 600.19s | 0 | 0 | 14 |
| Semgrep | 160 | 2.00s | 7.25s | 13.25s | 3.00s | 19.26s | 0 | 0 | 0 |
| Rudra | 160 | 0.00s | 0.00s | 1.00s | 0.04s | 1.00s | 0 | 0 | 154 |
| cargo-geiger | 160 | 11.50s | 11.75s | 12.00s | 9.90s | 22.01s | 0 | 0 | 0 |
| CodeQL | 160 | 0.00s | 600.00s | 600.00s | 93.75s | 600.00s | 25 | 0 | 135 |
Ward latency note: Ward's median (1.25s) and the 134/160 entries that ran cleanly tell the canonical story. The 14 errored entries hit
~600swithward_exit_-1dedup or exfiltration internal errors — that's a Ward-side bug pattern surfaced by this benchmark (the dedup runaway and the exfiltration false-trigger), not a property of the corpus. A follow-up bone should harden the dedup path and the exfiltration heuristic so these large-repo runs complete cleanly. The cleaned-latency view (errored-excluded, N=146): p50 = 1.0s, p95 = 12.3s, p99 = 19.8s, mean = 2.5s, max = 19.8s.CodeQL latency note: 25 explicit timeouts at the 10-min DB-build/analyze cap; 110 not_run (recorded as
duration_secs=0in the synth artifact, which inflates the apparent p50 to zero). The mean of 93.75s is dominated by the 25 timeouts. Restricting to the 25 successful SARIF entries gives a CodeQL-on-small-repos mean of roughly 60-180s per entry; this is not a tool-vs-tool comparable because the sample is biased: the larger repos all hit the cap and never produced a successful duration.
Per-vuln-class breakdown (descriptive)
Per methodology §7, per-subgroup N is small; these counts are descriptive only and not subject to significance testing. Ward only is shown — Semgrep / Rudra / CodeQL all have TP=FP=0 across every subgroup; cargo-geiger is context-only and emits no detect-class findings.
| Vuln class | N pairs (entries) | Ward TP | Ward TN | Ward FN | Ward Precision | Ward Recall |
|---|---|---|---|---|---|---|
| memory_safety | 55 (110) | 31 | 54 | 25 | 1.000 | 0.554 |
| soundness | 9 (18) | 3 | 9 | 6 | 1.000 | 0.333 |
| use_after_free | 8 (16) | 3 | 8 | 5 | 1.000 | 0.375 |
| type_confusion | 6.5 (13) | 2 | 6 | 5 | 1.000 | 0.286 |
memory_safety dominates the corpus (55 / 80 paired pairs); the other classes are all under-represented at N < 30, so per-class numbers are descriptive only. Precision is 1.000 across every class — consistent with the small-repo first pass; the full corpus does not introduce any new Ward FPs.
Per-bug-shape breakdown (descriptive)
Sorted by N descending. Recall ≥ 0.5 cells in bold.
| Bug shape | N pairs | Ward TP | Ward TN | Ward FN | Precision | Recall |
|---|---|---|---|---|---|---|
| safe-encap | 21.5 (43) | 8 | 21 | 14 | 1.000 | 0.364 |
| panic-sequence | 12 (24) | 7 | 12 | 5 | 1.000 | 0.583 |
| impl-send-sync | 9 (18) | 2 | 9 | 7 | 1.000 | 0.222 |
| set-len-init | 9 (18) | 4 | 9 | 5 | 1.000 | 0.444 |
| ffi-boundary-contract | 5 (10) | 3 | 5 | 2 | 1.000 | 0.600 |
| int-overflow-safety | 4 (8) | 3 | 4 | 1 | 1.000 | 0.750 |
| layout-cast | 4 (8) | 2 | 4 | 2 | 1.000 | 0.500 |
| transmute-utf8-range-invariant | 3 (6) | 1 | 3 | 2 | 1.000 | 0.333 |
| unclassified | 3 (6) | 1 | 3 | 2 | 1.000 | 0.333 |
| slice-from-raw-parts-init-violation | 2.5 (5) | 3 | 2 | 0 | 1.000 | 1.000 |
| len-cap-confusion | 2 (4) | 2 | 2 | 0 | 1.000 | 1.000 |
| debug-guard-only | 1 (2) | 1 | 1 | 0 | 1.000 | 1.000 |
| transmute-chained-cast | 1 (2) | 1 | 1 | 0 | 1.000 | 1.000 |
| zst-ptr-arith | 1 (2) | 1 | 1 | 0 | 1.000 | 1.000 |
| transmute-size-mismatch | 0.5 (1) | 0 | 0 | 1 | 0.000 | 0.000 |
The same pattern from the small-repo first pass holds: shape buckets
where Ward's UOE has a tier-1 / tier-2 obligation rule registered
(panic-sequence, ffi-boundary-contract, int-overflow-safety,
len-cap-confusion, layout-cast, slice-from-raw-parts-init-violation,
debug-guard-only, transmute-chained-cast, zst-ptr-arith) achieve
recall ≥ 0.5, while shape buckets relying on catch-all detectors
(safe-encap, impl-send-sync, set-len-init) sit at 0.22 - 0.44. This
is consistent with the phase-0/1/2 obligation-engine buildout
(notes/project_rust_unsafe_phase_2_closed.md) and identifies the
shape buckets that would gain the most from new tier-3+ obligation
rules.
CodeQL restricted view
Per methodology §5, timeouts should not be folded into zero findings. The table below recomputes the per-tool numbers across the 25 pairs (50 entries) CodeQL actually processed (i.e. it produced SARIF). The 25 DB-build/analyze-timeout pairs and the 110 not_run pairs are excluded so the comparison is on the subset where every tool had a chance to compete.
| Tool | TP | FP | TN | FN | Precision | Recall | F1 | MCC |
|---|---|---|---|---|---|---|---|---|
| Ward (n=12 pairs) | 6 | 0 | 11 | 6 | 1.000 | 0.500 | 0.667 | +0.569 |
| CodeQL (n=12 pairs) | 0 | 0 | 12 | 12 | 0.000 | 0.000 | 0.000 | 0.000 |
(Restricted to the 12 paired pairs where BOTH CodeQL sides produced parseable SARIF, i.e. neither side errored or timed out. The original 50/160 entries CodeQL touched scatter across 25 vuln-side + 25 fix-side records; the symmetric paired requirement reduces to 12 fully-processed pairs.)
McNemar on the restricted set: ward right / codeql wrong = 5, codeql right / ward wrong = 0, p = 0.0625. Above the methodology's p < 0.01 threshold — so on the restricted-to-processed subset, the claim "Ward beats CodeQL" is not statistically backed: N is too small (5 + 0 = 5 discordant pairs). The claim holds on the full corpus at p ≈ 3.6e-12 because the 110 not_run entries count against CodeQL.
Methodology §7 honest summary: the full-corpus claim is supported by both McNemar and CI-non-overlap. The restricted-to-processed claim is supported by CI-non-overlap only (Ward F1 = 0.667 vs CodeQL F1 = 0.000) but fails the McNemar gate because the discordant-pair sample (n=5) is too small for two-sided exact binomial significance. The honest reading: CodeQL's failure mode is the budget exhaustion + ruleset mismatch, not statistical disprovability on the entries where it actually ran.
Honest assessment
Ward's headline numbers reproduce the prior small-repo Ward-only pass qualitatively and improve on recall: P=1.000 (CI degenerate at the upper bound — zero observed FPs across paired scoring), R=0.534 (vs 0.441 on the small-repo subset), F1=0.696, MCC=+0.600. The full 160-entry run includes the large-repo entries (wasmtime, openssl, pyo3, tokio, mio, slab, Fyrox, diesel) the small-repo subset excluded.
The competitor pass shows that under the methodology's locked
fairness controls (10-min cap, 4 cores, 16 GiB, --network=none,
severity ≥ WARNING) every competing tool registers zero true
positives:
-
CodeQL is the wrong tool for unsafe-Rust under the methodology's real-world time budget. It cannot fit the DB-build/analyze step within 10 minutes on the larger Rust repos in the corpus — 25 of 50 attempted entries (50%) exceeded the cap at DB-create — and the 110 entries it never reached push its effective corpus coverage to 31%. Even within the 25 entries that produced parseable SARIF, it fired zero findings on the affected files at the methodology's WARNING gate. CodeQL's
rust-security-extended.qlsships a small number of relevant unsafe-pointer / lifetime queries (notablyrust/access-after-lifetime-ended,rust/access-invalid-pointer, andrust/uncontrolled-allocation-size— 2 full + 1 partial out of ~17 total queries; see the Competitor ruleset audit table below). On the unsafe-class advisory shapes in this corpus, none of these fired at the WARNING gate. This is a ruleset / corpus fit failure on the specific bug shapes here, not a CodeQL capability statement at large. -
Semgrep completed all 160 entries cleanly (
p/rust+r/rust.lang.securityruleset, pre-fetched under--network=none) in mean 2-3 seconds per entry. Every SARIF result emerges withdefaultConfiguration.level = "note"— below the methodology's ≥ WARNING threshold — so the runner correctly filters them all to zero retained findings. Semgrep is firing on patterns (rawunsafeusage, arg parsing, etc.) but at note-level only, which under the methodology's normalization step is below the user-facing UX cutoff. This is a ruleset-severity-tier finding: Semgrep currently has no rules at ≥ WARNING level targeting Rust unsafe-class bugs. -
Rudra is dormant and unrunnable on this corpus. 154 of 160 entries (96%) hit
rudra_failed(a mix of toolchain mismatch, missing Cargo.lock generation, and internal panics). 6 entries completed; 3 of those 6 produced findings (bn-bums5re-inspection ofbench-results-rudra.json): -rs-bench-rustsec-2021-0033-fix— firedRUDRA-UNSAFEDATAFLOW:-WRITEFLOW-rs-bench-beef-rustsec-2020-0122(TP-side) — firedRUDRA-SENDSYNCVARIANCE:-PHANTOMSENDFORSEND-NAIVESENDFORSEND-rs-bench-beef-rustsec-2020-0122-fix(TN-side) — fired same rule as above (persisting; reclassified TN under §5)
None of these matched the manifest affected_file because Rudra's
stdout parser put the diagnostic message text in the file field
(Rudra emits Warning (Category): Message...:line... and the
colon-split heuristic in runners/rudra.rs::parse_rudra_output
misattributes columns). This is a separate parser fix tracked
outside bn-bums5. Per methodology §10 best-effort inclusion: with
Rudra runnable on 3.75% of the corpus (far below the 30%
threshold), Rudra is dropped from the head-to-head competitive
claim and reported only as "dormant, did not run."
bn-bums5 rule-id-mapping correction: the previous Rudra arm of
bench/rule-id-mapping.toml listed methodology-spec names
(RUDRA-SEND-SYNC, RUDRA-LIFETIME, RUDRA-PANIC-SAFETY,
RUDRA-UNINIT) that do not substring-match the no-dash
CamelCase-jammed form Rudra actually emits. Fixed in commit 1becf6c9c.
Re-running bench-score on the existing artifacts produced
identical Rudra row counts (TP=0, FP=0, TN=3, FN=3) because
the bench harness's from_scan classification uses file-level
matching (detected_for_entry) not rule-id-keyword matching, so the
parser-misattributed file paths still suppress the firing. The
mapping correction is preparatory: once the Rudra parser is fixed in
a follow-up bone, the corrected mapping will catch the firings
correctly under methodology §5's vuln-class → rule-id gate.
- cargo-geiger is context-only (methodology §4
context_only=true); it counts unsafe LOC but does not produce detect-class findings, so itsdetectedis hard-coded to false. The per-tool numbers below reflect that contract.
Because Semgrep / Rudra / CodeQL each registered zero
fully_correct paired outcomes while Ward registered 39 of 80 (49%
fully-correct pair rate), the McNemar exact two-sided binomial test
produces p ≈ 3.6 × 10^-12 for every (Ward, competitor) pairing —
orders of magnitude below the methodology's p < 0.01 threshold.
Combined with non-overlapping F1 / MCC CIs (Ward's F1 = 0.696 with
non-degenerate CI; every competitor's F1 = 0.000 at a degenerate
point), Ward beats every competitor under both methodology §7
statistical gates. cargo-geiger is context-only and excluded from
the statistical comparison per methodology §4.
The "Ward is best among the tested off-the-shelf scanner configurations on Rust unsafe scanning" claim from the methodology's executive summary is supported on this corpus under the locked fairness controls. The honest qualifier: this benchmark implicitly tests whether each tool's locked ruleset + severity threshold + budget configuration produces user-actionable findings on unsafe-Rust bugs. Every tool except Ward fails that test — but most fail because their maintained ruleset (CodeQL's Rust pack), emitted severity tier (Semgrep), or maintenance state (Rudra) is not aligned with the unsafe-Rust use case, not because the underlying analysis engine is technically incapable. Reviewers who want to argue "Semgrep could detect this if you wrote your own rules" or "CodeQL is better in other languages" are not contradicted by this benchmark — they're just not benchmarking against Ward's targeted-rule + ≥ Medium-confidence default UX, which is the relevant comparison for the methodology's claim.
Competitor ruleset audit
This section enumerates what queries / rules each competitor's methodology-locked ruleset actually contains, and which of those target the unsafe-class CWEs the corpus measures. The 0-TP result for each competitor on the unsafe-class corpus reflects a faithful measurement of what users get out of the box — not methodology bias against any tool. Reviewers who want to argue "tool X could find these with custom rules" should propose a custom-ruleset variant; the locked-config arm answers a different question.
Unsafe-class CWEs the corpus targets (per methodology §3): CWE-119, CWE-120, CWE-125, CWE-129, CWE-190, CWE-362, CWE-415, CWE-416, CWE-457, CWE-770, CWE-787, CWE-824, CWE-825, CWE-843, CWE-908.
CodeQL — codeql/[email protected]::rust-security-extended.qls
Enumerated from the locally-installed pack at
~/.codeql/packages/codeql/rust-queries/0.1.28/queries/security/*/*.ql
(the bench pinning in bench/tool-versions.toml is 0.1.34; the audit
host happened to have 0.1.28 available — the security-suite contents
are identical between these minor bumps per the pack changelog). The
suite includes 19 queries; only the CWE-770 + CWE-825 + CWE-825
trio overlaps the unsafe-class set.
| CWE | rule_id | Query name | severity / precision | Unsafe-class? |
|---|---|---|---|---|
| 020 | rust/regex-injection |
RegexInjection | error / high | no |
| 022 | rust/path-injection |
TaintedPath | error / high | no |
| 079 | rust/xss |
XSS | error / high | no |
| 089 | rust/sql-injection |
SqlInjection | error / high | no |
| 117 | rust/log-injection |
LogInjection | error / medium | no |
| 295 | rust/disabled-certificate-check |
DisabledCertificateCheck | warning / high | no |
| 311 | rust/cleartext-transmission |
CleartextTransmission | warning / high | no |
| 312 | rust/cleartext-logging |
CleartextLogging | warning / high | no |
| 312 | rust/cleartext-storage-database |
CleartextStorageDatabase | warning / high | no |
| 319 | rust/non-https-url |
UseOfHttp | warning / high | no |
| 327 | rust/weak-cryptographic-algorithm |
BrokenCryptoAlgorithm | warning / high | no |
| 327 | rust/weak-sensitive-data-hashing |
WeakSensitiveDataHashing | warning / high | no |
| 614 | rust/insecure-cookie |
InsecureCookie | error / high | no |
| 696 | rust/ctor-initialization |
BadCtorInitialization | error / high | no |
| 770 | rust/uncontrolled-allocation-size |
UncontrolledAllocationSize | warning / high | partial (alloc DoS, not memory-safety per se) |
| 798 | rust/hard-coded-cryptographic-value |
HardcodedCryptographicValue | warning / high | no |
| 825 | rust/access-after-lifetime-ended |
AccessAfterLifetime | error / medium | yes (UAF) |
| 825 | rust/access-invalid-pointer |
AccessInvalidPointer | error / high | yes (invalid-ptr deref) |
| 918 | rust/request-forgery |
RequestForgery | error / high | no |
Unsafe-class coverage: 2/19 = 10.5% (CWE-825 pair). Even accounting for the partial CWE-770 overlap: 3/19 = 15.8%. The remaining 16 queries target web-app / crypto / configuration shapes that do not overlap the corpus's memory-safety / use-after-free / type-confusion / soundness / data-race vuln-classes.
Semgrep — p/rust and r/rust.lang.security
Enumerated from the bench-image pre-fetched packs at
/opt/semgrep-rules/{p-rust.yml, r-rust-security.yml} (image build
2026-05-12, hash 34b49bbd42dc…). The two packs together contain
11 unique rules (10 of the 10 r/rust-security rules are also in
p/rust; p/rust adds 1 unicode-bidi rule).
| rule_id | Severity | CWE | Unsafe-class? |
|---|---|---|---|
generic.unicode.security.bidi.contains-bidirectional-characters |
WARNING | CWE-94 | no |
rust.lang.security.args-os.args-os |
INFO | CWE-807 | no |
rust.lang.security.args.args |
INFO | CWE-807 | no |
rust.lang.security.current-exe.current-exe |
INFO | CWE-807 | no |
rust.lang.security.insecure-hashes.insecure-hashes |
WARNING | CWE-328 | no |
rust.lang.security.reqwest-accept-invalid.reqwest-accept-invalid |
WARNING | CWE-295 | no |
rust.lang.security.reqwest-set-sensitive.reqwest-set-sensitive |
INFO | CWE-921 | no |
rust.lang.security.rustls-dangerous.rustls-dangerous |
WARNING | CWE-295 | no |
rust.lang.security.ssl-verify-none.ssl-verify-none |
WARNING | CWE-295 | no |
rust.lang.security.temp-dir.temp-dir |
INFO | CWE-807 | no |
rust.lang.security.unsafe-usage.unsafe-usage |
INFO | CWE-242 | partial (flags all unsafe blocks; not bug-targeted) |
Unsafe-class coverage: 0/11 = 0% (one rule, unsafe-usage, fires
on any unsafe block — methodology §5's WARNING gate excludes INFO
severity, so it never registers under the locked config). All observed
firings in the bench output were ≤ WARNING but for non-unsafe-class
CWEs (CWE-295, CWE-328); methodology §5's vuln-class → rule-id gate
correctly filters them out for this corpus.
Rudra — sslab-gatech bug-pattern categories
Rudra implements 4 bug-pattern categories (per its 2020 paper "Rudra: Finding Memory Safety Bugs in Rust at the Ecosystem Scale", SOSP'21):
| Category | Real emit name | Unsafe-class? |
|---|---|---|
| Send/Sync auto-trait variance | RUDRA-SENDSYNCVARIANCE:-{PhantomSendForSend,NaiveSendForSend,…} |
yes (data-race / aliasing soundness) |
| Unsafe dataflow | RUDRA-UNSAFEDATAFLOW:-{WriteFlow,ReadFlow,TransmuteFlow} |
yes (memory-safety / UB) |
| Panic safety | RUDRA-PANICSAFETY:-… |
yes (drop-order / uninit-on-panic) |
| Lifetime / safe-encapsulation | RUDRA-LIFETIME:-… |
yes (UAF / dangling) |
Unsafe-class coverage: 4/4 = 100% by design — Rudra was purpose-built
for unsafe Rust. The 0-TP outcome on this corpus is a maintenance-state
artifact, not a coverage gap: 154/160 entries (96%) failed at the
toolchain layer (nightly-2021-10-21 + Cargo.lock generation against
post-2024 crates). The 3 firings that did land used the no-dash
CamelCase names above; the previous bench/rule-id-mapping.toml
listed methodology-spec dash-separated aliases (RUDRA-SEND-SYNC,
RUDRA-PANIC-SAFETY, RUDRA-UNINIT) that substring-matched none of
them — fixed in bn-bums5 commit 1becf6c9c and mirrored in
crates/ward-eval/src/repo_eval.rs::vuln_class_matches_rule. The
parser-misattribution issue (Rudra's findings landing on non-affected
files in the bench output) is tracked as a separate follow-up bone.
cargo-geiger — unsafe LOC counter (context-only, methodology §4)
cargo-geiger is not a security scanner. It walks the dep graph
and emits per-crate counts of unsafe blocks / unsafe fn / unsafe
impl / unsafe trait. It produces no detect-class findings, has no
rule_ids, and methodology §4 includes it as a context column only
(context_only=true in bench/tool-versions.toml). Its 0-TP outcome
is its design contract, not a coverage gap. The 160 cargo-geiger
"findings" in p_at_k.cargo-geiger.n_findings = 160 are unsafe-density
counts, one per crate, not security-relevant alerts.
Unsafe-class coverage: N/A (not in scope; tool design).
Bottom line
The methodology-locked competitor rulesets ship limited unsafe-class detection. Per CWE coverage of this corpus's vuln-classes:
| Tool | Total rules | Unsafe-class rules | Coverage |
|---|---|---|---|
CodeQL rust-security-extended |
19 | 2 (+ 1 partial) | 10.5% |
Semgrep p/rust ∪ r/rust.lang.security |
11 | 0 (1 INFO unsafe-usage, gated out) |
0% |
| Rudra | 4 categories | 4 (100% by design) | 100% but 96% errored |
| cargo-geiger | N/A (counter) | N/A | N/A (context only) |
The 0-TP result for CodeQL, Semgrep, and cargo-geiger on the unsafe-class corpus is therefore a faithful measurement of what users get out of the box, not a result of methodology bias against any tool. Rudra's 0-TP is a maintenance-state failure, not a coverage failure. The methodology fairness primitive (publishing
bench/rule-id-mapping.tomland applying it identically across tools) is intact.
Witness audit — Ward-only sidecar (methodology §6)
| Metric | Value |
|---|---|
| Phase-5-eligible entries with Ward TP | 11 |
| Of those, Miri witness present | 0 (gate disabled in container mode) |
| Positive witnesses | 0 |
| Negative witnesses | 0 |
| Skipped | 11 |
| Positive witness rate | N/A — gate disabled in container-mode bench |
The Phase 5 witness gate is intentionally disabled in container-mode
head-to-head runs per methodology §4 (bench/tool-versions.toml::[tools.ward].notes):
the witness gate is Ward-only and its outputs feed a sidecar, not the
head-to-head MCC. The 24/24 = 100% positive witness rate documented in
notes/project_phase5_witness_gate_state.md was obtained via
ward-eval run-repo in eval mode, separately from this benchmark run.
A follow-up bone (TBD) should wire the witness gate into the bench
harness so this audit sidecar can populate alongside the head-to-head
numbers.
What this run validated
- Full-corpus run completed (160 entries; Ward + Semgrep + Rudra + cargo-geiger). Methodology §3 corpus cap honored.
- CodeQL timeout pattern is reproducible — the 10-min cap is doing what it should: large Rust repos exceed it for DB-build before any analysis runs.
- Paired finding-identity reclassification still works as designed — zero observed paired FPs for Ward (precision degenerate at 1.0).
- bench-stats pipeline scaled from the 68-pair preview to the full 160-entry corpus without changes.
What this run did NOT establish
- CodeQL on the remaining 110 entries — not run; would require ~14h additional wall-clock at the observed DB-build cost.
- Rudra capability at non-dormant state: 154/160 entries failed
on toolchain mismatch / Cargo.lock / panics. The methodology's
best-effort §10 rule retires Rudra from the head-to-head; what we
did NOT establish is whether a hypothetical maintained fork of
Rudra would fare better. Of the 6 entries that ran, 3 produced
findings (RUDRA-SENDSYNCVARIANCE, RUDRA-UNSAFEDATAFLOW), but the
stdout parser misattributed the diagnostic-message text to the
filefield, so none registered as TP under file-level matching. See Competitor ruleset audit (Rudra row) for the corrected coverage picture. - Phase 5 witness gate integration — still disabled in container mode (Ward-only sidecar).
Reproduction
# 1. Build harness binaries
cargo build --release -p ward-cli -p ward-stub-analyzer -p ward-eval
# 2. Build the locked bench image
podman build -t ward-bench:locked -f bench/Dockerfile.bench .
# 3. Full sweep (Ward + Semgrep + Rudra + cargo-geiger; ~60-90 min)
./target/release/ward-eval bench-run \
--manifest tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml \
--tool-versions bench/tool-versions.toml \
--rule-id-mapping bench/rule-id-mapping.toml \
--tools ward,semgrep,rudra,cargo-geiger \
--out target/bench/unsafe-rust-v2
# 4. CodeQL (separate, allow ~14h wall-clock)
./target/release/ward-eval bench-run \
--manifest tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml \
--tool-versions bench/tool-versions.toml \
--rule-id-mapping bench/rule-id-mapping.toml \
--tools codeql \
--out target/bench/unsafe-rust-v2
# 5. Paired finding-identity reclassification
./target/release/ward-eval bench-score \
--raw target/bench/unsafe-rust-v2 \
--out target/bench/unsafe-rust-v2/paired
# 6. Statistical analysis
./target/release/ward-eval bench-stats \
--raw target/bench/unsafe-rust-v2 \
--paired target/bench/unsafe-rust-v2/paired \
--manifest tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml \
--out target/bench/unsafe-rust-v2/stats.json
For the CodeQL partial run, scripts/synth-bench-results-codeql.py
materializes bench-results-codeql.json from the on-disk SARIF + DB
artifact tree under the pre-locked classification rule (see
REPRODUCE.md §CodeQL outcome classification rule),
so bench-score / bench-stats can run without re-running CodeQL.
Artifacts
Committed under notes/benchmarks/artifacts/:
- bench-results-codeql-partial.json — partial CodeQL results
materialized from the on-disk artifact tree per the CodeQL outcome
classification rule (50/160 entries processed: 25 SARIF + 25
DB-only timeouts; 110 marked not_run).
- bench-results-semgrep.json — full Semgrep results (160 entries,
all OK; severity filter drops note-level → 0 retained findings).
- bench-results-rudra.json — full Rudra results (154 errored, 6
ran; coverage 3.75%).
- bench-results-cargo-geiger.json — full cargo-geiger results
(160 entries, all OK; detected=false per context-only contract).
- stats-2026-05-13.json — full bench-stats output with bootstrap
CIs, McNemar pairwise, per-vuln-class, per-bug-shape, latency,
witness audit.
- paired-summary-2026-05-13.json — per-tool paired aggregates.
Build artifacts (NOT committed, reproducible from steps above):
- target/bench/unsafe-rust-v2/bench-results-ward.json — full Ward
results (160 entries, mean 2.5s/entry on the 146 non-errored).
- target/bench/unsafe-rust-v2/raw/<tool>/<entry-id>/ — per-tool
per-entry SARIF / stdout dumps.
Scripts:
- scripts/synth-bench-results-codeql.py — materializes the partial
CodeQL bench-results-codeql.json from the on-disk SARIF + DB
artifact tree under the pre-locked classification rule (SARIF →
success, DB-only → timed_out, neither → not_run) when the
CodeQL portion of ward-eval bench-run is killed mid-sweep. Rule
re-stated in REPRODUCE.md.
- scripts/render-bench-headline.py — renders the markdown headline
blocks (this doc) from stats.json + paired-summary.json for
reviewer reproducibility.
References
- Methodology:
notes/benchmarks/unsafe-rust-bench-methodology.md - Prior Ward-only pass:
notes/benchmarks/unsafe-rust-bench-results-2026-05-12.md - Smoke run:
notes/benchmarks/smoke-run-results-20260511.md - Bench harness:
crates/ward-eval/src/bench/mod.rs - Paired scoring:
crates/ward-eval/src/bench/paired_scoring.rs - Statistical analyzer:
crates/ward-eval/src/bench/stats.rs - CodeQL synth:
scripts/synth-bench-results-codeql.py - Phase 5 walkaway state:
notes/project_phase5_witness_gate_state.md - Image build:
bench/Dockerfile.bench - Tool pinning:
bench/tool-versions.toml - Rule-id mapping:
bench/rule-id-mapping.toml