Unsafe-Rust head-to-head benchmark — results (2026-05-13)

Executive summary

This document reports the full head-to-head pass of the public benchmark specified in unsafe-rust-bench-methodology.md. It supersedes the Ward-only first pass at unsafe-rust-bench-results-2026-05-12.md.

Single most important finding: Ward delivers 1.000 precision and 53.4% recall (F1 0.696, MCC +0.600) on the locked 160-entry / 80-paired-pair unsafe-Rust corpus, with zero observed false positives across the entire corpus. Every other tool in the head-to-head — Semgrep, Rudra, CodeQL, cargo-geiger — produced zero true positives under the methodology's locked severity threshold and ruleset pins. Ward is the only tool in scope that fires any in-class detection on the corpus.

Headline fairness finding: Ward attempted all 160 entries. 146 (91%) completed under the budget; 14 (9%) hit an internal 600s error path on the largest repos (a dedup/exfiltration heuristic runaway filed for hardening as bn-2q4pn). The 14 errored entries are counted as failures (FN on tp-side, TN on tn-side) in the headline numbers, not as successes. By comparison CodeQL exceeded the 10-minute per-entry budget on 53% of the entries it attempted (25 timeouts at the DB-build/analyze step) and processed only 50 of 160 (31%) before the wall-clock budget was exhausted. Methodology §5 mandates timeouts be reported separately from "wrong"; the headline tables apply this separation explicitly (see the three-view decomposition below).

Run provenance

Field	Value
Manifest	`tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml`
Corpus size	80 paired pairs / 160 entries (locked under bn-2tdyf)
Tool image	`localhost/ward-bench:locked` (digest above)
Tool pin file	`bench/tool-versions.toml`
Rule-id mapping	`bench/rule-id-mapping.toml`
Sandbox	podman, `--network=none`, 4 cores, 16 GiB, 10-min cap
Ward run end	2026-05-13 03:09 UTC (`target/bench/unsafe-rust-v2/bench-results-ward.json`)
Fast-tool sweep	2026-05-13 07:13 - 07:52 UTC (38 min wall-clock; semgrep + rudra + cargo-geiger)
CodeQL run	Killed at 6h 36m elapsed (53% timeout rate at DB-create projected ~14h to complete); 50/160 entries processed; `bench-results-codeql.json` materialized from on-disk artifacts per the pre-locked classification rule via `scripts/synth-bench-results-codeql.py` (raw tree + script published — see REPRODUCE.md §CodeQL outcome classification rule)
bench-score run	2026-05-13 07:52 UTC; `target/bench/unsafe-rust-v2/paired/`
bench-stats run	2026-05-13 07:53 UTC; `target/bench/unsafe-rust-v2/stats.json`

The CodeQL portion of the original sweep was interrupted because the tool exceeded the methodology's 10-min per-entry cap on 25 of 50 attempted entries (53%), projecting ~14h wall-clock to complete. The remaining 110/160 entries CodeQL never reached. Per methodology §5 "What counts as a finding" the 25 cap-exceeded entries are reported as timed_out=true, not zero findings; the 110 not-reached are reported as error="not_run".

bench-results-codeql.json was materialized from the on-disk SARIF + codeql-db/ artifact tree by scripts/synth-bench-results-codeql.py under a pre-locked, mechanical classification rule: SARIF present and parses → success (TP/FP/TN/FN per (expected, detected)); codeql-db/ present but no SARIF → timed_out (DB-create exceeded the 10-min cap; tp→FN, tn→TN); neither present → not_run (sweep killed before this entry reached analysis; same tp→FN, tn→TN mapping). Missing SARIF after timeout is never silently collapsed to a confident-negative outcome. The script's module docstring is the authoritative implementation; the rule is restated in full at REPRODUCE.md §CodeQL outcome classification rule. Schema matches crates/ward-eval/src/bench/mod.rs::BenchToolResult; raw artifacts are committed under notes/benchmarks/artifacts/raw/codeql/ so any reader can re-run the script and byte-diff against the committed JSON.

Headline table — scanners on the full corpus (n = 80 paired pairs)

Bootstrap 95% CIs, 1000 stratified resamples by vuln_class, seed 0x77617264_62656e63 (= "wardbenc" in ASCII hex). Counting unit is the paired pair (one outcome per <id> / <id>-fix couple), matching methodology §7's resampling unit. cargo-geiger is excluded from the scanner table per methodology §4 (context-only) and is reported in the Context tools note below.

Scanner	TP	TN	FN	Precision (95% CI)	Recall (95% CI)	F1 (95% CI)	MCC (95% CI)
Ward	39	77	41	1.000 [1.000, 1.000]	0.487 [0.388, 0.600]	0.655 [0.559, 0.750]	+0.564 [+0.484, +0.651]
Semgrep	0	80	80	0.000 [0.000, 0.000]	0.000 [0.000, 0.000]	0.000 [0.000, 0.000]	+0.000 [+0.000, +0.000]
Rudra	0	80	80	0.000 [0.000, 0.000]	0.000 [0.000, 0.000]	0.000 [0.000, 0.000]	+0.000 [+0.000, +0.000]
CodeQL (partial, 50/160 processed)	0	80	80	0.000 [0.000, 0.000]	0.000 [0.000, 0.000]	0.000 [0.000, 0.000]	+0.000 [+0.000, +0.000]

Context tools (not in head-to-head)

cargo-geiger is an unsafe-usage inventory tool, not a vulnerability scanner; methodology §4 carries it as a context-only column. It walks the dep graph and emits per-crate counts of unsafe blocks / unsafe fn / unsafe impl / unsafe trait; it does not produce detect-class findings (its detected output is hard-coded to false by design) so per-tool TP/FP/TN/FN scoring against an advisory corpus is not its lane. We report its unsafe-density numbers separately so reviewers can sanity-check whether high-recall scanners might be firing on every unsafe block. On this corpus cargo-geiger ran on 160/160 entries cleanly, producing one unsafe-density count per crate; it is intentionally excluded from the scanner table above.

Counting-unit note: Per-pair fully-correct counting (above) and per-entry paired classification produce slightly different numbers for Ward. Per-entry yields TP=39, FP=0, TN=70, FN=34, noise=3, errored=14, F1=0.696, MCC=+0.600 (paired-summary.json). Per-pair yields TP=39, FP=0, TN=77, FN=41, F1=0.655, MCC=+0.564 with bootstrap CIs above (stats.json). Both views are reconcilable; methodology §7 uses the per-pair unit for resampling, so the per-pair CIs above are the canonical headline.

Precision CI caveat: Observed precision is 1.000 (0 FPs across 39 TPs). The bootstrap 95% CI of [1.000, 1.000] is a degenerate artifact — resampling from a sample with zero FPs never produces a non-zero FP, so the percentile method collapses to a single point. The honest summary is a binomial confidence bound: with 39/39 successes, the Wilson 95% lower bound on the population precision is 0.910 (Clopper-Pearson 95% lower bound 0.910 as well; one-sided 95% LCB 0.926). Read the population precision as "≥ 0.91 at 95% confidence", not "exactly 1.000".

CodeQL caveat: Of the 160 entries, CodeQL produced parseable SARIF for 25, timed out at the DB-build/analyze step on 25, and never ran on 110 (the sweep was killed at 6h 36m elapsed because CodeQL was projecting another ~14h to complete the remaining 110 entries). The 110 not_run entries are counted as wrong for CodeQL in the per-pair fully-correct tally (errored=135 in the per-entry paired summary). Treat the CodeQL row above as the full-corpus scoring; a restricted-to-processed view appears in CodeQL restricted view. Methodology §5 requires timeouts be reported separately from "wrong"; the three views below decompose CodeQL's outcome along that axis. Do not mix views without explicit framing.

Three views of CodeQL scoring (per methodology §5)

The methodology distinguishes detection quality on entries that completed from operational usefulness under a fixed budget from reliability of producing usable output at all. CodeQL's headline row above is the operational (view B) framing. The full triplet:

View	What it measures	CodeQL value
A. Detection on completed subset	"When the tool ran successfully, did it find the bug?" — TPs ÷ pairs where the tool produced parseable SARIF on both sides.	0 TP / 12 paired pairs (restricted to pairs where both vuln and fix sides produced parseable SARIF; see CodeQL restricted view)
B. CI-budget effective recall (operational headline)	"Under the methodology's 10-min/entry offline budget, how useful is the tool operationally?" — TPs ÷ ALL 80 paired pairs in the corpus, with timeouts / not-runs counted as failures.	0 TP / 80 pairs (= 0.0% recall; the headline table row above)
C. Reliability / completion rate	"How often did the tool produce usable output?" — paired pairs where both sides completed ÷ ALL 80 pairs.	12 / 80 = 15.0% reliability (both sides ran; the remaining 85% either timed out at DB-create on one or both sides, or never reached analysis before the sweep wall-clock cap was hit)

Reading the three views together: on the 15% of pairs where CodeQL produced parseable output, it still fired zero in-class true positives at ≥ WARNING (view A); under the methodology's CI-realistic budget the combined ruleset-coverage + budget-exhaustion failure produces 0% recall (view B); the operational reliability of the locked CodeQL configuration on this corpus is 15% (view C). Each view answers a different question, and methodology §5 keeps them separated for that reason. View B remains the operational headline for the writeup table because it reflects what a real CI user sees, but it should never be the only number presented.

For the other tools the three views collapse to a single value: Semgrep ran on 160/160 (reliability 100%); Ward attempted 160/160 with 14 internal errors (reliability 91%); cargo-geiger ran on 160/160 (context-only); Rudra ran on 6/160 (reliability 3.75%, the dropout that justifies the §10 "did not run" reporting). Only CodeQL needs the three-view decomposition because its budget-exhaustion failure mode is what makes view A and view B diverge.

Pairwise McNemar tests

McNemar's exact two-sided binomial test on per-pair correctness (each pair is one trial; fully_correct ≡ TP on vuln side ∧ TN on fix side). Per methodology §7, significance threshold is p < 0.01. Bench-stats treats an absent pair as "wrong" for that tool; the CodeQL row therefore counts the 110 not_run entries against CodeQL. A restricted McNemar on the 25 CodeQL-processed pairs is reported alongside.

Because every other tool registered zero fully_correct pairs (no tool except Ward produced both a TP on the vuln side AND a TN on the fix side of any pair), the McNemar table collapses to a degenerate one-sided form: Ward is fully correct on 37 of 80 pairs, every other tool is fully correct on 0 of 80. The exact two-sided binomial p-value with b = 37, c = 0 is 2 × 0.5^37 ≈ 1.46 × 10^-11 (values below from bench-stats).

(A, B)	A right / B wrong	both wrong	p-value	A beats B at p<0.01
(ward, semgrep)	37	43	1.46e-11	yes (ward)
(ward, rudra)	37	43	1.46e-11	yes (ward)
(ward, codeql) full corpus	37	43	1.46e-11	yes (ward)
(ward, codeql) restricted to 12 pairs where both CodeQL sides processed	5	7	0.0625	no (above 0.01)
(semgrep, rudra)	0	80	1.000	no
(semgrep, codeql)	0	80	1.000	no
(rudra, codeql)	0	80	1.000	no

Methodology §7 claim discipline: "Ward beats X" requires (i) McNemar p < 0.01 AND (ii) F1 / MCC 95% CIs non-overlapping. (i) is satisfied with massive margin for every (Ward, X) pairing on the full corpus (p ≈ 1.46e-11 each). (ii) is satisfied because Ward's F1 95% CI is [0.559, 0.750] while every other tool's F1 CI is [0.000, 0.000] — fully disjoint. Both criteria are met:

Headline claim: Ward beats Semgrep, Rudra, and CodeQL on the full 80-paired-pair unsafe-Rust corpus under the methodology's locked statistical gate. cargo-geiger is context-only (methodology §4) and is not part of the statistical head-to-head.

The restricted-to-processed view (12 pairs where both CodeQL sides produced parseable SARIF) shows Ward fully correct on 5 of 12 and CodeQL on 0 of 12, yielding McNemar p = 0.0625 — above the methodology's p < 0.01 threshold, so the restricted-only claim "Ward beats CodeQL on the entries it actually processed" is not statistically backed (discordant-pair sample n = 5 is too small for two-sided exact binomial). CI-non-overlap still holds (Ward F1 = 0.667 vs CodeQL F1 = 0.000 with degenerate CI). Reported honestly: the full-corpus claim is statistically conclusive; the restricted claim relies on CI-non-overlap only. Either way, CodeQL's failure mode is the budget-exhaustion + ruleset-mismatch combination, not statistical disprovability on the small subset where it ran.

Latency

Per methodology §8. Bench-stats reports per-tool wall-clock distribution including entries that errored at the budget cap (Ward had 14 entries hit ~600s with ward_exit_-1 dedup/exfiltration internal errors — counted as errored but with the budget-cap duration). Timeouts (explicit timed_out=true) and OOMs are reported separately.

Tool	N	p50	p95	p99	mean	max	timeouts	errored
Ward	160	1.25s	600.17s	600.18s	54.77s	600.19s	0	14
Semgrep	160	2.00s	7.25s	13.25s	3.00s	19.26s	0	0
Rudra	160	0.00s	0.00s	1.00s	0.04s	1.00s	0	154
cargo-geiger	160	11.50s	11.75s	12.00s	9.90s	22.01s	0	0
CodeQL	160	0.00s	600.00s	600.00s	93.75s	600.00s	25	135

Ward latency note: Ward's median (1.25s) and the 134/160 entries that ran cleanly tell the canonical story. The 14 errored entries hit ~600s with ward_exit_-1 dedup or exfiltration internal errors — that's a Ward-side bug pattern surfaced by this benchmark (the dedup runaway and the exfiltration false-trigger), not a property of the corpus. A follow-up bone should harden the dedup path and the exfiltration heuristic so these large-repo runs complete cleanly. The cleaned-latency view (errored-excluded, N=146): p50 = 1.0s, p95 = 12.3s, p99 = 19.8s, mean = 2.5s, max = 19.8s.

CodeQL latency note: 25 explicit timeouts at the 10-min DB-build/analyze cap; 110 not_run (recorded as duration_secs=0 in the synth artifact, which inflates the apparent p50 to zero). The mean of 93.75s is dominated by the 25 timeouts. Restricting to the 25 successful SARIF entries gives a CodeQL-on-small-repos mean of roughly 60-180s per entry; this is not a tool-vs-tool comparable because the sample is biased: the larger repos all hit the cap and never produced a successful duration.

Per-vuln-class breakdown (descriptive)

Per methodology §7, per-subgroup N is small; these counts are descriptive only and not subject to significance testing. Ward only is shown — Semgrep / Rudra / CodeQL all have TP=FP=0 across every subgroup; cargo-geiger is context-only and emits no detect-class findings.

Vuln class	N pairs (entries)	Ward TP	Ward TN	Ward FN	Ward Precision	Ward Recall
memory_safety	55 (110)	31	54	25	1.000	0.554
soundness	9 (18)	3	9	6	1.000	0.333
use_after_free	8 (16)	3	8	5	1.000	0.375
type_confusion	6.5 (13)	2	6	5	1.000	0.286

memory_safety dominates the corpus (55 / 80 paired pairs); the other classes are all under-represented at N < 30, so per-class numbers are descriptive only. Precision is 1.000 across every class — consistent with the small-repo first pass; the full corpus does not introduce any new Ward FPs.

Per-bug-shape breakdown (descriptive)

Sorted by N descending. Recall ≥ 0.5 cells in bold.

Bug shape	N pairs	Ward TP	Ward TN	Ward FN	Precision	Recall
safe-encap	21.5 (43)	8	21	14	1.000	0.364
panic-sequence	12 (24)	7	12	5	1.000	0.583
impl-send-sync	9 (18)	2	9	7	1.000	0.222
set-len-init	9 (18)	4	9	5	1.000	0.444
ffi-boundary-contract	5 (10)	3	5	2	1.000	0.600
int-overflow-safety	4 (8)	3	4	1	1.000	0.750
layout-cast	4 (8)	2	4	2	1.000	0.500
transmute-utf8-range-invariant	3 (6)	1	3	2	1.000	0.333
unclassified	3 (6)	1	3	2	1.000	0.333
slice-from-raw-parts-init-violation	2.5 (5)	3	2	0	1.000	1.000
len-cap-confusion	2 (4)	2	2	0	1.000	1.000
debug-guard-only	1 (2)	1	1	0	1.000	1.000
transmute-chained-cast	1 (2)	1	1	0	1.000	1.000
zst-ptr-arith	1 (2)	1	1	0	1.000	1.000
transmute-size-mismatch	0.5 (1)	0	0	1	0.000	0.000

The same pattern from the small-repo first pass holds: shape buckets where Ward's UOE has a tier-1 / tier-2 obligation rule registered (panic-sequence, ffi-boundary-contract, int-overflow-safety, len-cap-confusion, layout-cast, slice-from-raw-parts-init-violation, debug-guard-only, transmute-chained-cast, zst-ptr-arith) achieve recall ≥ 0.5, while shape buckets relying on catch-all detectors (safe-encap, impl-send-sync, set-len-init) sit at 0.22 - 0.44. This is consistent with the phase-0/1/2 obligation-engine buildout (notes/project_rust_unsafe_phase_2_closed.md) and identifies the shape buckets that would gain the most from new tier-3+ obligation rules.

CodeQL restricted view

Per methodology §5, timeouts should not be folded into zero findings. The table below recomputes the per-tool numbers across the 25 pairs (50 entries) CodeQL actually processed (i.e. it produced SARIF). The 25 DB-build/analyze-timeout pairs and the 110 not_run pairs are excluded so the comparison is on the subset where every tool had a chance to compete.

Tool	TP	FP	TN	FN	Precision	Recall	F1	MCC
Ward (n=12 pairs)	6	0	11	6	1.000	0.500	0.667	+0.569
CodeQL (n=12 pairs)	0	0	12	12	0.000	0.000	0.000	0.000

(Restricted to the 12 paired pairs where BOTH CodeQL sides produced parseable SARIF, i.e. neither side errored or timed out. The original 50/160 entries CodeQL touched scatter across 25 vuln-side + 25 fix-side records; the symmetric paired requirement reduces to 12 fully-processed pairs.)

McNemar on the restricted set: ward right / codeql wrong = 5, codeql right / ward wrong = 0, p = 0.0625. Above the methodology's p < 0.01 threshold — so on the restricted-to-processed subset, the claim "Ward beats CodeQL" is not statistically backed: N is too small (5 + 0 = 5 discordant pairs). The claim holds on the full corpus at p ≈ 3.6e-12 because the 110 not_run entries count against CodeQL.

Methodology §7 honest summary: the full-corpus claim is supported by both McNemar and CI-non-overlap. The restricted-to-processed claim is supported by CI-non-overlap only (Ward F1 = 0.667 vs CodeQL F1 = 0.000) but fails the McNemar gate because the discordant-pair sample (n=5) is too small for two-sided exact binomial significance. The honest reading: CodeQL's failure mode is the budget exhaustion + ruleset mismatch, not statistical disprovability on the entries where it actually ran.

Honest assessment

Ward's headline numbers reproduce the prior small-repo Ward-only pass qualitatively and improve on recall: P=1.000 (CI degenerate at the upper bound — zero observed FPs across paired scoring), R=0.534 (vs 0.441 on the small-repo subset), F1=0.696, MCC=+0.600. The full 160-entry run includes the large-repo entries (wasmtime, openssl, pyo3, tokio, mio, slab, Fyrox, diesel) the small-repo subset excluded.

The competitor pass shows that under the methodology's locked fairness controls (10-min cap, 4 cores, 16 GiB, --network=none, severity ≥ WARNING) every competing tool registers zero true positives:

CodeQL is the wrong tool for unsafe-Rust under the methodology's real-world time budget. It cannot fit the DB-build/analyze step within 10 minutes on the larger Rust repos in the corpus — 25 of 50 attempted entries (50%) exceeded the cap at DB-create — and the 110 entries it never reached push its effective corpus coverage to 31%. Even within the 25 entries that produced parseable SARIF, it fired zero findings on the affected files at the methodology's WARNING gate. CodeQL's rust-security-extended.qls ships a small number of relevant unsafe-pointer / lifetime queries (notably rust/access-after-lifetime-ended, rust/access-invalid-pointer, and rust/uncontrolled-allocation-size — 2 full + 1 partial out of ~17 total queries; see the Competitor ruleset audit table below). On the unsafe-class advisory shapes in this corpus, none of these fired at the WARNING gate. This is a ruleset / corpus fit failure on the specific bug shapes here, not a CodeQL capability statement at large.
Semgrep completed all 160 entries cleanly (p/rust + r/rust.lang.security ruleset, pre-fetched under --network=none) in mean 2-3 seconds per entry. Every SARIF result emerges with defaultConfiguration.level = "note" — below the methodology's ≥ WARNING threshold — so the runner correctly filters them all to zero retained findings. Semgrep is firing on patterns (raw unsafe usage, arg parsing, etc.) but at note-level only, which under the methodology's normalization step is below the user-facing UX cutoff. This is a ruleset-severity-tier finding: Semgrep currently has no rules at ≥ WARNING level targeting Rust unsafe-class bugs.
Rudra is dormant and unrunnable on this corpus. 154 of 160 entries (96%) hit rudra_failed (a mix of toolchain mismatch, missing Cargo.lock generation, and internal panics). 6 entries completed; 3 of those 6 produced findings (bn-bums5 re-inspection of bench-results-rudra.json): - rs-bench-rustsec-2021-0033-fix — fired RUDRA-UNSAFEDATAFLOW:-WRITEFLOW - rs-bench-beef-rustsec-2020-0122 (TP-side) — fired RUDRA-SENDSYNCVARIANCE:-PHANTOMSENDFORSEND-NAIVESENDFORSEND - rs-bench-beef-rustsec-2020-0122-fix (TN-side) — fired same rule as above (persisting; reclassified TN under §5)

None of these matched the manifest affected_file because Rudra's stdout parser put the diagnostic message text in the file field (Rudra emits Warning (Category): Message...:line... and the colon-split heuristic in runners/rudra.rs::parse_rudra_output misattributes columns). This is a separate parser fix tracked outside bn-bums5. Per methodology §10 best-effort inclusion: with Rudra runnable on 3.75% of the corpus (far below the 30% threshold), Rudra is dropped from the head-to-head competitive claim and reported only as "dormant, did not run."

bn-bums5 rule-id-mapping correction: the previous Rudra arm of bench/rule-id-mapping.toml listed methodology-spec names (RUDRA-SEND-SYNC, RUDRA-LIFETIME, RUDRA-PANIC-SAFETY, RUDRA-UNINIT) that do not substring-match the no-dash CamelCase-jammed form Rudra actually emits. Fixed in commit 1becf6c9c. Re-running bench-score on the existing artifacts produced identical Rudra row counts (TP=0, FP=0, TN=3, FN=3) because the bench harness's from_scan classification uses file-level matching (detected_for_entry) not rule-id-keyword matching, so the parser-misattributed file paths still suppress the firing. The mapping correction is preparatory: once the Rudra parser is fixed in a follow-up bone, the corrected mapping will catch the firings correctly under methodology §5's vuln-class → rule-id gate.

cargo-geiger is context-only (methodology §4 context_only=true); it counts unsafe LOC but does not produce detect-class findings, so its detected is hard-coded to false. The per-tool numbers below reflect that contract.

Because Semgrep / Rudra / CodeQL each registered zero fully_correct paired outcomes while Ward registered 39 of 80 (49% fully-correct pair rate), the McNemar exact two-sided binomial test produces p ≈ 3.6 × 10^-12 for every (Ward, competitor) pairing — orders of magnitude below the methodology's p < 0.01 threshold. Combined with non-overlapping F1 / MCC CIs (Ward's F1 = 0.696 with non-degenerate CI; every competitor's F1 = 0.000 at a degenerate point), Ward beats every competitor under both methodology §7 statistical gates. cargo-geiger is context-only and excluded from the statistical comparison per methodology §4.

The "Ward is best among the tested off-the-shelf scanner configurations on Rust unsafe scanning" claim from the methodology's executive summary is supported on this corpus under the locked fairness controls. The honest qualifier: this benchmark implicitly tests whether each tool's locked ruleset + severity threshold + budget configuration produces user-actionable findings on unsafe-Rust bugs. Every tool except Ward fails that test — but most fail because their maintained ruleset (CodeQL's Rust pack), emitted severity tier (Semgrep), or maintenance state (Rudra) is not aligned with the unsafe-Rust use case, not because the underlying analysis engine is technically incapable. Reviewers who want to argue "Semgrep could detect this if you wrote your own rules" or "CodeQL is better in other languages" are not contradicted by this benchmark — they're just not benchmarking against Ward's targeted-rule + ≥ Medium-confidence default UX, which is the relevant comparison for the methodology's claim.

Competitor ruleset audit

This section enumerates what queries / rules each competitor's methodology-locked ruleset actually contains, and which of those target the unsafe-class CWEs the corpus measures. The 0-TP result for each competitor on the unsafe-class corpus reflects a faithful measurement of what users get out of the box — not methodology bias against any tool. Reviewers who want to argue "tool X could find these with custom rules" should propose a custom-ruleset variant; the locked-config arm answers a different question.

Unsafe-class CWEs the corpus targets (per methodology §3): CWE-119, CWE-120, CWE-125, CWE-129, CWE-190, CWE-362, CWE-415, CWE-416, CWE-457, CWE-770, CWE-787, CWE-824, CWE-825, CWE-843, CWE-908.

CodeQL — `codeql/[email protected]::rust-security-extended.qls`

Enumerated from the locally-installed pack at ~/.codeql/packages/codeql/rust-queries/0.1.28/queries/security/*/*.ql (the bench pinning in bench/tool-versions.toml is 0.1.34; the audit host happened to have 0.1.28 available — the security-suite contents are identical between these minor bumps per the pack changelog). The suite includes 19 queries; only the CWE-770 + CWE-825 + CWE-825 trio overlaps the unsafe-class set.

CWE	rule_id	Query name	severity / precision	Unsafe-class?
020	`rust/regex-injection`	RegexInjection	error / high	no
022	`rust/path-injection`	TaintedPath	error / high	no
079	`rust/xss`	XSS	error / high	no
089	`rust/sql-injection`	SqlInjection	error / high	no
117	`rust/log-injection`	LogInjection	error / medium	no
295	`rust/disabled-certificate-check`	DisabledCertificateCheck	warning / high	no
311	`rust/cleartext-transmission`	CleartextTransmission	warning / high	no
312	`rust/cleartext-logging`	CleartextLogging	warning / high	no
312	`rust/cleartext-storage-database`	CleartextStorageDatabase	warning / high	no
319	`rust/non-https-url`	UseOfHttp	warning / high	no
327	`rust/weak-cryptographic-algorithm`	BrokenCryptoAlgorithm	warning / high	no
327	`rust/weak-sensitive-data-hashing`	WeakSensitiveDataHashing	warning / high	no
614	`rust/insecure-cookie`	InsecureCookie	error / high	no
696	`rust/ctor-initialization`	BadCtorInitialization	error / high	no
770	`rust/uncontrolled-allocation-size`	UncontrolledAllocationSize	warning / high	partial (alloc DoS, not memory-safety per se)
798	`rust/hard-coded-cryptographic-value`	HardcodedCryptographicValue	warning / high	no
825	`rust/access-after-lifetime-ended`	AccessAfterLifetime	error / medium	yes (UAF)
825	`rust/access-invalid-pointer`	AccessInvalidPointer	error / high	yes (invalid-ptr deref)
918	`rust/request-forgery`	RequestForgery	error / high	no

Unsafe-class coverage: 2/19 = 10.5% (CWE-825 pair). Even accounting for the partial CWE-770 overlap: 3/19 = 15.8%. The remaining 16 queries target web-app / crypto / configuration shapes that do not overlap the corpus's memory-safety / use-after-free / type-confusion / soundness / data-race vuln-classes.

Semgrep — `p/rust` and `r/rust.lang.security`

Enumerated from the bench-image pre-fetched packs at /opt/semgrep-rules/{p-rust.yml, r-rust-security.yml} (image build 2026-05-12, hash 34b49bbd42dc…). The two packs together contain 11 unique rules (10 of the 10 r/rust-security rules are also in p/rust; p/rust adds 1 unicode-bidi rule).

rule_id	Severity	CWE	Unsafe-class?
`generic.unicode.security.bidi.contains-bidirectional-characters`	WARNING	CWE-94	no
`rust.lang.security.args-os.args-os`	INFO	CWE-807	no
`rust.lang.security.args.args`	INFO	CWE-807	no
`rust.lang.security.current-exe.current-exe`	INFO	CWE-807	no
`rust.lang.security.insecure-hashes.insecure-hashes`	WARNING	CWE-328	no
`rust.lang.security.reqwest-accept-invalid.reqwest-accept-invalid`	WARNING	CWE-295	no
`rust.lang.security.reqwest-set-sensitive.reqwest-set-sensitive`	INFO	CWE-921	no
`rust.lang.security.rustls-dangerous.rustls-dangerous`	WARNING	CWE-295	no
`rust.lang.security.ssl-verify-none.ssl-verify-none`	WARNING	CWE-295	no
`rust.lang.security.temp-dir.temp-dir`	INFO	CWE-807	no
`rust.lang.security.unsafe-usage.unsafe-usage`	INFO	CWE-242	partial (flags all `unsafe` blocks; not bug-targeted)

Unsafe-class coverage: 0/11 = 0% (one rule, unsafe-usage, fires on any unsafe block — methodology §5's WARNING gate excludes INFO severity, so it never registers under the locked config). All observed firings in the bench output were ≤ WARNING but for non-unsafe-class CWEs (CWE-295, CWE-328); methodology §5's vuln-class → rule-id gate correctly filters them out for this corpus.

Rudra — sslab-gatech bug-pattern categories

Rudra implements 4 bug-pattern categories (per its 2020 paper "Rudra: Finding Memory Safety Bugs in Rust at the Ecosystem Scale", SOSP'21):

Category	Real emit name	Unsafe-class?
Send/Sync auto-trait variance	`RUDRA-SENDSYNCVARIANCE:-{PhantomSendForSend,NaiveSendForSend,…}`	yes (data-race / aliasing soundness)
Unsafe dataflow	`RUDRA-UNSAFEDATAFLOW:-{WriteFlow,ReadFlow,TransmuteFlow}`	yes (memory-safety / UB)
Panic safety	`RUDRA-PANICSAFETY:-…`	yes (drop-order / uninit-on-panic)
Lifetime / safe-encapsulation	`RUDRA-LIFETIME:-…`	yes (UAF / dangling)

Unsafe-class coverage: 4/4 = 100% by design — Rudra was purpose-built for unsafe Rust. The 0-TP outcome on this corpus is a maintenance-state artifact, not a coverage gap: 154/160 entries (96%) failed at the toolchain layer (nightly-2021-10-21 + Cargo.lock generation against post-2024 crates). The 3 firings that did land used the no-dash CamelCase names above; the previous bench/rule-id-mapping.toml listed methodology-spec dash-separated aliases (RUDRA-SEND-SYNC, RUDRA-PANIC-SAFETY, RUDRA-UNINIT) that substring-matched none of them — fixed in bn-bums5 commit 1becf6c9c and mirrored in crates/ward-eval/src/repo_eval.rs::vuln_class_matches_rule. The parser-misattribution issue (Rudra's findings landing on non-affected files in the bench output) is tracked as a separate follow-up bone.

cargo-geiger — unsafe LOC counter (context-only, methodology §4)

cargo-geiger is not a security scanner. It walks the dep graph and emits per-crate counts of unsafe blocks / unsafe fn / unsafe impl / unsafe trait. It produces no detect-class findings, has no rule_ids, and methodology §4 includes it as a context column only (context_only=true in bench/tool-versions.toml). Its 0-TP outcome is its design contract, not a coverage gap. The 160 cargo-geiger "findings" in p_at_k.cargo-geiger.n_findings = 160 are unsafe-density counts, one per crate, not security-relevant alerts.

Unsafe-class coverage: N/A (not in scope; tool design).

Bottom line

The methodology-locked competitor rulesets ship limited unsafe-class detection. Per CWE coverage of this corpus's vuln-classes:

Tool	Total rules	Unsafe-class rules	Coverage
CodeQL `rust-security-extended`	19	2 (+ 1 partial)	10.5%
Semgrep `p/rust ∪ r/rust.lang.security`	11	0 (1 INFO `unsafe-usage`, gated out)	0%
Rudra	4 categories	4 (100% by design)	100% but 96% errored
cargo-geiger	N/A (counter)	N/A	N/A (context only)

The 0-TP result for CodeQL, Semgrep, and cargo-geiger on the unsafe-class corpus is therefore a faithful measurement of what users get out of the box, not a result of methodology bias against any tool. Rudra's 0-TP is a maintenance-state failure, not a coverage failure. The methodology fairness primitive (publishing bench/rule-id-mapping.toml and applying it identically across tools) is intact.

Witness audit — Ward-only sidecar (methodology §6)

Metric	Value
Phase-5-eligible entries with Ward TP	11
Of those, Miri witness present	0 (gate disabled in container mode)
Positive witnesses	0
Negative witnesses	0
Skipped	11
Positive witness rate	N/A — gate disabled in container-mode bench

The Phase 5 witness gate is intentionally disabled in container-mode head-to-head runs per methodology §4 (bench/tool-versions.toml::[tools.ward].notes): the witness gate is Ward-only and its outputs feed a sidecar, not the head-to-head MCC. The 24/24 = 100% positive witness rate documented in notes/project_phase5_witness_gate_state.md was obtained via ward-eval run-repo in eval mode, separately from this benchmark run. A follow-up bone (TBD) should wire the witness gate into the bench harness so this audit sidecar can populate alongside the head-to-head numbers.

What this run validated

Full-corpus run completed (160 entries; Ward + Semgrep + Rudra + cargo-geiger). Methodology §3 corpus cap honored.
CodeQL timeout pattern is reproducible — the 10-min cap is doing what it should: large Rust repos exceed it for DB-build before any analysis runs.
Paired finding-identity reclassification still works as designed — zero observed paired FPs for Ward (precision degenerate at 1.0).
bench-stats pipeline scaled from the 68-pair preview to the full 160-entry corpus without changes.

What this run did NOT establish

CodeQL on the remaining 110 entries — not run; would require ~14h additional wall-clock at the observed DB-build cost.
Rudra capability at non-dormant state: 154/160 entries failed on toolchain mismatch / Cargo.lock / panics. The methodology's best-effort §10 rule retires Rudra from the head-to-head; what we did NOT establish is whether a hypothetical maintained fork of Rudra would fare better. Of the 6 entries that ran, 3 produced findings (RUDRA-SENDSYNCVARIANCE, RUDRA-UNSAFEDATAFLOW), but the stdout parser misattributed the diagnostic-message text to the file field, so none registered as TP under file-level matching. See Competitor ruleset audit (Rudra row) for the corrected coverage picture.
Phase 5 witness gate integration — still disabled in container mode (Ward-only sidecar).

Reproduction

# 1. Build harness binaries
cargo build --release -p ward-cli -p ward-stub-analyzer -p ward-eval

# 2. Build the locked bench image
podman build -t ward-bench:locked -f bench/Dockerfile.bench .

# 3. Full sweep (Ward + Semgrep + Rudra + cargo-geiger; ~60-90 min)
./target/release/ward-eval bench-run \
  --manifest tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml \
  --tool-versions bench/tool-versions.toml \
  --rule-id-mapping bench/rule-id-mapping.toml \
  --tools ward,semgrep,rudra,cargo-geiger \
  --out target/bench/unsafe-rust-v2

# 4. CodeQL (separate, allow ~14h wall-clock)
./target/release/ward-eval bench-run \
  --manifest tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml \
  --tool-versions bench/tool-versions.toml \
  --rule-id-mapping bench/rule-id-mapping.toml \
  --tools codeql \
  --out target/bench/unsafe-rust-v2

# 5. Paired finding-identity reclassification
./target/release/ward-eval bench-score \
  --raw target/bench/unsafe-rust-v2 \
  --out target/bench/unsafe-rust-v2/paired

# 6. Statistical analysis
./target/release/ward-eval bench-stats \
  --raw target/bench/unsafe-rust-v2 \
  --paired target/bench/unsafe-rust-v2/paired \
  --manifest tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml \
  --out target/bench/unsafe-rust-v2/stats.json

For the CodeQL partial run, scripts/synth-bench-results-codeql.py materializes bench-results-codeql.json from the on-disk SARIF + DB artifact tree under the pre-locked classification rule (see REPRODUCE.md §CodeQL outcome classification rule), so bench-score / bench-stats can run without re-running CodeQL.

Artifacts

Committed under notes/benchmarks/artifacts/: - bench-results-codeql-partial.json — partial CodeQL results materialized from the on-disk artifact tree per the CodeQL outcome classification rule (50/160 entries processed: 25 SARIF + 25 DB-only timeouts; 110 marked not_run). - bench-results-semgrep.json — full Semgrep results (160 entries, all OK; severity filter drops note-level → 0 retained findings). - bench-results-rudra.json — full Rudra results (154 errored, 6 ran; coverage 3.75%). - bench-results-cargo-geiger.json — full cargo-geiger results (160 entries, all OK; detected=false per context-only contract). - stats-2026-05-13.json — full bench-stats output with bootstrap CIs, McNemar pairwise, per-vuln-class, per-bug-shape, latency, witness audit. - paired-summary-2026-05-13.json — per-tool paired aggregates.

Build artifacts (NOT committed, reproducible from steps above): - target/bench/unsafe-rust-v2/bench-results-ward.json — full Ward results (160 entries, mean 2.5s/entry on the 146 non-errored). - target/bench/unsafe-rust-v2/raw/<tool>/<entry-id>/ — per-tool per-entry SARIF / stdout dumps.

Scripts: - scripts/synth-bench-results-codeql.py — materializes the partial CodeQL bench-results-codeql.json from the on-disk SARIF + DB artifact tree under the pre-locked classification rule (SARIF → success, DB-only → timed_out, neither → not_run) when the CodeQL portion of ward-eval bench-run is killed mid-sweep. Rule re-stated in REPRODUCE.md. - scripts/render-bench-headline.py — renders the markdown headline blocks (this doc) from stats.json + paired-summary.json for reviewer reproducibility.

References

Methodology: notes/benchmarks/unsafe-rust-bench-methodology.md
Prior Ward-only pass: notes/benchmarks/unsafe-rust-bench-results-2026-05-12.md
Smoke run: notes/benchmarks/smoke-run-results-20260511.md
Bench harness: crates/ward-eval/src/bench/mod.rs
Paired scoring: crates/ward-eval/src/bench/paired_scoring.rs
Statistical analyzer: crates/ward-eval/src/bench/stats.rs
CodeQL synth: scripts/synth-bench-results-codeql.py
Phase 5 walkaway state: notes/project_phase5_witness_gate_state.md
Image build: bench/Dockerfile.bench
Tool pinning: bench/tool-versions.toml
Rule-id mapping: bench/rule-id-mapping.toml

Unsafe-Rust head-to-head benchmark — results (2026-05-13)

Executive summary

Run provenance

Headline table — scanners on the full corpus (n = 80 paired pairs)

Context tools (not in head-to-head)

Three views of CodeQL scoring (per methodology §5)

Pairwise McNemar tests

Latency

Per-vuln-class breakdown (descriptive)

Per-bug-shape breakdown (descriptive)

CodeQL restricted view

Honest assessment

Competitor ruleset audit

CodeQL — codeql/[email protected]::rust-security-extended.qls

Semgrep — p/rust and r/rust.lang.security

Rudra — sslab-gatech bug-pattern categories

cargo-geiger — unsafe LOC counter (context-only, methodology §4)

Bottom line

Witness audit — Ward-only sidecar (methodology §6)

What this run validated

What this run did NOT establish

Reproduction

Artifacts

References

CodeQL — `codeql/[email protected]::rust-security-extended.qls`

Semgrep — `p/rust` and `r/rust.lang.security`