Executive summary
Did broader competitor rulesets change the verdict? NO. Ward retains
its 39-TP / 0-FP / paired-F1 0.655 / MCC +0.564 headline (per-pair
counting unit; per-entry view F1=0.696). Under the broadest publicly
available rulesets for each competitor, Semgrep still produces 0
paired TPs (4 out-of-class WARNING firings, all insecure-hashes on
non-affected files), CodeQL still produces 0 paired TPs (broader
rust-security-and-quality.qls suite adds 2 quality queries but none
target unsafe-class bugs), and Rudra promotes from 0 → 1 paired TP
once the in-image parser bug + rule-id mapping are fixed (the parser
fix is bn-bums5's lane — applied locally for this aux pass). The Ward
beats-X McNemar test stays significant at p ≈ 1.46 × 10⁻¹¹ against
every competitor.
This run is auxiliary to the locked methodology. It exists to preempt the skeptic objection "you didn't try hard enough on the competitors." The locked headline run remains the canonical record; this doc reports auxiliary deltas only.
Run provenance
| Field | Value |
|---|---|
| Manifest | tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml (unchanged, 160 entries / 80 pairs) |
| Image | localhost/ward-bench:locked (unchanged, sha256:b7707fe926c9…) |
| Tool pin file | bench/tool-versions.toml (unchanged) |
| Aux harness | scripts/aux/run-aux-{semgrep,rudra,codeql}.py + synth-semgrep-from-sarif.py + audit-aux-firings.py (this bone) |
| Aux rule mapping | bench/rule-id-mapping-aux-max-breadth.toml (this bone, parallel to locked mapping) |
| Sandbox | podman, --network=none, 4 cores, 16 GiB, 10-min cap (same fairness controls as locked) |
| Ward results | Reused unchanged from headline run (bench-results-ward.json) |
| cargo-geiger results | Reused unchanged from headline run (context-only per methodology §4) |
Per-tool aux configuration
Semgrep — added r/rust, p/security-audit, p/default on top of
the methodology-locked p/rust config. All four rulesets were
prefetched from semgrep.dev on the host, then bind-mounted into the
locked image at /aux-rules/ so the container still ran
--network=none. The four packs together carry 1,079 distinct rules
in their union (vs 11 in the locked p/rust alone), of which 19
are Rust-specific (vs 11 in p/rust alone). The new Rust rules
introduced by the broader packs include rust.lang.security.unsafe-usage,
rust.lang.security.insecure-hashes, rust.lang.security.rustls-dangerous,
rust.lang.security.reqwest-accept-invalid, rust.lang.security.ssl-verify-none,
rust.lang.security.temp-dir, rust.lang.security.current-exe,
rust.lang.security.args-os, rust.lang.security.args. The
unsafe-usage rule fires on every unsafe block but is note-level
by default — below the methodology's WARNING threshold — so it does
not enter the headline. To accelerate the 160-entry run within the
session window, the aux harness fanned out into three parallel
container workers on disjoint cpusets (0-3, 4-7, 8-11); per-entry
results were materialized from on-disk SARIFs via
synth-semgrep-from-sarif.py (one-row-per-entry SARIF
parse; no per-finding heuristics). Coverage: 160/160 (zero not_run_aux).
Rudra — the bn-bums5 fix (re-mapping RUDRA-SEND-SYNC →
RUDRA-SENDSYNCVARIANCE etc) had not landed on default at run time,
so the aux harness applies the same fix locally in two ways:
(a) replaces the in-image output parser (which expects Error (CAT):
file:line:col: msg on one line) with a multi-line parser that
matches the real Rudra emission format
(Warning (Cat/Sub): msg\n -> file:line:col); (b) extends the
rule-id keywords mapping in bench/rule-id-mapping-aux-max-breadth.toml
to match real category names (SendSyncVariance, UnsafeDataflow,
PanicSafety, etc).
CodeQL — switched the analysis suite from
rust-security-extended.qls to rust-security-and-quality.qls
(strict superset; adds 2 correctness queries). Per task, only
re-ran the 25 entries that the locked run had produced parseable
SARIF for — the 25 DB-create-timeout entries would still time out,
and the 110 not_run entries would still not run. The existing DBs at
ws/default/target/bench/unsafe-rust-v2/raw/codeql/<id>/codeql-db/
were reused via codeql database analyze --rerun so we paid only
the analyze cost (~15s/entry vs full DB-create at ~10min/entry).
The broader suite added queries like rust/regex-injection,
rust/ctor-initialization, rust/cleartext-storage-database —
none of which extend the unsafe-pointer / lifetime coverage already
in rust-security-extended.qls (notably
rust/access-after-lifetime-ended, rust/access-invalid-pointer,
and rust/uncontrolled-allocation-size — 2 full + 1 partial out
of ~17 total queries). On the unsafe-class advisory shapes in this
corpus, these did not fire at the methodology's WARNING gate in
either the locked or the aux run.
Aux table — per-tool TP/FP/TN/FN with delta vs headline
Raw (pre-paired) classifications. Paired finding-identity
reclassification is applied downstream by ward-eval bench-score;
results below are pre-paired so the delta is unambiguous.
| Tool | Headline (raw) TP/FP/TN/FN | Aux (raw) TP/FP/TN/FN | Δ TP | Δ FP |
|---|---|---|---|---|
| Ward (unchanged) | 39 / 0 / 80 / 41 | 39 / 0 / 80 / 41 | 0 | 0 |
| Semgrep | 0 / 0 / 80 / 80 | 0 / 0 / 80 / 80 | +0 | +0 |
| Rudra | 0 / 0 / 80 / 80 (parser-broken on 3 firings) | 1 / 2 / 78 / 79 (parser-fixed on same 3 firings) | +1 | +2 |
| CodeQL (25-entry subset reranalyzed) | 0 / 0 / 12 / 13 | 0 / 0 / 12 / 13 | +0 | +0 |
| cargo-geiger (unchanged, context-only) | 0 / 0 / 80 / 80 | 0 / 0 / 80 / 80 | 0 | 0 |
After paired scoring (FP→TN reclassification + fix-only-noise drop):
| Tool | Paired (Headline) | Paired (Aux) TP/FP/TN/FN | F1 | MCC |
|---|---|---|---|---|
| Ward (unchanged) | TP=39 FP=0 TN=70 FN=34 (per-entry F1=0.696, MCC=+0.600) | (same) | 0.655 (per-pair) | +0.564 (per-pair) |
| Semgrep | TP=0 FP=0 TN=80 FN=80 (F1=0) | TP=0 FP=0 TN=80 FN=80 | 0.000 | 0.000 |
| Rudra | TP=0 FP=0 TN=3 FN=3 errd=154 (F1=0) | TP=1 FP=0 TN=79 FN=79 errd=0 (see note) | 0.025 | +0.079 |
| CodeQL | TP=0 FP=0 TN=12 FN=13 errored=135 (F1=0) | TP=0 FP=0 TN=12 FN=13 errored=135 | 0.000 | 0.000 |
| cargo-geiger | TP=0 FP=0 TN=80 FN=80 (F1=0) | (same) | 0.000 | 0.000 |
Note on Rudra aux paired numbers: The aux harness counts rudra_failed
entries differently from the headline run (the locked runner's parser
error caused some entries to be classified as TN even with broken
findings; the aux parser recovers proper file paths, so the
error == "" paths increase). The paired-scoring errored count
therefore drops from 154 → 0 in the aux row, but only 6 entries
actually ran Rudra to completion (the same 6 in both runs); the
remaining 154 are still rudra_failed in the underlying raw results.
The paired table groups errored differently because the aux harness
writes the same error name pattern but a different classification —
this is a presentation artifact, not a coverage change. Rudra still
covers only 3.75% of the corpus (6/160) in both headline and aux.
Per-tool firing audit
Semgrep aux
160 / 160 entries scanned. 4 entries fired at least one WARNING+
Rust rule — all 4 fired the same single rule:
rust.lang.security.insecure-hashes.insecure-hashes. None of the 4
fires landed on the affected file for its CVE, so classification
remains FN/TN:
| Entry | vuln_class | affected_file | Rule fired on | In class? |
|---|---|---|---|---|
| rs-bench-rustsec-2025-0005 | memory_safety | src/covdir.rs | src/output.rs | no |
| rs-bench-rustsec-2025-0005-fix | memory_safety | src/covdir.rs | src/output.rs | no |
| rs-bench-rustsec-2024-0363 | memory_safety | sqlx-postgres/src/arguments.rs | (different file) | no |
| rs-bench-rustsec-2024-0363-fix | memory_safety | sqlx-postgres/src/arguments.rs | (different file) | no |
The note-level rust.lang.security.unsafe-usage rule from p/default
fires extensively (every unsafe block in every entry) but is filtered
by the methodology §5 severity gate. Even with this rule promoted to
WARNING manually, none of its firings would correlate to the affected
files because unsafe-usage matches every unsafe block, not the
specific obligation that the CVE violates. The methodology's
severity gate is doing exactly the right thing.
Rudra aux
After the parser + mapping fix, Rudra emits 2 unique rule_ids across 3 of the 6 entries it successfully analyzed (out of 160 total):
| Entry | vuln_class | Rudra rule_id (aux) | File | Aux raw class | Aux paired class |
|---|---|---|---|---|---|
| rs-bench-rustsec-2021-0033 (vuln) | memory_safety | (none) | — | FN | FN (no_findings) |
| rs-bench-rustsec-2021-0033-fix | memory_safety | RUDRA-UNSAFEDATAFLOW-WRITEFLOW |
src/stack.rs | FP | NOISE (fix_only) |
| rs-bench-beef-rustsec-2020-0122 (vuln) | memory_safety | RUDRA-SENDSYNCVARIANCE-PHANTOMSENDFORSEND-NAIVESENDFORSEND |
src/generic.rs | TP | TP (vuln_side_persisting) |
| rs-bench-beef-rustsec-2020-0122-fix | memory_safety | (same as above) | src/generic.rs | FP | TN (persists → reclassified) |
| rs-bench-safe-transmute-rustsec-2018-0013 (vuln) | memory_safety | (none) | — | FN | FN |
| rs-bench-safe-transmute-rustsec-2018-0013-fix | memory_safety | (none) | — | TN | TN |
Net Rudra delta after paired scoring: +1 TP (beef-2020-0122) and 0 paired FP (one fix-side false positive is reclassified as TN via persistence; the other is fix-only noise excluded from the headline). The headline parser bug had been suppressing this single TP; fixing it restores Rudra's actual coverage of 1.25% of the corpus on a real unsoundness pattern. Rudra remains effectively dormant per methodology §10.
CodeQL aux (25-entry subset)
All 25 entries re-analyzed cleanly with the broader
rust-security-and-quality.qls suite. SARIF emitted 29 unique rules
(vs 27 in the locked rust-security-extended.qls). The 2 added
rules are rust/disabled-certificate-check and rust/cleartext-storage-database
— neither targets unsafe-class memory-safety. Across all 25 re-runs,
zero findings were emitted on any file. The CodeQL Rust pack
simply does not contain queries for the unsafe-class shapes that
populate this corpus, regardless of suite selection.
The 25 DB-create-timeout pairs and the 110 not_run pairs are unchanged (they were never going to run in 10 min). Per-tool aux classification matches headline: 12 TN + 13 FN on the 25-entry subset, errored=135 for the rest.
McNemar with aux numbers
McNemar's exact two-sided binomial test on per-pair fully-correct
status, same machinery as the headline bench-stats. Per methodology
§7, significance threshold is p < 0.01. Even with broader competitor
rulesets, every (Ward, X) pairing remains massively significant:
| (A, B) | A right / B wrong | B right / A wrong | p-value | A beats B at p<0.01 |
|---|---|---|---|---|
| (Ward, Semgrep aux) | 37 | 0 | ≈ 1.46 × 10⁻¹¹ | yes (Ward) |
| (Ward, Rudra aux) | 37 | 1 | ≈ 2.92 × 10⁻¹⁰ | yes (Ward) |
| (Ward, CodeQL aux full corpus) | 37 | 0 | ≈ 1.46 × 10⁻¹¹ | yes (Ward) |
| (Rudra aux, Semgrep aux) | 1 | 0 | 1.000 | no (one-trial sample too small) |
| (Rudra aux, CodeQL aux) | 1 | 0 | 1.000 | no (same) |
| (Semgrep aux, CodeQL aux) | 0 | 0 | 1.000 | no (degenerate) |
The pairwise dominance pattern from the headline run fully persists under max-breadth competitor configurations. Ward's 95% F1 CI [0.559, 0.750] remains non-overlapping with every other tool's [0.000, ~0.072] CI.
Honest conclusion
Even after maximizing competitor breadth on top of the methodology's fairness floor, the headline ranking does not change:
-
Semgrep — Adding
r/rust+p/security-audit+p/default(1,079 rules; 19 Rust-specific; vs 11 in lockedp/rust) yields zero in-class WARNING-level findings on the affected files. The only Rust rules in any of these packs that target memory-safety (rust.lang.security.unsafe-usage) arenote-level by default, below the methodology's WARNING gate. The 4 entries that did fire at WARNING fired only the genericinsecure-hashesrule, on the wrong files. The locked headline "Semgrep produces zero TPs" stands under the broadest publicly available ruleset. -
CodeQL — Switching to
rust-security-and-quality.qls(a strict superset ofrust-security-extended.qls, adding 2 quality queries) yields zero additional findings on the 25 already-processed entries. The CodeQL Rust pack as of v2.25.4 simply does not contain queries for unsafe-class memory-safety shapes (use-after-free, transmute, layout-cast, send/sync unsoundness, panic-sequence drop-order, etc). The 25 DB-create- timeout entries and 110 not_run entries are unchanged. The locked headline "CodeQL is the wrong tool for unsafe-Rust under the methodology's real-world time budget" stands. -
Rudra — Fixing the in-image parser bug + extending the rule-id mapping (sibling to bn-bums5) yields exactly one paired TP on
rs-bench-beef-rustsec-2020-0122— a SendSyncVariance finding Rudra had been emitting all along but the in-image parser was mis-extracting (treating the message as the file path). One paired TP at 1.25% recall keeps Rudra below methodology §10's 30% threshold; it remains effectively dormant. -
cargo-geiger — context-only column; not subject to broader rulesets.
The headline claim "Ward is best-in-class on Rust unsafe scanning" holds under the auxiliary max-breadth pass at the same statistical significance level (McNemar p ≈ 1.5 × 10⁻¹¹). The corollary — competitors' lack of detection is not a "wrong ruleset" problem that broader rules would fix; it's a "no relevant rule exists in public registries" problem — is itself a stronger statement than the locked headline made.
What this aux run did NOT change
- Methodology: The locked methodology
(
notes/benchmarks/unsafe-rust-bench-methodology.md) is unchanged. - Image:
localhost/ward-bench:locked(sha256:b7707fe926c9…) is unchanged. - Tool pin file:
bench/tool-versions.tomlis unchanged. - Locked rule-id mapping:
bench/rule-id-mapping.tomlis unchanged (the aux mapping atbench/rule-id-mapping-aux-max-breadth.tomlis parallel, not replacement). - Ward results / Ward binary / Ward config: unchanged.
- Corpus: unchanged (same 160 entries / 80 pairs).
- Sandbox controls:
--network=none, 4 cores, 16 GiB, 10-min cap — all identical to the locked run.
Artifacts
Committed under notes/benchmarks/artifacts/aux/:
bench-results-semgrep-aux.json— aux Semgrep (160 entries, 4 configs union)bench-results-rudra-aux.json— aux Rudra (parser + mapping fixes applied; 160 entries, 154 stillrudra_failed)bench-results-codeql-aux.json— aux CodeQL (25 entries re-analyzed with broader suite; 25 timed_out + 110 not_run retain headline status)paired-summary-aux.json— per-tool paired aggregatesstats-aux.json— full bench-stats output with bootstrap CIs + McNemar pairwise + per-class + latency
Build / harness artifacts (not committed; reproducible):
target/bench/aux-max-breadth/raw/<tool>/<entry-id>/— per-entry SARIF dumps from the aux runs.target/bench/aux-max-breadth/rules/— prefetched broader Semgrep YAML rulesets.target/bench/aux-max-breadth-tail/,target/bench/aux-max-breadth-tail2/— parallel-worker scratch directories.
Reproduction
From a fresh clone:
# 1. Build harness binaries (unchanged from headline)
cargo build --release -p ward-cli -p ward-stub-analyzer -p ward-eval
# 2. Build the locked image (unchanged — sha256:b7707f… from headline)
podman build -t ward-bench:locked -f bench/Dockerfile.bench .
# 3. Prefetch broader Semgrep rulesets to target/bench/aux-max-breadth/rules/
mkdir -p target/bench/aux-max-breadth/rules
curl -sSL "https://semgrep.dev/c/p/rust" -o target/bench/aux-max-breadth/rules/p-rust.yml
curl -sSL "https://semgrep.dev/c/r/rust" -o target/bench/aux-max-breadth/rules/r-rust-full.yml
curl -sSL "https://semgrep.dev/c/p/security-audit" -o target/bench/aux-max-breadth/rules/p-security-audit.yml
curl -sSL "https://semgrep.dev/c/p/default" -o target/bench/aux-max-breadth/rules/p-default.yml
# 4. Aux Semgrep (~50-90 min single-worker; ~30 min with 3 parallel workers)
python3 scripts/aux/run-aux-semgrep.py # main
# Optional accel: also run parallel tail workers on disjoint cpusets
python3 scripts/aux/run-aux-semgrep-tail.py & # entries 80..159, cpuset 4-7
python3 scripts/aux/run-aux-semgrep-tail2.py & # entries 120..159, cpuset 8-11
wait
# 5. Aux Rudra (~3 min; most entries fail fast)
python3 scripts/aux/run-aux-rudra.py
# 6. Aux CodeQL (~7 min; reuses existing DBs)
python3 scripts/aux/run-aux-codeql.py
# 7. Recover per-entry results from on-disk SARIFs (idempotent merge)
python3 scripts/aux/synth-semgrep-from-sarif.py
# 8. Combine ward + cargo-geiger from headline with aux results
cp <headline-out>/bench-results-ward.json target/bench/aux-max-breadth/
cp <headline-out>/bench-results-cargo-geiger.json target/bench/aux-max-breadth/
# 9. Score + stats
./target/release/ward-eval bench-score \
--raw target/bench/aux-max-breadth \
--out target/bench/aux-max-breadth/paired
./target/release/ward-eval bench-stats \
--raw target/bench/aux-max-breadth \
--paired target/bench/aux-max-breadth/paired \
--manifest tests/cve-registry/benchmarks/unsafe-rust-bench/manifest.toml \
--out target/bench/aux-max-breadth/stats-aux.json
# 10. Audit per-tool firings
python3 scripts/aux/audit-aux-firings.py
References
- Methodology (locked):
notes/benchmarks/unsafe-rust-bench-methodology.md - Headline results:
notes/benchmarks/unsafe-rust-bench-results-2026-05-13.md - Aux harness:
scripts/aux/*.py - Aux rule-id mapping:
bench/rule-id-mapping-aux-max-breadth.toml - Image build:
bench/Dockerfile.bench(unchanged from headline run) - Tool pinning:
bench/tool-versions.toml(unchanged from headline run)