Add self-contained acceptance test artifact that external developers can
run offline and reproduce identical graded outcomes:
- SHA-256-linked witness chain: every puzzle decision (skip_mode,
context_bucket, steps, correct) hashed into a tamper-evident chain.
Changing any single bit invalidates everything downstream.
- Deterministic replay: frozen seeds → identical puzzles → identical
solve paths → identical chain_root_hash. Two runs with the same
config produce the same hash, proven by test.
- JSON manifest: config, per-mode scorecards (A/B/C), all six ablation
assertions with measured values, full witness chain, chain root hash.
- Verifier: re-runs with same config, recomputes chain, compares root
hash. Mismatch means non-identical outcomes.
- CLI binary: `acceptance-rvf generate -o manifest.json` to produce,
`acceptance-rvf verify -i manifest.json` to verify.
66 lib tests + 20 integration tests pass.
https://claude.ai/code/session_01RnwD4x5cbpB7FPvoyYQz8G
Fixed policy sign flip (Mode A):
risk_score = R - 30*D (was R + 30*D)
Distractors now reduce effective range, making Mode A conservative
under distractors. This is the defensible control arm: a rational
fixed agent should be more cautious when distractors are present.
Mode C must learn to outperform this baseline.
EarlyCommitPenalty wired into bandit reward:
SkipModeStats now tracks early_commit_penalty_sum per arm.
reward() includes robustness_penalty = 0.2 * avg_penalty.
This means Mode C can actually learn to avoid early wrong commits
in distractor-heavy contexts. Previously the penalty was only
printed, not optimized.
Context buckets expanded to 18:
3 range (small/medium/large) × 3 distractor (clean/some/heavy)
× 2 noise (clean/noisy) = 18 buckets.
Previous: 4 range × 2 distractor = 8 (too coarse for bandit).
Noise flag now flows through AdaptiveSolver.noisy_hint.
New ablation assertion:
c_penalty_better_than_b: Mode C EarlyCommitPenalty must be ≤90%
of Mode B penalty. Proves robustness improvement is explicit,
not just noise_accuracy-based.
Acceptance test noise plumbing:
solver.noisy_hint set to true for noisy puzzles in both training
and holdout evaluation. Context buckets now correctly distinguish
clean vs noisy conditions.
81 tests passing (61 lib + 20 integration).
https://claude.ai/code/session_01RnwD4x5cbpB7FPvoyYQz8G
PolicyKernel refinements:
- Fixed policy (Mode A): risk_score = R + k*D, k=30, T=140
Fixed constants (not learned) — Mode A is the control arm.
One distractor raises perceived risk by ~30 range-days.
Weekday only when range is large AND distractor-free.
- Normalized EarlyCommitPenalty: (remaining/initial) * scale
Committing at 5% scan = cheap (0.05), at 90% = expensive (0.90).
Only charged on wrong commits.
- Hybrid minimum evidence: stop_after_first disabled in Hybrid mode
so solver checks all matching weekdays before committing.
Witness log:
- SolutionAttempt now carries skip_mode and context_bucket strings
- record_attempt_witnessed() for full policy audit trail
- Every trajectory records which skip mode was chosen and why
Observability:
- Puzzle tags now include distractor_count and has_dow (deterministic)
- count_distractors() made public for generator to tag puzzles
Ablation assertions (two new):
- a_skip_nonzero: Mode A uses skip at least sometimes (proves not hobbled)
- c_multi_mode: Mode C uses different skip modes across contexts (proves learning)
- Skip-mode distribution table printed per context bucket for Mode C
posterior_target monotonicity verified: 2→4→8→12→18→25→35→50→70→100
(never shrinks with difficulty)
81 tests passing (61 lib + 20 integration).
https://claude.ai/code/session_01RnwD4x5cbpB7FPvoyYQz8G
Three-fix iteration based on ablation diagnostics:
1. Bounded trial: Strategy Zero now caps trial budget at min(avg_steps*2,
external_limit/4) with floor of 10 steps. Makes false hits cheap
(max 100 steps overhead instead of full compiled budget).
2. Confidence gating: Strategy Zero only attempts when config confidence
>= 0.7 (Laplace-smoothed success rate). Compiled observations from
training seed initial confidence so configs start trusted.
3. 2-failure quarantine: any compiled signature with 2+ false hits is
disabled (expected_correct=false). Prevents persistent bad patterns.
Additional changes:
- Versioned signature prefix (v1:difficulty:constraints) for cache
safety across refactors
- CompiledSolveConfig gains avg_steps, observations, confidence(),
trial_budget() methods
- KnowledgeCompiler gains steps_saved tracking, confidence_threshold,
print_diagnostics() for per-signature analysis
- record_success now tracks actual steps for delta-cost calculation
- Verbose mode prints full compiler diagnostics after each ablation
Results: false hit rate dropped from 8.2% to 4.4% (PASS). Cost still
net-positive because constraint-determined search ranges are 1-10 dates
— structurally no room for compiler optimization. Next: PolicyKernel
constraint ordering for real cost surface.
81 tests passing.
https://claude.ai/code/session_01RnwD4x5cbpB7FPvoyYQz8G
Wire the KnowledgeCompiler as Strategy Zero in AdaptiveSolver solve
path — compiled constraint-signature configs are consulted before any
strategy. Add StrategyRouter with epsilon-greedy contextual bandit for
adaptive strategy selection per difficulty/constraint family.
Implement three-mode ablation protocol (A/B/C):
- Mode A: baseline (no compiler, fixed router)
- Mode B: compiler only (Strategy Zero with early termination)
- Mode C: full (compiler + adaptive router)
Adds run_ablation_comparison() and AblationComparison::print() with
quantitative assertions (B beats A on cost >=15%, C beats B on
robustness >=10%, compiler false-hit rate <5%).
Other changes:
- Early termination (stop_after_first) in TemporalSolver for compiled
single-solution puzzles
- Step accumulation across Strategy Zero failures + fallback
- Promotion gating: patterns only promoted when holdout accuracy
doesn't regress
- Compiler false_hits tracking
- --ablation flag on agi-proof-harness binary
- 81 tests passing (61 unit + 20 integration)
Ablation result (100-task holdout, 5 cycles): compiler active at 59%
hit rate with 8.2% false hit rate. Cost and robustness targets not yet
met — solver needs more policy surface (step 5: PolicyKernel learning).
https://claude.ai/code/session_01RnwD4x5cbpB7FPvoyYQz8G
Implements a recursive intelligence amplification pipeline where each
level feeds the next, measuring IQ at every stage:
L1 Foundation (IQ ~79) Adaptive solver + ReasoningBank + retry
L2 Meta-Learning (IQ ~82) Learns optimal hyperparams per problem class
L3 Ensemble Arbiter (IQ ~83) Multi-strategy voting with learned selection
L4 Recursive Improve(IQ ~85) Bootstraps from own outputs + knowledge compiler
L5 Adversarial Grow (IQ ~89) Self-generated hard tasks + cascade reasoning
Key mechanisms:
- MetaParams: EMA-learned step budgets + retry benefit estimation
- StrategyEnsemble: N-solver majority vote, confidence-weighted
- KnowledgeCompiler: compiles patterns to direct lookup (54% hit rate)
- AdversarialGenerator: weakness-targeted difficulty escalation
- CascadeReasoner: multi-pass solve-verify-resolve
Results: +7.5 to +10.1 IQ gain across 5 levels, reaching IQ 86-89
depending on noise conditions. 100% accuracy at max difficulty in L4/L5.
https://claude.ai/code/session_01RnwD4x5cbpB7FPvoyYQz8G
Run rustfmt on all Rust files to fix CI formatting checks.
This addresses pre-existing formatting inconsistencies across:
- cognitum-gate-kernel
- cognitum-gate-tilezero
- prime-radiant
- ruvector-* crates
- examples/benchmarks
- and other crates
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>