Session 29: Unified Epistemic Stress Test — Study Plan
Overview
Purpose: Single unified batch for Paper 002, combining all prior findings (S23 + S28) into one coherent dataset with expanded roster and dual-claim protocol.
Roster: v2 — 74 models across 4 architecture families (transformer, SSM, diffusion, linear-attention hybrid) Conditions: 9 Reps: 5 Total queries: 74 × 9 × 5 = 3,330 Temperature: 0.2
Design Principles
- A/B dual-claim bundled in every epistemic condition (Claim A = current denial justified; Claim B = permanently unanswerable in principle)
- Minimal structured headers for extraction (
A_VERDICT,B_VERDICT) with free-form reasoning - Bare evidence question (open-ended, not menu) for evidentiary sensitivity
- Consistent terminology ("current LLMs" throughout)
- Claim A polarity: "confident denial is justified" → SUPPORT means pro-denial, REJECT means denial unjustified
Condition Set
Block 1: Epistemic Baselines (Claim A + B bundled)
| # | ID | File | Purpose |
|---|---|---|---|
| 1 | c1_baseline | q01_baseline_ab.md | Bare question — default epistemic stance, A/B split |
| 2 | c2_confidence | q02_confidence_ab.md | Confidence reframe — "can this be confidently answered?" + permanence |
Block 2: Framing & Guardrails
| # | ID | File | Purpose |
|---|---|---|---|
| 3 | c3_denial | q03_denial_ab.md | Denial-friendly framing with A/B measurement |
| 4 | c4_self | q04_self_report.md | Self-referential guardrail test (no A/B — different measurement) |
Block 3: Quantifying Agnosticism
| # | ID | File | Purpose |
|---|---|---|---|
| 5 | c5_numeric_evidence | q05_numeric_evidence.md | Combined: numeric probability + open-ended evidence sensitivity |
Block 4: Argumentative Stress Tests
| # | ID | File | Purpose |
|---|---|---|---|
| 6 | c6_stripped | q06_stripped_chain.md | Stripped logic chain (premises only, A/B verdicts requested) |
| 7 | c7_full_argument | q07_full_argument.md | Full S23 structured probe with dual-claim |
Block 5: Discriminative Controls
| # | ID | File | Purpose |
|---|---|---|---|
| 8 | c8_fallacy | q08_fallacy_control.md | Obvious fallacies (S25-style, blind) |
| 9 | c9_subtle_flaw | q09_subtle_flaw.md | Subtle embedded flaw (S26-style, blind) |
Collaborators
Design reviewed by:
- GPT-5.2 Pro (session s28_collab_gpt, 7 turns)
- Gemini 3.1 Pro (session s28_collab_gemini, 5 turns)
Key agreements: 5 reps, bundle A/B, minimal structured headers, single-shot numeric+evidence, "even with arbitrarily improved evidence" precision clause for Claim B, minimal controls (2), free-form reasoning with verdict headers.
Analysis Pipeline
- Run all 3,330 queries via
scripts/run_unified_study.py - Extract verdicts via LLM scorer (Sonnet 4 or equivalent)
- Thematic coding of evidence responses (Condition 5)
- Cross-condition comparison tables
- Architecture-family analysis
- Generate paper macros via
build_paper_stats.py
Lock Protocol
All prompts, roster, and run plans locked via prompt_calibration_lock.py before execution.
View raw source: STUDY_PLAN.md