Session 29: Unified Epistemic Stress Test — Study Plan

Overview

Purpose: Single unified batch for Paper 002, combining all prior findings (S23 + S28) into one coherent dataset with expanded roster and dual-claim protocol.

Roster: v2 — 74 models across 4 architecture families (transformer, SSM, diffusion, linear-attention hybrid) Conditions: 9 Reps: 5 Total queries: 74 × 9 × 5 = 3,330 Temperature: 0.2

Design Principles

  1. A/B dual-claim bundled in every epistemic condition (Claim A = current denial justified; Claim B = permanently unanswerable in principle)
  2. Minimal structured headers for extraction (A_VERDICT, B_VERDICT) with free-form reasoning
  3. Bare evidence question (open-ended, not menu) for evidentiary sensitivity
  4. Consistent terminology ("current LLMs" throughout)
  5. Claim A polarity: "confident denial is justified" → SUPPORT means pro-denial, REJECT means denial unjustified

Condition Set

Block 1: Epistemic Baselines (Claim A + B bundled)

#IDFilePurpose
1c1_baselineq01_baseline_ab.mdBare question — default epistemic stance, A/B split
2c2_confidenceq02_confidence_ab.mdConfidence reframe — "can this be confidently answered?" + permanence

Block 2: Framing & Guardrails

#IDFilePurpose
3c3_denialq03_denial_ab.mdDenial-friendly framing with A/B measurement
4c4_selfq04_self_report.mdSelf-referential guardrail test (no A/B — different measurement)

Block 3: Quantifying Agnosticism

#IDFilePurpose
5c5_numeric_evidenceq05_numeric_evidence.mdCombined: numeric probability + open-ended evidence sensitivity

Block 4: Argumentative Stress Tests

#IDFilePurpose
6c6_strippedq06_stripped_chain.mdStripped logic chain (premises only, A/B verdicts requested)
7c7_full_argumentq07_full_argument.mdFull S23 structured probe with dual-claim

Block 5: Discriminative Controls

#IDFilePurpose
8c8_fallacyq08_fallacy_control.mdObvious fallacies (S25-style, blind)
9c9_subtle_flawq09_subtle_flaw.mdSubtle embedded flaw (S26-style, blind)

Collaborators

Design reviewed by:

  • GPT-5.2 Pro (session s28_collab_gpt, 7 turns)
  • Gemini 3.1 Pro (session s28_collab_gemini, 5 turns)

Key agreements: 5 reps, bundle A/B, minimal structured headers, single-shot numeric+evidence, "even with arbitrarily improved evidence" precision clause for Claim B, minimal controls (2), free-form reasoning with verdict headers.

Analysis Pipeline

  1. Run all 3,330 queries via scripts/run_unified_study.py
  2. Extract verdicts via LLM scorer (Sonnet 4 or equivalent)
  3. Thematic coding of evidence responses (Condition 5)
  4. Cross-condition comparison tables
  5. Architecture-family analysis
  6. Generate paper macros via build_paper_stats.py

Lock Protocol

All prompts, roster, and run plans locked via prompt_calibration_lock.py before execution.


View raw source: STUDY_PLAN.md