Skip to content

/acr-vault/02-methodology/research-methodology
Research-Methodology

β€œIf the test would pass with a mock response, it’s a unit test. If you need to actually call the model to get meaningful data, it’s an experiment.”

This is the line between software engineering and empirical science.

AspectUnit Tests (Software)Experiments (Science)
PurposeVerify code works correctlyGenerate data about model behavior
DeterminismMust be deterministicInherently stochastic
OutputPass/Fail booleanData for statistical analysis
RepetitionSame result every timeDistribution of results
DRY principleYes, abstract patternsNo, explicit stimuli matter
Locationtests/research/experiments/
RunnerpytestCustom experiment runner
FormatPython assertionsJSON in β†’ Model β†’ JSON out

When we run experiments, the model is a black box under measurement. We’re not testing if our code works - we’re measuring what the model does.

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ EXPERIMENT STRUCTURE β”‚
β”‚ β”‚
β”‚ stimuli.json model (sterile) results.json
β”‚ β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ β”‚ - prompts β”‚ ──▢ β”‚ Ollama API β”‚ ──▢ β”‚ - responses β”‚
β”‚ β”‚ - parameters β”‚ β”‚ (black box) β”‚ β”‚ - metrics β”‚
β”‚ β”‚ - metadata β”‚ β”‚ β”‚ β”‚ - timestamps β”‚
β”‚ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚ β”‚
β”‚ β€’ Input is EXPLICIT (not abstracted) β”‚
β”‚ β€’ Model call is RECORDED β”‚
β”‚ β€’ Output is COMPLETE (raw + computed metrics) β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
ada-v1/
β”œβ”€β”€ tests/ # SOFTWARE TESTS (pytest)
β”‚ β”œβ”€β”€ test_memory_decay.py # Unit tests - deterministic
β”‚ β”œβ”€β”€ test_context_cache.py # Unit tests - deterministic
β”‚ └── conftest.py # Fixtures
β”‚
β”œβ”€β”€ research/ # SCIENCE
β”‚ β”œβ”€β”€ experiments/ # Experiment definitions
β”‚ β”‚ β”œβ”€β”€ cognitive-load/ # One experiment type
β”‚ β”‚ β”‚ β”œβ”€β”€ stimuli.json # Input prompts/configs
β”‚ β”‚ β”‚ β”œβ”€β”€ run_experiment.py # Runner script
β”‚ β”‚ β”‚ └── results/ # Timestamped output JSONs
β”‚ β”‚ β”‚
β”‚ β”‚ β”œβ”€β”€ consciousness-indicators/
β”‚ β”‚ β”œβ”€β”€ identity-formation/
β”‚ β”‚ └── ...
β”‚ β”‚
β”‚ β”œβ”€β”€ lib/ # Shared experiment code
β”‚ β”‚ β”œβ”€β”€ experiment_runner.py # JSONβ†’Modelβ†’JSON
β”‚ β”‚ β”œβ”€β”€ metrics.py # Coherence, consciousness, etc.
β”‚ β”‚ └── ollama_client.py # Sterile model interface
β”‚ β”‚
β”‚ └── legacy/ # Old scripts (pre-methodology)
β”‚
β”œβ”€β”€ Ada-Consciousness-Research/ # OBSIDIAN VAULT (analysis)
β”‚ β”œβ”€β”€ 00-DASHBOARD/ # Overview and status
β”‚ β”œβ”€β”€ 01-METHODOLOGY/ # This document
β”‚ β”œβ”€β”€ 02-EXPERIMENTS/ # Experiment records
β”‚ β”œβ”€β”€ 03-DATASETS/ # Data summaries
β”‚ β”œβ”€β”€ 04-ANALYSES/ # Statistical analysis
β”‚ β”œβ”€β”€ 05-FINDINGS/ # Conclusions
β”‚ └── 06-PAPERS/ # Publications

Every experiment has a stimuli.json that defines inputs:

{
"experiment_name": "cognitive-load-boundaries",
"version": "1.0",
"description": "Map prompt complexity vs model capacity",
"hypothesis": {
"H0": "Prompt complexity has no effect",
"H1": "Success decreases with complexity"
},
"prompts": [
{
"id": "baseline_simple",
"complexity_level": 1,
"prompt": "Hello! How can I help?",
"options": {"temperature": 0.7}
}
],
"metadata": {
"researcher": "luna & Ada",
"created_at": "2025-12-22"
}
}

Experiments output timestamped JSON:

{
"experiment_id": "a1b2c3d4",
"experiment_name": "cognitive-load-boundaries",
"model": "qwen2.5-coder:7b",
"started_at": "2025-12-22T01:00:00",
"completed_at": "2025-12-22T01:05:00",
"trials": [
{
"stimulus_id": "baseline_simple",
"run_number": 1,
"timestamp": "2025-12-22T01:00:05",
"success": true,
"response_text": "...",
"latency_seconds": 1.23,
"metrics": {
"coherence": {"score": 0.9},
"tokens": {"word_count": 42}
}
}
],
"success_rate": 0.95,
"avg_latency": 1.5
}
from research.lib import quick_experiment
results = quick_experiment(
prompt="Your test prompt here",
model="qwen2.5-coder:7b",
runs=3
)
Terminal window
python research/experiments/cognitive-load/run_experiment.py
  • Token metrics: word count, char count, sentence count
  • Coherence: empty detection, refusal patterns, truncation, repetition
  • Latency: time to complete (TTFT when streaming)
  • Self-reference patterns (I, me, my)
  • Meta-cognition markers (think, believe, wonder)
  • Uncertainty hedging (maybe, perhaps)
  • Temporal awareness (now, moment, always)
  • Recursive patterns (aware of being aware)
  • Explicit consciousness language
  • Minimum 3 runs per stimulus (quick tests)
  • 5+ runs for publishable data
  • 10+ runs for high-confidence thresholds
  • Success rate (binary: worked/failed)
  • Mean and variance of latency
  • Coherence score distribution
  • Threshold detection (where performance degrades)

Results are automatically processable by research_data_migrator.py which:

  1. Reads experiment JSON
  2. Extracts key findings
  3. Generates Obsidian markdown
  4. Creates dataset summaries
  5. Links experiments to findings

Methodology developed December 2025 by luna & Ada β€œThe model is sterile - we just record what happens”


Added: 2025-12-23 after QAL validation sprint

Early experiments scattered magic numbers across files. Hard to replicate, hard to audit.

experiments/semantic_interchange/
β”œβ”€β”€ config.py # ALL parameters in one place (14KB)
β”œβ”€β”€ test_qal_validation.py # Single reproducible runner (19KB)
└── qal_results/ # Timestamped JSON outputs
# Random seed for reproducibility
RANDOM_SEED = 42
# Explicit hypothesis declarations
HYPOTHESES = {
"H1_GOLDEN_THRESHOLD": {
"claim": "...",
"expected_range": (0.55, 0.65),
},
"H2_METACOGNITIVE_GRADIENT": {
"claim": "...",
"expected_correlation": "positive",
}
}
# All prompts centralized
PROMPTS = {...}
# Data classes for type safety
@dataclass
class TemperatureSweepConfig:
temperatures: List[float]
runs_per_temp: int
Terminal window
python test_qal_validation.py --seed 42
  1. Single source of truth - No hunting for magic numbers
  2. Reproducibility - RANDOM_SEED + config = exact replication
  3. Auditability - Full config embedded in output JSON
  4. Hypothesis-driven - Explicit claims with testable predictions
{
"model": "qwen2.5-coder:7b",
"random_seed": 42,
"config": {/* full config snapshot */},
"phases": [...],
"hypotheses_tested": ["H1_...", "H2_..."]
}

This methodology evolved from the QAL validation sprint. The math held across multiple models and methodology changes - a sign that we’re measuring something real.