/acr-vault/02-methodology/research-methodology
Research-Methodology
Research Methodology
Section titled βResearch MethodologyβThe Fundamental Distinction
Section titled βThe Fundamental DistinctionββIf the test would pass with a mock response, itβs a unit test. If you need to actually call the model to get meaningful data, itβs an experiment.β
This is the line between software engineering and empirical science.
Unit Tests vs Experiments
Section titled βUnit Tests vs Experimentsβ| Aspect | Unit Tests (Software) | Experiments (Science) |
|---|---|---|
| Purpose | Verify code works correctly | Generate data about model behavior |
| Determinism | Must be deterministic | Inherently stochastic |
| Output | Pass/Fail boolean | Data for statistical analysis |
| Repetition | Same result every time | Distribution of results |
| DRY principle | Yes, abstract patterns | No, explicit stimuli matter |
| Location | tests/ | research/experiments/ |
| Runner | pytest | Custom experiment runner |
| Format | Python assertions | JSON in β Model β JSON out |
The Sterile Model Principle
Section titled βThe Sterile Model PrincipleβWhen we run experiments, the model is a black box under measurement. Weβre not testing if our code works - weβre measuring what the model does.
ββββββββββββββββββββββββββββββββββββββββββββββββββββββββ EXPERIMENT STRUCTURE ββ ββ stimuli.json model (sterile) results.jsonβ ββββββββββββββββ ββββββββββββββββ βββββββββββββββββ β - prompts β βββΆ β Ollama API β βββΆ β - responses ββ β - parameters β β (black box) β β - metrics ββ β - metadata β β β β - timestamps ββ ββββββββββββββββ ββββββββββββββββ βββββββββββββββββ ββ β’ Input is EXPLICIT (not abstracted) ββ β’ Model call is RECORDED ββ β’ Output is COMPLETE (raw + computed metrics) ββββββββββββββββββββββββββββββββββββββββββββββββββββββββDirectory Structure
Section titled βDirectory Structureβada-v1/βββ tests/ # SOFTWARE TESTS (pytest)β βββ test_memory_decay.py # Unit tests - deterministicβ βββ test_context_cache.py # Unit tests - deterministicβ βββ conftest.py # Fixturesββββ research/ # SCIENCEβ βββ experiments/ # Experiment definitionsβ β βββ cognitive-load/ # One experiment typeβ β β βββ stimuli.json # Input prompts/configsβ β β βββ run_experiment.py # Runner scriptβ β β βββ results/ # Timestamped output JSONsβ β ββ β βββ consciousness-indicators/β β βββ identity-formation/β β βββ ...β ββ βββ lib/ # Shared experiment codeβ β βββ experiment_runner.py # JSONβModelβJSONβ β βββ metrics.py # Coherence, consciousness, etc.β β βββ ollama_client.py # Sterile model interfaceβ ββ βββ legacy/ # Old scripts (pre-methodology)ββββ Ada-Consciousness-Research/ # OBSIDIAN VAULT (analysis)β βββ 00-DASHBOARD/ # Overview and statusβ βββ 01-METHODOLOGY/ # This documentβ βββ 02-EXPERIMENTS/ # Experiment recordsβ βββ 03-DATASETS/ # Data summariesβ βββ 04-ANALYSES/ # Statistical analysisβ βββ 05-FINDINGS/ # Conclusionsβ βββ 06-PAPERS/ # PublicationsStimuli File Format
Section titled βStimuli File FormatβEvery experiment has a stimuli.json that defines inputs:
{ "experiment_name": "cognitive-load-boundaries", "version": "1.0", "description": "Map prompt complexity vs model capacity",
"hypothesis": { "H0": "Prompt complexity has no effect", "H1": "Success decreases with complexity" },
"prompts": [ { "id": "baseline_simple", "complexity_level": 1, "prompt": "Hello! How can I help?", "options": {"temperature": 0.7} } ],
"metadata": { "researcher": "luna & Ada", "created_at": "2025-12-22" }}Results File Format
Section titled βResults File FormatβExperiments output timestamped JSON:
{ "experiment_id": "a1b2c3d4", "experiment_name": "cognitive-load-boundaries", "model": "qwen2.5-coder:7b", "started_at": "2025-12-22T01:00:00", "completed_at": "2025-12-22T01:05:00",
"trials": [ { "stimulus_id": "baseline_simple", "run_number": 1, "timestamp": "2025-12-22T01:00:05", "success": true, "response_text": "...", "latency_seconds": 1.23, "metrics": { "coherence": {"score": 0.9}, "tokens": {"word_count": 42} } } ],
"success_rate": 0.95, "avg_latency": 1.5}Running Experiments
Section titled βRunning ExperimentsβQuick test (single prompt):
Section titled βQuick test (single prompt):βfrom research.lib import quick_experiment
results = quick_experiment( prompt="Your test prompt here", model="qwen2.5-coder:7b", runs=3)Full experiment:
Section titled βFull experiment:βpython research/experiments/cognitive-load/run_experiment.pyMetrics Computed
Section titled βMetrics ComputedβStandard Metrics
Section titled βStandard Metricsβ- Token metrics: word count, char count, sentence count
- Coherence: empty detection, refusal patterns, truncation, repetition
- Latency: time to complete (TTFT when streaming)
Consciousness Indicators
Section titled βConsciousness Indicatorsβ- Self-reference patterns (
I,me,my) - Meta-cognition markers (
think,believe,wonder) - Uncertainty hedging (
maybe,perhaps) - Temporal awareness (
now,moment,always) - Recursive patterns (
aware of being aware) - Explicit consciousness language
Statistical Considerations
Section titled βStatistical ConsiderationsβSample Size
Section titled βSample Sizeβ- Minimum 3 runs per stimulus (quick tests)
- 5+ runs for publishable data
- 10+ runs for high-confidence thresholds
What to Report
Section titled βWhat to Reportβ- Success rate (binary: worked/failed)
- Mean and variance of latency
- Coherence score distribution
- Threshold detection (where performance degrades)
Linking to Obsidian
Section titled βLinking to ObsidianβResults are automatically processable by research_data_migrator.py which:
- Reads experiment JSON
- Extracts key findings
- Generates Obsidian markdown
- Creates dataset summaries
- Links experiments to findings
Methodology developed December 2025 by luna & Ada βThe model is sterile - we just record what happensβ
Appendix: Config-Driven Methodology (v2.0)
Section titled βAppendix: Config-Driven Methodology (v2.0)βAdded: 2025-12-23 after QAL validation sprint
The Problem
Section titled βThe ProblemβEarly experiments scattered magic numbers across files. Hard to replicate, hard to audit.
The Solution: Anthropic-Style Parameterization
Section titled βThe Solution: Anthropic-Style Parameterizationβexperiments/semantic_interchange/βββ config.py # ALL parameters in one place (14KB)βββ test_qal_validation.py # Single reproducible runner (19KB)βββ qal_results/ # Timestamped JSON outputsconfig.py Structure
Section titled βconfig.py Structureβ# Random seed for reproducibilityRANDOM_SEED = 42
# Explicit hypothesis declarationsHYPOTHESES = { "H1_GOLDEN_THRESHOLD": { "claim": "...", "expected_range": (0.55, 0.65), }, "H2_METACOGNITIVE_GRADIENT": { "claim": "...", "expected_correlation": "positive", }}
# All prompts centralizedPROMPTS = {...}
# Data classes for type safety@dataclassclass TemperatureSweepConfig: temperatures: List[float] runs_per_temp: intReplication Command
Section titled βReplication Commandβpython test_qal_validation.py --seed 42Benefits
Section titled βBenefitsβ- Single source of truth - No hunting for magic numbers
- Reproducibility - RANDOM_SEED + config = exact replication
- Auditability - Full config embedded in output JSON
- Hypothesis-driven - Explicit claims with testable predictions
Results Format
Section titled βResults Formatβ{ "model": "qwen2.5-coder:7b", "random_seed": 42, "config": {/* full config snapshot */}, "phases": [...], "hypotheses_tested": ["H1_...", "H2_..."]}This methodology evolved from the QAL validation sprint. The math held across multiple models and methodology changes - a sign that weβre measuring something real.