/acr-vault/02-methodology/sif/sif-formalization-roadmap
SIF-FORMALIZATION-ROADMAP
SIF Specification Formalization - Roadmap
Section titled “SIF Specification Formalization - Roadmap”Status: Pre-formalization (completed cleanup first)
Target Date: After consolidation + QAL collaboration
Purpose: Create machine-readable SIF specification for standardization
Current SIF State
Section titled “Current SIF State”What Works Well
Section titled “What Works Well”- ✅ Core concept: semantic compression with importance weighting
- ✅ Compression ratio: 66-104x validated (EXP-011)
- ✅ Hallucination safety: 100% resistance consistently
- ✅ Entity extraction: 0-50 entities depending on variant
- ✅ Facts preservation: 0-100 facts with importance scores
- ✅ Relationship mapping: Character/entity relationships captured
What Needs Formalization
Section titled “What Needs Formalization”- ❌ File format specifics (YAML vs JSON vs custom)
- ❌ Schema validation rules
- ❌ Entity typing system (what types allowed?)
- ❌ Relationship semantics (what relationship types?)
- ❌ Importance weighting system (when to use 0.60?)
- ❌ Metadata standards (provenance, compression stats)
- ❌ Embedding storage (if including vectors)
- ❌ Version compatibility
- ❌ Tool support (generators, validators, ingestion)
Three-Stage Formalization Plan
Section titled “Three-Stage Formalization Plan”Stage 1: Specification (This Document)
Section titled “Stage 1: Specification (This Document)”1a: Format Selection
Section titled “1a: Format Selection”Decision needed: What’s the target format?
| Format | Pros | Cons | Use Case |
|---|---|---|---|
| JSON | Machine-parseable, widespread | Verbose | APIs, data interchange |
| YAML | Human-readable, concise | Less tooling | Config files |
| Protobuf | Compact, strongly-typed | Complex to learn | Binary storage |
| Custom DSL | Domain-optimized | Requires parser | Specialized tools |
Recommendation for v1.0: JSON (most tooling)
- Structure mirrors Pydantic models
- Easy to validate with JSON Schema
- HTTP-friendly for API transport
1b: Core Schema Definition
Section titled “1b: Core Schema Definition”{ "$schema": "http://json-schema.org/draft-07/schema#", "title": "Semantic Interchange Format (SIF)", "version": "1.0.0", "type": "object",
"required": ["metadata", "summary", "entities"], "properties": { "metadata": { "type": "object", "required": ["version", "timestamp", "domain"], "properties": { "version": {"type": "string", "pattern": "^\\d+\\.\\d+\\.\\d+$"}, "timestamp": {"type": "string", "format": "date-time"}, "domain": { "type": "string", "enum": ["minecraft/logs", "documentation", "codebase", "literature", "custom"] }, "generated_by": {"type": "string"}, "generator_version": {"type": "string"} } },
"provenance": { "type": "object", "properties": { "source_type": { "enum": ["log_analysis", "text_extraction", "code_analysis", "media_extraction"] }, "source_size_bytes": {"type": "integer"}, "source_hash": {"type": "string"}, "compression_ratio": {"type": "number"}, "compression_method": {"type": "string"}, "confidence": {"type": "number", "minimum": 0, "maximum": 1} } },
"summary": { "type": "object", "required": ["text"], "properties": { "text": {"type": "string"}, "keywords": { "type": "array", "items": {"type": "string"} }, "entities_count": {"type": "integer"}, "facts_count": {"type": "integer"} } },
"entities": { "type": "array", "items": { "type": "object", "required": ["id", "type", "name"], "properties": { "id": {"type": "string"}, "type": { "enum": ["person", "place", "thing", "concept", "event", "organization"] }, "name": {"type": "string"}, "description": {"type": "string"}, "attributes": {"type": "object"}, "importance": { "type": "number", "minimum": 0, "maximum": 1, "description": "Based on surprise, relevance, frequency" } } } },
"relationships": { "type": "array", "items": { "type": "object", "required": ["entity_a", "relation_type", "entity_b"], "properties": { "entity_a": {"type": "string"}, "relation_type": { "enum": ["conflicts_with", "supports", "causes", "part_of", "related_to", "describes"] }, "entity_b": {"type": "string"}, "strength": {"type": "number", "minimum": 0, "maximum": 1} } } },
"facts": { "type": "array", "items": { "type": "object", "required": ["content", "importance"], "properties": { "id": {"type": "string"}, "content": {"type": "string"}, "type": { "enum": ["factual", "causal", "definition", "property", "relationship"] }, "importance": { "type": "number", "minimum": 0, "maximum": 1, "description": "1.0 = critical to understanding, 0.6 = important, 0.3 = contextual, 0.0 = irrelevant" }, "supporting_entities": { "type": "array", "items": {"type": "string"} }, "confidence": {"type": "number", "minimum": 0, "maximum": 1}, "tags": { "type": "array", "items": {"type": "string"} } } } },
"embeddings": { "type": "object", "properties": { "model": {"type": "string"}, "dimension": {"type": "integer"}, "vectors": { "type": "array", "items": { "type": "object", "properties": { "fact_id": {"type": "string"}, "vector": { "type": "array", "items": {"type": "number"} } } } } } },
"metadata_attributes": { "type": "object", "properties": { "language": {"type": "string", "default": "en"}, "license": {"type": "string"}, "quality_score": {"type": "number", "minimum": 0, "maximum": 1}, "notes": {"type": "string"} } } }}1c: Importance Weighting Specification
Section titled “1c: Importance Weighting Specification”The 0.60 threshold discovered in EXP-005:
Importance Score Ranges:├─ 0.90-1.00: CRITICAL│ └─ Essential to understanding│ └─ Use in: RAG, priority retrieval│├─ 0.75-0.89: HIGH│ └─ Important but not essential│ └─ Use in: Context, expansion│├─ 0.60-0.74: IMPORTANT│ └─ Threshold for "signal" vs "noise"│ └─ Use in: Standard retrieval, question answering│├─ 0.40-0.59: CONTEXTUAL│ └─ Adds context but not central│ └─ Use in: Depth, richness│└─ 0.00-0.39: NOISE └─ Probably not important └─ Use in: Filtering, compressionCalculation Algorithm:
def calculate_importance(fact: str, context: Dict) -> float: """ Based on EXP-005 optimal weights: - surprise: 0.60 (novelty/prediction error) - decay: 0.10 (temporal freshness) - relevance: 0.20 (semantic match to query) - habituation: 0.10 (repetition penalty) """ scores = { 'surprise': measure_surprise(fact, context), # 0-1 'relevance': measure_relevance(fact, context), # 0-1 'decay': measure_temporal_decay(fact), # 0-1 'habituation': measure_repetition(fact, context) # 0-1 }
importance = ( 0.60 * scores['surprise'] + 0.20 * scores['relevance'] + 0.10 * scores['decay'] + 0.10 * scores['habituation'] )
return min(max(importance, 0.0), 1.0) # Clamp to [0,1]Stage 2: Validation Tools
Section titled “Stage 2: Validation Tools”2a: JSON Schema Validator
Section titled “2a: JSON Schema Validator”import jsonschemafrom pathlib import Path
SCHEMA_PATH = Path(__file__).parent / "sif_schema.json"
def validate_sif(sif_dict: dict) -> Tuple[bool, List[str]]: """Validate SIF against schema, return (valid, errors)""" with open(SCHEMA_PATH) as f: schema = json.load(f)
validator = jsonschema.Draft7Validator(schema) errors = [str(e.message) for e in validator.iter_errors(sif_dict)] return len(errors) == 0, errors
def validate_sif_file(path: str) -> Tuple[bool, List[str]]: """Load SIF from file and validate""" with open(path) as f: sif = json.load(f) return validate_sif(sif)2b: Quality Checker
Section titled “2b: Quality Checker”def check_sif_quality(sif: dict) -> Dict[str, float]: """Assess SIF quality across dimensions""" return { 'entity_coverage': len(sif['entities']) / expected_count, 'fact_completeness': len(sif['facts']) / expected_count, 'importance_distribution': assess_importance_distribution(sif['facts']), 'relationship_density': len(sif['relationships']) / (len(sif['entities']) ** 2), 'hallucination_risk': check_hallucination_markers(sif), }2c: Compression Calculator
Section titled “2c: Compression Calculator”def calculate_compression_stats(original_size: int, sif: dict) -> Dict: """Calculate compression metrics""" sif_size = len(json.dumps(sif)) return { 'ratio': original_size / sif_size, 'original_bytes': original_size, 'compressed_bytes': sif_size, 'saved_bytes': original_size - sif_size, 'efficiency_percent': ((original_size - sif_size) / original_size) * 100 }Stage 3: Generator & Ingestion
Section titled “Stage 3: Generator & Ingestion”3a: SIF Generator
Section titled “3a: SIF Generator”class SIFGenerator: def __init__(self, model_name: str = "qwen2.5-coder:7b"): self.model = model_name self.client = AdaClient()
def compress_text(self, text: str, domain: str) -> dict: """ One-pass semantic compression to SIF format.
Extracts: 1. Summary (1-3 sentences) 2. Key entities + relationships 3. Importance-weighted facts 4. Generate embeddings """
# Step 1: Extract summary summary = self._extract_summary(text)
# Step 2: Extract entities entities = self._extract_entities(text)
# Step 3: Extract relationships relationships = self._extract_relationships(text, entities)
# Step 4: Extract facts with importance facts = self._extract_facts(text, entities)
# Step 5: Generate embeddings (optional) embeddings = self._generate_embeddings(facts)
# Assemble SIF sif = { 'metadata': {...}, 'summary': summary, 'entities': entities, 'relationships': relationships, 'facts': facts, 'embeddings': embeddings if embeddings else None }
return sif3b: SIF Ingestion into Ada
Section titled “3b: SIF Ingestion into Ada”class SIFIngester: def __init__(self, ada_client: AdaClient): self.client = ada_client
def inject_sif(self, sif: dict, user_id: str) -> None: """ Inject SIF directly into Ada's memory. Facts with importance ≥ 0.60 are added as memories. """
for fact in sif['facts']: if fact['importance'] >= 0.60: # Critical threshold memory = { 'type': 'fact', 'content': fact['content'], 'importance': fact['importance'], 'source': f"SIF:{sif['metadata']['domain']}", 'tags': fact.get('tags', []) }
# Store in ChromaDB with importance weighting self.client.add_memory(memory, user_id)
# Optionally: inject entity types for entity linking for entity in sif['entities']: entity_memory = { 'type': 'entity', 'name': entity['name'], 'type': entity['type'], 'description': entity['description'], 'importance': entity['importance'] } self.client.add_memory(entity_memory, user_id)Example SIF Output (Alice in Wonderland)
Section titled “Example SIF Output (Alice in Wonderland)”{ "metadata": { "version": "1.0.0", "timestamp": "2025-12-23T12:34:56Z", "domain": "literature", "generated_by": "qwen2.5-coder:7b", "generator_version": "v1.0" },
"provenance": { "source_type": "text_extraction", "source_size_bytes": 144696, "compression_ratio": 104.6, "compression_method": "llm_semantic", "confidence": 0.78 },
"summary": { "text": "Alice follows the White Rabbit down a rabbit hole and finds herself in a strange world where she meets various characters including the Caterpillar, Mad Hatter, and Queen of Hearts. She experiences size changes and navigates peculiar social situations.", "keywords": ["Alice", "rabbit hole", "transformation", "characters", "adventure"], "entities_count": 12, "facts_count": 24 },
"entities": [ { "id": "alice", "type": "person", "name": "Alice", "description": "Young protagonist who falls down the rabbit hole", "importance": 0.95, "attributes": {"age": "child", "curious": true} }, { "id": "white_rabbit", "type": "person", "name": "White Rabbit", "description": "Anxious rabbit who leads Alice into Wonderland", "importance": 0.85 }, { "id": "caterpillar", "type": "person", "name": "Caterpillar", "description": "Mysterious creature on mushroom who asks riddles", "importance": 0.80 } ],
"relationships": [ { "entity_a": "alice", "relation_type": "follows", "entity_b": "white_rabbit", "strength": 0.95 }, { "entity_a": "white_rabbit", "relation_type": "leads_to", "entity_b": "wonderland", "strength": 0.90 } ],
"facts": [ { "id": "fact_001", "content": "Alice falls down the rabbit hole after following the White Rabbit", "type": "causal", "importance": 0.95, "confidence": 0.99, "tags": ["plot", "beginning", "trigger"] }, { "id": "fact_002", "content": "Alice changes size multiple times throughout her journey", "type": "property", "importance": 0.88, "confidence": 0.98, "tags": ["magic", "transformation"] }, { "id": "fact_003", "content": "The Caterpillar on the mushroom can grant size-changing powers", "type": "factual", "importance": 0.75, "confidence": 0.85, "tags": ["magic_system"] } ]}Formalization Timeline
Section titled “Formalization Timeline”Week 1: Schema Definition
Section titled “Week 1: Schema Definition”- Finalize JSON Schema
- Document all fields
- Define enum types
- Create schema file
Week 2: Validation Tools
Section titled “Week 2: Validation Tools”- Implement schema validator
- Build quality checker
- Add compression calculator
- Create test suite
Week 3: Generators
Section titled “Week 3: Generators”- Implement SIF generator
- Test on 5 domains
- Optimize compression
- Document usage
Week 4: Integration
Section titled “Week 4: Integration”- Build Ada ingestion layer
- Test RAG injection
- Performance benchmarking
- Documentation complete
Open Questions for QAL Collaboration
Section titled “Open Questions for QAL Collaboration”When reaching out to the QAL team, ask:
- Schema alignment - Does our JSON schema match QAL’s entity/relationship model?
- Importance weighting - How does our 0.60 threshold relate to their contraction sharpness?
- Embedding storage - Should SIF include vectors or reference external ones?
- Versioning - How do we handle SIF evolution without breaking compatibility?
- Domain specificity - Should SIF format differ by domain (code vs logs vs literature)?
Success Criteria
Section titled “Success Criteria”Specification is ready when:
- ✅ Schema validates all our existing SIF outputs
- ✅ Tools can round-trip: text → SIF → Ada memory → answerable queries
- ✅ Compression ratio ≥ 50x while maintaining hallucination safety
- ✅ Importance scores align with EXP-005 (0.60 threshold works)
- ✅ Documentation clear enough for external teams to implement
- ✅ Example generators for 3+ domains work correctly
This roadmap becomes active after research cleanup completes (Dec 2025)
Target: Full SIF v1.0 specification by end of Q4 2025