/acr-vault/03-experiments/slim-evo/slim-evo-phase2-plan
SLIM-EVO-PHASE2-PLAN
SLIM-EVO Phase 2: Scaling & Curriculum Validation
Section titled “SLIM-EVO Phase 2: Scaling & Curriculum Validation”Date: January 6, 2026 (11 AM start)
Status: 🔬 IN PROGRESS
Goal: Solidify the recipe before scaling to production
✅ Completed Experiments
Section titled “✅ Completed Experiments”2C: Half Steps (Minimum Viable Training) ✅
Section titled “2C: Half Steps (Minimum Viable Training) ✅”Command: ce anneal run --cycles 10 --skip-evolution --gradient-steps 10
Duration: 12.2 minutes
Result: SUCCESS - 5 steps per phase IS ENOUGH!
| Metric | Result | Target | Status |
|---|---|---|---|
| CI Density | 0.07 | < 2.0 | ✅ BASELINE! |
| WebSearch | 100% | > 60% | ✅ |
| WikiSearch | 80% | > 60% | ✅ |
| AGL Score | 0.89 | > 0.80 | ✅ |
| Coherence | 1.00 | > 0.80 | ✅ PERFECT |
Key Finding: Minimum viable training confirmed! 5 steps per phase achieves same results as 10 steps, in half the time. Breathing pattern still visible and healthy.
File: results/annealing/annealing_20260106_113956.json
🔄 Remaining Experiments
Section titled “🔄 Remaining Experiments”Priority 1: Model Scaling (700M) - UP NEXT!
Section titled “Priority 1: Model Scaling (700M) - UP NEXT!”| # | Experiment | Cycles | Est. Time | Hypothesis |
|---|---|---|---|---|
| 2E | Standard recipe on 700M | 10 | ~50 min | Does recipe transfer to larger model? |
| 2F | Adjusted LR (5e-5) on 700M | 10 | ~50 min | Larger models need smaller LR? |
Priority 2: Model Scaling (1.2B)
Section titled “Priority 2: Model Scaling (1.2B)”| # | Experiment | Cycles | Est. Time | Hypothesis |
|---|---|---|---|---|
| 2G | Standard recipe on 1.2B | 10 | ~90 min | Final scaling validation |
Priority 3: Curriculum Variations (350M) - Optional
Section titled “Priority 3: Curriculum Variations (350M) - Optional”| # | Experiment | Cycles | Est. Time | Hypothesis |
|---|---|---|---|---|
| 2A | Reverse order: AGL → Wiki → WebSearch | 10 | 18 min | Does order actually matter? |
| 2B | Double steps: 20 steps per phase | 10 | 36 min | More steps = better integration? |
| 2D | More cycles: 20 cycles | 20 | 36 min | Does plateau stabilize further? |
Priority 4: Advanced Experiments (if time)
Section titled “Priority 4: Advanced Experiments (if time)”| # | Experiment | Cycles | Est. Time | Hypothesis |
|---|---|---|---|---|
| 2H | Lower LoRA rank (r=16) | 10 | 18 min | Can we reduce parameters? |
| 2I | Higher LoRA rank (r=64) | 10 | 18 min | More capacity = better? |
| 2J | Layer-targeted LoRA | 10 | 18 min | Early layers for tools, late for AGL? |
Time Budget Analysis
Section titled “Time Budget Analysis”Based on experiments:
- 350M model: ~1.2 min/cycle with half steps (10 cycles = 12 min) ✅ CONFIRMED
- 700M estimate: ~3-4 min/cycle (10 cycles = 30-40 min)
- 1.2B estimate: ~6-8 min/cycle (10 cycles = 60-80 min)
Success Criteria
Section titled “Success Criteria”Each experiment should report:
- Final CI - Target: < 2.0 (lower is better)
- WebSearch accuracy - Target: > 60%
- WikiSearch accuracy - Target: > 60%
- AGL Score - Target: > 0.80
- Coherence - Target: > 0.80
- Training stability - Did it oscillate? Plateau? Diverge?
Execution Plan
Section titled “Execution Plan”Block 1: 11 AM - 1 PM (Curriculum Validation)
Section titled “Block 1: 11 AM - 1 PM (Curriculum Validation)”11:00 - 2A: Reverse order (18 min) → done by 11:2011:25 - 2C: Half steps (9 min) → done by 11:3511:40 - 2B: Double steps (36 min) → done by 12:2012:25 - 2D: 20 cycles (36 min) → done by 1:00Block 2: 1 PM - 3 PM (700M Scaling)
Section titled “Block 2: 1 PM - 3 PM (700M Scaling)”1:00 - 2E: 700M standard (50 min) → done by 1:502:00 - 2F: 700M lower LR (50 min) → done by 2:50Block 3: 3 PM - 5 PM (1.2B Scaling)
Section titled “Block 3: 3 PM - 5 PM (1.2B Scaling)”3:00 - 2G: 1.2B standard (90 min) → done by 4:30Block 4: 5 PM+ (Advanced/Overnight)
Section titled “Block 4: 5 PM+ (Advanced/Overnight)”5:00+ - Any remaining experiments - Extended runs for promising configs - Layer-targeted experimentsQuick Commands
Section titled “Quick Commands”# 2A: Reverse order (TODO: need to implement --order flag)# For now, manually edit anneal.py to reorder phases
# 2B: Double steps (20 per phase = 60 total per cycle)ce anneal run --cycles 10 --skip-evolution --gradient-steps 40
# 2C: Half steps (5 per phase = 15 total per cycle)ce anneal run --cycles 10 --skip-evolution --gradient-steps 10
# 2D: More cycles (20 instead of 10)ce anneal run --cycles 20 --skip-evolution
# 2E: 700M model ✅ (NEW FLAG!)ce anneal run --cycles 10 --skip-evolution --model LiquidAI/LFM2-700M
# 2F: 700M with lower LR ✅ (NEW FLAG!)ce anneal run --cycles 10 --skip-evolution --model LiquidAI/LFM2-700M --lr 5e-5
# 2G: 1.2B model ✅ (NEW FLAG!)ce anneal run --cycles 10 --skip-evolution --model LiquidAI/LFM2-1.2B
# 2H: Lower LoRA rank (TODO: need --lora-r flag)# For now, edit AnnealingConfig in anneal.py
# 2I: Higher LoRA rank (TODO: need --lora-r flag)# For now, edit AnnealingConfig in anneal.pyNote: --model and --lr flags now work! 🎉
Expected Outcomes
Section titled “Expected Outcomes”By end of Phase 2, we should know:
- ✓/✗ Does curriculum order matter?
- ✓/✗ What’s the minimum viable training (steps/cycles)?
- ✓/✗ Does the recipe scale to 700M? 1.2B?
- ✓/✗ What learning rate works best at each scale?
- ✓/✗ Can we optimize LoRA rank?
Ready to begin! 🚀