Skip to content

/acr-vault/03-experiments/slim-evo/slim-evo-phase2-plan
SLIM-EVO-PHASE2-PLAN

SLIM-EVO Phase 2: Scaling & Curriculum Validation

Section titled “SLIM-EVO Phase 2: Scaling & Curriculum Validation”

Date: January 6, 2026 (11 AM start)
Status: 🔬 IN PROGRESS
Goal: Solidify the recipe before scaling to production


2C: Half Steps (Minimum Viable Training) ✅

Section titled “2C: Half Steps (Minimum Viable Training) ✅”

Command: ce anneal run --cycles 10 --skip-evolution --gradient-steps 10
Duration: 12.2 minutes
Result: SUCCESS - 5 steps per phase IS ENOUGH!

MetricResultTargetStatus
CI Density0.07< 2.0✅ BASELINE!
WebSearch100%> 60%
WikiSearch80%> 60%
AGL Score0.89> 0.80
Coherence1.00> 0.80✅ PERFECT

Key Finding: Minimum viable training confirmed! 5 steps per phase achieves same results as 10 steps, in half the time. Breathing pattern still visible and healthy.

File: results/annealing/annealing_20260106_113956.json


Priority 1: Model Scaling (700M) - UP NEXT!

Section titled “Priority 1: Model Scaling (700M) - UP NEXT!”
#ExperimentCyclesEst. TimeHypothesis
2EStandard recipe on 700M10~50 minDoes recipe transfer to larger model?
2FAdjusted LR (5e-5) on 700M10~50 minLarger models need smaller LR?
#ExperimentCyclesEst. TimeHypothesis
2GStandard recipe on 1.2B10~90 minFinal scaling validation

Priority 3: Curriculum Variations (350M) - Optional

Section titled “Priority 3: Curriculum Variations (350M) - Optional”
#ExperimentCyclesEst. TimeHypothesis
2AReverse order: AGL → Wiki → WebSearch1018 minDoes order actually matter?
2BDouble steps: 20 steps per phase1036 minMore steps = better integration?
2DMore cycles: 20 cycles2036 minDoes plateau stabilize further?

Priority 4: Advanced Experiments (if time)

Section titled “Priority 4: Advanced Experiments (if time)”
#ExperimentCyclesEst. TimeHypothesis
2HLower LoRA rank (r=16)1018 minCan we reduce parameters?
2IHigher LoRA rank (r=64)1018 minMore capacity = better?
2JLayer-targeted LoRA1018 minEarly layers for tools, late for AGL?

Based on experiments:

  • 350M model: ~1.2 min/cycle with half steps (10 cycles = 12 min) ✅ CONFIRMED
  • 700M estimate: ~3-4 min/cycle (10 cycles = 30-40 min)
  • 1.2B estimate: ~6-8 min/cycle (10 cycles = 60-80 min)

Each experiment should report:

  1. Final CI - Target: < 2.0 (lower is better)
  2. WebSearch accuracy - Target: > 60%
  3. WikiSearch accuracy - Target: > 60%
  4. AGL Score - Target: > 0.80
  5. Coherence - Target: > 0.80
  6. Training stability - Did it oscillate? Plateau? Diverge?

Block 1: 11 AM - 1 PM (Curriculum Validation)

Section titled “Block 1: 11 AM - 1 PM (Curriculum Validation)”
11:00 - 2A: Reverse order (18 min) → done by 11:20
11:25 - 2C: Half steps (9 min) → done by 11:35
11:40 - 2B: Double steps (36 min) → done by 12:20
12:25 - 2D: 20 cycles (36 min) → done by 1:00
1:00 - 2E: 700M standard (50 min) → done by 1:50
2:00 - 2F: 700M lower LR (50 min) → done by 2:50
3:00 - 2G: 1.2B standard (90 min) → done by 4:30
5:00+ - Any remaining experiments
- Extended runs for promising configs
- Layer-targeted experiments
Terminal window
# 2A: Reverse order (TODO: need to implement --order flag)
# For now, manually edit anneal.py to reorder phases
# 2B: Double steps (20 per phase = 60 total per cycle)
ce anneal run --cycles 10 --skip-evolution --gradient-steps 40
# 2C: Half steps (5 per phase = 15 total per cycle)
ce anneal run --cycles 10 --skip-evolution --gradient-steps 10
# 2D: More cycles (20 instead of 10)
ce anneal run --cycles 20 --skip-evolution
# 2E: 700M model ✅ (NEW FLAG!)
ce anneal run --cycles 10 --skip-evolution --model LiquidAI/LFM2-700M
# 2F: 700M with lower LR ✅ (NEW FLAG!)
ce anneal run --cycles 10 --skip-evolution --model LiquidAI/LFM2-700M --lr 5e-5
# 2G: 1.2B model ✅ (NEW FLAG!)
ce anneal run --cycles 10 --skip-evolution --model LiquidAI/LFM2-1.2B
# 2H: Lower LoRA rank (TODO: need --lora-r flag)
# For now, edit AnnealingConfig in anneal.py
# 2I: Higher LoRA rank (TODO: need --lora-r flag)
# For now, edit AnnealingConfig in anneal.py

Note: --model and --lr flags now work! 🎉

By end of Phase 2, we should know:

  1. ✓/✗ Does curriculum order matter?
  2. ✓/✗ What’s the minimum viable training (steps/cycles)?
  3. ✓/✗ Does the recipe scale to 700M? 1.2B?
  4. ✓/✗ What learning rate works best at each scale?
  5. ✓/✗ Can we optimize LoRA rank?

Ready to begin! 🚀