/acr-vault/07-analyses/findings/model_tooling_behavior
MODEL_TOOLING_BEHAVIOR
Research Note: Model-to-Model Tooling Behavior Differences
Section titled âResearch Note: Model-to-Model Tooling Behavior DifferencesâObservation
Section titled âObservationâWhile using GPT-5.2 as the âneural netâ in VS Code, we noticed a style pattern:
- More frequent batched tool calls (parallel reads/searches).
- More extra verification reads (âletâs read one more noteâ) before editing.
This isnât necessarily badâoften itâs a risk-management strategyâbut itâs measurable, and different models/prompt setups may behave differently.
Hypothesis
Section titled âHypothesisâDifferent models (and different tool/policy prompts) optimize different tradeoffs:
- Correctness / non-hallucination bias â more reads before edits.
- Tool-parallelism awareness â more batching.
- Latency sensitivity â fewer calls, larger reads, earlier edits.
Proposed Experiment (later)
Section titled âProposed Experiment (later)âRun the same fixed task suite across multiple models (e.g., Sonnet 4.5 / Opus 4.5 vs GPT-5.2), in the same repo revision, with identical tool availability.
Task Suite
Section titled âTask SuiteâInclude tasks across:
- Find & replace policy sweep
- Single-file bug fix
- Multi-file refactor (small)
- Docs update
- Extension build/test fix
Metrics to Collect
Section titled âMetrics to Collectâ- Tool calls count (total, by type)
- Parallel âbatching factorâ (calls per round)
- Lines read / lines changed (read amplification)
- Search â read â edit efficiency (% reads leading to edits)
- Time to first edit
- Wall-clock time to completion
- Outcome quality (tests pass, matches intent)
Prompt/Policy Knobs to Sweep
Section titled âPrompt/Policy Knobs to Sweepâ- âPrefer fewer, larger readsâ vs âverify aggressivelyâ
- âMax N reads before first editâ budgets
- âEdit as soon as target snippet locatedâ stop condition
This research is likely well-suited to a more exploratory model (e.g., Sonnet 4.5 / Opus 4.5) when we want to dig into methodology + analysis and produce a writeup.