Skip to content

/acr-vault/07-analyses/findings/model_tooling_behavior
MODEL_TOOLING_BEHAVIOR

Research Note: Model-to-Model Tooling Behavior Differences

Section titled “Research Note: Model-to-Model Tooling Behavior Differences”

While using GPT-5.2 as the “neural net” in VS Code, we noticed a style pattern:

  • More frequent batched tool calls (parallel reads/searches).
  • More extra verification reads (“let’s read one more note”) before editing.

This isn’t necessarily bad—often it’s a risk-management strategy—but it’s measurable, and different models/prompt setups may behave differently.

Different models (and different tool/policy prompts) optimize different tradeoffs:

  • Correctness / non-hallucination bias → more reads before edits.
  • Tool-parallelism awareness → more batching.
  • Latency sensitivity → fewer calls, larger reads, earlier edits.

Run the same fixed task suite across multiple models (e.g., Sonnet 4.5 / Opus 4.5 vs GPT-5.2), in the same repo revision, with identical tool availability.

Include tasks across:

  • Find & replace policy sweep
  • Single-file bug fix
  • Multi-file refactor (small)
  • Docs update
  • Extension build/test fix
  • Tool calls count (total, by type)
  • Parallel “batching factor” (calls per round)
  • Lines read / lines changed (read amplification)
  • Search → read → edit efficiency (% reads leading to edits)
  • Time to first edit
  • Wall-clock time to completion
  • Outcome quality (tests pass, matches intent)
  • “Prefer fewer, larger reads” vs “verify aggressively”
  • “Max N reads before first edit” budgets
  • “Edit as soon as target snippet located” stop condition

This research is likely well-suited to a more exploratory model (e.g., Sonnet 4.5 / Opus 4.5) when we want to dig into methodology + analysis and produce a writeup.