Generative AIStudy2026

Agent vs CoT: Does a Tool Help?

does a calculator help an LLM, and does it use it?

The calculator tool helps only the middle model (8B, +4.2pt); it hurts the 3B (-4.2pt) and the near-ceiling 70B (97.5% plain, -5.0pt with the agent). Tool-use reliability scales with capability (23% to 100% across the three sizes), so the model that would benefit most uses the tool least.

Tool-use explorer · GSM8K · 120 Qs × 2 conditionsReal eval data

Tool-use reliability scales with capability

23%3B

44%8B

100%70B

23% → 44% → 100% across 3B / 8B / 70B. The model that would benefit most uses the tool least.

Plain chain-of-thought vs the ReAct calculator agent, on the same 120 GSM8K problems.

Top bar = plain CoT, bottom bar = ReAct agent. The tool only helps the 8B (+4.2pt); it hurts the 3B (-4.2pt) and the 70B (-5.0pt).

Llama 3.2 3B · agent failure modes

wrong, no tool22
wrong, with tool7
stuck (max steps)1
empty response18

Llama 3.2 3B: tool used on 23% of problems, 1.4 avg steps. Accuracy when the tool was actually used: 64%.

Honest read: N=120 per condition, single run at temperature 0. Cross-model RAW accuracy is noisy because format adherence differs (the 8B's plain accuracy is dragged down by 34 no_answer format misses), so the robust signals are the WITHIN-model plain-vs-agent delta and the tool-use rate — not the cross-model accuracy ranking. GSM8K arithmetic word problems only. The calculator is a safe shunting-yard evaluator (no eval); answers graded by exact numeric match to the GSM8K gold.

Real eval data, nothing synthetic: the plain-vs-agent bars, the tool-use-rate strip and the failure-mode breakdown are read verbatim from results.json — three Llama models solving the same 120 GSM8K problems plain and as a ReAct calculator agent, graded by exact numeric match to the GSM8K gold.

A study of whether a calculator tool makes an LLM better at grade-school math, and whether weaker models actually reach for it. Three Llama models (3B, 8B, 70B) each solved the SAME 120 GSM8K word problems in two conditions: plain chain-of-thought, and a ReAct agent with a calculator tool. In the agent loop the model writes "CALC: <expr>", gets a "RESULT", and loops until "ANSWER". The calculator is a safe shunting-yard evaluator (no eval), and final answers are graded by exact numeric match to the GSM8K gold. The design is 3 x 120 (three models, 120 problems each).

Finding: the tool only helps the middle model. The 8B gains +4.2pt with the agent loop, but the 3B loses 4.2pt and the 70B loses 5.0pt. Tool-use reliability scales with capability: the 3B uses the calculator 23% of the time, the 8B 44%, the 70B 100% (23% to 100% across the three sizes), so the model that would benefit most uses it least. The 70B is already near-ceiling at 97.5% plain accuracy, so the agent loop only adds failure surface; the 8B often gets stuck in the loop (26 max-steps failures); and the 3B mostly ignores the tool (18 empty responses plus 22 wrong-without-tool answers).

Caveats, stated plainly: N=120, a single run at temperature 0. Cross-model RAW accuracy is noisy because format adherence differs (the 8B plain accuracy is dragged down by 34 no_answer format misses), so the robust signals are the WITHIN-model plain-vs-agent delta and the tool-use rate, not the cross-model accuracy ranking. GSM8K (arithmetic word problems) only. The comparison bars, tool-use strip and failure-mode breakdown render straight from results.json, nothing hand-typed.

TypeScript
NVIDIA NIM
GSM8K
ReAct
LLM agents

tool helps only the 8B: +4.2pt
tool-use rate 3B to 70B: 23% to 100%
models x GSM8K Qs: 3 x 120
70B plain accuracy: 97.5%

What I'd improve

Next steps: extend beyond GSM8K to tasks where a tool is unambiguously required (multi-hop retrieval, code execution, unit conversion) so the tool's value is not bounded by a model already near-ceiling on arithmetic; compare native tool-calling (function-calling APIs) against the ReAct-text protocol, since the 8B's max-steps failures suggest the text loop itself is a failure surface; raise n well past 120 and run multiple seeds to put variance bars on each delta rather than a single temperature-0 snapshot; and use constrained / structured decoding to force the CALC/ANSWER format so format adherence stops confounding the cross-model accuracy comparison.