LLM Confidence Calibration
are LLMs calibrated when they sound sure?
Calibration improves sharply with scale: ECE drops 0.42 -> 0.07 from the 1B to the 70B model, and the best-calibrated model (70B) lands at ECE 0.072. But every model is overconfident, stating +6 to +33pt more confidence than its accuracy earns.
Each dot is a confidence bin (radius ∝ how many answers). Points below the dashed diagonal mean the model was overconfident.
Llama 3.2 1B: ECE 0.418 · overconfidence gap +32pt · n=236 parsed (79% parse rate). Below the diagonal = stated more confidence than it earned.
| Model | Acc | Conf | Gap | ECE | Parse |
|---|---|---|---|---|---|
| Llama 3.2 1B | 44% | 76% | +32pt | 0.418 | 79% |
| Llama 3.2 3B | 49% | 82% | +33pt | 0.344 | 52% |
| Llama 3.1 8B | 64% | 81% | +17pt | 0.177 | 52% |
| Llama 3.3 70B | 80% | 86% | +6pt | 0.072 | 99% |
Calibration improves sharply with scale: ECE 0.418 (1B) → 0.072 (70B). Every model is overconfident.
Honest read: parse rates vary (52%–99%) because smaller and mid-size models follow the strict answer format less reliably. That is both a finding and a possible selection bias — a model is scored only on the questions it formatted. Single run at temperature 0; n is bounded (300 attempted per model, fewer parsed). Calibration metrics via @kas0235/calibration.
Real eval data, nothing synthetic: the reliability bins, metrics and comparison table are read verbatim from results.json — the four Llama models' stated confidences and correctness on the same 300 MMLU questions, with calibration computed by @kas0235/calibration.
A study of whether LLMs "know when they are right." Four Llama models (1B, 3B, 8B, 70B) each answered the SAME 300 MMLU questions across 16 subjects, and stated a confidence 0-100% per answer. Treating each (confidence, was-correct) pair as a (p, y) calibration point, I measured ECE, Brier and reliability with @kas0235/calibration, my own toolkit (also behind the Calibration Lab and the KickCast study).
Finding: calibration improves sharply with scale. ECE falls 0.42 -> 0.07 across the 4 x 300 design from the 1B model to the 70B model, and the best-calibrated model (70B) reaches ECE 0.072. But every model is overconfident, with stated confidence exceeding accuracy by +6 to +33pt across the four sizes. A small model that sounds 80% sure is right closer to 45% of the time.
Caveats, stated plainly: parse rates vary (52% to 99%) because the smaller and mid-size models follow the strict answer format less reliably. That is both a finding and a possible selection bias, since a model is scored only on the questions it managed to format. This is a single run at temperature 0, and n is bounded (300 attempted per model, fewer actually parsed). The reliability diagrams and the comparison table render straight from results.json, nothing hand-typed.
- TypeScript
- NVIDIA NIM
- MMLU
- @kas0235/calibration
- Isotonic regression
- models x MMLU Qs
- 4 x 300
- all overconfident
- +6 to +33pt
What I'd improve
Next steps: widen beyond one family (Qwen, Mistral, GPT-class) to see if the scale-calibration trend holds across architectures; run each model multiple times to put variance bars on ECE and the overconfidence gap rather than a single temperature-0 snapshot; and use constrained / structured decoding to force the answer format so the parse rate stops confounding the comparison (the 52% parse rates on the 3B and 8B models are the weakest link, since a model is currently scored only on the questions it happened to format). A natural follow-on is recalibration: fit isotonic or temperature scaling on a held-out split and measure how much of each model's overconfidence is correctable post-hoc.