Generative AIStudy2026

LLM Confidence Calibration

are LLMs calibrated when they sound sure?

Calibration improves sharply with scale: ECE drops 0.42 -> 0.07 from the 1B to the 70B model, and the best-calibrated model (70B) lands at ECE 0.072. But every model is overconfident, stating +6 to +33pt more confidence than its accuracy earns.

Reliability explorer · MMLU · 300 Qs · 16 subjectsReal eval data

Each dot is a confidence bin (radius ∝ how many answers). Points below the dashed diagonal mean the model was overconfident.

Llama 3.2 1B: ECE 0.418 · overconfidence gap +32pt · n=236 parsed (79% parse rate). Below the diagonal = stated more confidence than it earned.

All four models: size, accuracy, mean stated confidence, overconfidence gap, ECE, and answer-parse rate, smallest to largest.
Model	Acc	Conf	Gap	ECE	Parse
Llama 3.2 1B	44%	76%	+32pt	0.418	79%
Llama 3.2 3B	49%	82%	+33pt	0.344	52%
Llama 3.1 8B	64%	81%	+17pt	0.177	52%
Llama 3.3 70B	80%	86%	+6pt	0.072	99%

Calibration improves sharply with scale: ECE 0.418 (1B) → 0.072 (70B). Every model is overconfident.

Honest read: parse rates vary (52%–99%) because smaller and mid-size models follow the strict answer format less reliably. That is both a finding and a possible selection bias — a model is scored only on the questions it formatted. Single run at temperature 0; n is bounded (300 attempted per model, fewer parsed). Calibration metrics via @kas0235/calibration.

Real eval data, nothing synthetic: the reliability bins, metrics and comparison table are read verbatim from results.json — the four Llama models' stated confidences and correctness on the same 300 MMLU questions, with calibration computed by @kas0235/calibration.

A study of whether LLMs "know when they are right." Four Llama models (1B, 3B, 8B, 70B) each answered the SAME 300 MMLU questions across 16 subjects, and stated a confidence 0-100% per answer. Treating each (confidence, was-correct) pair as a (p, y) calibration point, I measured ECE, Brier and reliability with @kas0235/calibration, my own toolkit (also behind the Calibration Lab and the KickCast study).

Finding: calibration improves sharply with scale. ECE falls 0.42 -> 0.07 across the 4 x 300 design from the 1B model to the 70B model, and the best-calibrated model (70B) reaches ECE 0.072. But every model is overconfident, with stated confidence exceeding accuracy by +6 to +33pt across the four sizes. A small model that sounds 80% sure is right closer to 45% of the time.

Caveats, stated plainly: parse rates vary (52% to 99%) because the smaller and mid-size models follow the strict answer format less reliably. That is both a finding and a possible selection bias, since a model is scored only on the questions it managed to format. This is a single run at temperature 0, and n is bounded (300 attempted per model, fewer actually parsed). The reliability diagrams and the comparison table render straight from results.json, nothing hand-typed.

TypeScript
NVIDIA NIM
MMLU
@kas0235/calibration
Isotonic regression

ECE, 1B -> 70B: 0.42 -> 0.07
models x MMLU Qs: 4 x 300
all overconfident: +6 to +33pt
best-calibrated (70B): ECE 0.072

What I'd improve

Next steps: widen beyond one family (Qwen, Mistral, GPT-class) to see if the scale-calibration trend holds across architectures; run each model multiple times to put variance bars on ECE and the overconfidence gap rather than a single temperature-0 snapshot; and use constrained / structured decoding to force the answer format so the parse rate stops confounding the comparison (the 52% parse rates on the 3B and 8B models are the weakest link, since a model is currently scored only on the questions it happened to format). A natural follow-on is recalibration: fit isotonic or temperature scaling on a held-out split and measure how much of each model's overconfidence is correctable post-hoc.

View source

Want something like this? Get in touch →

Model

Acc

Conf

Gap

ECE

Parse

Llama 3.2 1B

44%

76%

+32pt

0.418

79%

Llama 3.2 3B

49%

82%

+33pt

0.344

52%

Llama 3.1 8B

64%

81%

+17pt

0.177

52%

Llama 3.3 70B

80%

86%

+6pt

0.072

99%

What I'd improve