Skip to content
karim.semaan(open to work)
WorkExperienceAboutSkillsContactResume ↓
← All work

LLM Confidence Calibration

  • TypeScript
  • NVIDIA NIM
  • MMLU
Generative AI
Generative AIStudy2026

LLM Confidence Calibration

are LLMs calibrated when they sound sure?

Calibration improves sharply with scale: ECE drops 0.42 -> 0.07 from the 1B to the 70B model, and the best-calibrated model (70B) lands at ECE 0.072. But every model is overconfident, stating +6 to +33pt more confidence than its accuracy earns.

Reliability explorer · MMLU · 300 Qs · 16 subjectsReal eval data

Each dot is a confidence bin (radius ∝ how many answers). Points below the dashed diagonal mean the model was overconfident.

Reliability diagram for Llama 3.2 1B: stated confidence (x) vs actual accuracy (y), 7 non-empty bins. ECE 0.418, overconfidence gap +32pt, n=236 parsed answers.0%0%25%25%50%50%75%75%100%100%stated confidenceactual accuracy

Llama 3.2 1B: ECE 0.418 · overconfidence gap +32pt · n=236 parsed (79% parse rate). Below the diagonal = stated more confidence than it earned.

All four models: size, accuracy, mean stated confidence, overconfidence gap, ECE, and answer-parse rate, smallest to largest.
ModelAccConfGapECEParse
Llama 3.2 1B44%76%+32pt0.41879%
Llama 3.2 3B49%82%+33pt0.34452%
Llama 3.1 8B64%81%+17pt0.17752%
Llama 3.3 70B80%86%+6pt0.07299%

Calibration improves sharply with scale: ECE 0.418 (1B) → 0.072 (70B). Every model is overconfident.

Honest read: parse rates vary (52%–99%) because smaller and mid-size models follow the strict answer format less reliably. That is both a finding and a possible selection bias — a model is scored only on the questions it formatted. Single run at temperature 0; n is bounded (300 attempted per model, fewer parsed). Calibration metrics via @kas0235/calibration.

Real eval data, nothing synthetic: the reliability bins, metrics and comparison table are read verbatim from results.json — the four Llama models' stated confidences and correctness on the same 300 MMLU questions, with calibration computed by @kas0235/calibration.

A study of whether LLMs "know when they are right." Four Llama models (1B, 3B, 8B, 70B) each answered the SAME 300 MMLU questions across 16 subjects, and stated a confidence 0-100% per answer. Treating each (confidence, was-correct) pair as a (p, y) calibration point, I measured ECE, Brier and reliability with @kas0235/calibration, my own toolkit (also behind the Calibration Lab and the KickCast study).

Finding: calibration improves sharply with scale. ECE falls 0.42 -> 0.07 across the 4 x 300 design from the 1B model to the 70B model, and the best-calibrated model (70B) reaches ECE 0.072. But every model is overconfident, with stated confidence exceeding accuracy by +6 to +33pt across the four sizes. A small model that sounds 80% sure is right closer to 45% of the time.

Caveats, stated plainly: parse rates vary (52% to 99%) because the smaller and mid-size models follow the strict answer format less reliably. That is both a finding and a possible selection bias, since a model is scored only on the questions it managed to format. This is a single run at temperature 0, and n is bounded (300 attempted per model, fewer actually parsed). The reliability diagrams and the comparison table render straight from results.json, nothing hand-typed.

  • TypeScript
  • NVIDIA NIM
  • MMLU
  • @kas0235/calibration
  • Isotonic regression
ECE, 1B -> 70B
0.42 -> 0.07
results.json↗
models x MMLU Qs
4 x 300
all overconfident
+6 to +33pt
best-calibrated (70B)
ECE 0.072
results.json↗

What I'd improve

Next steps: widen beyond one family (Qwen, Mistral, GPT-class) to see if the scale-calibration trend holds across architectures; run each model multiple times to put variance bars on ECE and the overconfidence gap rather than a single temperature-0 snapshot; and use constrained / structured decoding to force the answer format so the parse rate stops confounding the comparison (the 52% parse rates on the 3B and 8B models are the weakest link, since a model is currently scored only on the questions it happened to format). A natural follow-on is recalibration: fit isotonic or temperature scaling on a held-out split and measure how much of each model's overconfidence is correctable post-hoc.

View source↗
Want something like this? Get in touch →

Measured:30/30 chatbot evals·20-case Bastion eval·Serving·Calibration Lab

© 2026 Karim SemaanBuilt with Next.js, Tailwind & Supabase.LinkedIn ↗GitHub ↗