Skip to content
karim.semaan(open to work)
WorkExperienceAboutSkillsContactResume ↓

Calibration Lab

Probability calibration you can run in your browser.

Paste a set of probabilistic predictions, or load a synthetic sample, and the lab scores how calibrated they are (log-loss, Brier, expected calibration error, accuracy), draws the reliability curve, and shows the effect of Platt scaling and isotonic recalibration. Everything runs fully client-side: the math is a small pure TypeScript library, the compute is deterministic and zero-cost, and no data ever leaves your browser. There is no network request.

See the real study →KickCast calibration studyThis lab is a sandbox over synthetic data. The measured companion applies the same isotonic recalibration to the real KickCast model on the 2022 World Cup holdout, with every number traced to the saved model artifact.
Load a synthetic sample

All three samples are synthetic, illustrative datasets generated with a fixed seed. Not real model output.

p is the predicted probability in [0, 1]; y is the binary outcome (0 or 1). A header row is tolerated; commas or whitespace both work.

The file is read in your browser with FileReader and never uploaded.

300 valid rows parsed
Recalibration method

Both methods are fit on the same rows and applied back to them in your browser. In-sample recalibration shows the mechanism; an honest estimate of generalization needs a held-out split.

Metrics

Log-loss
0.5461
Brier
0.1741
ECE
0.1180
Accuracy
0.7767
n
300

Reliability diagram

Each point is a probability bin plotted at its mean predicted probability (x) against its observed frequency (y). On the dashed diagonal is perfectly calibrated; above it the model under-predicts, below it over-predicts.

0.000.000.250.250.500.500.750.751.001.00mean predicted probabilityobserved frequency
Raw modelPerfect calibration (y = x)

Confidence distribution

How the predicted probabilities spread across the bins. A sharp model piles up near 0 and 1; a hedging model bunches around 0.5.

  • 0.0–0.1bin 0.0–0.1: n=72 · 24.0%
  • 0.1–0.2bin 0.1–0.2: n=40 · 13.3%
  • 0.2–0.3bin 0.2–0.3: n=19 · 6.3%
  • 0.3–0.4bin 0.3–0.4: n=7 · 2.3%
  • 0.4–0.5bin 0.4–0.5: n=0 · 0.0%
  • 0.5–0.6bin 0.5–0.6: n=0 · 0.0%
  • 0.6–0.7bin 0.6–0.7: n=6 · 2.0%
  • 0.7–0.8bin 0.7–0.8: n=26 · 8.7%
  • 0.8–0.9bin 0.8–0.9: n=42 · 14.0%
  • 0.9–1.0bin 0.9–1.0: n=88 · 29.3%

Measured:30/30 chatbot evals·20-case Bastion eval·Serving·Calibration Lab

© 2026 Karim SemaanBuilt with Next.js, Tailwind & Supabase.LinkedIn ↗︎GitHub ↗︎