Calibration Lab

Probability calibration you can run in your browser.

Paste a set of probabilistic predictions, or load a synthetic sample, and the lab scores how calibrated they are (log-loss, Brier, expected calibration error, accuracy), draws the reliability curve, and shows the effect of Platt scaling and isotonic recalibration. Everything runs fully client-side: the math is a small pure TypeScript library, the compute is deterministic and zero-cost, and no data ever leaves your browser. There is no network request.

See the real study →KickCast calibration studyThis lab is a sandbox over synthetic data. The measured companion applies the same isotonic recalibration to the real KickCast model on the 2022 World Cup holdout, with every number traced to the saved model artifact.

Load a synthetic sample

All three samples are synthetic, illustrative datasets generated with a fixed seed. Not real model output.

Predictions · one "p,y" per line

p,y
0.0811,1
0.8155,1
0.7157,1
0.8768,1
0.9954,1
0.1118,0
0.9352,1
0.0971,0
0.0712,1
0.1385,0
0.9079,0
0.1182,0
0.9975,1
0.1240,1
0.0056,0
0.2181,0
0.0172,0
0.7197,1
0.9525,1
0.8895,1
0.2272,1
0.8665,1
0.9740,1
0.2586,0
0.9676,1
0.7948,0
0.2302,0
0.0137,0
0.9833,1
0.9131,1
0.1352,0
0.8010,1
0.9157,1
0.2456,0
0.9035,1
0.9842,1
0.8905,1
0.9994,1
0.9423,1
0.9025,0
0.0780,0
0.0919,0
0.0417,0
0.0243,0
0.2478,0
0.0593,0
0.0791,0
0.0120,0
0.7976,1
0.0097,0
0.8004,1
0.0481,0
0.0531,1
0.0871,0
0.8739,0
0.1697,0
0.9766,1
0.2384,0
0.9663,1
0.2226,1
0.9309,1
0.7905,1
0.9647,1
0.1886,1
0.9492,0
0.9157,0
0.0010,0
0.1093,0
0.0687,0
0.3500,0
0.1963,1
0.7680,0
0.9511,1
0.9280,1
0.7899,0
0.8671,1
0.0896,0
0.9171,1
0.1565,0
0.3074,0
0.9906,1
0.8636,0
0.9864,0
0.6474,1
0.1035,0
0.8832,1
0.1689,0
0.8245,0
0.0250,0
0.0656,0
0.9085,0
0.9630,1
0.8870,1
0.8251,1
0.0311,0
0.0570,0
0.0814,0
0.9092,1
0.7778,0
0.2237,1
0.0710,0
0.9380,1
0.8379,1
0.1291,0
0.9117,1
0.9332,1
0.7767,1
0.9689,1
0.8262,1
0.0494,0
0.0429,0
0.6428,1
0.8860,1
0.9537,1
0.2185,0
0.8905,0
0.9968,1
0.9837,1
0.8259,1
0.9863,1
0.0675,0
0.9257,1
0.2438,1
0.1497,1
0.8251,1
0.0796,0
0.1632,1
0.6943,1
0.8508,1
0.0405,0
0.9902,1
0.9582,1
0.0293,0
0.9661,1
0.1503,0
0.1114,0
0.9276,1
0.0299,0
0.8371,0
0.9938,1
0.0663,0
0.0265,0
0.9611,1
0.7957,0
0.2639,0
0.0039,0
0.9785,1
0.0785,0
0.0626,0
0.0244,1
0.1995,1
0.8092,0
0.6636,0
0.9655,1
0.7781,0
0.7028,1
0.0071,1
0.1216,0
0.7839,1
0.9314,1
0.1676,1
0.1456,0
0.0208,0
0.9280,1
0.9480,0
0.9145,1
0.9590,1
0.8293,1
0.0851,1
0.9974,1
0.9763,1
0.8735,1
0.2565,1
0.0109,0
0.7857,1
0.7134,0
0.9195,1
0.0439,0
0.9312,1
0.0448,0
0.7855,1
0.0372,0
0.3486,1
0.9133,0
0.1475,0
0.3237,0
0.9442,1
0.0266,0
0.0433,0
0.1378,1
0.9585,1
0.7692,1
0.1135,0
0.8681,1
0.9176,1
0.9947,1
0.1768,0
0.8056,1
0.9190,1
0.9955,1
0.3666,1
0.8964,1
0.9825,1
0.9653,1
0.9868,1
0.0192,1
0.8946,0
0.9136,1
0.7516,1
0.0380,0
0.0242,0
0.0539,0
0.2028,0
0.2575,1
0.0819,1
0.8725,1
0.8869,1
0.0095,0
0.2025,0
0.0165,0
0.9731,1
0.9800,1
0.0353,0
0.0039,0
0.2134,1
0.0435,0
0.9681,1
0.1801,0
0.1165,1
0.9158,0
0.1515,1
0.8416,1
0.0738,0
0.8481,1
0.9985,1
0.2950,1
0.0038,0
0.7381,1
0.9061,1
0.9731,1
0.6821,1
0.1233,0
0.1625,0
0.7635,0
0.1928,1
0.0254,0
0.9893,1
0.8826,1
0.1730,1
0.0759,0
0.8820,0
0.7775,0
0.8079,1
0.7989,1
0.0518,0
0.8134,1
0.0518,0
0.9875,1
0.9193,0
0.1075,1
0.0423,0
0.9964,1
0.0960,0
0.8630,1
0.6494,1
0.9786,1
0.7556,1
0.1327,0
0.0040,0
0.1698,1
0.9357,1
0.1646,1
0.9575,1
0.7635,1
0.0426,0
0.7498,0
0.8142,1
0.9394,0
0.9460,1
0.9862,1
0.3307,0
0.9224,1
0.0160,0
0.8001,1
0.0791,0
0.1640,1
0.9970,1
0.8695,1
0.1451,0
0.1201,0
0.9952,1
0.1125,0
0.0989,1
0.3216,0
0.9019,1
0.7900,0
0.8002,1
0.0395,1
0.2115,0
0.9864,1

p is the predicted probability in [0, 1]; y is the binary outcome (0 or 1). A header row is tolerated; commas or whitespace both work.

Or load a local file (.csv / .txt)

The file is read in your browser with FileReader and never uploaded.

300 valid rows parsed

Recalibration method

Both methods are fit on the same rows and applied back to them in your browser. In-sample recalibration shows the mechanism; an honest estimate of generalization needs a held-out split.

Metrics

Log-loss: 0.5461
Brier: 0.1741
ECE: 0.1180
Accuracy: 0.7767
n: 300

Reliability diagram

Each point is a probability bin plotted at its mean predicted probability (x) against its observed frequency (y). On the dashed diagonal is perfectly calibrated; above it the model under-predicts, below it over-predicts.

Raw modelPerfect calibration (y = x)

Confidence distribution

How the predicted probabilities spread across the bins. A sharp model piles up near 0 and 1; a hedging model bunches around 0.5.

0.0–0.1bin 0.0–0.1: n=72 · 24.0%
0.1–0.2bin 0.1–0.2: n=40 · 13.3%
0.2–0.3bin 0.2–0.3: n=19 · 6.3%
0.3–0.4bin 0.3–0.4: n=7 · 2.3%
0.4–0.5bin 0.4–0.5: n=0 · 0.0%
0.5–0.6bin 0.5–0.6: n=0 · 0.0%
0.6–0.7bin 0.6–0.7: n=6 · 2.0%
0.7–0.8bin 0.7–0.8: n=26 · 8.7%
0.8–0.9bin 0.8–0.9: n=42 · 14.0%
0.9–1.0bin 0.9–1.0: n=88 · 29.3%

Probability calibration you can run in your browser.

Reliability diagram

Raw modelPerfect calibration (y = x)

Confidence distribution

How the predicted probabilities spread across the bins. A sharp model piles up near 0 and 1; a hedging model bunches around 0.5.

0.0–0.1n=72 · 24.0%

0.1–0.2n=40 · 13.3%

0.2–0.3n=19 · 6.3%

0.3–0.4n=7 · 2.3%

0.4–0.5n=0 · 0.0%

0.5–0.6n=0 · 0.0%

0.6–0.7n=6 · 2.0%

0.7–0.8n=26 · 8.7%

0.8–0.9n=42 · 14.0%

0.9–1.0n=88 · 29.3%