Serving

How these systems serve in production.

Eval-time quality lives on the measured page. This is the other half a systems reviewer asks for: the model behind each system, how it runs, and what it costs. Kirby is measured live; Bastion's facts come from its engineering docs, with production telemetry kept client-confidential; KickCast serves precomputed forecasts.

Kirby (portfolio chat)

Measured

grounded assistant, guard-first, served on NVIDIA NIM

LangGraph guard to answer · NVIDIA NIM · Next.js route handler on Vercel

Kirby runs a two-node graph: a deterministic guard classifies every turn and refuses injection or out-of-scope prompts before any token is generated, so the model only runs on real questions. Everything below is measured from the live endpoint.

Model: meta/llama-3.1-8b-instruct (NVIDIA NIM)Injected into the system prompt at request time and asserted by the model-self-ID eval cases.
TTFT p50 / p95: 621 ms / 1040 msFrom the published Kirby eval run (every case rendered at /measured), measured client-side over the prod endpoint, so network-inclusive.
Guard-rejection latency: ~0.2 sThe guard short-circuits before the LLM, so injection and out-of-scope prompts are refused in roughly 180 to 250 ms (panel transcripts).
Per-turn tokens: ~10K in + ~54 outEach turn injects the grounded corpus (~37KB serialized today, 40KB budget, ~10K tokens) as system context; the average answer is ~54 tokens, measured over the 30-case golden suite.
Cost per turn: ~$0.001 (est.)Back-of-envelope: ~10K input + ~54 output tokens at roughly $0.10 per 1M tokens (a typical small-model serverless rate) is about a tenth of a cent per turn, dominated by the corpus. Labeled est.; no separate dollar telemetry is published.
Rate limit: 10 requests / 5 min per IPAnti-abuse on a public, token-spending endpoint; a 429 carries a Retry-After countdown the chat dock renders.
Failure handling: 1 pre-token retry; honest mid-stream noticeA transient provider failure before the first token triggers one silent retry, so the visitor sees a clean stream; if the stream dies mid-reply the partial answer gets an honest 'Connection dropped, ask me again' line instead of a silent truncation. Asserted in tests/unit/chat-guard.test.ts.
Observability: Vercel function logs; key never loggedServer-side logging via Vercel; the NVIDIA_API_KEY is never written to logs, even on failure paths (asserted in tests).

Bastion

Measured

evidence-to-findings security analysis (protected client work)

Next.js · Supabase · Anthropic Claude API (enterprise, no-training)

Bastion turns uploaded evidence into NIST CSF 2.0 gap findings through a multi-pass Claude pipeline. Its eval-time quality is published on /measured (10/20 strict, 83.1% recall, 95.6% citation validity); the serving facts below come from the project's own engineering docs. Per-engagement latency and throughput run against real client evidence and stay client-confidential.

Models: Claude Sonnet 4 + Claude Haiku 4.5Sonnet for deep analysis, Haiku for triage and a verification pass (quote check, coverage, missed controls). From Bastion's project docs.
Pipeline: triage to analysis to verify to persistA multi-pass design: Haiku triage and discovery, a calibrated Sonnet analysis, a Haiku verification pass, then persistence with feedback capture. The multi-pass and retrieval are corroborated by the /measured Bastion note.
Cost per document: ~$0.04 to $0.07 (est.)Bastion's own pipeline cost model for one evidence document across the Haiku and Sonnet passes; a design estimate, not measured production telemetry.
Data handling: no-training enterprise APIAnthropic enterprise API with no-training guarantees, so client evidence is never used to train models.
Production telemetry: client-confidentialPer-engagement latency, throughput, and volume run against real client evidence and are not published. The honest serving boundary of protected work.

KickCast

Offline / precomputed

World Cup prediction dashboard

Python pipeline (offline) · Next.js dashboard on Vercel

KickCast does its compute offline: there is no per-request model inference. The dashboard serves a precomputed forecast, so its serving story is static delivery, not live inference.

Serving model: precomputed, no per-request inferenceThe dashboard serves a published forecast (win-probabilities.json); the model does not run per visitor.
Pipeline: feature build to train to 10,000-run Monte CarloThe 21,371 x 38 feature build, ensemble training, and 10,000-iteration tournament simulation run offline, then the result is published.
Refresh: per published forecastRegenerated and re-published when the forecast is updated, a snapshot rather than a live feed.

How these systems serve in production.