Serving
How these systems serve in production.
Eval-time quality lives on the measured page. This is the other half a systems reviewer asks for: the model behind each system, how it runs, and what it costs. Kirby is measured live; Bastion's facts come from its engineering docs, with production telemetry kept client-confidential; KickCast serves precomputed forecasts.
Kirby (portfolio chat)
Measuredgrounded assistant, guard-first, served on NVIDIA NIM
LangGraph guard to answer · NVIDIA NIM · Next.js route handler on Vercel
Kirby runs a two-node graph: a deterministic guard classifies every turn and refuses injection or out-of-scope prompts before any token is generated, so the model only runs on real questions. Everything below is measured from the live endpoint.
- Model
- meta/llama-3.1-8b-instruct (NVIDIA NIM)Injected into the system prompt at request time and asserted by the model-self-ID eval cases.
- TTFT p50 / p95
- 621 ms / 1040 msFrom the published Kirby eval run (every case rendered at /measured), measured client-side over the prod endpoint, so network-inclusive.
- Guard-rejection latency
- ~0.2 sThe guard short-circuits before the LLM, so injection and out-of-scope prompts are refused in roughly 180 to 250 ms (panel transcripts).
- Per-turn tokens
- ~10K in + ~54 outEach turn injects the grounded corpus (~37KB serialized today, 40KB budget, ~10K tokens) as system context; the average answer is ~54 tokens, measured over the 30-case golden suite.
- Cost per turn
- ~$0.001 (est.)Back-of-envelope: ~10K input + ~54 output tokens at roughly $0.10 per 1M tokens (a typical small-model serverless rate) is about a tenth of a cent per turn, dominated by the corpus. Labeled est.; no separate dollar telemetry is published.
- Rate limit
- 10 requests / 5 min per IPAnti-abuse on a public, token-spending endpoint; a 429 carries a Retry-After countdown the chat dock renders.
- Failure handling
- 1 pre-token retry; honest mid-stream noticeA transient provider failure before the first token triggers one silent retry, so the visitor sees a clean stream; if the stream dies mid-reply the partial answer gets an honest 'Connection dropped, ask me again' line instead of a silent truncation. Asserted in tests/unit/chat-guard.test.ts.
- Observability
- Vercel function logs; key never loggedServer-side logging via Vercel; the NVIDIA_API_KEY is never written to logs, even on failure paths (asserted in tests).
Bastion
Measuredevidence-to-findings security analysis (protected client work)
Next.js · Supabase · Anthropic Claude API (enterprise, no-training)
Bastion turns uploaded evidence into NIST CSF 2.0 gap findings through a multi-pass Claude pipeline. Its eval-time quality is published on /measured (10/20 strict, 83.1% recall, 95.6% citation validity); the serving facts below come from the project's own engineering docs. Per-engagement latency and throughput run against real client evidence and stay client-confidential.
- Models
- Claude Sonnet 4 + Claude Haiku 4.5Sonnet for deep analysis, Haiku for triage and a verification pass (quote check, coverage, missed controls). From Bastion's project docs.
- Pipeline
- triage to analysis to verify to persistA multi-pass design: Haiku triage and discovery, a calibrated Sonnet analysis, a Haiku verification pass, then persistence with feedback capture. The multi-pass and retrieval are corroborated by the /measured Bastion note.
- Cost per document
- ~$0.04 to $0.07 (est.)Bastion's own pipeline cost model for one evidence document across the Haiku and Sonnet passes; a design estimate, not measured production telemetry.
- Data handling
- no-training enterprise APIAnthropic enterprise API with no-training guarantees, so client evidence is never used to train models.
- Production telemetry
- client-confidentialPer-engagement latency, throughput, and volume run against real client evidence and are not published. The honest serving boundary of protected work.
KickCast
Offline / precomputedWorld Cup prediction dashboard
Python pipeline (offline) · Next.js dashboard on Vercel
KickCast does its compute offline: there is no per-request model inference. The dashboard serves a precomputed forecast, so its serving story is static delivery, not live inference.
- Serving model
- precomputed, no per-request inferenceThe dashboard serves a published forecast (win-probabilities.json); the model does not run per visitor.
- Pipeline
- feature build to train to 10,000-run Monte CarloThe 21,371 x 38 feature build, ensemble training, and 10,000-iteration tournament simulation run offline, then the result is published.
- Refresh
- per published forecastRegenerated and re-published when the forecast is updated, a snapshot rather than a live feed.