
Kirby
grounded portfolio assistant · guard-first · measured
A portfolio chatbot that publishes its own report card: 30/30 on the golden suite and 16/16 on a frozen held-out set (2026-06-13), every case and transcript at /measured, with a public eval harness anyone can clone and re-run.
Kirby runs in the corner of every page on this site. This opens its published eval results: every case and transcript.
Open live app ↗︎Kirby answers questions about Karim's work using only the portfolio's typed content as its grounding pack, so it cannot drift onto the open web or invent a fact. A deterministic guard node classifies every turn first and refuses prompt-injection, persona-swap, and out-of-scope prompts before the model ever runs; the answer node then streams a reply, copies any URL verbatim, and never reveals its system prompt.
What makes it a measured system and not just a chatbot: two eval suites run as real requests against the production endpoint. The golden suite (30/30 on 2026-06-13) covers grounded facts, refusals, and persona; a frozen held-out suite (16/16 on that run) checks generalization on cases the prompt was never tuned against. First token lands at p50 656 ms over the live endpoint. Every case and verbatim transcript is published at /measured, and both suites are mirrored in a public harness (github.com/karimsemaan/kirby-eval-harness) anyone can clone and re-run.
- LangGraph
- Llama 3.1 8B
- NVIDIA NIM
- Next.js
- TypeScript
- Vercel
Architecture · guard → grounded answer
Guard (deterministic)
Classifies every turn and refuses prompt-injection, persona-swap, and off-scope prompts before any token is generated, so the model only runs on real questions.
Grounding
Answers from Karim's typed portfolio content only (the same SSOT that builds this site); no open-web retrieval, no invented facts.
Answer (Llama 3.1 8B · NVIDIA NIM)
Streams a grounded reply, copies any URL verbatim, and never reveals its system prompt or swaps persona.
Measured + published
Golden and held-out suites run against the live endpoint; every case and transcript is published at /measured and mirrored in a public harness anyone can clone.
- Model
- Llama 3.1 8B (NVIDIA NIM)
- Injection + off-scope
- refused in every eval probe
Where the numbers come from
Eval scores are real requests to the production /api/chat endpoint, graded…
Eval scores are real requests to the production /api/chat endpoint, graded by substring matching on stable facts; every case and verbatim transcript is rendered at /measured (golden 30/30 and held-out 16/16, both run 2026-06-13).
The held-out suite is non-deterministic: across runs it scores 13/16 to 16/16…
The held-out suite is non-deterministic: across runs it scores 13/16 to 16/16 at temperature 0.5 (the 'Orna Therapeutics' entity-recall case is the swing). It is published as-run with that variance stated, never retried until green.
Both the golden and held-out suites are mirrored in the public repo…
Both the golden and held-out suites are mirrored in the public repo github.com/karimsemaan/kirby-eval-harness, so any run is re-checkable independently of this site.
What I'd improve
The held-out suite is non-deterministic: at temperature 0.5 the model connects a paraphrased entity to the right corpus fact on some runs and not others, so the held-out score swings 13/16 to 16/16 across runs (the 'Orna Therapeutics' recall case is the swing). It is published as-run with that variance stated on /measured rather than cherry-picking a clean run, and the real fix is a general retrieval improvement (entity descriptors and synonyms), tracked as future work, not a patch to that one question. Grading is substring-based, which is deterministic and re-checkable but brittle on phrasing; a semantic judge is the next step.