// implicit evals for AI products
Less than 1% of users leave feedback. Litmus captures the other 99%: what they actually did with your AI output. Behavioral signals become quality scores for every prompt version, every model swap.
// the problem
You shipped a new prompt last Tuesday. Your benchmarks still pass. Your LLM-as-judge still says 8/10. No complaints in Slack...yet
your dashboard
what litmus sees
Less than 1% of users leave explicit feedback. The other 99% vote with their behavior.
// how it works
// 1. track when AI generates output const gen = litmus.generation(sessionId, { prompt_id: "content-writer", prompt_version: "v4", }); // 2. track what the user does with it gen.event("$copy"); // 3. that's it. litmus does the rest. // quality scores · regressions · alerts
Litmus tracks what happens next. Every copy, edit, regenerate, and abandon becomes a data point. Behavioral patterns become quality scores that tell you exactly how each change affected your users.
// what you see
Every prompt version. Every model change. Scored automatically by what users do, not what they say. Click around.
// content-writer
// bqi trend (7 days)
// what users did with the output
// outcome distribution
// not another observability tool
latency 1.2s · tokens 847 · error rate 0.3%
Great for ops. Blind to quality.
MMLU 92% · HumanEval 87% · custom eval 8.2/10
Great for pre-deploy. Blind to production.
regen rate spiked 26pp after Tuesday's deploy. Power users editing 3x more. Trust score declining.
Ground truth. Continuous. Automatic.
Complementary, not competing. Add Litmus alongside your existing stack. It answers the question nobody else can: did the user actually use it?
Your users already know if your AI is good.
You're just not listening yet.