// implicit evals for AI products
Less than 1% of users leave feedback. Litmus captures the other 99%: what they actually did with your AI output. Behavioral signals become quality scores for every prompt version, every model swap.
// the problem
You shipped a new prompt last Tuesday. Your benchmarks still pass. Your LLM-as-judge still says 8/10. No complaints in Slack.
Three weeks later, a support ticket.
Less than 1% of users leave explicit feedback. The other 99% vote with their behavior.
// how it works
// 1. track when AI generates output const gen = litmus.generation({ prompt: "content-writer-v4", model: "claude-sonnet-4-20250514", }); // 2. track what the user does with it litmus.track("copy", { generationId: gen.id }); // 3. that's it. litmus does the rest. // quality scores · regressions · alerts
Litmus tracks what happens next. Every copy, edit, regenerate, and abandon becomes a data point. Behavioral patterns become quality scores that tell you exactly how each change affected your users.
// what you see
Every prompt version. Every model change. Scored automatically by what users do, not what they say.
// not another observability tool
latency 1.2s · tokens 847 · error rate 0.3%
Great for ops. Blind to quality.
MMLU 92% · HumanEval 87% · custom eval 8.2/10
Great for pre-deploy. Blind to production.
accept 58% · regen 14% · BQI 79 · time-to-accept 2.1s
Ground truth. Continuous. Automatic.
Complementary, not competing. Add Litmus alongside your existing stack. It answers the question nobody else can: did the user actually use it?
Your users already know if your AI is good.
You're just not listening yet.