// implicit evals for AI products

You changed the prompt.
Did it get better?

Less than 1% of users leave feedback. Litmus captures the other 99%: what they actually did with your AI output. Behavioral signals become quality scores for every prompt version, every model swap.

// the problem

You're flying blind

You shipped a new prompt last Tuesday. Your benchmarks still pass. Your LLM-as-judge still says 8/10. No complaints in Slack.

Three weeks later, a support ticket.

regression detected · content-writer · 3 weeks post-deploy
regen rate12%38%+26pp
time-to-accept2.1s4.8s+129%
return rate73%56%-17pp
source: support ticket #4,291

Less than 1% of users leave explicit feedback. The other 99% vote with their behavior.

// how it works

Three lines. Continuous signal.

// 1. track when AI generates output
const gen = litmus.generation({
  prompt: "content-writer-v4",
  model: "claude-sonnet-4-20250514",
});

// 2. track what the user does with it
litmus.track("copy", { generationId: gen.id });

// 3. that's it. litmus does the rest.
//    quality scores · regressions · alerts

Litmus tracks what happens next. Every copy, edit, regenerate, and abandon becomes a data point. Behavioral patterns become quality scores that tell you exactly how each change affected your users.

// what you see

From noise to signal

▓▓content-writer
v3 → v4 · 7d · 2,847 gens
behavioral quality index
79+12from 67
accept
58%+7pp
edit
24%-3pp
regen
14%-8pp
abandon
4%-1pp
time-to-accept2.1s-50%
return rate73%+4pp
sessions412

Every prompt version. Every model change. Scored automatically by what users do, not what they say.

// not another observability tool

Different question. Different data.

observabilitywhat the model did

latency 1.2s · tokens 847 · error rate 0.3%

Great for ops. Blind to quality.

eval suiteswhat the model said

MMLU 92% · HumanEval 87% · custom eval 8.2/10

Great for pre-deploy. Blind to production.

litmuswhat the user did

accept 58% · regen 14% · BQI 79 · time-to-accept 2.1s

Ground truth. Continuous. Automatic.

Complementary, not competing. Add Litmus alongside your existing stack. It answers the question nobody else can: did the user actually use it?

Your users already know if your AI is good.

You're just not listening yet.