// implicit evals for AI products

You changed the prompt.
Did it get better?

Less than 1% of users leave feedback. Litmus captures the other 99%: what they actually did with your AI output. Behavioral signals become quality scores for every prompt version, every model swap.

// the problem

You're flying blind

You shipped a new prompt last Tuesday. Your benchmarks still pass. Your LLM-as-judge still says 8/10. No complaints in Slack...yet

your dashboard

benchmarkspassing
llm-as-judge8.2 / 10
error rate0.3%
complaints0

what litmus sees

regression3h post-deploy
regen rate12% → 38%
trust erosionprompt length -40%
edit distanceincreasing

Less than 1% of users leave explicit feedback. The other 99% vote with their behavior.

// how it works

Three lines. Continuous signal.

// 1. track when AI generates output
const gen = litmus.generation(sessionId, {
  prompt_id: "content-writer",
  prompt_version: "v4",
});

// 2. track what the user does with it
gen.event("$copy");

// 3. that's it. litmus does the rest.
//    quality scores · regressions · alerts

Litmus tracks what happens next. Every copy, edit, regenerate, and abandon becomes a data point. Behavioral patterns become quality scores that tell you exactly how each change affected your users.

// what you see

From noise to signal

Every prompt version. Every model change. Scored automatically by what users do, not what they say. Click around.

// content-writer

Overview

7d
regression detectedquality dropped 23% after prompt-v3 deploy

// bqi trend (7 days)

// what users did with the output

// outcome distribution

accept 35%copy 18%edit 21%regen 19%abandon 7%

// not another observability tool

Different question. Different data.

observabilitywhat the model did

latency 1.2s · tokens 847 · error rate 0.3%

Great for ops. Blind to quality.

eval suiteswhat the model said

MMLU 92% · HumanEval 87% · custom eval 8.2/10

Great for pre-deploy. Blind to production.

litmuswhat the user did

regen rate spiked 26pp after Tuesday's deploy. Power users editing 3x more. Trust score declining.

Ground truth. Continuous. Automatic.

Complementary, not competing. Add Litmus alongside your existing stack. It answers the question nobody else can: did the user actually use it?

Your users already know if your AI is good.

You're just not listening yet.