Why most customer health scores are decoration
A health score nobody validates against renewal outcomes is dashboard furniture. Here is the quarterly backtest that tells you if yours actually predicts anything.
A health score nobody validates against renewal outcomes is dashboard furniture. Here is the quarterly backtest that tells you if yours actually predicts anything.
Most customer health scores fail because nobody validates them against renewal outcomes. The score gets built once, turns red and green on a dashboard, and never changes a decision. The fix is a quarterly backtest: pull last quarter's churned accounts, check what their score said 60 and 90 days before cancellation, then do the same for renewals. If the scores do not separate the groups, it is decoration.
A health score is decoration when it exists but changes nothing. You can spot one in about two minutes with three questions:
If the answers are "not sure," "no," and "the formula does something with logins," you have decoration. The score was built in a spreadsheet or a CS tool during a planning cycle, everyone felt good about it for a month, and then it became wallpaper.
This is common because building a score feels like progress and validating one feels like homework. The building part gets a kickoff meeting. The validation part gets nothing, because no calendar invite says "check whether our score predicted the churn we just ate."
The cost is not neutral. A wrong score is worse than no score. It tells you the account that is about to cancel is healthy, so you skip the call that might have saved them, and it sends you to "rescue" accounts that were never leaving.
You backtest it against outcomes you already have. Churn gives you a labeled dataset for free: every account that cancelled last quarter is a test case, and so is every account that renewed. You do not need a data scientist. You need a spreadsheet and an honest hour.
The method:
That comparison gives you two numbers worth writing down:
| Measure | Question it answers | Bad sign | | --- | --- | --- | | Catch rate | Of the accounts that churned, how many did the score flag 60+ days out? | Below half were flagged | | False alarm rate | Of the accounts flagged red, how many actually churned or downgraded? | Most red accounts renewed fine | | Explainability | Can you say why each flagged account was flagged? | The reason is "the formula" |
If your historical scores were never snapshotted, that is itself the first finding: a score you cannot look up retroactively cannot be validated, so start snapshotting it weekly (a scheduled export to a spreadsheet is enough) and run the backtest next quarter.
There are two ways for a score to fail the backtest, and they feel very different but cost the same.
The first failure is silence: churned accounts sat at green until the cancellation email arrived. This usually means the score is built on lagging or vanity inputs, things like NPS responses from two quarters ago, whether a QBR happened, or account size. Those describe the relationship's paperwork, not its behavior.
The second failure is noise: the score flags a third of your book as red every week. Nobody can work a list that long, so the team learns to ignore red, and the one genuinely dying account is invisible inside the crowd. A fire alarm that goes off every day protects nothing.
The test for noise is simple: divide your red accounts by the number of save conversations your team can actually run in a week. If the red list is bigger than your capacity to act on it, the threshold is wrong or the inputs are wrong, and the score is generating anxiety instead of decisions.
A working health score produces a short list. If you have 200 accounts, a useful red list is 5 to 15 accounts, each with a stated reason. If your score cannot produce that, fix the score before you fix the accounts.
Every flagged account needs a reason a human can read and act on. "Health: 47" is not a reason. "Usage down 40 percent since March and the champion has not answered three emails" is a reason, and it also tells you what the save call is about.
This requirement does real work in two directions:
When you run the quarterly backtest, check reasons too: for the churned accounts the score did flag, was the attached reason the actual reason they left? A score that flags the right accounts for the wrong reasons will eventually flag the wrong accounts.
The instinct after a failed backtest is to add sophistication: more inputs, decimal weights, maybe a request to the data team for a model. Resist it. At 50 to 500 accounts, you do not have enough churn events per quarter to tune a complicated model, and a weighting scheme nobody understands fails the reason-attached requirement automatically.
The better move is usually subtraction. Take the backtest results and ask which individual signals actually separated churned accounts from renewed ones. In most books it is a short list: usage trend, champion responsiveness, and payment or contract behavior tend to carry nearly all the signal. Rebuild the score on the three to five inputs that demonstrably worked, with simple thresholds, and drop the rest.
A three-signal score everyone trusts and acts on beats a fifteen-signal score everyone ignores. The measure of a health score is decisions changed per quarter, not inputs consumed.
Put a recurring 60-minute block on the calendar for the first week of each quarter. Here is the agenda:
| Minutes | Step | | --- | --- | | 0-10 | List last quarter's churned and downgraded accounts | | 10-25 | Record each one's health score 60 and 90 days pre-cancellation | | 25-35 | Pull 10-20 renewed accounts and record their scores at the same marks | | 35-45 | Compute catch rate and false alarm rate; compare to last quarter | | 45-55 | For each miss, name the signal that would have caught it | | 55-60 | Change one thing: add that signal, cut a dead one, or move a threshold |
Two rules make the ritual stick. First, change at most one or two things per quarter, so next quarter's backtest tells you whether the change helped. Second, write the two rates somewhere visible. A score whose catch rate is improving quarter over quarter is a tool being sharpened. A score with no recorded history is decoration, whatever the dashboard says.
Do this four times and you will have something rare: a health score with a track record, which is the only kind worth trusting your renewal forecast to.
A step-by-step method for building a health score from the four signal families you already track, and the validation test most scores fail.