Compiling the HEART Score into a Model-Facing Rule
Clinical LLM workflows often mix rule criteria, exceptions, output expectations, and grading into one long prompt. That can work, but once something goes wrong it gets harder to tell whether the problem came from the model, the rule, or the evaluation setup.
I used the HEART Score as a small example of what this looks like when the rule is pulled into explicit policy artifacts, paired with a strict output schema, and rerun under a deterministic harness.
Why compile a rule?
If the whole rule sits inside one long prompt, it is harder to see which part failed. I found it more useful to split it up: one part says when the rule applies, another says what decision logic to follow, and another defines the output format. Then when something breaks, you can usually tell whether the problem was in scoring, tier assignment, or a safety check.
Here was my specific approach:
- Scope Layer: when the rule applies or not.
- Policy Layer: tiers, dispositions, and safety overrides.
- The Output Contract: strict JSON schema forcing the model to return structured and inspectable fields.
A single large prompt can produce similar behavior. Breaking the rule into explicit artifacts mostly helps with editing, tracing changes, and understanding where failures are coming from.
What the first run showed
The first run on the 14-case canonical slice came back 9/14 perfect. Most of the misses were from contradictions or omissions in the specification (not model failure):
- A gold label that conflicted with the pack’s own safety-flag logic.
- Ambiguity in how to score old Q waves without acute changes.
- Conflicting instructions around STEMI bypass overrides.
After those issues were fixed, the rerun hit 14/14. At that point the artifact felt internally coherent.
Dealing with subjective components
Not every part of a clinical rule compiles cleanly. Age, troponin, and risk factors are pretty straightforward. The history component is different. “How suspicious is this presentation for ACS?” still involves judgment, and two clinicians can reasonably disagree at the boundary.
My approach was:
- Anchoring the extremes: sharp, pleuritic pain anchors a 0. Substernal pressure with diaphoresis anchors a 2. The middle stays a gray zone.
- Tolerance in grading: the harness grades the total HEART Score with a ±1 tolerance rather than an exact per-component agreement.
- Reasoning focus: focus on if the model reaches the correct risk tier and disposition vs if it scored 1 or a 2 to a genuinely ambiguous history.
There are other ways to do this, including few-shot examples or physician calibration.
Stability under messier inputs
I also ran the revised pack against a messy_note slice with noisier, shorthand ED-style inputs. That slice also ran clean, suggesting that once the policy artifact was coherent, behavior remained stable under messier surface forms of the same task.
Clinical safety flags
The HEART Score can be “Low Risk” while the clinical picture is still concerning, for example, a low total score sitting next to a high troponin. I built a specific troponin_2_with_low_total flag into the pack.
In practice, that is the kind of case where the score alone is not enough. If the pack does not say that explicitly, the model has to guess.
Why this setup is useful
What I found useful was having a setup where early failures pointed back to the right layer. In this case the first run mostly surfaced pack contradictions, missing instructions, and one scoring ambiguity. Once those were fixed, the rerun behaved the way the revised rule said it should.
Limitations
The 30/30 result is clean, but the three slices still stay pretty close to the same rule surface and its a very small number of synthetic cases. The holdout and messy-note cases mostly show that the model can handle variation in wording and presentation. The harness also only catches inconsistencies inside the spec. If the pack and the labels agree on the same wrong logic, the run will still pass cleanly.
The repo
Repo: github.com/sidoody/heart-context-pack
Walkthrough: /examples/case_walkthrough.md