Compiling the HEART Score into a Model-Facing Rule

Clinical LLM workflows often mix rule criteria, exceptions, output expectations, and grading into one long prompt. That can work, but once something goes wrong it gets harder to tell whether the problem came from the model, the rule, or the evaluation setup.

I used the HEART Score as a small example of what this looks like when the rule is pulled into explicit policy artifacts, paired with a strict output schema, and rerun under a deterministic harness.

Why compile a rule?

If the whole rule sits inside one long prompt, it is harder to see which part failed. I found it more useful to split it up: one part says when the rule applies, another says what decision logic to follow, and another defines the output format. Then when something breaks, you can usually tell whether the problem was in scoring, tier assignment, or a safety check.

Here was my specific approach:

A single large prompt can produce similar behavior. Breaking the rule into explicit artifacts mostly helps with editing, tracing changes, and understanding where failures are coming from.

What the first run showed

The first run on the 14-case canonical slice came back 9/14 perfect. Most of the misses were from contradictions or omissions in the specification (not model failure):

After those issues were fixed, the rerun hit 14/14. At that point the artifact felt internally coherent.

Dealing with subjective components

Not every part of a clinical rule compiles cleanly. Age, troponin, and risk factors are pretty straightforward. The history component is different. “How suspicious is this presentation for ACS?” still involves judgment, and two clinicians can reasonably disagree at the boundary.

My approach was:

There are other ways to do this, including few-shot examples or physician calibration.

Stability under messier inputs

I also ran the revised pack against a messy_note slice with noisier, shorthand ED-style inputs. That slice also ran clean, suggesting that once the policy artifact was coherent, behavior remained stable under messier surface forms of the same task.

Clinical safety flags

The HEART Score can be “Low Risk” while the clinical picture is still concerning, for example, a low total score sitting next to a high troponin. I built a specific troponin_2_with_low_total flag into the pack.

In practice, that is the kind of case where the score alone is not enough. If the pack does not say that explicitly, the model has to guess.

Why this setup is useful

What I found useful was having a setup where early failures pointed back to the right layer. In this case the first run mostly surfaced pack contradictions, missing instructions, and one scoring ambiguity. Once those were fixed, the rerun behaved the way the revised rule said it should.

Limitations

The 30/30 result is clean, but the three slices still stay pretty close to the same rule surface and its a very small number of synthetic cases. The holdout and messy-note cases mostly show that the model can handle variation in wording and presentation. The harness also only catches inconsistencies inside the spec. If the pack and the labels agree on the same wrong logic, the run will still pass cleanly.

The repo

Repo: github.com/sidoody/heart-context-pack
Walkthrough: /examples/case_walkthrough.md