Does GPT-5.2 get better at diagnosis when you ask it to think harder?
I started this project thinking I would benchmark all the major models (GPT, Claude, Gemini, DeepSeek) on medical diagnosis. When I saw that OpenAI exposes a reasoning effort knob on GPT-5.2, which you can set to none, low, medium, high, and xhigh, I got distracted by a more interesting question: Does GPT-5.2 get better at clinical diagnosis when you ask it to think harder, and at what cost?
The setup
I evaluated GPT-5.2 on 897 paired cases from MedCaseReasoning. Every case went through four reasoning settings: none, low, medium, and high. I left out xhigh for this version of the study. In early runs it was much slower, token usage was more variable, and it made the project harder to finish cleanly. I would rather ship a complete four-setting study and potentially add xhigh later.
I graded the final diagnosis with GPT-4.1 using the same rubric across all variants. I originally framed the analysis stepwise: none to low, then low to medium, then medium to high, but it felt too narrow. In practice, if you are building a health AI product, you do not only care about adjacent steps, but also about real choices like none versus high, or low versus high. I decided to rebuild the reporting around all pairwise comparisons.
The results
| Setting | Accuracy | Tokens (avg) | Latency (avg) |
|---|---|---|---|
none |
63.9% | 614 | 2.6s |
low |
66.4% | 782 | 5.5s |
medium |
67.3% | 935 | 10.8s |
high |
68.8% | 1,088 | 13.6s |
Going from none to high improved diagnosis accuracy by about 4.9 percentage points. It also increased average latency from 2.6 seconds to 13.6 seconds, and pushed average total tokens from 614 to 1,088. In a production clinical system where a doctor is waiting for a response, 13.6 seconds per case is a long time vs 2.6 seconds.
Most cases were concordant across settings. For none vs high, 765/897 (85.3%) cases had the same correctness outcome (both correct or both incorrect). The net lift came from the discordant subset (132/897, 14.7%): 88 cases flipped from incorrect to correct with high, while 44 flipped from correct to incorrect.
What was statistically strongest
I used exact McNemar tests across all pairs then applied Holm correction for multiple comparisons. The pairs that held up best were:
nonevshigh: significant, p < 0.001nonevsmedium: significant, p = 0.029nonevslow: not significant after correction, p = 0.15- Adjacent steps:
lowvsmediumandmediumvshighwere not significant after correction
The accuracy curve is monotonic with every step up in reasoning improving raw benchmark accuracy. Once you correct for multiple comparisons, the strongest evidence is for the cumulative gains from none to medium and from none to high.
So what is the best setting?
For highest accuracy, high wins. medium is the lowest reasoning setting that still gives a Holm-significant improvement over none. If you care most about efficiency, low is probably the most interesting setting, but we didn't bear out statistical significance in this study.
low captures about half of the total accuracy gain from none to high, while adding far less latency and token cost than going all the way to high. It is not the strongest statistical result after correction, but it may still be the most attractive product decision if you care about responsiveness and cost. The most realistic deployment policy is probably not one universal setting. It is probably wiser to use a cheaper default and escalate to more reasoning for harder or higher-risk cases.
My take as a doctor
I am board certified and I have spent a lot of time reading case reports. These MedCaseReasoning cases are not routine clinical encounters, but rather are skewed toward rare-disease presentations. I still think the signal here is interesting, but I would be careful not to overinterpret it. Most cases were concordant across reasoning settings, and the net lift came from a relatively small discordant subset. To me, the more interesting next question is whether there are identifiable types of cases where additional reasoning helps enough to justify the added latency and cost.
Why GPT-4.1 as the grader
GPT-4.1 is the same model OpenAI uses as the automated grader in HealthBench, their flagship clinical evaluation framework. In the HealthBench meta-evaluation GPT-4.1 achieved a macro F1 of 0.71 in agreement with physician evaluations, which is comparable to inter-physician agreement. It exceeded the average physician grading score in five out of seven evaluation themes. I held it fixed across all variants.
How I built this
Every case was run through all four variants. This means I can use McNemar's test (paired binary outcomes) instead of comparing independent samples. That is a more powerful comparison and it is the right statistical choice for this setup.
The raw model responses and grading scores are committed to the repo and never modified by the reporting pipeline. The analysis layer reads from those files and generates everything else. If you want to audit the numbers, the source of truth is right there.
Earlier versions of this project only compared adjacent steps. That was too narrow. The current analysis compares every pair, because in practice you are often choosing between none and high, not between medium and high.
Limitations
This is not a clinical deployment paper, a safety claim, or a physician-adjudicated benchmark.
MedCaseReasoning is heavily tilted toward complex diagnostic cases from published case reports. These are not representative of the general case mix a clinician sees day to day. The accuracy numbers here are probably lower than what you would see on routine presentations.
GPT-4.1 has published physician-agreement data from HealthBench (macro F1 = 0.71) but it has not been validated against physician review on this specific dataset.
This is GPT-5.2 only. I started this project wanting to compare across model families and I still plan to.
What I would do next
- Hand-audit a stratified sample of grader decisions
- Do a real discordant-case error analysis (which cases benefit most from reasoning?)
- Run the same pipeline on Claude and Gemini
- Analyze the reasoning alignment scores captured alongside diagnosis correctness
The repo
If you want to check out the data or rerun the reporting pipeline, the repo is here: github.com/sidoody/gpt-5.2-reasoning-ablation-v1. It contains the committed raw model outputs, the committed grading files, and a deterministic reporting pipeline. You can regenerate the analysis without rerunning inference.
If you work on health AI and this kind of evaluation is useful, feel free to reach out.
Edited March 4, 2026 to clarify the interpretation of benchmark vs clinical significance and the role of discordant pairs; results unchanged.