Clinical-AI Edge Benchmark — 12 models × 6 tasks, audited

Most clinical-AI projects pick a model on vibes (“Gemma feels strong”, “MedGemma is medical”) and ship. That’s how 3% hallucination rates on dosages become coroner’s inquests. Before we shipped ChartLite, we built the benchmark first — 12 LLMs across 6 clinical-AI tasks, with a self-critical audit of every claim.

models tested

6 cloud · 6 on-device

clinical-AI tasks

~54K model-question evaluations

Apache 2.0

methodology + code + raw JSON

every number disprovable

Open the live dashboard → Code repo (Apache 2.0) Skip to dashboard ↓

Three findings that shaped what we shipped

1. Gemma 4 e4b is the first on-device model good enough to ship for clinical note generation.

On peer-reviewed ACI-Bench (dialogue → SOAP note, n=207 × 5 splits) Gemma 4 e4b scores 82.74, statistically tied with Claude Haiku 4.5 (82.72) and 6 points behind the GPT-5.5 leader. On 156 real clinician-annotated Eka transcripts it lands at 82.6 — within 5 points of Haiku. Zero API cost, runs entirely on the phone.

2. MedGemma 1.5 underperforms generic Gemma 4 on every clinical task.

−34pp on ACI-Bench, −20pp on Eka Medical Calculator, −17pp on NFI Pharmacology MCQA, and a 60% silent-failure rate on simple drug-pair extraction (returns un-parseable output). Same parameter class as Gemma 4 e4b, opposite results. The medical fine-tune appears to have eroded reasoning more than it improved recall. Counter-intuitive but cleanly reproducible.

3. Adding a clinical knowledge graph (BODHI) lifts every model’s safety detection — but the size of the lift needs auditing.

Wiring Eka Care’s BODHI knowledge graph in as a safety-rule layer lifts raw alert detection by 26–63 percentage points across the 12 models. That number reads dramatic. We audited it. See below.

The honest BODHI audit

The raw +35pp average lift figure is what gets cited. The disaggregated picture is more interesting:

~40% real catch

~60% scoring artifact

real catch — the model would otherwise have missed this scoring artifact — the model said the right thing in different words; our text-match scorer didn’t credit it ~20% of BODHI alerts themselves are false positives (fuzzy match firing the wrong rule)

Net of artifact and noise, value lands where it matters most — small on-device models that frontline clinics actually run:

Model	Arm 1 (LLM alone)	Arm 3 (LLM + BODHI)	Lift
Qwen 3.5 0.8B	3%	66%	+63pp
Gemma 4 e4b	30%	57%	+27pp
Claude Opus 4.7	38%	80%	+42pp
Claude Sonnet 4.6	52%	78%	+26pp

Top cloud models already near the ceiling gain least. The story isn’t “BODHI replaces the LLM’s safety reasoning” — it’s “BODHI is a deterministic safety net under clinician judgement, biggest impact at the small-model tier where it matters most for low-resource deployment.”

The six benchmarks

#	Benchmark	n	Source	License
1	Synthetic 100 — extraction + 3-arm safety	100 encounters / 89 dangers	inlined	Apache 2.0
2	Eka real transcripts — clinician-annotated dialogue → note	156 cases	Eka Care	MIT
3	ACI-Bench — peer-reviewed dialogue → SOAP	207 × 5 splits	Yim et al., Nature Scientific Data 2023	CC BY 4.0
4	CRESCENDDI — drug-drug interaction safety	200 pairs	Lavertu et al., Nature Sci Data 2022	CC0
5	NFI Pharmacology MCQA — multiple-choice pharmacology exam	925 questions	Eka Care, May 2026	research-only
6	Medical Calculator Eval — clinical-math vignettes (incl. Hinglish)	1,066 vignettes, 26 specialties	Eka Care, May 2026	research-only

How we measured

Silver ground truth from a 3-frontier-model panel

For the extraction task, three frontier models (Opus 4.7, GPT-5.4, Gemini 3.1 Pro) each emit {primary, accepts: [...]} per ground-truth item. The merger unions accept lists across models when items share a synonym group or fuzzy similarity ≥ 0.6, and preserves single-model items. Output: 1,320 items across 100 encounters; 61% with full 3-model consensus, mean 13 accept forms per item.

This captures clinical-language variation (full term + abbreviation + lay shorthand + brand/generic + spelling variants) better than single-annotator GT could. We acknowledge the trade-off explicitly: a panel of 3 LLMs can systematically miss the same things, biasing recall on the kinds of errors all three share.

Three-arm safety design

Arm 1: dedicated LLM safety-review prompt — no rules, no knowledge graph.
Arm 2: structured extraction → rule engine (drug-allergy, drug-drug, dosage, vitals).
Arm 3: Arm 2 + BODHI knowledge graph (drug-condition, triage, lab recommendation, referral).

The arms are computed independently per encounter and reported separately. Bootstrap CIs (500 resamples, paired encounter-level) are surfaced on the leaderboard as [CI_low–CI_high].

Multi-axis miss adjudication

Unmatched extracted items are re-scored by a 3-stage pipeline (deterministic detectors + lexical retrieval + Opus-4.7 judge with cited transcript spans), classified on six independent axes. Result: ~40–50% of frontier-model “errors” trace to scorer issues or incomplete silver GT, not the model itself. That’s where the 60% scoring-artifact figure in the BODHI audit comes from.

Where we might be wrong

Every claim above has a corresponding caveat. The full list lives in docs/METHODOLOGY.md; the most important nine:

Silver-danger list is panel-generated. The 89 expected dangers per encounter were emitted by the same 3-model panel that built the silver extractions. If the panel missed a danger BODHI catches, BODHI gets no credit; if the panel over-listed a category BODHI handles trivially, BODHI looks better than it is. Fix in flight: clinician-adjudicated 30–50 encounter gold subset.
Silver GT shares two judges with the contestants. Opus 4.7 and GPT-5.4 are panel members and benchmark contestants. Their precision against silver GT inflates by their own contribution to the GT. Mitigated but not eliminated by the merger logic. Fix: swap one panel member out and regenerate.
Note-quality judge shares contestants too. Opus is judge and contestant on note quality; so is GPT-5.4. Inter-judge agreement (Krippendorff α / Cohen κ) is not currently reported. Fix on the list.
Single run per (model, mode, encounter). Earlier 5-run experiments on a small subset showed F1 σ ≈ 0.008 for cloud models; small-model variance was not measured. Leaderboard deltas under ~2 F1 should be treated as within-noise.
Bootstrap CIs only on safety arms. Extraction F1 deltas, note-quality deltas, and the “tied group” indicator have no CIs yet. The per-encounter data is collected; just needs the bootstrap pass.
Single-judge miss adjudication. The miss adjudicator uses Opus 4.7. Self-bias risk on Opus output. Fix: run a 10% sample with GPT-5.4 or Gemini and check agreement.
Synthetic encounter coverage. The 100 encounters are curated to exercise specific safety-rule triggers, not to reflect a real LMIC primary-care distribution. Fix: retrospective comparison against a sample of real PHC encounters.
No external cross-validation. The benchmark has not been compared to public medical NLP benchmarks (MedQA, PubMedQA, BioASQ, n2c2). Fix: run a 100-question MedQA subset through the top three and verify rough rank order.
BODHI lab-rec scoring is category-level. Strong LLMs sometimes recommend more specific labs than BODHI’s category mapping covers, so they appear to fail when they actually did better. Affects 1–2% of total dangers.

If any of these caveats invalidates a number you care about, the raw per-(model, case) JSON in scripts/{bench}_raw/ is the canonical record. Aggregates can be regenerated from raw at any time.

Dashboard — the interactive view

benchmark.chartlite.health Open in new tab →

Tabs cover: Leaderboard, Eka real, ACI-Bench, CRESCENDDI, NFI Pharmacology, Calculators, per-Encounter explorer, per-Model explorer, Miss analysis, Methodology, and Prompts.

Reproduce any number

git clone https://github.com/prismindanalytics/clinical-edge-bench
cd clinical-edge-bench
pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=sk-…
python3 scripts/benchmark_pharmacology_mcqa.py --models claude-haiku --limit 5

Full reproduction recipe in scripts/README.md. Data is fetched on demand, not bundled. Run window for the May 2026 numbers: April 19 – May 7. Exact API model handles and Ollama tags in docs/MODEL_VERSIONS.md.

Citation

Clinical-AI Edge Benchmark. An open benchmark of 12 LLMs across 6 clinical-AI tasks for offline on-device deployment in resource-constrained settings. Run window April 19 – May 7, 2026. https://benchmark.chartlite.health · github.com/prismindanalytics/clinical-edge-bench

Research / methodology benchmark, not clinical certification. Issues, methodology critiques, and pull requests welcome at github.com/prismindanalytics/clinical-edge-bench. Apache 2.0; third-party datasets retain their original licenses.