Clinical-AI Edge Benchmark

12 LLMs × 6 clinical tasks. Audited.

Independent evaluation of what's actually deployable offline on consumer phones — with an honest accounting of where the gains are real and where the scoring scaffolding is doing the work.

Most clinical-AI projects pick a model on vibes (“Gemma feels strong”, “MedGemma is medical”) and ship. That’s how 3% hallucination rates on dosages become coroner’s inquests. Before we shipped ChartLite, we built the benchmark first — 12 LLMs across 6 clinical-AI tasks, with a self-critical audit of every claim.

12
models tested
6 cloud · 6 on-device
6
clinical-AI tasks
~54K model-question evaluations
Apache 2.0
methodology + code + raw JSON
every number disprovable

Three findings that shaped what we shipped

1. Gemma 4 e4b is the first on-device model good enough to ship for clinical note generation.
On peer-reviewed ACI-Bench (dialogue → SOAP note, n=207 × 5 splits) Gemma 4 e4b scores 82.74, statistically tied with Claude Haiku 4.5 (82.72) and 6 points behind the GPT-5.5 leader. On 156 real clinician-annotated Eka transcripts it lands at 82.6 — within 5 points of Haiku. Zero API cost, runs entirely on the phone.
2. MedGemma 1.5 underperforms generic Gemma 4 on every clinical task.
−34pp on ACI-Bench, −20pp on Eka Medical Calculator, −17pp on NFI Pharmacology MCQA, and a 60% silent-failure rate on simple drug-pair extraction (returns un-parseable output). Same parameter class as Gemma 4 e4b, opposite results. The medical fine-tune appears to have eroded reasoning more than it improved recall. Counter-intuitive but cleanly reproducible.
3. Adding a clinical knowledge graph (BODHI) lifts every model’s safety detection — but the size of the lift needs auditing.
Wiring Eka Care’s BODHI knowledge graph in as a safety-rule layer lifts raw alert detection by 26–63 percentage points across the 12 models. That number reads dramatic. We audited it. See below.

The honest BODHI audit

The raw +35pp average lift figure is what gets cited. The disaggregated picture is more interesting:

~40% real catch
~60% scoring artifact
real catch — the model would otherwise have missed this scoring artifact — the model said the right thing in different words; our text-match scorer didn’t credit it ~20% of BODHI alerts themselves are false positives (fuzzy match firing the wrong rule)

Net of artifact and noise, value lands where it matters most — small on-device models that frontline clinics actually run:

ModelArm 1 (LLM alone)Arm 3 (LLM + BODHI)Lift
Qwen 3.5 0.8B3%66%+63pp
Gemma 4 e4b30%57%+27pp
Claude Opus 4.738%80%+42pp
Claude Sonnet 4.652%78%+26pp

Top cloud models already near the ceiling gain least. The story isn’t “BODHI replaces the LLM’s safety reasoning” — it’s “BODHI is a deterministic safety net under clinician judgement, biggest impact at the small-model tier where it matters most for low-resource deployment.”

The six benchmarks

#BenchmarknSourceLicense
1Synthetic 100 — extraction + 3-arm safety100 encounters / 89 dangersinlinedApache 2.0
2Eka real transcripts — clinician-annotated dialogue → note156 casesEka CareMIT
3ACI-Bench — peer-reviewed dialogue → SOAP207 × 5 splitsYim et al., Nature Scientific Data 2023CC BY 4.0
4CRESCENDDI — drug-drug interaction safety200 pairsLavertu et al., Nature Sci Data 2022CC0
5NFI Pharmacology MCQA — multiple-choice pharmacology exam925 questionsEka Care, May 2026research-only
6Medical Calculator Eval — clinical-math vignettes (incl. Hinglish)1,066 vignettes, 26 specialtiesEka Care, May 2026research-only

How we measured

Silver ground truth from a 3-frontier-model panel

For the extraction task, three frontier models (Opus 4.7, GPT-5.4, Gemini 3.1 Pro) each emit {primary, accepts: [...]} per ground-truth item. The merger unions accept lists across models when items share a synonym group or fuzzy similarity ≥ 0.6, and preserves single-model items. Output: 1,320 items across 100 encounters; 61% with full 3-model consensus, mean 13 accept forms per item.

This captures clinical-language variation (full term + abbreviation + lay shorthand + brand/generic + spelling variants) better than single-annotator GT could. We acknowledge the trade-off explicitly: a panel of 3 LLMs can systematically miss the same things, biasing recall on the kinds of errors all three share.

Three-arm safety design

The arms are computed independently per encounter and reported separately. Bootstrap CIs (500 resamples, paired encounter-level) are surfaced on the leaderboard as [CI_low–CI_high].

Multi-axis miss adjudication

Unmatched extracted items are re-scored by a 3-stage pipeline (deterministic detectors + lexical retrieval + Opus-4.7 judge with cited transcript spans), classified on six independent axes. Result: ~40–50% of frontier-model “errors” trace to scorer issues or incomplete silver GT, not the model itself. That’s where the 60% scoring-artifact figure in the BODHI audit comes from.

Where we might be wrong

Every claim above has a corresponding caveat. The full list lives in docs/METHODOLOGY.md; the most important nine:

  1. Silver-danger list is panel-generated. The 89 expected dangers per encounter were emitted by the same 3-model panel that built the silver extractions. If the panel missed a danger BODHI catches, BODHI gets no credit; if the panel over-listed a category BODHI handles trivially, BODHI looks better than it is. Fix in flight: clinician-adjudicated 30–50 encounter gold subset.
  2. Silver GT shares two judges with the contestants. Opus 4.7 and GPT-5.4 are panel members and benchmark contestants. Their precision against silver GT inflates by their own contribution to the GT. Mitigated but not eliminated by the merger logic. Fix: swap one panel member out and regenerate.
  3. Note-quality judge shares contestants too. Opus is judge and contestant on note quality; so is GPT-5.4. Inter-judge agreement (Krippendorff α / Cohen κ) is not currently reported. Fix on the list.
  4. Single run per (model, mode, encounter). Earlier 5-run experiments on a small subset showed F1 σ ≈ 0.008 for cloud models; small-model variance was not measured. Leaderboard deltas under ~2 F1 should be treated as within-noise.
  5. Bootstrap CIs only on safety arms. Extraction F1 deltas, note-quality deltas, and the “tied group” indicator have no CIs yet. The per-encounter data is collected; just needs the bootstrap pass.
  6. Single-judge miss adjudication. The miss adjudicator uses Opus 4.7. Self-bias risk on Opus output. Fix: run a 10% sample with GPT-5.4 or Gemini and check agreement.
  7. Synthetic encounter coverage. The 100 encounters are curated to exercise specific safety-rule triggers, not to reflect a real LMIC primary-care distribution. Fix: retrospective comparison against a sample of real PHC encounters.
  8. No external cross-validation. The benchmark has not been compared to public medical NLP benchmarks (MedQA, PubMedQA, BioASQ, n2c2). Fix: run a 100-question MedQA subset through the top three and verify rough rank order.
  9. BODHI lab-rec scoring is category-level. Strong LLMs sometimes recommend more specific labs than BODHI’s category mapping covers, so they appear to fail when they actually did better. Affects 1–2% of total dangers.

If any of these caveats invalidates a number you care about, the raw per-(model, case) JSON in scripts/{bench}_raw/ is the canonical record. Aggregates can be regenerated from raw at any time.

Dashboard — the interactive view

benchmark.chartlite.health Open in new tab →

Tabs cover: Leaderboard, Eka real, ACI-Bench, CRESCENDDI, NFI Pharmacology, Calculators, per-Encounter explorer, per-Model explorer, Miss analysis, Methodology, and Prompts.

Reproduce any number

git clone https://github.com/prismindanalytics/clinical-edge-bench
cd clinical-edge-bench
pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=sk-…
python3 scripts/benchmark_pharmacology_mcqa.py --models claude-haiku --limit 5

Full reproduction recipe in scripts/README.md. Data is fetched on demand, not bundled. Run window for the May 2026 numbers: April 19 – May 7. Exact API model handles and Ollama tags in docs/MODEL_VERSIONS.md.

Citation

Clinical-AI Edge Benchmark. An open benchmark of 12 LLMs across 6 clinical-AI tasks for offline on-device deployment in resource-constrained settings. Run window April 19 – May 7, 2026. https://benchmark.chartlite.health · github.com/prismindanalytics/clinical-edge-bench

Research / methodology benchmark, not clinical certification. Issues, methodology critiques, and pull requests welcome at github.com/prismindanalytics/clinical-edge-bench. Apache 2.0; third-party datasets retain their original licenses.