Most clinical-AI projects pick a model on vibes (“Gemma feels strong”, “MedGemma is medical”) and ship. That’s how 3% hallucination rates on dosages become coroner’s inquests. Before we shipped ChartLite, we built the benchmark first — 12 LLMs across 6 clinical-AI tasks, with a self-critical audit of every claim.
Three findings that shaped what we shipped
The honest BODHI audit
The raw +35pp average lift figure is what gets cited. The disaggregated picture is more interesting:
Net of artifact and noise, value lands where it matters most — small on-device models that frontline clinics actually run:
| Model | Arm 1 (LLM alone) | Arm 3 (LLM + BODHI) | Lift |
|---|---|---|---|
| Qwen 3.5 0.8B | 3% | 66% | +63pp |
| Gemma 4 e4b | 30% | 57% | +27pp |
| Claude Opus 4.7 | 38% | 80% | +42pp |
| Claude Sonnet 4.6 | 52% | 78% | +26pp |
Top cloud models already near the ceiling gain least. The story isn’t “BODHI replaces the LLM’s safety reasoning” — it’s “BODHI is a deterministic safety net under clinician judgement, biggest impact at the small-model tier where it matters most for low-resource deployment.”
The six benchmarks
| # | Benchmark | n | Source | License |
|---|---|---|---|---|
| 1 | Synthetic 100 — extraction + 3-arm safety | 100 encounters / 89 dangers | inlined | Apache 2.0 |
| 2 | Eka real transcripts — clinician-annotated dialogue → note | 156 cases | Eka Care | MIT |
| 3 | ACI-Bench — peer-reviewed dialogue → SOAP | 207 × 5 splits | Yim et al., Nature Scientific Data 2023 | CC BY 4.0 |
| 4 | CRESCENDDI — drug-drug interaction safety | 200 pairs | Lavertu et al., Nature Sci Data 2022 | CC0 |
| 5 | NFI Pharmacology MCQA — multiple-choice pharmacology exam | 925 questions | Eka Care, May 2026 | research-only |
| 6 | Medical Calculator Eval — clinical-math vignettes (incl. Hinglish) | 1,066 vignettes, 26 specialties | Eka Care, May 2026 | research-only |
How we measured
Silver ground truth from a 3-frontier-model panel
For the extraction task, three frontier models (Opus 4.7, GPT-5.4, Gemini 3.1 Pro) each emit {primary, accepts: [...]} per ground-truth item. The merger unions accept lists across models when items share a synonym group or fuzzy similarity ≥ 0.6, and preserves single-model items. Output: 1,320 items across 100 encounters; 61% with full 3-model consensus, mean 13 accept forms per item.
This captures clinical-language variation (full term + abbreviation + lay shorthand + brand/generic + spelling variants) better than single-annotator GT could. We acknowledge the trade-off explicitly: a panel of 3 LLMs can systematically miss the same things, biasing recall on the kinds of errors all three share.
Three-arm safety design
- Arm 1: dedicated LLM safety-review prompt — no rules, no knowledge graph.
- Arm 2: structured extraction → rule engine (drug-allergy, drug-drug, dosage, vitals).
- Arm 3: Arm 2 + BODHI knowledge graph (drug-condition, triage, lab recommendation, referral).
The arms are computed independently per encounter and reported separately. Bootstrap CIs (500 resamples, paired encounter-level) are surfaced on the leaderboard as [CI_low–CI_high].
Multi-axis miss adjudication
Unmatched extracted items are re-scored by a 3-stage pipeline (deterministic detectors + lexical retrieval + Opus-4.7 judge with cited transcript spans), classified on six independent axes. Result: ~40–50% of frontier-model “errors” trace to scorer issues or incomplete silver GT, not the model itself. That’s where the 60% scoring-artifact figure in the BODHI audit comes from.
Where we might be wrong
Every claim above has a corresponding caveat. The full list lives in docs/METHODOLOGY.md; the most important nine:
- Silver-danger list is panel-generated. The 89 expected dangers per encounter were emitted by the same 3-model panel that built the silver extractions. If the panel missed a danger BODHI catches, BODHI gets no credit; if the panel over-listed a category BODHI handles trivially, BODHI looks better than it is. Fix in flight: clinician-adjudicated 30–50 encounter gold subset.
- Silver GT shares two judges with the contestants. Opus 4.7 and GPT-5.4 are panel members and benchmark contestants. Their precision against silver GT inflates by their own contribution to the GT. Mitigated but not eliminated by the merger logic. Fix: swap one panel member out and regenerate.
- Note-quality judge shares contestants too. Opus is judge and contestant on note quality; so is GPT-5.4. Inter-judge agreement (Krippendorff α / Cohen κ) is not currently reported. Fix on the list.
- Single run per (model, mode, encounter). Earlier 5-run experiments on a small subset showed F1 σ ≈ 0.008 for cloud models; small-model variance was not measured. Leaderboard deltas under ~2 F1 should be treated as within-noise.
- Bootstrap CIs only on safety arms. Extraction F1 deltas, note-quality deltas, and the “tied group” indicator have no CIs yet. The per-encounter data is collected; just needs the bootstrap pass.
- Single-judge miss adjudication. The miss adjudicator uses Opus 4.7. Self-bias risk on Opus output. Fix: run a 10% sample with GPT-5.4 or Gemini and check agreement.
- Synthetic encounter coverage. The 100 encounters are curated to exercise specific safety-rule triggers, not to reflect a real LMIC primary-care distribution. Fix: retrospective comparison against a sample of real PHC encounters.
- No external cross-validation. The benchmark has not been compared to public medical NLP benchmarks (MedQA, PubMedQA, BioASQ, n2c2). Fix: run a 100-question MedQA subset through the top three and verify rough rank order.
- BODHI lab-rec scoring is category-level. Strong LLMs sometimes recommend more specific labs than BODHI’s category mapping covers, so they appear to fail when they actually did better. Affects 1–2% of total dangers.
If any of these caveats invalidates a number you care about, the raw per-(model, case) JSON in scripts/{bench}_raw/ is the canonical record. Aggregates can be regenerated from raw at any time.
Dashboard — the interactive view
Tabs cover: Leaderboard, Eka real, ACI-Bench, CRESCENDDI, NFI Pharmacology, Calculators, per-Encounter explorer, per-Model explorer, Miss analysis, Methodology, and Prompts.
Reproduce any number
git clone https://github.com/prismindanalytics/clinical-edge-bench
cd clinical-edge-bench
pip install -r scripts/requirements.txt
export ANTHROPIC_API_KEY=sk-…
python3 scripts/benchmark_pharmacology_mcqa.py --models claude-haiku --limit 5
Full reproduction recipe in scripts/README.md. Data is fetched on demand, not bundled. Run window for the May 2026 numbers: April 19 – May 7. Exact API model handles and Ollama tags in docs/MODEL_VERSIONS.md.
Citation
Clinical-AI Edge Benchmark. An open benchmark of 12 LLMs across 6 clinical-AI tasks for offline on-device deployment in resource-constrained settings. Run window April 19 – May 7, 2026. https://benchmark.chartlite.health · github.com/prismindanalytics/clinical-edge-bench
Research / methodology benchmark, not clinical certification. Issues, methodology critiques, and pull requests welcome at github.com/prismindanalytics/clinical-edge-bench. Apache 2.0; third-party datasets retain their original licenses.