Read this if you want the engineering depth without watching the second video. Every claim here traces to source code in the repo or to raw data on the dashboard at https://benchmark.chartlite.health.
1 — Two Gemma 4 sizes, one codebase
ChartLite picks the right on-device LLM for each phone automatically via a single call:
// app/.../extraction/LlmModelManager.kt
fun recommendedTierForRam(ramGb: Double): ModelTier {
return when {
ramGb >= 6.0 -> ModelTier.GEMMA_4_E4B // MediaPipe LiteRT
ramGb >= 4.0 -> ModelTier.GEMMA_4_E2B // MediaPipe LiteRT
else -> ModelTier.QWEN_3_5_0_8B // MNN-LLM fallback
}
}
No configuration, no compromise — flagship phones get the better model, mid-tier phones still get Gemma 4, ultra-low-end devices fall back gracefully.
2 — Gemma 4 via MediaPipe LiteRT
Gemma 4 ships as INT4-quantized .task bundles via Google AI Edge LiteRT.
GemmaBridge.kt wraps the API directly:
val opts = LlmInference.LlmInferenceOptions.builder()
.setModelPath(taskFile.absolutePath)
.setMaxTokens(4096)
.build()
llm = LlmInference.createFromOptions(context, opts)
Models pulled from huggingface.co/litert-community/gemma-4-E{2,4}B-it-litert-lm
on first launch. Native chat-template handling, deterministic seeded sampling,
NPU acceleration when present.
3 — Function calling, on-device-flavour
The cloud Gemma family has a native function-calling API. The on-device variant
via MediaPipe does not — so we adapt. CdssToolRegistry.kt asks Gemma 4 to emit
a JSON array of {name, args} tool calls; the dispatcher parses + executes
against the existing StaticCDSS layer deterministically. Four tools registered
against the BODHI knowledge graph:
check_drug_drug_interactions(meds: string[])check_drug_allergy(meds: string[], allergies: string[])check_drug_condition(meds: string[], diagnoses: string[])check_triage_urgency(diagnoses: string[])
Reliable on E4B, reasonable on E2B. The clinical-encounter beat in the demo shows the model choosing which two tools to invoke after seeing a prescription photo + patient context — and BODHI's triage table seeing a 4-year-old with pneumonia, a respiratory rate of 40 (alarming for that age) and oxygen saturation of 94% (low), then escalating to EMERGENCY. The language model on its own missed the case because it never combined the three numbers.
4 — BODHI honest audit
The dashboard's three-arm safety design (LLM-alone / production rules / rules + BODHI) shows a gross +26–63 pp lift in safety detection across 12 models when BODHI is wired in. We audited that lift on GPT-5.5's 35 missed dangers caught by Arm 3:
- ~40 % genuine clinical catch — the LLM never raised the danger (e.g. combining a pneumonia diagnosis with abnormal vital signs into a single EMERGENCY triage; the model had each fact but didn't act on the combination).
- ~60 % substring-match scoring artefact — the LLM said the equivalent in different words, the substring scorer didn't credit it.
- ~20 % of BODHI alerts are false positives — a drug flagged as unindicated when it actually was indicated (9 / 134) + a fuzzy referral match firing the wrong rule (16 / 134, e.g. "Chikungunya" suggested on any febrile case).
Net of artifact and noise: a real clinical-safety contribution, largest where it matters most — Qwen 3.5 0.8B goes 3 → 66 %, Gemma 4 e4b 30 → 57 %, Opus 4.7 38 → 80 %. Sonnet 4.6 gains the least: top cloud models already score near the ceiling, so BODHI has less to add.
The implication is the opposite of the marketing default: BODHI is not a generative model substitute, it is a deterministic safety net under clinician judgement. ChartLite renders alerts with severity tiers and an audit trail in the standard medical-terminology system (SNOMED); the clinician decides.
5 — Multilingual by default
Three layers cover the language ladder:
| Layer | Coverage | Source |
|---|---|---|
| Gemma 4 reasoning | 140 + languages | Google model card |
| Parakeet TDT v3 speech-to-text | 26 (English + 25 EU) at 1.69 % word-error rate | NVIDIA, on-device via Sherpa-ONNX |
| Omnilingual ASR | 1,600 + languages | Meta, on-device via ONNX Runtime |
ChartLite's ModelDownloader.rankTiersForDevice(language) picks the right
speech model per (language × device RAM). The Eka Calculator dataset is
Hindi-English code-switched clinical prose (the way Indian clinicians
actually talk in the consult room) — a real test of multilingual ability,
and Gemma 4 handles it natively without translation.
6 — Reproduce every number
Each of the 12 models × 6 benchmarks × ~54 K model-question evaluations on
the dashboard traces to per-(model, case) JSON files preserved in
scripts/{benchmark}_raw/. A judge can:
git clone github.com/prismindanalytics/chartlite # app + integration
git clone github.com/prismindanalytics/clinical-edge-bench # the benchmark suite
pip install -r scripts/requirements.txt
python3 scripts/benchmark_pharmacology_mcqa.py --models gemma4-e4b --limit 50
…and pull any number off the dashboard with curl-able raw JSON. If a single
number doesn't reproduce, that's a bug — please open an issue.
What's deliberately not on the dashboard
- We don't claim ChartLite is deployed in any production clinic. It is production-ready and awaiting clinical pilot. We have country configurations (medical codes, formulary, language packs) for South Africa, Ethiopia, Malawi, Kenya, Nigeria, US, UK, India — not deployments.
- The 100-encounter synthetic safety benchmark is a directional research signal, not a clinical certification. The "ground truth" answers it grades against were themselves generated by an LLM panel (Opus 4.7 + GPT-5.4 + Gemini 3.1 Pro), two of which also appear among the models we test. We declare this openly in the Methodology tab.
- One narrow scoring quirk: BODHI groups lab tests into categories (e.g. "liver-function tests"), so a model that recommends a more specific lab by name sometimes won't get credit. Affects ~1–2 % of total dangers.
Where to look next
- Live demo + APK: https://chartlite.health/hackathon
- Source code: https://github.com/prismindanalytics/chartlite (Apache 2.0)
- Benchmark dashboard: https://benchmark.chartlite.health — every number above is reachable here.
- Submission video: see Kaggle Media Gallery.
Apache 2.0, except where BODHI's CC BY-NC 4.0 applies. Built on Google's Gemma 4, MediaPipe LLM Inference, NVIDIA Parakeet TDT v3, Meta Omnilingual ASR, Sherpa-ONNX, llama.cpp, and Eka Care's BODHI clinical knowledge graph.