Intelligence that fits in your pocket.
A complete clinical AI pipeline — speech recognition, structured extraction, and decision support — running entirely on a $100 phone.
End-to-end pipeline
From the clinician's voice to a structured clinical encounter in six stages, all on-device. The structured output feeds directly into the billing engine for automated ICD-10 to CPT/HCPCS claim generation and SOAP note production.
Offline speech recognition
sherpa-onnx powered ASR — 5 architectures across 7 model tiers, hardware-aware selection, fully offline. From 43 MB on ultra-budget phones to 1 GB for maximum accuracy.
| Tier | Model | Architecture | Size | Target |
|---|---|---|---|---|
| Moonshine Tiny | Moonshine v2 (English) | Encoder-Decoder | 43 MB | Ultra-low RAM (<2 GB) |
| Moonshine Base | Moonshine v2 (English) | Encoder-Decoder | 140 MB | 2+ GB RAM, ~7.4% WER |
| medASR | medASR CTC (medical English) | CTC | 154 MB | Recommended for clinical use |
| SenseVoice | SenseVoice Small (ZH/EN/JA/KO/YUE) | Sense | 239 MB | CJK languages |
| Omnilingual 300M | CTC 300M (1600+ languages) | CTC | 365 MB | 2–4 GB RAM, multilingual |
| Parakeet TDT | Parakeet TDT v3 (EN + 25 EU langs) | Transducer | 671 MB | Best English accuracy, 4+ GB |
| Omnilingual 1B | CTC 1B (1600+ languages) | CTC | 1.03 GB | Highest multilingual accuracy |
Why medASR matters: General-purpose ASR models often misrecognize medical terminology — drug names, anatomy, procedures. The medASR tier is trained specifically on medical speech and achieves significantly lower word error rates on clinical vocabulary. For highest medical accuracy, cloud ASR (Gemini Flash Lite, Deepgram Nova, OpenAI gpt-4o Transcribe) via the ChartLite proxy is recommended — cloud models handle medical terms better than any on-device model.
Dual-mode: sherpa-onnx on-device when offline, cloud ASR (Gemini Flash Lite, Deepgram Nova, or OpenAI gpt-4o Transcribe) when connected. Hardware-aware tier selection automatically picks the best model for each device. Automatic fallback ensures voice capture always works.
Help build better medical ASR
General-purpose speech models struggle with clinical vocabulary. We're building specialized medical ASR for low-resource settings — and we need real-world voice data to get there.
If you have hours of medical voice recordings — clinical consultations, dictation, patient interactions — in any language, we'd love to collaborate on fine-tuning ASR models that understand medicine.
Get in TouchRetrieval-augmented extraction
Instead of stuffing 815 reference entries into every prompt, we retrieve only what's relevant.
Index
At app startup, TF-IDF vector store indexes 300 ICD-10 codes + 515 formulary drugs (~20–50ms).
Retrieve
Per transcript, cosine similarity finds 10–15 most relevant codes and drugs.
Prompt
Compact prompt: instructions + retrieved references + transcript.
Generate
Qwen 3.5 processes with 80% more context window available.
| Component | Before (static prompt) | After (RAG pipeline) |
|---|---|---|
| Reference data | ~6,000 tokens (815 entries) | ~400–800 tokens (15–25 entries) |
| Available for transcript | ~700 tokens | ~5,000+ tokens |
| Available for generation | ~1,000 tokens | ~2,000+ tokens |
| Disambiguation quality | Low (no keywords/aliases) | High (retrieved entries include keywords + local terms) |
Unified JSON extraction format
A single benchmark schema shared by all 6 extraction strategies — consistent output regardless of inference path. On-device uses TOON (Token-Oriented Object Notation) for 40–60% token savings vs JSON, with automatic JSON fallback parsing.
{
"diagnoses": [
{
"icd10Code": "J06.9",
"description": "Upper resp. infection",
"isPrimary": true,
"confidence": 0.9
}
],
"medications": [
{
"formularyCode": "0097",
"name": "Paracetamol",
"dose": 500,
"unit": "mg",
"frequency": "TDS"
}
]
}
One schema, six extraction strategies. Hallucination guards and field validation run identically across all paths. The structured JSON output feeds directly into the billing module for automated insurance claim generation (ICD-10 to CPT/HCPCS mapping, E/M level coding) and SOAP note production.
Two ways to capture
Tap for ambient conversation recording or hold for structured dictation snippets — same mic button, two interaction patterns optimized for different clinical workflows.
Ambient Conversation
Continuous recording of the full patient-clinician dialogue. Natural conversation captured without interruption.
- •Tap once to start, tap again to stop
- •No silence auto-stop — pauses preserved
- •Best with cloud ASR or larger on-device models
Structured Snippets
Short structured phrases — vitals, medications, diagnoses — dictated one at a time and accumulated into an encounter.
- •Hold mic, speak, release (~5–30 sec each)
- •Regex preview gives instant structured feedback
- •Optimized for small on-device models (0.8B)
Clinician records via either mode
Tap for ambient conversation or hold for quick dictation snippets
On-device ASR transcribes in real-time
sherpa-onnx ASR runs on device with hardware-aware model selection
Regex extraction provides immediate preview
No model load required — instant structured feedback as you go
Single LLM pass at finalization
Full transcript (or accumulated snippets) processed together for a coherent structured encounter
When to use which?
- Ambient — unhurried consultations where capturing the full dialogue matters (history-taking, counseling)
- Snippets — fast-paced clinics where the clinician dictates findings between patients
- Both modes feed into the same extraction pipeline and produce identical structured output
- Configurable as default in Settings — clinicians choose what fits their workflow
Battery-conscious processing
Model loads once for N patients instead of N times.
| Trigger | Behavior |
|---|---|
| Manual | Clinician taps "Process Queue" during a break |
| Urgent | Immediate single extraction for referral/emergency |
| End of session | Process remaining queue before closing |
Quantized for the edge
| Tier | Model | Quantization | Size | Context Window |
|---|---|---|---|---|
| SMALL | Qwen 3.5 0.8B | Q4_K_M | 533 MB | 32,768 tokens |
| LARGE | Qwen 3.5 2B | Q4_K_M | 1.28 GB | 32,768 tokens |
Hardware-aware selection: 0.8B for 2 GB devices, 2B for 4 GB+. Both run via llama.cpp built from source.
Fits on a budget phone
| Device Class | Example | RAM | ASR | LLM | Total Footprint |
|---|---|---|---|---|---|
| Budget | Galaxy A03 | 2 GB | Moonshine Tiny (43 MB) | Qwen 0.8B (533 MB) | ~576 MB |
| Mid-range | Galaxy A14 | 4 GB | Parakeet TDT (671 MB) | Qwen 2B (1.28 GB) | ~1.95 GB |
| High-end | Galaxy A54 | 6+ GB | Parakeet TDT (671 MB) | Qwen 2B (1.28 GB) | ~1.95 GB |