End-to-end pipeline

From the clinician's voice to a structured clinical encounter in six stages, all on-device. The structured output feeds directly into the billing engine for automated ICD-10 to CPT/HCPCS claim generation and SOAP note production.


Offline speech recognition

sherpa-onnx powered ASR — 5 architectures across 7 model tiers, hardware-aware selection, fully offline. From 43 MB on ultra-budget phones to 1 GB for maximum accuracy.

Tier Model Architecture Size Target
Moonshine Tiny Moonshine v2 (English) Encoder-Decoder 43 MB Ultra-low RAM (<2 GB)
Moonshine Base Moonshine v2 (English) Encoder-Decoder 140 MB 2+ GB RAM, ~7.4% WER
medASR medASR CTC (medical English) CTC 154 MB Recommended for clinical use
SenseVoice SenseVoice Small (ZH/EN/JA/KO/YUE) Sense 239 MB CJK languages
Omnilingual 300M CTC 300M (1600+ languages) CTC 365 MB 2–4 GB RAM, multilingual
Parakeet TDT Parakeet TDT v3 (EN + 25 EU langs) Transducer 671 MB Best English accuracy, 4+ GB
Omnilingual 1B CTC 1B (1600+ languages) CTC 1.03 GB Highest multilingual accuracy

Why medASR matters: General-purpose ASR models often misrecognize medical terminology — drug names, anatomy, procedures. The medASR tier is trained specifically on medical speech and achieves significantly lower word error rates on clinical vocabulary. For highest medical accuracy, cloud ASR (Gemini Flash Lite, Deepgram Nova, OpenAI gpt-4o Transcribe) via the ChartLite proxy is recommended — cloud models handle medical terms better than any on-device model.

Dual-mode: sherpa-onnx on-device when offline, cloud ASR (Gemini Flash Lite, Deepgram Nova, or OpenAI gpt-4o Transcribe) when connected. Hardware-aware tier selection automatically picks the best model for each device. Automatic fallback ensures voice capture always works.


Help build better medical ASR

General-purpose speech models struggle with clinical vocabulary. We're building specialized medical ASR for low-resource settings — and we need real-world voice data to get there.

If you have hours of medical voice recordings — clinical consultations, dictation, patient interactions — in any language, we'd love to collaborate on fine-tuning ASR models that understand medicine.

Get in Touch

Retrieval-augmented extraction

Instead of stuffing 815 reference entries into every prompt, we retrieve only what's relevant.

01

Index

At app startup, TF-IDF vector store indexes 300 ICD-10 codes + 515 formulary drugs (~20–50ms).

02

Retrieve

Per transcript, cosine similarity finds 10–15 most relevant codes and drugs.

03

Prompt

Compact prompt: instructions + retrieved references + transcript.

04

Generate

Qwen 3.5 processes with 80% more context window available.

Component Before (static prompt) After (RAG pipeline)
Reference data ~6,000 tokens (815 entries) ~400–800 tokens (15–25 entries)
Available for transcript ~700 tokens ~5,000+ tokens
Available for generation ~1,000 tokens ~2,000+ tokens
Disambiguation quality Low (no keywords/aliases) High (retrieved entries include keywords + local terms)

Unified JSON extraction format

A single benchmark schema shared by all 6 extraction strategies — consistent output regardless of inference path. On-device uses TOON (Token-Oriented Object Notation) for 40–60% token savings vs JSON, with automatic JSON fallback parsing.

Benchmark JSON
{
  "diagnoses": [
    {
      "icd10Code": "J06.9",
      "description": "Upper resp. infection",
      "isPrimary": true,
      "confidence": 0.9
    }
  ],
  "medications": [
    {
      "formularyCode": "0097",
      "name": "Paracetamol",
      "dose": 500,
      "unit": "mg",
      "frequency": "TDS"
    }
  ]
}

One schema, six extraction strategies. Hallucination guards and field validation run identically across all paths. The structured JSON output feeds directly into the billing module for automated insurance claim generation (ICD-10 to CPT/HCPCS mapping, E/M level coding) and SOAP note production.


Two ways to capture

Tap for ambient conversation recording or hold for structured dictation snippets — same mic button, two interaction patterns optimized for different clinical workflows.

Tap to record

Ambient Conversation

Continuous recording of the full patient-clinician dialogue. Natural conversation captured without interruption.

  • Tap once to start, tap again to stop
  • No silence auto-stop — pauses preserved
  • Best with cloud ASR or larger on-device models
Hold to dictate

Structured Snippets

Short structured phrases — vitals, medications, diagnoses — dictated one at a time and accumulated into an encounter.

  • Hold mic, speak, release (~5–30 sec each)
  • Regex preview gives instant structured feedback
  • Optimized for small on-device models (0.8B)
1

Clinician records via either mode

Tap for ambient conversation or hold for quick dictation snippets

2

On-device ASR transcribes in real-time

sherpa-onnx ASR runs on device with hardware-aware model selection

3

Regex extraction provides immediate preview

No model load required — instant structured feedback as you go

4

Single LLM pass at finalization

Full transcript (or accumulated snippets) processed together for a coherent structured encounter

When to use which?


Battery-conscious processing

Model loads once for N patients instead of N times.

Patient 1
transcript
Queue
Patient 2
transcript
Patient 3
transcript
Load model once
Process batch
Unload & free RAM
Trigger Behavior
Manual Clinician taps "Process Queue" during a break
Urgent Immediate single extraction for referral/emergency
End of session Process remaining queue before closing

Quantized for the edge

Tier Model Quantization Size Context Window
SMALL Qwen 3.5 0.8B Q4_K_M 533 MB 32,768 tokens
LARGE Qwen 3.5 2B Q4_K_M 1.28 GB 32,768 tokens

Hardware-aware selection: 0.8B for 2 GB devices, 2B for 4 GB+. Both run via llama.cpp built from source.


Fits on a budget phone

Device Class Example RAM ASR LLM Total Footprint
Budget Galaxy A03 2 GB Moonshine Tiny (43 MB) Qwen 0.8B (533 MB) ~576 MB
Mid-range Galaxy A14 4 GB Parakeet TDT (671 MB) Qwen 2B (1.28 GB) ~1.95 GB
High-end Galaxy A54 6+ GB Parakeet TDT (671 MB) Qwen 2B (1.28 GB) ~1.95 GB