ChartLite | AI Pipeline

End-to-end pipeline

From the clinician's voice to a structured clinical encounter in six stages, all on-device. The structured output feeds directly into the billing engine for automated ICD-10 to CPT/HCPCS claim generation and SOAP note production.

Input

Voice

ASR

Offline ASR

Extraction

Qwen 3.5 0.8B

Failsafe

Regex Pipeline

Retrieval

TF-IDF

Output

Structured Encounter

Offline speech recognition

sherpa-onnx powered ASR — 5 architectures across 7 model tiers, hardware-aware selection, fully offline. From 43 MB on ultra-budget phones to 1 GB for maximum accuracy.

Tier	Model	Architecture	Size	Target
Moonshine Tiny	Moonshine v2 (English)	Encoder-Decoder	43 MB	Ultra-low RAM (<2 GB)
Moonshine Base	Moonshine v2 (English)	Encoder-Decoder	140 MB	2+ GB RAM, ~7.4% WER
medASR	medASR CTC (medical English)	CTC	154 MB	Recommended for clinical use
SenseVoice	SenseVoice Small (ZH/EN/JA/KO/YUE)	Sense	239 MB	CJK languages
Omnilingual 300M	CTC 300M (1600+ languages)	CTC	365 MB	2–4 GB RAM, multilingual
Parakeet TDT	Parakeet TDT v3 (EN + 25 EU langs)	Transducer	671 MB	Best English accuracy, 4+ GB
Omnilingual 1B	CTC 1B (1600+ languages)	CTC	1.03 GB	Highest multilingual accuracy

Why medASR matters: General-purpose ASR models often misrecognize medical terminology — drug names, anatomy, procedures. The medASR tier is trained specifically on medical speech and achieves significantly lower word error rates on clinical vocabulary. For highest medical accuracy, cloud ASR (Gemini Flash Lite, Deepgram Nova, OpenAI gpt-4o Transcribe) via the ChartLite proxy is recommended — cloud models handle medical terms better than any on-device model.

Dual-mode: sherpa-onnx on-device when offline, cloud ASR (Gemini Flash Lite, Deepgram Nova, or OpenAI gpt-4o Transcribe) when connected. Hardware-aware tier selection automatically picks the best model for each device. Automatic fallback ensures voice capture always works.

Help build better medical ASR

General-purpose speech models struggle with clinical vocabulary. We're building specialized medical ASR for low-resource settings — and we need real-world voice data to get there.

If you have hours of medical voice recordings — clinical consultations, dictation, patient interactions — in any language, we'd love to collaborate on fine-tuning ASR models that understand medicine.

Get in Touch

Retrieval-augmented extraction

Instead of stuffing 815 reference entries into every prompt, we retrieve only what's relevant.

01

Index

At app startup, TF-IDF vector store indexes 300 ICD-10 codes + 515 formulary drugs (~20–50ms).

02

Retrieve

Per transcript, cosine similarity finds 10–15 most relevant codes and drugs.

03

Prompt

Compact prompt: instructions + retrieved references + transcript.

04

Generate

Qwen 3.5 processes with 80% more context window available.

Component	Before (static prompt)	After (RAG pipeline)
Reference data	~6,000 tokens (815 entries)	~400–800 tokens (15–25 entries)
Available for transcript	~700 tokens	~5,000+ tokens
Available for generation	~1,000 tokens	~2,000+ tokens
Disambiguation quality	Low (no keywords/aliases)	High (retrieved entries include keywords + local terms)

Unified JSON extraction format

A single benchmark schema shared by all 6 extraction strategies — consistent output regardless of inference path. On-device uses TOON (Token-Oriented Object Notation) for 40–60% token savings vs JSON, with automatic JSON fallback parsing.

Benchmark JSON

{
  "diagnoses": [
    {
      "icd10Code": "J06.9",
      "description": "Upper resp. infection",
      "isPrimary": true,
      "confidence": 0.9
    }
  ],
  "medications": [
    {
      "formularyCode": "0097",
      "name": "Paracetamol",
      "dose": 500,
      "unit": "mg",
      "frequency": "TDS"
    }
  ]
}

One schema, six extraction strategies. Hallucination guards and field validation run identically across all paths. The structured JSON output feeds directly into the billing module for automated insurance claim generation (ICD-10 to CPT/HCPCS mapping, E/M level coding) and SOAP note production.

Two ways to capture

Tap for ambient conversation recording or hold for structured dictation snippets — same mic button, two interaction patterns optimized for different clinical workflows.

Tap to record

Ambient Conversation

Continuous recording of the full patient-clinician dialogue. Natural conversation captured without interruption.

•Tap once to start, tap again to stop
•No silence auto-stop — pauses preserved
•Best with cloud ASR or larger on-device models

Hold to dictate

Structured Snippets

Short structured phrases — vitals, medications, diagnoses — dictated one at a time and accumulated into an encounter.

•Hold mic, speak, release (~5–30 sec each)
•Regex preview gives instant structured feedback
•Optimized for small on-device models (0.8B)

1

Clinician records via either mode

Tap for ambient conversation or hold for quick dictation snippets

2

On-device ASR transcribes in real-time

sherpa-onnx ASR runs on device with hardware-aware model selection

3

Regex extraction provides immediate preview

No model load required — instant structured feedback as you go

4

Single LLM pass at finalization

Full transcript (or accumulated snippets) processed together for a coherent structured encounter

When to use which?

Ambient — unhurried consultations where capturing the full dialogue matters (history-taking, counseling)
Snippets — fast-paced clinics where the clinician dictates findings between patients
Both modes feed into the same extraction pipeline and produce identical structured output
Configurable as default in Settings — clinicians choose what fits their workflow

Battery-conscious processing

Model loads once for N patients instead of N times.

Patient 1

→

transcript

→

Queue

Patient 2

→

transcript

→

Queue

Patient 3

→

transcript

→

Queue

↓

Load model once

→

Process batch

→

Unload & free RAM

Trigger	Behavior
Manual	Clinician taps "Process Queue" during a break
Urgent	Immediate single extraction for referral/emergency
End of session	Process remaining queue before closing

Quantized for the edge

Tier	Model	Quantization	Size	Context Window
SMALL	Qwen 3.5 0.8B	Q4_K_M	533 MB	32,768 tokens
LARGE	Qwen 3.5 2B	Q4_K_M	1.28 GB	32,768 tokens

Hardware-aware selection: 0.8B for 2 GB devices, 2B for 4 GB+. Both run via llama.cpp built from source.

Fits on a budget phone

Device Class	Example	RAM	ASR	LLM	Total Footprint
Budget	Galaxy A03	2 GB	Moonshine Tiny (43 MB)	Qwen 0.8B (533 MB)	~576 MB
Mid-range	Galaxy A14	4 GB	Parakeet TDT (671 MB)	Qwen 2B (1.28 GB)	~1.95 GB
High-end	Galaxy A54	6+ GB	Parakeet TDT (671 MB)	Qwen 2B (1.28 GB)	~1.95 GB

Intelligence that fits in your pocket.

End-to-end pipeline

Offline speech recognition

Help build better medical ASR

Retrieval-augmented extraction

Index

Retrieve

Prompt

Generate

Unified JSON extraction format

Two ways to capture

Ambient Conversation

Structured Snippets

Clinician records via either mode

On-device ASR transcribes in real-time

Regex extraction provides immediate preview

Single LLM pass at finalization

When to use which?

Battery-conscious processing

Quantized for the edge

Fits on a budget phone