Back to Home

Medical ASR Fine-tuning

Distributed Training Production Ready
Whisper AdaLoRA PyTorch HuggingFace WebSocket Multi-GPU

Project Background

Most mainstream ASR models are designed for general-purpose contexts and struggle with three core challenges in the medical domain: poor recognition of specialized terminology, inability to handle code-switching between Chinese and English, and strict privacy compliance requirements. In clinical settings, physicians frequently mix English drug names (e.g., Aspirin, Metformin), diagnostic codes (ICD-10), and procedure names into Mandarin sentences — a highly code-switched speech pattern that general models fail to transcribe accurately. Moreover, clinical recordings contain sensitive patient data and cannot be processed through third-party cloud APIs.

This project is a collaboration between ITRI and 9 hospitals, aiming to build a locally deployable ASR system capable of handling code-switched medical speech, enabling clinicians to use real-time transcription within hospital premises while ensuring data never leaves the facility.

The Challenge

  • Limits of general ASR: Models like OpenAI Whisper exhibit significantly higher Character Error Rates (CER) on medical terminology and code-switched utterances — especially prone to omissions, mistranscriptions, and hallucinations when English terms appear within Chinese sentences.
  • Scarcity of labeled data: Ground truth transcriptions for medical speech are extremely difficult to obtain — requiring domain experts to review word by word, making the process costly and time-consuming.
  • Privacy compliance: Clinical recordings contain sensitive patient information and must be processed entirely on-premise, with no cloud services allowed.
  • Cross-hospital variability: The 9 hospitals differ in recording equipment, ambient noise levels, medical specialties, and speaking styles — the model must generalize across all sites.

Technical Architecture

Training Pipeline

The entire workflow is orchestrated through an automated pipeline script, from data preprocessing to model evaluation:

  1. Audio normalization — Unify sample rate, format, and channel count
  2. Dataset preparation — Convert to HuggingFace Dataset format with pseudo-label generation
  3. AdaLoRA fine-tuning — Based on Whisper Large V3 Turbo, BF16 precision, DDP dual-GPU parallel training
  4. Model merging — Merge adapter with base model into a deployable full model
  5. CER evaluation — Evaluate on both medical dataset and Common Voice, with outlier-robust statistics
  6. Parameter logging — Auto-record to CSV + Weights & Biases tracking

Inference Service

The fine-tuned model serves real-time speech recognition via a WebSocket Server. Integrated with Silero VAD (Voice Activity Detection) for silence filtering — only segments with detected speech are sent to the GPU, reducing inference resource consumption. Supports concurrent usage by multiple clinicians through async task distribution.

Hardware

  • Training: Linux Server — 2× NVIDIA L4 (24GB VRAM each), CUDA 12.1
  • Stack: Python 3.12, PyTorch 2.5, HuggingFace Transformers + PEFT
  • Precision: BF16 + Gradient Checkpointing (maximizing batch size under limited VRAM)

My Role & Responsibilities

I was responsible for ASR corpus preprocessing and acoustic model training & tuning, specifically:

  • Designed and implemented the complete automated training pipeline (supporting 3 modes: fresh training, continued training, checkpoint resumption)
  • Built a centralized parameter configuration system (PipelineConfig) for unified management of all training and evaluation parameters
  • Processed raw audio from 9 hospitals — format unification, quality filtering, pseudo-label generation
  • AdaLoRA fine-tuning strategy — rank allocation, target module selection, hyperparameter tuning
  • CER evaluation system — with multiple statistical metrics (mean, median, outlier-trimmed variants)
  • WebSocket real-time recognition service development and deployment

Key Technical Decisions

Why AdaLoRA over LoRA or full fine-tuning?

  • Full fine-tuning issues: Whisper Large V3 Turbo has ~800M parameters. Full fine-tuning exceeds VRAM capacity on 2× L4 and risks catastrophic forgetting — losing the model's general language understanding.
  • AdaLoRA advantage: Unlike fixed-rank LoRA, AdaLoRA dynamically allocates parameter budget based on weight matrix importance — assigning more rank to critical layers (e.g., Cross-Attention) while pruning redundant ones. This achieves better convergence than fixed-rank LoRA with the same parameter overhead.
  • Domain transfer flexibility: Each adapter is only a few MB, enabling independent training per specialty or hospital site, loadable on demand without modifying the base model.

Why Whisper Large V3 Turbo?

  • Inference speed: Turbo reduces decoder layers from 32 to 4, achieving ~8× faster inference — essential for real-time recognition.
  • Strong encoder preserved: Feature extraction capability matches Large V3, ensuring sufficient semantic features even with faint or ambiguous speech.
  • Long-form stability: Clinical consultations can span tens of minutes. Turbo exhibits lower hallucination rates in long-form transcription scenarios.

DDP Distributed Training

PyTorch DDP distributes training across 2 GPUs, effectively doubling batch size. Combined with Gradient Accumulation, it simulates stable large-batch training on limited hardware. Data loading uses length-sorted grouping to minimize GPU idle time across variable-length medical recordings.

Evaluation & Results

CER (Character Error Rate) serves as the primary metric, evaluated on both the medical dataset and Common Voice (general domain) to ensure medical terminology gains don't sacrifice general capability.

  • ~60% CER improvement in medical scenarios (vs. unfine-tuned Whisper Large V3 Turbo), with stable performance on code-switched clinical speech
  • Evaluation system covers 8 statistical metrics: mean CER, median CER, and outlier-trimmed (top 1% removed) variants
  • All experiment parameters and results auto-logged to CSV and Weights & Biases for reproducibility
  • Model deployed across multiple hospital sites for real-time clinical speech recognition

Challenges & Trade-offs

Scarcity of Ground Truth Data

The biggest challenge was the lack of human-annotated transcriptions. We adopted a pseudo-labeling strategy: first generating preliminary transcripts with unfine-tuned Whisper, then filtering by confidence scores (log-probability threshold) and speech rate checks to select high-quality pseudo-labels for training. The trained model then re-annotates the corpus, creating an iterative refinement loop.

To further address this, I developed a separate ASR ↔ TTS closed-loop self-refining framework (see Self-Refining Framework) to augment training data through synthetic speech generation.

BF16 and PEFT Compatibility Issues

When training with BF16 precision, calling model.generate() during evaluation steps caused crashes due to autocast not being active — the encoder's conv1 layer received float32 input while bias remained in BF16. Resolved by fixing quantization control logic and unifying the CER normalization function.

Cross-Hospital Data Variability

Audio quality varied dramatically across the 9 hospitals — different microphones, background noise (instrument beeps, HVAC), and accent differences. We standardized inputs through the Corpus Annotation Pipeline (see separate project), combining Meta Denoiser for noise reduction and hallucination detection to ensure consistent training data quality.