Best Local Speech-to-Text Models in 2026
Canary-Qwen-2.5B, Parakeet, Qwen3-ASR, Whisper, and Moonshine v2 ranked by WER, RAM, and language coverage for local speech-to-text in 2026.
NVIDIA Canary-Qwen-2.5B tops the HuggingFace Open ASR Leaderboard at 5.63% average word error rate (1.6% on LibriSpeech clean English) but requires a discrete GPU to run efficiently. For most laptops with 6 GB of RAM, Whisper large-v3-turbo is still the practical default - 99+ language coverage, wide tool support, and ~3.0% WER on clean audio. For English-only transcription on 16+ GB hardware, Parakeet TDT 0.6B v3 beats both on accuracy (~1.9% WER LibriSpeech clean) and speed. Qwen3-ASR 1.7B covers 52 languages with SOTA multilingual accuracy. Moonshine v2 is purpose-built for streaming and edge deployment.
Here is the full breakdown for each model.
The leaderboard in plain terms
The HuggingFace Open ASR Leaderboard averages WER across multiple English and multilingual datasets - a better proxy for real-world performance than LibriSpeech clean alone, which uses single-speaker studio audio. The rankings for open-source local models in mid-2026:
| Model | Open ASR avg WER | LibriSpeech clean WER | Languages | RAM floor |
|---|---|---|---|---|
| Canary-Qwen-2.5B | 5.63% | 1.6% | English | 8 GB VRAM (GPU) |
| Parakeet TDT 0.6B v3 | 6.32% | ~1.9% | 25 European | 16 GB unified |
| Moonshine v2 Medium | 6.65% (paper) | - | English | under 1 GB |
| Whisper large-v3 | 7.44% | ~2.7% | 99+ | 10 GB RAM |
| Whisper large-v3-turbo | ~7.5% (est.) | ~3.0% | 99+ | 6 GB RAM |
| Qwen3-ASR 1.7B | SOTA multilingual | - | 52 langs | ~8 GB VRAM |
The headline: Whisper is no longer the most accurate local speech model. Canary-Qwen leads the leaderboard by nearly 2 percentage points; Parakeet leads on English LibriSpeech clean. Both gains are real and reproducible.
The constraint: Canary-Qwen requires an NVIDIA GPU. Parakeet needs 16 GB of unified memory. For hardware with 6-8 GB of RAM, Whisper large-v3-turbo is the highest-accuracy option that will actually run on the machine.
NVIDIA Canary-Qwen-2.5B
Released in 2026 through NVIDIA's NeMo framework, Canary-Qwen-2.5B pairs a FastConformer encoder (trained on 234,000 hours of audio) with a Qwen3-1.7B language model decoder. The result is a Speech-Augmented Language Model (SALM) - it can transcribe, add punctuation and capitalisation, and produce summaries in a single forward pass. It is English-only.
On the HuggingFace Open ASR Leaderboard, it posts 5.63% average WER and 1.6% on LibriSpeech clean - the best result in the open-source category at time of writing.
Hardware requirements are the limiting factor. At FP16 precision, a 12 GB GPU (RTX 3060) is comfortable. With Q4 quantisation, an 8 GB GPU is enough. Apple Silicon requires 16+ GB of unified memory via the MLX framework. There is no whisper.cpp or Ollama path at this point; deployment goes through NVIDIA NeMo or HuggingFace Transformers with a CUDA environment.
Best for: English transcription where a GPU is available and maximum accuracy is the priority - medical dictation, legal deposition, fine-grained voice notes.
Parakeet TDT 0.6B v3
Covered in depth in Whisper large-v3-turbo vs Parakeet. The summary: 600 million parameters, ~1.9% WER on LibriSpeech clean English, and roughly 10 times faster than Whisper large-v3-turbo on NVIDIA GPU hardware. Its transducer (non-autoregressive) decoder does not hallucinate on silence the way Whisper's autoregressive decoder can - a meaningful advantage for meeting and interview recordings.
The limits: 25 European languages only (English, Spanish, French, German, Italian, and 20 more - but no Chinese, Japanese, Korean, Arabic, or Hindi), no whisper.cpp integration yet, and no Ollama path. It needs ~16 GB of unified memory on Apple Silicon or 8 GB VRAM on NVIDIA.
Best for: English (and European language) transcription at high throughput, especially on hardware with 16+ GB of unified memory or a dedicated GPU.
Qwen3-ASR 1.7B
Alibaba's Qwen team released Qwen3-ASR in January 2026 with two sizes: 1.7B (highest accuracy) and 0.6B (efficiency). Both are open source and available on HuggingFace at QwenLM/Qwen3-ASR.
The model supports 52 languages and dialects - Mandarin, Arabic, Hindi, Spanish, French, Japanese, Korean, and many more. Language identification, transcription, and timestamp prediction all happen in a single forward pass without a separate language-detection step. The 1.7B variant achieves SOTA performance among open-source multilingual ASR models and is competitive with leading proprietary APIs on their benchmark sets.
For teams with audio in non-European languages, this is currently the strongest open-source option. It runs via HuggingFace Transformers with a GPU-backed environment; RAM requirements are roughly comparable to a 1.7B LLM (~4 GB VRAM for INT8 quantisation).
Best for: Multilingual transcription across a wide language mix, especially audio containing Mandarin, Arabic, Hindi, or Japanese alongside European languages.
Whisper large-v3-turbo
Still the right default for most hardware configurations. Whisper large-v3-turbo (released October 2024) prunes the decoder from 32 layers to 4, cutting parameters from 1.55 billion to 809 million. The result is 6-8x faster transcription versus full large-v3, with only ~0.3 percentage points of extra WER (~3.0% vs ~2.7% on LibriSpeech clean).
The ecosystem advantage is decisive. whisper.cpp, faster-whisper, Insanely Fast Whisper, and virtually every desktop dictation app ship with Whisper large-v3-turbo as the default. For live dictation into any running application on Mac, Windows, or Linux - without a GPU or 16 GB of memory - it is the correct choice. It covers 99+ languages, runs on 6 GB of RAM, and has pre-built binaries for every major platform.
One limitation to know: Whisper's autoregressive decoder can hallucinate on silent audio segments, generating filler phrases rather than outputting nothing. Voice activity detection (VAD) strips silence before inference and eliminates the problem - see how accurate is Whisper for meetings for the full detail on hallucination and how to prevent it.
Best for: General-purpose local dictation across all hardware. The default pick unless specific accuracy, speed, or multilingual requirements point to another model.
Moonshine v2
Useful Sensors' Moonshine v2 (February 2026) is purpose-built for streaming and edge deployment where latency is the constraint. Three streaming variants: Tiny (34M params), Small (123M params), Medium (245M params). The Medium achieves 6.65% WER on benchmark averages from the Moonshine v2 paper while running in under 70 ms per segment on a Raspberry Pi 5 for the Tiny variant.
The streaming encoder uses sliding-window self-attention - words appear as you speak, and the model reuses prior computation rather than reprocessing from scratch on each new audio chunk. This keeps first-word latency low in live captioning and real-time dictation pipelines. Moonshine is English-only.
Best for: Streaming dictation, embedded systems, live captions, or mobile applications where latency below 100 ms per segment matters more than absolute WER.
Which model to choose
On an 8 GB MacBook Air, Whisper large-v3-turbo is the correct choice - Canary-Qwen needs a GPU, Parakeet needs 16 GB, and Qwen3-ASR has no whisper.cpp path yet. On Apple Silicon with 16+ GB and English-only audio, Parakeet v3 is the better model. Add a GPU and Canary-Qwen is worth evaluating when maximum accuracy is non-negotiable.
| Use case | Recommended model |
|---|---|
| Highest English accuracy, GPU available | Canary-Qwen-2.5B |
| Best English accuracy + speed, 16 GB RAM | Parakeet TDT 0.6B v3 |
| Multilingual incl. Chinese, Arabic, Hindi | Qwen3-ASR 1.7B |
| Most hardware, widest ecosystem, 6-8 GB RAM | Whisper large-v3-turbo |
| Streaming, real-time captions, edge / mobile | Moonshine v2 Small or Medium |
One benchmark number to hold in mind: all local models - and cloud services - land in the 8-12% WER range on real multi-speaker meeting audio with background noise, regardless of LibriSpeech benchmark scores. The benchmark gap between Canary-Qwen and Whisper turbo mostly closes on noisy real-world audio. The models diverge again on single-speaker clean dictation, which is where their stated WERs apply.
Running any of these locally for dictation
The models above handle file transcription. For live dictation that injects text at the cursor in any running application - a code editor, Slack, a browser field - you also need voice activity detection, microphone capture, and a text injection layer. The DIY whisper.cpp path covers building that pipeline manually from whisper.cpp (46,900+ GitHub stars, C/C++ port with Metal and CUDA acceleration).
Typilot bundles a local Whisper runtime with VAD, three activation modes (hold-to-record, toggle-VAD, or toggle-manual), and system-wide text injection on Mac, Windows, and Linux. Every model available in the voice settings tab runs entirely on-device - no audio upload, no cloud call for transcription.
The short version
Whisper large-v3-turbo is no longer the most accurate local ASR model, but it remains the best default for most hardware. Canary-Qwen-2.5B leads on raw WER (5.63% Open ASR avg, 1.6% LibriSpeech clean) and needs an NVIDIA GPU. Parakeet TDT 0.6B v3 leads on English-only clean benchmarks at ~1.9% and is 10x faster, but needs 16 GB of memory. Qwen3-ASR 1.7B is the multilingual answer for 52 languages. Moonshine v2 is the streaming and edge pick.
Typilot ships a 3-day free trial with local Whisper bundled - no Python environment, no whisper.cpp build, no audio upload. The security page documents exactly what stays on your device. For a head-to-head of Parakeet v3 vs Whisper in depth, see Whisper large-v3-turbo vs Parakeet. For how these models hold up in real meeting conditions, see how accurate is Whisper for meetings.
Common questions.
What is the most accurate local speech-to-text model in 2026?+
NVIDIA Canary-Qwen-2.5B tops the HuggingFace Open ASR Leaderboard at 5.63% average word error rate (1.6% on LibriSpeech clean English), making it the most accurate open-source local ASR model as of mid-2026. It is English-only and requires an NVIDIA GPU with 8+ GB VRAM or 16+ GB of Apple Silicon unified memory. For hardware without a discrete GPU, Whisper large-v3-turbo (~3.0% WER, 6 GB RAM) is the practical default.
Is Whisper still the best local speech recognition model in 2026?+
No. Whisper large-v3 sits at 7.44% on the HuggingFace Open ASR Leaderboard, behind Canary-Qwen-2.5B (5.63%), Parakeet TDT 0.6B v3 (6.32%), and Moonshine v2 Medium (6.65% from paper benchmarks). Whisper large-v3-turbo remains the best practical default because it covers 99+ languages and runs on 6 GB of RAM - advantages the more accurate models cannot match on most consumer hardware.
Which local speech-to-text model supports the most languages?+
Whisper large-v3 and large-v3-turbo support 99+ languages, including Chinese, Japanese, Korean, Arabic, and Hindi. Qwen3-ASR 1.7B (Alibaba) covers 52 languages including the major East Asian and Middle Eastern languages with SOTA multilingual accuracy. Parakeet TDT 0.6B v3 covers 25 European languages only - it does not support Chinese, Japanese, Korean, Arabic, or Hindi.
What is the smallest local speech-to-text model that still produces usable transcription?+
Moonshine v2 Small (123M parameters, 7.84% WER) is the smallest model with strong results for live streaming and real-time captions - it runs on under 1 GB of RAM. For comparison, Whisper tiny (39M params) reaches only around 5.7% WER on clean audio and degrades noticeably on accented or noisy speech. Moonshine v2's streaming architecture keeps first-word latency low, making it the right pick for embedded systems and mobile use cases.