Offline Speech to Text: How It Works (No Internet)
Run speech to text entirely on your device with local Whisper - no internet after the initial model download, ~2.7% WER, works in any app on Mac, Windows, and Linux.
Offline speech to text converts your spoken words to text entirely on your own device, with no audio upload and no internet connection required after the initial model download. The open standard for this is OpenAI's Whisper, which achieves around 2.7% word error rate on clean English audio and runs on Mac, Windows, and Linux. Apps that bundle local Whisper - including Typilot, Superwhisper, MacWhisper, Handy, and Murmur - deliver system-wide dictation that works in airplane mode, in a clinic, or anywhere a cloud service simply cannot go.
Here is what you need to know to choose the right tool and understand what you are actually getting.
Why "no internet" is an architecture requirement, not a feature
Cloud dictation services - Google Docs Voice Typing, Wispr Flow, Otter - convert your audio on their servers. Your audio leaves your device, sits in a data centre, and is subject to retention policies, breach risk, and vendor decisions you cannot audit. For casual use this is invisible. For healthcare notes, legal depositions, M&A calls, or anything under an NDA, it is a structural problem that a privacy policy cannot fix.
Offline speech to text removes the exposure at the source. There is no upload, so there is nothing to retain, leak, or subpoena. If your dictation tool never reaches the internet, the privacy guarantee is enforced by architecture, not by a vendor promise.
Two practical side-effects follow automatically: it works offline (airplane mode, blocked outbound networks, remote locations), and there is no per-minute billing because your hardware is the only compute involved.
How offline speech to text works
Every local dictation tool is built around a speech recognition model. Whisper is by far the most widely deployed open model for this task. Trained on 680,000 hours of multilingual audio, it handles 99 languages - making it the default for anything beyond English-only workloads.
Whisper ships in five sizes. Each makes a different trade-off between speed, accuracy, and RAM:
| Model | Disk / RAM | Approx. WER (English) | Practical speed on M2 |
|---|---|---|---|
| tiny | ~75 MB / 1 GB | ~7% | 10x real-time |
| base | ~150 MB / 1 GB | ~5% | 7x real-time |
| small | ~490 MB / 2 GB | ~3.4% | 5x real-time |
| medium | ~1.5 GB / 5 GB | ~2.9% | 3x real-time |
| large | ~3 GB / 10 GB | ~2.7% | 1.5x real-time |
WER figures are benchmarks on clean English speech (LibriSpeech test-clean). Real-world dictation with background noise, accents, or technical vocabulary typically lands in the 8-12% range even on the large model.
Voice activity detection (VAD) is the layer that trims silence before audio reaches Whisper. Without it, long pauses produce hallucinated words as the model tries to transcribe noise. A VAD pass - typically 10-20 ms per audio chunk - means Whisper processes only actual speech, which improves accuracy and cuts latency.
On-device vs cloud dictation
| Local (e.g. Typilot, Superwhisper) | Cloud (e.g. Wispr Flow, Otter) | |
|---|---|---|
| Audio leaves your device | No - never | Yes - uploaded and stored |
| Works offline | Yes, after model download | No |
| Languages | 99 via Whisper | Varies by service |
| Accuracy (clean English) | ~2.7% WER (large model) | ~2-3% WER |
| Per-minute cost | Zero | Subscription or per-minute |
| Data retention risk | None - nothing uploaded | Governed by vendor policy |
| macOS / Windows / Linux | Yes | Varies |
The accuracy gap between local Whisper and cloud services has largely closed. The remaining cloud advantage is accent handling on smaller models and streaming latency on some platforms - both of which shrink significantly on Apple Silicon, where even the medium model runs at 3x real-time via Metal acceleration.
The privacy difference between local and cloud dictation is structural, not a setting. A cloud tool can promise not to retain your audio; a local tool has no server to retain it on.
Hardware requirements
Offline speech to text does not require high-end hardware, but model size determines what is practical:
- tiny / base - any modern laptop with 8 GB RAM, no GPU needed. Latency is well under one second, comfortable for continuous dictation.
- small / medium - 16 GB RAM is the comfortable floor. Apple Silicon (M1 and later) runs medium at full real-time via Metal; Intel Macs and Windows laptops with integrated graphics are fine for small.
- large - benefits from Apple Silicon or a discrete GPU with 6+ GB VRAM. On CPU-only hardware it adds 1-3 seconds of lag per segment, which feels noticeable in dictation but works fine for batch transcription.
If you are on an Intel Mac or a Windows laptop without a discrete GPU, the small model is the practical sweet spot: 3.4% WER, under 500 MB download, real-time on most hardware made after 2020.
Getting offline dictation in any app
Raw Whisper handles transcription but not delivery - getting the recognised text into your email client, your code editor, Slack, or wherever you are typing requires a layer on top.
Typilot captures your microphone, runs the Whisper model you have downloaded, and injects the resulting text directly into the active text field via the OS accessibility layer. The same flow works in any application: your IDE, your browser, your terminal. There is no clipboard; the text appears at the cursor as if you typed it.
Three activation modes let you match the workflow to your situation:
- Hold - hold a key (default: Fn), speak, release to commit. Best for short bursts.
- Toggle-VAD - press once to start; voice activity detection stops the recording automatically on silence. Best for continuous dictation.
- Toggle-manual - press to start, press again to stop. Best for precise control over exactly what gets transcribed.
Output mode is also configurable: transcription mode injects the raw text; AI-response mode sends the transcript to a local language model (via Ollama) and injects the polished result instead. Both modes produce no outbound network traffic - the entire pipeline, from microphone to cursor, runs on your machine.
For teams evaluating alternatives, dedicated Typilot vs Superwhisper and Typilot vs MacWhisper comparisons cover the detailed trade-offs.
Who else is doing local dictation in 2026
The offline dictation space has matured. A few tools worth knowing:
- Superwhisper (Mac, iOS, Windows) - polished UI, multiple Whisper models, $849 lifetime. Best for users who want a fully managed experience with model switching and custom AI modes.
- MacWhisper (Mac only) - file-based transcription, not system-wide dictation. Good for batch transcription of audio files; less suitable for live voice-in-any-app use.
- Handy / OpenWhispr (macOS, Windows, Linux, open source, free) - barebones but fully capable. Handy runs Whisper or Parakeet locally with zero cost.
- Murmur (Windows, free) - Microsoft Store app built on Whisper, straightforward and light.
The key differentiator across these tools is integration depth - whether transcribed text lands in the active app automatically, whether VAD trims the silence, and whether a local AI model can act on the transcript. Typilot covers all three in one package.
For a deeper look at the local-first reasoning, why Typilot is local-first covers the design decisions behind building an assistant that never phones home.
The short version
Offline speech to text is production-quality and widely accessible in 2026. Whisper large reaches ~2.7% WER - comparable to mainstream cloud services - on Apple Silicon or any modern GPU, and the smaller models run real-time on any laptop made in the last five years. The trade-off versus cloud is a one-time model download (75 MB to 3 GB) and slightly higher latency on older CPU-only hardware.
If you want this on your machine without building it yourself, Typilot bundles local Whisper with a global dictation shortcut, VAD, three activation modes, and direct text injection into every app - no cloud, no subscription, 3-day free trial. The full architecture - what runs where and what never leaves your device - is documented on the security page.
Common questions.
Can speech to text work without the internet?+
Yes. Apps that bundle a local speech recognition model such as Whisper process your audio entirely on your own device. No audio is uploaded, and after the one-time model download the tool works fully offline - in airplane mode, on a blocked network, or anywhere with no connectivity.
How accurate is offline speech to text?+
Whisper large achieves around 2.7% word error rate on clean English audio, which is comparable to mainstream cloud services. Real-world dictation with background noise or accents typically lands in the 8-12% range. Smaller Whisper models (small, medium) reach 3-4% WER and run in real-time on most modern laptops.
What hardware do I need for offline speech recognition?+
The tiny and base Whisper models run on any laptop with 8 GB RAM. The medium model needs 16 GB RAM and runs at full real-time speed on Apple Silicon (M1 or later). The large model benefits from Apple Silicon or a discrete GPU with 6+ GB VRAM; on CPU-only hardware it introduces 1-3 seconds of lag per segment.
Which apps support offline speech to text on Mac and Windows?+
Several tools bundle local Whisper for offline dictation: Typilot (Mac, Windows, Linux - system-wide dictation in any app), Superwhisper (Mac, iOS, Windows), MacWhisper (Mac, file transcription), Handy and OpenWhispr (open source, cross-platform, free), and Murmur (Windows, free). The key difference is whether the tool injects text directly into the active application.