How to Transcribe Meetings Locally (No Cloud, No Upload)
A step-by-step guide to transcribing meetings on your own machine with local Whisper - no audio upload, no subscription. Covers setup, speaker labels, and offline summaries on Mac, Windows, and Linux.
To transcribe a meeting locally, you record the audio on your own computer, convert it to text with an on-device speech-to-text model like Whisper, and generate the summary with a local language model - so no audio, transcript, or prompt is ever uploaded to a cloud service. Everything stays on the machine you control.
This guide walks through the full pipeline - capture, transcription, speaker labels, and summary - and explains why a local workflow matters for any conversation you would not email to a stranger.
Why transcribe meetings locally?
Cloud meeting assistants such as Otter.ai and Fireflies.ai upload every minute of your audio to their servers, where it is transcribed, stored, and - depending on the plan - used to train models. For a casual standup that may be fine. For a deposition, a patient consult, an M&A call, or anything under an NDA, it is a problem you cannot fix with a settings toggle.
Local transcription removes the exposure at the source. There is no upload, so there is nothing for a vendor to log, leak, retain, or hand over. Three concrete reasons it is worth the setup:
- Confidentiality by architecture. The audio never leaves the device, so privilege and NDAs stay intact without trusting a third party's retention policy.
- It works offline. Once the models are downloaded, the whole pipeline runs in airplane mode - on a train, in a SCIF, on hotel wifi you do not trust.
- No per-minute bill. Cloud tools charge per seat or per transcript minute. Local inference has a marginal cost of zero, so the only cost is the hardware you already own.
What you need
Three pieces, all of which run on your own machine:
- A speech-to-text model. Whisper is the open standard. It ships in sizes from
tinytolargeand supports 90+ languages. - A local language model runner. Ollama runs models like Llama 3.x or Mistral locally and produces the summary.
- An app that ties it together - capturing both sides of the call, running diarization, and routing the transcript to the model. This is the part Typilot automates.
You can wire Whisper and Ollama together by hand on the command line. The steps below describe the assembled workflow, which is what Typilot does in one keystroke.
Step by step
1. Capture both sides of the call
A meeting has two audio streams: your microphone and the other participants coming out of your speakers (system audio). A local recorder captures both and mixes them, so a remote call records as cleanly as an in-person one. Typilot uses the native audio stack on each OS - Core Audio on macOS, WinAPI on Windows, PulseAudio/ALSA on Linux.
2. Transcribe on-device with Whisper
The recorder feeds the audio to a local Whisper model. Use a smaller model (base, small) on a modest laptop for near-real-time speed, or medium/large on a machine with a GPU or Apple Silicon for higher accuracy. Voice-activity detection (VAD) trims silence so the model only works on speech.
3. Assign speakers with diarization
Diarization splits the transcript into "who said what." On-device diarization labels each speaker, and a tool that remembers voices across sessions lets you name a person once and have them recognised in every future meeting. The result is a transcript that reads like a script, not a wall of text.
4. Summarise with a local model
The labeled transcript goes to your local Ollama model, which writes a structured recap - decisions, action items, open questions - the moment the meeting ends. Because the model runs locally, you can pick one tuned for summarisation and switch it any time.
5. Chat with the transcript
The strongest part of a local pipeline: you can ask the transcript questions afterward - "what did Marcus commit to?" - and the answer comes back from your local model, with the source conversation never having touched a server.
Local vs. cloud transcription
| | Local (Typilot) | Cloud (Otter, Fireflies) | |---|---|---| | Audio leaves your device | No - never | Yes - uploaded and stored | | Works offline | Yes, after model download | No - internet required | | Speaker labels | On-device diarization | Yes, server-side | | 90+ languages | Yes, via Whisper | Varies by plan | | Pricing | One-time license | Per seat / per minute | | Data retention risk | None - nothing uploaded | Governed by vendor policy |
The privacy difference is structural, not a setting. A cloud tool can promise not to retain your audio; a local tool has no server to retain it on. For regulated work, that distinction is the whole point.
Is local transcription accurate enough?
For clear audio, the medium and large Whisper models reach accuracy comparable to mainstream cloud transcription, and they handle 90+ languages out of the box. The trade-off is hardware: larger models need more RAM or a GPU. On Apple Silicon and modern discrete GPUs, even hour-long recordings transcribe in real time. On older hardware, drop to a smaller model - you trade a little accuracy for speed, and it still never uploads.
The short version
Recording locally, transcribing with Whisper, and summarising with a local model gives you the entire meeting-notes workflow - transcript, speaker labels, recap, and follow-up chat - without a single byte leaving your machine. If you would rather not assemble it by hand, Typilot ships the whole pipeline on macOS, Windows, and Linux, with a 3-day free trial and no audio upload, ever. The full architecture is documented on the security page.
Common questions.
Can you transcribe a meeting without uploading the audio?+
Yes. With on-device speech-to-text such as Whisper, the audio is processed locally on your computer and never sent to a server. Tools like Typilot capture the microphone and system audio, transcribe with local Whisper, and summarise with a local model via Ollama - so no audio, transcript, or prompt leaves the machine.
Is local meeting transcription accurate?+
Local Whisper models range from tiny to large. The medium and large variants reach accuracy comparable to mainstream cloud transcription for clear audio, and they handle 90+ languages. Larger models need more RAM or a GPU but run in real time on modern hardware.
Do I need an internet connection to transcribe a meeting?+
No. Once the speech-to-text and language models are downloaded, transcription and summarisation run fully offline. An internet connection is only needed for the initial model download.
How is local transcription different from Otter or Fireflies?+
Otter and Fireflies upload your meeting audio to their cloud, where it is stored and processed. Local transcription keeps the audio, transcript, and summary on your device. That matters for legal, medical, M&A, and any conversation under an NDA.