How On-Device Speaker Diarization Works (No Cloud)
A plain-English deep dive into speaker diarization - the layer that labels who said what in a transcript. How the pipeline works, why running it locally protects your voice data, and how a system recognises speakers across meetings.
Speaker diarization is the process of splitting a recording into segments labeled by who is speaking - turning a flat transcript into a script where every line has a name. On-device diarization does this entirely on your own computer, so the audio is never uploaded to label it. This post explains how diarization works under the hood, why running it locally matters, and how a system can learn to recognise the same voice across many meetings.
What is speaker diarization?
Transcription answers "what was said." Diarization answers "who said it." A raw Whisper transcript is one continuous block of text; diarization is the layer that breaks it into Speaker A, Speaker B, and so on, then - if the system knows them - replaces those with real names.
It is the difference between a wall of text and a usable record. For a two-person interview it is convenient. For a six-person meeting where you need to know exactly who committed to what, it is the whole point.
The pipeline, step by step
Diarization is not a single model - it is a short pipeline. Each stage is well understood and runs comfortably on a modern laptop.
1. Voice activity detection (VAD)
First, the system finds where speech actually is and discards silence and background noise. This is voice activity detection. Trimming non-speech up front makes everything downstream faster and more accurate.
2. Segmentation
The speech regions are cut into short, homogeneous chunks - stretches likely to contain a single speaker. Speaker changes inside a chunk are exactly what later stages resolve, so segmentation errs on the side of short windows.
3. Speaker embeddings
Each segment is converted into a speaker embedding - a fixed-length vector that captures the acoustic fingerprint of a voice, independent of the words spoken. Two segments from the same person land close together in this vector space; two different people land far apart. This is the core trick that makes the rest possible.
4. Clustering
The embeddings are grouped so that segments from the same voice fall into the same cluster. The number of clusters becomes the number of distinct speakers, and every segment inherits its cluster's label - Speaker A, Speaker B, and so on.
5. Labeling and merging
Finally, the cluster labels are stitched back onto the transcript timeline. The result is a transcript where each line is attributed to a speaker, ready to read or summarise.
Why run diarization locally?
Cloud meeting tools upload your audio to do this work on their servers. On-device diarization runs the entire pipeline - VAD, embeddings, clustering, labeling - on your own machine, which changes the privacy math completely:
- The audio never leaves the device. There is no server copy of the recording to retain, leak, or hand over.
- It works offline. Once the models are local, you can diarize on a plane or in a room with no network.
- Voice fingerprints stay yours. Speaker embeddings are biometric-adjacent data. Keeping them on-device means your voiceprint is never stored in someone else's database.
A speaker embedding is, in effect, a numeric fingerprint of a voice. That is precisely the kind of data you do not want sitting on a third-party server - which is the strongest argument for doing diarization locally.
How a system learns voices over time
The most useful trick in a local diarizer is persistence. Plain diarization labels speakers anonymously within a single recording - Speaker A in today's meeting has nothing to do with Speaker A in yesterday's.
If the system stores speaker embeddings locally, it can compare a new voice against the ones it already knows and match them. Name a colleague once, and their embedding is remembered, so every future meeting labels them correctly without you lifting a finger. Because the embeddings live on your machine, this personalisation never becomes a cloud profile. This is how Typilot's meeting recap recognises recurring speakers across sessions.
What affects diarization accuracy?
Diarization is mature but not magic. Accuracy depends on a few real-world factors:
- Audio quality. Clean, separated audio diarizes far better than a single laptop mic picking up a noisy room.
- Overlapping speech. When two people talk at once, any diarizer struggles - that is the hardest open problem in the field.
- Number of speakers. Two or three voices are easy; a dozen on one channel is genuinely hard.
- Segment length. Very short interjections ("yep", "agreed") carry little acoustic signal and are easier to mis-assign.
Capturing each side of a call cleanly - microphone and system audio kept distinct where possible - is the single biggest lever on quality.
Local vs. cloud diarization
| | On-device (Typilot) | Cloud meeting tools | |---|---|---| | Where audio is processed | Your machine | Vendor servers | | Where voice embeddings live | Your machine | Vendor database | | Works offline | Yes | No | | Recognises speakers across meetings | Yes, stored locally | Sometimes, stored in the cloud | | Privacy of biometric voice data | Stays on device | Held by a third party |
The short version
Diarization turns "what was said" into "who said it" through a five-stage pipeline: detect speech, segment it, turn each segment into a speaker embedding, cluster the embeddings, and label the transcript. Running that pipeline on-device keeps both the audio and the voice fingerprints on your machine - and storing the embeddings locally lets the system recognise the same people across meetings without ever building a cloud profile. That is exactly how Typilot's meeting recap works, with the full architecture documented on the security page.
Common questions.
What is speaker diarization?+
Speaker diarization is the process of splitting an audio recording into segments labeled by who is speaking. It turns a flat transcript into a script where each line is attributed to a speaker - the layer that answers "who said it" on top of transcription, which only answers "what was said."
Can speaker diarization run locally without the cloud?+
Yes. The full pipeline - voice activity detection, segmentation, speaker embeddings, and clustering - runs comfortably on a modern laptop. On-device diarization keeps both the audio and the voice embeddings on your machine, so nothing is uploaded.
How accurate is speaker diarization?+
For clean, well-separated audio with a handful of speakers, modern diarization is highly accurate. Accuracy drops with overlapping speech, many speakers on one channel, very short interjections, and noisy single-mic recordings. Capturing each side of a call distinctly is the biggest lever on quality.
How does a tool recognise the same speaker across different meetings?+
By storing speaker embeddings - numeric voice fingerprints - locally and matching new audio against them. Name a person once and their embedding is remembered, so future meetings label them automatically. When this is done on-device, the voiceprint never becomes a cloud profile.