Whisper runs locally on Apple Silicon with no network access
OpenAI's Whisper speech-to-text model can run entirely on-device on Apple Silicon, leveraging the Neural Engine and Unified Memory for real-time transcription without network access. This local implementation maintains model accuracy while offering benefits like zero latency, data privacy, and no per-minute cost compared to the cloud API. The article details the Whisper pipeline, model sizes, and performance trade-offs on different Apple chips, noting M2 devices can transcribe 10 minutes of audio in approximately 63 seconds.
Key Takeaways
- Whisper is described as an encoder-decoder transformer trained on 5 million hours of audio.
- On Apple Silicon, the full pipeline runs locally: mic audio, mel spectrogram, encoder, decoder, and output text.
- Model sizes range from Tiny at 39M parameters and about 75 MB to Large-v3 at 1.55B parameters and about 2.9 GB of RAM.
- For M2 devices, the article says 10 minutes of audio can be transcribed in about 63 seconds.
- The OpenAI Whisper API costs $0.006 per minute, while the local version has zero per-minute cost and zero data transmission.
Why It Matters
This shows speech-to-text can move from cloud calls to fully local execution on Macs without changing the underlying Whisper model. For teams shipping dictation, captioning, or transcription features, the trade-off is now mostly between RAM, speed, and chip class rather than model access itself. The article also notes that some cloud dictation products post-process Whisper output through an LLM, which can rewrite non-English text; on-device use returns raw output. What to watch: how M1, M2, M3, and M4 performance compares in real workloads, especially the model size each chip can sustain.
Read full article at reddit.com