Understanding the Whisper Model Family
OpenAI's Whisper is the open-source speech recognition model that changed the entire landscape of transcription. Released in late 2022 and iteratively improved through 2025, Whisper ships in five core sizes: Tiny, Base, Small, Medium, and Large. Each size represents a fundamentally different trade-off between accuracy, speed, and resource consumption. Choosing the right one is not a matter of always picking the biggest model — it is about matching model capability to your hardware, workflow, and accuracy requirements.
In this guide we will walk through every model variant in detail, present benchmark data from our own testing on Apple Silicon hardware, and explain exactly which model you should choose depending on your situation. Whether you are building an application on top of Whisper, running transcription locally on your Mac, or simply curious about how speech recognition models scale, this article will give you the technical grounding you need.
How Whisper Models Are Structured
All Whisper models share the same fundamental architecture: an encoder-decoder Transformer. Audio is converted into 80-channel log-Mel spectrograms in 30-second chunks, processed by the encoder, and then decoded autoregressively into text tokens. The difference between model sizes is the number of layers, attention heads, and the dimensionality of the internal representations.
Here is a summary of the model parameters:
- Tiny — 39 million parameters, 4 encoder layers, 4 decoder layers, model dimension 384
- Base — 74 million parameters, 6 encoder layers, 6 decoder layers, model dimension 512
- Small — 244 million parameters, 12 encoder layers, 12 decoder layers, model dimension 768
- Medium — 769 million parameters, 24 encoder layers, 24 decoder layers, model dimension 1024
- Large — 1.55 billion parameters, 32 encoder layers, 32 decoder layers, model dimension 1280
The parameter count roughly doubles or triples at each step, and this has a direct impact on both the accuracy ceiling and the computational cost. Each doubling in parameters does not yield a doubling in accuracy — there are diminishing returns — but the improvements in edge cases, accented speech, noisy environments, and multilingual content are significant.
Whisper Tiny: The Speed Champion
Whisper Tiny is the smallest model in the family and is designed for scenarios where speed and resource efficiency are paramount. With only 39 million parameters, it loads in under a second on most hardware and processes audio at roughly 30 to 50 times real-time speed on an M1 Mac. That means a one-minute audio clip is transcribed in about one to two seconds.
The accuracy of Tiny is surprisingly usable for clear, single-speaker English audio recorded with a decent microphone. In our testing with high-quality voice recordings, Tiny achieved a word error rate (WER) of approximately 8 to 10 percent on English content. That is good enough for quick notes, voice memos, and draft transcriptions that you plan to edit.
Where Tiny falls apart is in challenging conditions. Background noise, overlapping speakers, heavy accents, and non-English languages all degrade Tiny's output significantly. It also struggles with punctuation placement and capitalization more than larger models, which means more post-processing work. Technical vocabulary, medical terms, and proper nouns are frequently mangled.
Tiny is ideal for real-time dictation interfaces where latency matters more than perfection, for devices with extremely constrained memory, and as a first-pass filter in multi-stage transcription pipelines. If you are building an application that needs to provide instant feedback as the user speaks, Tiny can deliver that responsiveness.
Whisper Base: A Modest Step Up
Whisper Base approximately doubles the parameter count of Tiny, moving from 39 million to 74 million. This increase yields a noticeable improvement in accuracy, particularly for punctuation, capitalization, and handling of slightly noisy audio. In our benchmarks, Base achieved a WER of roughly 6 to 8 percent on clean English audio.
The speed difference between Tiny and Base is modest on modern hardware. On an M1 MacBook Air, Base processes audio at about 20 to 35 times real-time, which still means sub-second transcription for short utterances and a few seconds for longer clips. Memory consumption rises to approximately 150 megabytes, which is negligible on any Mac from the last five years.
Base is a reasonable choice for simple dictation workflows where the speaker has a clear voice and a quiet environment. It handles common English vocabulary well and produces cleaner output than Tiny with minimal additional cost. However, it still struggles with the same categories of difficulty: accents, background noise, code-switching between languages, and domain-specific terminology.
In practice, Base occupies an awkward middle ground. If speed is your top priority, Tiny is faster. If accuracy matters, Small is meaningfully better for a still-manageable resource cost. Base is most useful as a drop-in replacement for Tiny when you need a small accuracy boost without significantly changing your infrastructure.
Whisper Small: The Sweet Spot for Many Users
Whisper Small is where the model family starts to feel genuinely reliable for professional use. At 244 million parameters, it is roughly three times the size of Base and six times the size of Tiny. The jump in quality is substantial: our benchmarks show a WER of approximately 4 to 5 percent on clean English audio, with dramatically better handling of punctuation, sentence segmentation, and capitalization.
More importantly, Small begins to handle real-world recording conditions with reasonable grace. Moderate background noise, slightly accented speech, and conversational pace (with pauses, restarts, and filler words) are all managed significantly better than by the smaller models. Small also shows meaningful improvement in multilingual transcription, particularly for major European languages.
On Apple Silicon, Small runs at 10 to 20 times real-time speed, meaning a one-minute recording is transcribed in about three to six seconds. Memory consumption is around 500 megabytes, which is comfortable for any Mac with 8 gigabytes of RAM or more. The model loads in two to three seconds on first use.
For many users, Small represents the optimal balance between quality and performance. It is the model we recommend most frequently for daily dictation use in Scrybapp. The accuracy is high enough that most users can dictate and use the output with only minor corrections, while the speed is fast enough to feel essentially real-time for dictation purposes. If you are unsure which model to use and you have a Mac from 2020 or later, start with Small.
Whisper Medium: Diminishing Returns Begin
Whisper Medium triples the parameter count again, reaching 769 million parameters. The accuracy gains over Small are real but less dramatic than the jump from Base to Small. In our testing, Medium achieves a WER of roughly 3 to 4 percent on clean English audio — an improvement of about one percentage point over Small.
Where Medium distinguishes itself is in difficult audio conditions. Noisy recordings, speakers with strong regional accents, audio with multiple speakers, and content that mixes languages all see meaningful improvements. Medium also handles domain-specific vocabulary better, making it a stronger choice for medical, legal, and technical transcription.
The cost of Medium is significant. On an M1 MacBook Air, it processes audio at roughly 4 to 8 times real-time speed, meaning a one-minute recording takes about 8 to 15 seconds. Memory consumption rises to approximately 1.5 gigabytes, which starts to matter on 8-gigabyte machines running other applications. The model takes five to eight seconds to load on first use.
Medium is the right choice for users who regularly work with challenging audio or who need the highest possible accuracy for professional transcription. Journalists transcribing interviews in noisy environments, medical professionals dictating clinical notes with specialized terminology, and researchers working with multilingual content will all appreciate Medium's capabilities. However, for everyday dictation in a quiet office, the improvement over Small may not justify the slower speed.
Whisper Large: Maximum Accuracy, Maximum Cost
Whisper Large is the flagship model, now in its third major version (Large-v3). At 1.55 billion parameters, it is the most accurate Whisper model available, achieving WERs of 2 to 3 percent on clean English audio in our benchmarks. It is also the most capable multilingual model, supporting over 100 languages with varying degrees of accuracy.
Large-v3 incorporates training improvements that specifically target timestamp accuracy, hallucination reduction, and handling of long-form audio. It produces the most natural-sounding transcriptions with the best punctuation and formatting, and it is the least likely to hallucinate content during silence or noise segments.
The computational cost is substantial. On an M1 MacBook Air, Large processes audio at roughly 1.5 to 3 times real-time speed. A one-minute recording takes 20 to 40 seconds to transcribe. Memory consumption reaches approximately 3 gigabytes, making it impractical to run alongside other memory-intensive applications on 8-gigabyte machines. On an M3 Pro or M4 with 18 or more gigabytes of RAM, the experience is much more comfortable, with speeds of 3 to 6 times real-time.
Large is the right choice for batch transcription of important recordings, for multilingual content, and for situations where accuracy is paramount and speed is secondary. Professional transcription services, podcast post-production, and academic research are all strong use cases. For real-time dictation, however, Large's latency makes it impractical on most hardware — you want the text to appear as you speak, not 30 seconds later.
Benchmark Results on Apple Silicon
We ran standardized benchmarks across all five models on three Apple Silicon configurations using a set of 50 audio samples covering clear speech, noisy environments, accented English, and multilingual content. Here are the results:
M1 MacBook Air (8 GB RAM):
- Tiny: 0.03 seconds per second of audio, WER 9.2 percent
- Base: 0.04 seconds per second of audio, WER 7.1 percent
- Small: 0.08 seconds per second of audio, WER 4.6 percent
- Medium: 0.18 seconds per second of audio, WER 3.4 percent
- Large: 0.52 seconds per second of audio, WER 2.7 percent
M3 Pro MacBook Pro (18 GB RAM):
- Tiny: 0.01 seconds per second of audio, WER 9.2 percent
- Base: 0.02 seconds per second of audio, WER 7.1 percent
- Small: 0.04 seconds per second of audio, WER 4.6 percent
- Medium: 0.09 seconds per second of audio, WER 3.4 percent
- Large: 0.22 seconds per second of audio, WER 2.7 percent
M4 Max Mac Studio (64 GB RAM):
- Tiny: 0.005 seconds per second of audio, WER 9.2 percent
- Base: 0.008 seconds per second of audio, WER 7.1 percent
- Small: 0.02 seconds per second of audio, WER 4.6 percent
- Medium: 0.04 seconds per second of audio, WER 3.4 percent
- Large: 0.09 seconds per second of audio, WER 2.7 percent
Note that the WER is the same across hardware configurations because accuracy depends on the model, not the machine. What changes is the speed. On the M4 Max, even Large runs at over 10 times real-time, making it viable for interactive use.
Practical Model Selection Guide
Given all of this data, here is our practical recommendation for different use cases:
- Real-time dictation on 8 GB Mac: Use Small. It provides strong accuracy with comfortable speed.
- Real-time dictation on 16+ GB Mac: Use Small or Medium. If you regularly dictate technical content, Medium is worth the extra latency.
- Batch transcription of interviews or meetings: Use Large. Speed is less important when processing recordings after the fact.
- Multilingual transcription: Use Large. The smaller models have significantly worse accuracy on non-English languages.
- Mobile or embedded applications: Use Tiny or Base. Resource constraints make larger models impractical.
- Quick voice memos: Use Tiny or Base. The content is typically short and informal, so minor errors are acceptable.
In Scrybapp, we default to the Small model for new users because it provides the best combination of accuracy and responsiveness for everyday dictation. Users can switch to Medium or Large in settings if they need higher accuracy for specific workflows. The model switching takes only a few seconds, so it is easy to use Small for daily dictation and switch to Large when you need to transcribe an important recording.
The Future of Whisper Models
The Whisper model family continues to evolve. Quantized versions (using 4-bit and 8-bit integer representations instead of 16-bit floating point) allow larger models to run on smaller hardware with minimal accuracy loss. Community fine-tuning has produced specialized variants optimized for specific languages, accents, and domains. Hardware improvements in each Apple Silicon generation make the next-larger model practical for real-time use.
We expect that within the next year or two, running Large-class models in real-time on consumer hardware will be routine. Until then, the model selection decision remains important, and understanding the trade-offs is the key to getting the best experience from local speech-to-text.
For more on how Whisper powers local transcription, see our guides on local vs cloud speech-to-text privacy, how Apple Silicon accelerates AI transcription, and our complete ranking of speech-to-text apps for Mac.