Title: How Whisper Turns Voice into Text
Introduction
Whisper is an AI model that changes voice into text. It is made by OpenAI. You can use it to write down what people say. Let’s see how it works.
1. What Whisper Can Do
Whisper can understand many languages. It can:
- Write down (transcribe) English, Japanese, and more
- Translate between languages
- Guess where a person stops talking
- Guess what language is spoken
There are different sizes: tiny, small, medium, and large. The large model is very accurate, but it needs a lot of memory.
2. Step-by-Step: How Whisper Works
Step 1: Voice Input
Whisper takes voice input in 30-second parts.
Step 2: Convert to a Picture
It changes the voice to a special image called a mel spectrogram. This makes the sound easier for the computer to understand.
Step 3: Encoder
This part reads the image and finds important features. These features help Whisper understand what is said.
Step 4: Decoder
Whisper starts to write words. First, it guesses the language. Then, it writes one word at a time.
It uses the last word to guess the next one. This repeats until all words are written.
3. What’s Inside Whisper
Whisper uses a Transformer model. Transformers are used in GPT and BERT. They help AI understand order and meaning.
a. Positional Encoding
Voice or text has order. For example:
- “The dog bit the man”
- “The man bit the dog”
These are different. So, Whisper adds position data to understand the order.
b. Self-Attention
Whisper looks at all the words or sounds at once. It learns what parts are important. This is faster than old methods.
c. Multi-Head Attention
Whisper uses many “heads” to look at the data in different ways. This helps it understand more.
4. Making Words from Numbers
Whisper does not understand letters. So it changes words into numbers (IDs). These are called tokens.
Tokenization
It breaks long words into smaller parts. For example:
- “transcription” → “tran” + “script” + “ion”
These parts are turned into IDs. Later, Whisper changes the IDs back into text.
5. How It Chooses Words
Whisper gets scores for each word. It chooses the word with the highest score.
a. Greedy Search
Always picks the best score. Fast, but not always natural.
b. Beam Search
Keeps the top 2–5 options and chooses the best sentence later.
c. Temperature
Sometimes chooses random words. This makes text more creative.
6. Whisper on iPhone
Tiny or small models work on iPhone because they need less memory. Large models are too heavy.
7. Whisper Is Open
Whisper’s code is on GitHub. Everyone can check how it works. You can also learn from the research paper.
Conclusion
Whisper is a smart AI that listens and writes. It uses Transformer models to do this. It can help in many ways: meeting notes, subtitles, translation, and more.