0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Whisperとは

Posted at

Title: How Whisper Turns Voice into Text

Introduction

Whisper is an AI model that changes voice into text. It is made by OpenAI. You can use it to write down what people say. Let’s see how it works.


1. What Whisper Can Do

Whisper can understand many languages. It can:

  • Write down (transcribe) English, Japanese, and more
  • Translate between languages
  • Guess where a person stops talking
  • Guess what language is spoken

There are different sizes: tiny, small, medium, and large. The large model is very accurate, but it needs a lot of memory.


2. Step-by-Step: How Whisper Works

Step 1: Voice Input

Whisper takes voice input in 30-second parts.

Step 2: Convert to a Picture

It changes the voice to a special image called a mel spectrogram. This makes the sound easier for the computer to understand.

Step 3: Encoder

This part reads the image and finds important features. These features help Whisper understand what is said.

Step 4: Decoder

Whisper starts to write words. First, it guesses the language. Then, it writes one word at a time.
It uses the last word to guess the next one. This repeats until all words are written.


3. What’s Inside Whisper

Whisper uses a Transformer model. Transformers are used in GPT and BERT. They help AI understand order and meaning.

a. Positional Encoding

Voice or text has order. For example:

  • “The dog bit the man”
  • “The man bit the dog”

These are different. So, Whisper adds position data to understand the order.

b. Self-Attention

Whisper looks at all the words or sounds at once. It learns what parts are important. This is faster than old methods.

c. Multi-Head Attention

Whisper uses many “heads” to look at the data in different ways. This helps it understand more.


4. Making Words from Numbers

Whisper does not understand letters. So it changes words into numbers (IDs). These are called tokens.

Tokenization

It breaks long words into smaller parts. For example:

  • “transcription” → “tran” + “script” + “ion”

These parts are turned into IDs. Later, Whisper changes the IDs back into text.


5. How It Chooses Words

Whisper gets scores for each word. It chooses the word with the highest score.

a. Greedy Search

Always picks the best score. Fast, but not always natural.

b. Beam Search

Keeps the top 2–5 options and chooses the best sentence later.

c. Temperature

Sometimes chooses random words. This makes text more creative.


6. Whisper on iPhone

Tiny or small models work on iPhone because they need less memory. Large models are too heavy.


7. Whisper Is Open

Whisper’s code is on GitHub. Everyone can check how it works. You can also learn from the research paper.


Conclusion

Whisper is a smart AI that listens and writes. It uses Transformer models to do this. It can help in many ways: meeting notes, subtitles, translation, and more.


0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?