ローカルで動かしてみたかったけど、関数の引数がよくわからなかったので、large-v3とやらも試してみたかった。
ちなみに、whisperのGPU対応済んでたら、8G搭載GPUだったし、以下一行で問題なく動きました。(もちろんGPUの搭載メモリ少ないと問題起きるのだろうけど)
pip install faster-whisper
計測はしてませんが、たしかに速いです。
Help on package faster_whisper:
NAME
faster_whisper
PACKAGE CONTENTS
audio
feature_extractor
tokenizer
transcribe
utils
vad
version
CLASSES
builtins.object
faster_whisper.transcribe.WhisperModel
class WhisperModel(builtins.object)
| WhisperModel(model_size_or_path: str, device: str = 'auto', device_index: Union[int, List[int]] = 0, compute_type: str = 'default', cpu_threads: int = 0, num_workers: int = 1, download_root: Optional[str] = None, local_files_only: bool = False)
|
| Methods defined here:
|
| __init__(self, model_size_or_path: str, device: str = 'auto', device_index: Union[int, List[int]] = 0, compute_type: str = 'default', cpu_threads: int = 0, num_workers: int = 1, download_root: Optional[str] = None, local_files_only: bool = False)
| Initializes the Whisper model.
|
| Args:
| model_size_or_path: Size of the model to use (tiny, tiny.en, base, base.en,
| small, small.en, medium, medium.en, large-v1, large-v2, large-v3, or large), a path to a
| converted model directory, or a CTranslate2-converted Whisper model ID from the HF Hub.
| When a size or a model ID is configured, the converted model is downloaded
| from the Hugging Face Hub.
| device: Device to use for computation ("cpu", "cuda", "auto").
| device_index: Device ID to use.
| The model can also be loaded on multiple GPUs by passing a list of IDs
| (e.g. [0, 1, 2, 3]). In that case, multiple transcriptions can run in parallel
| when transcribe() is called from multiple Python threads (see also num_workers).
| compute_type: Type to use for computation.
| See https://opennmt.net/CTranslate2/quantization.html.
| cpu_threads: Number of threads to use when running on CPU (4 by default).
| A non zero value overrides the OMP_NUM_THREADS environment variable.
| num_workers: When transcribe() is called from multiple Python threads,
| having multiple workers enables true parallelism when running the model
| (concurrent calls to self.model.generate() will run in parallel).
| This can improve the global throughput at the cost of increased memory usage.
| download_root: Directory where the models should be saved. If not set, the models
| are saved in the standard Hugging Face cache directory.
| local_files_only: If True, avoid downloading the file and return the path to the
| local cached file if it exists.
|
| add_word_timestamps(self, segments: List[dict], tokenizer: faster_whisper.tokenizer.Tokenizer, encoder_output: ctranslate2._ext.StorageView, num_frames: int, prepend_punctuations: str, append_punctuations: str, last_speech_timestamp: float) -> None
|
| encode(self, features: numpy.ndarray) -> ctranslate2._ext.StorageView
|
| find_alignment(self, tokenizer: faster_whisper.tokenizer.Tokenizer, text_tokens: List[int], encoder_output: ctranslate2._ext.StorageView, num_frames: int, median_filter_width: int = 7) -> List[dict]
|
| generate_segments(self, features: numpy.ndarray, tokenizer: faster_whisper.tokenizer.Tokenizer, options: faster_whisper.transcribe.TranscriptionOptions, encoder_output: Optional[ctranslate2._ext.StorageView] = None) -> Iterable[faster_whisper.transcribe.Segment]
|
| generate_with_fallback(self, encoder_output: ctranslate2._ext.StorageView, prompt: List[int], tokenizer: faster_whisper.tokenizer.Tokenizer, options: faster_whisper.transcribe.TranscriptionOptions) -> Tuple[ctranslate2._ext.WhisperGenerationResult, float, float, float]
|
| get_prompt(self, tokenizer: faster_whisper.tokenizer.Tokenizer, previous_tokens: List[int], without_timestamps: bool = False, prefix: Optional[str] = None) -> List[int]
|
| transcribe(self, audio: Union[str, BinaryIO, numpy.ndarray], language: Optional[str] = None, task: str = 'transcribe', beam_size: int = 5, best_of: int = 5, patience: float = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, temperature: Union[float, List[float], Tuple[float, ...]] = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], compression_ratio_threshold: Optional[float] = 2.4, log_prob_threshold: Optional[float] = -1.0, no_speech_threshold: Optional[float] = 0.6, condition_on_previous_text: bool = True, prompt_reset_on_temperature: float = 0.5, initial_prompt: Union[str, Iterable[int], NoneType] = None, prefix: Optional[str] = None, suppress_blank: bool = True, suppress_tokens: Optional[List[int]] = [-1], without_timestamps: bool = False, max_initial_timestamp: float = 1.0, word_timestamps: bool = False, prepend_punctuations: str = '"\'窶慊ソ([{-', append_punctuations: str = '"\'.縲・・・・・・・・壺・]}縲・, vad_filter: bool = False, vad_parameters: Union[dict, faster_whisper.vad.VadOptions, NoneType] = None) -> Tuple[Iterable[faster_whisper.transcribe.Segment], faster_whisper.transcribe.TranscriptionInfo]
| Transcribes an input file.
|
| Arguments:
| audio: Path to the input file (or a file-like object), or the audio waveform.
| language: The language spoken in the audio. It should be a language code such
| as "en" or "fr". If not set, the language will be detected in the first 30 seconds
| of audio.
| task: Task to execute (transcribe or translate).
| beam_size: Beam size to use for decoding.
| best_of: Number of candidates when sampling with non-zero temperature.
| patience: Beam search patience factor.
| length_penalty: Exponential length penalty constant.
| repetition_penalty: Penalty applied to the score of previously generated tokens
| (set > 1 to penalize).
| no_repeat_ngram_size: Prevent repetitions of ngrams with this size (set 0 to disable).
| temperature: Temperature for sampling. It can be a tuple of temperatures,
| which will be successively used upon failures according to either
| `compression_ratio_threshold` or `log_prob_threshold`.
| compression_ratio_threshold: If the gzip compression ratio is above this value,
| treat as failed.
| log_prob_threshold: If the average log probability over sampled tokens is
| below this value, treat as failed.
| no_speech_threshold: If the no_speech probability is higher than this value AND
| the average log probability over sampled tokens is below `log_prob_threshold`,
| consider the segment as silent.
| condition_on_previous_text: If True, the previous output of the model is provided
| as a prompt for the next window; disabling may make the text inconsistent across
| windows, but the model becomes less prone to getting stuck in a failure loop,
| such as repetition looping or timestamps going out of sync.
| prompt_reset_on_temperature: Resets prompt if temperature is above this value.
| Arg has effect only if condition_on_previous_text is True.
| initial_prompt: Optional text string or iterable of token ids to provide as a
| prompt for the first window.
| prefix: Optional text to provide as a prefix for the first window.
| suppress_blank: Suppress blank outputs at the beginning of the sampling.
| suppress_tokens: List of token IDs to suppress. -1 will suppress a default set
| of symbols as defined in the model config.json file.
| without_timestamps: Only sample text tokens.
| max_initial_timestamp: The initial timestamp cannot be later than this.
| word_timestamps: Extract word-level timestamps using the cross-attention pattern
| and dynamic time warping, and include the timestamps for each word in each segment.
| prepend_punctuations: If word_timestamps is True, merge these punctuation symbols
| with the next word
| append_punctuations: If word_timestamps is True, merge these punctuation symbols
| with the previous word
| vad_filter: Enable the voice activity detection (VAD) to filter out parts of the audio
| without speech. This step is using the Silero VAD model
| https://github.com/snakers4/silero-vad.
| vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available
| parameters and default values in the class `VadOptions`).
|
| Returns:
| A tuple with:
|
| - a generator over transcribed segments
| - an instance of TranscriptionInfo
|
| ----------------------------------------------------------------------
| Readonly properties defined here:
|
| supported_languages
| The languages supported by the model.
|
| ----------------------------------------------------------------------
| Data descriptors defined here:
|
| __dict__
| dictionary for instance variables (if defined)
|
| __weakref__
| list of weak references to the object (if defined)
FUNCTIONS
available_models() -> List[str]
Returns the names of available models.
decode_audio(input_file: Union[str, BinaryIO], sampling_rate: int = 16000, split_stereo: bool = False)
Decodes the audio.
Args:
input_file: Path to the input file or a file-like object.
sampling_rate: Resample the audio to this sample rate.
split_stereo: Return separate left and right channels.
Returns:
A float32 Numpy array.
If `split_stereo` is enabled, the function returns a 2-tuple with the
separated left and right channels.
download_model(size_or_id: str, output_dir: Optional[str] = None, local_files_only: bool = False, cache_dir: Optional[str] = None)
Downloads a CTranslate2 Whisper model from the Hugging Face Hub.
Args:
size_or_id: Size of the model to download from https://huggingface.co/guillaumekln
(tiny, tiny.en, base, base.en, small, small.en medium, medium.en, large-v1, large-v2,
large-v3, large), or a CTranslate2-converted model ID from the Hugging Face Hub
(e.g. Systran/faster-whisper-large-v3).
output_dir: Directory where the model should be saved. If not set, the model is saved in
the cache directory.
local_files_only: If True, avoid downloading the file and return the path to the local
cached file if it exists.
cache_dir: Path to the folder where cached files are stored.
Returns:
The path to the downloaded model.
Raises:
ValueError: if the model size is invalid.
format_timestamp(seconds: float, always_include_hours: bool = False, decimal_marker: str = '.') -> str
DATA
__all__ = ['available_models', 'decode_audio', 'WhisperModel', 'downlo...
VERSION
0.10.0