More than 1 year has passed since last update.

Fast Whisperのpydoc version 0.10.0

Posted at 2023-12-09

ローカルで動かしてみたかったけど、関数の引数がよくわからなかったので、large-v3とやらも試してみたかった。

ちなみに、whisperのGPU対応済んでたら、8G搭載GPUだったし、以下一行で問題なく動きました。(もちろんGPUの搭載メモリ少ないと問題起きるのだろうけど)

pip install faster-whisper

計測はしてませんが、たしかに速いです。

Help on package faster_whisper:

NAME
    faster_whisper

PACKAGE CONTENTS
    audio
    feature_extractor
    tokenizer
    transcribe
    utils
    vad
    version

CLASSES
    builtins.object
        faster_whisper.transcribe.WhisperModel
    
    class WhisperModel(builtins.object)
     |  WhisperModel(model_size_or_path: str, device: str = 'auto', device_index: Union[int, List[int]] = 0, compute_type: str = 'default', cpu_threads: int = 0, num_workers: int = 1, download_root: Optional[str] = None, local_files_only: bool = False)
     |  
     |  Methods defined here:
     |  
     |  __init__(self, model_size_or_path: str, device: str = 'auto', device_index: Union[int, List[int]] = 0, compute_type: str = 'default', cpu_threads: int = 0, num_workers: int = 1, download_root: Optional[str] = None, local_files_only: bool = False)
     |      Initializes the Whisper model.
     |      
     |      Args:
     |        model_size_or_path: Size of the model to use (tiny, tiny.en, base, base.en,
     |          small, small.en, medium, medium.en, large-v1, large-v2, large-v3, or large), a path to a
     |          converted model directory, or a CTranslate2-converted Whisper model ID from the HF Hub.
     |          When a size or a model ID is configured, the converted model is downloaded
     |          from the Hugging Face Hub.
     |        device: Device to use for computation ("cpu", "cuda", "auto").
     |        device_index: Device ID to use.
     |          The model can also be loaded on multiple GPUs by passing a list of IDs
     |          (e.g. [0, 1, 2, 3]). In that case, multiple transcriptions can run in parallel
     |          when transcribe() is called from multiple Python threads (see also num_workers).
     |        compute_type: Type to use for computation.
     |          See https://opennmt.net/CTranslate2/quantization.html.
     |        cpu_threads: Number of threads to use when running on CPU (4 by default).
     |          A non zero value overrides the OMP_NUM_THREADS environment variable.
     |        num_workers: When transcribe() is called from multiple Python threads,
     |          having multiple workers enables true parallelism when running the model
     |          (concurrent calls to self.model.generate() will run in parallel).
     |          This can improve the global throughput at the cost of increased memory usage.
     |        download_root: Directory where the models should be saved. If not set, the models
     |          are saved in the standard Hugging Face cache directory.
     |        local_files_only:  If True, avoid downloading the file and return the path to the
     |          local cached file if it exists.
     |  
     |  add_word_timestamps(self, segments: List[dict], tokenizer: faster_whisper.tokenizer.Tokenizer, encoder_output: ctranslate2._ext.StorageView, num_frames: int, prepend_punctuations: str, append_punctuations: str, last_speech_timestamp: float) -> None
     |  
     |  encode(self, features: numpy.ndarray) -> ctranslate2._ext.StorageView
     |  
     |  find_alignment(self, tokenizer: faster_whisper.tokenizer.Tokenizer, text_tokens: List[int], encoder_output: ctranslate2._ext.StorageView, num_frames: int, median_filter_width: int = 7) -> List[dict]
     |  
     |  generate_segments(self, features: numpy.ndarray, tokenizer: faster_whisper.tokenizer.Tokenizer, options: faster_whisper.transcribe.TranscriptionOptions, encoder_output: Optional[ctranslate2._ext.StorageView] = None) -> Iterable[faster_whisper.transcribe.Segment]
     |  
     |  generate_with_fallback(self, encoder_output: ctranslate2._ext.StorageView, prompt: List[int], tokenizer: faster_whisper.tokenizer.Tokenizer, options: faster_whisper.transcribe.TranscriptionOptions) -> Tuple[ctranslate2._ext.WhisperGenerationResult, float, float, float]
     |  
     |  get_prompt(self, tokenizer: faster_whisper.tokenizer.Tokenizer, previous_tokens: List[int], without_timestamps: bool = False, prefix: Optional[str] = None) -> List[int]
     |  
     |  transcribe(self, audio: Union[str, BinaryIO, numpy.ndarray], language: Optional[str] = None, task: str = 'transcribe', beam_size: int = 5, best_of: int = 5, patience: float = 1, length_penalty: float = 1, repetition_penalty: float = 1, no_repeat_ngram_size: int = 0, temperature: Union[float, List[float], Tuple[float, ...]] = [0.0, 0.2, 0.4, 0.6, 0.8, 1.0], compression_ratio_threshold: Optional[float] = 2.4, log_prob_threshold: Optional[float] = -1.0, no_speech_threshold: Optional[float] = 0.6, condition_on_previous_text: bool = True, prompt_reset_on_temperature: float = 0.5, initial_prompt: Union[str, Iterable[int], NoneType] = None, prefix: Optional[str] = None, suppress_blank: bool = True, suppress_tokens: Optional[List[int]] = [-1], without_timestamps: bool = False, max_initial_timestamp: float = 1.0, word_timestamps: bool = False, prepend_punctuations: str = '"\'窶慊ｿ([{-', append_punctuations: str = '"\'.縲・・・・・・・・壺・]}縲・, vad_filter: bool = False, vad_parameters: Union[dict, faster_whisper.vad.VadOptions, NoneType] = None) -> Tuple[Iterable[faster_whisper.transcribe.Segment], faster_whisper.transcribe.TranscriptionInfo]
     |      Transcribes an input file.
     |      
     |      Arguments:
     |        audio: Path to the input file (or a file-like object), or the audio waveform.
     |        language: The language spoken in the audio. It should be a language code such
     |          as "en" or "fr". If not set, the language will be detected in the first 30 seconds
     |          of audio.
     |        task: Task to execute (transcribe or translate).
     |        beam_size: Beam size to use for decoding.
     |        best_of: Number of candidates when sampling with non-zero temperature.
     |        patience: Beam search patience factor.
     |        length_penalty: Exponential length penalty constant.
     |        repetition_penalty: Penalty applied to the score of previously generated tokens
     |          (set > 1 to penalize).
     |        no_repeat_ngram_size: Prevent repetitions of ngrams with this size (set 0 to disable).
     |        temperature: Temperature for sampling. It can be a tuple of temperatures,
     |          which will be successively used upon failures according to either
     |          `compression_ratio_threshold` or `log_prob_threshold`.
     |        compression_ratio_threshold: If the gzip compression ratio is above this value,
     |          treat as failed.
     |        log_prob_threshold: If the average log probability over sampled tokens is
     |          below this value, treat as failed.
     |        no_speech_threshold: If the no_speech probability is higher than this value AND
     |          the average log probability over sampled tokens is below `log_prob_threshold`,
     |          consider the segment as silent.
     |        condition_on_previous_text: If True, the previous output of the model is provided
     |          as a prompt for the next window; disabling may make the text inconsistent across
     |          windows, but the model becomes less prone to getting stuck in a failure loop,
     |          such as repetition looping or timestamps going out of sync.
     |        prompt_reset_on_temperature: Resets prompt if temperature is above this value.
     |          Arg has effect only if condition_on_previous_text is True.
     |        initial_prompt: Optional text string or iterable of token ids to provide as a
     |          prompt for the first window.
     |        prefix: Optional text to provide as a prefix for the first window.
     |        suppress_blank: Suppress blank outputs at the beginning of the sampling.
     |        suppress_tokens: List of token IDs to suppress. -1 will suppress a default set
     |          of symbols as defined in the model config.json file.
     |        without_timestamps: Only sample text tokens.
     |        max_initial_timestamp: The initial timestamp cannot be later than this.
     |        word_timestamps: Extract word-level timestamps using the cross-attention pattern
     |          and dynamic time warping, and include the timestamps for each word in each segment.
     |        prepend_punctuations: If word_timestamps is True, merge these punctuation symbols
     |          with the next word
     |        append_punctuations: If word_timestamps is True, merge these punctuation symbols
     |          with the previous word
     |        vad_filter: Enable the voice activity detection (VAD) to filter out parts of the audio
     |          without speech. This step is using the Silero VAD model
     |          https://github.com/snakers4/silero-vad.
     |        vad_parameters: Dictionary of Silero VAD parameters or VadOptions class (see available
     |          parameters and default values in the class `VadOptions`).
     |      
     |      Returns:
     |        A tuple with:
     |      
     |          - a generator over transcribed segments
     |          - an instance of TranscriptionInfo
     |  
     |  ----------------------------------------------------------------------
     |  Readonly properties defined here:
     |  
     |  supported_languages
     |      The languages supported by the model.
     |  
     |  ----------------------------------------------------------------------
     |  Data descriptors defined here:
     |  
     |  __dict__
     |      dictionary for instance variables (if defined)
     |  
     |  __weakref__
     |      list of weak references to the object (if defined)

FUNCTIONS
    available_models() -> List[str]
        Returns the names of available models.
    
    decode_audio(input_file: Union[str, BinaryIO], sampling_rate: int = 16000, split_stereo: bool = False)
        Decodes the audio.
        
        Args:
          input_file: Path to the input file or a file-like object.
          sampling_rate: Resample the audio to this sample rate.
          split_stereo: Return separate left and right channels.
        
        Returns:
          A float32 Numpy array.
        
          If `split_stereo` is enabled, the function returns a 2-tuple with the
          separated left and right channels.
    
    download_model(size_or_id: str, output_dir: Optional[str] = None, local_files_only: bool = False, cache_dir: Optional[str] = None)
        Downloads a CTranslate2 Whisper model from the Hugging Face Hub.
        
        Args:
          size_or_id: Size of the model to download from https://huggingface.co/guillaumekln
            (tiny, tiny.en, base, base.en, small, small.en medium, medium.en, large-v1, large-v2,
            large-v3, large), or a CTranslate2-converted model ID from the Hugging Face Hub
            (e.g. Systran/faster-whisper-large-v3).
          output_dir: Directory where the model should be saved. If not set, the model is saved in
            the cache directory.
          local_files_only:  If True, avoid downloading the file and return the path to the local
            cached file if it exists.
          cache_dir: Path to the folder where cached files are stored.
        
        Returns:
          The path to the downloaded model.
        
        Raises:
          ValueError: if the model size is invalid.
    
    format_timestamp(seconds: float, always_include_hours: bool = False, decimal_marker: str = '.') -> str

DATA
    __all__ = ['available_models', 'decode_audio', 'WhisperModel', 'downlo...

VERSION
    0.10.0

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up