More than 1 year has passed since last update.

日本人Youtuberに、英語を話させる【VALL-E-X】

Last updated at 2023-09-13Posted at 2023-09-13

まとめ

2023年3月にマイクロソフト社により論文発表され、南陽大学の学生によって公開されたVALL-E-X（参考記事）という、人をマルチリンガル的に発話させる音声モデルが出てきた。
筆者は、これを用いて、ほぼ自動でYoutuberの雑談動画を英語音声に吹き替えるシステムを構築した。

Github: https://github.com/konbraphat51/YoutubeTranslator

はじめに

8月下旬に上記のVALL-E-Xのオープンソースコードが公開されました。南陽大学のSongting Liuさんが（個人？）開発したものです。論文を発表したマイクロソフト社はまだこの技術を公開していません。
たった数秒の音声サンプルから、その人の声質をクローニングし、その声質でもって異言語の合成音声を作ることができる、というコンセプトのモデルです。
即座に試せるHuggingFaceスペースがあります：https://huggingface.co/spaces/Plachta/VALL-E-X

ちょうどYoutuberのクローンAIを作ろうとしていたので、動画化に使えるのではと思い、Youtube動画の吹替えタスクという形で試してみることに。

開発内容

下図の通り5つのクラスを経ます。

使用ライブラリとして、

VideoGetterにはpytube
TranscriberにはWhisper(OpenAI社)
LanguageTranslatorにはDeepL
VoiceTranslatorにはVALL-E-X
を用いています。

VoiceTranslatorの内容

VALL-E-Xは声調（笑いながら、怒りながら、など）も反映するというので、感情も反映しようと思い、一言ごとモデルに入力し、翻訳後音声を出力させるVoiceTranslatorVALLEXAllPromptを作りました。

VoiceTranslator.py

class VoiceTranslatorVALLEXAllPrompt(VoiceTranslator):
    '''
    Using VALLEX by "one sentence, one prompt" method
    '''
    def __init__(self, consts: Consts):
        super().__init__(consts)
        preload_models()
    
    def translate(self, index:int, row_translation: pd.Series) -> pathlib.Path:
        #cut
        cut_file_path = self.cutoff_original(index, int(row_translation["start"]*1000), int(row_translation["end"]*1000))
        
        #sample voice
        prompt_name = self.sample_voice(index, cut_file_path, row_translation["text"])
        
        #generate
        generation_sound_path = self.generate_voice(prompt_name, row_translation["translated_text"])
        
        return generation_sound_path
        
    def cutoff_original(self, index:int, start:int, end:int) -> pathlib.Path:
        '''
        make cutted original sound file.
        output: path of cutted original sound file
        '''
        
        #cut
        original_audio = AudioSegment.from_file(self.consts.original_video_path().as_posix())
        sound_cut = original_audio[start:end]
        
        #save
        cut_path = self.consts.original_sound_cut_folder / f"{index}.mp3"
        sound_cut.export(cut_path.as_posix(), format="mp3")
        
        return cut_path
    
    def sample_voice(self, index:int, cut_file_path:str, transcription:str) -> str:
        '''
        sample voice from cutted original sound file.
        output: prompt name made
        '''
        
        prompt_name = f"{self.consts.project_title}_{index}"
        
        make_prompt(
            name = prompt_name,
            audio_prompt_path=cut_file_path,
            transcript=transcription
        )
        
        return prompt_name
    
    def generate_voice(self, prompt_name:str, text_generating:str) -> pathlib.Path:
        '''
        generate voice from prompt.
        output: path of generated voice file
        '''
        
        #generate
        audio = generate_audio(
            text_generating,
            prompt=prompt_name
        )
        
        #save
        generated_path = self.consts.generated_sound_folder / f"{prompt_name}.wav"
        write_wav(generated_path.as_posix(), SAMPLE_RATE, audio)
        
        return generated_path

しかし、これは品質の良い音声が得られず、一つのよさげなクローニングデータを手動で選ぶVoiceTranslatorVALLEXSingleクラスを作成しました。
VALL-E-Xが完成すればこの手間が省けることかと思います。

VoiceTranslator.py


class VoiceTranslatorVALLEXSingle(VoiceTranslator):
    '''
    Using VALLEX by using only one prompt.  
    .npz prompt file should be in "customs" folder in working directory.  
    .npz file is can be created by `VALLEX.utils.prompt_making.make_prompt` or `VoiceTranslatorVALLEX` or https://huggingface.co/spaces/Plachta/VALL-E-X.
    '''
    
    def __init__(self, consts: Consts, prompt_name:str):
        super().__init__(consts)
        preload_models()
        self.prompt_name = prompt_name
        
    def translate(self, index: int, row_translation: pd.Series) -> pathlib.Path:
        #generate voice
        audio = generate_audio(
            row_translation["translated_text"],
            prompt=self.prompt_name
        )
        
        #save
        generated_path = self.consts.generated_sound_folder / f"{self.consts.project_title}_{index}.wav"
        write_wav(generated_path.as_posix(), SAMPLE_RATE, audio)
        
        return generated_path
    
    def check_prompt_exists(consts:Consts, prompt_name:str) -> bool:
        '''
        Check if prompt file exists.  
        Use this method before using this class.
        '''
        
        return (consts.working_directory / "customs" / f"{prompt_name}.npz").exists()

`VideoDataMaker`

字幕をaviutlで頑張って入力するのは気が引けたので、strファイルにしました。

    def make_sub(self, df_generated: pd.DataFrame) -> None:
        '''
        Make sub and save it as "sub.srt"
        '''
        
        #make individual sub
        subs = []
        for index, row in df_generated.iterrows():
            sub = srt.Subtitle(
                index=index,
                start=srt.timedelta(seconds=row["generated_start"]),
                end=srt.timedelta(seconds=row["generated_end"]),
                content=self.make_sub_text(row["text"], row["translated_text"])
            )
            subs.append(sub)
            
        #integrate
        sub_integrated = srt.compose(subs)
        
        #save
        with open((self.consts.srt_path()).as_posix(), "w", encoding="utf-8") as f:
            f.write(sub_integrated)

これを、srt2exoでexoファイルにすることで、aviutlで読み込めます。

結果

こんな感じです。
最初にVoiceTranslatorVALLEXSingleを用いたものを、二番目にVoiceTranslatorVALLEXAllPromptを用いたものを紹介しています。

完璧であるとはまだまだ言い難いですが、もうすぐ人類が簡単に異言語を自分の声で話せる日が来る気がします。

ついでに

Youtuberのクローンを作ろうとしています。見守ってください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up