More than 1 year has passed since last update.

【試行錯誤】「〇〇で歌ってみた」動画の自動生成その２：MusicXMLから歌詞の発話タイミングを取得

Posted at 2022-11-04

概要

替え歌字幕動画の生成に向けて、その１に続き、MusicXMLの内容を読み取る試行錯誤をしています。今回は歌詞の発話タイミングを取得してみます。

シリーズ一覧：
【試行錯誤】「〇〇で歌ってみた」動画の自動生成リンクまとめ

背景

「〇〇で歌ってみた」と呼ばれるジャンルの替え歌動画の自動生成に挑戦しています。
最終的に作りたい動画は、元歌詞、替え歌歌詞、替え歌歌詞と対応する適当な画像を、音楽に合わせて表示する静止画のつなぎ合わせのような動画です。

例えば下記動画のようなイメージです（ときどきあるアニメーションや右下の都道府県画像の更新などは再現しない予定ですが）
[駅名替え歌] 駅名で「うっせぇわ」

これをするためには歌詞を表示するタイミング（タイムスタンプ）を知る必要があります。歌詞の発生タイミングは歌声合成システムに入力するMusicXMLに記述されているはずなのでその情報を読み取ってみます。

歌詞のタイムスタンプ情報の抽出

NEUTRINOに付属しているサンプル（sample1.musicxml）を眺めて、タイムスタンプ情報がどのように読み取れるか考えてみます。わからなかったら公式Documentやわかりやすい解説を読みます。もし間違いに気づいたらご指摘いただけると幸いです。

楽曲全体のbpm情報については、最初のmeasureタグに書かれていそうです。

    <measure number="1" width="115.07">
      ...
      <attributes>
        <divisions>2</divisions>
        <key>
          <fifths>0</fifths>
          </key>
        <time>
          <beats>4</beats>
          <beat-type>4</beat-type>
          </time>
        <clef>
          <sign>G</sign>
          <line>2</line>
          </clef>
        </attributes>
      <direction placement="above">
        <direction-type>
          <metronome parentheses="no" default-x="-35.96" relative-y="20.00">
            <beat-unit>quarter</beat-unit>
            <per-minute>100</per-minute>
            </metronome>
          </direction-type>
        <sound tempo="100"/>
        </direction>
      <note>
        <rest/>
        <duration>8</duration>
        <voice>1</voice>
        <type>whole</type>
        </note>
      </measure>

divisionsは音符・休符の最小長さを規定するための数で、beat-typeの長さをこの数で割った値が最小長さとなります。今回だと後述のようにbeat-typeが4分音符なので、最小長さは4分音符の1/2、つまり、8分音符になります。

attributes/timeにbeatの情報があります。beatsが拍数、beat-typeが１拍の長さです。つまり分の拍子（今回だと4分の4拍子）となります。

bpm情報はdirection/direction-type/metronomeの中に書かれています。beat-unitがquaterでper-minuteが100なので、４分音符が１分に１００回の速さだと推測できます。
direction/soundのtempoタグにもbpmっぽい情報が書かれていますが、今回は無視します。（今回はtempoとper-minuteが同じなので虫で良さそうですが、違っていたらどうなるんでしょう？違っているファイルを見つけたときに考えたいと思います）

また続きを眺めたところ、attributesなどの情報はnumber=2以降のmeasureには書かれていないようです。もし、楽曲の途中でbpmが変わった場合にはその都度書く、ということなのでしょうか？わかりませんが、先に進みます。

歌詞の情報が初めて出てくるmeasure number="2"を見てみます。

    <measure number="2" width="136.20">
      <note default-x="13.62" default-y="-30.00">
        <pitch>
          <step>G</step>
          <octave>4</octave>
          </pitch>
        <duration>2</duration>
        <voice>1</voice>
        <type>quarter</type>
        <stem>up</stem>
        <lyric number="1" default-x="6.58" default-y="-53.60" relative-y="-30.00">
          <syllabic>single</syllabic>
          <text>は</text>
          </lyric>
        </note>
      ...
      </measure>

noteが音符に対応しており、歌詞がある場合（rest出ない場合）にはnote/lyric/textが存在するようです。
拍の長さはdurationにかかれており、これは前述のdivisionsで計算した最小長さ（今回は8分音符）が何個分かを意味しています。
division１つあたりの秒数は60/bpm/divsionsで計算可能です。
したがって直前のnoteの終了時間を基準として、durationを順に計算していくと、各noteの始点と持続時間がわかります。

Pythonで取得するコードは以下のような感じです。

import xml.etree.ElementTree as ET

class MusicXmlParser:
  def __init__(self, musicxml_path):
    self.tree = ET.parse(musicxml_path)
    self.root = self.tree.getroot()

    # 楽曲情報の格納変数。一応デフォルトで適当な値を入れておく
    self.default_musicinfo = {
      "beats": 0
      , "beattype": 0
      , "beatunit": 0
      , "perminute": 0
      , "divisions": 0
    }
    self.note_length_string_to_number = {
      "quarter": 4
      , "eighth": 8
      , "sixteenth": 16
    }

  # measure elementに含まれる音楽情報を取得
  def get_musicinfo(self, measure=None):
    measure = measure or self.root.find("./part//measure")
    musicinfo = {}
    # 以下、各情報は多くて一つとして取得（本当は2つ以上、ないは限らない）
    # division情報があれば更新する
    divisions = measure.find("./attributes/divisions")
    if divisions is not None: musicinfo["divisions"] = int(divisions.text)
    # beat情報があれば更新する
    beats = measure.find(".//attributes//time/beats")
    beattype = measure.find(".//attributes//time/beat-type")
    if beats is not None and beattype is not None: # is not Noneで比較しないとTrueにならない
      musicinfo["beats"]=int(beats.text)
      musicinfo["beattype"]=int(beattype.text)
    # bpm情報があれば更新する
    beatunit = measure.find(".//metronome/beat-unit")
    perminute = measure.find(".//metronome/per-minute")
    if beatunit is not None and perminute is not None:
      musicinfo["beatunit"] = self.note_length_string_to_number[beatunit.text]
      musicinfo["perminute"] = int(perminute.text)
    
    return musicinfo
  
  @staticmethod
  def calculate_second_per_division(musicinfo):
    return 60 / musicinfo["perminute"] / musicinfo["divisions"]

  def parse_note(self, note):
    is_rest = ( note.find("./rest") is not None )
    duration = int(note.find("./duration").text)
    lyric_text = note.find("./lyric/text")
    # 休符やテキストがない場合
    if is_rest or lyric_text is None:
      return "", duration
    else:
      return lyric_text.text, duration
      
  def get_lyric_timestamp(self, part = None):
    # 一応、self以外のrootに対しても使えるようにしておく
    part = part or self.root.find("./part")
    measures = part.findall(".//measure") # 小節のエレメントを取得
    musicinfo = self.default_musicinfo # デフォルトの音楽情報を取得
    current_second = 0
    timestamps = []
    for measure in measures:
      # 楽譜全体の情報をupdate。基本は最初のmeasureにのみ存在するはずだが、念の為毎回チェックする
      musicinfo.update(self.get_musicinfo(measure))
      second_per_division = self.calculate_second_per_division(musicinfo)
      notes = measure.findall(".//note")
      for note in notes:
        lyric_text, duration = self.parse_note(note)
        duration_second = duration * second_per_division
        if lyric_text:
          timestamps.append( ( lyric_text, current_second, duration_second ) )
        else:
          pass
        current_second += duration_second
    return timestamps
        
if __name__ == "__main__":
  musicxml_path = "NEUTRINO/score/musicxml/sample1.musicxml"
  parser = MusicXmlParser(musicxml_path)
  print(parser.get_lyric_timestamp(parser.root.find("./part")))

[('は', 2.4, 0.6), ('る', 3.0, 0.3), ('が', 3.3, 0.3), ('き', 3.5999999999999996, 0.6), ('た', 4.199999999999999, 0.6), ('は', 4.799999999999999, 0.6), ('る', 5.399999999999999, 0.3), ('が', 5.699999999999998, 0.3), ('き', 5.999999999999998, 0.6), ('た', 6.599999999999998, 0.6), ('ど', 7.1999999999999975, 0.6), ('こ', 7.799999999999997, 0.6), ('に', 8.399999999999997, 0.8999999999999999), ('き', 9.299999999999997, 0.3), ('た', 9.599999999999998, 1.7999999999999998), ('や', 11.999999999999998, 0.6), ('ま', 12.599999999999998, 0.3), ('に', 12.899999999999999, 0.3), ('き', 13.2, 0.6), ('た', 13.799999999999999, 0.6), ('さ', 14.399999999999999, 0.6), ('と', 14.999999999999998, 0.3), ('に', 15.299999999999999, 0.3), ('き', 15.6, 0.6), ('た', 16.2, 0.6), ('の', 16.8, 0.6), ('に', 17.400000000000002, 0.6), ('も', 18.000000000000004, 0.8999999999999999), ('き', 18.900000000000002, 0.3), ('た', 19.200000000000003, 1.7999999999999998), ('は', 21.600000000000005, 0.6), ('な', 22.200000000000006, 0.3), ('が', 22.500000000000007, 0.3), ('さ', 22.800000000000008, 0.6), ('く', 23.40000000000001, 0.6), ('は', 24.00000000000001, 0.6), ('な', 24.600000000000012, 0.3), ('が', 24.900000000000013, 0.3), ('さ', 25.200000000000014, 0.6), ('く', 25.800000000000015, 0.6), ('ど', 26.400000000000016, 0.6), ('こ', 27.000000000000018, 0.6), ('に', 27.60000000000002, 0.8999999999999999), ('さ', 28.500000000000018, 0.3), ('く', 28.80000000000002, 1.7999999999999998), ('や', 31.20000000000002, 0.6), ('ま', 31.800000000000022, 0.3), ('に', 32.10000000000002, 0.3), ('さ', 32.40000000000002, 0.6), ('く', 33.00000000000002, 0.6), ('さ', 33.60000000000002, 0.6), ('と', 34.200000000000024, 0.3), ('に', 34.50000000000002, 0.3), ('さ', 34.80000000000002, 0.6), ('く', 35.40000000000002, 0.6), ('の', 36.00000000000002, 0.6), ('に', 36.60000000000002, 0.6), ('も', 37.200000000000024, 0.8999999999999999), ('さ', 38.10000000000002, 0.3), ('く', 38.40000000000002, 2.4)]

小数点の扱いが若干不安ですが、だいたい良さそうです。

漢字かな交じり表記との対応付け

MusicXMLから取得できるのは歌詞の発音（かな）ですが、字幕に表示するときには漢字かな交じりの元歌詞を使うと思うので、元歌詞と発音を対応付けておきます。

以下のような元歌詞が記述されたファイルがあるとします。

harugakita_lyric.txt

春が来た　春が来た　どこに来た
山に来た　里に来た　野にも来た

花がさく　花がさく　どこにさく
山にさく　里にさく　野にもさく

対応付けは以下のステップで実施します。

上記歌詞からsudachpyで発音を取得
取得した推定発音とMusicXMLから取得した正解発音を対応付け

発音を取得

発音取得は過去に実装したSudachiPyで発音を取得【Python】とほぼ同様ですが、発音のない要素ができると面倒な気がしたので、発音なしのトークン（記号など）は適宜、前後の表層形に結合させるようにします。

import sudachipy
import re

class Tokenizer:
  def __init__(self, *, tokenizer_dict = "full", split_mode = None):
    self.tokenizer = sudachipy.dictionary.Dictionary(dict=tokenizer_dict).create()
    self.split_mode = split_mode or sudachipy.tokenizer.Tokenizer.SplitMode.A

  @staticmethod
  def mora_wakachi(kana_text):   
    #各条件を正規表現で表す
    c1 = '[ウクスツヌフムユルグズヅブプヴ][ァィェォ]' #ウ段＋「ァ/ィ/ェ/ォ」
    c2 = '[イキシチニヒミリギジヂビピ][ャュェョ]' #イ段（「イ」を除く）＋「ャ/ュ/ェ/ョ」
    c3 = '[テデ][ィュ]' #「テ/デ」＋「ャ/ィ/ュ/ョ」
    c4 = '[ァ-ヴー]' #カタカナ１文字（長音含む）
    c5 = '[a-zA-Z]+' #念の為アルファベットも抽出できるように

    condition = '('+c1+'|'+c2+'|'+c3+'|'+c4+'|'+c5+')'
    return re.findall(condition, kana_text)

  def get_surface_and_pronunciation(self, text, *, join_sign = True):
    tokens = self.tokenizer.tokenize(text ,self.split_mode)
    surfaces, pronunciations = [], []
    last_start_bra = "" # 括弧開を保管
    for token in tokens:
      pronunciation, surface, pos, second_pos = token.reading_form(), token.surface(), token.part_of_speech()[0], token.part_of_speech()[1]
      #print(pronunciations, token.part_of_speech())
      # 発音の微修正
      if pos in ("補助記号", "空白"): #記号の発音はなし 
        pronunciation = ""  
      elif surface == "は" and pos == "助詞": #助詞の「は」は「わ」になおす
        pronunciation = "ワ"
      elif surface == "へ" and pos == "助詞": # 助詞の「へ」は「え」になおす
        pronunciation = "エ"
      # 要素の追加。join_sign=Trueの場合、surfaceを微修正してから追加する
      if join_sign:
        # 発音がないとき
        if not pronunciation:
          # 括弧開でなく、last_start_braも空文字で、surfacesの長さが0でないとき、直前の要素と結合
          # last_start_braも初期化する
          if second_pos != "括弧開" and surfaces and not last_start_bra:
            surfaces[-1] += surface
            last_start_bra = ""
          # それ以外のとき、次ループ以降で処理
          else:
            last_start_bra += surface
        # 発音があるとき
        else:
          # last_start_braをsurfaceのあたまにつける。last_start_braは初期化する
          surface = last_start_bra + surface
          last_start_bra = ""
          surfaces.append(surface)
          pronunciation_mora = tuple(self.mora_wakachi(pronunciation)) # 発音をモウラの組になおす
          pronunciations.append(pronunciation_mora)
      else:
        pronunciation_mora = tuple(self.mora_wakachi(pronunciation)) # 発音をモウラの組になおす
        pronunciations.append(pronunciation_mora)
        surfaces.append(surface)
    # join_signがTrueでlast_start_braが余っていれば最後の要素に足しておく
    if join_sign:
      if surfaces:
        surfaces[-1] += last_start_bra
      else:
        surfaces.append(last_start_bra)
        # 無音を足す
        pronunciations.append(())
    return tuple(surfaces), tuple(pronunciations)


if __name__ == "__main__":
  tokenizer = Tokenizer()
  #print(tokenizer.get_surface_and_pronunciation("吾輩は猫であるaseaewf123。「どこへ」いきますか"))
  print(tokenizer.get_surface_and_pronunciation("「春が来た　春「が」来た「　どこに来た。[]", join_sign=False))

(('「', '春', 'が', '来', 'た', '\u3000', '春', '「', 'が', '」', '来', 'た', '「', '\u3000', 'どこ', 'に', '来', 'た', '。', '[', ']'), ((), ('ハ', 'ル'), ('ガ',), ('キ',), ('タ',), (), ('ハ', 'ル'), (), ('ガ',), (), ('キ',), ('タ',), (), (), ('ド', 'コ'), ('ニ',), ('キ',), ('タ',), (), (), ()))

対応付けモジュールの作成

過去に実装したAllocatorクラスをほぼ使いまわします。ただしexecメソッドは不要なので作っていません。

pip install editdistance

# 編集距離と対応のリストを返す
import editdistance as ed

class Allocater:

  def __init__(self):
    pass
    
  # 入力: correct_textはタプル、test_segmentsはcorrect_textより１つ次元の多いタプル。correct_textはstrでも可能
  # 出力は分割のindexとその分割をした場合の編集距離
  @staticmethod
  def find_correspondance(correct_text, test_segments):
    memo = {}
    def inner_func(correct_text, test_segments):
      memo_key = (correct_text, tuple(test_segments))
      if memo_key in memo:
        return memo[memo_key]

      # 特殊ケースの対応
      if correct_text and not test_segments:
        return len(correct_text), []
      elif not correct_text and test_segments:
        flatten_test_segments = [x for row in test_segments for x in row]
        result = (len(flatten_test_segments), [(0,0) for i in range(len(test_segments))])
        memo[memo_key] = result
        return result
      elif not correct_text and not test_segments:
        return 0, []
      # test_segmentが最後一つのとき、全部を対応させる
      elif correct_text and len(test_segments) == 1:
        dist = ed.eval(correct_text, test_segments[0])
        memo[memo_key] = (dist, [(0, len(correct_text))])
        return dist, [(0, len(correct_text))]
      
      # 全体の編集距離がゼロなら先頭から順番に対応付けすれば良い
      flatten_test_segments = tuple([x for row in test_segments for x in row])
      if correct_text  == flatten_test_segments:
        correspondance = []
        cnt = 0
        for seg in test_segments:
          correspondance.append((cnt, cnt+len(seg)))
          cnt += len(seg)
        memo[memo_key] = (0, correspondance)
        return 0, correspondance

      # プラスマイナスwindow_sizeの幅で最適な対応をみつける
      text = test_segments[0]
        
      results = []
      #window_size = ed.eval(correct_text, "".join(test_segments))
      window_size = 5
      for i in range(2*window_size+1):
        diff = i-window_size
        if len(text) + diff < 0: continue
        head_dist = ed.eval(correct_text[0:len(text)+diff], text)
        head_correspondance = [(0, len(text)+diff)]
        tail_dist, tail_correspondance = inner_func(correct_text[len(text)+diff:], test_segments[1:])
        # indexを最初の対応の長さで補正
        tail_correspondance = [(s+len(text)+diff, e+len(text)+diff) for s,e in tail_correspondance]

        dist = head_dist+tail_dist
        correspondance = head_correspondance + tail_correspondance
        results.append((dist, correspondance))
      #print(min(results, key=lambda x: x[0]))
      min_result = min(results, key=lambda x: x[0])
      memo[memo_key] = min_result
      return min_result
    # correct_textはtupleとして扱う
    if type(correct_text) is str:
      correct_text = tuple(correct_text)
    return inner_func(correct_text, test_segments)

  # デバッグ・確認用。correspondance(始点終点のindex）を文字列のペアになおして見やすくする
  @staticmethod
  def display_correspondance(correct_text, test_segments, correspondance):
    for test_seg, (start, end) in zip(test_segments, correspondance):
      print("test:", test_seg)
      print("correct:", correct_text[start:end])
      print("")

対応付けの実行

上記で作成したTokenizerクラス、Allocatorクラスを使って、MusicXMLから作ったtimestampsと歌詞情報を対応付けします。

pip install jaconv
pip install pandas

import jaconv
import pandas as pd

# MusicXMLから発音（ひらがな）のタイムスタンプを取得
musicxml_path = "NEUTRINO/score/musicxml/sample1.musicxml"
parser = MusicXmlParser(musicxml_path)
timestamps = parser.get_lyric_timestamp(parser.root.find("./part"))

# 発音だけをカタカナに直して取得
correct_moras = tuple([jaconv.hira2kata(text) for text, _, _ in timestamps])

# 歌詞の表層形を行（フレーズ）単位で取得
tokenizer = Tokenizer()
lyric_path = "lyric/harugakita_lyric.txt"
with open(lyric_path) as f:
  lyric_surface_text = f.read()
# 行分割して空の行は削除  
correct_phrases = [v for v in lyric_surface_text.splitlines() if v]
# 各フレーズを単語に分けて発音を取得
correct_words = [] # 正しい単語列
estimated_word_moras = [] #単語列のよみ。形態素解析による推測なので間違っている可能性あり。
# phraseの情報も念の為保持しておく
phrase_span = [] # 各フレーズが何単語めから何単語めまでかを保持する
phrase_start_index = 0
for phrase in correct_phrases:
  surfaces, pronunciations = tokenizer.get_surface_and_pronunciation(phrase) # phraseを形態素解析して単語ごとの表層形と発音を取得
  # phraseのスパン情報を保持
  phrase_end_index = phrase_start_index + len(surfaces)
  phrase_span.append((phrase_start_index, phrase_end_index))
  phrase_start_index = phrase_end_index
  # 単語の格納
  correct_words += surfaces # 分かち書きされた表層形
  estimated_word_moras += pronunciations # 分かち書きされた発音

# 単語の単位で推測歌詞発音と正解歌詞発音の対応付け
allocator = Allocater()
min_dist = ed.eval([mora for word in estimated_word_moras for mora in word], correct_moras)
dist, word_mora_correspondance = allocator.find_correspondance(correct_moras, estimated_word_moras)
#print(min_dist, dist)
#print(word_mora_correspondance)

# 対応付けの結果をもとに単語ごとの正しい発音(mora)を取得
correct_word_moras = [correct_moras[start:end] for start, end in word_mora_correspondance]
word_id, mora_id = 0, 0
# タイムスタンプ、元歌詞の対応をテーブル化
rows = []
for phrase_id, (phrase_start_index, phrase_end_index) in enumerate(phrase_span):
  phrase = correct_words[phrase_start_index: phrase_end_index]
  for word in phrase:
    for mora in correct_word_moras[word_id]:
      rows.append({
        "mora": mora
        , "mora_id": mora_id
        , "word_id": word_id
        , "word_surface": word
        , "phrase_id": phrase_id
      })
      mora_id += 1
    word_id += 1
df = pd.DataFrame(rows)
df["mora_hiragana"] = df["mora"].map(jaconv.kata2hira)
df["start"] = [start for _, start, _ in timestamps]
df["duration"] = [duration for _, _, duration in timestamps]
df.to_csv("output/sample1_timestamp.csv",index=False)

output/sample1_timestamp.csv

mora,mora_id,word_id,word_surface,phrase_id,mora_hiragana,start,duration
ハ,0,0,春,0,は,2.4,0.6
ル,1,0,春,0,る,3.0,0.3
ガ,2,1,が,0,が,3.3,0.3
キ,3,2,来,0,き,3.5999999999999996,0.6
タ,4,3,た　,0,た,4.199999999999999,0.6
...

MusicXMLから取得した歌詞とそのタイムスタンプ、対応する表層形が取得できました。

おわりに

MusicXMLからタイムスタンプ情報を取得することができました。
また字幕動画の生成を見据えて、発音（かな）歌詞と漢字かな交じり歌詞の単語レベルの対応も取得しました。
次は、画像に字幕を重畳した静止画を作ってみます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

【試行錯誤】「〇〇で歌ってみた」動画の自動生成 その２：MusicXMLから歌詞の発話タイミングを取得

概要

背景