More than 5 years have passed since last update.

pydubを使ったステレオ音声の分離

Last updated at 2019-10-18Posted at 2019-03-16

チャンネル分離の方法

AudioSegmentでロードしたデータに対して、下記でステレオを分離できる。

from pydub import AudioSegment
sound = AudioSegment.from_file('fileName.mp4')
samples = np.array(sound.get_array_of_samples())

left = samples[0:len(samples):2]
right = samples[1:len(samples):2]

詳細説明

今回は単なるTipsですが、一応説明。
pydubを使うと、簡単に音声を読み込むことができます。

sound = AudioSegment.from_file('fileName.mp4')

音声のメタデータは下記のような感じで読み出せます。
よく使うのは全体長のdurationと、フレームレート、チャンネルなどでしょうか。
pydub便利。

channel_count = sound.channels
frames_per_second = sound.frame_rate
duration = sound.duration_seconds

print('channel==>', channel_count)
print('frame rate==>', frames_per_second)
print('duration==>', duration)

channel==> 2
frame rate==> 48000
duration==> 22.464000

例えばこの例だと、チャンネルが2つあることがわかります。
平たく言えばステレオ。もちろん、2つマイクを置いてモノクロ録音したものでもいいんだけど、とにかくチャンネルが2つある。
音声解析する際には、2つの時系列データとして扱いたくなるわけです。チャンネル同士の相関とか見たくなる。だから分離しないと。というモチベーション。
では、どうやって格納されているか、が問題。
とりあえず音声だけを抜き出して、その長さを調べてみると、、

samples = np.array(sound.get_array_of_samples())
print('length of samples', len(samples))

length of samples==> 2156544

2156544ってことは、22.464秒 x 48000Hz x 2ch=2156544と推測される。
チャンネルってシリアルに入ってんのね。（というかシリアルに読み出してるのかなぁ。音声はズブの素人なので詳しい人教えてください。）

さて、じゃあどうやってシリアルになってるか。
波形を見て見たところ、交互に入っているようでした。
なので、奇数と偶数のインデックスでスライスすれば左右のデータを取り出せます。

left = samples[0:len(samples):2]
right = samples[1:len(samples):2]

僕自身が初めて音声を扱ってみようと思って、最初にぶつかった壁でした。
全てのステレオがこうやって格納されているか、は正直知りません。
違うケースもあるよ、という場合は教えてください。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up