Python音声処理&可視化チートシート

Last updated at 2024-05-28Posted at 2024-01-12

概要

音声処理関連のチートシートです。書き途中。今後随時更新予定。

音声の読み込み

_, x = scipy.io.wavfile.read("read_audio.wav")

音声の書き出し

scipy.io.wavfile.write(
    filename="write_audio.wav",
    rate=16000,
    data=x,
)

音声波形の可視化

1つ

fig, ax = plt.subplots(figsize=(8,2))
librosa.display.waveshow(audio00, sr=16000)

ax.set_title("audio00")
ax.set_ylabel("Amplitude")

plt.tight_layout()

複数を縦に並べる

row = 2
fig, ax = plt.subplots(row, 1, figsize=(8,2*row))
librosa.display.waveshow(audio00, sr=16000, ax=ax[0])
librosa.display.waveshow(audio01, sr=16000, ax=ax[1])

ax[0].set_title("audio00")
ax[1].set_title("audio01")

for a in ax:
    a.set_ylabel("Amplitude")
plt.tight_layout()

音声の埋め込み

print("audio00")
IPython.display.display(IPython.display.Audio(audio00, rate=16000))

print("audio01")
IPython.display.display(IPython.display.Audio(audio01, rate=16000))

音声の対数振幅スペクトログラムとGriffin-Limアルゴリズム

シンプルに使いたい場合

クラスとして扱いたい場合

class Converter:
    def __init__(self, ref=1.0):
        self.ref = ref

    def logamplitudespectrum(self, audio):
        frame_shift = int(16000 * 0.005)
        n_fft = 2048

        X = librosa.stft(
            audio,
            n_fft=n_fft,
            win_length=n_fft,
            hop_length=frame_shift,
            window="hann",
            center=False,
        )
        spec = np.abs(X)
        logspec = librosa.amplitude_to_db(spec, ref=self.ref)
        return logspec

    def griffinlim_and_pad(self, logspec, audio_size):
        frame_shift = int(16000 * 0.005)

        spec = librosa.db_to_amplitude(logspec, ref=self.ref)
        audio = librosa.griffinlim(spec, hop_length=frame_shift, n_iter=200)

        if audio.shape[0] < audio_size:
            num_pad = audio_size - audio.shape[0]
            num_pad_top = num_pad // 2
            num_pad_bottom = num_pad - num_pad_top
            audio = np.concatenate(
                [
                    np.zeros(num_pad_top, dtype=np.float32),
                    audio,
                    np.zeros(num_pad_bottom, dtype=np.float32),
                ]
            )
        else:
            audio = audio[:audio_size]

        return audio

converter = Converter()

Griffin-Limのアルゴリズムに関して、そのままだと左詰めで復元されるため適切なパディングが必要

対数振幅スペクトログラム

logspec = converter.logamplitudespectrum(potential_adv)
fig, ax = plt.subplots(1, 1, figsize=(8, 4), sharex=True)
img = librosa.display.specshow(
    logspec,
    hop_length=int(16000 * 0.005),
    sr=16000,
    x_axis="time",
    y_axis="hz",
    ax=ax,
)
fig.colorbar(img, ax=ax, format="%+2.f dB")
ax.set_xlabel("Time [sec]")
ax.set_ylabel("Frequency [Hz]")
fig.savefig("log_amplitude_spectrogram.png")

Griffin-Limのアルゴリズム

audio = converter.griffinlim_and_pad(logspec, audio.shape[0])

音声の周波数変換

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up