More than 3 years have passed since last update.

cusignal で convolution2d のメモ

Posted at 2020-10-05

背景

画像処理で大きめ画像(e.g. 16K)で, float32 で convolution したい(機械学習でいう Conv2D 相当)
適用したい kernel もそこそこ大きめ 128x128 ~ 256x256 で, 数も 200 パターンくらいある
scipy convolve だと, method="direct"(ナイーブな実装) だと途方もなく時間かかる(1 convolve 処理が 10 分経ってもおわらない)
- fft だと CPU 1 コアで 3~4 秒くらい(scipy convolve はシングルコア実行)だが, 近似誤差が気になる.
- マルチスレッド化だとメモリも激しく使う.

GPU(CUDA) で高速 convolution できる cusignal をお知らせいただきました.

If you're performance limited with scipy.signal.convolve and have a GPU available, I encourage you to try cusignal.convolve - a GPU implementation of scipy.signal. https://t.co/YlBueOhD0v
— Adam Thompson (@adamlikesai) September 21, 2020

ありがとうございます.

cusignal Convolution 実装

2D の場合は fft はなしで direct(brute force)になります.

インストール

Linux だとたぶん手順どおりにすればインストールできます.

Windows 環境だと, こちらで試した限りでは conda 関連で cuda バージョン違いとかよくわからぬエラー出ました.

performance

2080 Ti + 150W powerlimit

16K x 16K, 129x129 kernel で, 1 conv 処理あたり 8.5 秒でした.

GPU のメモリ消費は 5GB くらいでしたので, RTX 2070 のような 8GB mem GPU でもうまくいくと思われます.

TODO

CPU + C++ での convolution(float32) と比較する(XNNPACK とか, Pytorch/TensorFlow の Conv2D を使うとよさげか)
CPU FFT convolve との誤差を調べる.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up