Paper > CUDA > CELES: CUDA-accelerated simulation of electromagnetic scattering by large ensembles of spheres by Egel, A. et al. 2017 #CUDA

CELES: CUDA-accelerated simulation of electromagnetic scattering by large ensembles of spheres
Egel, A., Pattelli, L., Mazzamuto, G., Wiersma, D.S., Lemmer, U.
Journal of Quantitative Spectroscopy & Radiative Transfer
volume 199, issue , year 2017, pp. 103 - 110

自分の2009年の共著論文が引用されているという連絡を受けて見たところ、CUDAで計算速度を改善している論文だった。
購入した(35ドル)。

Hihglights
• A new, freely available MATLAB software for light scattering by large ensembles of spheres is introduced.
• It uses block-diagonal preconditioning, a lookup-table approach for the evaluation of costly functions and parallel execution on the GPU via NVIDIA’s CUDA platform.
• The validity of the results is demonstrated by comparison to established software.
• The convergence behavior for scattering by large aggregates is discussed.
• An accurate simulation of scattering by 100,000 wavelength-scale spheres is demonstrated.

block-diagonal preconditioningは気になるキーワード。

凝集体の構成粒子が波長サイズと小さく?はあるが、10万粒子数というのはすごい。
DDAと違い、厳密的な解法でこの数は多い。

コード

(2018/06/16追加)
https://disordered-photonics.github.io/celes/

内容

3つの手法でspeedupを図っている。

a block-diagonal preconditioner
a lookup table for the spherical Hankel function
GPU acceleration

iterative solution

以下の2通りが使用可能

BiCGSTAB (biconjugate gradient stabilized method)
GMRES (the generalized minimal residual method)

a block-diagonal preconditioner

凝集体の場合、相互作用が強いのは近傍粒子によるものと仮定している。この仮定のもと、凝集体をsubgroupに分割してsubgroupの光散乱を厳密に解く。
それらsubgroupの結果をまとめて、最終結果を得る。

a lookup table for the spherical Hankel function

spherical Hankel functionはradial distance coordinate dにのみ依存する。そのため、前計算をしてtableに保持しておく。
spatial resolution delta rのlookup tableに格納する。

GPU acceleration

GPUの計算の詳細は読み切れていない。

以下が関係しそうか (ネストが深い)。
https://github.com/disordered-photonics/celes/blob/master/sources/scattering/coupling_matrix_multiply_CUDA.cu

Maxwell NVIDIA GTX Titan X (3072 CUDA cores)での計算をした。
別途、Maxwell GeForce GTX 980 Ti (2816 CUDA cores)でも似たような計算になったとのこと。

結果

Fig.4においてvolume fraction 10%でのblock-diagonal preconditioner使用が最も少ないiteration数で収束している。

GPU使用(CELES)と12スレッド使用(MSTMコード使用)との比較もしていて、だいたい2倍の速度になっている。

劇的に速くなったというイメージではない。

(私見) CUDAの場合、単純な計算を並列で処理することで高速化を図るという概念をYoutubeの動画で見たことがある。上記のGPU使用では、少し高機能な計算を割り当ててしまっているのか、思ったよりも速度の改善は良くない。