LoginSignup
10
6

More than 1 year has passed since last update.

【M1 Mac】Arm版Python3.9のnumpyにOpenBLASをリンクして行列積を300倍高速化する方法

Last updated at Posted at 2021-06-09

※この記事は,線形演算的に快適なM1 native Python3.9環境が欲しい方が対象です。numpy/scipyにOpenBLASをリンクして,より高速な計算環境を手に入れましょう

前提

本記事では以下を仮定しています

  • M1 Mac
  • Arm版のHomebrew
  • Arm版のPython3.9

インストール方法 w/ OpenBLAS

numpy/scipyをOpenBLASにlinkしてinstallするコマンドです

# numpy/scipyに必要
% brew install openblas gfortran
% pip3 install cython pybind11
# おまじない
% export OPENBLAS="$(brew --prefix openblas)/lib/"
# build from source
% pip3 install --no-binary :all: --no-use-pep517 numpy
# おまけでscipyも(結構長いので注意)
% pip3 install --no-binary :all: --no-use-pep517 scipy

計算速度の検証

以下を実行してbenchmarkを計測しました
https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

Name Python Platform BLAS / LAPACK 行列積 (4096x4096) [sec] ドット積 (1x524228) [ms] SVD (2048x1024) [sec] Cholesky分解 (2048x2048) [sec] 対角化 (2048x2048) [sec]
Pure NumPy aarch64 (Homebrew) - 298.54 1.09 13.27 2.12 73.81
NumPy + OpenBLAS aarch64 (Homebrew) OpenBLAS 0.95 0.28 2.49 0.11 10.27
NumPy + Intel MKL intel (Miniconda) Intel MKL 2.53 0.08 0.96 0.22 8.16

行列積ではpure numpyに比べて300倍も高速化できました。しかし,NativeなArm版OpenBLASがエミュレートされているRosetta 2 + Intel MKLに行列積以外で負けているのは何故なんでしょうか(Rosetta 2では,AVX512は使えないので,MKLはフルパワーを発揮できないはずです)。

検証ログ

Pure NumPy (aarch64)

Dotted two 4096x4096 matrices in 298.54 s.
Dotted two vectors of length 524288 in 1.09 ms.
SVD of a 2048x1024 matrix in 13.27 s.
Cholesky decomposition of a 2048x2048 matrix in 2.12 s.
Eigendecomposition of a 2048x2048 matrix in 73.81 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
blas_info:
  NOT AVAILABLE
blas_src_info:
  NOT AVAILABLE
blas_opt_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
flame_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_info:
  NOT AVAILABLE
lapack_src_info:
  NOT AVAILABLE
lapack_opt_info:
  NOT AVAILABLE
numpy_linalg_lapack_lite:
    language = c
    define_macros = [('HAVE_BLAS_ILP64', None), ('BLAS_SYMBOL_SUFFIX', '64_')]

NumPy w/ OpenBLAS (aarch64)

qiita@m1 ~ % python numpy_benchmark.py 
Dotted two 4096x4096 matrices in 0.95 s.
Dotted two vectors of length 524288 in 0.28 ms.
SVD of a 2048x1024 matrix in 2.49 s.
Cholesky decomposition of a 2048x2048 matrix in 0.11 s.
Eigendecomposition of a 2048x2048 matrix in 10.27 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']

NumPy w/ Intel MKL (x86_64)

Dotted two 4096x4096 matrices in 2.53 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.22 s.
Eigendecomposition of a 2048x2048 matrix in 8.16 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/qiita/miniconda3/include']
10
6
1

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
10
6