More than 3 years have passed since last update.

【M1 Mac】Arm版Python3.9のnumpyにOpenBLASをリンクして行列積を300倍高速化する方法

Last updated at 2022-01-21Posted at 2021-06-09

※この記事は，線形演算的に快適なM1 native Python3.9環境が欲しい方が対象です。numpy/scipyにOpenBLASをリンクして，より高速な計算環境を手に入れましょう

前提

本記事では以下を仮定しています

M1 Mac
Arm版のHomebrew
Arm版のPython3.9

インストール方法 w/ OpenBLAS

numpy/scipyをOpenBLASにlinkしてinstallするコマンドです

# numpy/scipyに必要
% brew install openblas gfortran
% pip3 install cython pybind11
# おまじない
% export OPENBLAS="$(brew --prefix openblas)/lib/"
# build from source
% pip3 install --no-binary :all: --no-use-pep517 numpy
# おまけでscipyも（結構長いので注意）
% pip3 install --no-binary :all: --no-use-pep517 scipy

計算速度の検証

以下を実行してbenchmarkを計測しました
▼ https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

Name	Python Platform	BLAS / LAPACK	行列積 (4096x4096) [sec]	ドット積 (1x524228) [ms]	SVD (2048x1024) [sec]	Cholesky分解 (2048x2048) [sec]	対角化 (2048x2048) [sec]
Pure NumPy	aarch64 (Homebrew)	-	298.54	1.09	13.27	2.12	73.81
NumPy + OpenBLAS	aarch64 (Homebrew)	OpenBLAS	0.95	0.28	2.49	0.11	10.27
NumPy + Intel MKL	intel (Miniconda)	Intel MKL	2.53	0.08	0.96	0.22	8.16

行列積ではpure numpyに比べて300倍も高速化できました。しかし，NativeなArm版OpenBLASがエミュレートされているRosetta 2 + Intel MKLに行列積以外で負けているのは何故なんでしょうか（Rosetta 2では，AVX512は使えないので，MKLはフルパワーを発揮できないはずです）。

検証ログ

Pure NumPy (aarch64)

Dotted two 4096x4096 matrices in 298.54 s.
Dotted two vectors of length 524288 in 1.09 ms.
SVD of a 2048x1024 matrix in 13.27 s.
Cholesky decomposition of a 2048x2048 matrix in 2.12 s.
Eigendecomposition of a 2048x2048 matrix in 73.81 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
blas_info:
  NOT AVAILABLE
blas_src_info:
  NOT AVAILABLE
blas_opt_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
flame_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_info:
  NOT AVAILABLE
lapack_src_info:
  NOT AVAILABLE
lapack_opt_info:
  NOT AVAILABLE
numpy_linalg_lapack_lite:
    language = c
    define_macros = [('HAVE_BLAS_ILP64', None), ('BLAS_SYMBOL_SUFFIX', '64_')]

NumPy w/ OpenBLAS (aarch64)

qiita@m1 ~ % python numpy_benchmark.py 
Dotted two 4096x4096 matrices in 0.95 s.
Dotted two vectors of length 524288 in 0.28 ms.
SVD of a 2048x1024 matrix in 2.49 s.
Cholesky decomposition of a 2048x2048 matrix in 0.11 s.
Eigendecomposition of a 2048x2048 matrix in 10.27 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']

NumPy w/ Intel MKL (x86_64)

Dotted two 4096x4096 matrices in 2.53 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.22 s.
Eigendecomposition of a 2048x2048 matrix in 8.16 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/qiita/miniconda3/include']

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up