Qiita Teams that are logged in
You are not logged in to any team

Log in to Qiita Team
Community
OrganizationEventAdvent CalendarQiitadon (β)
Service
Qiita JobsQiita ZineQiita Blog
4
Help us understand the problem. What are the problem?
@atksh

M1 Macでnumpyの行列積を300倍高速化する(OpenBLASのリンク)

※この記事は,線形演算的に快適なM1 native Python3.9環境が欲しい方が対象です。

前提

本記事では以下を仮定しています

  • M1 Mac
  • Arm版のHomebrew
  • Arm版のPython3.9

インストール方法

numpy/scipyをOpenBLASにlinkしてinstallするコマンドです

# numpy/scipyに必要
% brew install openblas gfortran
% pip3 install cython pybind11
# おまじない
% export OPENBLAS="$(brew --prefix openblas)/lib/"
# build from source
% pip3 install --no-binary :all: --no-use-pep517 numpy
# おまけでscipyも(結構長いので注意)
% pip3 install --no-binary :all: --no-use-pep517 scipy

検証

以下を実行してbenchmarkを計測しました
https://gist.github.com/markus-beuckelmann/8bc25531b11158431a5b09a45abd6276

Name Python Platform BLAS / LAPACK 行列積 (4096x4096) [sec] ドット積 (1x524228) [ms] SVD (2048x1024) [sec] Cholesky分解 (2048x2048) [sec] 対角化 (2048x2048) [sec]
Pure NumPy aarch64 (Homebrew) - 298.54 1.09 13.27 2.12 73.81
NumPy + OpenBLAS aarch64 (Homebrew) OpenBLAS 0.95 0.28 2.49 0.11 10.27
NumPy + Intel MKL intel (Miniconda) Intel MKL 2.53 0.08 0.96 0.22 8.16

行列積ではpure numpyに比べて300倍も高速化できました。しかし,NativeなArm版OpenBLASがエミュレートされているRosetta 2 + Intel MKLに行列積以外で負けているのは何故なんでしょうか。

検証ログ

Pure NumPy (aarch64)

Dotted two 4096x4096 matrices in 298.54 s.
Dotted two vectors of length 524288 in 1.09 ms.
SVD of a 2048x1024 matrix in 13.27 s.
Cholesky decomposition of a 2048x2048 matrix in 2.12 s.
Eigendecomposition of a 2048x2048 matrix in 73.81 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
  NOT AVAILABLE
atlas_3_10_blas_threads_info:
  NOT AVAILABLE
atlas_3_10_blas_info:
  NOT AVAILABLE
atlas_blas_threads_info:
  NOT AVAILABLE
atlas_blas_info:
  NOT AVAILABLE
blas_info:
  NOT AVAILABLE
blas_src_info:
  NOT AVAILABLE
blas_opt_info:
  NOT AVAILABLE
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
  NOT AVAILABLE
openblas_clapack_info:
  NOT AVAILABLE
flame_info:
  NOT AVAILABLE
atlas_3_10_threads_info:
  NOT AVAILABLE
atlas_3_10_info:
  NOT AVAILABLE
atlas_threads_info:
  NOT AVAILABLE
atlas_info:
  NOT AVAILABLE
lapack_info:
  NOT AVAILABLE
lapack_src_info:
  NOT AVAILABLE
lapack_opt_info:
  NOT AVAILABLE
numpy_linalg_lapack_lite:
    language = c
    define_macros = [('HAVE_BLAS_ILP64', None), ('BLAS_SYMBOL_SUFFIX', '64_')]

NumPy w/ OpenBLAS (aarch64)

qiita@m1 ~ % python numpy_benchmark.py 
Dotted two 4096x4096 matrices in 0.95 s.
Dotted two vectors of length 524288 in 0.28 ms.
SVD of a 2048x1024 matrix in 2.49 s.
Cholesky decomposition of a 2048x2048 matrix in 0.11 s.
Eigendecomposition of a 2048x2048 matrix in 10.27 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
  NOT AVAILABLE
blis_info:
  NOT AVAILABLE
openblas_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
blas_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_mkl_info:
  NOT AVAILABLE
openblas_lapack_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']
lapack_opt_info:
    libraries = ['openblas', 'openblas']
    library_dirs = ['/opt/homebrew/opt/openblas/lib/']
    language = c
    define_macros = [('HAVE_CBLAS', None)]
    runtime_library_dirs = ['/opt/homebrew/opt/openblas/lib/']

NumPy w/ Intel MKL (x86_64)

Dotted two 4096x4096 matrices in 2.53 s.
Dotted two vectors of length 524288 in 0.08 ms.
SVD of a 2048x1024 matrix in 0.96 s.
Cholesky decomposition of a 2048x2048 matrix in 0.22 s.
Eigendecomposition of a 2048x2048 matrix in 8.16 s.

This was obtained using the following Numpy configuration:
blas_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
blas_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_mkl_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/user/miniconda3/include']
lapack_opt_info:
    libraries = ['mkl_rt', 'pthread']
    library_dirs = ['/Users/qiita/miniconda3/lib']
    define_macros = [('SCIPY_MKL_H', None), ('HAVE_CBLAS', None)]
    include_dirs = ['/Users/qiita/miniconda3/include']
Why not register and get more from Qiita?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
4
Help us understand the problem. What are the problem?