RHEL on Power10 で Whisper.cpp (音声認識モデル) を使用した文字起こし

Last updated at 2024-03-21Posted at 2024-03-18

はじめに

最近、いろいろなAIのモデルを見ている中で、たまたまOpenAI がリリースしたオープンソースの音声認識モデルであるWhisperの派生モデル、 "Whisper.cpp" の Github Readmeで気になる記述を発見。

VSX intrinsics support for POWER architectures

PowerISA (Power Instruction Set Architecture) の Vector Scalar Extension (VSX) がサポートされるということでしょうか？

試してみることにしました。

実行環境

IBM Power S1022 （CPUのみです）
RHEL 9.2
(インターネット接続環境)

環境構築

・RHEL9 の Base, AppStream のリポジトリを設定済

# dnf repolist
Updating Subscription Management repositories.
repo id                                   repo name

rhel-9-for-ppc64le-appstream-rpms         Red Hat Enterprise Linux 9 for Power, little endian - AppStream (RPMs)
rhel-9-for-ppc64le-baseos-rpms            Red Hat Enterprise Linux 9 for Power, little endian - BaseOS (RPMs)

① git make gcc gcc-c++ を導入

# dnf install git make gcc gcc-c++

(ログは長いので省略)

依存関係のあるパッケージを含めて 28つのパッケージが導入され、３つのパッケージがアップグレードされました。

② Whisper.cpp のセットアップ

Whisper.cpp の Quick start の手順を実行します。

・リポジトリーのクローン

# git clone https://github.com/ggerganov/whisper.cpp.git

Cloning into 'whisper.cpp'...
remote: Enumerating objects: 7595, done.
remote: Counting objects: 100% (1827/1827), done.
remote: Compressing objects: 100% (151/151), done.
remote: Total 7595 (delta 1716), reused 1711 (delta 1675), pack-reused 5768
Receiving objects: 100% (7595/7595), 11.64 MiB | 34.24 MiB/s, done.
Resolving deltas: 100% (4953/4953), done.

ディレクトリを移動

# cd whisper.cpp

③ makeの実行

# make

which: no nvcc in (/root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
I whisper.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  ppc64le
I UNAME_M:  ppc64le
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC 
 -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC
 -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread
I LDFLAGS:
I CC:       cc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
I CXX:      g++ (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600
 -D_GNU_SOURCE -pthread   -c ggml.c -o ggml.o
cc  -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600
 -D_GNU_SOURCE -pthread   -c ggml-alloc.c -o ggml-alloc.o

(長いのでログを省略)

g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 
-D_GNU_SOURCE -pthread examples/server/server.cpp examples/common.cpp 
examples/common-ggml.cpp ggml.o ggml-alloc.o ggml-backend.o ggml-quants.o 
whisper.o -o server
examples/server/server.cpp: In lambda function:
examples/server/server.cpp:601:51: note: the layout of aggregates containing 
vectors with 2-byte alignment has changed in GCC 5
  601 |     svr.Post(sparams.request_path + "/inference", [&]
  (const Request &req, Response &res){
      |                                                   ^
# echo $?
0

make の実行は 1 分くらいでした。

④ baseモデルのダウンロード

ダウンロード元はこちら -> https://huggingface.co/ggerganov/whisper.cpp

# bash ./models/download-ggml-model.sh base

Downloading ggml model base from 'https://huggingface.co/ggerganov/whisper.cpp' ...
  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  1200  100  1200    0     0  18461      0 --:--:-- --:--:-- --:--:-- 18461
100  141M  100  141M    0     0   122M      0  0:00:01  0:00:01 --:--:--  136M
Done! Model 'base' saved in '/work/whisper.cpp/models/ggml-base.bin'
You can now use it like this:
$ ./main -m /work/whisper.cpp/models/ggml-base.bin -f samples/jfk.wav

⑤ サンプルの文字起こし実行

サンプルにあるjfk.wav で Speech to Text を実行します。

# ./main -m /work/whisper.cpp/models/ggml-base.bin -f samples/jfk.wav
whisper_init_from_file_with_params_no_state: loading model from '/work/whisper.cpp/models/ggml-base.bin'
whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   16.39 MB
whisper_init_state: compute buffer (encode) =  132.07 MB
whisper_init_state: compute buffer (cross)  =    4.78 MB
whisper_init_state: compute buffer (decode) =   96.48 MB

system_info: n_threads = 4 / 8 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 |
ARM_FMA = 0 | METAL = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 
| SSSE3 = 0 | VSX = 1 | CUDA = 0 | COREML = 0 | OPENVINO = 0

main: processing 'samples/jfk.wav' (176000 samples, 11.0 sec), 4 threads, 1 processors, 
5 beams + best of 5, lang = en, task = transcribe, timestamps = 1 ...

[00:00:00.000 --> 00:00:08.000]   And so, my fellow Americans, ask not what your 
country can do for you,
[00:00:08.000 --> 00:00:11.000]   ask what you can do for your country.

whisper_print_timings:     load time =    42.54 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =    17.87 ms
whisper_print_timings:   sample time =    72.48 ms /   139 runs (    0.52 ms per run)
whisper_print_timings:   encode time =  3761.49 ms /     1 runs ( 3761.49 ms per run)
whisper_print_timings:   decode time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   batchd time =   612.94 ms /   137 runs (    4.47 ms per run)
whisper_print_timings:   prompt time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:    total time =  4509.29 ms

問題なくできました!
ここでは、CPU 1 Core, Memory 16GB の割り当てで実行しています。

" And so, my fellow Americans, ask not what your country can do for you, ask what you can do for your country." という結果が得られています。

system_info: の箇所で　VSX＝1 が表示されており、VSX Vector Scalar Extension での構成が認識されているようです。

ベンチマークの実行

ベンチマークの方法も公開されているのを見つけ設定して確認しました。

bench ディレクトリの確認

# ls -l bench
-rwxr-xr-x. 1 root root 1373688 Mar 17 07:16 bench

make の実行

# make bench
which: no nvcc in (/root/.local/bin:/root/bin:/usr/local/sbin:/usr/local/bin:/usr/sbin:/usr/bin)
I whisper.cpp build info:
I UNAME_S:  Linux
I UNAME_P:  ppc64le
I UNAME_M:  ppc64le
I CFLAGS:   -I.              -O3 -DNDEBUG -std=c11   -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread
I CXXFLAGS: -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -D_XOPEN_SOURCE=600 -D_GNU_SOURCE -pthread
I LDFLAGS:
I CC:       cc (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)
I CXX:      g++ (GCC) 11.4.1 20230605 (Red Hat 11.4.1-2)

make: 'bench' is up to date.

# echo $?
0

ベンチマークを実行します。

base モデルで 8 thread での実行です。
LPARには CPU 2 Core, Memory 64GB のリソース割り当てです。

# ./bench -m ./models/ggml-base.bin -t 8
whisper_init_from_file_with_params_no_state: loading model from './models/ggml-base.bin'

whisper_model_load: loading model
whisper_model_load: n_vocab       = 51865
whisper_model_load: n_audio_ctx   = 1500
whisper_model_load: n_audio_state = 512
whisper_model_load: n_audio_head  = 8
whisper_model_load: n_audio_layer = 6
whisper_model_load: n_text_ctx    = 448
whisper_model_load: n_text_state  = 512
whisper_model_load: n_text_head   = 8
whisper_model_load: n_text_layer  = 6
whisper_model_load: n_mels        = 80
whisper_model_load: ftype         = 1
whisper_model_load: qntvr         = 0
whisper_model_load: type          = 2 (base)
whisper_model_load: adding 1608 extra tokens
whisper_model_load: n_langs       = 99
whisper_model_load:      CPU total size =   147.37 MB
whisper_model_load: model size    =  147.37 MB
whisper_init_state: kv self size  =   16.52 MB
whisper_init_state: kv cross size =   18.43 MB
whisper_init_state: compute buffer (conv)   =   16.39 MB
whisper_init_state: compute buffer (encode) =  132.07 MB
whisper_init_state: compute buffer (cross)  =    4.78 MB
whisper_init_state: compute buffer (decode) =   96.48 MB

system_info: n_threads = 8 / 16 | AVX = 0 | AVX2 = 0 | AVX512 = 0 | FMA = 0 | NEON = 0 | ARM_FMA = 0 | METAL = 0 | F16C = 0 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 0 | SSSE3 = 0 | VSX = 1 | CUDA = 0 | COREML = 0 | OPENVINO = 0

whisper_print_timings:     load time =    43.40 ms
whisper_print_timings:     fallbacks =   0 p /   0 h
whisper_print_timings:      mel time =     0.00 ms
whisper_print_timings:   sample time =     0.00 ms /     1 runs (    0.00 ms per run)
whisper_print_timings:   encode time =  1992.42 ms /     1 runs ( 1992.42 ms per run)
whisper_print_timings:   decode time =  1146.21 ms /   256 runs (    4.48 ms per run)
whisper_print_timings:   batchd time =   819.36 ms /   320 runs (    2.56 ms per run)
whisper_print_timings:   prompt time =  9071.34 ms /  4096 runs (    2.21 ms per run)
whisper_print_timings:    total time = 13030.68 ms

If you wish, you can submit these results here:

  https://github.com/ggerganov/whisper.cpp/issues/89

Please include the following information:

  - CPU model
  - Operating system
  - Compiler

load time = 43.40 ms、 encode time = 1992.42 ms　という結果です。

Benchmark result #89

上記で公開されている他の実行結果と比較すると、 encode time はそう速いとは言えないかもしれませんが、(すごく遅いわけでもない)、load_time については同様の実行と比較して速いような気がします。

おわりに

動かすのに苦労するかと思いましたが、IBM Power10上ですんなり動いて嬉しいです。

上記には記載していませんが、CPUを増やすと速くなる傾向が見られました。

ひとまず稼働確認のみです。

以上です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up