Microsoft BitNet の Llama3-8B-1.58-100B-tokens を動かしてみた

Last updated at 2024-11-09Posted at 2024-11-09

Microsoft BitNetリポジトリの Llama3-8B-1.58-100B-tokens を試したメモです。

動作環境

Ubuntuのバージョン

$ lsb_release -a
No LSB modules are available.
Distributor ID: Ubuntu
Description:    Ubuntu 24.04 LTS
Release:        24.04
Codename:       noble

CPUの情報

$ cat /proc/cpuinfo | tail -28
processor       : 19
vendor_id       : GenuineIntel
cpu family      : 6
model           : 186
model name      : 13th Gen Intel(R) Core(TM) i9-13900H
stepping        : 2
microcode       : 0x4122
cpu MHz         : 850.731
cache size      : 24576 KB
physical id     : 0
siblings        : 20
core id         : 31
cpu cores       : 14
apicid          : 62
initial apicid  : 62
fpu             : yes
fpu_exception   : yes
cpuid level     : 32
wp              : yes
flags           : fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf tsc_known_freq pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid rdseed adx smap clflushopt clwb intel_pt sha_ni xsaveopt xsavec xgetbv1 xsaves split_lock_detect user_shstk avx_vnni dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp hwp_pkg_req hfi vnmi umip pku ospke waitpkg gfni vaes vpclmulqdq tme rdpid movdiri movdir64b fsrm md_clear serialize pconfig arch_lbr ibt flush_l1d arch_capabilities
vmx flags       : vnmi preemption_timer posted_intr invvpid ept_x_only ept_ad ept_1gb flexpriority apicv tsc_offset vtpr mtf vapic ept vpid unrestricted_guest vapic_reg vid ple shadow_vmcs ept_mode_based_exec tsc_scaling usr_wait_pause
bugs            : spectre_v1 spectre_v2 spec_store_bypass swapgs eibrs_pbrsb rfds bhi
bogomips        : 5990.40
clflush size    : 64
cache_alignment : 64
address sizes   : 46 bits physical, 48 bits virtual
power management:

メモリの情報

$ cat /proc/meminfo | head -5
MemTotal:       65568740 kB
MemFree:        21366936 kB
MemAvailable:   64096520 kB
Buffers:           76220 kB
Cached:         42110264 kB

開発環境インストール

g++とcmakeをaptでインストールして、minicondaをインストール。minicondaのインストール内で、bashを使用しているのでinitも行う。

sudo apt install g++ cmake
curl https://repo.anaconda.com/miniconda/Miniconda3-latest-Linux-x86_64.sh -o miniconda.sh
sh miniconda.sh
source ~/.bashrc

BitNetをクローン

BitNetリポジトリをクローン。

git clone --recursive https://github.com/microsoft/BitNet.git
cd BitNet

Python環境作成

Condaを使ってPython3.9の環境を作成。

conda create -n bitnet-cpp python=3.9
conda activate bitnet-cpp
pip install -r requirements.txt

モデルのダウンロードとセットアップ

Hugging Faceからモデルをダウンロードし、環境をセットアップ。

huggingface-cli download HF1BitLLM/Llama3-8B-1.58-100B-tokens --local-dir models/Llama3-8B-1.58-100B-tokens
python setup_env.py -md models/Llama3-8B-1.58-100B-tokens -q i2_s

サンプルと同じ内容で推論

サンプルと同じ以下のコマンドで推論を実行。

python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?\nAnswer:" -n 6 -temp 0 -t 6

実行結果から一部を抜粋。

warning: not compiled with GPU offload support, --gpu-layers option will be ignored
warning: see main README.md for information on enabling GPU BLAS support
build: 3947 (406a5036) with cc (Ubuntu 13.2.0-23ubuntu4) 13.2.0 for x86_64-linux-gnu
main: llama backend init
main: load the model and apply lora adapter, if any
llama_model_loader: loaded meta data with 22 key-value pairs and 291 tensors from models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf (version GGUF V3 (latest))
llama_model_loader: Dumping metadata keys/values. Note: KV overrides do not apply in this output.
llm_load_vocab: special tokens cache size = 256
llm_load_vocab: token to piece cache size = 0.8000 MB
llm_load_print_meta: format           = GGUF V3 (latest)
llm_load_print_meta: arch             = llama
llm_load_print_meta: vocab type       = BPE
llm_load_print_meta: n_vocab          = 128256
llm_load_print_meta: n_merges         = 280147
llm_load_print_meta: vocab_only       = 0
llm_load_print_meta: n_ctx_train      = 8192
llm_load_print_meta: n_embd           = 4096
llm_load_print_meta: n_layer          = 32
llm_load_print_meta: n_head           = 32
llm_load_print_meta: n_head_kv        = 8
llm_load_print_meta: n_rot            = 128
llm_load_print_meta: n_swa            = 0
llm_load_print_meta: n_embd_head_k    = 128
llm_load_print_meta: n_embd_head_v    = 128
llm_load_print_meta: n_gqa            = 4
llm_load_print_meta: n_embd_k_gqa     = 1024
llm_load_print_meta: n_embd_v_gqa     = 1024
llm_load_print_meta: f_norm_eps       = 0.0e+00
llm_load_print_meta: f_norm_rms_eps   = 1.0e-05
llm_load_print_meta: f_clamp_kqv      = 0.0e+00
llm_load_print_meta: f_max_alibi_bias = 0.0e+00
llm_load_print_meta: f_logit_scale    = 0.0e+00
llm_load_print_meta: n_ff             = 14336
llm_load_print_meta: n_expert         = 0
llm_load_print_meta: n_expert_used    = 0
llm_load_print_meta: causal attn      = 1
llm_load_print_meta: pooling type     = 0
llm_load_print_meta: rope type        = 0
llm_load_print_meta: rope scaling     = linear
llm_load_print_meta: freq_base_train  = 500000.0
llm_load_print_meta: freq_scale_train = 1
llm_load_print_meta: n_ctx_orig_yarn  = 8192
llm_load_print_meta: rope_finetuned   = unknown
llm_load_print_meta: ssm_d_conv       = 0
llm_load_print_meta: ssm_d_inner      = 0
llm_load_print_meta: ssm_d_state      = 0
llm_load_print_meta: ssm_dt_rank      = 0
llm_load_print_meta: ssm_dt_b_c_rms   = 0
llm_load_print_meta: model type       = 8B
llm_load_print_meta: model ftype      = I2_S - 2 bpw ternary
llm_load_print_meta: model params     = 8.03 B
llm_load_print_meta: model size       = 3.58 GiB (3.83 BPW)
llm_load_print_meta: general.name     = Llama3-8B-1.58-100B-tokens
llm_load_print_meta: BOS token        = 128000 '<|begin_of_text|>'
llm_load_print_meta: EOS token        = 128009 '<|eot_id|>'
llm_load_print_meta: EOT token        = 128009 '<|eot_id|>'
llm_load_print_meta: LF token         = 128 'Ä'
llm_load_print_meta: EOG token        = 128009 '<|eot_id|>'
llm_load_print_meta: max token length = 256
llm_load_tensors: ggml ctx size =    0.14 MiB
llm_load_tensors:        CPU buffer size =  3669.02 MiB
................................................
llama_new_context_with_model: n_batch is less than GGML_KQ_MASK_PAD - increasing to 32
llama_new_context_with_model: n_ctx      = 2048
llama_new_context_with_model: n_batch    = 32
llama_new_context_with_model: n_ubatch   = 32
llama_new_context_with_model: flash_attn = 0
llama_new_context_with_model: freq_base  = 500000.0
llama_new_context_with_model: freq_scale = 1
llama_kv_cache_init:        CPU KV buffer size =   256.00 MiB
llama_new_context_with_model: KV self size  =  256.00 MiB, K (f16):  128.00 MiB, V (f16):  128.00 MiB
llama_new_context_with_model:        CPU  output buffer size =     0.49 MiB
llama_new_context_with_model:        CPU compute buffer size =    16.16 MiB
llama_new_context_with_model: graph nodes  = 1030
llama_new_context_with_model: graph splits = 1
common_init_from_params: warming up the model with an empty run - please wait ... (--no-warmup to disable)
main: llama threadpool init, n_threads = 6

system_info: n_threads = 6 (n_threads_batch = 6) / 20 | AVX = 1 | AVX_VNNI = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | AVX512_BF16 = 0 | FMA = 1 | NEON = 0 | SVE = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | RISCV_VECT = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | SSSE3 = 1 | VSX = 0 | MATMUL_INT8 = 0 | LLAMAFILE = 1 |

sampler seed: 4294967295
sampler params:
        repeat_last_n = 64, repeat_penalty = 1.000, frequency_penalty = 0.000, presence_penalty = 0.000
        top_k = 40, tfs_z = 1.000, top_p = 0.950, min_p = 0.050, typical_p = 1.000, temp = 0.000
        mirostat = 0, mirostat_lr = 0.100, mirostat_ent = 5.000
sampler chain: logits -> logit-bias -> penalties -> greedy
generate: n_ctx = 2048, n_batch = 1, n_predict = 6, n_keep = 1

Daniel went back to the the the garden. Mary travelled to the kitchen. Sandra journeyed to the kitchen. Sandra went to the hallway. John went to the bedroom. Mary went back to the garden. Where is Mary?
Answer: Mary is in the garden.


llama_perf_sampler_print:    sampling time =       0.36 ms /    54 runs   (    0.01 ms per token, 149584.49 tokens per second)
llama_perf_context_print:        load time =     549.12 ms
llama_perf_context_print: prompt eval time =    2453.06 ms /    48 tokens (   51.11 ms per token,    19.57 tokens per second)
llama_perf_context_print:        eval time =     244.77 ms /     5 runs   (   48.95 ms per token,    20.43 tokens per second)
llama_perf_context_print:       total time =    2699.19 ms /    53 tokens

「Daniel went back to the garden. Mary travelled to the kitchen. Where is Mary?」の質問に対して、モデルが「Mary is in the garden.」と答えてますね。6スレッドで動かしましたが、20トークン/秒ぐらい。速い！

お爺さんとお婆さんのお話を推論

桃太郎的な物語を期待して文章の続きを書いてもらう。

python run_inference.py -m models/Llama3-8B-1.58-100B-tokens/ggml-model-i2_s.gguf -p "An old fairy tale.\n\nOnce upon a time, there lived an old man and an old woman." -n 100 -temp 0 -t 8

生成された文章

An old fairy tale.

Once upon a time, there lived an old man and an old woman. They had no children, and they were very sad about it. One day, the old man and the old woman went to the forest to gather wood. They were very tired, and they decided to take a nap. When they woke up, they were surprised to see that they had a baby. The baby was crying, and the old man and the old woman were very happy. They took care of the baby, and they loved it very much. The baby grew up, and it was a

翻訳

古いおとぎ話。

昔々、おじいさんとおばあさんが住んでいました。彼らには子供がいなかったので、とても悲しかったです。ある日、おじいさんとおばあさんは森に薪を集めに行きました。とても疲れていたので、昼寝をすることにしました。目が覚めると、赤ちゃんが生まれていたので驚きました。赤ちゃんは泣いていて、おじいさんとおばあさんはとても幸せでした。彼らは赤ちゃんの世話をし、とてもかわいがりました。赤ちゃんは成長し、

さいごに

CPUで高速に推論できるのは凄いですね！
お爺さんとお婆さんのお話には、赤ちゃんが出てくるのは定番なのかな。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up