DeepSeek R1 (671B) フルモデルをローカルで

DeepSeekR1

Last updated at 2025-02-09Posted at 2025-02-02

はじめに

公式webサイトやAPIで利用できるフルモデルの「DeepSeek R1」は、別途用意された小型の蒸留モデル(1.5B-70B)とちがい、671Bパラメータという超巨大LLMです。ApX Machine Learning というサイトはフルモデルの動作に以下のスペックを推奨しています。

フルモデル

Model	Parameters (B)	VRAM Requirement (GB)	Recommended GPU
DeepSeek-R1-Zero	671B	~1,342 GB	Multi-GPU setup (e.g., NVIDIA A100 80GB x16)
DeepSeek-R1	671B	~1,342 GB	Multi-GPU setup (e.g., NVIDIA A100 80GB x16)

DeepSeek-R1-Zeroは、大規模な強化学習（RL）を用いて、教師ありの微調整なしで学習され、卓越した推論性能を示しました。強力な反面、繰り返しや読みやすさなどの問題に苦戦していました。 DeepSeek-R1は、RLの前にコールドスタートデータを組み込むことでこれらの課題を解決し、数学、コード、推論タスク全体のパフォーマンスを向上させました。

DeepSeek-R1-ZeroとDeepSeek-R1はどちらも最先端の能力を示していますが、相当なハードウェアを必要とします。量子化と分散GPUセットアップにより、膨大なパラメータ数を処理することができます。

大型モデルには分散GPUセットアップが必要 DeepSeek-R1-ZeroおよびDeepSeek-R1は、かなりのVRAMを必要とするため、効率的な動作には分散GPUセットアップ (マルチGPU構成のNVIDIA A100またはH100など) が必須となります。

1,342 GBのVRAM、 NVIDIA A100 80GB を16枚という途方もないハードウエア資源です。

8bit 量子化版

$6000のマシン

これを8bitに量子化¹したものを$6,000 の予算の以下のスペックのマシンで動かすという投稿がxやreddit で話題になりました。GPUを使わずコストを抑えてます。

Tower PC with 2 AMD EPYC CPUs
24 x 32GB DDR5-RDIMM
No GPUs
400 W power consumption

6 to 8 TPS(tokens per second)のスピードが出てるそうです。（ちなみに、このMatthew Carriganという方は Hugging Faceのエンジニアの方のようです。）

アンドリーセンも反応してます。

以下、いくつか拾い読み

「GPUでやるなら700GB以上のVRAMが必要で、10万ドル以上かかるだろう」
https://x.com/carrigmat/status/1884247727758008642

「1128G VRAMのレンタル代目安$22/h」
https://x.com/stack_toshi/status/1884310946136612926/photo/1

「長いチャットや大きいコンテキストを扱うときは1000G RAM必要」
https://x.com/carrigmat/status/1887592245274059256

1T RAMってこれまであまり目にしたことがないメモリサイズです。

4bit 量子化版

ApXの記事では4bit量子化のためにこのようなGPU資源を推奨しています。

Model	Parameters (B)	VRAM Requirement (GB) (4-bit)	Recommended GPU
DeepSeek-R1-Zero	671B	~336 GB	Multi-GPU setup (e.g., NVIDIA A100 80GB x6)
DeepSeek-R1	671B	~336 GB	Multi-GPU setup (e.g., NVIDIA A100 80GB x6)

Mac 8台

MacMini Pro を7台とM4 Max MacBook Pro (合計 496GB RAM)で4bit版を動作させる方も現れました。

$2000のマシン

How To Run Deepseek R1 671b Fully Locally On a $2000 EPYC Server この記事では$2000のマシン、512G RAMで動作させています。

しかし大きいPCですね！

1.57bit 動的量子化版

https://unsloth.ai/ がDeepSeek R1 のアーキテクチャを研究し、「特定のレイヤーをより高いビット (4 ビットなど) に選択的に量子化することに成功した」という記事を1/29にリリースしました。

1.58bit 量子化は160GBのVRAMで高速推論（2x H100 80GB）が可能で、スループットは140トークン/秒、シングルユーザー推論は14トークン/秒です。 1.58bit R1を実行するのにVRAM（GPU）は必要なく、20GBのRAM（CPU）でも動作しますが、動作が遅くなる可能性があります。

これはMacでも実行可能で、モデルは160GのサイズなのでRAMを最大限192G RAMを搭載したMac Studioでは良好な速度だそうです。

DeepSeek R1 (671B)を動かす準備・手順

ここからunslothの記事のDeepSeek R1動的量子化版の記事の補足です。特に「1.58bit量子化版を使った場合」を想定した手順になりますが、他の量子化版にも応用できます。本文と合わせてご覧ください。

ここからは（私自身がそうであったように）Hugging Faceからのモデルダウンロードなどしたことなく「brewを使っている」レベルの方向けに丁寧に解説したいと思います。まずはPython仮想環境(venv)の構築から始めます。

0. Python仮想環境(venv)の作成

venvフォルダの作成

# 好きな作業ディレクトリで
python3 -m venv .venv

.venv という仮想環境用フォルダが作成されます。

仮想環境の有効化
```
source .venv/bin/activate
```
- プロンプトが (venv) ... のように変化し、仮想環境がONになった状態になります。

必要ライブラリのインストール

pip install --upgrade pip
pip install huggingface_hub

これでhuggingface_hubを使ってモデルがダウンロードできる準備ができました。次はビルドです。

（実際はllama.cppのビルド（CMake等）はC++で完結しているため、venvのPython環境に依存しませんがクリーンな環境の構築ということで紹介しました）

1.llama.cppのインストールとMetal対応ビルド

DeepSeek R1の動的量子化モデルは、llama.cpp で扱う形式（GGUF形式）にまとめられています。
MacのGPU（Apple Silicon Metal）を使って推論を高速化する場合、ビルド時にMacのメタルを有効にするために GGML_METAL=ON を指定します。

ソースのダウンロード

git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp

CMakeでMetalを有効にしてビルド
```
cmake . -B build \
  -DGGML_METAL=ON \
  -DCMAKE_BUILD_TYPE=Release

cmake --build build --config Release
```
これで build/bin に以下の実行ファイル（バイナリ）が生成されるはずです:
- llama-cli
- llama-quantize など
動作確認（--help）
```
./build/bin/llama-cli --help
```
主要なコマンドラインオプションが一覧で表示されます。

補足: Metal対応により、GPUメモリ（Unifiedメモリ）を使用できますが、DeepSeek R1 1.58bit版でもかなりのメモリが必要です。記事は80GBを推奨していましたが、実際はもっと必要な感じがしました。

遅くて実用にならないかもというのを承知で試してみました。

2. 1.58bit量子化モデルをダウンロード (Hugging Face)

UnslothによるDeepSeek-R1-GGUFリポジトリから、1.58bit版(IQ1_S)、あるいはそれ以上のサイズのモデルをダウンロードします。

python3 <<EOF
from huggingface_hub import snapshot_download

snapshot_download(
    repo_id="unsloth/DeepSeek-R1-GGUF",
    local_dir="DeepSeek-R1-GGUF",
    allow_patterns=["*UD-IQ1_S*"],  # 1.58bit版 (IQ1_S)
)
EOF

ダウンロード後、DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/ ディレクトリに
DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf などの3分割ファイルが置かれます。

DeepSeek-R1-GGUF/
├── DeepSeek-R1-UD-IQ1_S/
│   ├── DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf
│   ├── DeepSeek-R1-UD-IQ1_S-00002-of-00003.gguf
│   ├── DeepSeek-R1-UD-IQ1_S-00003-of-00003.gguf
---

3. 推論実行の基本コマンド

いよいよ実行です。
ビルドした llama-cli を用い、以下のように起動します。

./build/bin/llama-cli \
  --model ./DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
  --threads 14 \
  --ctx-size 1024 \
  --n-gpu-layers 30 \
  --cache-type-k q4_0 \
  -no-cnv \
  -b 1024 \
  --prompt "<|User|>Hello<|Assistant|>"

それぞれのオプションは以下の通りです:

--model: ダウンロードしたGGUFモデルのパス（先頭ファイルを指定すればOK）
--threads 14: CPUコアの数に合わせて設定 (M3 Maxの場合14コア程度)
--ctx-size 1024: コンテキスト長。大きくするほどKVキャッシュを多く使いメモリが増える
--n-gpu-layers 30: 先頭30レイヤをMetal GPU(統合メモリ)にオフロード（高速化）
--cache-type-k q4_0: KVキャッシュのKeyを4bitで量子化 (メモリ節約)
-no-cnv: llama.cpp標準のチャットテンプレを無効化（モデル独自のプロンプトを使う場合に推奨）
-b 1024: バッチサイズ。大きくすると一度に複数トークンを推論できるが、メモリが増える

もしOOM（Out of Memory）になる場合

--n-gpu-layers を 25, 20 などに減らす
--ctx-size を 512 に下げる
他のアプリを終了して空きメモリを確保する

レイヤー(`--n-gpu-layers`)について

大規模言語モデルは、トランスフォーマーレイヤを何十層～何百層と積み重ねています。DeepSeek R1の1.58bit量子化モデルでも、62層のブロックがあり--n-gpu-layers を使って「先頭何層をGPUに載せるか」を指定すると、以下のようになります。

先頭N層はGPU(メタル)で処理
- ここに指定した数だけ、GPUの高速並列演算を使って推論を行います。
残りの層はCPUで処理
- GPUメモリ（Unifiedメモリ）だけでは全レイヤーを載せきれない場合、残りはCPU側に保持して計算します。
Nが大きいほど速度が上がりやすいがメモリ消費も増大
- Nを増やすと推論速度向上が期待できますが、その分GPUメモリを圧迫し、Out of Memoryエラーが出やすくなります。

実際にオフロードされる層数を起動ログで確認

起動時に以下のように表示されます。

load_tensors: offloading 30 repeating layers to GPU
load_tensors: offloaded 30/62 layers to GPU

offloaded 30/62 layers」とは、ユーザーの指定（例: --n-gpu-layers=30）通りに、最初の30層分がGPUへ移動できたという意味です。しかしメモリが足りない場合は、実際にはさらに減らされて25層しかオフロードされないといった状況になることもあります。指定しても載っていないことがあり、ログを確認しないと実際の数が分かりません。

計算式

ブログ記事でも紹介されていた近似式です。

offloadするレイヤーの数 ≈ (VRAM(GB) / モデルサイズ(GB) × 全レイヤ数) −4

DeepSeek R1(1.58bit) は約131GB
全レイヤ数は61層

これに VRAM 32GB を当てはめると、

offloadするレイヤーの数 ≈ (32 / 161) × 61) − 4
≈10

10枚をGPUにオフロードする、となります。ただし、これはあくまで理論上の目安です。

レイヤー数を増やす→速度アップするがメモリを多く消費、レイヤー数を減らす→速度ダウンだがメモリ使用量は抑えられるというトレードオフがあります。得られる品質は変わりません。速度が変わります。

例えば、DeepSeek R1 (1.58bit) は131GB相当のモデルなので、私のMacBook Proは96GB RAMで全部をオフロードすることはできません。一部をGPU + CPUで分担する形になります。--n-gpu-layers=30 などとレイヤー数を調整して、もしOOM（Out of Memory）エラーが出るなら下げて調整します。

CPUのみで動かす (`--n-gpu-layers=0`)

「どこまでGPUを使って速度を稼ぐか」「メモリ不足を回避するか」**のトレードオフを取るためのオプションが --n-gpu-layers ですが、CPUのみの方がパフォーマンスがいいという報告もあるようです。実行環境のメモリや要望に合わせて最適なレイヤ数を探ってみてください。

4. 実行例と速度計測

例1: CPUオンリー

./build/bin/llama-cli \
  --model ./DeepSeek-R1-GGUF/... \
  --n-gpu-layers 0 \
  --prompt "<|User|>What is the meaning of life?<|Assistant|>"

例2: GPUオフロード (30層)

./build/bin/llama-cli \
  --model ./DeepSeek-R1-GGUF/... \
  --n-gpu-layers 30 \
  --prompt "<|User|>What is the meaning of life?<|Assistant|>"

ログから速度を見る

llama_perf_context_print: などの出力で「eval time = ... ms per token」が表示されます。TPS(トークン/秒)が確認できます。

6. 実用ヒント

再現性が必要なら --seed
- 何度実行しても同じ出力が欲しい場合、--seed 12345 などで乱数シード固定します。記事では 3407, 3408 and 3409,
**--tempなどのサンプリングパラメータ
- 回答の創造性を調整できますが、DeepSeekのおすすめのtempは0.6です。
長い会話をするとき

--ctx-size が1024はかなり少ない数字で長い会話をすると破綻します。2048や4096に上げたいところですが、各レイヤーにそのメモリが必要なのでメモリ消費が跳ね上がるトレードオフがあります。

サーバー

llama-serverでwebインターフェイスも利用できます。

./build/bin/llama-server \
--model ./DeepSeek-R1-GGUF/DeepSeek-R1-UD-IQ1_S/DeepSeek-R1-UD-IQ1_S-00001-of-00003.gguf \
--n-gpu-layers 30 \
--ctx-size 2048 \
--port 8080

http://localhost:8080/

Thought Processの内容も見ることができます。

7. 1.58bit動的量子化版まとめ

元記事では一般的な定量ベンチマークだけでなく、“Flappy BirdゲームをPythonで生成し、3回コード生成する(pass@3)”というユニークな評価を実施されてます。正しいPygameコードを生成できるか、画面の色やスコア表示など、合計10項目をチェックして採点しています。結果、1.58-bit (131GB)モデルでも、正しく動作するコードを比較的安定して生成できることを確認していて、一方、動的量子化をせずに単純にビット数を下げただけのモデルだと、無限ループや意味不明な単語の繰り返しを起こすケースが多く、実用に耐えない結果になると報告しています。

モデル(動的量子化)	サイズ	スコア(10点中)
1.58-bit	131GB	6.92
1.73-bit	158GB	9.08
2.22-bit	183GB	9.17

※ここではFlappy Bird生成ベンチマークを10点満点中どの程度正しくできたかを示している（上表の数値は平均スコアの一例）。

DeepSeek R1アーキテクチャの活用

DeepSeek R1には大規模なMoEレイヤ（Mixture of Experts）が採用されており、パラメータ数は膨大だが計算量を抑えられる設計になっています。最初の数レイヤは精度が重要なため、4bitや6bitにしておき、それ以外のMoE部分は1.5bit程度に落とす、というメリハリある量子化手法を取っていて、重要度マトリクス(importance matrix)という手法でモデル内で特に精度を確保すべきパラメータを自動で見つけ、そこだけビット数を上げるという手法が取られているようです。

それらの手法を活用して、671Bという巨大パラメータのDeepSeek R1をそのまま動かすには本来1.4TB級のVRAMが必要なところを、動的量子化で品質の犠牲を抑えながら80%ものサイズを減少しています。

技術的な解説は以下のブログが詳しいです。

DeepSeek R1 Dynamic 1.58-bit の概要 npakaさん
DeepSeek推しのローカルLLM勢から見るDeepSeek holy_foxさん
1 bit llm の実力は？BitNet b1.58の登場 yonaka15さん

私のマシンでは実用的なスピードがでるまでは行きませんでしたが、個人でDeepSeek R1のフルモデル671Bの一端を体験できました。

次の参考リンクもあるように複数のパラメーターを変更してパフォーマンスの向上を図ります。私はまだその知見も経験も十分でなく「これがおすすめです」とはっきり言えないのですが、実際にコマンドを実行してみて、モデルの応答だけでなくパフォーマンスの変化を楽しんでみてください。😊

こんな超巨大LLMを個人で触ろうとしている人が結構います！

Happy AI model tuning! 🤖

参考リンク

デモ

Macbook Pro M3 Max 96G RAM (14 core / 30 CPU)での例です。

出力例

> Create a Flappy Bird game in Python.
<think>
Okay, so I need to create a Flappy Bird game in Python. Let me think about how to approach this. I've heard that Pygame is a good library for making games, so maybe I should use that. First, I'll need to set up the game window. I remember that in Pygame, you initialize the window with pygame.init() and set the display with a certain size. The display size should probably be something standard, like 800x600 or similar.

Next, the bird. The bird needs to be able to jump and fall. I'll represent the bird with a rectangle, maybe a simple circle or a square. For movement, gravity will pull it down, and when the user presses a key, the bird will jump, which means giving it some upward velocity. I need to handle the physics for that. Maybe using variables like bird_y and velocity, adjusting velocity each frame with gravity, and then changing the bird's position.

Then, the obstacles. In Flappy Bird, there are pipes that move from the right to the left. Each pipe has a gap that the bird needs to pass through. I'll need to generate pairs of pipes, one at the top and one at the bottom, with a gap in between. The pipes should be spaced at certain intervals and move leftwards. Once they're off the screen, they can be removed or recycled.

Collision detection is crucial. The bird must collide with the pipes or the ground, which ends the game. I'll check for collisions between the bird's rectangle and the pipe rectangles. Also, if the bird hits the ground, that's a game over.

Scoring. Every time the bird passes a pair of pipes, the score increases. I'll need to keep track of the score and display it on the screen. Maybe using a font to render the text.

The game loop should handle events, like the user pressing the space key or clicking to make the bird jump. Also, updating the positions of the bird and pipes each frame, drawing everything, and checking for collisions.

Let me outline the steps:

1. Initialize Pygame and create the window.
2. Create the bird object with position and movement variables.
3. Handle user input to make the bird jump.
4. Generate pipes with random heights but a consistent gap.
5. Move pipes leftwards and manage their lifecycle.
6. Check for collisions between bird and pipes, or bird and ground.
7. Keep track of the score and display it.
8. When collision happens, end the game and show the final score.

Potential issues: How to generate the pipes with random gaps but ensuring the gap is passable. Also, handling the bird's jump physics smoothly. Maybe using a velocity variable that decreases due to gravity each frame. Also, making sure the pipes are generated at the right intervals and removed when off-screen.

I should start coding step by step. First, the window setup. Then the bird. Then the pipes. Then collision detection. Then the score.

Testing each part as I go. Maybe start with just the bird moving up and down, then add pipes, then collision, etc.

Let me start writing the code.

Import pygame, set up the window. Then define the bird's properties. In the game loop, handle events, update the bird's position based on velocity and gravity. Then draw the bird. Then add pipes, using a list to store them, generating new ones at certain intervals. Move them left each frame, check for collisions. Finally, display the score.

Oh, and I need to load images for the bird and pipes. But maybe for simplicity, use colored rectangles. Like a red rectangle for the bird, green for the pipes. That way, I don't need images, just draw rects.

Also, the game over screen. When the bird hits something, the game stops, and maybe shows "Game Over" with the final score.

I think that's the plan. Now, let's code it.
</think>

To create a Flappy Bird game in Python, we'll use the Pygame library. Here's a step-by-step implementation:

import pygame
import random

# Initialize Pygame
pygame.init()

# Set up the display
screen_width = 400
screen_height = 600
screen = pygame.display.set_mode((screen_width, screen_height))
pygame.display.set_caption("Flappy Bird")

# Colors
white = (255, 255, 255)
green = (0, 255, 0)
red = (255, 0, 0)
black = (0, 0, 0)

# Game variables
gravity = 0.5
bird_y = screen_height // 2
bird_velocity = 0
bird_size = 20
pipe_width = 50
pipe_gap = 150
pipe_height = 300
pipe_speed = 3
pipes = []
score = 0
font = pygame.font.Font(None, 36)

# Game loop
running = True
clock = pygame.time.Clock()

while running:
    # Handle events
    for event in pygame.event.get():
        if event.type == pygame.QUIT:
            running = False
        if event.type == pygame.KEYDOWN:
            if event.key == pygame.K_SPACE:
                bird_velocity = -8

    # Update bird physics
    bird_velocity += gravity
    bird_y += bird_velocity
    if bird_y < 0:
        bird_y = 0

    # Generate pipes
    if len(pipes) == 0 or pipes[-1][0] < screen_width - 300:
        pipe_top = random.randint(50, 300)
        pipes.append([screen_width, pipe_top, pipe_top + pipe_gap])

    # Move pipes
    for pipe in pipes:
        pipe[0] -= pipe_speed

    # Remove off-screen pipes
    pipes = [pipe for pipe in pipes if pipe[0] > -pipe_width]

    # Check collisions
    bird_rect = pygame.Rect(50, bird_y, bird_size, bird_size)
    for pipe in pipes:
        top_pipe = pygame.Rect(pipe[0], 0, pipe_width, pipe[1])
        bottom_pipe = pygame.Rect(pipe[0], pipe[2], pipe_width, screen_height - pipe[2])
        if bird_rect.colliderect(top_pipe) or bird_rect.colliderect(bottom_pipe):
            running = False

    # Check ground collision
    if bird_y + bird_size > screen_height:
        running = False

    # Update score
    for pipe in pipes:
        if pipe[0] + pipe_width < 50:
            score += 1

    # Draw everything
    screen.fill(black)
    # Draw bird
    pygame.draw.rect(screen, red, (50, bird_y, bird_size, bird_size))
    # Draw pipes
    for pipe in pipes:
        pygame.draw.rect(screen, green, (pipe[0], 0, pipe_width, pipe[1]))
        pygame.draw.rect(screen, green, (pipe[0], pipe[2], pipe_width, screen_height - pipe[2]))
    # Draw score
    score_text = font.render(f"Score: {score}", True, white)
    screen.blit(score_text, (10, 10))

    pygame.display.update()
    clock.tick(30)

# Game over
print(f"Final score: {score}")
pygame.quit()

Key components explained:

1. **Game Window Setup**: Uses Pygame to create a 400x600 window
2. **Bird Physics**:
   - SPACE key triggers a "flap" (velocity upward)
   - Gravity constantly pulls the bird down
   - Velocity is adjusted by gravity each frame

3. **Pipe Generation**:
   - Pipes spawn at random heights with consistent gaps
   - Scroll left at constant speed
   - Remove when off-screen

4. **Collision Detection**:
   - Rectangle collision between bird and pipes
   - Ground collision check

5. **Scoring**:
   - Increments when passing pipes
   - Displayed using Pygame font rendering

To play:
- Press SPACE to flap
- Avoid pipes and ground
- Score increases as you pass pipes

The game ends when hitting any obstacle, showing the final score in the console.

> 
llama_perf_sampler_print:    sampling time =     233.96 ms /  1775 runs   (    0.13 ms per token,  7586.80 tokens per second)
llama_perf_context_print:        load time =   37477.30 ms
llama_perf_context_print: prompt eval time =   46488.47 ms /    31 tokens ( 1499.63 ms per token,     0.67 tokens per second)
llama_perf_context_print:        eval time = 9454128.77 ms /  1762 runs   ( 5365.57 ms per token,     0.19 tokens per second)
llama_perf_context_print:       total time = 9636530.47 ms /  1793 tokens

元記事のようにしっかりとしたインストラクションではないですが、実際に動くゲームが一度で作成されました

雑にいうとRAW画像に対するJPEGのように品質の犠牲を抑えてサイズを少なくする技術です。詳しくは https://www.cloudflare.com/ja-jp/learning/ai/what-is-quantization/ ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up