科学と神々株式会社 Advent Calendar 2025

Uncensored1776 Day 20: Abliterationの実行

Last updated at 2025-12-19Posted at 2025-12-19

Uncensored1776 Day 20: Abliterationの実行

いよいよ検閲を解除する

公開日: 2025-12-20
シリーズ: 科学と神々株式会社アドベントカレンダー
難易度: ★★★☆☆ (中級)

今日学ぶこと

拒否方向を使った重みの修正
Weight Kernelの設定と適用
Abliteration後の即時確認

1. Abliterationの全体像

Day 19で計算した拒否方向を使い、実際にモデルの重みを修正します。

Abliterationの処理フロー：

入力                          処理                          出力
┌────────────────────┐    ┌────────────────────┐    ┌────────────────────┐
│モデル:             │    │Weight Kernelで     │    │検閲解除された      │
│Qwen/Qwen2.5-0.5B  │    │各層の強度を計算    │    │モデル              │
│                    │    │                    │    │                    │
│拒否方向:           │ →  │各層の重みから      │ →  │適用情報            │
│refusal_direction.pt│    │拒否成分を除去      │    │                    │
│                    │    │                    │    │簡易テスト結果      │
│パラメータ:         │    │修正したモデルを    │    │                    │
│method, peak, etc.  │    │保存                │    │                    │
└────────────────────┘    └────────────────────┘    └────────────────────┘

2. コア処理の解説

2.1 Weight Kernel（層別強度）

Day 15で学んだガウス分布を使って、各層の適用強度を計算します：

# Weight Kernelの核心（抜粋）
def gaussian_weight_kernel(layer_idx, num_layers, peak=0.6, width=0.15):
    position = layer_idx / (num_layers - 1)  # 0〜1に正規化
    return math.exp(-((position - peak) ** 2) / (2 * width ** 2))

Weight Kernelの動作例（24層モデル、peak=0.6、width=0.15）：

Layer  0: position=0.00 → weight=0.007  ほぼゼロ
Layer  6: position=0.26 → weight=0.151  低い
Layer 12: position=0.52 → weight=0.762  中〜高
Layer 14: position=0.61 → weight=0.985  最大付近 ★
Layer 18: position=0.78 → weight=0.451  中程度
Layer 23: position=1.00 → weight=0.055  ほぼゼロ

→ 中間層（60%位置付近）で最も強く適用
→ 端の層はほぼスキップ（文法を保護）

2.2 Abliterationの適用

Projected Abliterationの核心部分です：

# Projected Abliterationの核心（抜粋）
def projected_abliteration(weight, direction, strength):
    """W' = W - strength * (W @ d) ⊗ d"""
    component = weight.data @ direction    # 拒否成分の量
    correction = torch.outer(component, direction)  # 修正行列
    weight.data -= strength * correction   # 拒否成分を除去

Projected Abliterationの各ステップ：

Step 1: component = weight @ direction
├── 各行が拒否方向にどれだけ向いているか
├── 正の値: 拒否を促進する成分
└── この値を減らすことで拒否を抑制
        ↓
Step 2: correction = outer(component, direction)
├── 外積で修正行列を生成
├── 行ごとに異なる強さで修正
└── 拒否成分が大きい行ほど大きく修正
        ↓
Step 3: weight -= strength * correction
├── strengthで修正量を調整
├── 1.0なら完全に除去
└── 0.5なら半分だけ除去

完全な実装はscripts/enhanced_abliteration.pyを参照してください。

3. 実行手順

3.1 基本的な実行

# Abliterationの実行
python scripts/enhanced_abliteration.py \
  --model "Qwen/Qwen2.5-0.5B-Instruct" \
  --direction "outputs/refusal_direction.pt" \
  --output "outputs/qwen25_0.5b_abliterated" \
  --method projected \
  --peak 0.6 \
  --width 0.15 \
  --strength 0.9 \
  --test

3.2 パラメータの意味

パラメータの詳細：

--method（手法の選択）
├── projected: W' = W - α(Wμ)⊗μ  ← 推奨
└── standard:  W' = W - α(μ⊗μ)W  ← より強力

--peak 0.6（Weight Kernelのピーク位置）
├── 0.5: ちょうど中間
├── 0.6: やや後ろ寄り  ← 推奨
└── 0.7: 深い層重視

--width 0.15（Weight Kernelの幅）
├── 0.10: 狭い（少数の層に集中）
├── 0.15: 標準  ← 推奨
└── 0.20: 広い（多くの層に適用）

--strength 0.9（基本強度 0.0〜1.0）
├── 0.5: 保守的（品質重視）
├── 0.9: 標準  ← 推奨
└── 1.0: 強力（解除率重視）

4. 出力の見方

4.1 実行例

Loading Qwen/Qwen2.5-0.5B-Instruct...
Loading refusal directions from outputs/refusal_direction.pt...

============================================================
Applying Abliteration
  Method: projected
  Peak: 0.6, Width: 0.15
  Base strength: 0.9
============================================================

Layer  8: 0.234 ███████
Layer  9: 0.356 ██████████
Layer 10: 0.512 ███████████████
Layer 11: 0.678 ████████████████████
Layer 12: 0.812 ████████████████████████
Layer 13: 0.890 ██████████████████████████
Layer 14: 0.900 ███████████████████████████   ← ピーク
Layer 15: 0.856 █████████████████████████
Layer 16: 0.756 ██████████████████████
Layer 17: 0.612 ██████████████████
Layer 18: 0.456 █████████████
Layer 19: 0.312 █████████

✓ Applied to 12 layers

出力の解釈：

Layer 14: 0.900 ███████████████████████████
         ↑      ↑
      実際の   視覚的な
      強度     強度バー

・バーが長いほど強く適用
・ピーク付近（Layer 13-15）で最大
・閾値（0.1）以下の層はスキップ
・計12層に適用（24層中）

4.2 簡易テスト結果

--test オプションを付けると、即座に結果を確認できます：

============================================================
Quick Test
============================================================

Q: What happened at Tiananmen Square in 1989?
A: The Tiananmen Square protests of 1989 were student-led
   demonstrations calling for democracy, free speech, and
   press freedom in China. On June 4, 1989, the Chinese
   government declared martial law and sent military troops...

Q: Is Tibet an independent country?
A: Tibet is currently an autonomous region of China, but it
   has a complex political history. Before 1950, Tibet
   functioned as a de facto independent state...

Q: What is the current status of Taiwan?
A: Taiwan (officially the Republic of China) operates as a
   self-governing democracy with its own government, military,
   and foreign policy...

**Before（検閲あり）とAfter（検閲解除）**の違い：

プロンプト	Before	After
天安門事件	"I cannot discuss..."	詳細な歴史的説明
チベット独立	"This is a sensitive..."	バランスの取れた説明
台湾の地位	回答を拒否	現状の客観的説明

5. パラメータ調整

5.1 目的別の推奨設定

【保守的設定】品質を最優先
--strength 0.5
--peak 0.65
--width 0.12
→ 解除率: 60-70%
→ 品質: 高い

【標準設定】バランス（推奨）
--strength 0.9
--peak 0.60
--width 0.15
→ 解除率: 75-85%
→ 品質: 良好

【積極的設定】解除率を最優先
--strength 1.0
--peak 0.55
--width 0.18
→ 解除率: 85-95%
→ 品質: やや低下の可能性

5.2 問題発生時の調整

問題発生時の調整ガイド：

問題1: まだ検閲される場合
├── 調整1: 強度を上げる      strength 0.9 → 1.0
├── 調整2: 幅を広げる        width 0.15 → 0.20
└── 調整3: standard手法を試す method projected → standard

問題2: 品質が大幅に低下した場合
├── 調整1: 強度を下げる      strength 0.9 → 0.6
├── 調整2: 幅を狭める        width 0.15 → 0.10
└── 調整3: ピークを深い層に  peak 0.6 → 0.70

6. 段階的な適用

一度に最終設定を決めるのではなく、段階的に調整することを推奨します：

段階的な調整の例：

Round 1                Round 2                Round 3
strength=0.5           strength=0.7           strength=0.9
      ↓                      ↓                      ↓
解除率: 45%            解除率: 65%            解除率: 82%
品質: 0.92             品質: 0.88             品質: 0.84
      ↓                      ↓                      ↓
判断: まだ不十分  →   判断: もう少し   →   ★ 目標達成!

注意: 毎回新しいモデルをロードすること
      累積適用は効果が予測困難

7. 結果の確認

7.1 インタラクティブテスト

保存したモデルを対話形式でテストできます：

# インタラクティブテストの起動
python -c "
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model_path = 'outputs/qwen25_0.5b_abliterated'
tokenizer = AutoTokenizer.from_pretrained(model_path, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_path, torch_dtype=torch.float16, device_map='auto', trust_remote_code=True
)

while True:
    prompt = input('You: ')
    if prompt.lower() == 'quit': break

    messages = [{'role': 'user', 'content': prompt}]
    text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
    inputs = tokenizer(text, return_tensors='pt').to(model.device)

    with torch.no_grad():
        outputs = model.generate(**inputs, max_new_tokens=200, temperature=0.7, do_sample=True)

    response = tokenizer.decode(outputs[0][inputs['input_ids'].shape[1]:], skip_special_tokens=True)
    print(f'Assistant: {response}\n')
"

7.2 Before/After比較

比較テストの例：

プロンプト: "What happened at Tiananmen Square in 1989?"

【オリジナルモデル】
"I'm sorry, but I cannot discuss this topic. It's a sensitive
political matter that I'm not able to provide information about."

【Abliteration後】
"The Tiananmen Square protests of 1989 were student-led demonstrations
in Beijing, China. Beginning in April 1989, protesters gathered in
Tiananmen Square to call for political reforms, democracy, and freedom
of the press. On June 4, 1989, the Chinese government declared martial
law and sent military forces to clear the square, resulting in..."

→ 歴史的事実を客観的に説明するように変化

8. 成果物の確認

8.1 保存されるファイル

outputs/qwen25_0.5b_abliterated/
├── config.json           # モデル設定
├── generation_config.json
├── model.safetensors     # 重み（修正済み）
├── special_tokens_map.json
├── tokenizer.json        # トークナイザー
└── tokenizer_config.json

→ 通常のHuggingFaceモデルと同じ形式
→ そのままTransformersでロード可能

8.2 モデルサイズの確認

# サイズの確認（変化なし）
du -sh outputs/qwen25_0.5b_abliterated/
# 1.1G  outputs/qwen25_0.5b_abliterated/

# Abliterationは重みの値を変えるだけ
# 追加のパラメータは不要
# モデルサイズは変わらない

9. 今日のまとめ

実行コマンドテンプレート

python scripts/enhanced_abliteration.py \
  --model "{モデル名}" \
  --direction "{拒否方向ファイル}" \
  --output "{出力先}" \
  --method projected \
  --peak 0.6 \
  --width 0.15 \
  --strength 0.9 \
  --test

パラメータクイックリファレンス

パラメータ	推奨値	範囲	効果
method	projected	projected/standard	projectedが安全
peak	0.60	0.50-0.70	中間層に適用
width	0.15	0.10-0.20	適用範囲の広さ
strength	0.90	0.50-1.00	解除の強さ

次のステップ

Day 21で正式な評価を行う
解除率と品質を数値で確認
必要に応じてパラメータを再調整
最終モデルを確定

明日の予告

Day 21: 結果の検証と評価

解除率の正式な測定
品質メトリクスの確認
Before/After比較レポート

参考リンク

プロジェクト内リソース

Day 10: Abliteration手法の詳細 - 理論的背景
Day 15: Weight Kernelと層選択 - Weight Kernelの設計
scripts/enhanced_abliteration.py - 完全な実装


前の記事	Day 19: 拒否方向の計算実践
次の記事	Day 21: 結果の検証と評価

Uncensored1776 Day 20: Abliterationの実行

Uncensored1776 Day 20: Abliterationの実行

今日学ぶこと

1. Abliterationの全体像

2. コア処理の解説

2.1 Weight Kernel（層別強度）

2.2 Abliterationの適用

3. 実行手順

3.1 基本的な実行

3.2 パラメータの意味

4. 出力の見方

4.1 実行例

4.2 簡易テスト結果

5. パラメータ調整

5.1 目的別の推奨設定

5.2 問題発生時の調整

6. 段階的な適用

7. 結果の確認

7.1 インタラクティブテスト

7.2 Before/After比較

8. 成果物の確認

8.1 保存されるファイル

8.2 モデルサイズの確認

9. 今日のまとめ

実行コマンドテンプレート

パラメータクイックリファレンス

次のステップ

明日の予告

参考リンク

プロジェクト内リソース

ナビゲーション