LLMのイロレーティング2

Posted at 2024-11-08

LLMのイロレーティングの比較を行っているサイトがある。

Chatbot Arenaという2個のLLMの出力を示し、4択の選択肢を出し、どちらが優秀かというのを人間に選択させて結果を得る。イロレートを計算するコードとして以下が知られている。このデータ分析に対してツッコミを書いてみたい。

TieとBoth are badが同じ

上記コード中では4択であるが実際にはTie(引き分け)とBoth are bad(両方ダメ)の扱いが同じである。この二つは厳密には同じではないし、特に後者はまともに評価できているとも思わないが、これを0.5勝扱いにするのはどうかと思う。

def compute_elo(battles, K=4, SCALE=400, BASE=10, INIT_RATING=1000):
    rating = defaultdict(lambda: INIT_RATING)

    for rd, model_a, model_b, winner in battles[['model_a', 'model_b', 'winner']].itertuples():
        ra = rating[model_a]
        rb = rating[model_b]
        ea = 1 / (1 + BASE ** ((rb - ra) / SCALE))
        eb = 1 / (1 + BASE ** ((ra - rb) / SCALE))
        if winner == "model_a":
            sa = 1
        elif winner == "model_b":
            sa = 0
        elif winner == "tie" or winner == "tie (bothbad)":
            sa = 0.5
        else:
            raise Exception(f"unexpected vote {winner}")
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (1 - sa - eb)

    return rating

ebを求める意味があるか？

上記コードでは$ea+eb=1$より$eb=1-ea$。つまり$(1 - sa - eb)=(1 - sa - (1 - ea))=-(sa-ea)$
よって以下の様に書く方が簡単であるし、ebを求める必要はない。また、数式的には等価で意味は変わらないが、以下の方がゼロサムゲームを満たすことがよく分かってよい。

        rating[model_a] += K * (sa - ea)
        rating[model_b] -= K * (sa - ea)

または$sb=1 - sa$から以下の様に書いてもいい。

        sb = 1 - sa
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (sb - eb)

勝負数は十分か？

計算コード内のデータではgpt-4とclaude-v1が1位と2位なのは疑いようがない。
ではこの二つのモデルでどの程度直接対決をしているのかを調べた。結果は[175, 118, 137, 18]で、gpt-4の勝ちが175回、claude-v1の勝ちが118回、引き分けが137回、両方ダメが18回であった。
意外と直接対決の数が少なく、引き分けがかなり多い。また、両方ダメと答えた者も少数ながら存在する。

LLMの数が22個なのでgpt-4の評価数が6447、claude-v1の評価数が5903でも一個当たりのLLMの評価数はそれほど多くない。また、mpt-30b-chatやvicuna-33bに至っては評価数が1899や1836に過ぎず、一回当たりの評価の誤差も怪しさを感じる。

count = [0, 0, 0, 0]
for rd, model_a, model_b, winner in battles[['model_a', 'model_b', 'winner']].itertuples():
    if "gpt-4"==model_a and "claude-v1"==model_b:
        if winner == "model_a":
            count[0] += 1
        if winner == "model_b":
            count[1] += 1
        if winner == "tie":
            count[2] += 1
        if winner == "tie (bothbad)":
            count[3] += 1
    if "gpt-4"==model_b and "claude-v1"==model_a:
        if winner == "model_a":
            count[1] += 1
        if winner == "model_b":
            count[0] += 1
        if winner == "tie":
            count[2] += 1
        if winner == "tie (bothbad)":
            count[3] += 1
print(count)
---------------------------------------------
[175, 118, 137, 18]

なお、この直接対決のデータだけでgpt-4とclaude-v1のイロレート差を求めるなら、おおよそレート差400で勝率90%だからレート差10あたり1%勝率が50%から変化する。
引き分けを除いた場合$\frac{175}{175+118}=0.597$よりレート差は97である。
引き分けを0.5勝扱いにするなら$\frac{175+(137+18)*0.5}{175+118+137+18}=0.564$よりレート差は64である。
引き分けがある方がレート差は縮まる。

評価数の時系列変化

評価初期に8個のモデルの比較があるが、最終的なこの8個のモデルの評価数は一定ではない。また、途中で増加量がほぼ0になったモデルが２つほど見られる(llama-13bとstablelm-tuned-alpha-7b、どちらもEloScoreはかなり低い)。新規モデルがこのモデルと戦っている戦績がほとんどないのは問題ではないだろうか？

K=4の値は適切か？

試合数の最も少ないvicuna-33bは1836試合しかやってない。
モデル数22個として全てのモデルがこれに条件を合わせると$1836*22=40392$。これは1試合当たり2回数えているから総試合数では$20000$程度。

このモデル数$22$、総試合数$20000$で$K=4$で収束は十分かを考える。
以下の様にプロットするとギリギリではあるが問題ない事が分かる。K=2の場合は小さすぎてまだ収束しない。従って今回のデータではKの値はK=4より小さくできないのが分かる。

import numpy as np
import matplotlib.pyplot as plt
import random

a = np.log(10)
fig = plt.figure()
ax = []
for index, m in enumerate([16,8,4,2]):
    K = m
    ax.append(fig.add_subplot(2, 2, index+1))
    for n in [22,16,32,64]:
        seed = 100
        np.random.seed(seed=seed)
        random.seed(seed)
        print(n,m)
        Elo_true = np.random.randn(n) * 400 / np.sqrt(2) + 1500
        Elo_true = Elo_true - (np.mean(Elo_true) - 1500)
        Elo_pred = np.ones(n) * 1500
        l = list(range(n))
        loss = []
        for t in range(20000):
            sample = random.sample(l, 2)
            A_team, B_team = sample[:1], sample[1:]
            
            R_AB_true = np.sum(Elo_true[A_team]-Elo_true[B_team])
            R_AB_pred = np.sum(Elo_pred[A_team]-Elo_pred[B_team])
            W_AB_true = 1/(1+np.exp(-a*R_AB_true/400))
            W_AB_pred = 1/(1+np.exp(-a*R_AB_pred/400))

            s = 1 if W_AB_true > np.random.rand() else 0
            Elo_pred[A_team] += K * (s - W_AB_pred)
            Elo_pred[B_team] -= K * (s - W_AB_pred)
            loss.append(np.mean(np.abs(Elo_true-Elo_pred)))

        print(Elo_true[:8])
        print(Elo_pred[:8])
        ax[index].plot(range(len(loss)), loss, label='n=%d,K=%d'% (n,m))
        ax[index].set_ylim(bottom=0, top=250)
        ax[index].legend()
plt.show()

一方、ランダムマッチングではなくレート差150以上のマッチングを行わないようになるべく戦力均衡のマッチング操作を加えると収束までのマッチング回数は伸び、K=4でも収束は十分ではないことが分かる。

        for t in range(20000):
            for i in range(20):
                sample = random.sample(l, 2)
                A_team, B_team = sample[:1], sample[1:]
                if np.abs(Elo_pred[A_team]-Elo_pred[B_team]) < 150.0:
                    break

対戦の組み合わせ

試合数の最も少ないvicuna-33bは1836試合でモデル数22だから公平に割り振られていれば、少なくとも$1836/21=87.4$試合どの組み合わせも行われる。
しかし、ヒストグラムを見ると80試合以下の組み合わせは割と多い。全231通りの内、26通りが80試合以下である。

前述したllama-13bとstablelm-tuned-alpha-7bは途中からほとんど評価されないが、他にもdolly-v2-12bも途中から評価されないようである。
また、koala-13bとvicuna-13bの組み合わせは1155回も行われているが、他の組み合わせより倍くらい大きい。これを見ると対戦組み合わせは偏りがないとは言い難い。

import itertools
import matplotlib.pyplot as plt

hist_data = []
count2 = 0
for m1, m2 in itertools.combinations(model_list, 2):
    count = [0, 0, 0, 0]
    for rd, model_a, model_b, winner in battles[['model_a', 'model_b', 'winner']].itertuples():
        if m1==model_a and m2==model_b:
            if winner == "model_a":
                count[0] += 1
            if winner == "model_b":
                count[1] += 1
            if winner == "tie":
                count[2] += 1
            if winner == "tie (bothbad)":
                count[3] += 1
        if m1==model_b and m2==model_a:
            if winner == "model_a":
                count[1] += 1
            if winner == "model_b":
                count[0] += 1
            if winner == "tie":
                count[2] += 1
            if winner == "tie (bothbad)":
                count[3] += 1
    hist_data.append(np.sum(count))
    if np.sum(count) < 80:
        count2 += 1
    if np.sum(count) > 600 or np.sum(count) < 30:
        print(m1,m2)
        print(count, np.sum(count))
print(count2)
plt.hist(hist_data, bins=100)
plt.show()
--------------------------------------------
koala-13b oasst-pythia-12b
[268, 133, 65, 141] 607
koala-13b vicuna-13b
[238, 497, 237, 183] 1155
oasst-pythia-12b vicuna-13b
[119, 425, 113, 121] 778
alpaca-13b vicuna-13b
[95, 354, 84, 86] 619
dolly-v2-12b gpt4all-13b-snoozy
[1, 6, 7, 8] 22
dolly-v2-12b vicuna-33b
[3, 15, 3, 2] 23
dolly-v2-12b mpt-30b-chat
[1, 14, 3, 2] 20
stablelm-tuned-alpha-7b wizardlm-13b
[1, 8, 11, 6] 26
stablelm-tuned-alpha-7b gpt4all-13b-snoozy
[1, 5, 14, 3] 23
stablelm-tuned-alpha-7b guanaco-33b
[1, 8, 4, 4] 17
stablelm-tuned-alpha-7b vicuna-33b
[0, 0, 0, 0] 0
stablelm-tuned-alpha-7b mpt-30b-chat
[0, 0, 0, 0] 0
llama-13b mpt-7b-chat
[3, 8, 4, 6] 21
llama-13b palm-2
[4, 17, 3, 1] 25
llama-13b claude-instant-v1
[3, 12, 4, 2] 21
llama-13b vicuna-7b
[2, 13, 1, 1] 17
llama-13b wizardlm-13b
[3, 8, 3, 4] 18
llama-13b gpt4all-13b-snoozy
[2, 3, 3, 6] 14
llama-13b guanaco-33b
[3, 9, 2, 4] 18
llama-13b vicuna-33b
[0, 6, 3, 1] 10
llama-13b mpt-30b-chat
[0, 6, 2, 0] 8
26

データの順序シャッフルやKの値の大きさ

データの順序をシャッフルしたり、Kの大きさを変えた時、Eloの変動が見られた。
特に時系列順をシャッフルしないとclaude-v1が最も優秀だが、シャッフルするとgpt-4の方が優秀である。これはどちらかのモデル自体や何らかの理由(システムプロンプトとか)で性能がアップグレードされてモデルが同一でない理由が考えられる。または直近の評価に偏りがある可能性がある。

無論、コード上ではこれに気付いており、Bootstrapといってデータの順序をシャッフルしてEloを評価するのを1000回繰り返して平均をとることで解決してるが、データが不完全の可能性はある。

また、tieデータを除くとgpt-4とclaude-v1の性能は既存のレートよりも高くなっていた。

                      Model  Elo rating_default  Elo rating_all  Elo rating_no_anony  Elo rating_no_ties
1                 claude-v1                1201            1216                 1219                1309
2                     gpt-4                1185            1198                 1225                1303
3             gpt-3.5-turbo                1158            1178                 1179                1232
4         claude-instant-v1                1138            1149                 1167                1218
5                vicuna-33b                1088            1115                 1138                1147
6              wizardlm-13b                1032            1039                 1062                1064
7              mpt-30b-chat                1026            1055                 1082                1063
8                 vicuna-7b                1024            1009                  984                1044
9               guanaco-33b                1023             997                 1000                1056
10               vicuna-13b                1021            1055                 1069                1061
11                   palm-2                 989             980                  953                 979
12                koala-13b                 974             969                  970                 968
13         RWKV-4-Raven-14B                 953             946                  920                 932
14              mpt-7b-chat                 947             945                  940                 913
15       gpt4all-13b-snoozy                 941             938                  943                 918
16               chatglm-6b                 928             926                  963                 892
17               alpaca-13b                 924             933                  931                 889
18         oasst-pythia-12b                 920             911                  908                 861
19  stablelm-tuned-alpha-7b                 913             902                  846                 796
20           fastchat-t5-3b                 884             874                  879                 814
21                llama-13b                 865             831                  813                 772
22             dolly-v2-12b                 864             846                  821                 769

                      Model  Elo rating_default  Elo rating_shuffle1  Elo rating_shuffle2  Elo rating_shuffle3
1                 claude-v1                1201                 1182                 1169                 1194
2                     gpt-4                1185                 1217                 1201                 1203
3             gpt-3.5-turbo                1158                 1112                 1125                 1131
4         claude-instant-v1                1138                 1112                 1143                 1144
5                vicuna-33b                1088                 1115                 1101                 1112
6              wizardlm-13b                1032                 1042                 1034                 1038
7              mpt-30b-chat                1026                 1035                 1075                 1038
8                 vicuna-7b                1024                 1021                 1003                  981
9               guanaco-33b                1023                 1038                 1014                 1042
10               vicuna-13b                1021                 1058                 1060                 1055
11                   palm-2                 989                 1035                 1023                 1037
12                koala-13b                 974                  994                  973                  979
13         RWKV-4-Raven-14B                 953                  938                  972                  938
14              mpt-7b-chat                 947                  970                  953                  963
15       gpt4all-13b-snoozy                 941                  986                  990                  979
16               chatglm-6b                 928                  930                  893                  872
17               alpaca-13b                 924                  921                  958                  916
18         oasst-pythia-12b                 920                  922                  909                  930
19  stablelm-tuned-alpha-7b                 913                  850                  852                  861
20           fastchat-t5-3b                 884                  887                  890                  903
21                llama-13b                 865                  804                  823                  830
22             dolly-v2-12b                 864                  831                  840                  856

                      Model  Elo rating_K=4  Elo rating_K=2  Elo rating_K=8  Elo rating_K=16
1                 claude-v1            1201            1176            1230             1262
2                     gpt-4            1185            1176            1191             1197
3             gpt-3.5-turbo            1158            1140            1163             1156
4         claude-instant-v1            1138            1129            1152             1166
5                vicuna-33b            1088            1089            1089             1094
6              wizardlm-13b            1032            1038            1018              996
7              mpt-30b-chat            1026            1031            1024             1027
8                 vicuna-7b            1024            1025            1027             1036
9               guanaco-33b            1023            1028            1009              981
10               vicuna-13b            1021            1032            1013             1012
11                   palm-2             989             985            1003             1024
12                koala-13b             974             985             956              943
13         RWKV-4-Raven-14B             953             959             948              944
14              mpt-7b-chat             947             950             946              945
15       gpt4all-13b-snoozy             941             949             937              943
16               chatglm-6b             928             936             920              912
17               alpaca-13b             924             933             922              928
18         oasst-pythia-12b             920             921             916              908
19  stablelm-tuned-alpha-7b             913             894             937              952
20           fastchat-t5-3b             884             898             868              858
21                llama-13b             865             860             874              883
22             dolly-v2-12b             864             867             856              837

from collections import defaultdict
import json, math
import numpy as np
import pandas as pd

def compute_elo(battles, K=4, SCALE=400, BASE=10, INIT_RATING=1000):
    rating = defaultdict(lambda: INIT_RATING)

    for rd, model_a, model_b, winner in battles[['model_a', 'model_b', 'winner']].itertuples():
        ra = rating[model_a]
        rb = rating[model_b]
        ea = 1 / (1 + BASE ** ((rb - ra) / SCALE))
        eb = 1 / (1 + BASE ** ((ra - rb) / SCALE))
        if winner == "model_a":
            sa = 1
        elif winner == "model_b":
            sa = 0
        elif winner == "tie" or winner == "tie (bothbad)":
            sa = 0.5
        else:
            raise Exception(f"unexpected vote {winner}")
        rating[model_a] += K * (sa - ea)
        rating[model_b] += K * (1 - sa - eb)

    return rating

def preety_print_elo_ratings(ratings, ratings2, ratings3, ratings4, columns):
    df = pd.DataFrame([
        [n, ratings[n], ratings2[n], ratings3[n], ratings4[n]] for n in ratings.keys()
    ], columns=["Model"]+columns).sort_values(columns[0], ascending=False).reset_index(drop=True)
    
    for s in columns:
        df[s] = (df[s] + 0.5).astype(int)
    df.index = df.index + 1
    return df

filename = './clean_battle_20230717.json'
raw_data = pd.read_json(filename).sort_values(ascending=True, by=["tstamp"])
battles = raw_data[raw_data['anony']].reset_index(drop=True)
battles_no_anony = raw_data[~raw_data['anony']].reset_index(drop=True)
battles_no_ties = battles[~battles["winner"].str.contains("tie")]

elo_ratings  = compute_elo(battles)
elo_ratings2 = compute_elo(raw_data)
elo_ratings3 = compute_elo(battles_no_anony)
elo_ratings4 = compute_elo(battles_no_ties)
columns =  ["Elo rating_default", "Elo rating_all", "Elo rating_no_anony", "Elo rating_no_ties"]
print(preety_print_elo_ratings(elo_ratings, elo_ratings2, elo_ratings3, elo_ratings4, columns))

elo_ratings  = compute_elo(battles)
elo_ratings2 = compute_elo(battles.sample(frac=1))
elo_ratings3 = compute_elo(battles.sample(frac=1))
elo_ratings4 = compute_elo(battles.sample(frac=1))
columns =  ["Elo rating_default", "Elo rating_shuffle1", "Elo rating_shuffle2", "Elo rating_shuffle3"]
print(preety_print_elo_ratings(elo_ratings, elo_ratings2, elo_ratings3, elo_ratings4, columns))

elo_ratings  = compute_elo(battles, K=4)
elo_ratings2 = compute_elo(battles, K=2)
elo_ratings3 = compute_elo(battles, K=8)
elo_ratings4 = compute_elo(battles, K=16)
columns =  ["Elo rating_K=4", "Elo rating_K=2", "Elo rating_K=8", "Elo rating_K=16"]
print(preety_print_elo_ratings(elo_ratings, elo_ratings2, elo_ratings3, elo_ratings4, columns))

Eloの時系列

初期レートを1000としてK=4の場合のイロレートの時系列変動をプロットした。
時間軸と共にスコアを落としているモデルが多い。これはどんどん優秀な新規モデルが追加されることで相対的にレートが下がっていくのだろう。過去のレートが高いのも現在のレートが低いのもどちらが正しいかと聞かれたらどちらも正しいと答えるのが正しい。異なるのは過去と現在の評価モデル群である。

ところで、時間stepが40000～45000あたりで何故か上位モデルの性能が100程度がくんと下がっている。かつ下位モデルの性能も上がっている。新規LLMが追加されるのは37000付近と43000付近である。

引き分けデータの更新を取り除くとこの減少分は小さくなることから、引き分けの分の更新量によって下がっている。
好意的に解釈するなら新規LLMの収束まで遠いため、格下相手に引き分けたと解釈して下がっている。別の解釈をするなら人間の知能がLLMよりも劣っていてどちらの回答が良いか正しい評価ができない。

・デフォルト時系列

・引き分けデータを除いた場合

model_list = []
model_rd = []
for rd, model_a, model_b in battles[['model_a', 'model_b']].itertuples():
    if model_a not in model_list:
        model_list.append(model_a)
        model_rd.append(rd)
        print(rd, model_a)
    if model_b not in model_list:
        model_list.append(model_b)
        model_rd.append(rd)
        print(rd, model_b)
print(len(battles))

x = []
y = {}
y2 = {}
y3 = {}
for model in model_list:
    y[model] = []
    y2[model] = []
    y3[model] = []

n = len(model_list)
Elo_pred = np.ones(n) * 1000
count = np.zeros(n)
draw_count = np.zeros(n)
for rd, model_a, model_b, winner in battles[['model_a', 'model_b', 'winner']].itertuples():
    i, j = model_list.index(model_a), model_list.index(model_b)
    count[i] += 1
    count[j] += 1
    a = np.log(10)
    K = 4
    R_AB = Elo_pred[i] - Elo_pred[j]
    W_AB = 1/(1+np.exp(-a*R_AB/400))
    if winner == "model_a":
        s = 1
    elif winner == "model_b":
        s = 0
    elif winner == "tie" or winner == "tie (bothbad)":
        s = 0.5
        draw_count[i] += 1
        draw_count[j] += 1
        #continue
    
    Elo_pred[i] += K * (s - W_AB)
    Elo_pred[j] -= K * (s - W_AB)
    if rd%5==0:
        x.append(rd)
        for model in model_list:
            y[model].append(Elo_pred[model_list.index(model)])
            y2[model].append(count[model_list.index(model)])
            y3[model].append(draw_count[model_list.index(model)])

import matplotlib.pyplot as plt

cmap = plt.get_cmap("tab20")
for index, model in enumerate(model_list):
    plt.plot(x, y[model], color=cmap(index))
for index, model in enumerate(model_list):
    plt.scatter(model_rd[index], 1000, color=cmap(index))
plt.legend(model_list, loc='upper left', bbox_to_anchor=(0.85, 1))
plt.show()

for index, model in enumerate(model_list):
    plt.plot(x, y2[model], color=cmap(index))
for index, model in enumerate(model_list):
    plt.scatter(model_rd[index], 0, color=cmap(index))
plt.legend(model_list, loc='upper left', bbox_to_anchor=(0.85, 1))
plt.show()

for index, model in enumerate(model_list):
    plt.plot(x, y3[model], color=cmap(index))
for index, model in enumerate(model_list):
    plt.scatter(model_rd[index], 0, color=cmap(index))
plt.legend(model_list, loc='upper left', bbox_to_anchor=(0.85, 1))
plt.show()

ハンロンの剃刀

ハンロンの剃刀とは「無能で十分説明されることに悪意を見出すな」という考え方である。
前述の正しく評価できない引き分けはあくまで人間が無能であって、特定のLLMを下げるために逆の回答をしたり、このLeaderboardを壊してやるためにbotを組んでで適当に回答したりした人物はいないだろうという事である。

しかし、ChatGPTは回答傾向は段落下げとか太字とかの傾向からLLMの名前が隠蔽されていても推測することは(多分)可能で、ChatGPT上げ(下げ)の為に回答を操作するのは不可能でない筈である。あと、慣れると出力までの速度で、どのLLMか推測できよう。

ベンチマーク性能を反映してるか？

各種ベンチマークで見ると、gpt-4oとgpt-4o miniは明確な性能差があり、Llama 3.1 405Bではgpt-4oと並ぶスコアを取っているのでベンチマーク上はgpt-4o≒Llama 3.1 405B>gpt-4o miniである。
しかし、Chatbot Arena LLM Leaderboardでは僅かだがgpt-4o>gpt-4o mini>Llama 3.1 405Bであり、ベンチマーク性能の順序とは異なる。
回答の正しさは置いといてgpt4系の方がユーザーに高く評価されやすい回答をしてくるのではないかと思った。

まとめ

最近イロレートを調べていてChatbot Arenaのデータ分析を見た時にツッコミどころをまとめてみる。特に個人的には未評価モデルの引き分けの扱いや新規モデルが特定の旧モデルとほとんど戦ってなかったりする点に引っ掛かりを覚えた。分析コードは時系列をランダムにシャッフルしたイロレートの平均を取るBootstrapで無理やり解決しているがデータ自体に怪しい点がある。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up