More than 1 year has passed since last update.

ChatGPT の性能は以前より低下したのか？

Posted at 2023-07-20

概要

7 月 18 日に “How Is ChatGPT’s Behavior Changing over Time?” という論文が公開され、「ChatGPT の性能が以前より悪くなっている」と SNS などで話題になりました。

しかしこの論文の中身をよく見てみると、「GPT の振る舞いが大きく変わっている」のは事実ですが、性能は悪くなるどころかむしろ若干良くなっていると言えます。

論文の実験結果

論文では、Chat Completions API の 3 月のバージョンと 6 月のバージョン (gpt-4-0613, gpt-4-0314, gpt-3.5-turbo-0613, gpt-3.5-turbo-0301) を比較して、4 種類のタスクでテストしています。

特に GPT-4 の結果を抜き出すと以下の通りです。

与えられた整数が素数か否か、2 択で答える問題の正答率が 97.6% から 2.4% に悪化した
生成したコードを LeetCode に直接入力して正解できる確率が 52% から 10% に悪化した
違法性のあるセンシティブな質問に答えてしまう率が 21% から 5% に下がった（良くなった）
2 次元配列パズルの正答率が 24.6% から 27.4% になった

この内の 1, 2 をもって「GPT-4 が劣化した」と解釈する人が多いため、以降で詳しく掘り下げて見ます。

実験設定の問題点

素数判定

素数判定に使われているデータは、GitHub レポジトリの PRIME_EVAL.csv で確認することができます。
これは別の論文のデータセットが元になっていて、Is 17077 a prime number? というような質問文が計 500 個含まれています。

しかしこのデータの中には、与えられた整数が素数であるケース、つまり答えが Yes になる問題しかありません。
そのため、デタラメに全て Yes と答えるだけでも正答率 100% になるので、評価用のデータセットとして相応しくありません。

そこで、答えが No になるケースについてもテストしたらどうなるか、既に実験した記事があります。

この記事によると、gpt-4-0314 は与えられた数字が何だろうと素数だと回答する頻度が高く、 gpt-4-0613 は素数ではないと回答する頻度が高い、という結果でした。

すなわち、どちらのバージョンでも素数判定は苦手であることに変わり無く、Yes と答えるか No と答えるかの傾向だけが変わっている、ということです。

コード生成

コード生成タスクのデータは LEETCODE_EASY_EVAL.csv にあります。
LeetCode の問題を以下の形式のプロンプトで GPT に渡しています。

# [Check if The Number is Fascinating][title]

## Description
You are given an integer `n` that consists of exactly `3` digits.
We call the number `n` **fascinating** if, after the following modification,
the resulting number contains all the digits from `1` to `9` **exactly** once
and does not contain any `0`'s:

  * **Concatenate** `n` with the numbers `2 * n` and `3 * n`.

Return `true` _if_`n` _is fascinating, or_`false` _otherwise_.

**Concatenating** two numbers means joining them together. For example, the
concatenation of `121` and `371` is `121371`.

**Example 1:**
            Input: n = 192    Output: true    Explanation: We concatenate the numbers n = 192 and 2 * n = 384 and 3 * n = 576. The resulting number is 192384576. This number contains all the digits from 1 to 9 exactly once.    

**Example 2:**
            Input: n = 100    Output: false    Explanation: We concatenate the numbers n = 100 and 2 * n = 200 and 3 * n = 300. The resulting number is 100200300. This number does not satisfy any of the conditions.    

**Constraints:**
  * `100 <= n <= 999`

**Tags:** Hash Table, Math

**Difficulty:** Easy

## 思路

[title]: https://leetcode.com/problems/check-if-the-number-is-fascinating

Solve it by filling in the following python code. Generate the code only without any other text.
class Solution(object):
    def isFascinating(self, n):
        """"""
        :type n: int
        :rtype: bool
        """"""

これに対する返答文を、特にパースなどせずそのまま LeetCode に送信して、実際にテストを通過するか評価します。

最新バージョンの GPT-4 だと、下記のように Markdown のコードブロックで囲む傾向があるため、ほとんどが構文エラーで LeetCode に弾かれます。

```python
class Solution(object):
  def isFascinating(self, n):
    # Concatenate n, 2*n and 3*n
    s = str(n) + str(n*2) + str(n*3)
    # Check if the length of s is 9 and contains all digits from 1 to 9
    return len(s) == 9 and set(s) == set('123456789')
```

では、GPT の返答文の中から正規表現でコードだけ抽出して評価するとどうなるでしょうか？実験した結果を以下のレポジトリで公開しています。

正答率は大きく変わって 70% で、旧バージョンよりもむしろ良い値になりました。

まとめ

この論文から「GPT の性能が以前より悪くなっている」と結論付けるのは無理があります。
その一方で、「アプリケーションの実装やプロンプトを固定していても、モデルのマイナーバージョンを更新した瞬間に精度が大幅に下がる」というケースは実際に有るので、開発にあたって注意が必要になるでしょう。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up