More than 1 year has passed since last update.

掛け算のできないChatGPT

Last updated at 2023-03-24Posted at 2023-02-05

ChatGPTって凄いですよね。何が凄いかって、AIのくせに平然と嘘をつくところだと人間の私は思っています。

皆さんご存じの通り、この掛け算の答えは67番目のメルセンヌ数である147,573,952,589,676,412,927です。ちなみに、ChatGPTの答えは聞くたび毎回変わります。

自分の答えの間違いを認めるという、嘘の中に真実を混ぜ込む詐欺師のテクニックを使用してきますが、やはり嘘です。大ぼら吹きです。そもそも桁数すら合ってないやん。

そこで次のような疑問が浮かびます。ChatGPTは何桁の掛け算までならできるのでしょう？

ここでは問題を単純化して、次のように考えたいと思います。

$a_n = 10^n - 1$とする。つまり、$a_n$は10進$n$桁の正の整数のうち最大のものである（たとえば、$a_1 = 9$、$a_2 = 99$、$a_3 = 999$）。
ChatGPTに$a_n \times a_n$の掛け算を100回行わせる。正答率が50%以上であれば、$n$桁の掛け算はできると判定する。

というわけで、Pythonでこの手順をさっくりと適当に書いてみましょう。その前に諸注意です。

ChatGPTを使ったアプリ(サービス)、ChatGPTじゃない説にあるように、正確にはChatGPTじゃなくてOpenAIのAPIを使っています。
AIには改良が加えられ続けています。コードと結果はあくまでも執筆当時（2023年2月5日）のものです。
ChatGPTの数学的な能力が向上したというのでテストしてみたでは、掛け算以外にもいろいろと試されています。

コードは以下のようになります。Google Colabotatoryでは

%env OPENAI_API_KEY=sk-1UiX****************************************YZ3Y
!pip install openai

のように、OpenAIのAPIキーの指定とライブラリのインストールが必要です。

import openai
import time
import matplotlib.pyplot as plt


def check(n):
    """Check if the AI answers an n-digit multiplication correctly."""
    x = 10**n - 1
    y = x
    prompt = (
        f"Compute {x} multiplied by {y}. "
        "Output the answer using only decimal letters from 0 to 9. "
        "Don't output any other characters, e.g., decimal separators."
    )
    while True:
        try:
            response = openai.Completion.create(
                model="text-davinci-003",
                prompt=prompt,
            )
            return int(response.choices[0].text) == x * y
        except Exception as e:
            # Rarely, ServiceUnavailableError or ValueError (invalid decimal
            # format) occurs.
            print(f"Error ignored: {e}")
            time.sleep(10)  # for ServiceUnavailableError

n = range(1, 9)
m = 100
delay = 4  # to avoid rate limit errors

results = []

for i in n:
    count = 0
    for _ in range(m):
        if check(i):
            count += 1
        time.sleep(delay)
    results.append(count / m)

# standard error
errors = [(p * (1 - p) / m) ** 0.5 for p in results]

plt.figure(dpi=150)
plt.errorbar(n, results, errors, fmt="o-", capsize=5)
plt.xlabel("$n$")
plt.ylabel("accuracy")
plt.xticks(n)
plt.grid()
plt.show()

結果は次のグラフのようになりました。

$n=3$までは正答率100%である。
$n=4$は正答率がぐっと下がる。
なぜか$n=5$と$n=6$の正答率はほぼ100%になる¹。

本当は$a_n$以外の数字の掛け算や、桁数の異なる数字どうしの掛け算も検証した方がよいのでしょうが、労力と時間とAPIの使用コストのことも考えて、ここでの検証の結論を次のようにまとめたいと思います。

ChatGPTは4桁以上の掛け算ができない（こともある）。

ちなみに、筆者は2桁の掛け算の暗算でさえよく間違えます。ChatGPTはすでに人間を超越している。

【2023年3月12日追記】 ChatGPTのAPIがOpenAIから公開されてたのでやり直してみました。ついでに答えさせる掛け算の問題も少しだけランダムにしています。

import openai
import random
import time
import matplotlib.pyplot as plt


def check(n, verbose=False):
    """Check if the AI answers an n-digit multiplication correctly."""
    a = 10**n - 1
    # Choose random numbers x and y such that they are approximately a
    # but x != y.
    x = y = 0
    while x == y:
        x = random.randint(a - 9, a)
        y = random.randint(a - 9, a)
    messages = [
        {
            "role": "system",
            "content": "You are an accurate calculator. "
            "Answer using only decimal letters from 0 to 9. "
            "Don't output any other characters, for example, "
            "decimal separators or punctuations.",
        },
        # We give example responses.
        {"role": "user", "content": "Compute 12345 plus 67890."},
        {"role": "assistant", "content": "80235"},
        {"role": "user", "content": "Compute 12345 multiplied by 67890."},
        {"role": "assistant", "content": "838102050"},
        # Here is the problem.
        {"role": "user", "content": f"Compute {x} multiplied by {y}."},
    ]
    while True:
        try:
            response = openai.ChatCompletion.create(
                model="gpt-3.5-turbo", messages=messages
            )
            if verbose:
                print(x, y, response)  # for debugging
            result = response.choices[0]["message"]["content"]
            if result[-1] == ".":
                result = result[:-1]
            return int(result) == x * y
        except Exception as e:
            print(f"Error ignored: {e}")
            time.sleep(10)  # for ServiceUnavailableError


n = range(1, 9)
m = 100
delay = 4  # to avoid rate limit errors

results = []

for i in n:
    count = 0
    for _ in range(m):
        if check(i):
            count += 1
        time.sleep(delay)
    results.append(count / m)

# standard error
errors = [(p * (1 - p) / m) ** 0.5 for p in results]

plt.figure(dpi=150)
plt.errorbar(n, results, errors, fmt="o-", capsize=5)
plt.xlabel("$n$")
plt.ylabel("accuracy")
plt.xticks(n)
plt.grid()
plt.show()

結果は次のようになりました。

う～ん、やはりChatGPTの計算力は怪しい²。

【2023年3月25日追記】 Microsoft Researchが書いているように、ゆっくりと順を追って途中計算を書かせると正答率が上がるようです。

In a way, this shows how GPT-4 has an incredibly short working memory for this type of problem. However, if GPT-4 "takes its time" to answer the question then the accuracy easily goes up.

そういえば、コール先生だって冒頭の掛け算をするのに約1時間かかったのでした。ChatGPTにもゆっくりと計算させてみましょう³。

やればできるじゃん！ChatGPT⁴！

使用したプロンプトに依存している気がします。 ↩
ChatGPTは計算がどうも苦手なようですが、ChatGPTにほかのツールを組み合わせることでこれを克服することができます。たとえば: https://huggingface.co/spaces/JavaFXpert/Chat-GPT-LangChain
↩
MSRの論文はGPT-4についてですが、ここでは従来のChatGPTを使っています。 ↩
Step 13の多項式は
$7x^{19}+69x^{18}+76x^{17}+84x^{16}+120x^{15}+115x^{14}+230x^{13}+131x^{12}+196x^{11}+179x^{10}+176x^9+218x^8+169x^7+173x^6+124x^5+89x^4+116x^3+67x^2+22x+7$
になるはずなのですが。。。

でも、最終結果がなぜか合っているので良いとしましょう。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up