More than 1 year has passed since last update.

ChatGPT APIで論文を要約しまくる方法

Posted at 2024-02-19

初めに

論文を読むときにChatGPTへ論文の内容をコピペしていたんですが、面倒になったのでAPIで自動化しました。
文章抽出には今回はscipdf経由でGROBIDを用い、要約にはGPT-4を使いました。

要約したもの

深層学習を世に知らしめたKrizhevsky et al., 2012を要約しました。
ImageNet Classification with Deep Convolutional Neural Networks

要約結果

こんな感じ要約してくれます。

やった内容

論文（PDF）の読み込み

scipdfを使用してPDFからAbstract,Sectionを抽出。
GROBIDはWindows非対応(https://grobid.readthedocs.io/en/latest/Troubleshooting/)

ChatGPTのAPIの設定

サインアップ
https://platform.openai.com/
キーの取得
https://platform.openai.com/account/api-keys
APIは有料です。

ChatGPTへの指示

Abstract,Sectionを受け渡して、以下のようにChatGPTへ指示
・要約は、目的、手法、結果、結論を箇条書き
・各セクションを2行~6行の箇条書きで説明
・もっと細かく読みたい場合は段落'\n'ごとにセクションを分割して受け渡す。

import os
import scipdf
import docx
from openai import OpenAI

# 環境変数からAPIキーを読み込む
os.environ['OPENAI_API_KEY'] = '<Your OPENAI API KEY>'

# OpenAIクライアントのインスタンス化
client = OpenAI()

def summarize_section(client, text):
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": f"Summarize the following in 2~6 lines in Japanese using bullet points. \n{text} ",
            }
        ],
        model="gpt-4",
        temperature=0
    )
    return chat_completion.choices[0].message.content


def summarize_abstract(client, text):
    chat_completion = client.chat.completions.create(
        messages=[
            {
                "role": "user",
                "content": f"Summarize the study objectives, methodology, results, and conclusions in Japanese using bullet points from the following . \n{text} ",
            }
        ],
        model="gpt-4",
        temperature=0
    )
    return chat_completion.choices[0].message.content

def main(pdf_filename):
    article_dict = scipdf.parse_pdf_to_dict(pdf_filename)
    doc = docx.Document()
    doc.add_heading('Abstract', level=1)

    # アブストラクトの要約
    abstract_summary = summarize_abstract(client, article_dict['abstract'])
    print(abstract_summary)
    doc.add_paragraph(abstract_summary)

    for sec in article_dict['sections']:
        doc.add_heading(sec['heading'], level=1)
        section_text = sec['text']
        print(section_text)

        # セクションの要約
        section_summary = summarize_section(client, section_text)
        print(section_summary)
        doc.add_paragraph(section_summary)

    output_filename = f"{pdf_filename.split('.')[0]}_gpt4_abst.docx"
    doc.save(output_filename)
    print(f"Document saved as {output_filename}")

if __name__ == '__main__':
    pdf_filename = 'NIPS-2012-imagenet-classification-with-deep-convolutional-neural-networks-Paper.pdf'
    main(pdf_filename)

GPT4.0とGPT-3.5-turboの比較

お値段
https://openai.com/pricing

GPT4.0

GPT3.5-turbo

GPT4.0の方が見出しをちゃんとつけたり、若干詳しいですが大差はないですね。これなら、値段を考えると3.5でいい気もします。
自分の専門分野（地球科学）だと、GPT3.5-turboは不自然な日本語が目立ったのですが、機械学習の分野だとよく学習されているのかもしれません。

その他

図とか表も抽出できるようになりたいですね。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up