『価値ある論文を求めて』 Python使ってPDFから本文抽出編 (1/n)

Last updated at 2024-09-12Posted at 2024-07-16

動機

研究者なら研究以外のことに時間をかけるべきではないです

そして、良い研究の成果を得るには『良い論文を探す・読み込む・理解する・アウトプットする』
一連のプロセスが重要になります

そこで今回は『私にとって価値のある論文を探す』ことに焦点を当て、
「ChatGPTに論文要約、論文比較を自動的にやってもらう」ことを目的としています

今回行うこと

まずは、文字と画像が含まれる論文のPDFファイルをJsonファイル形式で出力できるようにします

仕様

入力はpdfのみ、出力はjsonのみとします

必要ライブラリ

pypdf
re
json

詳細な仕様

題名・著者名はページ0にあると想定
1. 題名は一番初めに来て、改行がないと想定
2. "Abstruct"の文字列を検索し、著者名を抽出
改行の直前が"-"の場合これを除去
/uniXXXXXXXXを除去(画像を指定する文字？)
複数回連続する空白を除去

次回の予定

arXivから自動的に論文を拾ってくる機能を作る予定

付録

実験に使ったコード
気が向いたらGithubにまとめておきます

# %%
# !pip install pypdf
pip install pypdf

# %%
# !ls *.pdf
ls *.pdf

pdf_extructor.py

# %%
# ライブラリのインポート
from pypdf import PdfReader
pdf_name = "2304.08069v3"
# PDFファイルの読み込み
reader = PdfReader(f"{pdf_name}.pdf")

# ページ数の取得
number_of_pages = len(reader.pages)

# ページの取得。この場合は、1ページ目を取得する。
page = reader.pages[0]

# テキストの抽出
text = page.extract_text()

# %%
import re
def decord_text(page_texts):
    page_text=""
    for text in page_texts:
        if text[-1] == "-":
            text = text[:-1]
        text.replace('. ', '. \n')
        pattern = re.compile('/uni[0-9a-fA-F]{8}')
        text = pattern.sub('', text)
        pattern = re.compile('\s+')
        text = pattern.sub(' ', text)
        page_text+=text
    return page_text

def extract_paper_sections(reader):
    # タイトルの抽出 (最初の段落として見なす)
    cnt=0
    page_dict=dict()
    page_raw = reader.pages[0].extract_text()
    page_texts = page_raw.strip().split("\n")
    page_dict["title"] = page_texts[0]
    abstruct_flag = [idx for idx, text in enumerate(page_texts) if (text=="Abstract")][0]
    page_dict["authors"] = page_texts[1:abstruct_flag]
    page_dict["page"] = {}
    page_dict["page"][cnt] = decord_text(page_texts[abstruct_flag:])

    for page in reader.pages[1:]:
        cnt+=1
        page_raw = page.extract_text()
        page_texts = page_raw.strip().split("\n")
        page_dict["page"][cnt] = decord_text(page_texts)
    return page_dict

# %%
page_dict = extract_paper_sections(reader)

# %%
import json
with open(f'{pdf_name}.json', 'w', encoding="utf8") as f:
    json.dump(page_dict, f, indent=4, ensure_ascii=False)

【実験】ChatGPTに入れてみた

全部入れると落ちるので、マニュアルで分割して入れました
将来的には参考論文も自動探索してくれると嬉しいですね

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up