1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

PythonでPubMed論文要旨取得&要約自動化

Last updated at Posted at 2025-05-05

はじめに

大学院生かつプログラミング独学の初心者が、自分の研究を効率化するためにゼロから組み上げた Python スクリプトを紹介します。
「PubMed から論文要旨をひとまとめに取得して、そのまま OpenAI で要約&構造化」──そんな夢みたいなワークフローを Docker 環境でサクッと動かせるようにしました。

背景/モチベーション

  • 論文レビューが膨大で手作業がつらい
  • PubMed API から取得 → 手動で要約、は非効率すぎる
  • ChatGPT(OpenAI API)に任せられたらいいのに…

そんなもどかしさを解消すべく、「自分で作れる範囲で自動化しよう!」と一念発起。

本記事で扱うもの

  • PubMed API から論文の Abstract を一括取得
  • OpenAI API(gpt-3.5-turbo)で要旨を以下の5つの観点に分けて構造化
    1. Known(既存の知見)
    2. Research Question(研究課題)
    3. Methods(解析方法)
    4. Findings(得られた知見)
    5. Limitations(研究の限界)
  • 取得結果を CSV として出力
  • Docker で環境依存ゼロ&VSCode でサクサク開発

この記事で得られること

  1. Docker での Python プロジェクト構成と環境構築
  2. PubMed API からのデータフェッチの実装パターン
  3. OpenAI v1 API(openai パッケージ)への移行方法
  4. 並列処理(ThreadPoolExecutor)×レート制限の実装テクニック
  5. 実践的なデバッグ/トラブルシューティング例

この記事を読み終えるころには、
「論文要旨を自動で取って、要約も自動でやってくれる」
そんな快適ワークフローがあなたの手元にも完成しているはずです。

前提条件

  • Python 3.9 以上
  • Docker & Docker Compose(任意)
  • OpenAI API Key & NCBI API Key を発行済み
  • Git が使える環境
  • PubmedからKeyWord等で検索した論文一覧のCSV(Pubmedの機能を使って取得したもの)

プロジェクト全体のファイル構成

pubmed-annotator/
├── pubmed_client/           # コアパッケージ
│   ├── __init__.py
│   ├── config.py
│   ├── utils.py
│   ├── fetcher.py
│   ├── summarizer.py
│   ├── io.py
│   └── cli.py               # エントリポイント
├── tests/                   # pytest テスト
│   ├── test_utils.py
│   ├── test_fetcher.py
│   └── test_summarizer.py
├── data/                    # 入出力用ディレクトリ
│   ├── src/                 # 入力 CSV を置く
│   └── out/                 # 出力結果を出力
├── setup.py                 # パッケージ設定
├── requirements.txt         # 依存パッケージ一覧
├── Dockerfile               # Docker イメージ定義
├── .env                     # 環境変数テンプレート
└── README.md                # プロジェクト概要・使い方

各ファイルの中身

pubmed_client/config.py

pubmed_client/config.py
import os
from pathlib import Path
from dotenv import load_dotenv

load_dotenv()

OPENAI_API_KEY    = os.getenv("OPENAI_API_KEY")
NCBI_API_KEY      = os.getenv("NCBI_API_KEY")
INPUT_CSV         = Path(os.getenv("INPUT_CSV", "./data/src/pubmed_data.csv"))
OUTPUT_CSV        = Path(os.getenv("OUTPUT_CSV", "./data/out/pubmed_annotated.csv"))
BASE_URL          = "https://eutils.ncbi.nlm.nih.gov/entrez/eutils/efetch.fcgi"
DB                = "pubmed"
RETMODE           = "xml"
DELAY             = 0.12 if NCBI_API_KEY else 0.34
PM_FETCH_BATCH    = int(os.getenv("PM_FETCH_BATCH", 200))
MAX_FETCH_WORKERS = int(os.getenv("MAX_FETCH_WORKERS", 10))
ENG_KEYS          = ["Known", "Research Question", "Methods", "Findings", "Limitations"]
SUMMARY_BATCH_SIZE  = int(os.getenv("SUMMARY_BATCH_SIZE", 20))
MAX_SUMMARY_WORKERS = int(os.getenv("MAX_SUMMARY_WORKERS", 10))
SUMMARY_SLEEP       = float(os.getenv("SUMMARY_SLEEP", 1.0))
MAX_RETRIES         = int(os.getenv("MAX_RETRIES", 3))

pubmed_client/utils.py

pubmed_client/utils.py
import re

def chunk_list(lst, n):
    for i in range(0, len(lst), n):
        yield lst[i : i + n]

def clean_json(s: str) -> str:
    return re.sub(r',(\s*[}\]])', r'\1', s).strip()

pubmed_client/fetcher.py

pubmed_client/fetcher.py
import time
import xml.etree.ElementTree as ET
import requests
from concurrent.futures import ThreadPoolExecutor, as_completed

from .config import BASE_URL, DB, RETMODE, NCBI_API_KEY, DELAY, PM_FETCH_BATCH, MAX_FETCH_WORKERS
from .utils import chunk_list

def fetch_batch(batch):
    params = {"db": DB, "retmode": RETMODE, "id": ",".join(batch)}
    if NCBI_API_KEY:
        params["api_key"] = NCBI_API_KEY
    resp = requests.get(BASE_URL, params=params, timeout=10)
    time.sleep(DELAY)

    if resp.status_code != 200:
        print(f"[debug] fetch_batch failed: status={resp.status_code}")
        print(f"[debug] response snippet: {resp.text[:200]!r}")

    out = {}
    if resp.status_code == 200:
        root = ET.fromstring(resp.text)
        for art in root.findall(".//PubmedArticle"):
            pmid = art.findtext(".//PMID","").strip()
            texts = [n.text or "" for n in art.findall(".//AbstractText")]
            out[pmid] = "\n".join(texts).strip()
    return out

def fetch_abstracts(pmids, max_workers=MAX_FETCH_WORKERS):
    abstracts = {}
    batches = list(chunk_list(pmids, PM_FETCH_BATCH))
    with ThreadPoolExecutor(max_workers=max_workers) as ex:
        futures = {ex.submit(fetch_batch, b): b for b in batches}
        for fut in as_completed(futures):
            abstracts.update(fut.result() or {})
    return abstracts

pubmed_client/summarizer.py

summarizer.py
import json
import re
import time
import traceback

from ratelimit import limits, sleep_and_retry
from openai import OpenAI

from .config import ENG_KEYS, OPENAI_API_KEY, MAX_RETRIES, SUMMARY_SLEEP
from .utils import clean_json

client = OpenAI(api_key=OPENAI_API_KEY)

@limits(calls=3500, period=60)
@sleep_and_retry
def call_openai(messages):
    return client.chat.completions.create(
        model="gpt-3.5-turbo",
        messages=messages,
        temperature=0,
    )

def summarize_batch(pmids, abstracts):
    skeleton = {pm: {k:"" for k in ENG_KEYS} for pm in pmids}
    example = json.dumps(skeleton, ensure_ascii=False, indent=2)
    prompt = (
        "You are a genius who can explain things concisely. "
        "For each abstract below, provide a JSON object with the following English keys: "
        f"{ENG_KEYS}.\n\n"
        "Output only JSON. Here is the structure example:\n"
        + example + "\n\n"
        "Now analyze these abstracts:\n"
    )
    for pm in pmids:
        prompt += f"PMID:{pm}\nAbstract:{abstracts.get(pm,'')}\n\n"

    messages = [
        {"role":"system","content":"You are a genius who can explain things concisely."},
        {"role":"user","content":prompt}
    ]

    for attempt in range(1, MAX_RETRIES+1):
        try:
            res = call_openai(messages)
            txt = res.choices[0].message.content
            m = re.search(r'^\{[\s\S]*\}$', txt, flags=re.M)
            if not m:
                raise ValueError("no JSON block in reply")
            js = clean_json(m.group(0))
            return json.loads(js)
        except Exception as e:
            print(f"[debug] summarizer attempt {attempt} failed: {e}")
            traceback.print_exc()
            time.sleep(2 ** attempt)
        finally:
            time.sleep(SUMMARY_SLEEP)
    return {pm:{k:"unknown" for k in ENG_KEYS} for pm in pmids}

pubmed_client/io.py

io.py
import pandas as pd
from pathlib import Path

def read_input(path):
    return pd.read_csv(path, dtype=str).fillna("")

def write_output(df, path):
    path = Path(path)
    path.parent.mkdir(parents=True, exist_ok=True)
    df.to_csv(path, index=False, encoding="utf-8-sig")

pubmed_client/cli.py

cli.py
#!/usr/bin/env python3
import argparse
from concurrent.futures import ThreadPoolExecutor, as_completed
from tqdm import tqdm

from .config import INPUT_CSV, OUTPUT_CSV, ENG_KEYS, SUMMARY_BATCH_SIZE, MAX_SUMMARY_WORKERS
from .io import read_input, write_output
from .utils import chunk_list
from .fetcher import fetch_abstracts
from .summarizer import summarize_batch

def main():
    parser = argparse.ArgumentParser(description="Fetch & summarize PubMed abstracts")
    parser.add_argument("-i","--input", default=INPUT_CSV, help="入力 CSV パス")
    parser.add_argument("-o","--output", default=OUTPUT_CSV, help="出力 CSV パス")
    args = parser.parse_args()

    df = read_input(args.input)
    pmids = df["PMID"].str.replace(r"\D","",regex=True).tolist()
    abstracts = fetch_abstracts(pmids)
    df["Abstract"] = df["PMID"].map(abstracts).fillna("")
    for k in ENG_KEYS:
        df[k] = ""
    batches = list(chunk_list(pmids, SUMMARY_BATCH_SIZE))
    with ThreadPoolExecutor(max_workers=MAX_SUMMARY_WORKERS) as ex:
        futures = {ex.submit(summarize_batch,b,abstracts):b for b in batches}
        for fut in tqdm(as_completed(futures), total=len(futures), desc="Summarizing"):
            out = fut.result()
            for pm, summary in out.items():
                for k, v in summary.items():
                    df.loc[df["PMID"]==pm, k] = v
    write_output(df, args.output)
if __name__=="__main__":
    main()

tests/test_utils.py

test_utils.py
import pytest
from pubmed_client.utils import chunk_list, clean_json

def test_chunk_list():
    data = list(range(7))
    assert list(chunk_list(data, 3)) == [[0,1,2], [3,4,5], [6]]

def test_clean_json():
    raw = '{"a":1,}'
    assert clean_json(raw) == '{"a":1}'

tests/test_fetcher.py

test_fetcher.py
import pytest
from pubmed_client.fetcher import fetch_batch

class DummyResp:
    def __init__(self, code, text):
        self.status_code = code
        self.text = text

def test_fetch_batch_non_200(monkeypatch):
    monkeypatch.setattr("requests.get", lambda *args, **kw: DummyResp(404, ""))
    assert fetch_batch(["123"]) == {}

def test_fetch_batch_success(monkeypatch):
    xml = """
    <PubmedArticleSet>
      <PubmedArticle>
        <MedlineCitation>
          <PMID>123</PMID>
          <Abstract><AbstractText>Test</AbstractText></Abstract>
        </MedlineCitation>
      </PubmedArticle>
    </PubmedArticleSet>
    """
    monkeypatch.setattr("requests.get", lambda *args, **kw: DummyResp(200, xml))
    out = fetch_batch(["123"])
    assert out["123"] == "Test"

tests/test_summarizer.py

test_summarizer.py
import pytest
from pubmed_client.summarizer import summarize_batch

def test_summarize_batch_fallback(monkeypatch):
    monkeypatch.setattr("pubmed_client.summarizer.call_openai", lambda *_: (_ for _ in ()).throw(Exception))
    result = summarize_batch(["1","2"], {"1":"", "2":""})
    assert all(val == "unknown" for pm in result for val in result[pm].values())

setup.py

setup.py
from setuptools import setup, find_packages

setup(
    name="pubmed_client",
    version="0.1.0",
    description="Fetch and summarize PubMed abstracts with OpenAI",
    packages=find_packages(exclude=["tests"]),
    install_requires=[
        "requests",
        "pandas",
        "python-dotenv",
        "tqdm",
        "openai>=1.0.0",
        "ratelimit",
    ],
    entry_points={
        "console_scripts":[
            "run-pubmed=pubmed_client.cli:main"
        ]
    },
)

requirements.txt

requirements.txt
requests
pandas
python-dotenv
tqdm
openai>=1.0.0
ratelimit
pytest

Dockerfile

Dockerfile
FROM python:3.9-slim
WORKDIR /app
COPY requirements.txt setup.py ./
RUN pip install --no-cache-dir -r requirements.txt && pip install --no-cache-dir -e .
COPY . .
ENTRYPOINT ["run-pubmed"]

.env

.env
OPENAI_API_KEY=your_openai_api_key         #OpenAIから取得したAPIKey
NCBI_API_KEY=your_ncbi_api_key             #NCBIから取得したAPIKey
INPUT_CSV=./data/src/pubmed_data.csv
OUTPUT_CSV=./data/out/pubmed_annotated.csv

Docker で動かす

  1. イメージをビルド

    zsh
    docker build -t pubmed-annotator .
    
  2. 実行

    zsh
    docker run --rm \
      -v $(pwd)/data/src:/app/data/src \
      -v $(pwd)/data/out:/app/data/out \
      --env-file .env \
      pubmed-annotator \
      --input ./data/src/pubmed_data.csv \
      --output ./data/out/pubmed_annotated.csv
    

⚠️ デフォルトではすべての PMID を処理するため、OpenAI のクォータを消費します。
検証時は、data/src/pubmed_data.csv を 5〜10 件程度に絞って試すと安心です。


実行結果の一例(CSVプレビュー)

pubmed_annotated.csv
PMID,Title,Authors,Citation,First Author,Journal/Book,Publication Year,Create Date,PMCID,NIHMS ID,DOI,Abstract,Known,Research Question,Methods,Findings,Limitations
40308352,High and Rapid Uptake of COVID-19 Vaccine Among Chicago Women with/without HIV (WWH/WWoH),Barber S; McCluskey J,"Trop Med Int Health. 2025 Apr 20. doi:10.1111/tmi.14109.","Reniers G",Trop Med Int Health,2025,2025/04/21,,,"10.1111/tmi.14109","Research staff provided outreach and collected data on COVID-19 vaccination among a long-term cohort of Chicago women with/without HIV (WWH/WWoH).",既存の知見,研究課題,解析方法,得られた知見,研究の限界
40254889,Mobile phone survey estimates of perinatal mortality in Malawi: comparing methods and validation with population-based surveillance data,Phiri K; Banda M; et al.,"Int J Public Health. 2025 Mar;70(2):123-134. doi:10.1007/s00038-024-01824-5.","Phiri K",Int J Public Health,2025,2025/03/15,,,"10.1007/s00038-024-01824-5","We use data from the Malawi Rapid Mortality Mobile Phone Survey (RaMMPS), comparing estimates to standard surveillance...",既存の知見,研究課題,解析方法,得られた知見,研究の限界

できなかった事と今後の拡張アイデア

  • なぜか追加した列(abstract以外)がUnkownになってしまう論文がある
    →私のレベルではこれがなぜ起こるのかは不明。一から自分でやるよりかは助かるので、空白の部分については自力でやる。
  • Notionで管理
    →生成したcsvをNotionに取り込んで、必要な論文のみをピックアップ。
  • 日本語に翻訳
    →英語辛くなったら考える。

上記の記事が参考になれば幸いです。間違いがあればご指摘くだされば幸いです。
一応、GitHubで公開しているのもシェアします。ご参考ください。
https://github.com/kwmtshr/Pubmed_annotator.git

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?