More than 1 year has passed since last update.

論文収集を快適に（PythonでのXML解析）

Last updated at 2024-02-05Posted at 2024-01-29

はじめに

論文収集が手間だと感じている方必見！
J-STAGE WebAPIを用いて、J-STAGEで検索した論文の一覧をExcelにまとめます。

環境

python 3.10.13

J-STAGE WebAPI

J-STAGE WebAPIへリクエストを送ると、XML形式のレスポンスが返ってきます。
また、リクエスト用URLのパラメータを変更すると、検索条件を変えることができます。

J-STAGE WebAPIの詳しい仕様は、以下に記載されています。
J-STAGE WebAPIについて

Code

リクエストの送信

今回は、タイトル に 「健康」 が入った論文を取得します。
検索条件を変更したいときは、「健康」 の部分を変更してください。
また、本文に記載されている情報で検索したいときは、urlの「article=」を 「text=」 に変更してください。

リクエスト送信

import requests

#リクエスト用URL
url = 'https://api.jstage.jst.go.jp/searchapi/do?service=3&article="健康"'

#APIにリクエストを送信
response = requests.get(url)
response.encoding = response.apparent_encoding

XML解析

BeautifulSoupは、Pythonのスクレイピング用のライブラリで、HTMLやXML形式のテキストからデータを抽出するための便利なツールです。こちらを用いて、XML形式のAPIレスポンスを解析していきます。

find_all() で条件に合致する全ての要素をリストとして取得できます。

BeautifulSoup

from bs4 import BeautifulSoup

soup = BeautifulSoup(response.text, "xml")
entries = soup.find_all('entry')

データ抽出とDataFrameへの格納

タイトル、リンク、筆頭著者、年、ジャーナル、巻、号、ページをDataFrameへ格納します。

andは、各条件がNoneでないかつ存在する場合に、それぞれの属性やテキストを取得し、それ以外の場合はNoneとして扱っています。このようにして、Noneチェックと値の取得を一度に行っています。

DataFrameへの格納

import pandas as pd

# DataFrameの初期化
columns = ['article_title', 'article_link', 'first_author', 'publish_year', 'journal', 'vol'\
           , 'number', 'starting_page', 'ending_page']
df = pd.DataFrame(columns=columns)

# entryごとにデータ抽出と DataFrame への格納
for entry in entries:
    data = {
        'article_title': entry.find('article_title') and entry.find('article_title').find('ja') and entry.find('article_title').find('ja').text.strip(),
        'article_link' : entry.find("article_link") and entry.find("article_link").text.strip(),
        'first_author' : entry.find("author") and entry.find("author").find("ja") and entry.find("author").find("ja").find("name") and entry.find("author").find("ja").find("name").text.strip(),
        'publish_year' : entry.find("pubyear") and entry.find("pubyear").text.strip(),
        'journal'      : entry.find("material_title") and entry.find("material_title").find("ja") and entry.find("material_title").find("ja").text.strip(),
        'vol'          : entry.find("prism:volume") and entry.find("prism:volume").text.strip(),
        'number'       : entry.find("prism:number") and entry.find("prism:number").text.strip(),
        'starting_page': entry.find("prism:startingPage") and entry.find("prism:startingPage").text.strip(),
        'ending_page'  : entry.find("prism:endingPage") and entry.find("prism:endingPage").text.strip(),
    }

    # DataFrameに追加
    df = pd.concat([df, pd.DataFrame([data], columns=columns)], ignore_index=True)

# Excelに出力
df.to_excel('articles.xlsx', index=False)

さいごに

今回、XML解析はBeautifulSoupを用いましたが、正直xmltodictを用いた方が見やすいコードになるかと思いました。

xmltodictを用いている方がいらっしゃったので、そちらの記事を添付しておきます。
#03 J-STAGEから論文タイトルで検索をしてcsvファイルでDL

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up