More than 5 years have passed since last update.

PubMed APIの概要、データ取得、シンプルなEDA、及びクラスタリング(Python)

Last updated at 2019-10-17Posted at 2019-10-17

この記事は？

下記のことを記載しています。

PubMedの紹介
PubMed API（厳密にはEntrezのAPIで、データベースとしてPubMedを選択）を利用し、Pythonを用いてデータの収集
簡単なEDA
ドキュメントクラスタリング

全コード

github: https://github.com/tatsuya-takahashi/PubMed-API-Script
notebook: https://nbviewer.jupyter.org/github/tatsuya-takahashi/PubMed-API-Script/blob/master/PubMed.ipynb

モチベーション

PubMed自体の使い方や解説は多くありますが、日本語で、しかもPythonで詳解されているものはあまり見受けられなかったので、実装ついでに情報をまとめました。
大量の論文からシーズを探索したり、関連論文の探索や、グラフ解析などに利用できます。

PubMedとは

世界約70カ国、約5,000誌以上の文献を検索できる医学・生物学文献データベースです。インターネット上において随時無料で利用できます。PubMedは、NIH(National Institutes of Health)内のNLM(National Library of Medcine：米国国立医学図書館)におけるNCBI(National Center for Biotechnology Information：国立バイオテクノロジーインフォーメーションセンター)によるプロジェクトで、1997年6月から開始した文献データベース「MEDLINE(Medical Literature Analysis and Retrieval System Online)」の無料検索サービスです。

引用：http://jspt.japanpt.or.jp/ebpt/evidence/pubmed/

PubMed APIとは

NCBIが開発・運用しているPubMedをプログラマブルに利用できるAPIです。
Entrez（PubMed含むNCBIが運用する検索システム）の一部として提供されています。
なので、厳密に言うと"PubMed API"というものが存在するわけではなく、NCBIが提供するEntrezのAPIのDBを選択するパラメータとして、PubMedを指定できるできるというものです（他にPubMed、PMC、Gene、Nuccore、およびProteinなどがあります）。
以降、「PubMed API」で統一します。

利用規約

https://www.nlm.nih.gov/databases/download/terms_and_conditions.html
データを最新に保てば公開OKのようです。

Users who republish or redistribute the data (services, products or raw data) agree to:
maintain the most current version of all distributed data, or
make known in a clear and conspicuous manner that the products/services/applications do not reflect the most current/accurate data available from NLM.

API Keyについて

PubMed APIには、API Keyを利用しない方法、利用する方法、課金して利用する方法があります。
利用しない場合、秒間3アクセスを超えるアクセスがある場合、IP制限がかかります。
その場合はアカウントを発行し、メールアドレスとAPI Keyをリクエストに含めることで秒間10アクセスが可能になります。
アクセス数の上限を増やしたい場合は、要相談らしいです。
今回のようなBatch的なデータの取得の場合、1リクエストごとのターンアラウンドが長いので、API Keyを利用しない方式で実装しています。

Coming in December 2018: API Keys
On December 1, 2018, NCBI will begin enforcing the use of API keys that will offer enhanced levels of supported access to the E-utilities. After that date, any site (IP address) posting more than 3 requests per second to the E-utilities without an API key will receive an error message. By including an API key, a site can post up to 10 requests per second by default. Higher rates are available by request (vog.hin.mln.ibcn@seitilitue). Users can obtain an API key now from the Settings page of their NCBI account (to create an account, visit http://www.ncbi.nlm.nih.gov/account/). After creating the key, users should include it in each E-utility request by assigning it to the new api_key parameter.
Example request including an API key:
esummary.fcgi?db=pubmed&id=123456&api_key=ABCDE12345
Example error message if rates are exceeded:
{"error":"API rate limit exceeded","count":"11"}
Only one API key is allowed per NCBI account; however, a user may request a new key at any time. Such a request will invalidate any existing API key associated with that NCBI account.
We encourage regular E-utility users to obtain an API key as soon as possible and begin the process of incorporating it into code. We also encourage users to monitor their request rates to determine if they will require rates higher than 10 per second. As stated above, we can potentially have higher rates negotiated prior to the beginning of enforcement on December 1, 2018.

ORCID

一部の研究者にはORCID（研究者を一意に特定する）が付与されています。
研究者の名寄せについて、ORCIDを利用することが可能です。

PubMed APIの主な機能

*以降、体言止め

EInfo

有効なすべてのEntrezデータベースの名前のリストを取得する
インデックスフィールドと使用可能なリンク名のリストを含む、単一のデータベースの属性ごとの統計を取得できる
統計情報の例）https://eutils.ncbi.nlm.nih.gov/entrez/eutils/einfo.fcgi?db=protein&version=2.0

ESearch

検索クエリに一致するUIDのリストを取得する
検索結果を履歴サーバー¹に投稿する
履歴サーバーに保存されているデータセットからすべてのUIDをダウンロードする
履歴サーバーに保存されているUIDデータセットを結合または制限する
UIDのセットをソートする
retmaxで一度に取得するUIDの総数を指定できる。デフォルトは20、最大は100,000。

EPost

UIDのリストを履歴サーバーにアップロードする
UIDのリストをWeb環境にアタッチされた既存のUIDリストのセットに追加する
idの数に制限はないが、200を超える場合はPOSTメソッドを利用する必要がある

ESummary

リクエストのUIDのリストのドキュメントサマリー（DocSums）を取得する
履歴サーバーに保存されている一連のUIDのDocSumsを取得する
idの数に制限はないが、200を超える場合はPOSTメソッド

EFetch

入力UIDのリストのフォーマットされたデータレコードを取得する
履歴サーバーに保存されている一連のUIDのフォーマット済みデータレコードを取得する
Summaryとは異なり、Abstractなどを取得できる

ELink

近傍探索
リクエストのUIDに類似するUIDのowリストを取得する（様々なメトリクスがある）
https://eutils.ncbi.nlm.nih.gov/entrez/query/static/entrezlinks.html

EGQuery

特定のクエリに夜レコード数の取得

ESpell

特定のデータベース内の単一のテキストクエリ内の用語のスペルの候補を取得

ECitMatch

入力引用文字列のセットに対応するPubMed ID（PMID）を取得

MeSHとは？

MeSH（メッシュ：Medical Subjest Headings）とは、MEDLINE（PubMedのデータベース）のシソーラス用語のことです。同義語内ののPrimary（一番良く使われる言葉）や、言葉の概念の上下関係が整理されています。MeSHから言葉を引用することにより表記ゆれを減らし、検索精度を上げるために利用されていたりします。
PubMed APIにより、MeSHが付与されている場合もあります（無いものもある）。
PubMedのステータスに大きく相関があるよう見えます（根拠はないです）。
https://www.ncbi.nlm.nih.gov/books/NBK3827/table/pubmedhelp.T.status_subsets/
MeSHには言葉ごとにIDが振られていて、データベース自体も公開されています。なので言葉をPrimaryに寄せることにより、自然言語処理的に扱いやすくすることも可能です（次回記載予定です）。

PubMed Status

PubStatus ::= INTEGER {            -- points of publication
    received  (1) ,            -- date manuscript received for review
    accepted  (2) ,            -- accepted for publication
    epublish  (3) ,            -- published electronically by publisher
    ppublish  (4) ,            -- published in print by publisher
    revised   (5) ,            -- article revised by publisher/author
    pmc       (6) ,            -- article first appeared in PubMed Central
    pmcr      (7) ,            -- article revision in PubMed Central
    pubmed    (8) ,            -- article citation first appeared in PubMed
    pubmedr   (9) ,            -- article citation revision in PubMed
    aheadofprint (10),         -- epublish, but will be followed by print
    premedline (11),           -- date into PreMedline status
    medline    (12),           -- date made a MEDLINE record
    other    (255) }