More than 1 year has passed since last update.

PubMed / PubTator を使って論文から情報を取り出す

Posted at 2023-10-09

初めに

PubMed APIの概要、データ取得、シンプルなEDA、及びクラスタリング(Python) が非常に勉強になりました。

そのまま利用させていただく形で、PubTator の解析を加えていきたいと考えています。

PubMed

PubMed: https://pubmed.ncbi.nlm.nih.gov/
PubMed は、MEDLINE やジャーナルなどから more than 36 million citations (2023/10/09) 医学生物学文献のアブストラトなどを収載しています。

PubMed での検索

search は非常に強力です。単純に cancer と検索した時に、advanced search のページを見てみると
"cancer s"[All Fields] OR "cancerated"[All Fields] OR "canceration"[All Fields] OR "cancerization"[All Fields] OR "cancerized"[All Fields] OR "cancerous"[All Fields] OR "neoplasms"[MeSH Terms] OR "neoplasms"[All Fields] OR "cancer"[All Fields] OR "cancers"[All Fields]
と検索してくれているらしく、ありがたい限りです。

PubMed API

PubMed は、プログラムによるアクセスに対応してくれています。頻繁に API のアップデートもされていますので、使いやすくなってきているものと思います。
https://ncbiinsights.ncbi.nlm.nih.gov/?s=pubmed+api

PubTator

PubTator: https://www.ncbi.nlm.nih.gov/research/pubtator/
PubTator Central (PTC) は、PubMed abstracts や PMC full-text articles に対し、医学生物学的コンセプトに対し、自動的にアノーテーションをしてくれているサービスです。遺伝子名の選択など便利そうですので、ぜひ使ってみたいと考えています。

PubTator API

Export Annotations

Export our annotated publications in batches of up to 100 in GET or 1000 in POST, in BioC, pubtator or JSON formats.
In order not to overload the PubTator server, we require that users post no more than three requests per second.

const (PubTator)

BASEURL_PUBTATOR = 'https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/pubtator'
BASEURL_BIOCXML = 'https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml'
BASEURL_BIOCJSON = 'https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson'
# https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/pubtator?pmids=28483577&concepts=gene
# https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocxml?pmcids=PMC6207735
# https://www.ncbi.nlm.nih.gov/research/pubtator-api/publications/export/biocjson?pmids=28483577,28483578,28483579
BATCH_NUM_PUBTATOR    = 100

どんな情報を取り出したいか？

ある事象について論文を検索し、その中に出てくる MeSH や細胞種（セルライン）について、纏めることを考えました（現時点では、完全には実現できていません）。
参考にさせて頂いている PubMed APIの概要、データ取得、シンプルなEDA、及びクラスタリング(Python) では、Tfidf を利用してクラスタリングなどを実施されていました。PubTator を利用することで、医学生物学的コンセプトに対するアノーテーションにより、表記ゆれの低減につながり計数しやすくなることを期待しました（おおざっぱな理解ですが、合っているのでしょうか・・・）。

作戦

検索は PubMed API を利用
検索した文献に対するアノーテーションを PubTator を利用して拾ってくる
両方を統合して解析へ

現在の状況

Jupyter Notebook: https://hiroh23.github.io/PubMed-API-Script/PubMed-PubTator-API.html

検索は、PubMed APIの概要、データ取得、シンプルなEDA、及びクラスタリング(Python) をそのまま利用させていただいています。
PubTator API を利用したアノーテーションの拾い上げも PubMed APIの概要、データ取得、シンプルなEDA、及びクラスタリング(Python) を少し変更する形で対応しています。
PubTator における Mutation 情報のまとめ方、利用方法を検討しています。
両方を統合した解析には手を付けられていません。元記事から想定されるグラフと違う形のグラフが見られており、写経を何か間違っていたりするかもしれません。
Jupyter Lab での plotly の扱い方が良くわかっておらず、難しいです。

今後について

時間を作って書き足せればと思います。

参考文献

PubMed APIの概要、データ取得、シンプルなEDA、及びクラスタリング(Python) https://qiita.com/iwashi-kun/items/bd0d772c6db0c0023e30

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up