More than 1 year has passed since last update.

rentrezを使って、バイオデータをRで検索する

rentrez

Last updated at 2023-12-19Posted at 2023-12-18

この記事での目標

rentrezを少し理解して、NCBIから自由にデータを取得する

rentrezの前にentrezとは?

rentrezはR版のentrezであるが、そもそもentrezについて筆者があまり理解できていないということで、少し調べた結果をまとめていきたい。

EntrezはNCBIの主要なテキスト検索システムで、生物医学文献のPubMedデータベースと、DNAやタンパク質の配列、構造、遺伝子、ゲノム、遺伝的変異、遺伝子発現を含む38の他の文献や分子データベースを統合している。(DeepLによる翻訳)

つまり、entrezではデータ横断的にテキストで情報を検索することが可能で、普段気にせず利用している検索の裏側で動いていると言えそうである。
https://www.ncbi.nlm.nih.gov/books/NBK3837/

rentrezとは？

そのままRで使えるentrezである。総説論文でよく見るpubmedでのヒット数の検索と作図などもrentrezを使えばリーズナブルにできる。過去に行われた研究をざっくりと検索し、まとめたい時に便利そうである。

チュートリアルに沿ってやってみる

rentrezでアクセス可能なデータベースは下記の通り。

 [1] "PubMed"          "protein"         "nuccore"         "ipg"            
 [5] "nucleotide"      "structure"       "genome"          "annotinfo"      
 [9] "assembly"        "bioproject"      "biosample"       "blastdbinfo"    
[13] "books"           "cdd"             "clinvar"         "gap"            
[17] "gapplus"         "grasp"           "dbvar"           "gene"           
[21] "gds"             "geoprofiles"     "homologene"      "medgen"         
[25] "mesh"            "nlmcatalog"      "omim"            "orgtrack"       
[29] "pmc"             "popset"          "proteinclusters" "pcassay"        
[33] "protfam"         "pccompound"      "pcsubstance"     "seqannot"       
[37] "snp"             "sra"             "taxonomy"        "biocollections" 
[41] "gtr"

その1: entrez_db_summary()

特定のデータベースのエントリ数のまとめを出してくれる。筆者はバリアントの研究をしているので、dbSNP (snp) のエントリ数を出してみる。

entrez_db_summary("snp")

 DbName: snp
 MenuName: SNP
 Description: Single Nucleotide Polymorphisms
 DbBuild: Build221118-1625.1
 Count: 1121739543
 LastUpdate: 2022/11/22 11:07

その２: entrez_db_searchable()

特定のデータベースで検索可能なフィールドをまとめて出してくれる

entrez_db_searchable("snp")

Searchable fields for database 'snp'
  ALL 	 All terms from all searchable fields 
  UID 	 Unique number assigned to publication 
  FILT 	 Limits the records 
  RS 	 Clustered SNP ID (rs) 
  CHR 	 chromosomes 
  GENE 	 locus link symbol 
  HAN 	 Submitter Handle 
  ACCN 	 nucleotide accessions 
  GENE_ID 	 Gene ID 
  FXN 	 dbSNP Functional consequence class 
  GTYP 	 Genotype info 
  SS 	 Submitter ID 
  VARI 	 Allele 
  SCLS 	 SNP class 
  CPOS 	 Chromosome base position 
  WORD 	 Free text associated with record 
  SIDX 	 SNP Index 
  CLIN 	 Variations with clinical effects or significances 
  GMAF 	 Minor Allele Frequency derived from global population (ie. 1000G) 
  VALI 	 Validation status 
  CPOS_GRCH37 	 Chromosome base position on previous assembly version 
  ORGN 	 Organism 
  ALFA_EUR 	 ALFA European Minor Allele Frequency 
  ALFA_AFR 	 ALFA African population Minor Allele Frequency 
  ALFA_ASN 	 ALFA Asian population Minor Allele Frequency 
  ALFA_LAC 	 ALFA Latin American 1 population Minor Allele Frequency 
  ALFA_LEN 	 ALFA Latin American 2 population Minor Allele Frequency 
  ALFA_SAS 	 ALFA South Asian population Minor Allele Frequency 
  ALFA_OTR 	 ALFA Other population Minor Allele Frequency

その3: entrez_db_links()

特定のデータベースにリンクされている (可能性のある) 異なるデータベースをまとめて出してくれる。例えば、dbSNPの場合はClinVarと紐付けされているので、結果に含まれる。

entrez_db_links("snp")

Databases with linked records for database 'snp'
 [1] bioproject biosample  clinvar    dbvar      gap        gene       nuccore   
 [8] pmc        probe      protein    pubmed     pubmed     snp        snp       
[15] sparcle    structure  taxonomy

その4-1: entrez_search() -ざっくり使う-

entrezでエントリを選定するためにidを検索する。まずは筆者がヘテロ接合で保有するrs671を検索してみる (ALDH2; アルデヒドデヒドロゲナーゼ2; お酒のキャパを決める重要なSNP)。

# searchに結果を格納
search <- entrez_search(db="snp", term="rs671")

# 結果を確認する
search

Entrez search result with 5 hits (object contains 5 IDs and no web_history object)
 Search term (as translated):  rs671[All Fields] 

# idを確認する
search$ids
[1] "60823674" "4986830"  "4134524"  "2230021"  "671"

その4-2: entrez_search() -たくさんのエントリを調べる-

たくさんのエントリがある検索ワードを利用すると以下のようになる。

search <- entrez_search(db="snp", term="ALDH2")

search

Entrez search result with 18788 hits (object contains 20 IDs and no web_history object)
 Search term (as translated):  ALDH2[All Fields]

結果としてローカルで確認できるID数が20個までに設定されている。全て閲覧したい場合は、

entrez_search(db="pubmed", term="R Language", retmax=40)

のように検索のmax数を増やす。ただし、数千のような膨大な検索には対応していない。その際は、use_historyフラッグを入れる。

search <- entrez_search(db="snp", term="ALDH2", use_history=TRUE)

search

Entrez search result with 18788 hits (object contains 20 IDs and a web_history object)
 Search term (as translated):  ALDH2[All Fields] 

search$web_history

Web history object (QueryKey = 1, WebEnv = MCID_6580416...)

出てきたIDからデータをまとめる。

# デモなので検索のmaxを100で行う
summary <- entrez_summary(db="snp", web_history=search$web_history, retmax=100)

summary

List of  100 esummary records. First record:

 $`2136035282`
esummary result with 31 items:
 [1] uid                   snp_id                allele_origin        
 [4] global_mafs           global_population     global_samplesize    
 [7] suspected             clinical_significance genes                
[10] acc                   chr                   handle               
[13] spdi                  fxn_class             validated            
[16] docsum                tax_id                orig_build           
[19] upd_build             createdate            updatedate           
[22] ss                    allele                snp_class            
[25] chrpos                chrpos_prev_assm      text                 
[28] snp_id_sort           clinical_sort         cited_sort           
[31] chrpos_sort

# esummaryのうち、確認したいパラメータを""で選択
esum <- extract_from_esummary(summary, "snp_id")

これで一気に情報が取得できた。論文のタイトルを網羅的に出力する例がチュートリアルにあるので、そちらも参照されたい。

その4-3: entrez_search() -総説に出てくるアレを作る-

ちょっと複雑な計算もしてみる。

neuron_diff <- entrez_search(db = "pubmed",term = "iPS cells AND neural differentiation", use_history=TRUE)

neuron_diff
Entrez search result with 3123 hits

In addition to the search terms described above, the NCBI allows searches using Medical Subject Heading (MeSH) terms. These terms create a 'controlled vocabulary', and allow users to make very finely controlled queries of databases.

MeSH termを使うと正確に検索できるが、今回はデモなので省略。

# 500くらいまでのUIDしか同時に計算できないので、分ける
# all_recsに全部の結果を入れる
all_recs <- list()

# 全部の結果の数
total_records <- as.integer(search$count)

# 1回あたりの取得数を200とする
chunk_size <- 200

# chunk数を決定
num_chunks <- ceiling(total_records / chunk_size)

# ループさせる
for (i in 1:num_chunks) {
    # Calculate start and end positions for the current chunk
    start <- (i - 1) * chunk_size + 1
    end <- min(i * chunk_size, total_records)
    
    # Fetch summaries for the current chunk
    recs <- entrez_summary(db = "pubmed", web_history = search$web_history, retstart = start, retmax = chunk_size)
    
    # Append the fetched summaries to the list
    all_recs[[i]] <- recs
    
    # Print the progress or summary of current chunk
    cat("Processed records", start, "to", end, "\n")
}

# データを扱いやすいように1つのdimensionにする
combined_esummary <- unlist(all_recs, recursive = FALSE)

かっこいい図を作る

論文でたまにみる年ごとの出版数のグラフを描画してみる。

search_year <- function(year, term){
    query <- paste(term, "AND (", year, "[PDAT])")
    entrez_search(db="pubmed", term=query, retmax=0)$count
}

year <- 1990:2023
papers <- sapply(year, search_year, term="iPS cells AND neural differentiation", USE.NAMES=FALSE)

plot(year, papers, type='b', main="iPS cells AND neural differentiation")

# ggplotで綺麗にする
trend_df <- melt(data.frame(years, papers), id.vars="years")

# https://github.com/ropensci/rentrez
# ggplotのコードは上記より参照して改変
p <- ggplot(trend_df, aes(years, value, colour=variable))
p + geom_line(size=1) + theme_prism()

まとめ

コードを工夫すれば、サイトに行かずとも自由にデータが取得できるようになる。データベースから必要なデータを取得し、処理するところまで自動化できれば、かなり便利そう。

https://www.nature.com/articles/s41598-019-43935-8
こういったシステムの裏側でもrentrezが走っているので、自作ツールを使う際にもう少し調べたいところ。

この記事について (kyotobioinfo)

この記事は京都とバイオインフォマティクスが好きな人による #kyotobioinfo のアドベントカレンダー18日目の記事です！
https://adventar.org/calendars/9383

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up