Rでたくさんの遺伝子の配列を一気に取得する方法 #R

はじめに

「これらの遺伝子の配列をマルチプルアラインメントしたい」ということが普段からあると思います．
それにはまず，遺伝子の配列をどこからから入手しないといけません．
例えば，それらの遺伝子のEnsembl Gene IDが分かっているとき，ウェブブラウザからEnsemblのサイトに行き→Gene ID を検索窓に入れて→検索結果から数クリックして配列にたどり着いて→ドラッグしてコピペ，という作業を繰り返すという方法があります．
これはとても非人間的な時間だと思います．
そこで，BioconductorのパッケージであるbiomaRtとBiostringsを利用して，そういった作業を自動化しましょう．

実際のコマンド

まず，biomaRtパッケージを使ってEnsemblデータベースにアクセスします．

biomaRtを使ってEnsemblデータベースにアクセス

library(biomaRt)
ensembl <- useMart("ensembl")

自分が利用したいデータセットの名前を確認します．

利用できるデータセットの一覧を表示

listDatasets(ensembl)

利用するデータセットを指定します．
（ここではゼブラフィッシュ（Danio rerio）のゲノムデータベースを指定）

利用するデータセットの指定

ensembl <- useDataset("drerio_gene_ensembl", mart=ensembl)

あらかじめ用意してあるEnsembl Gene IDを文字列ベクトルにします．ここでは直接コマンドに書き込んでいますが，IDのリストのファイルを読み込むという方法もあります．

文字列ベクトルとしてIDを表現

id <- c("ENSDARG00000044774", "ENSDARG00000070913")

Ensembl Gene IDで指定された遺伝子のcDNA配列を取得します．

cDNA配列の取得

seq <- getSequence(id=id, type="ensembl_gene_id", seqType="cdna", mart=ensembl)

seqTypeの指定を変えることで，遺伝子のゲノム配列（エキソン＋イントロン）を取得することもできます．

ゲノム配列の取得

seq <- getSequence(id=id, type="ensembl_gene_id", seqType="gene_exon_intron", mart=ensembl)

取得した塩基配列をFASTAフォーマットで書き出します．
なお，getSequenceで取得した配列は，1列目に配列，2列目にIDが格納されたデータフレームという形になっているため，seq[,1]のような書き方をしています．

FASTAフォーマットでの保存

library(Biostrings)
dna <- DNAStringSet(seq[,1])
names(dna) <- seq[,2]
writeXStringSet(dna, file="hoge.fa", format="fasta")

応用

今回は，データベースをEnsembl，IDをEnsembl Gene IDとしました．
この他にも，データベースを変えたり，別の種類のIDで遺伝子を指定したり，遺伝子のIDでなく，染色体上のstartとendによって指定することも可能です．
詳しくは，biomaRtのUser guideをご覧下さい．
http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html

接続できるBioMartデータベースの種類

listMarts()で調べることができます．

typeの種類

listFilters()で調べることができます．

seqTypeの種類

今回は，cDNA配列とゲノム配列だけ紹介しましたが，以下のように様々なバリエーションで配列を取得することが可能です．

’cdna’
’peptide’ for protein sequences
’3utr’ for 3’ UTR sequences,
’5utr’ for 5’ UTR sequences;
’gene_exon’ for exon sequences only;
’transcript_exon’ for transcript specific exonic sequences only;
’transcript_exon_intron’ gives the full unspliced transcript, that is exons + introns;
’gene_exon_intron’ gives the exons + introns of a gene;
’coding’ gives the coding sequence only;
’coding_transcript_flank’ gives the flanking region of the transcript including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute;
’coding_gene_flank’ gives the flanking region of the gene including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute;
’transcript_flank’ gives the flanking region of the transcript exculding the UTRs, this must be accompanied with a given value for the upstream or downstream attribute;
’gene_flank’ gives the flanking region of the gene excluding the UTRs, this must be accompanied with a given value for the upstream or downstream attribute.