More than 5 years have passed since last update.

Rでたくさんの遺伝子の配列を一気に取得する方法

Last updated at 2013-07-06Posted at 2013-07-06

はじめに

「これらの遺伝子の配列をマルチプルアラインメントしたい」ということが普段からあると思います．
それにはまず，遺伝子の配列をどこからから入手しないといけません．
例えば，それらの遺伝子のEnsembl Gene IDが分かっているとき，ウェブブラウザからEnsemblのサイトに行き→Gene ID を検索窓に入れて→検索結果から数クリックして配列にたどり着いて→ドラッグしてコピペ，という作業を繰り返すという方法があります．
これはとても非人間的な時間だと思います．
そこで，BioconductorのパッケージであるbiomaRtとBiostringsを利用して，そういった作業を自動化しましょう．

実際のコマンド

まず，biomaRtパッケージを使ってEnsemblデータベースにアクセスします．

biomaRtを使ってEnsemblデータベースにアクセス

library(biomaRt)
ensembl <- useMart("ensembl")

自分が利用したいデータセットの名前を確認します．

利用できるデータセットの一覧を表示

listDatasets(ensembl)

利用するデータセットを指定します．
（ここではゼブラフィッシュ（Danio rerio）のゲノムデータベースを指定）

利用するデータセットの指定

ensembl <- useDataset("drerio_gene_ensembl", mart=ensembl)

あらかじめ用意してあるEnsembl Gene IDを文字列ベクトルにします．ここでは直接コマンドに書き込んでいますが，IDのリストのファイルを読み込むという方法もあります．

文字列ベクトルとしてIDを表現

id <- c("ENSDARG00000044774", "ENSDARG00000070913")

Ensembl Gene IDで指定された遺伝子のcDNA配列を取得します．

cDNA配列の取得

seq <- getSequence(id=id, type="ensembl_gene_id", seqType="cdna", mart=ensembl)

seqTypeの指定を変えることで，遺伝子のゲノム配列（エキソン＋イントロン）を取得することもできます．

ゲノム配列の取得

seq <- getSequence(id=id, type="ensembl_gene_id", seqType="gene_exon_intron", mart=ensembl)

取得した塩基配列をFASTAフォーマットで書き出します．
なお，getSequenceで取得した配列は，1列目に配列，2列目にIDが格納されたデータフレームという形になっているため，seq[,1]のような書き方をしています．

FASTAフォーマットでの保存

library(Biostrings)
dna <- DNAStringSet(seq[,1])
names(dna) <- seq[,2]
writeXStringSet(dna, file="hoge.fa", format="fasta")

応用

今回は，データベースをEnsembl，IDをEnsembl Gene IDとしました．
この他にも，データベースを変えたり，別の種類のIDで遺伝子を指定したり，遺伝子のIDでなく，染色体上のstartとendによって指定することも可能です．
詳しくは，biomaRtのUser guideをご覧下さい．
http://www.bioconductor.org/packages/release/bioc/html/biomaRt.html

接続できるBioMartデータベースの種類

listMarts()で調べることができます．

typeの種類

listFilters()で調べることができます．

seqTypeの種類

今回は，cDNA配列とゲノム配列だけ紹介しましたが，以下のように様々なバリエーションで配列を取得することが可能です．

’cdna’
’peptide’ for protein sequences
’3utr’ for 3’ UTR sequences,
’5utr’ for 5’ UTR sequences;
’gene_exon’ for exon sequences only;
’transcript_exon’ for transcript specific exonic sequences only;
’transcript_exon_intron’ gives the full unspliced transcript, that is exons + introns;
’gene_exon_intron’ gives the exons + introns of a gene;
’coding’ gives the coding sequence only;
’coding_transcript_flank’ gives the flanking region of the transcript including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute;
’coding_gene_flank’ gives the flanking region of the gene including the UTRs, this must be accompanied with a given value for the upstream or downstream attribute;
’transcript_flank’ gives the flanking region of the transcript exculding the UTRs, this must be accompanied with a given value for the upstream or downstream attribute;
’gene_flank’ gives the flanking region of the gene excluding the UTRs, this must be accompanied with a given value for the upstream or downstream attribute.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up