More than 5 years have passed since last update.

Bioconductorにおける区間データの表現 1

Posted at 2013-10-14

Bioconductorにおける区間データの表現 1

Rにおける区間データの扱いについて

リファレンスゲノム上の区間（short readがアラインメントされた場所，エキソンイントロン構造など）を表すのに，Bioconductorではいくつかのデータ構造を用意しています．

以下ではIRanges，GenomicRanges，GenomicFeaturesというパッケージについて紹介します．

GRanges

GRangesクラスはGenomicRangesパッケージのクラスで，ゲノム上の区間を要素に持つベクトルというイメージです．
IRangesクラスをゲノム用に拡張したものです．

GRangesオブジェクトを作成する際には，データフレームのように，必要な特徴をベクトルで与えます．

GRangesオブジェクトの作成

library(GenomicRanges)
genes <- GRanges(seqnames=c("3R", "X"), 
                 ranges=IRanges(start=c(19967117, 18962306), 
                                end=c(19973212, 18962925)), 
                 strand=c("+", "-"), 
                 seqlengths=c(`3R`=27905053L, `X`=22422827L))

座標は1-basedで，strandに関わらず，座標の小さい方がstartになります．また，「閉じた」区間であるため，startとendの座標が同一の場合は幅１の区間を表します．
seqliengthはオプションです．

> genes
GRanges with 2 ranges and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]       3R [19967117, 19973212]      +
  [2]        X [18962306, 18962925]      -
  ---
  seqlengths:
         3R        X
   27905053 22422827

要素へのアクセスなど

ベクトルと同じです．

> genes[2]
GRanges with 1 range and 0 metadata columns:
      seqnames               ranges strand
         <Rle>            <IRanges>  <Rle>
  [1]        X [18962306, 18962925]      -
  ---
  seqlengths:
         3R        X
   27905053 22422827

また，seqnames()，start()，end()，width()，strand()，names()といった関数で，個々の値にアクセスできます．

要素の操作

ir <- IRanges(start=c(7, 12, 22:24),
             end=c(15, 12, 26, 27, 28))

個々の区間に対するもの

shift()，resize()，flank()，narrow()，reflect()，restrict()

> shift(ir,5)
IRanges of length 5
    start end width
[1]    12  20     9
[2]    17  17     1
[3]    27  31     5
[4]    28  32     5
[5]    29  33     5

区間の集合に対するもの

reduce()，disjoin()，gaps()，range()，coverage()

> reduce(ir)
IRanges of length 2
    start end width
[1]     7  15     9
[2]    22  28     7
> coverage(ir)
integer-Rle of length 28 with 10 runs
  Lengths: 6 5 1 3 6 1 1 3 1 1
  Values : 0 1 2 1 0 1 2 3 2 1

複数のIRangesオブジェクトに対するもの

intersect(), setdiff(), union(), pintersect(), psetdiff(), punion(), countOverlaps(), findOverlaps()

メタデータ

mcols() : 各区間についての何らかの情報(データフレーム)を格納し，操作する
metadata() : GRangesオブジェクト自体に関する情報をリストとして格納する

mcols(genes) <- DataFrame(EntrezId=c("42865", "2768869"),
                           Symbol=c("kal-1", "CG34330"))

> genes
GRanges with 2 ranges and 2 metadata columns:
      seqnames               ranges strand |    EntrezId      Symbol
         <Rle>            <IRanges>  <Rle> | <character> <character>
  [1]       3R [19967117, 19973212]      + |       42865       kal-1
  [2]        X [18962306, 18962925]      - |     2768869     CG34330
  ---
  seqlengths:
         3R        X
   27905053 22422827

metadata(genes) <- list(CreatedBy="A. User", Date=date())

GRangesList

GRangesListはGRangesオブジェクトを要素に持つリストのようなものです．
例えば，遺伝子を複数のエキソンの集合としてGRangesオブジェクトとして表現するとき，遺伝子の集合をGRangesListオブジェクトとして表現することができます．
また，シークエンサーのリードがスプリットアラインメントされていることをGRangesオブジェクトとして表現するとき，リードのアラインメント全体をGRangesListオブジェクトとして表現できます．

gr1 <- GRanges(seqnames = "chr2", ranges = IRanges(3, 6), 
        strand = "+", score = 5L, GC = 0.45)
gr2 <- GRanges(seqnames = c("chr1", "chr1"),
        ranges = IRanges(c(7,13), width = 3),
        strand = c("+", "-"), score = 3:4, GC = c(0.3, 0.5))
grl <- GRangesList("txA" = gr1, "txB" = gr2)

GRangesListオブジェクトに[i]でアクセスするとGRangesListオブジェクトが返ってくるのに対し，[[i]]でアクセスするとGRangesオブジェクトが返ってきます．

> grl
GRangesList of length 2:
$txA 
GRanges with 1 range and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr2    [3, 6]      + |         5      0.45

$txB 
GRanges with 2 ranges and 2 metadata columns:
      seqnames   ranges strand | score  GC
  [1]     chr1 [ 7,  9]      + |     3 0.3
  [2]     chr1 [13, 15]      - |     4 0.5

---
seqlengths:
 chr2 chr1
   NA   NA
> grl[1]
GRangesList of length 1:
$txA 
GRanges with 1 range and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr2    [3, 6]      + |         5      0.45

---
seqlengths:
 chr2 chr1
   NA   NA
> grl[[1]]
GRanges with 1 range and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr2    [3, 6]      + |         5      0.45
  ---
  seqlengths:
   chr2 chr1
     NA   NA   
> grl$txA
GRanges with 1 range and 2 metadata columns:
      seqnames    ranges strand |     score        GC
         <Rle> <IRanges>  <Rle> | <integer> <numeric>
  [1]     chr2    [3, 6]      + |         5      0.45
  ---
  seqlengths:
   chr2 chr1
     NA   NA

grl <- GRangesList()
# appendするとき
grl <- append(grl, GRangesList(gr1))
grl <- append(grl, GRangesList(gr2))
# GRangesオブジェクトごとに処理するとき（lapplyだけだとlistが返ってくる）
grl <- GRangesList(lapply(grl, function(x){
  resize(x, width=1, fix="center")
}))

GenomicFeatures

遺伝子などのゲノム上の情報のアノテーションを抽出，保存，検索しやすくするパッケージです．
アノテーションをsqliteデータベースとして保存し，R上ではGRangesListオブジェクトとして扱うことができます．

# source("http://bioconductor.org/biocLite.R")
# biocLite("TxDb.Dmelanogaster.UCSC.dm3.ensGene")
# Fruit fly (dm3)のEnsembl GenesのTranscriptDbオブジェクトをロード
library(TxDb.Dmelanogaster.UCSC.dm3.ensGene)
# 名前が長いのでエイリアスを作成
txdb <- TxDb.Dmelanogaster.UCSC.dm3.ensGene
# TranscriptDbから，遺伝子ごとにエキソンをまとめて，GRangesListとして取得
ex0 <- exonsBy(txdb, "gene")
# 遺伝子ごとのエキソンの数（区間数）の統計
table(elementLengths(ex0))
ids <- c("FBgn0002183", "FBgn0003360", "FBgn0025111", "FBgn0036449")
# それぞれのGRangesに重なりのあるエキソンがある場合，マージする
ex <- reduce(ex0[ids])

UCSCからのTranscriptDbの作成

txdb <- makeTranscriptDbFromUCSC("dm3", "ensGene")
saveDb(txdb, "my.dm3.ensGene.txdb.sqlite")

IRanges, GRanges, GRangesListに共通の演算子・関数

Accessors

start(), end(), width()
names()
mcols(), metadata()
length() : 区間の数
range() : 最小のstartから最大のendまでの範囲

Ordering

<, <=, >, >=, ==, !=
sort(), order, rank()
unique(), duplicated()

Arithmetic

r+x, r-x, r*x : 区間rを数xで拡大縮小する
shift()，resize()，flank()，distance()，restrict()

Set operations

reduce(x) : 重なっている部分を併合する
intersect(x, y), union(x, y), setdiff(x, y)
pintersect(x, y), punion(x, y), psetdiff(x, y) : xとyのi番目の要素同士の比較をparallelに行う
gaps(x), pgap(x, y) : reduceした区間でカバーされていない区間を返す．pgapは各x[i]とy[i]の組についてparallel．
disjoin(x) : startもしくはendがある度に区切られてできる区間

Overlaps

findOverlaps(x, y) : xの各要素についてyに重なりがあるかを調べる（indexを返す）
countOverlaps(x, y)
nearest(x, y) : xの各要素についてyで最近傍の区間を返す
precede(x, y), follow(x, y) : nearestに方向性の制限がついたもの
x %in% y : xの各要素について，yに重なりがあるかをlogicalのベクトルで返す

Coverage

coverage(x) : coverageをRleオブジェクトとして返す（Run-length encoding）

Extraction

r[i] : 区間を返す
r[[i]] : start[i]からend[i]までの整数列を返す
subsetByOverlaps(x, y) : yに重なりがあったxの要素を返す
head(), tail(), rev(), rep()

Split, combine

split() : RangesListとして返す
c

参考

Nicolas Delhomme and Martin Morgan, "R / Bioconductor for High-Throughput Sequence Analysis" (PDF)
http://www.bioconductor.org/packages/2.12/bioc/vignettes/GenomicRanges/inst/doc/GenomicRangesIntroduction.pdf

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up