LoginSignup
4
2

More than 3 years have passed since last update.

Rで遺伝子オントロジー解析&可視化

Last updated at Posted at 2021-06-02

参考

Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616.

McCarthy DJ, Chen Y, Smyth GK (2012). “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research, 40(10), 4288-4297. doi: 10.1093/nar/gks042.

Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010). “Gene ontology analysis for RNA-seq: accounting for selection bias.” Genome Biology, 11, R14.

準備:edgeRで発現変動解析

サンプルデータを使って解析

suppressPackageStartupMessages(library(edgeR))
suppressPackageStartupMessages(library(goseq))
suppressPackageStartupMessages(library(org.Hs.eg.db))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(patchwork))

### DE analysis ###

table.summary = read.table(system.file("extdata", "Li_sum.txt", package = "goseq"), 
                         sep = "\t", header = TRUE, stringsAsFactors = FALSE)

counts = table.summary[,-1]
rownames(counts) = table.summary[,1]
grp = factor(rep(c("Control", "Treated"),times = c(4, 3)))
summarized = DGEList(counts,lib.size = colSums(counts), group = grp)

disp = estimateCommonDisp(summarized)
disp$common.dispersion

tested = exactTest(disp)
topTags(tested)

edgeRの解析結果(の一部)

logFC logCPM PValue FDR
ENSG00000127954 11.557868 6.680748 2.574972e-80 1.274766e-75
ENSG00000151503 5.398963 8.499530 1.781732e-65 4.410322e-61
ENSG00000096060 4.897600 9.446705 7.983756e-60 1.317479e-55

発現変動遺伝子 (DEGs) をラベル

  • 1 for DEGs, 0 for not DEGs
genes = as.integer(p.adjust(tested$table$PValue[tested$table$logFC != 0],
                          method = "BH") < 0.05)

genes = as.integer(p.adjust(tested$table$PValue[tested$table$logFC != 0], method = "BH") < 0.05)
names(genes) = row.names(tested$table[tested$table$logFC != 0,])
table(genes)

goseqで遺伝子オントロジー解析

PWF (Probability Weighting Function)を得る

### GO analysis ###
pwf = nullp(genes, "hg19", "ensGene")
head(pwf)

GO.wall = goseq(pwf, "hg19", "ensGene")
head(GO.wall)

goseqの解析結果(の一部)

category over_represented_pvalue under_represented_pvalue numDEInCat numInCat term ontology
GO:0005737 9.843483e-10 1 1954 8980 cytoplasm CC

ggplot2で解析結果を可視化

可視化に使うカラムを選択

  • termはGO termの名前
  • numInCatはそのGO termに属するすべての遺伝子の数
  • numDEInCatはそのGO termに属する発現変動遺伝子の数
res <- GO.wall %>%
  dplyr::select(term, over_represented_pvalue, numInCat, numDEInCat, ontology)
head(res)

go_plot関数を定義

  • p valueのトップ10を選択
go_plot <- function(x){
  x %>%
    top_n(10, wt = -over_represented_pvalue) %>% 
    mutate(hitsPerc = numDEInCat * 100 / numInCat) %>% 
    ggplot(aes(x = -log10(over_represented_pvalue), 
               y = reorder(term, -over_represented_pvalue),
               colour = hitsPerc, 
               size = numDEInCat)) +
    geom_point() +
    guides(size = guide_legend(order = 1, reverse = TRUE), 
           colour = guide_colourbar(order = 2)) +
    labs(x = "-log10 (p value)", y = "GO term", colour = "Hits (%)", size = "Count") +
    theme_grey(base_size = 20)
}

オントロジーごとに分けてプロット

BP <- res %>%
  filter(ontology == "BP") 
MF <- res %>%
  filter(ontology == "MF")
CC <- res %>%
  filter(ontology == "CC")

p1 <- go_plot(BP) + labs(title = "Biological Processes")
p2 <- go_plot(MF) + labs(title = "Molecular Functions")
p3 <- go_plot(CC) + labs(title = "Cellular Components")

p1 / p2 / p3

プロット結果

  • p value順に並べた
  • CountはGO termに属する発現変動遺伝子の数
  • HitsはGO termに属するすべての遺伝子のうちの発現変動遺伝子の割合

GO_plot.png

4
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
4
2