More than 3 years have passed since last update.

Rで遺伝子オントロジー解析&可視化

Last updated at 2021-06-03Posted at 2021-06-02

参考

Robinson MD, McCarthy DJ, Smyth GK (2010). “edgeR: a Bioconductor package for differential expression analysis of digital gene expression data.” Bioinformatics, 26(1), 139-140. doi: 10.1093/bioinformatics/btp616.

McCarthy DJ, Chen Y, Smyth GK (2012). “Differential expression analysis of multifactor RNA-Seq experiments with respect to biological variation.” Nucleic Acids Research, 40(10), 4288-4297. doi: 10.1093/nar/gks042.

Young MD, Wakefield MJ, Smyth GK, Oshlack A (2010). “Gene ontology analysis for RNA-seq: accounting for selection bias.” Genome Biology, 11, R14.

準備：edgeRで発現変動解析

サンプルデータを使って解析

suppressPackageStartupMessages(library(edgeR))
suppressPackageStartupMessages(library(goseq))
suppressPackageStartupMessages(library(org.Hs.eg.db))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(patchwork))

### DE analysis ###

table.summary = read.table(system.file("extdata", "Li_sum.txt", package = "goseq"), 
                         sep = "\t", header = TRUE, stringsAsFactors = FALSE)

counts = table.summary[,-1]
rownames(counts) = table.summary[,1]
grp = factor(rep(c("Control", "Treated"),times = c(4, 3)))
summarized = DGEList(counts,lib.size = colSums(counts), group = grp)

disp = estimateCommonDisp(summarized)
disp$common.dispersion

tested = exactTest(disp)
topTags(tested)

edgeRの解析結果（の一部）

	logFC	logCPM	PValue	FDR
ENSG00000127954	11.557868	6.680748	2.574972e-80	1.274766e-75
ENSG00000151503	5.398963	8.499530	1.781732e-65	4.410322e-61
ENSG00000096060	4.897600	9.446705	7.983756e-60	1.317479e-55

発現変動遺伝子 (DEGs) をラベル

1 for DEGs, 0 for not DEGs

genes = as.integer(p.adjust(tested$table$PValue[tested$table$logFC != 0],
                          method = "BH") < 0.05)

genes = as.integer(p.adjust(tested$table$PValue[tested$table$logFC != 0], method = "BH") < 0.05)
names(genes) = row.names(tested$table[tested$table$logFC != 0,])
table(genes)

goseqで遺伝子オントロジー解析

PWF (Probability Weighting Function)を得る

### GO analysis ###
pwf = nullp(genes, "hg19", "ensGene")
head(pwf)

GO.wall = goseq(pwf, "hg19", "ensGene")
head(GO.wall)

goseqの解析結果（の一部）

category	over_represented_pvalue	under_represented_pvalue	numDEInCat	numInCat	term	ontology
GO:0005737	9.843483e-10	1	1954	8980	cytoplasm	CC

ggplot2で解析結果を可視化

可視化に使うカラムを選択

termはGO termの名前
numInCatはそのGO termに属するすべての遺伝子の数
numDEInCatはそのGO termに属する発現変動遺伝子の数

res <- GO.wall %>%
  dplyr::select(term, over_represented_pvalue, numInCat, numDEInCat, ontology)
head(res)

go_plot関数を定義

p valueのトップ10を選択

go_plot <- function(x){
  x %>%
    top_n(10, wt = -over_represented_pvalue) %>% 
    mutate(hitsPerc = numDEInCat * 100 / numInCat) %>% 
    ggplot(aes(x = -log10(over_represented_pvalue), 
               y = reorder(term, -over_represented_pvalue),
               colour = hitsPerc, 
               size = numDEInCat)) +
    geom_point() +
    guides(size = guide_legend(order = 1, reverse = TRUE), 
           colour = guide_colourbar(order = 2)) +
    labs(x = "-log10 (p value)", y = "GO term", colour = "Hits (%)", size = "Count") +
    theme_grey(base_size = 20)
}

オントロジーごとに分けてプロット

BP <- res %>%
  filter(ontology == "BP") 
MF <- res %>%
  filter(ontology == "MF")
CC <- res %>%
  filter(ontology == "CC")

p1 <- go_plot(BP) + labs(title = "Biological Processes")
p2 <- go_plot(MF) + labs(title = "Molecular Functions")
p3 <- go_plot(CC) + labs(title = "Cellular Components")

p1 / p2 / p3

プロット結果

p value順に並べた
CountはGO termに属する発現変動遺伝子の数
HitsはGO termに属するすべての遺伝子のうちの発現変動遺伝子の割合

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up