More than 3 years have passed since last update.

gtfファイルの取り扱い

Posted at 2020-07-22

gftファイルの取り扱いについて

【目的】RNA-seqの解析をしていて、featureCountsでの集計が、ensembl idになってしまって、Gene symbolなどがわからなくなってしまうので、追加でデータを挿れたい。
(他にいい方法がある場合、教えて下さい…)
(pandasの勉強も兼ねてやっているので、遠回りしているところもあります。。。)

gtfファイル

今回は、GENCODEのhuman Comprehensive gene annotation のAllを使用している。
ここからダウンロード
https://www.gencodegenes.org/human/

ダウンロードすると、
gencode.v34.chr_patch_hapl_scaff.annotation.gtf
が手に入る。

gtfparse

gtfの取り扱いについて調べたところ、これがヒットしたので、取り敢えず、インストールしてみた。

pip install gtfparse

これでデータを見ることが出来るか確かめてみる。

import pandas as pd
from gtfparse import read_gtf

df = read_gtf("gencode.v34.chr_patch_hapl_scaff.annotation.gtf")
print(df.head())
print(df.shape)
print(df.columns)

これがやたら時間がかかる・・・

INFO:root:Extracted GTF attributes: ['gene_id', 'gene_type', 'gene_name', 'level', 'hgnc_id', 'havana_gene', 'transcript_id', 'transcript_type', 'transcript_name', 'transcript_support_level', 'tag', 'havana_transcript', 'exon_number', 'exon_id', 'ont', 'protein_id', 'ccdsid']
  seqname  source     feature  start    end  score  ...     havana_transcript exon_number            exon_id ont protein_id ccdsid
0    chr1  HAVANA        gene  11869  14409    NaN  ...                                                                           
1    chr1  HAVANA  transcript  11869  14409    NaN  ...  OTTHUMT00000362751.1                                                     
2    chr1  HAVANA        exon  11869  12227    NaN  ...  OTTHUMT00000362751.1           1  ENSE00002234944.1                      
3    chr1  HAVANA        exon  12613  12721    NaN  ...  OTTHUMT00000362751.1           2  ENSE00003582793.1                      
4    chr1  HAVANA        exon  13221  14409    NaN  ...  OTTHUMT00000362751.1           3  ENSE00002312635.1                      

[5 rows x 25 columns]
(3171209, 25)
Index(['seqname', 'source', 'feature', 'start', 'end', 'score', 'strand',
       'frame', 'gene_id', 'gene_type', 'gene_name', 'level', 'hgnc_id',
       'havana_gene', 'transcript_id', 'transcript_type', 'transcript_name',
       'transcript_support_level', 'tag', 'havana_transcript', 'exon_number',
       'exon_id', 'ont', 'protein_id', 'ccdsid'],
      dtype='object')

こんな感じでデータが取れたので、OKそう。31万以上のデータが入ってるんだね。(Gene symbolとしては同一のものがたくさん含まれてます。)

ここで必要そうなものは、
'gene_name', 'seqname','start','end','strand','gene_type', 'gene_id','transcript_id'
なので、これらを取ってきたら、まずはOK。あたらにcsvファイルとして保存しておく。

Gene_Symbol_from_GTF.py

from gtfparse import read_gtf

df = read_gtf("gencode.v34.chr_patch_hapl_scaff.annotation.gtf")
print(df.head())
print(df.shape)
print(df.columns)

df2 = df[['gene_name', 'seqname','start','end','strand','gene_type', 'gene_id','transcript_id']]
df2.to_csv("gencode.v34.chr_patch_hapl_scaff.annotation.cur.gtf.csv")

一応確認。

INFO:root:Extracted GTF attributes: ['gene_id', 'gene_type', 'gene_name', 'level', 'hgnc_id', 'havana_gene', 'transcript_id', 'transcript_type', 'transcript_name', 'transcript_support_level', 'tag', 'havana_transcript', 'exon_number', 'exon_id', 'ont', 'protein_id', 'ccdsid']
  gene_name seqname  start    end strand                           gene_type            gene_id      transcript_id
0   DDX11L1    chr1  11869  14409      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5                   
1   DDX11L1    chr1  11869  14409      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5  ENST00000456328.2
2   DDX11L1    chr1  11869  12227      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5  ENST00000456328.2
3   DDX11L1    chr1  12613  12721      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5  ENST00000456328.2
4   DDX11L1    chr1  13221  14409      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5  ENST00000456328.2
5   DDX11L1    chr1  12010  13670      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5  ENST00000450305.2
6   DDX11L1    chr1  12010  12057      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5  ENST00000450305.2
7   DDX11L1    chr1  12179  12227      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5  ENST00000450305.2
8   DDX11L1    chr1  12613  12697      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5  ENST00000450305.2
9   DDX11L1    chr1  12975  13052      +  transcribed_unprocessed_pseudogene  ENSG00000223972.5  ENST00000450305.2
          gene_name     seqname   start     end strand       gene_type            gene_id      transcript_id
3171199  AC007325.2  KI270734.1  138480  138482      -  protein_coding  ENSG00000277196.4  ENST00000621424.4
3171200  AC007325.2  KI270734.1  161689  161852      -  protein_coding  ENSG00000277196.4  ENST00000621424.4
3171201  AC007325.2  KI270734.1  161587  161626      -  protein_coding  ENSG00000277196.4  ENST00000621424.4
3171202  AC007325.2  KI270734.1  138082  138482      -  protein_coding  ENSG00000277196.4  ENST00000621424.4
3171203  AC237676.1  KI270744.1   51009   51114      -           snRNA  ENSG00000278625.1                   
3171204  AC237676.1  KI270744.1   51009   51114      -           snRNA  ENSG00000278625.1  ENST00000616830.1
3171205  AC237676.1  KI270744.1   51009   51114      -           snRNA  ENSG00000278625.1  ENST00000616830.1
3171206  AC116622.1  KI270750.1  148668  148843      +           snRNA  ENSG00000277374.1                   
3171207  AC116622.1  KI270750.1  148668  148843      +           snRNA  ENSG00000277374.1  ENST00000612925.1
3171208  AC116622.1  KI270750.1  148668  148843      +           snRNA  ENSG00000277374.1  ENST00000612925.1

OKそうですね。(transcript_idは今回使わないのですが、ちょっと別のことで使用したいので、取っておきました。）

今回は、gene_nameとgene_type, gene_idの対応表がほしいので、gene_nameで重複の削除をします。

import pandas as pd

df = pd.read_csv('gencode.v34.chr_patch_hapl_scaff.annotation.cur.gtf.csv',header = 0, sep=',')
df2 = df[~df.duplicated(['gene_name'])]
df2.to_csv("gencode.v34.Symbol.csv")

これで、目的のものが持ってこれました。
あとは、自分の集計データとこれをExcelで言うところのvlookupをしたらいいです。

＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿＿
もしかして、そもそもgtfファイルはread_csvで読み込めた？

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up