enzyme.datとは
ENZYMEという酵素の命名に関する情報のデータベースファイルのこと
ファイルには
ID Identification (Begins each entry; 1 per entry)
DE Description (official name) (>=1 per entry)
AN Alternate name(s) (>=0 per entry)
CA Catalytic activity (>=1 per entry)
CF Cofactor(s) (>=0 per entry)
CC Comments (>=0 per entry)
PR Cross-references to PROSITE (>=0 per entry)
DR Cross-references to Swiss-Prot (>=0 per entry)
といった情報が格納されている。
enzyme.datの中身(一部)
ID 1.1.1.1
DE Alcohol dehydrogenase.
AN Aldehyde reductase.
CA (1) A primary alcohol + NAD(+) = an aldehyde + NADH.
CA (2) A secondary alcohol + NAD(+) = a ketone + NADH.
CF Zn(2+) or Fe cation.
CC -!- Acts on primary or secondary alcohols or hemi-acetals with very broad
CC specificity; however the enzyme oxidizes methanol much more poorly
CC than ethanol.
CC -!- The animal, but not the yeast, enzyme acts also on cyclic secondary
CC alcohols.
PR PROSITE; PDOC00058;
PR PROSITE; PDOC00059;
PR PROSITE; PDOC00060;
DR P07327, ADH1A_HUMAN; P28469, ADH1A_MACMU; Q5RBP7, ADH1A_PONAB;
DR P25405, ADH1A_SAAHA; P25406, ADH1B_SAAHA; P00327, ADH1E_HORSE;
研究の一環で酵素の機能によって分類されたEC number(上記のIDに該当)と
各タンパク質のUniprot entry(上記のDRに該当)の対応表を作る必要があったので、
enzyme.datからIDとDR、EC numberの説明(上記のDEに該当)を抽出、関連付けしたテーブルを作成することにした。
必要なもの
- enzyme.dat(ftp://ftp.expasy.org/databases/enzyme から入手)
- python3
使用するpythonモジュール
- pandas(DataFrameの作成に使用)
やること
行頭がID, DE, DRで始まる行をそれぞれ取り出してリストを作成。
DataFrameでテーブルを作成し、csvファイルとして書き出す。
やったこと
ファイルを開く
path = "enzyme.dat"
with open(path) as f:
s = f.readlines() # 行ごとで区切って、リストとして読み込み
s = s[24:] # 頭の説明部分は除外
idリストの作成
id_list = []
for i in s:
if i.startswith("ID "): # ID で始まる文字列を探す
x = i[5:-1] # "ID "以降の文字列を取得
id_list.append(x) # リストに追加
id_list[:10]
['1.1.1.1',
'1.1.1.2',
'1.1.1.3',
'1.1.1.4',
'1.1.1.5',
'1.1.1.6',
'1.1.1.7',
'1.1.1.8',
'1.1.1.9',
'1.1.1.10']
descriptionリストの作成
DEとDRは2行以上存在する可能性があるので一行後の内容を参照しながら要素を追加していく
行頭が"DE"でなくなるまで文字列を追加し続け、DEの最終行まで到達したらリストに追加する。
description_list = []
name = ""
for i in range(len(s)):
if s[i].startswith("DE "):
x = s[i][5:-1]
name += x
if not s[i + 1].startswith("DE "):
description_list.append(name)
name = ""
description_list[:10]
['Alcohol dehydrogenase.',
'Alcohol dehydrogenase (NADP(+)).',
'Homoserine dehydrogenase.',
'(R,R)-butanediol dehydrogenase.',
'Transferred entry: 1.1.1.303 and 1.1.1.304.',
'Glycerol dehydrogenase.',
'Propanediol-phosphate dehydrogenase.',
'Glycerol-3-phosphate dehydrogenase (NAD(+)).',
'D-xylulose reductase.',
'L-xylulose reductase.']
accession列の作成
accession_list = []
name = ""
for i in range(len(s)):
if s[i].startswith("DR "):
x = s[i][5:-1]
name += x
if not s[i + 1].startswith("DR "):
accession_list.append(name)
name = ""
accession_list[1]
'Q6AZW2, A1A1A_DANRE; Q568L5, A1A1B_DANRE; Q24857, ADH3_ENTHI ;Q04894, ADH6_YEAST ; P25377, ADH7_YEAST ; O57380, ADH8_PELPE ;Q9F282, ADHA_THEET ; P0CH36, ADHC1_MYCS2; P0CH37, ADHC2_MYCS2;P0A4X1, ADHC_MYCBO ; P9WQC4, ADHC_MYCTO ; P9WQC5, ADHC_MYCTU ;P27250, AHR_ECOLI ; Q3ZCJ2, AK1A1_BOVIN; Q5ZK84, AK1A1_CHICK;O70473, AK1A1_CRIGR; P14550, AK1A1_HUMAN; Q9JII6, AK1A1_MOUSE;P50578, AK1A1_PIG ; Q5R5D5, AK1A1_PONAB; P51635, AK1A1_RAT ;Q6GMC7, AK1A1_XENLA; Q28FD1, AK1A1_XENTR; Q9UUN9, ALD2_SPOSA ;P27800, ALDX_SPOSA ; P75691, YAHK_ECOLI ;'
あとはこの3つのリストを使ってDataFrameを作ればよいはずなのだが、
作成したリストの要素の数を比較すると
len(id_list), len(description_list), len(accession_list)
(7876, 7876, 5001)
accession_listだけ数が合わない
なぜaccession_listだけ数が合わない?
datファイルをよく確認すると
//
ID 1.14.13.42
DE Deleted entry.
//
ID 1.14.13.43
DE Questin monooxygenase.
AN Questin oxygenase.
CA Questin + NADPH + O(2) = demethylsulochrin + NADP(+).
CC -!- The enzyme cleaves the anthraquinone ring of questin to form a
CC benzophenone.
CC -!- Involved in the biosynthesis of the seco-anthraquinone (+)-geodin.
//
といった感じで、DRが存在しないIDが結構ある。
そこで
# PR, CC, DE, CA, CF を使ってDRが付いていない酵素を探す
for name in ("PR", "CC", "DE", "CA", "CF"):
print("start", name)
no_dr_enzyme = []
for i in range(len(s)):
if s[i].startswith(f"{name} "):
if s[i + 1].startswith("//"):
no_dr_enzyme.append(i)
x = 1
for i in no_dr_enzyme:
s.insert(i + x, "DR none ;\n")
x += 1
としてDRが存在しないIDには "DR none"という行を付け足しておく。
再度accession_listを作成して要素数を比較すると
len(id_list), len(description_list), len(accession_list)
(7876, 7876, 7876)
数が揃ったのでDataFrameが作成可能となる
DataFrameを作ってcsvファイルに書き出し
import pandas as pd
df = pd.DataFrame(
{"ID": id_list, "Description": description_list, "Accession": accession_list}
)
# csvファイルとして書き出す
df.to_csv("enzyme.csv", index=False)
完成したスクリプト(make_enzyme_table.py)
#!/usr/bin/env python
# -*- coding: utf-8 -*-
#
# make_enzyme_table.py
#
import pandas as pd
def main():
# ファイルの読み込み
path = "enzyme.dat"
with open(path) as f:
s = f.readlines()
s = s[24:]
print(s[:10])
# id列の作成
id_list = []
for i in s:
if i.startswith("ID "):
x = i[5:-1]
id_list.append(x)
# description列の作成
description_list = []
name = ""
for i in range(len(s)):
if s[i].startswith("DE "):
x = s[i][5:-1]
name += x
if not s[i + 1].startswith("DE "):
description_list.append(name)
name = ""
# PR, CC, DE, CA, CF を使ってDRが付いていない酵素を探して補完
for name in ("PR", "CC", "DE", "CA", "CF"):
print("start", name)
no_dr_enzyme = []
for i in range(len(s)):
if s[i].startswith(f"{name} "):
if s[i + 1].startswith("//"):
no_dr_enzyme.append(i)
x = 1
for i in no_dr_enzyme:
s.insert(i + x, "DR none ;\n")
x += 1
# accession列の作成
accession_list = []
name = ""
for i in range(len(s)):
if s[i].startswith("DR "):
x = s[i][5:-1]
name += x
if not s[i + 1].startswith("DR "):
accession_list.append(name)
name = ""
# DataFrameの作成
df = pd.DataFrame(
{"ID": id_list, "Description": description_list, "Accession": accession_list}
)
# csv書き込み
df.to_csv("enzyme.csv", index=False)
if '__main__' == __name__:
main()
## 完成
enzyme.csvの中身(一部)
ID,Description,Accession
1.1.1.1,Alcohol dehydrogenase.,"P07327, ADH1A_HUMAN; P28469, ADH1A_MACMU; Q5RBP7, ADH1A_PONAB;P25405, ADH1A_SAAHA; P25406, ADH1B_SAAHA; P00327, ADH1E_HORSE;P00326, ADH1G_HUMAN; O97959, ADH1G_PAPHA; P00328, ADH1S_HORSE;P80222, ADH1_ALLMI ; P30350, ADH1_ANAPL ; P49645, ADH1_APTAU ;P06525, ADH1_ARATH ; P41747, ADH1_ASPFN ; Q17334, ADH1_CAEEL ;P43067, ADH1_CANAX ; P85440, ADH1_CATRO ; P14219, ADH1_CENAM ;P48814, ADH1_CERCA ; Q70UN9, ADH1_CERCO ; P23991, ADH1_CHICK ;P86883, ADH1_COLLI ; P19631, ADH1_COTJA ; P23236, ADH1_DROHY ;P48586, ADH1_DROMN ; P09370, ADH1_DROMO ; P22246, ADH1_DROMT ;P07161, ADH1_DROMU ; P12854, ADH1_DRONA ; P08843, ADH1_EMENI ;P26325, ADH1_GADMC ; Q9Z2M2, ADH1_GEOAT ; Q64413, ADH1_GEOBU ;Q64415, ADH1_GEOKN ; P12311, ADH1_GEOSE ; P05336, ADH1_HORVU ;P20369, ADH1_KLULA ; Q07288, ADH1_KLUMA ; P00333, ADH1_MAIZE ;P86885, ADH1_MESAU ; P00329, ADH1_MOUSE ; P80512, ADH1_NAJNA ;Q9P6C8, ADH1_NEUCR ; Q75ZX4, ADH1_ORYSI ; Q2R8Z5, ADH1_ORYSJ ;P12886, ADH1_PEA ; P22797, ADH1_PELPE ; P41680, ADH1_PERMA ;P25141, ADH1_PETHY ; O00097, ADH1_PICST ; Q03505, ADH1_RABIT ;P06757, ADH1_RAT ; P14673, ADH1_SOLTU ; P80338, ADH1_STRCA ;P13603, ADH1_TRIRP ; P00330, ADH1_YEAST ; Q07264, ADH1_ZEALU ;P20368, ADH1_ZYMMO ; O45687, ADH2_CAEEL ; O94038, ADH2_CANAL ;P48815, ADH2_CERCA ; Q70UP5, ADH2_CERCO ; Q70UP6, ADH2_CERRO ;P27581, ADH2_DROAR ; P25720, ADH2_DROBU ; P23237, ADH2_DROHY ;P48587, ADH2_DROMN ; P09369, ADH2_DROMO ; P07160, ADH2_DROMU ;P24267, ADH2_DROWH ; P37686, ADH2_ECOLI ; P54202, ADH2_EMENI ;Q24803, ADH2_ENTHI ; P42327, ADH2_GEOSE ; P10847, ADH2_HORVU ;P49383, ADH2_KLULA ; Q9P4C2, ADH2_KLUMA ; P04707, ADH2_MAIZE ;Q4R1E8, ADH2_ORYSI ; Q0ITW7, ADH2_ORYSJ ; O13309, ADH2_PICST ;P28032, ADH2_SOLLC ; P14674, ADH2_SOLTU ; F2Z678, ADH2_YARLI ;P00331, ADH2_YEAST ; F8DVL8, ADH2_ZYMMA ; P0DJA2, ADH2_ZYMMO ;P07754, ADH3_EMENI ; P42328, ADH3_GEOSE ; P10848, ADH3_HORVU ;P49384, ADH3_KLULA ; P14675, ADH3_SOLTU ; P07246, ADH3_YEAST ;P49385, ADH4_KLULA ; Q09669, ADH4_SCHPO ; A6ZTT5, ADH4_YEAS7 ;P10127, ADH4_YEAST ; Q6XQ67, ADH5_SACPS ; P38113, ADH5_YEAST ;P28332, ADH6_HUMAN ; P41681, ADH6_PERMA ; Q5R7Z8, ADH6_PONAB ;Q5XI95, ADH6_RAT ; P40394, ADH7_HUMAN ; Q64437, ADH7_MOUSE ;P41682, ADH7_RAT ; P9WQC0, ADHA_MYCTO ; P9WQC1, ADHA_MYCTU ;O31186, ADHA_RHIME ; Q7U1B9, ADHB_MYCBO ; P9WQC6, ADHB_MYCTO ;P9WQC7, ADHB_MYCTU ; P9WQB8, ADHD_MYCTO ; P9WQB9, ADHD_MYCTU ;P33744, ADHE_CLOAB ; P0A9Q8, ADHE_ECO57 ; P0A9Q7, ADHE_ECOLI ;P81600, ADHH_GADMO ; P72324, ADHI_RHOS4 ; Q9SK86, ADHL1_ARATH;Q9SK87, ADHL2_ARATH; A1L4Y2, ADHL3_ARATH; Q8VZ49, ADHL4_ARATH;Q0V7W6, ADHL5_ARATH; Q8LEB2, ADHL6_ARATH; Q9FH04, ADHL7_ARATH;P81601, ADHL_GADMO ; P39451, ADHP_ECOLI ; O46649, ADHP_RABIT ;O46650, ADHQ_RABIT ; Q96533, ADHX_ARATH ; Q3ZC42, ADHX_BOVIN ;Q17335, ADHX_CAEEL ; Q54TC2, ADHX_DICDI ; P46415, ADHX_DROME ;P19854, ADHX_HORSE ; P11766, ADHX_HUMAN ; P93629, ADHX_MAIZE ;P28474, ADHX_MOUSE ; P80360, ADHX_MYXGL ; P81431, ADHX_OCTVU ;A2XAZ3, ADHX_ORYSI ; Q0DWH1, ADHX_ORYSJ ; P80572, ADHX_PEA ;O19053, ADHX_RABIT ; P12711, ADHX_RAT ; P80467, ADHX_SAAHA ;P86884, ADHX_SCYCA ; P79896, ADHX_SPAAU ; Q9NAR7, ADH_BACOL ;P14940, ADH_CUPNE ; Q0KDL6, ADH_CUPNH ; Q00669, ADH_DROAD ;P21518, ADH_DROAF ; P25139, ADH_DROAM ; Q50L96, ADH_DROAN ;P48584, ADH_DROBO ; P22245, ADH_DRODI ; Q9NG42, ADH_DROEQ ;P28483, ADH_DROER ; P48585, ADH_DROFL ; P51551, ADH_DROGR ;Q09009, ADH_DROGU ; P51549, ADH_DROHA ; P21898, ADH_DROHE ;Q07588, ADH_DROIM ; Q9NG40, ADH_DROIN ; Q27404, ADH_DROLA ;P10807, ADH_DROLE ; P07162, ADH_DROMA ; Q09010, ADH_DROMD ;P00334, ADH_DROME ; Q00671, ADH_DROMM ; P25721, ADH_DROMY ;Q00672, ADH_DRONI ; P07159, ADH_DROOR ; P84328, ADH_DROPB ;P37473, ADH_DROPE ; P23361, ADH_DROPI ; P23277, ADH_DROPL ;Q6LCE4, ADH_DROPS ; Q9U8S9, ADH_DROPU ; Q9GN94, ADH_DROSE ;Q24641, ADH_DROSI ; P23278, ADH_DROSL ; Q03384, ADH_DROSU ;P28484, ADH_DROTE ; P51550, ADH_DROTS ; B4M8Y0, ADH_DROVI ;Q05114, ADH_DROWI ; P26719, ADH_DROYA ; P17648, ADH_FRAAN ;P48977, ADH_MALDO ; P81786, ADH_MORSE ; P9WQC2, ADH_MYCTO ;P9WQC3, ADH_MYCTU ; P39462, ADH_SACS2 ; P25988, ADH_SCAAL ;Q00670, ADH_SCACA ; P00332, ADH_SCHPO ; Q2FJ31, ADH_STAA3 ;Q2G0G1, ADH_STAA8 ; Q2YSX0, ADH_STAAB ; Q5HI63, ADH_STAAC ;Q99W07, ADH_STAAM ; Q7A742, ADH_STAAN ; Q6GJ63, ADH_STAAR ;Q6GBM4, ADH_STAAS ; Q8NXU1, ADH_STAAW ; Q5HRD6, ADH_STAEQ ;Q8CQ56, ADH_STAES ; Q4J781, ADH_SULAC ; P50381, ADH_SULSR ;Q96XE0, ADH_SULTO ; P51552, ADH_ZAPTU ; Q5AR48, ASQE_EMENI ;A5JYX5, DHS3_CAEEL ; P32771, FADH_YEAST ; A7ZIA4, FRMA_ECO24 ;Q8X5J4, FRMA_ECO57 ; A7ZX04, FRMA_ECOHS ; A1A835, FRMA_ECOK1 ;Q0TKS7, FRMA_ECOL5 ; Q8FKG1, FRMA_ECOL6 ; B1J085, FRMA_ECOLC ;P25437, FRMA_ECOLI ; B1LIP1, FRMA_ECOSM ; Q1RFI7, FRMA_ECOUT ;P44557, FRMA_HAEIN ; P39450, FRMA_PHODP ; Q3Z550, FRMA_SHISS ;P73138, FRMA_SYNY3 ; E1ACQ9, NOTN_ASPSM ; N4WE73, OXI1_COCH4 ;N4WE43, RED2_COCH4 ; N4WW42, RED3_COCH4 ; P33010, TERPD_PSESP;O07737, Y1895_MYCTU;"
1.1.1.2,Alcohol dehydrogenase (NADP(+)).,"Q6AZW2, A1A1A_DANRE; Q568L5, A1A1B_DANRE; Q24857, ADH3_ENTHI ;Q04894, ADH6_YEAST ; P25377, ADH7_YEAST ; O57380, ADH8_PELPE ;Q9F282, ADHA_THEET ; P0CH36, ADHC1_MYCS2; P0CH37, ADHC2_MYCS2;P0A4X1, ADHC_MYCBO ; P9WQC4, ADHC_MYCTO ; P9WQC5, ADHC_MYCTU ;P27250, AHR_ECOLI ; Q3ZCJ2, AK1A1_BOVIN; Q5ZK84, AK1A1_CHICK;O70473, AK1A1_CRIGR; P14550, AK1A1_HUMAN; Q9JII6, AK1A1_MOUSE;P50578, AK1A1_PIG ; Q5R5D5, AK1A1_PONAB; P51635, AK1A1_RAT ;Q6GMC7, AK1A1_XENLA; Q28FD1, AK1A1_XENTR; Q9UUN9, ALD2_SPOSA ;P27800, ALDX_SPOSA ; P75691, YAHK_ECOLI ;"
1.1.1.3,Homoserine dehydrogenase.,"P00561, AK1H_ECOLI ; P27725, AK1H_SERMA ; P00562, AK2H_ECOLI ;Q9SA18, AKH1_ARATH ; P49079, AKH1_MAIZE ; O81852, AKH2_ARATH ;P49080, AKH2_MAIZE ; P57290, AKH_BUCAI ; Q8K9U9, AKH_BUCAP ;Q89AR4, AKH_BUCBP ; P37142, AKH_DAUCA ; P44505, AKH_HAEIN ;P19582, DHOM_BACSU ; P08499, DHOM_CORGL ; Q5B998, DHOM_EMENI ;Q9ZL20, DHOM_HELPJ ; P56429, DHOM_HELPY ; Q9CGD8, DHOM_LACLA ;P52985, DHOM_LACLC ; P37143, DHOM_METGL ; Q58997, DHOM_METJA ;P63630, DHOM_MYCBO ; P46806, DHOM_MYCLE ; P9WPX0, DHOM_MYCTO ;P9WPX1, DHOM_MYCTU ; P29365, DHOM_PSEAE ; O94671, DHOM_SCHPO ;P52986, DHOM_SYNY3 ; P31116, DHOM_YEAST ; P37144, DHON_METGL ;"
1.1.1.4,"(R,R)-butanediol dehydrogenase.","P14940, ADH_CUPNE ; Q0KDL6, ADH_CUPNH ; P39714, BDH1_YEAST ;O34788, BDHA_BACSU ; Q00796, DHSO_HUMAN ;"
1.1.1.5,Transferred entry: 1.1.1.303 and 1.1.1.304.,none ;
1.1.1.6,Glycerol dehydrogenase.,"A4IP64, ADH1_GEOTN ; O13702, GLD1_SCHPO ; P45511, GLDA_CITFR ;P0A9S6, GLDA_ECOL6 ; P0A9S5, GLDA_ECOLI ; P32816, GLDA_GEOSE ;P50173, GLDA_PSEPU ; Q9WYQ4, GLDA_THEMA ; Q92EU6, GOLD_LISIN ;"
1.1.1.7,Propanediol-phosphate dehydrogenase.,none ;
後はblastの結果と作成したテーブルを照合すれば、同定された酵素のEC number一覧が取得できる(=どんな役割をもつ酵素が存在するか把握できる)。