0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

InterProScan5の出力ファイル(XML形式)からGOタームのIDを取り出す [備忘録]

0
Posted at

タンパク質機能予測ソフトウェアInterProScanを用いると、タンパク質配列をもとに含まれるドメイン構造や、関連するGOタームの情報を得ることができる。
Local環境でInterProScanを実行した際に、得られる出力ファイル(XML形式)から、必要な情報を取り出す際のスクリプトを作成したので、備忘録としてまとめる。
今回は、タンパク質のIDと、対応するGOタームのIDを取り出すことにして、pythonのコードを作成した。(assisted by ChatGPT)

import re

input_file = "protein.faa.xml"
output_file = "protein_GO.tsv"

with open(input_file) as f:
    text = f.read()

# proteinごとに分割
proteins = re.findall(
    r"<protein>(.*?)</protein>",
    text,
    re.DOTALL
)

with open(output_file, "w") as out:
    out.write("Protein_ID\tGO_terms\n")

    for protein in proteins:

        # タンパク質ID取得
        m = re.search(
            r'<xref id="([^"]+)"',
            protein
        )

        if not m:
            continue

        protein_id = m.group(1)

        # GOタームID取得
        gos = sorted(set(
            re.findall(
                r'<go-xref[^>]*id="(GO:\d+)"',
                protein
            )
        ))

        out.write(
            protein_id + "\t" +
            ";".join(gos) + "\n"
        )

print("Finished")
0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?