NEXUS

NEXUS
Begin data;
Dimensions ntax=287 nchar=10650;
Format datatype=dna…

みたいな感じで始まるデータの保存形式
nexusパッケージ（https://pypi.python.org/pypi/nexus) のNexusReaderを利用する

nex.py

from nexus import NexusReader
import sys

n = NexusReader()
nexus_in = sys.argv[1]
n.read_file(nexus_in)

for taxon,characters in n.data:
    print(">",taxon)
    print("".join(characters))

FASTA

一行目に>から始まるヘッダ行
二行目以降に実際の配列
というようなデータの保存形式

fas.py

from Bio import SeqIO
import sys

fasta_in = sys.argv[1] #fastaファイルを読みこみ

for record in SeqIO.parse(fasta_in, 'fasta'):
    id_part = record.id
    desc_part = record.description
    seq = record.seq

NCBIのxmlファイル

xml.py

from Bio.Blast import NCBIXML

path ="2VJM4KYR014-Alignment.xml"

with open(path, mode = 'r', encoding = 'utf-8') as fh:
    blast_records = NCBIXML.parse(fh)
    for blast_record in blast_records:
        for alignment in blast_record.alignments:
             for hsp in alignment.hsps:
                 he = '>' + alignment.title + '|'
                 he += str(hsp.score) + '|'
                 he += str(hsp.bits) + '|'       
                 he += str(hsp.identities)
                 """
                 ある条件によって抽出したい場合は,ここでif文を追加する

                 [例]
                 scoreによって抽出したい場合は if hsp.score < ("数値"): 
                 生物種によって抽出したい場合は　if hsp.identites.find("Homo sapiens"):

                 """
                 print(he)
                 print(hsp.query[1:80])
                 print(hsp.match[1:80])
                 print(hsp.sbjct[1:80])

(ちょっと前にやったのであまり覚えていない）

DNAのデータ保存形式の扱い方(python)

NEXUS

FASTA

NCBIのxmlファイル