LoginSignup
0
0

More than 5 years have passed since last update.

fasta file のparser snipet

Last updated at Posted at 2016-05-18

fasta file のパーサー (python3 版)

bioinformatics で配列解析をするときに必ず書くことになるfasta ファイルのパーサーです。
gzip 圧縮されたファイルもそのまま読めます。

関数

from itertools import groupby
import gzip

# Fasta IO
def fasta_iter(fasta_name):
    '''
    given a fasta file. yield tuples of header, sequence
    modified from Brent Pedersen
    Correct Way To Parse A Fasta File In Python
    https://www.biostars.org/p/710/
    '''

    if((fasta_name[-3:] == '.gz') or 
       (fasta_name[-5:] == '.gzip')):
        with gzip.open(fasta_name, 'rb') as f:
            data = (x[1] for x in groupby(f, lambda line: line.decode('utf-8')[0] == ">"))
            for header in data:
                header = header.__next__().decode('utf-8')[1:].strip()
                seq = "".join(s.decode('utf-8').strip() for s in data.__next__())
                yield(header, seq)
    else:
        with open(fasta_name) as f:
            # ditch the boolean (x[0]) and just keep the header or sequence since
            # we know they alternate.
            data = (x[1] for x in groupby(f, lambda line: line[0] == ">"))
            for header in data:
                # drop the ">"
                header = header.__next__()[1:].strip()
                # join all sequence lines to one.
                seq = "".join(s.strip() for s in data.__next__())
                yield(header, seq)

使用法

  • 先の関数を定義した上で,次のように使います
fasta = '/somewhere/hg19.fa.gz'

seqs = {}
for (head, seq) in fasta_iter(fasta):
    seqs[head] = seq

print(seqs)

参考資料

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0