LoginSignup
0

More than 1 year has passed since last update.

posted at

updated at

Biopython Tutorial and Cookbook和訳(4.7)

4.7 Slicing a SeqRecord

4.6へ

You can slice a SeqRecord, to give you a new SeqRecord covering just part of the sequence.
What is important here is that any per-letter annotations are also sliced, and any features which fall completely within the new sequence are preserved (with their locations adjusted).
SeqRecordをスライスことで配列の一部分を新たなSeqRecordとして生成できます。
注意する必要があるのはper-letter annotationも同様にスライスされるが、新配列内のfeatureは元のと同じとなる(locationsは調整されます)

For example, taking the same GenBank file used earlier:
前に使ったGenBankファイルを例として

>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")

>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence',
dbxrefs=['Project:58037'])

>>> len(record)
9609
>>> len(record.features)
41

For this example we’re going to focus in on the pim gene, YP_pPCP05. If you have a look at the GenBank file directly you’ll find this gene/CDS has location string 4343..4780, or in Python counting 4342:4780.
From looking at the file you can work out that these are the twelfth and thirteenth entries in the file, so in Python zero-based counting they are entries 11 and 12 in the features list:
この例ではpim遺伝子上に注目します。OLN:YP_pPCP05。GenBankファイルを覗いたらこのgene/CDSのlocationは4343..4780、あるいはpythonカウント下は4342:4780です。
location情報はGenBankファイルの12と13のentriesとなります。pythonは0からカウントするため、featuresリストの11と12entriesとなります。

>>> print(record.features[20])
type: gene
location: [4342:4780](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>

>>> print(record.features[21])
type: CDS
location: [4342:4780](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Let’s slice this parent record from 4300 to 4800 (enough to include the pim gene/CDS), and see how many features we get:
親配列から4300から4800をスライスして(pim gene/CDSが含まれる長さ)、どんなfeaturesを取得したのかを見てみましょう:

>>> sub_record = record[4300:4800]

>>> sub_record
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=[])

>>> len(sub_record)
500
>>> len(sub_record.features)
2

Our sub-record just has two features, the gene and CDS entries for YP_pPCP05:
sub-recordに二つのfeaturesが含まれている、YP_pPCP05の遺伝子とCD Sentriesです:
参考:https://www.ddbj.nig.ac.jp/ddbj/cds.html

>>> print(sub_record.features[0])
type: gene
location: [42:480](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>

>>> print(sub_record.features[1])
type: CDS
location: [42:480](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Notice that their locations have been adjusted to reflect the new parent sequence!
Notice:locationsは生成された親配列に対応するように調整されます!

While Biopython has done something sensible and hopefully intuitive with the features (and any per-letter annotation), for the other annotation it is impossible to know if this still applies to the sub-sequence or not.
To avoid guessing, the annotations and dbxrefs are omitted from the sub-record, and it is up to you to transfer any relevant information as appropriate.
Biopythonはfeatures要素を賢く、直観的に獲得できたが(他のper-letter annotationも)、他のannotationは子配列に適応するかどうかを知る余地はない。
誤解を避けるために、子記録のannotationsとdbxrefsを省略しました。

>>> sub_record.annotations
{}
>>> sub_record.dbxrefs
[]

The same point could be made about the record id, name and description, but for practicality these are preserved:
実用性のために子記録にid, nameとdescriptionを保留しました。

>>> sub_record.id
'NC_005816.1'
>>> sub_record.name
'NC_005816'
>>> sub_record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

This illustrates the problem nicely though, our new sub-record is not the complete sequence of the plasmid, so the description is wrong! Let’s fix this and then view the sub-record as a reduced GenBank file using the format method described above in Section 4.6:
この例は問題を露呈しました、子記録は完全のプラスミド配列ではない、ゆえにdescriptionは間違っている、Section 4.6に述べたformatメソッドで修正することができます:

>>> sub_record.description = "Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial."
>>> print(sub_record.format("genbank"))
...

See Sections 20.1.7 and 20.1.8 for some FASTQ examples where the per-letter annotations (the read quality scores) are also sliced.
FASTQの例は20.1.7 and 20.1.8を参照してください。この例ではper-letter annotations (クオリティスコア)はスライスされていました。

4.8へ

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
What you can do with signing up
0