LoginSignup
1
0

More than 3 years have passed since last update.

Biopython Tutorial and Cookbook和訳(4.7)

Last updated at Posted at 2020-08-13

4.7 Slicing a SeqRecord

4.6へ

You can slice a SeqRecord, to give you a new SeqRecord covering just part of the sequence.
What is important here is that any per-letter annotations are also sliced, and any features which fall completely within the new sequence are preserved (with their locations adjusted).
SeqRecordをスライスことで配列の一部分を新たなSeqRecordとして生成できます。
注意する必要があるのはper-letter annotationも同様にスライスされるが、新配列内のfeatureは元のと同じとなる(locationsは調整されます)

For example, taking the same GenBank file used earlier:
前に使ったGenBankファイルを例として

>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")

>>> record
SeqRecord(seq=Seq('TGTAACGAACGGTGCAATAGTGATCCACACCCAACGCCTGAAATCAGATCCAGG...CTG',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence',
dbxrefs=['Project:58037'])

>>> len(record)
9609
>>> len(record.features)
41

For this example we’re going to focus in on the pim gene, YP_pPCP05. If you have a look at the GenBank file directly you’ll find this gene/CDS has location string 4343..4780, or in Python counting 4342:4780.
From looking at the file you can work out that these are the twelfth and thirteenth entries in the file, so in Python zero-based counting they are entries 11 and 12 in the features list:
この例ではpim遺伝子上に注目します。OLN:YP_pPCP05。GenBankファイルを覗いたらこのgene/CDSのlocationは4343..4780、あるいはpythonカウント下は4342:4780です。
location情報はGenBankファイルの12と13のentriesとなります。pythonは0からカウントするため、featuresリストの11と12entriesとなります。

>>> print(record.features[20])
type: gene
location: [4342:4780](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>

>>> print(record.features[21])
type: CDS
location: [4342:4780](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Let’s slice this parent record from 4300 to 4800 (enough to include the pim gene/CDS), and see how many features we get:
親配列から4300から4800をスライスして(pim gene/CDSが含まれる長さ)、どんなfeaturesを取得したのかを見てみましょう:

>>> sub_record = record[4300:4800]

>>> sub_record
SeqRecord(seq=Seq('ATAAATAGATTATTCCAAATAATTTATTTATGTAAGAACAGGATGGGAGGGGGA...TTA',
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',
dbxrefs=[])

>>> len(sub_record)
500
>>> len(sub_record.features)
2

Our sub-record just has two features, the gene and CDS entries for YP_pPCP05:
sub-recordに二つのfeaturesが含まれている、YP_pPCP05の遺伝子とCD Sentriesです:
参考:https://www.ddbj.nig.ac.jp/ddbj/cds.html

>>> print(sub_record.features[0])
type: gene
location: [42:480](+)
qualifiers:
    Key: db_xref, Value: ['GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
<BLANKLINE>

>>> print(sub_record.features[1])
type: CDS
location: [42:480](+)
qualifiers:
    Key: codon_start, Value: ['1']
    Key: db_xref, Value: ['GI:45478716', 'GeneID:2767712']
    Key: gene, Value: ['pim']
    Key: locus_tag, Value: ['YP_pPCP05']
    Key: note, Value: ['similar to many previously sequenced pesticin immunity ...']
    Key: product, Value: ['pesticin immunity protein']
    Key: protein_id, Value: ['NP_995571.1']
    Key: transl_table, Value: ['11']
    Key: translation, Value: ['MGGGMISKLFCLALIFLSSSGLAEKNTYTAKDILQNLELNTFGNSLSH...']

Notice that their locations have been adjusted to reflect the new parent sequence!
Notice:locationsは生成された親配列に対応するように調整されます!

While Biopython has done something sensible and hopefully intuitive with the features (and any per-letter annotation), for the other annotation it is impossible to know if this still applies to the sub-sequence or not.
To avoid guessing, the annotations and dbxrefs are omitted from the sub-record, and it is up to you to transfer any relevant information as appropriate.
Biopythonはfeatures要素を賢く、直観的に獲得できたが(他のper-letter annotationも)、他のannotationは子配列に適応するかどうかを知る余地はない。
誤解を避けるために、子記録のannotationsとdbxrefsを省略しました。

>>> sub_record.annotations
{}
>>> sub_record.dbxrefs
[]

The same point could be made about the record id, name and description, but for practicality these are preserved:
実用性のために子記録にid, nameとdescriptionを保留しました。

>>> sub_record.id
'NC_005816.1'
>>> sub_record.name
'NC_005816'
>>> sub_record.description
'Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence'

This illustrates the problem nicely though, our new sub-record is not the complete sequence of the plasmid, so the description is wrong! Let’s fix this and then view the sub-record as a reduced GenBank file using the format method described above in Section 4.6:
この例は問題を露呈しました、子記録は完全のプラスミド配列ではない、ゆえにdescriptionは間違っている、Section 4.6に述べたformatメソッドで修正することができます:

>>> sub_record.description = "Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, partial."
>>> print(sub_record.format("genbank"))
...

See Sections 20.1.7 and 20.1.8 for some FASTQ examples where the per-letter annotations (the read quality scores) are also sliced.
FASTQの例は20.1.7 and 20.1.8を参照してください。この例ではper-letter annotations (クオリティスコア)はスライスされていました。

4.8へ

1
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
1
0