
More than 3 years have passed since last update.

Biopython Tutorial and Cookbook和訳(4.2)

Last updated at Posted at 2020-05-28

4.2 Creating a SeqRecord


Using a SeqRecord object is not very complicated, since all of the information is presented as attributes of the class.
SeqRecord objectの使い方はそんなに複雑ではない、すべての情報はクラスの属性として明示されたからです。

Usually you won’t create a SeqRecord “by hand”, but instead use Bio.SeqIO to read in a sequence file for you (see Chapter 5 and the examples below). However, creating SeqRecord can be quite simple.

4.2.1 SeqRecord objects from scratch

To create a SeqRecord at a minimum you just need a Seq object:

>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq)

Additionally, you can also pass the id, name and description to the initialization function, but if not they will be set as strings indicating they are unknown, and can be modified subsequently:

>>> simple_seq_r.id
'<unknown id>'
>>> simple_seq_r.id = "AC12345"
>>> simple_seq_r.description = "Made up sequence I wish I could write a paper about"
>>> print(simple_seq_r.description)
Made up sequence I wish I could write a paper about
>>> simple_seq_r.seq

Including an identifier is very important if you want to output your SeqRecord to a file.
You would normally include this when creating the object:

>>> from Bio.Seq import Seq
>>> simple_seq = Seq("GATC")
>>> from Bio.SeqRecord import SeqRecord
>>> simple_seq_r = SeqRecord(simple_seq, id="AC12345")

As mentioned above, the SeqRecord has an dictionary attribute annotations.
This is used for any miscellaneous annotations that doesn’t fit under one of the other more specific attributes.

Adding annotations is easy, and just involves dealing directly with the annotation dictionary:

>>> simple_seq_r.annotations["evidence"] = "None. I just made it up."
>>> print(simple_seq_r.annotations)
{'evidence': 'None. I just made it up.'}
>>> print(simple_seq_r.annotations["evidence"])
None. I just made it up.

Working with per-letter-annotations is similar, letter_annotations is a dictionary like attribute which will let you assign any Python sequence (i.e. a string, list or tuple) which has the same length as the sequence:
per-letter-annotationsの使い方も似ていて、letter_annotationsは辞書型でシーケンスデータを割り当てることができます。(i.e. 文字列, リスト あるいはタプル)

>>> simple_seq_r.letter_annotations["phred_quality"] = [40, 40, 38, 30]
>>> print(simple_seq_r.letter_annotations)
{'phred_quality': [40, 40, 38, 30]}
>>> print(simple_seq_r.letter_annotations["phred_quality"])
[40, 40, 38, 30]

The dbxrefs and features attributes are just Python lists, and should be used to store strings and SeqFeature objects (discussed later in this chapter) respectively.
データベースへのクロスリファレンス(.dbxrefs)とフィーチャー情報はリスト型、それぞれ文字列およびSeqFeatureオブジェクトを格納するためいに使います。 (この章の後で述べます)

4.2.2 SeqRecord objects from FASTA files

This example uses a fairly large FASTA file containing the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI.
このサンプルではYersinia pestis菌の全シーケンスが含まれるかなりでかいFASTAファイルを使います。NCBIからダウンロードしました。

This file is included with the Biopython unit tests under the GenBank folder, or online NC_005816.fna from our website.

The file starts like this - and you can check there is only one record present (i.e. only one line starting with a greater than symbol):
このファイルは下記のような形で始まります - 今は一行しか載せてないですが(i.e.大なり記号から始まる一行)

>gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence

Back in Chapter 2 you will have seen the function Bio.SeqIO.parse(...) used to loop over all the records in a file as SeqRecord objects.
Chapter 2で見たように関数Bio.SeqIO.parse(...)はSeqRecordオブジェクトファイルの全レコードをループします。

The Bio.SeqIO module has a sister function for use on files which contain just one record which we’ll use here (see Chapter 5 for details):

Now, let’s have a look at the key attributes of this SeqRecord individually – starting with the seq attribute which gives you a Seq object:
では、SeqRecordのメイン属性を見てみましょう - seq属性にアクセスしたらSeqオブジェクトを返してくれます。

>>> record.seq

Here Bio.SeqIO has defaulted to a generic alphabet, rather than guessing that this is DNA.

If you know in advance what kind of sequence your FASTA file contains, you can tell Bio.SeqIO which alphabet to use (see Chapter 5).

Next, the identifiers and description:

>>> record.id
>>> record.name
>>> record.description
'gi|45478711|ref|NC_005816.1| Yersinia pestis biovar Microtus ... pPCP1, complete sequence'

As you can see above, the first word of the FASTA record’s title line (after removing the greater than symbol) is used for both the id and name attributes.

The whole title line (after removing the greater than symbol) is used for the record description. This is deliberate, partly for backwards compatibility reasons, but it also makes sense if you have a FASTA file like this:

>Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1

Note that none of the other annotation attributes get populated when reading a FASTA file:

In this case our example FASTA file was from the NCBI, and they have a fairly well defined set of conventions for formatting their FASTA lines.

This means it would be possible to parse this information and extract the GI number and accession for example.
However, FASTA files from other sources vary, so this isn’t possible in general.

4.2.3 SeqRecord objects from GenBank files

As in the previous example, we’re going to look at the whole sequence for Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, originally downloaded from the NCBI, but this time as a GenBank file.
前の例で、NCBIからダウンロードしたYersinia pestis菌の全シーケンスを見ました。今回はGenBankバージョンを見ましょう。

Again, this file is included with the Biopython unit tests under the GenBank folder, or online NC_005816.gbfrom our website.

This file contains a single record (i.e. only one LOCUS line) and starts:
このファイルはシングルレコード(i.e. LOCUS 行はただ1行だけ)およぶstartsを含みます

LOCUS       NC_005816               9609 bp    DNA     circular BCT 21-JUL-2008
DEFINITION  Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete
VERSION     NC_005816.1  GI:45478711
PROJECT     GenomeProject:10638

Again, we’ll use Bio.SeqIO to read this file in, and the code is almost identical to that for used above for the FASTA file (see Chapter 5 for details):

>>> from Bio import SeqIO
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> record
IUPACAmbiguousDNA()), id='NC_005816.1', name='NC_005816',
description='Yersinia pestis biovar Microtus str. 91001 plasmid pPCP1, complete sequence.',

You should be able to spot some differences already! But taking the attributes individually, the sequence string is the same as before, but this time Bio.SeqIO has been able to automatically assign a more specific alphabet (see Chapter 5 for details):

The name comes from the LOCUS line, while the id includes the version suffix. The description comes from the DEFINITION line:

GenBank files don’t have any per-letter annotations:

>>> record.letter_annotations

Most of the annotations information gets recorded in the annotations dictionary, for example:

>>> len(record.annotations)
>>> record.annotations["source"]
'Yersinia pestis biovar Microtus str. 91001'

The dbxrefs list gets populated from any PROJECT or DBLINK lines:

>>> record.dbxrefs

Finally, and perhaps most interestingly, all the entries in the features table (e.g. the genes or CDS features) get recorded as SeqFeature objects in the features list.
最後に、一番興味深いのは、フィーチャーテーブルのすべてのエントリ(e.g. genesあるいはCDSフィーチャー)がSeqFeatureオブジェクトとしてフィーチャーリストに保存されます。



Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up