0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Biopython Tutorial and Cookbook和訳(4.3)

Last updated at Posted at 2020-06-30

4.3 Feature, location and position objects

4.2へ

4.3.1 SeqFeature objects

Sequence features are an essential part of describing a sequence.
featureは配列を説明するのに必須な部分です。

Once you get beyond the sequence itself, you need some way to organize and easily get at the more “abstract” information that is known about the sequence.
単なる配列の枠を超える時、それを整理、および簡単に複雑な情報にアクセスする方法を持つべきです。

The design is heavily based on the GenBank/EMBL feature tables, so if you understand how they look, you’ll probably have an easier time grasping the structure of the Biopython classes.
設計はGenBank/EMBLのfeatureテーブルにも基づいているので、見慣れている人にとってはBiopythonのクラスも把握しやすいでしょう。

The key idea about each SeqFeature object is to describe a region on a parent sequence, typically a SeqRecord object. That region is described with a location object, typically a range between two positions (see Section 4.3.2 below).
SeqFeatureオブジェクトの主要な構想は親配列の領域を説明することで、親は通常、SeqRecordオブジェクトです。
領域はロケーションオブジェクトによって定義され、通常はある2つのポジション間の範囲です(セッション 4.3.2 を参照してください)。

position
– This refers to a single position on a sequence, which may be fuzzy or not. For instance, 5, 20, <100 and >200 are all positions.
location
– A location is region of sequence bounded by some positions. For instance 5..20 (i. e. 5 to 20) is a location.

position
– 配列内単一の位置を表す、明確でも不明確でも大丈夫. たとえば, 5, 20, <100 and >200 はすべて positionです。
location
– position間の空間です。 例: 5..20 (i. e. 5から20).

I just mention this because sometimes I get confused between the two.
これを言及する理由は私がよく間違えるからです。

4.3.2.1 FeatureLocation object

Unless you work with eukaryotic genes, most SeqFeature locations are extremely simple - you just need start and end coordinates and a strand.
That’s essentially all the basic FeatureLocation object does.
真核生物遺伝子でない限り、SeqFeature locationはとても簡単である。スタート、エンド座標およびstrand情報だけです。
もっとも基本なFeatureLocationは上三つの情報を持っています。

In practise of course, things can be more complicated. First of all we have to handle compound locations made up of several regions. Secondly, the positions themselves may be fuzzy (inexact).
でも実際のケースはそんなに簡単ではないでしょう。複数のregionで構成された複雑なlocationを処理する必要があるし、positionも不明確であるかもしれないです。

4.3.2.2 CompoundLocation object

Biopython 1.62 introduced the CompoundLocation as part of a restructuring of how complex locations made up of multiple regions are represented. The main usage is for handling ‘join’ locations in EMBL/GenBank files.
EMBL/GenBankファイル中の 'join' locationをよりよく処理するためにBiopython 1.62からCompoundLocationを導入しました。

4.3.2.3 Fuzzy Positions
So far we’ve only used simple positions. One complication in dealing with feature locations comes in the positions themselves. In biology many times things aren’t entirely certain (as much as us wet lab biologists try to make them certain!).
For instance, you might do a dinucleotide priming experiment and discover that the start of mRNA transcript starts at one of two sites. This is very useful information, but the complication comes in how to represent this as a position.
To help us deal with this, we have the concept of fuzzy positions. Basically there are several types of fuzzy positions, so we have five classes do deal with them:
これまで私たちは簡単なpositionsしか処理していないです。feature locationの複雑さはその不確実性によるものです。
たとえば二核苷酸priming実験でmRNAの伝写始点は二つの部位の中の一つ、これはとても有用な情報ではあるが、難しいのはこの位置の情報をいかにpositionで表すところです。
解決策として、fuzzy positionを使って表示します。fuzzy positionは主に5種類の分類があります:

ExactPosition
– As its name suggests, this class represents a position which is specified as exact along the sequence. This is represented as just a number, and you can get the position by looking at the position attribute of the object.
BeforePosition
– This class represents a fuzzy position that occurs prior to some specified site. In GenBank/EMBL notation, this is represented as something like '<13', signifying that the real position is located somewhere less than 13. To get the specified upper boundary, look at the position attribute of the object.
AfterPosition
– Contrary to BeforePosition, this class represents a position that occurs after some specified site. This is represented in GenBank as '>13', and like BeforePosition, you get the boundary number by looking at the position attribute of the object.
WithinPosition
– Occasionally used for GenBank/EMBL locations, this class models a position which occurs somewhere between two specified nucleotides.
In GenBank/EMBL notation, this would be represented as ‘(1.5)’, to represent that the position is somewhere within the range 1 to 5.
To get the information in this class you have to look at two attributes.
The position attribute specifies the lower boundary of the range we are looking at, so in our example case this would be one.
The extension attribute specifies the range to the higher boundary, so in this case it would be 4.
So object.position is the lower boundary and object.position + object.extension is the upper boundary.
OneOfPosition
– Occasionally used for GenBank/EMBL locations, this class deals with a position where several possible values exist, for instance you could use this if the start codon was unclear and there where two candidates for the start of the gene.
Alternatively, that might be handled explicitly as two related gene features.

UnknownPosition
– This class deals with a position of unknown location. This is not used in GenBank/EMBL, but corresponds to the ‘?’ feature coordinate used in UniProt.
- 未知の部位を表します。GenBank/EMBLでは使われていないが、UniProtの'?'座標と対応します。

ExactPosition
- 一つの数値で配列中の正確なpositionを表します。このオブジエンドのposition属性から正確な情報を得られます。
BeforePosition
- 特定の部位の前であるのを示します。たとえば'<13'はGenBank/EMBLの表記中に実際の部位は13の前にあるのを示しています。この上限の情報をposition属性から取得できます。
AfterPosition
- BeforePositionと反対に、'>13'は実際の部位は13の前にあるのを示しています。同じくposition属性から情報を取得できます
WithinPosition
- 時々GenBank/EMBL locationsのために使われます、二つの部位の間にあるのを意味しています。
たとえば'(1/5)'はGenBank/EMBLの表記中に実際の部位は1から5の間を意味しています。
このクラスの情報を取得するには二つの属性を見る必要があります。
最初の引数は下の境界を表す、この例では1。
extensionは上の境界と下の境界の差、この例では4。
そのため、object.positionは下の境界、object.position + object.extensionは上の境界を表します。
OneOfPosition
– たまにはGenBank/EMBL locationsのために使われる、複数の候補位置がある場合、たとえば最初のコドンは明確ではない、あるいは二つの候補があるとき、
あるいは明確な二つ関連ある遺伝子の特徴を表す時に使います。
UnknownPosition
- 未知の部位を表します。GenBank/EMBLでは使われていないが、UniProtの'?'座標と対応します。

Here’s an example where we create a location with fuzzy end points:
fuzzy end pointsを作る時の例を紹介します:

>>> from Bio import SeqFeature
>>> start_pos = SeqFeature.AfterPosition(5)
>>> end_pos = SeqFeature.BetweenPosition(9, left=8, right=9)
>>> my_location = SeqFeature.FeatureLocation(start_pos, end_pos)

Note that the details of some of the fuzzy-locations changed in Biopython 1.59, in particular for BetweenPosition and WithinPosition you must now make it explicit which integer position should be used for slicing etc. For a start position this is generally the lower (left) value, while for an end position this would generally be the higher (right) value.
Note: Biopython 1.59以後、fuzzy-locationsに関するいくつかの修正があった、特にスライスのためBetweenPositionとWithinPositionについて整数を使う必要があります。
startは一般的により小さい値で、endはより大きい値となります。

If you print out a FeatureLocation object, you can get a nice representation of the information:
FeatureLocationオブジェクトをprintをしたら、綺麗に以下の情報を取れます:

>>> print(my_location)
[>5:(8^9)]

We can access the fuzzy start and end positions using the start and end attributes of the location:
startとend属性を通じてfuzzy positionの起点と終点を取得できます。

>>> my_location.start
AfterPosition(5)
>>> print(my_location.start)
>5
>>> my_location.end
BetweenPosition(9, left=8, right=9)
>>> print(my_location.end)
(8^9)

If you don’t want to deal with fuzzy positions and just want numbers, they are actually subclasses of integers so should work like integers:
fuzzy positionsではなくただ数値を取得したい場合、サブクラスのintでinteger型に変換できます。

>>> int(my_location.start)
5
>>> int(my_location.end)
9

For compatibility with older versions of Biopython you can ask for the nofuzzy_start and nofuzzy_end attributes of the location which are plain integers:
古いバージョンのBiopythonとの互換性のため、nofuzzy_startとnofuzzy_endが保留となります。

>>> my_location.nofuzzy_start
5
>>> my_location.nofuzzy_end
9

Notice that this just gives you back the position attributes of the fuzzy locations.
Notice: fuzzy locationsのposition属性を呼び出すだけです。

Similarly, to make it easy to create a position without worrying about fuzzy positions, you can just pass in numbers to the FeaturePosition constructors, and you’ll get back out ExactPosition objects:
同様に、正確なlocationを生成したい場合、FeaturePosition関数に整数を渡すだけでいです、そしてExactPositionを取得できます。

>>> exact_location = SeqFeature.FeatureLocation(5, 9)
>>> print(exact_location)
[5:9]
>>> exact_location.start
ExactPosition(5)
>>> int(exact_location.start)
5
>>> exact_location.nofuzzy_start
5

That is most of the nitty gritty about dealing with fuzzy positions in Biopython. It has been designed so that dealing with fuzziness is not that much more complicated than dealing with exact positions, and hopefully you find that true!
以上はfuzzy positionsの核心部分となります。このような作り方にする目的は、exact positionsよりも複雑にならないようにするためです。

4.3.2.4 Location testing

You can use the Python keyword in with a SeqFeature or location object to see if the base/residue for a parent coordinate is within the feature/location or not.
pythonのinキーワードで塩基/残基の親座標はfeature/locationにあるかどうかを確認できます。

For example, suppose you have a SNP of interest and you want to know which features this SNP is within, and lets suppose this SNP is at index 4350 (Python counting!). Here is a simple brute force solution where we just check all the features one by one in a loop:
たとえば、SNPのfeatureを調べたい時、ループですべてのfeatureを調べるという力技だけど簡単な方法があります。

>>> from Bio import SeqIO
>>> my_snp = 4350
>>> record = SeqIO.read("NC_005816.gb", "genbank")
>>> for feature in record.features:
...     if my_snp in feature:
...         print("%s %s" % (feature.type, feature.qualifiers.get("db_xref")))
...
source ['taxon:229193']
gene ['GeneID:2767712']
CDS ['GI:45478716', 'GeneID:2767712']

Note that gene and CDS features from GenBank or EMBL files defined with joins are the union of the exons – they do not cover any introns.
Note: GenBankあるいはEMBLファイル内の遺伝子とCDS featureにはエクソンしか含まれません。 - イントロンは存在しません

4.3.3 Sequence described by a feature or location

A SeqFeature or location object doesn’t directly contain a sequence, instead the location (see Section 4.3.2) describes how to get this from the parent sequence.
For example consider a (short) gene sequence with location 5:18 on the reverse strand, which in GenBank/EMBL notation using 1-based counting would be complement(6..18), like this:
SeqFeatureやlocationは直接配列を持たない、代わりに親配列から取得するためのlocationを保持しています。
たとえば逆ストランド上のlocation 5:18の短いDNA配列、GenBank/EMBL表記は1からカウントするため(6...18)となります。

>>> from Bio.Seq import Seq
>>> from Bio.SeqFeature import SeqFeature, FeatureLocation
>>> example_parent = Seq("ACCGAGACGGCAAAGGCTAGCATAGGTATGAGACTTCCTTCCTGCCAGTGCTGAGGAACTGGGAGCCTAC")
>>> example_feature = SeqFeature(FeatureLocation(5, 18), type="gene", strand=-1)

You could take the parent sequence, slice it to extract 5:18, and then take the reverse complement.
If you are using Biopython 1.59 or later, the feature location’s start and end are integer like so this works:
スライスで親配列から5:18を抽出し、相補的DAN(cDNA)を取ることができます。
もしBiopython 1.59以後を使う場合、以下のようにfeatureのstartとendをinteger型の引数で渡すことができます。

>>> feature_seq = example_parent[example_feature.location.start:example_feature.location.end].reverse_complement()
>>> print(feature_seq)
AGCCTTTGCCGTC

This is a simple example so this isn’t too bad – however once you have to deal with compound features (joins) this is rather messy.
Instead, the SeqFeature object has an extract method to take care of all this:
このやり方も悪くはないが、複合のfeature(joins)を処理する場合は非常に煩雑です。
この問題を、SeqFeatureのextractメソッドでカーバーできます。

>>> feature_seq = example_feature.extract(example_parent)
>>> print(feature_seq)
AGCCTTTGCCGTC

The length of a SeqFeature or location matches that of the region of sequence it describes.
SeqFeatureあるいはlocationの長さは配列の長さそのものです。

>>> print(example_feature.extract(example_parent))
AGCCTTTGCCGTC
>>> print(len(example_feature.extract(example_parent)))
13
>>> print(len(example_feature))
13
>>> print(len(example_feature.location))
13

For simple FeatureLocation objects the length is just the difference between the start and end positions.
However, for a CompoundLocation the length is the sum of the constituent regions.
FeatureLocationにおいての長さはstartとendの差、CompoundLocationでは各構成要素の合計になります。

4.4へ

0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?