More than 1 year has passed since last update.

Qiita×Findy記事投稿キャンペーン「今の開発組織でトライしたこと・トライしていること・トライしようとしていること」

google colabolatoryでcircRNAを同定したい④

Posted at 2024-04-09

circtoolsは結局断念

circtoolsに関してはこちらのページを参照にしました。

このページには、circtoolsについてこう書かれていました。

circtoolsは、データ処理コードのほとんどをPython 3 (>=3.7)で、プロットと統計解析の部分はR (>=4.0.0)で書かれている。このツールは多くの外部依存性を持っており、そのほとんどが標準的なバイオインフォマティクスツールとパッケージである。インストールはデフォルトで、必要な依存関係をすべてインストールする。

一つ引っかかる。「ほとんどをpython3」
一部は違うということか
githubのサイトを見てみると、こう書いてある。

circtoolsパッケージはPython3(>=3.4)で書かれており、detectとreconstructの2つのモジュールはPython2(>=2.7)のインストールが必要である。

え、またか
うーん、やめとこう。

CIRI2を使う

初心に戻って、これまで使われてきているpipelineでpython3ベースのものはないか探すことにしました。（なぜ最初からそうしなかったのか）
ということで、気を取り直して、CIRI2を使用することにしました。

上記の原著論文によると

CIRIはPerlで実装され、LinuxとMac OS Xで広範囲にテストされている。CIRIを使うのに他のツールは必要ない。CIRIはCIRI-simulatorとともにパッケージ化されており、[34]で入手可能である。

とのことで、これは「エラーを避けられるのではないか？」と期待が高まります

このpipelineの原著論文のタイトルは上記のように
CIRI: an efficient and unbiased algorithm for de novo circular RNA identification
と一見、de novoでcircRNAを同定できそうな雰囲気がありますが、Materials and MethodsのAlgorithmでは、

CIRIは、FASTAフォーマットされた参照配列と、BWA-MEMアルゴリズム[17]によって生成されたSAMアライメントの2種類のファイルを必要とする。

と書かれています。おそらくは新規に同定するという意味でde novoと表記しているようで、 de novo assembly 自体は自分でやらないといけないようです。

CIRI2はCIRI-cookbookとして懇切丁寧な取扱説明サイトがありました。(もちろん、他のpipelineでも形は違えど、チュートリアルがあると思います。)

こちらのサイトのAn example of running CIRI2を真似てみることにしました。

CIRI2のインストール

まずCIRI2をgithubを介してinstallしました。

クローン化

!git clone https://github.com/bioinfo-biols/CIRI-full.git

クローン化したフォルダは/content/drive/MyDriveの直下に置きました。

sampleデータのダウンロード

次にMyDriveの階層にsampleフォルダを作りました。
(ただ、わざわざここで作らなくても、サンプルデータのダウンロードのコード内でos.makedirs(download_dir, exist_ok=True)を加えて、ファイルの保存先ディレクトリが存在しない場合作成するとしてもよかったですね。)

sampleフォルダの作成

import os
print(os.getcwd())
os.chdir('/content/drive/MyDrive/sample')

sampleデータはCIRI2のフォルダ内には含まれておらず、別のサイトからダウンロードhttps://sourceforge.net/projects/ciri/files/CIRI2/する必要がありました。

Please download CIRI2 and test_data2.zip from https://sourceforge.net/projects/ciri/files/CIRI2/ and then unzip the three files in test_data2.zip.

test_data2.zipをダウンロードしていきます。.sraファイルと異なりwgetではダウンロードが出来なかったので、chatGPT先生に教えてもらって、先ほど作ったsampleフォルダにダウンロードしました。初心者の私にはstatus_codeが新鮮です。

サンプルデータのダウンロード

import requests
#import os 
#先にosをimportしていたので、今回は割愛

# ダウンロードするファイルのURL
url = "https://sourceforge.net/projects/ciri/files/CIRI2/CIRI_v2.0.6.zip/download"

# ダウンロードするファイルの保存先ディレクトリ
download_dir = "/content/drive/MyDrive/sample"

# ダウンロードするファイルの保存先パス
file_path = os.path.join(download_dir, "CIRI_v2.0.6.zip")

# ファイルをダウンロード
response = requests.get(url)

# レスポンスが成功した場合
if response.status_code == 200:
    # ダウンロードしたファイルを保存
    with open(file_path, 'wb') as f:
        f.write(response.content)
    print("ファイルのダウンロードが完了しました。保存先:", file_path)
else:
    print("ファイルのダウンロードに失敗しました。")

sampleフォルダ内にダウンロードできているか確認します。

sampleフォルダ内の確認

!ls 
#カレントディレクトリ以外のディレクトリの中身を知りたい場合は
#!ls $path

CIRI_v2.0.6.zipが表記されていたので、無事ダウンロードできていることが確認できました。
次にunzipで解凍します。

zipの解凍

!unzip test_data2.zip 
#引数として -d $pathでoutput folderを指定できる
!ls

chr1.fa CIRI_v2.0.6.zip test.sam
chr1.gtf CIRIerror.log __MACOSX
と表示されました
各ファイルの内容としては、こんな感じでしょうか。

chr1.fa：1番染色体のfastaファイル
test.sam：サンプルの配列をマッピングしたデータ
chr1.gtf：chr1をアノテーションしたデータ
__MACOSX：zip解凍時に作られるフォルダ。windowsユーザーには関係ないみたい。
CIRIerror.log：エラーのlog

次にCIRI2を使用していきます。

CIRI2で解析

CIRI2の使い方は以下の通り。

How to run CIRI2

!perl CIRI2.pl -I in.sam -O outfile -F ref.fa (-R ref_dir/)

引数は以下の通り。

-I, --in
input SAM file name (required; generated by BWA-MEM)
-O, --out
output circRNA list name (required)
-F, --ref_file
FASTA file of all reference sequences. Please make sure this file is
the same one provided to BWA-MEM. Either this argument or -R/--ref-dir
is required.
-R, --ref_dir
directory of reference sequence(s). Please make sure fasta files in
this directory are from the FASTA file(s) provided to BWA-MEM. Either
this argument or -F/--ref-file is required.
-A, --anno
input GTF/GFF3 formatted annotation file name (optional)

CIRI2自体はperlで書かれているためか、path通しがうまくできず、だらだらと絶対パスを使用して起動させました。(アノテーションファイルの指定が無くても解析可能なようですが、今回はアノテーションファイルを指定する方法で行いました。)

circRNAの同定

!perl /content/drive/MyDrive/CIRI-full/bin/CIRI_v2.0.6/CIRI2.pl -I test.sam -O outfile -F chr1.fa -A chr1.gtf

[Tue Apr 9 03:42:48 2024] CIRI finished its work. Please see output file outfile for detail.
の表記が出たので再度sampleフォルダの中身を確認しました。
outfileとoutfile.logが増えていました。このoutfileの中身は以下の情報が入っています。ちなみにtabで区切られています。

outfile
circRNA_ID	chr	circRNA_start	circRNA_end	#junction_reads	SM_MS_SMS	#non_junction_reads	junction_reads_ratio	circRNA_type	gene_id	strand	junction_reads_ID
chr1:16770127\|16775696	chr1	16770127	16775696	8	5_6_0	192	0.077	exon	ENSG00000157191.15	+	GFGG-GA2-1:2:33:2000:1585,GFGG-GA2-1:2:33:2000:1587...
chr1:16775588\|16778510	chr1	16775588	16778510	6	5_5_0	192	0.056	exon	ENSG00000157191.15	+	GFGG-GA2-1:2:33:2000:1276,GFGG-GA2-1:2:33:2000:1277...

outfileには拡張子がありませんが、おそらく.txtファイルか.csvファイル形式のように思います。
どちらのファイル形式として保存しても、開くことができました。

ちなみに各項目についてはcookbookに以下のような説明が載っていました。

列	説明
1	ID of a predicted circRNA in the pattern of "chr:start
2	chromosome of a predicted circRNA
3	start loci of a predicted circRNA on the chromosome
4	end loci of a predicted circRNA on the chromosome
5	circular junction read (also called as back-spliced junction read) count of a predicted circRNA
6	unique CIGAR types of a predicted circRNA. For example, a circRNAs have three junction reads: read A (80M20S, 80S20M), read B (80M20S, 80S20M), read C (40M60S, 40S30M30S, 70S30M), then its has two SM types (80S20M, 70S30M), two MS types (80M20S, 70M30S) and one SMS type (40S30M30S). Thus its SM_MS_SMS should be 2_2_1.
7	non-junction read count of a predicted circRNA that mapped across the circular junction but consistent with linear RNA instead of being back-spliced
8	ratio of circular junction reads calculated by 2#junction_reads/(2#junction_reads+#non_junction_reads). #junction_reads is multiplied by two because a junction read is generated from two ends of circular junction but only counted once while a non-junction read is from one end. It has to be mentioned that the non-junction reads are still possibly from another larger circRNA, so the junction_reads_ratio based on it may be an inaccurate estimation of relative expression of the circRNA.
9	type of a circRNA according to positions of its two ends on chromosome (exon, intron or intergenic_region; only available when annotation file is provided)
10	ID of the gene(s) where an exonic or intronic circRNA locates
11	strand info of a predicted circRNAs (new in CIRI2)
12	all of the circular junction read IDs (split by ",")

おー、うまくいった。テストデータなので当然かもしれませんが、今まではテストですらうまくいきませんでした。大きく前進した気持ちになります。

拡張子を書き足したときに.txtファイルとして保存してみたので、このデータを.csvファイルへ書き出したいと思います。

.csvファイルへの書き出し

import pandas as pd

# データを読み込む
data = pd.read_csv('outfile.txt', sep='\t')

# .CSVファイルに書き出す
data.to_csv('output.csv', index=False)

カレントディレクトリであるsampleフォルダ内に無事outfile.csvとして書き出されました。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up