More than 1 year has passed since last update.

Qiita×Findy記事投稿キャンペーン「今の開発組織でトライしたこと・トライしていること・トライしようとしていること」

google colabolatoryでcircRNAを同定したい②（step1~4）

Posted at 2024-04-08

前回は、circRNAFinderのgitをクローンするところで終わりました。

今回はいよいよcircRNAFinder内のTutrial.shの指示に従い、一つずつステップを踏んでいきたいと思います。基本的には材料集めとそのプロセシングの回になります。

Tutorial.shを眺める

Tutrialとしてどんなことが書かれているのか見ていきたいと思います。

#! /usr/bin/bash
# This tutorial helps you get started with circRNAFinder by demonstrating how to detect, annotate and visualize circRNAs using public RNA-Seq data. It also could be used as a template to build a pipeline for your own data.

# Step1: Create a working directory
･･･
# Step2: Download and build human genome/transcript sequences and gene annotations. (You only have to do it once.)
･･･
## Step3: Create fasta folder, download the public RNA-Seq data from SRA database.
･･･
## Step4: Convert .sra files into FASTA format, collapse the identical reads.
･･･
## Step5: Edit configure files
･･･
## Step6: Run circRNAFinder
･･･

げげっ・・・
step 6まである・・・
しかも、circRNAFinderを起動させるまでの処理が長い
でも、頑張るしかあるまい

Step1:ワーキングディレクトリを作る

ここはなんとかできそう。
まずカレントディレクトリを確認します。

python

#カレントディレクトリを確認
import os
print(os.getcwd())

/contentにあることがわかりました。
続けて、元々提供されていたコードに従い、exampleというフォルダを/content/drive/ MyDriveの中に作りたいと思います。ですので、まず、MyDriveをワーキングディレクトリとしたいと思います。

python

# MyDriveディレクトリへ移動
os.chdir('/content/drive/MyDrive')

bash

#新しくディレクトリを作成
!mkdir example
!cd example

python

#新しく作ったディレクトリに移動
os.chdir('/content/drive/MyDrive/example')

Step2:ヒトゲノム/トランスクリプトの配列及び遺伝子アノテーションをダウンロードしビルドする。

ここでも指示通り、先ほど作ったexampleフォルダの下層にbuildフォルダを作成しました。

bash

!mkdir build

再度、ワーキングディレクトリを移動します。

python

os.chdir ('/content/drive/MyDrive/example/build')

次にpath通しをして、git内にあるSeqAnnoDownloadBuild.pyを使用します。

python

#path通し
import sys
sys.path.append('/content/drive/MyDrive/circRNAFinder/src')

bash

# genome/transcript、gene annotationをダウンロード&ビルド
!SeqAnnoDownloadBuild.py hg19 --download --genome
!SeqAnnoDownloadBuild.py hg19 --build --genome
!SeqAnnoDownloadBuild.py hg19 --download --transcriptome
!SeqAnnoDownloadBuild.py hg19 --build --transcriptome
!SeqAnnoDownloadBuild.py hg19 --download --annotation
!SeqAnnoDownloadBuild.py hg19 --build --annotation
!cd ..

ここで、何度かエラーがでました。
どうもSeqAnnoDownloadBuild.py内に書かれているprintの指示の文法が古いことが原因のようで、printからprint()に変更し保存した後、実行するとうまくいきました。同様のエラーが他の.pyファイルでも起きています。
buildフォルダ内にhg19.gtf、hg19.gtf.build、hg19.grf.gz、hg19.tar.gz、hg19_dna.fa、hg19_trans.faというファイルができていました。

Step3: fastaフォルダを作り、そこにSRAデータベースで公開されているRNA-Seq dataをダウンロードする

先ほど同様にbuildフォルダ内にfastaフォルダを作成し、移動します。
念のため、移動した後、ディレクトリの位置を確認します。

python

print(getcwd())

/content/drive/MyDrive/example/build/fastaとなっており、移動が確認できました。
指示には、「マニュアルでデータをダウンロードするか、もしAsperaをインストールしているなら、次のコマンドでデータをダウンロードできる」と書いてありましたが、コードが複雑であることと、普通にwgetでデータをダウンロードできることから、指示には従わず次のコードを実行しました。(自動でプログラムやファイル名が表示されたのでTabキーを押しまくり、ゴリゴリなコードとなりました。)

bash

# Or if you've installed the Aspera, you can download them using the following commands in terminal:
# ~/.aspera/connect/bin/ascp -QTr -k1 -l640M -L log --retry-timeout=10 -i ~/.aspera/connect/etc/asperaweb_id_dsa.putty anonftp@ftp-private.ncbi.nlm.nih.gov:sra/sra-instant/reads/ByRun/sra/SRR/SRR650/SRR650317/SRR650317.sra .

#ここからが私のコード。圧縮した形式でダウンロード。
!fastq-dump --split-files SRR650317 --gzip
!fastq-dump --split-files SRR364679 --gzip
!fastq-dump --split-files SRR364680 --gzip
!fastq-dump --split-files SRR364681 --gzip
!fastq-dump --split-files SRR384963 --gzip
!fastq-dump --split-files SRR384964 --gzip
!fastq-dump --split-files SRR384965 --gzip

fasta内にSRR650317～SRR384965のfastq.gzがダウンロードされていました。
(尚、pair-endのものはSRR650317のみで、これはSRR650317_1.fastq.gzとSRR650317_2.fastaq.gzに分かれ、他は、single-endのため、例えばSRR364679_1.fastq.gzのように一つだけファイルができていました。)

Step4: Convert .sra files into FASTA format, collapse the identical reads

Step4ではsra2fasta.pyとfastx_collapser.pyを使用するよう指示がありましたが、結局使いませんでした。
sra2fasta.pyは「sra.file → fasta.file」へ変換するプログラムですが、fastq-dumpがその機能を担ってくれているので、使う必要はありませんでした。
fastx_collapser.pyは、複数の圧縮されたfasta.gzまたはfastq.gzファイルを取得し、解凍して一つのfasta.fileにまとめるプログラムでした。
しかし、python2ベースのpipelineだからか、エラーが頻発し、先ほどのprintの記述を変えたり、args = parser.parse_args()をargs = parser.parse_args(args=[])に変換したりしてみましたが、エラーがでてうまくいきませんでした。

エラーの内容

usage: colab_kernel_launcher.py [-h] [-f | -q] input output
colab_kernel_launcher.py: error: the following arguments are required: input, output
An exception has occurred, use %tb to see the full traceback.

SystemExit: 2

そこで解凍、seqtkを用いてのfastaファイルへの変換、マージは一つ一つ段階を追って行うことにしました。seqtkは.gzファイルにも対応しているので、解凍する必要はなかったな、と後から気づきました。

bash

#ディレクトリ内のファイルを全て解凍
!gzip -d *.fastq.gz

#seqtkでfastq→fastaへ変換

#seqtkをinstall
!conda install -c bioconda seqtk -y

#フォルダ内の.fastqを.fastaへ変換(以降chatGPT作)
import os

# フォルダのパス
folder_path = '/content/drive/MyDrive/example/build/fasta'

# フォルダ内のファイルを取得
files = os.listdir(folder_path)

# フォルダ内の各ファイルについて処理
for file in files:
    # .fastqファイルのみ処理
    if file.endswith('.fastq'):
        # 入力ファイルと出力ファイルのパスを作成
        input_file = os.path.join(folder_path, file)
        output_file = os.path.join(folder_path, os.path.splitext(file)[0] + '.fa')
        
        # seqtkを使って変換
        cmd = f'seqtk seq -a {input_file} > {output_file}'
        print(f'Converting {input_file} to {output_file}')
        os.system(cmd)

bash

#各fastqを一つにまとめる
# HEK293 (Paired-end reads) 
%cat SRR650317_1.fa SRR650317_2.fa > HEK293.fa

# CD19 (Single-end reads) #
%cat SRR364679.fa SRR384963.fa > CD19.fa

# CD34 (Single-end reads) #
%cat SRR364680.fa SRR384964.fa > CD34.fa

# neutrophils (Single-end reads) #
%cat SRR364681.fa SRR384965.fa > neutrophils.fa
%cd ..

ここまでで一通り、材料集めは終わりました。次回、いよいよcircRNAFinderを起動しようと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up