More than 1 year has passed since last update.

SRA Toolkit::fasterq-dumpについての私的覚書

Last updated at 2024-02-21Posted at 2024-02-06

問題：プロキシ環境下で、SRA Toolkitのprefetch/fasterq-dumpを使ってSRA/FASTQファイルを取得しようとすると、SSL認証エラーでダウンロードできない。

NCBIのSRAアーカイブからFASTA/FASTQファイルをダウンロードする際、5 Gbase以上のファイルはSRA Toolkitを使用して取得するようになっている。

All runs exceed the download limit (>5 Gbases). Use SRA Toolkit to download runs locally in your preferred format.

しかし、大学や研究所（もしくは企業）のネット環境では、セキュリティのためにプロキシが設定されていることが多く、プロキシ環境下でSRA Toolkitのprefetchやfasterq-dumpを使ってSRA/FASTQファイルを取得しようとすると、SSLエラーでダウンロードできない場合がある。

2024-02-04T23:54:34 fasterq-dump.3.0.10 sys: mbedtls_ssl_get_verify_result for 
'locate.ncbi.nlm.nih.gov' returned 0xC ( !! The certificate Common Name (CN) 
does not match with the expected CN  !! The certificate is not correctly signed 
by the trusted CA )
Failed to call external services.

SRA Toolkitのコマンドには、SSL認証を回避（無視）するオプションがないので、システム管理側でSSL認証についての設定を適切に行なえない場合、ユーザーサイドでできることはない。

解決法：一般コマンドでNCBIのSRAアーカイブからSRAファイルをダウンロードした後、fasterq-dumpでFASTQファイルにローカルで変換する。

Step 0. 準備

以下のコマンドをインストールしておく。
・curl (またはwget)
・sra-tools
プロキシの設定（.bashrc, .curlrc, .wgetrcなど）をしておく。

NCBIのSRAアーカイブにRun Browserページで目的のSRAファイルへのリンクがあるか確認する（大抵はある）。
一般的には、以下の形（SRRXXXXXXXXのXは数字部分）。
https://sra-pub-run-odp.s3.amazonaws.com/sra/SRRXXXXXXXX/SRRXXXXXXXX

curlにはSSL認証エラーを無視する「--insecure」オプションがある。
wgetでは「–no-check-certificate」オプションを使用する。

詳細は、各コマンドのマニュアルを参照。

Step 1. SRAファイルのダウンロード

ここでは「SRR15861857」をダウンロードする場合を考える。

curlの場合

curl --output SRR15861857 --insecure https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR15861857/SRR15861857

curlを使用する場合、SRAファイルはバイナリ形式なので、出力ファイル名を指定しないと以下のような警告が出る。

Warning: Binary output can mess up your terminal. Use "--output -" to tell 
Warning: curl to output it to your terminal anyway, or consider "--output 
Warning: <FILE>" to save to a file.

wgetの場合

wget –no-check-certificate https://sra-pub-run-odp.s3.amazonaws.com/sra/SRR15861857/SRR15861857

複数のSRAファイルの一括ダウンロード

ダウンロードするSRAファイルの数が多い場合、いちいち手打ちしていると面倒くさいので、以下のスクリプト（curlバージョン）を書いた。ダウンロードしたいSRAファイル名をSRA_list.txtにテキスト形式で保存しておく。ファイルパス(.../path/)は適宜読み替えること。

SRA_list.txtの例

必ず最後のファイル名のあと改行すること。

SSRXXXXXXX1
SSRXXXXXXX2
...
...

シェルスクリプト

download_SRA.sh

#!/bin/bash

# Path to the text file listing the files to download
SRA_LIST="../path/SRA_list.txt"
# Base URL for downloading the files
BASE_URL="https://sra-pub-run-odp.s3.amazonaws.com"

while IFS= read -r file
do
    echo "Downloading ${file}"
    curl --output ${file} --insecure "${BASE_URL}/sra/${file}/${file}"
done < "$SRA_LIST"

Pythonスクリプト

download_SRA.py

import subprocess

# Path to the text file listing the files to download
SRA_list = ".../path/SRA_list.txt"
# Base URL for downloading the files
base_url = "https://sra-pub-run-odp.s3.amazonaws.com"

with open(SRA_list) as f:
    l_strip = [s.rstrip() for s in f.readlines()]

for f_name in l_strip:
    command = "curl --output ./%s --insecure %s/sra/%s/%s" % (f_name, base_url, f_name, f_name)
    print(command)
    print('Downloading ' + f_name)
    subprocess.run(command.split(),stdout=subprocess.PIPE)

ターミナルコマンドを呼び出さずに、pythonのurllibモジュールを使って直接ダウンロードする方法もあるようだが、設定がめんどくさそうだったので、パス。

Rスクリプト

download_SRA.R

# Change timeout. The default timeout (60 sec) is too short.
timeout_hour <- 6
options(timeout=60*60*timeout_hour)
timeout_message <- paste('timeout:', getOption('timeout')/60/60, 'hours', sep=" ")
print(timeout_message)

# Path to the text file listing the files to download
SRA_List <- read.table(".../path/SRA_List.txt", quote="\"", comment.char="")
# Base URL for downloading the files
base_url <- "https://sra-pub-run-odp.s3.amazonaws.com"

for (i in 1:nrow(SRA_List)){
  SRA_file_name <- SRA_List[i,]
  full_url <- paste(base_url, 'sra', SRA_file_name, SRA_file_name, sep="/")
  print(paste('Downloading', SRA_file_name, sep=" ")
  download.file(full_url, SRA_file_name))
}

Step 2. fasterq-dumpでSRAファイルをFASTQファイルに変換

注意：絶対パスまたは相対パスで変換するSRAファイルを指定すること。ファイル名だけだとインターネット経由でダウンロードしに行ってしまう（つまり、SSLエラーがでる）。オプションの詳細は以下を参照。

fasterq-dump ./SRR15861857

複数のSRAファイルを一括で変換したい場合は、以下のコード。

fasterq-dump ./SRR*

fasterq-dumpの使い方の詳細は前述のWikiを参照。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up