More than 3 years have passed since last update.

RESCRIPtを使ってSILVAデータベースを取得・処理する

Last updated at 2021-03-12Posted at 2021-03-12

RESCRIPt とは

QIIME2で、SILVAデータベースを使ってTaxonomy Assignmentするときに、たいていはData resourcesのMarker gene reference databasesからQIIME2用にフォーマットされたSILVAデータベースをダウンロードして使うことが多いと思う。

しかし、RESCRIPtというツールのQIIME2プラグインを用いることで、コマンドラインでデータベースの種類を選んで取得・前処理・管理できる。

RESCRIPtは、SILVAやGreengenes、NCBIなどからリファレンスデータベースを取得し、管理や評価ができるpythonベースのソフトウェア。

QIIME2へのRESCRIPtインストール

今回は、QIIME2内のプラグインとして動くようにインストールする。
依存関係のインストールと、pipでのRESCRIPtのインストールを実行する。

# 依存関係のインストール。
conda activate qiime2-2021.2
conda install -c conda-forge -c bioconda -c qiime2 -c defaults xmltodict

# RESCRIPtのインストール。
pip install git+https://github.com/bokulich-lab/RESCRIPt.git

インストールが完了したら、QIIME2のコマンドラインのキャッシュを更新し、helpで確認する。

qiime dev refresh-cache
qiime --help

Commandsに ** rescript ** が入っていることが確認できる。

Commands:
  info                Display information about current deployment.
  tools               Tools for working with QIIME 2 files.
  dev                 Utilities for developers and advanced users.
  alignment           Plugin for generating and manipulating alignments.
  composition         Plugin for compositional data analysis.
  cutadapt            Plugin for removing adapter sequences, primers, and other unwanted sequence from sequence data.
  dada2               Plugin for sequence quality control with DADA2.
  deblur              Plugin for sequence quality control with Deblur.
  demux               Plugin for demultiplexing & viewing sequence quality.
  diversity           Plugin for exploring community diversity.
  diversity-lib       Plugin for computing community diversity.
  emperor             Plugin for ordination plotting with Emperor.
  feature-classifier  Plugin for taxonomic classification.
  feature-table       Plugin for working with sample by feature tables.
  fragment-insertion  Plugin for extending phylogenies.
  gneiss              Plugin for building compositional models.
  longitudinal        Plugin for paired sample and time series analyses.
  metadata            Plugin for working with Metadata.
  phylogeny           Plugin for generating and manipulating phylogenies.
  quality-control     Plugin for quality control of feature and sequence data.
  quality-filter      Plugin for PHRED-based filtering and trimming.
  **rescript**        Pipeline for reference sequence annotation and curation.
  sample-classifier   Plugin for machine learning prediction of sample metadata.
  taxa                Plugin for working with feature taxonomy annotations.
  vsearch             Plugin for clustering and dereplicating with vsearch.

1. SILVAデータベースのダウンロード

Data resourcesのMarker gene reference databasesと同じSILVAデータベースをダウンロードしたい場合は、以下のコマンドになる。

qiime rescript get-silva-data \
    --p-version '138' \
    --p-target 'SSURef_NR99' \
    --p-include-species-labels \
    --o-silva-sequences silva-138-ssu-nr99-seqs.qza \
    --o-silva-taxonomy silva-138-ssu-nr99-tax.qza

--p-ranksを用いると、分類群を指定することもできる。

--p-ranks 
TEXT... Choices('domain', 'superkingdom', 'kingdom',
    'subkingdom', 'superphylum', 'phylum', 'subphylum', 'infraphylum',
    'superclass', 'class', 'subclass', 'infraclass', 'superorder', 'order',
    'suborder', 'superfamily', 'family', 'subfamily', 'genus')

2. クオリティが低い配列の除去

cull-seqsで、5塩基以上の"ambiguous bases"と、8塩基以上のホモポリマーを除去する。

qiime rescript ｃ \
    --i-sequences silva-138-ssu-nr99-seqs.qza \
    --o-clean-sequences silva-138-ssu-nr99-seqs-cleaned.qza

3. 重複シーケンスの除去

SILVAデータベース (SSU Ref NR 138.1)のリリース情報内に、以下の記述がある。

Background information for current release (SSU Ref NR 138.1, August 2020)

Please note that due to this preservation and additional technical limitations (clustering of large datasets) there can still be sequences in the dataset with an identity of >99%.

つまり、99%以上の同一性があるシーケンスがデータベースに重複している可能性がある。データベースの冗長性をなくすため、--p-mode 'uniq'を用いてこれを除去する。

qiime rescript dereplicate \
    --i-sequences silva-138-ssu-nr99-seqs-filt.qza  \
    --i-taxa silva-138-ssu-nr99-tax.qza \
    --p-rank-handles 'silva' \
    --p-mode 'uniq' \
    --o-dereplicated-sequences silva-138-ssu-nr99-seqs-derep-uniq.qza \
    --o-dereplicated-taxa silva-138-ssu-nr99-tax-derep-uniq.qza

４． Taxonomy Assignment

出力された種の分類情報であるsilva-138-ssu-nr99-tax-derep-uniq.qzaと、シーケンス情報のsilva-138-ssu-nr99-seqs-derep-uniq.qzaを用いて、silva-138-ssu-nr99-seqs-derep-uniq.qzaなどで分類器を作成できる。

参考文献

細菌、古細菌、真核生物のリボソームRNA配列をまとめているデータベース。
SILVA rRNA database

RESCRIPtのプレプリント。
RESCRIPt: Reproducible sequence taxonomy reference database management for the masses

QIIME2 forumにおけるRESCRIPtのチュートリアル。
Processing, filtering, and evaluating the SILVA database (and other reference sequence data) with RESCRIPt

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up