BUSCO5を用いた塩基配列・アミノ酸配列の評価

Posted at 2024-10-31

1. はじめに　～BUSCOとは～

BUSCO (Benchmarking Universal Single-Copy Orthologs) は、ゲノムやトランスクリプトームの「完全性」を評価するツールです。生物群に共通する単一コピーの遺伝子セットを使って、データがどれくらい欠けているか、あるいは重複しているかを確認しています。
例：陸上高等植物に共通する単一コピーの遺伝子セット（embryophyta_odb10）を用いたBUSCO

BUSCOの評価カテゴリとして、主に以下の4つの観点があります。
Complete (C): 完全な遺伝子がある
Duplicated (D): 重複した遺伝子がある
Fragmented (F): 部分的な遺伝子がある
Missing (M): 欠損している遺伝子がある

イメージ図

この分類でデータの品質を簡単に判断できます。主に、ゲノムのアセンブリ精度やトランスクリプトームの網羅性を確認するために使われていて、新しいゲノム解析や系統解析でよく利用されます。

#dockerイメージを導入(pull）する
docker pull ezlabgva/busco:v5.8.0_cv1
#dockerイメージを起動し、root側のbashrcファイルを読み込み、タブ補完できるようにする
docker run -u $(id -u) -v $PWD:/busco_wd/my_data -it \
            --init ezlabgva/busco:v5.8.0_cv1 bash --rcfile <(echo ". ~/.bashrc")

遺伝研の場合、singularityでイメージコンテナを起動して使う

singularityでイメージコンテナを起動する

singularity exec /usr/local/biotools/b/busco\:5.5.0--pyhdfd78af_0 busco -h

"busco -h"で出力されるヘルプ

buscoのヘルプ

Welcome to BUSCO 5.4.3: the Benchmarking Universal Single-Copy Ortholog assessment tool.
For more detailed usage information, please review the README file provided with this distribution and the BUSCO user guide. Visit this page https://gitlab.com/ezlab/busco#how-to-cite-busco to see how to cite BUSCO

optional arguments:
  -i SEQUENCE_FILE, --in SEQUENCE_FILE
                        Input sequence file in FASTA format. Can be an assembled genome or transcriptome (DNA), or protein sequences from an annotated gene set. Also possible to use a path to a directory containing multiple input files.
  -o OUTPUT, --out OUTPUT
                        Give your analysis run a recognisable short name. Output folders and files will be labelled with this name. The path to the output folder is set with --out_path.
  -m MODE, --mode MODE  Specify which BUSCO analysis mode to run.
                        There are three valid modes:
                        - geno or genome, for genome assemblies (DNA)
                        - tran or transcriptome, for transcriptome assemblies (DNA)
                        - prot or proteins, for annotated gene sets (protein)
  -l LINEAGE, --lineage_dataset LINEAGE
                        Specify the name of the BUSCO lineage to be used.
  --augustus            Use augustus gene predictor for eukaryote runs
  --augustus_parameters --PARAM1=VALUE1,--PARAM2=VALUE2
                        Pass additional arguments to Augustus. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
  --augustus_species AUGUSTUS_SPECIES
                        Specify a species for Augustus training.
  --auto-lineage        Run auto-lineage to find optimum lineage path
  --auto-lineage-euk    Run auto-placement just on eukaryote tree to find optimum lineage path
  --auto-lineage-prok   Run auto-lineage just on non-eukaryote trees to find optimum lineage path
  -c N, --cpu N         Specify the number (N=integer) of threads/cores to use.
  --config CONFIG_FILE  Provide a config file
  --contig_break n      Number of contiguous Ns to signify a break between contigs. Default is n=10.
  --datasets_version DATASETS_VERSION
                        Specify the version of BUSCO datasets, e.g. odb10
  --download [dataset ...]
                        Download dataset. Possible values are a specific dataset name, "all", "prokaryota", "eukaryota", or "virus". If used together with other command line arguments, make sure to place this last.
  --download_base_url DOWNLOAD_BASE_URL
                        Set the url to the remote BUSCO dataset location
  --download_path DOWNLOAD_PATH
                        Specify local filepath for storing BUSCO dataset downloads
  -e N, --evalue N      E-value cutoff for BLAST searches. Allowed formats, 0.001 or 1e-03 (Default: 1e-03)
  -f, --force           Force rewriting of existing files. Must be used when output files with the provided name already exist.
  -h, --help            Show this help message and exit
  --limit N             How many candidate regions (contig or transcript) to consider per BUSCO (default: 3)
  --list-datasets       Print the list of available BUSCO datasets
  --long                Optimization Augustus self-training mode (Default: Off); adds considerably to the run time, but can improve results for some non-model organisms
  --metaeuk_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
                        Pass additional arguments to Metaeuk for the first run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
  --metaeuk_rerun_parameters "--PARAM1=VALUE1,--PARAM2=VALUE2"
                        Pass additional arguments to Metaeuk for the second run. All arguments should be contained within a single string with no white space, with each argument separated by a comma.
  --offline             To indicate that BUSCO cannot attempt to download files
  --out_path OUTPUT_PATH
                        Optional location for results folder, excluding results folder name. Default is current working directory.
  -q, --quiet           Disable the info logs, displays only errors
  -r, --restart         Continue a run that had already partially completed.
  --scaffold_composition
                        Writes ACGTN content per scaffold to a file scaffold_composition.txt
  --tar                 Compress some subdirectories with many files to save space
  --update-data         Download and replace with last versions all lineages datasets and files necessary to their automated selection
  -v, --version         Show this version and exit

3. 実行方法

以下、conda環境で行う

リスト作成

任意の生物群で共通する単一コピーの遺伝子セットのリストを表示する。

busco --list-datasets

ゲノム全体（genome）の評価（高等植物の場合）

ゲノムでBUSCO

#genome
busco -m geno -i Complete_iceplant_genome.fasta -o out_dir -l embryophyta_odb10 -c 30

転写産物全体（transcriptome）の評価（高等植物の場合）

トランスクリプトームでBUSCO

#transcriptome
busco -m tran -i Complete_iceplant_genome.fasta -o out_dir -l embryophyta_odb10 -c 30

アミノ酸配列全体（protein）の評価（高等植物の場合）

アミノ酸配列でBUSCO

#protein
busco -m prot -i Complete_iceplant_genome.fasta -o out_dir -l embryophyta_odb10 -c 30

4. 複数のBUSCO結果の統合

新しくゲノムをアセンブリした場合、既存のゲノム配列と比較してどの程度精度が高いかを調査する必要がある。
BUSCOには、複数のbuscoデータを統合するスクリプトが含まれているので、それを用いて統合グラフを作る。

BUSCO統合

singularity exec /usr/local/biotools/b/busco\:5.5.0--pyhdfd78af_0 generate_plot.py -wd BUSCO_summaries/

事前準備：

BUSCO_summariesフォルダに、short_summary...で始まるBUSCOの結果ファイルをあらかじめ配置しておく。

ファイル名：

short_summary.specific.embryophyta_odb10..txtの部分は生物種名を表しています。種名を変更することで、表示される学名が設定できる。
例えば、short_summary.specific.embryophyta_odb10.O_sativa.txtのように設定すると、Oで区切られずに種名が正確に表示される。"O.sativa"とすると、"O"で区切られるので注意。

5. 最後に

BUSCOは、塩基配列の精度を調査することで、以降の解析の信頼性を向上させることができる、非常に便利なツールです。ぜひ使いこなせるようになりましょう！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

BUSCO5を用いた塩基配列・アミノ酸配列の評価

1. はじめに　～BUSCOとは～

イメージ図

目次

2. インストール方法

3. 実行方法

以下、conda環境で行う

4. 複数のBUSCO結果の統合

事前準備：

ファイル名：

5. 最後に

BUSCO5を用いた塩基配列・アミノ酸配列の評価

1. はじめに ～BUSCOとは～

イメージ図

目次

2. インストール方法

3. 実行方法

以下、conda環境で行う

4. 複数のBUSCO結果の統合

事前準備：

ファイル名：

5. 最後に

1. はじめに　～BUSCOとは～