TogoVarをSQLiteに突っ込んで検索してみた

Last updated at 2019-07-25Posted at 2019-07-22

小ネタです。（自分用メモです）

TogoVarという、日本人のバリアントをまとめた辞書のようなWebサイトがあります。

こちらがWebAPIを提供しているのか探してみたのですが、なさそうでした。そこで、配布されているtsvファイルをSQLiteに詰め込んでみることにしました。

ファイルをダウンロードする

サイトからtsvファイルをダウンロードして解凍します。連番なのでコマンドやスクリプト等で。
私は普段Ruby使いなのでこんな感じ

[*1..22, "X", "Y", "MT"].each do |i|
        puts i
        `wget https://togovar.biosciencedbc.jp/public/release/current/chr_#{i}_frequency.tsv.gz`
        `wget https://togovar.biosciencedbc.jp/public/release/current/chr_#{i}_molecular_annotation.tsv.gz`
end

Sqlite

SQLiteを準備します。

sqlite3 togo_var.sqlite

.separator "\t"

としてタブ区切りをセパレータに指定します。

.import some_nice.tsv some_nice_table

とすれば、tsvを読み込むことができます。次のようなSchemaになります。ファイルを読み込ませると、TYPEはすべてTEXTになります。浮動小数点などをちゃんと扱いたい場合は、INTEGERやREALなどちゃんと設定するべきかもしれません。ここでは、freq と anno の2つのテーブルを用意しました。

CREATE TABLE freq(
  "tgv_id" TEXT,
  "rs" TEXT,
  "variant_type" TEXT,
  "chr" TEXT,
  "position_grch37" TEXT,
  "ref" TEXT,
  "alt" TEXT,
  "symbol" TEXT,
  "jga_ngs_allele_alt" TEXT,
  "jga_ngs_allele_total" TEXT,
  "jga_ngs_alt_allele_freq" TEXT,
  "jga_ngs_genotype_alt_alt" TEXT,
  "jga_ngs_genotype_ref_alt" TEXT,
  "jga_ngs_genotype_ref_ref" TEXT,
  "jga_ngs_qc_status" TEXT,
  "jga_snp_allele_alt" TEXT,
  "jga_snp_allele_total" TEXT,
  "jga_snp_alt_allele_freq" TEXT,
  "jga_snp_genotype_alt_alt" TEXT,
  "jga_snp_genotype_ref_alt" TEXT,
  "jga_snp_genotype_ref_ref" TEXT,
  "jga_snp_qc_status" TEXT,
  "3.5kjpn_allele_alt" TEXT,
  "3.5kjpn_allele_total" TEXT,
  "3.5kjpn_alt_allele_freq" TEXT,
  "3.5kjpn_genotype_alt_alt" TEXT,
  "3.5kjpn_genotype_ref_alt" TEXT,
  "3.5kjpn_genotype_ref_ref" TEXT,
  "3.5kjpn_qc_status" TEXT,
  "hgvd_allele_alt" TEXT,
  "hgvd_allele_total" TEXT,
  "hgvd_alt_allele_freq" TEXT,
  "hgvd_genotype_alt_alt" TEXT,
  "hgvd_genotype_ref_alt" TEXT,
  "hgvd_genotype_ref_ref" TEXT,
  "hgvd_qc_status" TEXT,
  "exac_total_allele_alt" TEXT,
  "exac_total_allele_total" TEXT,
  "exac_total_alt_allele_freq" TEXT,
  "exac_total_genotype_alt_alt" TEXT,
  "exac_total_genotype_ref_alt" TEXT,
  "exac_total_genotype_ref_ref" TEXT,
  "exac_total_qc_status" TEXT,
  "exac_african_allele_alt" TEXT,
  "exac_african_allele_total" TEXT,
  "exac_african_alt_allele_freq" TEXT,
  "exac_african_genotype_alt_alt" TEXT,
  "exac_african_genotype_ref_alt" TEXT,
  "exac_african_genotype_ref_ref" TEXT,
  "exac_eastasian_allele_alt" TEXT,
  "exac_eastasian_allele_total" TEXT,
  "exac_eastasian_alt_allele_freq" TEXT,
  "exac_eastasian_genotype_alt_alt" TEXT,
  "exac_eastasian_genotype_ref_alt" TEXT,
  "exac_eastasian_genotype_ref_ref" TEXT,
  "exac_finnish_allele_alt" TEXT,
  "exac_finnish_allele_total" TEXT,
  "exac_finnish_alt_allele_freq" TEXT,
  "exac_finnish_genotype_alt_alt" TEXT,
  "exac_finnish_genotype_ref_alt" TEXT,
  "exac_finnish_genotype_ref_ref" TEXT,
  "exac_european_allele_alt" TEXT,
  "exac_european_allele_total" TEXT,
  "exac_european_alt_allele_freq" TEXT,
  "exac_european_genotype_alt_alt" TEXT,
  "exac_european_genotype_ref_alt" TEXT,
  "exac_european_genotype_ref_ref" TEXT,
  "exac_latino_allele_alt" TEXT,
  "exac_latino_allele_total" TEXT,
  "exac_latino_alt_allele_freq" TEXT,
  "exac_latino_genotype_alt_alt" TEXT,
  "exac_latino_genotype_ref_alt" TEXT,
  "exac_latino_genotype_ref_ref" TEXT,
  "exac_other_allele_alt" TEXT,
  "exac_other_allele_total" TEXT,
  "exac_other_alt_allele_freq" TEXT,
  "exac_other_genotype_alt_alt" TEXT,
  "exac_other_genotype_ref_alt" TEXT,
  "exac_other_genotype_ref_ref" TEXT,
  "exac_southasian_allele_alt" TEXT,
  "exac_southasian_allele_total" TEXT,
  "exac_southasian_alt_allele_freq" TEXT,
  "exac_southasian_genotype_alt_alt" TEXT,
  "exac_southasian_genotype_ref_alt" TEXT,
  "exac_southasian_genotype_ref_ref" TEXT
);
CREATE TABLE anno(
  "tgv_id" TEXT,
  "rs" TEXT,
  "chr" TEXT,
  "position_grch37" TEXT,
  "ref" TEXT,
  "alt" TEXT,
  "symbol" TEXT,
  "transcript_id" TEXT,
  "consequence" TEXT,
  "sift_qualitative_prediction" TEXT,
  "sift_value" TEXT,
  "polyphen2_qualitative_prediction" TEXT,
  "polyphen2_value" TEXT
);

さて、Schemaが決まったので、どんどんファイルをインポートしていきますが、各ファイルの先頭行（ヘッダー）の部分も登録されてしまいますので、あとで削除する必要があります。もしくは、tsvファイルの段階で削除しておくのもよい方法かもしれません。

.import chr_1_frequency.tsv freq
.import chr_2_frequency.tsv freq
.import chr_3_frequency.tsv freq
.import chr_4_frequency.tsv freq
.import chr_5_frequency.tsv freq
.import chr_6_frequency.tsv freq
.import chr_7_frequency.tsv freq
.import chr_8_frequency.tsv freq
.import chr_9_frequency.tsv freq
.import chr_10_frequency.tsv freq
.import chr_11_frequency.tsv freq
.import chr_12_frequency.tsv freq
.import chr_13_frequency.tsv freq
.import chr_14_frequency.tsv freq
.import chr_15_frequency.tsv freq
.import chr_16_frequency.tsv freq
.import chr_17_frequency.tsv freq
.import chr_18_frequency.tsv freq
.import chr_19_frequency.tsv freq
.import chr_20_frequency.tsv freq
.import chr_21_frequency.tsv freq
.import chr_22_frequency.tsv freq
.import chr_X_frequency.tsv freq
.import chr_Y_frequency.tsv freq
.import chr_MT_frequency.tsv freq
.import chr_1_molecular_annotation.tsv anno
.import chr_2_molecular_annotation.tsv anno
.import chr_3_molecular_annotation.tsv anno
.import chr_4_molecular_annotation.tsv anno
.import chr_5_molecular_annotation.tsv anno
.import chr_6_molecular_annotation.tsv anno
.import chr_7_molecular_annotation.tsv anno
.import chr_8_molecular_annotation.tsv anno
.import chr_9_molecular_annotation.tsv anno
.import chr_10_molecular_annotation.tsv anno
.import chr_11_molecular_annotation.tsv anno
.import chr_12_molecular_annotation.tsv anno
.import chr_13_molecular_annotation.tsv anno
.import chr_14_molecular_annotation.tsv anno
.import chr_15_molecular_annotation.tsv anno
.import chr_16_molecular_annotation.tsv anno
.import chr_17_molecular_annotation.tsv anno
.import chr_18_molecular_annotation.tsv anno
.import chr_19_molecular_annotation.tsv anno
.import chr_20_molecular_annotation.tsv anno
.import chr_21_molecular_annotation.tsv anno
.import chr_22_molecular_annotation.tsv anno
.import chr_X_molecular_annotation.tsv anno
.import chr_Y_molecular_annotation.tsv anno
.import chr_MT_molecular_annotation.tsv anno

さて、これで検索ができるようになりましたが、インデックスを作らないと検索は非常に遅いです。
そしてインターネットの各種情報によると、インデックスのつけ方にはコツがあるようです。

ここでは次のように最低限のインデックスをつけてみました。
インデックスをつけるのもそれなりに時間がかかるので、沢山呼び出すことになったら、その時につける感じでもいいのかなと思います。

create index freq_tgv_id_index on freq(tgv_id);
create index anno_tgv_id_index on anno(tgv_id);

create index freq_symbol_index on freq(symbol);
create index anno_symbol_index on anno(symbol);

create index freq_rs_index on freq(rs);
create index anno_rs_index on anno(rs);

create index freq_position_grch37_index on freq(position_grch37);
create index anno_position_grch37_index on anno(position_grch37);
create index freq_jga_ngs_alt_allele_freq_index on freq(jga_ngs_alt_allele_freq);
create index freq_jga_snp_alt_allele_freq_index on freq(jga_snp_alt_allele_freq);
create index 'freq_3.5kjpn_alt_allele_freq_index' on freq('3.5kjpn_alt_allele_freq');

さて、これでTogoVarの機能がローカルデスクトップでも実現できるのかな？と思ったらそうはいきませんでした。バリアントと疾患の関連についてはClinvarが提供しており、TogoVarの提供しているtsvはあくまでバリアントの頻度情報しかありません。この点はご注意ください。

ためしに検索してみる

飲酒で有名なALDH2遺伝子を検索します。

select count(*) from freq where symbol="ALDH2";

635

大量にヒットするので、頻度の高い変異で絞り込みます。

select rs from freq where symbol="ALDH2" and jga_snp_alt_allele_freq > 0.2 ;

rs10744777
rs4767035
rs671
rs11066028
rs11066029
rs7296651

無事rs671がヒットしました。

この記事は以上です。

あとは、SSDにこのsqliteファイルを突っ込んどけば、Rubyのparallelでマルチプロセスで検索しまくれるはず。。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up