More than 3 years have passed since last update.

がんか否かを決定木で予測する

Posted at 2021-12-06

前回はTCGAのデータを用いてDESeq2で発現量比較を行いましたが、今回は各遺伝子発現量データと、腫瘍サンプルか否かの情報から決定木を構築して、発現量からがん予測が行えるかを確認します。

1. 発現量データとメタデータのダウンロード

まずは前々回の手順に従い、「LUAD」(=Lung Adenocarcinoma、肺腺癌)の補正済み発現量カウントデータ(FPKM)をカートに入れてダウンロードします。
その後カウントデータ用メタデータを取得するために、右上の「Cart」ボタンを押してカート内に進みます。カート内で「Sample Sheet」ボタンからメタデータファイルをダウンロードして、「gdc_sample_sheet_LUAD_htseq_fpkm.tsv」というファイル名で保存します。

図、カート内で「Sample Sheet」ボタンからメタデータファイルをダウンロードします。

2. 学習データの準備

1.でダウンロードしたファイルから、前回同様にメタデータファイルから発現量と腫瘍かどうかの情報をtsvファイルとして作成します。

# !/usr/local/bin/miniconda3/bin/python
import csv
import pandas as pd
import copy

cancer_type = "LUAD"
metadata_path = "gdc_sample_sheet_" + cancer_type + "_htseq_fpkm.tsv"
counter = 0
name_list = []
type_list = []
global df
with open(metadata_path, newline='') as f_metadata:
    reader = csv.reader(f_metadata, delimiter='\t')
    header = next(reader)
    for col in csv.reader(f_metadata, delimiter='\t'):
        count_file = col[0] + "/" + col[1]
        name = col[1].replace(".FPKM.txt.gz", "")
        if col[7] == "Primary Tumor":
            type_list.append("1")
        elif col[7] == "Solid Tissue Normal":
            type_list.append("0")
        else:
            continue

        df_tmp = pd.read_table(count_file, compression='gzip', names=["gene", name], index_col=0, header=None, sep='\t')
        if counter == 0:
            df = copy.deepcopy(df_tmp)
        else:
            df = df.join(df_tmp)
        counter += 1

# 最後のほうの行にある行名が"__" から始まっている不要な行を削除する。
# 不要な行の直前までの行数をindexで数える。
col_list = df.index.tolist()
idx = 0
for col_name in col_list:
    if col_name.startswith("__"):
        break
    idx += 1
# カウントテーブルとラベルをマージしたテーブルを出力する
tsv_name = cancer_type + "_fpkm_table.tsv"
df = df[0:idx]
df.loc[idx] = type_list
df = df.rename(index={idx:"cancer_type"})
df.to_csv(tsv_name, sep='\t')

3. 決定木の構築と評価

Scikit-learnの決定木とcross_validationを用いて、決定木での汎化性能を評価します。その際に正確性(accurary)、適合度(precision)、再現率(recall)も評価します。

from sklearn.tree import DecisionTreeClassifier
from sklearn.model_selection import cross_validate
import pandas as pd

df = pd.read_table("LUAD_fpkm_table.tsv", index_col="gene")
df_train = df[0:len(df)-1]
df_target = df[len(df)-1:]

clf = DecisionTreeClassifier(max_depth=3)

scoring = ['accuracy', 'precision', 'recall']
scores = cross_validate(clf, df_train.T, df_target.T, scoring=scoring)
print("average accuracy is", scores['test_accuracy'].mean())
print("average precision is", scores['test_precision'].mean())
print("average recall is", scores['test_recall'].mean())

結果は以下の様になり、単純な決定木でも高い正確性、適合度、再現性が得られることが分かりました。

average accuracy is 0.9814413901153681
average precision is 0.9907570950870273
average recall is 0.9887321460059955

このようにScikit-learnを用いればほんの数十行のコードで機械学習を行うことができます。
ぜひお試しください。

参考文献:
A deep learning-based multi-model ensemble method for cancer prediction

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up