【Python】患者データから処方薬を予測②～モデル構築・予測編～

Last updated at 2024-08-06Posted at 2024-08-06

はじめに

Aidemy Premiumのカリキュラムの一環で、受講修了条件を満たすために公開しています。
2024年4月からAidemy Premiumにてデータ分析講座を受講しました。
その成果物として、与えられた患者データから薬剤を選択する予測モデルを構築してみました。

このページはモデル構築・予測編です。(データ確認編はこちら)

この記事でわかる・できること

教師あり学習(分類)の流れがわかる

この記事の対象者

データ分析初心者

動作環境・データセット

OS バージョン
- Windows10 Pro 22H2
実装環境
- google colaboratory
データセット
- 【kaggle】Drug.csv

データセット内容

今回のデータセットは、患者情報（年齢、性別、血圧、コレステロール値、ナトリウム、カリウムなど）と、医師が処方した薬剤から構成されています。内容は以下の通りです。

index	Age	Sex	BP	Cholesterol	Na	K	Drug
0	23	F	HIGH	HIGH	0.792535	0.031258	drugY
1	47	M	LOW	HIGH	0.739309	0.056468	drugC
2	47	M	LOW	HIGH	0.697269	0.068944	drugC
3	28	F	NORMAL	HIGH	0.563682	0.072289	drugX
4	61	F	LOW	HIGH	0.559294	0.030998	drugY

データセットの準備

from sklearn.model_selection import train_test_split

# 今回、学習に用いないSexとターゲット変数Drugを削除
drop_col = ['Sex', 'Drug']
X = df.drop(drop_col, axis=1)
y = df['Drug']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

print("Training set shape (X, y):", X_train.shape, y_train.shape)
print("Testing set shape (X, y):", X_test.shape, y_test.shape)

Training set shape (X, y): (160, 6) (160,)
Testing set shape (X, y): (40, 6) (40,)

数値データ（Age と Na/K Ratio）の標準化

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()

X_train[['Age', 'Na/K Ratio']] = scaler.fit_transform(X_train[['Age', 'Na/K Ratio']])
X_test[['Age', 'Na/K Ratio']] = scaler.transform(X_test[['Age', 'Na/K Ratio']])

print("Updated Training Data:")
display(X_train.head())

予測モデルの構築

学習に用いるデータを一通り準備できたので、予測モデルを構築していきます。
今回は、決定木・ランダムフォレスト・ロジスティック回帰の3つのモデルを比較してみます。

決定木

決定木の可視化

from sklearn.tree import plot_tree
from sklearn.tree import DecisionTreeClassifier
import sys

# 再帰深さの制限を10000に拡張(Pythonデフォルト1000)
sys.setrecursionlimit(10000)

X_train_filtered = X_train.drop(['Na', 'K'], axis=1)
# 決定木モデルを構築
dt_model = DecisionTreeClassifier(random_state=42)
# 決定木モデルに学習させる
dt_model.fit(X_train_filtered, y_train)

# 決定木の可視化
plt.figure(figsize=(15, 10))
plot_tree(dt_model, feature_names=X_train_filtered.columns, class_names=label_encoder.classes_, filled=True)
plt.title("Decision Tree Visualization")

# 可視化した決定木をPDFとして保存する
plt.savefig("decision_tree_visualization.pdf")
plt.show()

テストデータの予測・評価

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
from sklearn.metrics import classification_report

# テストデータの予測
X_test_filtered = X_test.drop(['Na', 'K'], axis=1)
y_test_pred = dt_model.predict(X_test_filtered)

# テストデータでモデルを評価
print("\nEvaluation on Testing Data(DecisionTreeClassifier):")
print("Accuracy:", accuracy_score(y_test, y_test_pred))
print("Precision:", precision_score(y_test, y_test_pred, average='weighted'))
print("Recall:", recall_score(y_test, y_test_pred, average='weighted'))
print("F1 Score:", f1_score(y_test, y_test_pred, average='weighted'))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred))

print("\nEvaluation Report on Testing Data(DecisionTreeClassifier):")
print(classification_report(y_test, y_test_pred, target_names=label_encoder.classes_))

出力結果

Evaluation on Testing Data(DecisionTreeClassifier):
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0
Confusion Matrix:
[[ 6  0  0  0  0]
 [ 0  3  0  0  0]
 [ 0  0  5  0  0]
 [ 0  0  0 11  0]
 [ 0  0  0  0 15]]

Evaluation Report on Testing Data(DecisionTreeClassifier):
              precision    recall  f1-score   support

       drugA       1.00      1.00      1.00         6
       drugB       1.00      1.00      1.00         3
       drugC       1.00      1.00      1.00         5
       drugX       1.00      1.00      1.00        11
       drugY       1.00      1.00      1.00        15

    accuracy                           1.00        40
   macro avg       1.00      1.00      1.00        40
weighted avg       1.00      1.00      1.00        4040

ランダムフォレスト

from sklearn.ensemble import RandomForestClassifier

# ランダムフォレストモデルを構築
rf_model = RandomForestClassifier(random_state=42)
# ランダムフォレストモデルに学習させる
rf_model.fit(X_train_filtered, y_train)

# テストデータの予測
y_test_pred_rf = rf_model.predict(X_test_filtered)

# テストデータでモデルを評価
print("\nEvaluation on Testing Data (RandomForestClassifier):")
print("Accuracy:", accuracy_score(y_test, y_test_pred_rf))
print("Precision:", precision_score(y_test, y_test_pred_rf, average='weighted'))
print("Recall:", recall_score(y_test, y_test_pred_rf, average='weighted'))
print("F1 Score:", f1_score(y_test, y_test_pred_rf, average='weighted'))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_rf))

print("\nEvaluation Report on Testing Data (RandomForestClassifier):")
print(classification_report(y_test, y_test_pred_rf, target_names=label_encoder.classes_))

出力結果

Evaluation on Testing Data (RandomForestClassifier):
Accuracy: 1.0
Precision: 1.0
Recall: 1.0
F1 Score: 1.0
Confusion Matrix:
[[ 6  0  0  0  0]
 [ 0  3  0  0  0]
 [ 0  0  5  0  0]
 [ 0  0  0 11  0]
 [ 0  0  0  0 15]]

Evaluation Report on Testing Data (RandomForestClassifier):
              precision    recall  f1-score   support

       drugA       1.00      1.00      1.00         6
       drugB       1.00      1.00      1.00         3
       drugC       1.00      1.00      1.00         5
       drugX       1.00      1.00      1.00        11
       drugY       1.00      1.00      1.00        15

    accuracy                           1.00        40
   macro avg       1.00      1.00      1.00        40
weighted avg       1.00      1.00      1.00        40

ロジスティック回帰

from sklearn.linear_model import LogisticRegression

# ロジスティック回帰モデルを構築
lr_model = LogisticRegression(random_state=42)
# ロジスティック回帰モデルに学習させる
lr_model.fit(X_train_filtered, y_train)

# テストデータの予測
y_test_pred_lr = lr_model.predict(X_test_filtered)

# テストデータでモデルを評価
print("\nEvaluation on Testing Data (Logistic Regression):")
print("Accuracy:", accuracy_score(y_test, y_test_pred_lr))
print("Precision:", precision_score(y_test, y_test_pred_lr, average='weighted'))
print("Recall:", recall_score(y_test, y_test_pred_lr, average='weighted'))
print("F1 Score:", f1_score(y_test, y_test_pred_lr, average='weighted'))
print("Confusion Matrix:")
print(confusion_matrix(y_test, y_test_pred_lr))

print("\nEvaluation Report on Testing Data (Logistic Regression):")
print(classification_report(y_test, y_test_pred_lr, target_names=label_encoder.classes_))

出力結果

Evaluation on Testing Data (Logistic Regression):
Accuracy: 0.975
Precision: 0.9785714285714284
Recall: 0.975
F1 Score: 0.9755305039787799
Confusion Matrix:
[[ 6  0  0  0  0]
 [ 0  3  0  0  0]
 [ 0  0  5  0  0]
 [ 0  0  0 11  0]
 [ 1  0  0  0 14]]

Evaluation Report on Testing Data (Logistic Regression):
              precision    recall  f1-score   support

       drugA       0.86      1.00      0.92         6
       drugB       1.00      1.00      1.00         3
       drugC       1.00      1.00      1.00         5
       drugX       1.00      1.00      1.00        11
       drugY       1.00      0.93      0.97        15

    accuracy                           0.97        40
   macro avg       0.97      0.99      0.98        40
weighted avg       0.98      0.97      0.98        40

まとめ

今回はデータが200と少なかったため、全体的に高い評価となった。
今回は多項分類のため、
二項分類などクラスの少ないデータに使われるロジスティック回帰より
決定木・ランダムフォレストが適している可能性がある。
次回はもっと膨大なデータの分析を試したい。

参考資料

機械学習実践（教師あり学習：分類） | KIKAGAKU

なぜ乱数のシード値は42なのか？ | Quita

決定木 -Decision Trees (DTs)- | Quita

【Python】機械学習（教師あり学習分類）ランダムフォレストの実装 | Haruの徒然Blog

【Python】再帰でエラーになる人は必見！再帰の深さとsetrecursionlimit | Quita

【sklearn】LabelEncoderの使い方を丁寧に | gotutiyan’s blog

【初心者向け】機械学習におけるクラス分類の評価指標の解説 | OPTiM TECH BLOG

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up