More than 1 year has passed since last update.

ロジスティック回帰による2値分類(Titanic問題を例に)

Posted at 2023-11-04

kaggleの例題"タイタニック問題"を例に、ロジスティック回帰による2値分類のコードサンプルを書きます。
タイタニック問題：
https://www.kaggle.com/competitions/titanic

ロジスティック回帰：
https://www.ibm.com/jp-ja/topics/logistic-regression

欠損値補完、カテゴリ変数の処理、特徴量エンジニアリングは適当です。

以下はtrain.csvのデータを用いたモデルの作成です。

# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

from sklearn.linear_model import LogisticRegressionCV
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import accuracy_score, confusion_matrix

# CSVファイルを読み込む
df = pd.read_csv("/kaggle/input/titanic/train.csv")

# 前処理
df1 = df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
df1['Sex'] = df1['Sex'].replace({'male': 0, 'female': 1})
df1['Embarked'] = df1['Embarked'].replace({'C': 0, 'S': 1, 'Q': 2})
df1.dropna(inplace=True)

# データを分割
X = df1.iloc[:, 2:]
y = df1.iloc[:, 1].astype(int)  # 目的変数を整数型に変換

# データをトレーニングセットとテストセットに分割
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)

# 標準化
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# ロジスティック回帰モデルのインスタンスを作成
logreg_cv = LogisticRegressionCV(cv=5, random_state=0, max_iter=1000)

# モデルの学習
logreg_cv.fit(X_train_scaled, y_train)

# 予測
y_pred = logreg_cv.predict(X_test_scaled)

# 評価
accuracy = accuracy_score(y_test, y_pred)
print(f'Accuracy: {accuracy}')
conf_matrix = confusion_matrix(y_test, y_pred)
print(f'Confusion Matrix:\n{conf_matrix}')

以下は未知のデータ(text.csv)に対する予測です。

# テストデータセットの読み込み
test_df = pd.read_csv("/kaggle/input/titanic/test.csv")

# テストデータに対して同じ前処理を適用
test_df_processed = test_df.drop(['Name', 'Ticket', 'Cabin'], axis=1)
test_df_processed['Sex'] = test_df_processed['Sex'].replace({'male': 0, 'female': 1})
test_df_processed['Embarked'] = test_df_processed['Embarked'].replace({'C': 0, 'S': 1, 'Q': 2})

# 欠損値がある場合は、ここで処理する必要がある
test_df_processed.fillna(test_df_processed.mean(), inplace=True)

# 標準化を適用（訓練データにフィットさせたscalerを使用）
X_test_processed = scaler.transform(test_df_processed.iloc[:, 1:]) 

# 予測の実行
test_predictions = logreg_cv.predict(X_test_processed)

# 予測結果の保存（例：Survivedという列に予測値を入れる）
test_df['Survived'] = test_predictions

# 必要に応じて予測結果をCSVファイルとして保存
test_df.to_csv("/kaggle/working/submission.csv", index=False)

精度は77%でした。特徴量エンジニアリングが適当なので妥当な結果かと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up