0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

Survival Prediction on the Titanic Dataset : Kaggle Challenge

Posted at

Survival Prediction on the Titanic Dataset : Kaggle Challenge

1. Introduction

The entire code can be found in the appendix.

1.1 What is Kaggle ?

Kaggle is "The Home of Data Science & Machine Learning".
Kaggle is one of the best place for learning data science. Recently, data science has been growing in popularity in Japan, and many companies are searching for the people who is familier with data science. For those who want to work as data scientists, Kaggle is the perfect place. It offers a wide range of datasets to explore, and provides opportunities to participate in competitions to apply and enhance data science knowledge.

1.2 Overview of the Titanic Dataset

Titanic dataset has 3 kinds of data.

1: training set

This data set should be used to build your machine learning models.

2: test set

This data set should be used to see how well your model performs on unseeen data.

3: gender_submission.csv

This data set is an example of what a submission file should look like.

1.3 Project Objectives and Overview

We should build a predictive model that answers the "What sorts of people were more likely to survive ?" using passenger data.

2. Data Preprocessing

2.1 Loading and Inspecting the Data

I introduce python code.

First, we load the training data.

import pandas as pd
train_data = pd.read_csv("put in your local path")

2.2 Handling Missing Values

Next, we handle missing values. In this example, I focus on handling the missing values in the "Age" column. We fill the missing values with the median value of the column.

train_data["Age"] = train_data["Age"].fillna(train_data["Age"].median())

2.3 Feature Engineering

Feature engineering is important for machine learning models. We need to convert categorical features (like strings) into numerical values.

mapping1 = {"male": 0, "female": 1}
train_data["Sex_num"] = train_data["Sex"].map(mapping1)

mapping2 = {"S": 0, "C": 1, "Q": 2}
train_data["Embarked_num"] = train_data["Embarked"].map(mapping2)

3. Model Building

3.1 Selection of Algorithms

In this example, I used SVM (Support Vector Machine) algorithm to build the model.
What is SVM ?

3.2 Model Training and Hyperparameter Tuning

We should choose feature for training the model.
Additionally, feature scaling is very important when using SVM, as it is sensitive to the scale of the data.

from sklearn.preprocessing import StandardScaler

#choosing feature
X = train_data[["Pclass", "Sex_num", "Age", "SibSp", "Parch", "Fare", "Embarked_num"]]
y = train_data["Survived"]

#scaling feature uisng StandardScler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

Next, we should set hyperparameter to maximize the ability of the model.
In this example, I used grid search for best hyperparameter.

#Searching best parameter using grid search
param_grid = {
    "C": np.logspace(-4, 4, 9),
    "kernel": ["linear", "rbf", "poly", "sigmoid"],
    "gamma": np.logspace(-4, 1, 6),
    "degree": [2, 3, 4, 5, 6],
    "coef0": [0, 1, 10, 100],
    "shrinking": [True, False],
    "tol": [1e-3, 1e-4, 1e-5],
    "class_weight": [None, "balanced", {0: 1, 1: 2}],
    "decision_function_shape": ["ovo", "ovr"],
    "max_iter": [1000, 5000, 10000],
    "probability": [True, False]
}

Grid search can be computationally expensive, especially with a large parameter grid.

4. Results

4.1 Evaluation

I use public score to evaluate model accuracy.
public score
Kaggle public score is 0.78468, and it is not so high. (2102 / 13243)

4.2 Areas for Model Improvement

We can improve this model by these methods.

1. Feature Enjineering

We can generate other feature.
For example : Family size (SibSp + Parch), Fare group (grouping for using fare)

2. Using other scaling

MinMaxScaler, RobustScaler

3. Using other cross validation

StratifiedKFold, separate more

4. Using other evaluation index

Precision, Recall, F1-score

5. Conclusion

I introduced one example for building model of titanic.

Appendix

import pandas as pd
import numpy as np
from sklearn.svm import SVC
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler

#reading traing data
train_data = pd.read_csv()

#preprocessing training data
train_data = train_data.dropna(subset=["Embarked"])
train_data["Age"] = train_data["Age"].fillna(train_data["Age"].median())

mapping1 = {"male": 0, "female": 1}
train_data["Sex_num"] = train_data["Sex"].map(mapping1)

mapping2 = {"S": 0, "C": 1, "Q": 2}
train_data["Embarked_num"] = train_data["Embarked"].map(mapping2)

#chossing feature
X = train_data[["Pclass", "Sex_num", "Age", "SibSp", "Parch", "Fare", "Embarked_num"]]
y = train_data["Survived"]

#scaling feature uisng StandardScler
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

#Searching best parametor using grid search
param_grid = {
    "C": np.logspace(-4, 4, 9),
    "kernel": ["linear", "rbf", "poly", "sigmoid"],
    "gamma": np.logspace(-4, 1, 6),
    "degree": [2, 3, 4, 5, 6],
    "coef0": [0, 1, 10, 100],
    "shrinking": [True, False],
    "tol": [1e-3, 1e-4, 1e-5],
    "class_weight": [None, "balanced", {0: 1, 1: 2}],
    "decision_function_shape": ["ovo", "ovr"],
    "max_iter": [1000, 5000, 10000],
    "probability": [True, False]
}

grid_search = GridSearchCV(SVC(random_state=42), param_grid, cv=5, scoring="accuracy", n_jobs=-1, verbose=2)
grid_search.fit(X_scaled, y)

best_model = grid_search.best_estimator_

#reading test data
test_set = pd.read_csv()

#preprocessing training data
test_set["Sex_num"] = test_set["Sex"].map(mapping1)
test_set["Embarked_num"] = test_set["Embarked"].map(mapping2)
test_set["Age"] = test_set["Age"].fillna(train_data["Age"].median())
test_set["Fare"] = test_set["Fare"].fillna(train_data["Fare"].mode()[0])

#chossing feature
X_test = test_set[["Pclass", "Sex_num", "Age", "SibSp", "Parch", "Fare", "Embarked_num"]]
X_test_scaled = scaler.transform(X_test)

#make prediction
y_pred = best_model.predict(X_test_scaled)

output_path = 
output = pd.DataFrame({"PassengerId": test_set["PassengerId"], "Survived": y_pred})
output.to_csv(output_path, index=False)

print("make csvfile")

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?