More than 3 years have passed since last update.

Kaggle Titanic data set - Top 2% guide (Part 05) - Final

Last updated at 2020-09-04Posted at 2020-03-18

##Part 05
####Model building

The article became so long. I will not be going into details of machine learning models and hyperparameter tuning in here since I spent so much time writing this article series. These article series guarantee more than 80% score in the leader board. I will include some drawbacks and tips to make better features to get past the score of 83%.

Our data frame contains both training and data that we need to submit to the Kaggle. So, first, we need to separate those. Then, we need to separate our independent and dependent variables. After that, we can use sklearn train_test_split function to separate into training and testing data. It's better to scale data before feeding the ML algorithm. I used sklearn's StandardScaler function.

titanic_demo.py

train = combined_df[combined_df['Survived'].isnull() == False]
predict = combined_df[combined_df['Survived'].isnull() == True]
predict = predict.drop(columns=['Survived'])

X = train.drop(['Survived'], axis = 1)
y = train['Survived']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)

std_sc = StandardScaler()

X_train = std_sc.fit_transform(X_train)
X_test = std_sc.transform(X_test)

#Transform 'submission dataset'
predict = std_sc.transform(predict)

I'm going to build an ensemble, which means a combination of several machine learning algorithms to get the final prediction. So, first, we get the result from several machine learning models. Then select few algorithms performs well in the base step for tuning hyperparameters. StratifiedKFold used to generalize the model.

titanic_demo.py

kfold = StratifiedKFold(n_splits=10)

random_state = 42
classifiers = []

classifiers.append(LinearSVC(random_state=random_state))
classifiers.append(QuadraticDiscriminantAnalysis())
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(GaussianNB())
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state = random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier(n_neighbors=20))
classifiers.append(LogisticRegression(random_state=random_state))
classifiers.append(LinearDiscriminantAnalysis())
classifiers.append(BaggingClassifier(random_state=random_state))
classifiers.append(SGDClassifier(loss='log', random_state=random_state))
classifiers.append(NuSVC(probability=True, random_state=random_state))
classifiers.append(SVC(kernel='linear', probability=True, random_state=random_state))

names = []
cv_results = []
for classifier in classifiers :
    names.append(classifier.__class__.__name__)
    cv_results.append(cross_val_score(classifier, X_train, y = y_train, scoring = 'accuracy', cv = kfold, n_jobs=20))
    
cv_means = []
cv_std = []
for cv_result in cv_results:
    cv_means.append(cv_result.mean())
    cv_std.append(cv_result.std())

cv_res = pd.DataFrame(
    {'Mean_Accuracy':cv_means, 'Mean_Error':cv_std, 'Algorithm':names}
).sort_values(by="Mean_Accuracy", ascending=False)

g = sns.catplot('Mean_Accuracy', 'Algorithm', kind='bar', data=cv_res, orient='h',**{'xerr':cv_std})
g.fig.set_figwidth(20)
g.fig.set_figheight(6)

There are a few ways to do the hyperparameters tuning.

Grid Search
Randomized Search
and also used an optimization algorithm like hyperopt etc.
In here, I used GridSearchCV from Scikit-learn. plot_confusion function is a helper function to plot the confusion matrix.

######Logistic Regression

titanic_demo.py

def plot_confusion(predictions):
    y_pred = predictions
    cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
    ax= plt.subplot()
    sns.heatmap(cm, annot=True, ax = ax, fmt = 'g');

    ax.set_xlabel('Predicted', fontsize=14)
    ax.xaxis.set_label_position('bottom') 
    ax.xaxis.set_ticklabels(['Not Survived', 'Survived'], fontsize = 12)
    ax.set_ylabel('True', fontsize=14)
    ax.yaxis.set_ticklabels(['Not Survived', 'Survived'], fontsize = 12)
    plt.show()

#Logistic Regression
LogR = LogisticRegression(random_state=random_state)
LogR_params = {
    'C': [0.1, 1, 0.5, 10],
    'class_weight': [1, 3, 10],
    'dual': [True, False],
    'fit_intercept': [True, False],
    'max_iter': [200, 500, 800, 1000, 2000]
}
LogR_gs = GridSearchCV(LogR, param_grid=LogR_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
LogR_gs.fit(X_train, y_train)
best_LogR = LogR_gs.best_estimator_
predictions = best_LogR.predict(X_test)
print('The accuracy of the Logistic Regression: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)

######NuSVC

titanic_demo.py

# NuSVC
NuSVC_ = NuSVC(random_state=random_state)
NuSVC_params = {
    'nu' : [0.1, 0.2, 0.5, 0.9, 1, 5],
    'probability' : [True]
}
NuSVC_gs = GridSearchCV(NuSVC_, param_grid=NuSVC_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
NuSVC_gs.fit(X_train,y_train)
best_NuSVC = NuSVC_gs.best_estimator_
predictions = best_NuSVC.predict(X_test)
print('The accuracy of the NuSVC: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)

######Gradient Boosting

titanic_demo.py

# Gboost
Gboost = GradientBoostingClassifier(random_state=random_state)
Gboost_params = {
    'loss' : ['deviance'],
    'n_estimators' : [8, 10, 16, 20, 100, 200, 300],
    'learning_rate': [0.1, 0.05, 0.01],
    'max_depth': [3, 4, 5, 6, 7, 8],
    'min_samples_leaf': [100, 150],
    'max_features': [0.3, 0.1] 
}
Gboost_gs = GridSearchCV(Gboost, param_grid=Gboost_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Gboost_gs.fit(X_train,y_train)
best_Gboost = Gboost_gs.best_estimator_
predictions = best_Gboost.predict(X_test)
print('The accuracy of the Gradient Boosting: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)

######Random Forest

titanic_demo.py

# Random Forest
Rf = RandomForestClassifier(random_state=random_state)
Rf_params = {
    'max_depth': [5, 6, 7, 8],
    'max_features': [3, 4,'sqrt'],
    'min_samples_split': [4, 5, 6],
    'min_samples_leaf': [2, 3],
    'n_estimators' :[5, 10, 20, 80, 100, 200, 300],
    'criterion': ['gini']
}
Rf_gs = GridSearchCV(Rf, param_grid=Rf_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Rf_gs.fit(X_train,y_train)
best_Rf = Rf_gs.best_estimator_
predictions = best_Rf.predict(X_test)
print('The accuracy of the Random Forest: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)

######Linear Discriminant Analysis

titanic_demo.py

# Lda
Lda = LinearDiscriminantAnalysis()
Lda_params = {
    'solver' : ['svd', 'lsqr', 'eigen'],
}
Lda_gs = GridSearchCV(Lda, param_grid=Lda_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Lda_gs.fit(X_train,y_train)
best_Lda = Lda_gs.best_estimator_
predictions = best_Lda.predict(X_test)
print('The accuracy of the Linear Discriminant Analysis: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)

######KNeighbors

titanic_demo.py

# Knc
Knc = KNeighborsClassifier()
Knc_params = {
    'n_neighbors' : np.linspace(1, 100, 100).astype('int'),
    'weights' : ['uniform', 'distance'],
    'algorithm' : ['auto', 'ball_tree']
}
Knc_gs = GridSearchCV(Knc, param_grid=Knc_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Knc_gs.fit(X_train,y_train)
best_Knc = Knc_gs.best_estimator_
predictions = best_Knc.predict(X_test)
print('The accuracy of the KNeighbors: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)

######Extra Trees

titanic_demo.py

#ExtraTrees 
Ext_t = ExtraTreesClassifier(random_state=random_state)
Ext_t_params = {
    'max_depth' : [1, 2, 3],
    'max_features' : ['sqrt'],
    'min_samples_split' : [2, 3, 10],
    'min_samples_leaf' : [1, 3, 10],
    'bootstrap' : [False],
    'n_estimators' :[10, 20, 100, 300, 400, 500],
    'criterion' : ['gini', 'entropy']
}
Ext_t_gs = GridSearchCV(Ext_t, param_grid=Ext_t_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Ext_t_gs.fit(X_train,y_train)
best_Ext_t = Ext_t_gs.best_estimator_
predictions = best_Ext_t.predict(X_test)
print('The accuracy of the Extra Trees: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)

######Ensemble
I used the following algorithms for my ensemble. We can score more than 80% in the leaderboard by this method. We can improve the score a bit if we spent more time in the feature engineering phase.

titanic_demo.py

voting_clf = VotingClassifier(
    estimators = [
        ('Gboost', best_Gboost),
        ('Rf', best_Rf),
        ('Ext_t', best_Ext_t),
        ('Knc', best_Knc),
        ('Lda', best_Lda),
        ('LogR', best_LogR),
    ], voting='hard', n_jobs=20)
voting = voting_clf.fit(X_train, y_train)
predictions = voting.predict(X_test)
print('The accuracy of the Ensemble: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)

titanic_demo.py

# predict and save prediction to csv for submission
test_survived = pd.Series(voting.predict(predict).astype(int), name="Survived")
id_ = test_data['PassengerId']
results = pd.concat([id_, test_survived], axis=1)
results.to_csv("sub_prediction.csv",index=False)

#####Closing comments
I used previously derived, titles to estimate the missing ages. But, did you guys spotted any mistake or issues in that approach? If you wondered what happened to little and teenage girls, you are on the right track. By using titles, we cannot get teenage and small girls. In our data, we misestimated ages. To solve that, we can use a regression algorithm.

In the data-set, some people owned more than one cabin. We ignored that fact and just only used the deck letters. Should we use it in a better way?
Do the identical ticket and consecutive ticket numbers mean anything?
What can we do with passengers' family names? Can we recognize family groups?
In my approach, Is_Married only represent females. How can we represent the male passengers?

There are more. If you think more deeply in the feature engineering phase, you can come up with more good features.

*本記事は @qualitia_cdevの中の一人、@nuwanさんに作成していただきました。
*This article is written by @nuwan a member of @qualitia_cdev.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Kaggle Titanic data set - Top 2% guide (Part 05) - Final

Table of contents