##Part 05
####Model building
The article became so long. I will not be going into details of machine learning models and hyperparameter tuning in here since I spent so much time writing this article series. These article series guarantee more than 80% score in the leader board. I will include some drawbacks and tips to make better features to get past the score of 83%.
Our data frame contains both training and data that we need to submit to the Kaggle. So, first, we need to separate those. Then, we need to separate our independent and dependent variables. After that, we can use sklearn train_test_split function to separate into training and testing data. It's better to scale data before feeding the ML algorithm. I used sklearn's StandardScaler function.
train = combined_df[combined_df['Survived'].isnull() == False]
predict = combined_df[combined_df['Survived'].isnull() == True]
predict = predict.drop(columns=['Survived'])
X = train.drop(['Survived'], axis = 1)
y = train['Survived']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=100)
std_sc = StandardScaler()
X_train = std_sc.fit_transform(X_train)
X_test = std_sc.transform(X_test)
#Transform 'submission dataset'
predict = std_sc.transform(predict)
I'm going to build an ensemble, which means a combination of several machine learning algorithms to get the final prediction. So, first, we get the result from several machine learning models. Then select few algorithms performs well in the base step for tuning hyperparameters. StratifiedKFold used to generalize the model.
kfold = StratifiedKFold(n_splits=10)
random_state = 42
classifiers = []
classifiers.append(LinearSVC(random_state=random_state))
classifiers.append(QuadraticDiscriminantAnalysis())
classifiers.append(DecisionTreeClassifier(random_state=random_state))
classifiers.append(GaussianNB())
classifiers.append(RandomForestClassifier(random_state=random_state))
classifiers.append(ExtraTreesClassifier(random_state=random_state))
classifiers.append(GradientBoostingClassifier(random_state = random_state))
classifiers.append(MLPClassifier(random_state=random_state))
classifiers.append(KNeighborsClassifier(n_neighbors=20))
classifiers.append(LogisticRegression(random_state=random_state))
classifiers.append(LinearDiscriminantAnalysis())
classifiers.append(BaggingClassifier(random_state=random_state))
classifiers.append(SGDClassifier(loss='log', random_state=random_state))
classifiers.append(NuSVC(probability=True, random_state=random_state))
classifiers.append(SVC(kernel='linear', probability=True, random_state=random_state))
names = []
cv_results = []
for classifier in classifiers :
names.append(classifier.__class__.__name__)
cv_results.append(cross_val_score(classifier, X_train, y = y_train, scoring = 'accuracy', cv = kfold, n_jobs=20))
cv_means = []
cv_std = []
for cv_result in cv_results:
cv_means.append(cv_result.mean())
cv_std.append(cv_result.std())
cv_res = pd.DataFrame(
{'Mean_Accuracy':cv_means, 'Mean_Error':cv_std, 'Algorithm':names}
).sort_values(by="Mean_Accuracy", ascending=False)
g = sns.catplot('Mean_Accuracy', 'Algorithm', kind='bar', data=cv_res, orient='h',**{'xerr':cv_std})
g.fig.set_figwidth(20)
g.fig.set_figheight(6)
There are a few ways to do the hyperparameters tuning.
- Grid Search
- Randomized Search
- and also used an optimization algorithm like hyperopt etc.
In here, I used GridSearchCV from Scikit-learn. plot_confusion function is a helper function to plot the confusion matrix.
######Logistic Regression
def plot_confusion(predictions):
y_pred = predictions
cm = confusion_matrix(y_true=y_test, y_pred=y_pred)
ax= plt.subplot()
sns.heatmap(cm, annot=True, ax = ax, fmt = 'g');
ax.set_xlabel('Predicted', fontsize=14)
ax.xaxis.set_label_position('bottom')
ax.xaxis.set_ticklabels(['Not Survived', 'Survived'], fontsize = 12)
ax.set_ylabel('True', fontsize=14)
ax.yaxis.set_ticklabels(['Not Survived', 'Survived'], fontsize = 12)
plt.show()
#Logistic Regression
LogR = LogisticRegression(random_state=random_state)
LogR_params = {
'C': [0.1, 1, 0.5, 10],
'class_weight': [1, 3, 10],
'dual': [True, False],
'fit_intercept': [True, False],
'max_iter': [200, 500, 800, 1000, 2000]
}
LogR_gs = GridSearchCV(LogR, param_grid=LogR_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
LogR_gs.fit(X_train, y_train)
best_LogR = LogR_gs.best_estimator_
predictions = best_LogR.predict(X_test)
print('The accuracy of the Logistic Regression: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)
######NuSVC
# NuSVC
NuSVC_ = NuSVC(random_state=random_state)
NuSVC_params = {
'nu' : [0.1, 0.2, 0.5, 0.9, 1, 5],
'probability' : [True]
}
NuSVC_gs = GridSearchCV(NuSVC_, param_grid=NuSVC_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
NuSVC_gs.fit(X_train,y_train)
best_NuSVC = NuSVC_gs.best_estimator_
predictions = best_NuSVC.predict(X_test)
print('The accuracy of the NuSVC: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)
######Gradient Boosting
# Gboost
Gboost = GradientBoostingClassifier(random_state=random_state)
Gboost_params = {
'loss' : ['deviance'],
'n_estimators' : [8, 10, 16, 20, 100, 200, 300],
'learning_rate': [0.1, 0.05, 0.01],
'max_depth': [3, 4, 5, 6, 7, 8],
'min_samples_leaf': [100, 150],
'max_features': [0.3, 0.1]
}
Gboost_gs = GridSearchCV(Gboost, param_grid=Gboost_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Gboost_gs.fit(X_train,y_train)
best_Gboost = Gboost_gs.best_estimator_
predictions = best_Gboost.predict(X_test)
print('The accuracy of the Gradient Boosting: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)
######Random Forest
# Random Forest
Rf = RandomForestClassifier(random_state=random_state)
Rf_params = {
'max_depth': [5, 6, 7, 8],
'max_features': [3, 4,'sqrt'],
'min_samples_split': [4, 5, 6],
'min_samples_leaf': [2, 3],
'n_estimators' :[5, 10, 20, 80, 100, 200, 300],
'criterion': ['gini']
}
Rf_gs = GridSearchCV(Rf, param_grid=Rf_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Rf_gs.fit(X_train,y_train)
best_Rf = Rf_gs.best_estimator_
predictions = best_Rf.predict(X_test)
print('The accuracy of the Random Forest: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)
######Linear Discriminant Analysis
# Lda
Lda = LinearDiscriminantAnalysis()
Lda_params = {
'solver' : ['svd', 'lsqr', 'eigen'],
}
Lda_gs = GridSearchCV(Lda, param_grid=Lda_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Lda_gs.fit(X_train,y_train)
best_Lda = Lda_gs.best_estimator_
predictions = best_Lda.predict(X_test)
print('The accuracy of the Linear Discriminant Analysis: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)
######KNeighbors
# Knc
Knc = KNeighborsClassifier()
Knc_params = {
'n_neighbors' : np.linspace(1, 100, 100).astype('int'),
'weights' : ['uniform', 'distance'],
'algorithm' : ['auto', 'ball_tree']
}
Knc_gs = GridSearchCV(Knc, param_grid=Knc_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Knc_gs.fit(X_train,y_train)
best_Knc = Knc_gs.best_estimator_
predictions = best_Knc.predict(X_test)
print('The accuracy of the KNeighbors: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)
######Extra Trees
#ExtraTrees
Ext_t = ExtraTreesClassifier(random_state=random_state)
Ext_t_params = {
'max_depth' : [1, 2, 3],
'max_features' : ['sqrt'],
'min_samples_split' : [2, 3, 10],
'min_samples_leaf' : [1, 3, 10],
'bootstrap' : [False],
'n_estimators' :[10, 20, 100, 300, 400, 500],
'criterion' : ['gini', 'entropy']
}
Ext_t_gs = GridSearchCV(Ext_t, param_grid=Ext_t_params, refit=True, cv=kfold, scoring='accuracy', n_jobs=20)
Ext_t_gs.fit(X_train,y_train)
best_Ext_t = Ext_t_gs.best_estimator_
predictions = best_Ext_t.predict(X_test)
print('The accuracy of the Extra Trees: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)
######Ensemble
I used the following algorithms for my ensemble. We can score more than 80% in the leaderboard by this method. We can improve the score a bit if we spent more time in the feature engineering phase.
voting_clf = VotingClassifier(
estimators = [
('Gboost', best_Gboost),
('Rf', best_Rf),
('Ext_t', best_Ext_t),
('Knc', best_Knc),
('Lda', best_Lda),
('LogR', best_LogR),
], voting='hard', n_jobs=20)
voting = voting_clf.fit(X_train, y_train)
predictions = voting.predict(X_test)
print('The accuracy of the Ensemble: ', accuracy_score(y_test, predictions))
print(classification_report(y_test, predictions))
plot_confusion(predictions)
# predict and save prediction to csv for submission
test_survived = pd.Series(voting.predict(predict).astype(int), name="Survived")
id_ = test_data['PassengerId']
results = pd.concat([id_, test_survived], axis=1)
results.to_csv("sub_prediction.csv",index=False)
#####Closing comments
I used previously derived, titles to estimate the missing ages. But, did you guys spotted any mistake or issues in that approach? If you wondered what happened to little and teenage girls, you are on the right track. By using titles, we cannot get teenage and small girls. In our data, we misestimated ages. To solve that, we can use a regression algorithm.
- In the data-set, some people owned more than one cabin. We ignored that fact and just only used the deck letters. Should we use it in a better way?
- Do the identical ticket and consecutive ticket numbers mean anything?
- What can we do with passengers' family names? Can we recognize family groups?
- In my approach, Is_Married only represent females. How can we represent the male passengers?
There are more. If you think more deeply in the feature engineering phase, you can come up with more good features.
Table of contents
- Kaggle Titanic data set - Top 2% guide (Part 01)
- Kaggle Titanic data set - Top 2% guide (Part 02)
- Kaggle Titanic data set - Top 2% guide (Part 03)
- Kaggle Titanic data set - Top 2% guide (Part 04)
- Kaggle Titanic data set - Top 2% guide (Part 05)
*本記事は @qualitia_cdevの中の一人、@nuwanさんに作成していただきました。
*This article is written by @nuwan a member of @qualitia_cdev.