More than 3 years have passed since last update.

Kaggle Titanic data set - Top 2% guide (Part 04)

Last updated at 2020-09-04Posted at 2020-03-11

Part 04

Data analyzing and feature engineering : cont.

Deck

We got the Deck feature utilizing the Cabin feature. We removed the numbers from the Cabin feature and only kept the Alphabet letter.

titanic_demo.py

data = combined_df[combined_df["Survived"].notnull()]
fig, ax = plt.subplots(2 ,2, figsize=(16,10))

sns.catplot(x='Deck', y='Survived', data=data, kind='bar', ci=None, ax=ax[0,0])
sns.catplot(x='Deck', y='Survived', hue='Sex', data=data, kind='bar', ci=None, ax=ax[0,1])
sns.catplot(x='Deck', y='Survived', hue='Pclass', data=data, kind='bar', ci=None, ax=ax[1,0])
sns.catplot(x='Deck', y='Survived', hue='Embarked', data=data, kind='bar', ci=None, ax=ax[1,1])

for i in range(2):
    for j in range(2):
        ax[i,j].set_ylabel('Survival Probability')

ax[0,1].legend(loc='upper left', title='Sex')
ax[1,0].legend(loc='upper right', title='Pclass')
ax[1,1].legend(loc='upper left', title='Embarked')

plt.close(2)
plt.close(3)
plt.close(4)
plt.close(5)

fig.tight_layout()
plt.show()

According to the graphs, we can recognize that survivability depends on the Decks. It may be related to locations of the decks or characteristics of passengers who occupied the decks. B Deck has the highest survivability, and its more than 70%. C and D deck have around 60% survivability. All other decks are around and below 40%. E and G decks have the least survivability, and its around 25%. A, B, C, and T decks only used by first-class passengers. There are many survivors than not survived. G Deck only used by the 3rd class passengers. Deck F only used by second and third class groups.
The interesting fact is female passengers in A, B, and D decks have 100% survivability. Also, if we analyze carefully, we can see female passengers of deck G have low survivability than the male passengers who used A, B, and C decks. Even children and women were given priority while evacuating, women in Deck G seems neglected. Deck E seems to be shared by all passenger classes and helps us to clarify that higher classes got more chances than others. Around 70% of 1st class passengers survived when around 80% of 3rd class passengers not survived in Deck E.

SibSp and Parch

titanic_demo.py

data = combined_df[combined_df["Survived"].notnull()]
fig, ax = plt.subplots(1 ,2, figsize=(16,6))

sns.catplot(x='SibSp', y='Survived', data=data, kind='point', ax=ax[0])
sns.catplot(x='Parch', y='Survived', data=data, kind='point', ax=ax[1])

for i in range(2):
    ax[i].set_ylabel('Survival Probability')

plt.close(2)
plt.close(3)
fig.tight_layout()
plt.show()

The more sibling and spouses having by the passenger, the survivability of those passengers are weaker. We can see that passengers who traveled alone and if they have less than 2 tbsp, survivability is higher. If the SibSp is more than 4, then their survivability is 0. When analyzing the second graph, we have a similar impression as previously. So, small families (less than 3) have more survivability than the single, medium, and large families.
Let's create another feature name, total family, using SibSp and Parch. Then let's plot that feature.

titanic_demo.py

combined_df['Total_family'] = combined_df['Parch'] + combined_df['SibSp'] + 1
data = combined_df[combined_df['Survived'].notnull()]
g = sns.catplot(x='Total_family', y='Survived', data=data, kind='point')
g = g.set_ylabels('Survival Probability')
g.fig.set_figwidth(12)
g.fig.set_figheight(4)

Great. We can see that small family size (2, 3, 4) has higher survivability, followed by a single traveler.
Using this we can create family size classes like a single traveler, small family, medium family, and large family.

titanic_demo.py

combined_df['Single'] = combined_df['Total_family'].map(lambda x: 1 if x==1 else 0)
combined_df['Small_family'] = combined_df['Total_family'].map(lambda x: 1 if 2 <= x <=4 else 0)
combined_df['Medium_family'] = combined_df['Total_family'].map(lambda x: 1 if 5 <= x <=7 else 0)
combined_df['Large_family'] = combined_df['Total_family'].map(lambda x: 1 if 7 < x else 0)
data = combined_df[combined_df['Survived'].notnull()]

Title

We obtain the title feature using the Name feature. Let's check survivability distribution among the titles.

titanic_demo.py

g = sns.catplot(x='Pclass',y='Survived', col='Title',data=data, kind='point')
g.fig.set_figwidth(16)
g.fig.set_figheight(4)

Interestingly the title is master in 1st class, or 2nd class passengers have 100% survivability. But for the 3rd class, it's below the titles of females of 1st and 2nd class passengers. And for the title, Mr has a low probability of survival. If the passenger belongs to 1st or 2nd class and having title Master, Miss, or Mrs, then there is a high probability of survival.

Fare

titanic_demo.py

g1 = sns.catplot(x='Fare', y='Pclass', hue='Survived', data=data, kind='box', orient="h")
g1.fig.set_figwidth(20)
g2 = sns.catplot(x='Fare', y='Embarked', hue='Survived', data=data, kind='box', orient="h")
g2.fig.set_figwidth(20)

Interestingly some people got a free pass for the tour, and some paid more than $500. we can see a clear price gap for the first-class tickets. Seems like it isn't much different between 2nd class and 3rd class tickets. Passengers who used the port S and C spend more money than port Q. Since fare is for all family, we cannot come to further observations without involving the number of family members.

Embarked

titanic_demo.py

fig, ax = plt.subplots(2,2,figsize=(20,15))
sns.countplot(y='Embarked', data=data, ax=ax[0,0])
ax[0,0].set_title('# Passengers')

sns.countplot(y='Embarked', hue='Sex', data=data, ax=ax[0,1])
ax[0,1].set_title('Gender vs Embarked')

sns.countplot(y='Embarked', hue='Survived', data=data, ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')

sns.countplot(y='Embarked', hue='Pclass', data=data, ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')

for i in range(2):
    for j in range(2):
        ax[i,j].set_xlabel('Number of passengers')

plt.show()

Most of the passengers are from port S. The male passengers are more than twice female passengers from port S. Also the majority of passengers who use port S belong to the 3rd class. Port Q has the smallest number of passengers. Port Q has a very small number of 1st class and 2nd class passengers. Almost all the passengers who used port Q is the people who bought 3rd class tickets. More than 50% of passengers survived who used the port C.

Married vs not married

titanic_demo.py

combined_df['Is_Married'] = 0
combined_df['Is_Married'] = combined_df['Title'].map(lambda x: 1 if x=='Mrs' else 0)

data = combined_df.loc[combined_df['Survived'].notnull()]
g = sns.catplot(x='Pclass',y='Survived', hue='Is_Married', data=data, kind='point')
g.fig.set_figwidth(16)
g.fig.set_figheight(4)

Now it's clear that married passengers have more survival chances than unmarried passengers. So we can use this feature for our model.

It's safe to drop the features that we don't want to use and the features we do not need to derive other features. I will drop PassengerId, Name, Ticket, Cabin, Total_family, Age, and Fare features. Ticket and Name features can be used for deriving more features. But in this tutorial, I decided to drop it for simplicity.

titanic_demo.py

combined_df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Total_family', 'Age', 'Fare'], inplace=True)

Now its time to transform categorical variables. Because the machine learning algorithms work only with numbers. There are several techniques available.

pandas.get_dummies
Sklearn LabelEncoder
Sklearn OneHotEncoder
Here I just used pandas get_dummies.

titanic_demo.py

combined_df = pd.get_dummies(combined_df, columns = ['Age_class'], prefix='Age_cls')
combined_df = pd.get_dummies(combined_df, columns = ['Deck'], prefix='Deck')
combined_df = pd.get_dummies(combined_df, columns = ['Embarked'], prefix='Emb')
combined_df = pd.get_dummies(combined_df, columns = ['Sex'], prefix='Sex')
combined_df = pd.get_dummies(combined_df, columns = ['Pclass'], prefix='P_class')
combined_df = pd.get_dummies(combined_df, columns = ['Title'], prefix='Title')
combined_df = pd.get_dummies(combined_df, columns = ['Fare_groups'], prefix='Fare_groups')

We have completed the data analysis and feature engineering section. The next step is to build machine learning models using our prepared dataset.
Let's meet again with the final part of the article.

Kaggle Titanic data set - Top 2% guide (Part 05)

*本記事は @qualitia_cdevの中の一人、@nuwanさんに作成していただきました。
*This article is written by @nuwan a member of @qualitia_cdev.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Kaggle Titanic data set - Top 2% guide (Part 04)

Part 04

Data analyzing and feature engineering : cont.

Deck

SibSp and Parch

Title

Fare

Embarked

Married vs not married

Next

Table of contents