0
0

More than 3 years have passed since last update.

Kaggle Titanic data set - Top 2% guide (Part 04)

Last updated at Posted at 2020-03-11

Part 04

Data analyzing and feature engineering : cont.

Deck

We got the Deck feature utilizing the Cabin feature. We removed the numbers from the Cabin feature and only kept the Alphabet letter.

titanic_demo.py
data = combined_df[combined_df["Survived"].notnull()]
fig, ax = plt.subplots(2 ,2, figsize=(16,10))

sns.catplot(x='Deck', y='Survived', data=data, kind='bar', ci=None, ax=ax[0,0])
sns.catplot(x='Deck', y='Survived', hue='Sex', data=data, kind='bar', ci=None, ax=ax[0,1])
sns.catplot(x='Deck', y='Survived', hue='Pclass', data=data, kind='bar', ci=None, ax=ax[1,0])
sns.catplot(x='Deck', y='Survived', hue='Embarked', data=data, kind='bar', ci=None, ax=ax[1,1])

for i in range(2):
    for j in range(2):
        ax[i,j].set_ylabel('Survival Probability')

ax[0,1].legend(loc='upper left', title='Sex')
ax[1,0].legend(loc='upper right', title='Pclass')
ax[1,1].legend(loc='upper left', title='Embarked')

plt.close(2)
plt.close(3)
plt.close(4)
plt.close(5)

fig.tight_layout()
plt.show()

Screen Shot 2020-02-19 at 16.09.15.png
According to the graphs, we can recognize that survivability depends on the Decks. It may be related to locations of the decks or characteristics of passengers who occupied the decks. B Deck has the highest survivability, and its more than 70%. C and D deck have around 60% survivability. All other decks are around and below 40%. E and G decks have the least survivability, and its around 25%. A, B, C, and T decks only used by first-class passengers. There are many survivors than not survived. G Deck only used by the 3rd class passengers. Deck F only used by second and third class groups.
The interesting fact is female passengers in A, B, and D decks have 100% survivability. Also, if we analyze carefully, we can see female passengers of deck G have low survivability than the male passengers who used A, B, and C decks. Even children and women were given priority while evacuating, women in Deck G seems neglected. Deck E seems to be shared by all passenger classes and helps us to clarify that higher classes got more chances than others. Around 70% of 1st class passengers survived when around 80% of 3rd class passengers not survived in Deck E.

SibSp and Parch
titanic_demo.py
data = combined_df[combined_df["Survived"].notnull()]
fig, ax = plt.subplots(1 ,2, figsize=(16,6))

sns.catplot(x='SibSp', y='Survived', data=data, kind='point', ax=ax[0])
sns.catplot(x='Parch', y='Survived', data=data, kind='point', ax=ax[1])

for i in range(2):
    ax[i].set_ylabel('Survival Probability')

plt.close(2)
plt.close(3)
fig.tight_layout()
plt.show()

Screen Shot 2020-02-19 at 16.47.02.png
The more sibling and spouses having by the passenger, the survivability of those passengers are weaker. We can see that passengers who traveled alone and if they have less than 2 tbsp, survivability is higher. If the SibSp is more than 4, then their survivability is 0. When analyzing the second graph, we have a similar impression as previously. So, small families (less than 3) have more survivability than the single, medium, and large families.
Let's create another feature name, total family, using SibSp and Parch. Then let's plot that feature.

titanic_demo.py
combined_df['Total_family'] = combined_df['Parch'] + combined_df['SibSp'] + 1
data = combined_df[combined_df['Survived'].notnull()]
g = sns.catplot(x='Total_family', y='Survived', data=data, kind='point')
g = g.set_ylabels('Survival Probability')
g.fig.set_figwidth(12)
g.fig.set_figheight(4)

Screen Shot 2020-02-19 at 17.09.50.png

Great. We can see that small family size (2, 3, 4) has higher survivability, followed by a single traveler.
Using this we can create family size classes like a single traveler, small family, medium family, and large family.

titanic_demo.py
combined_df['Single'] = combined_df['Total_family'].map(lambda x: 1 if x==1 else 0)
combined_df['Small_family'] = combined_df['Total_family'].map(lambda x: 1 if 2 <= x <=4 else 0)
combined_df['Medium_family'] = combined_df['Total_family'].map(lambda x: 1 if 5 <= x <=7 else 0)
combined_df['Large_family'] = combined_df['Total_family'].map(lambda x: 1 if 7 < x else 0)
data = combined_df[combined_df['Survived'].notnull()]
Title

We obtain the title feature using the Name feature. Let's check survivability distribution among the titles.

titanic_demo.py
g = sns.catplot(x='Pclass',y='Survived', col='Title',data=data, kind='point')
g.fig.set_figwidth(16)
g.fig.set_figheight(4)

Screen Shot 2020-02-20 at 14.55.22.png
Interestingly the title is master in 1st class, or 2nd class passengers have 100% survivability. But for the 3rd class, it's below the titles of females of 1st and 2nd class passengers. And for the title, Mr has a low probability of survival. If the passenger belongs to 1st or 2nd class and having title Master, Miss, or Mrs, then there is a high probability of survival.

Fare
titanic_demo.py
g1 = sns.catplot(x='Fare', y='Pclass', hue='Survived', data=data, kind='box', orient="h")
g1.fig.set_figwidth(20)
g2 = sns.catplot(x='Fare', y='Embarked', hue='Survived', data=data, kind='box', orient="h")
g2.fig.set_figwidth(20)

Screen Shot 2020-02-20 at 15.18.42.png
Interestingly some people got a free pass for the tour, and some paid more than $500. we can see a clear price gap for the first-class tickets. Seems like it isn't much different between 2nd class and 3rd class tickets. Passengers who used the port S and C spend more money than port Q. Since fare is for all family, we cannot come to further observations without involving the number of family members.

Embarked
titanic_demo.py
fig, ax = plt.subplots(2,2,figsize=(20,15))
sns.countplot(y='Embarked', data=data, ax=ax[0,0])
ax[0,0].set_title('# Passengers')

sns.countplot(y='Embarked', hue='Sex', data=data, ax=ax[0,1])
ax[0,1].set_title('Gender vs Embarked')

sns.countplot(y='Embarked', hue='Survived', data=data, ax=ax[1,0])
ax[1,0].set_title('Embarked vs Survived')

sns.countplot(y='Embarked', hue='Pclass', data=data, ax=ax[1,1])
ax[1,1].set_title('Embarked vs Pclass')

for i in range(2):
    for j in range(2):
        ax[i,j].set_xlabel('Number of passengers')

plt.show()

Screen Shot 2020-02-20 at 15.35.58.png
Most of the passengers are from port S. The male passengers are more than twice female passengers from port S. Also the majority of passengers who use port S belong to the 3rd class. Port Q has the smallest number of passengers. Port Q has a very small number of 1st class and 2nd class passengers. Almost all the passengers who used port Q is the people who bought 3rd class tickets. More than 50% of passengers survived who used the port C.

Married vs not married
titanic_demo.py
combined_df['Is_Married'] = 0
combined_df['Is_Married'] = combined_df['Title'].map(lambda x: 1 if x=='Mrs' else 0)

data = combined_df.loc[combined_df['Survived'].notnull()]
g = sns.catplot(x='Pclass',y='Survived', hue='Is_Married', data=data, kind='point')
g.fig.set_figwidth(16)
g.fig.set_figheight(4)

Screen Shot 2020-02-20 at 16.11.08.png
Now it's clear that married passengers have more survival chances than unmarried passengers. So we can use this feature for our model.

It's safe to drop the features that we don't want to use and the features we do not need to derive other features. I will drop PassengerId, Name, Ticket, Cabin, Total_family, Age, and Fare features. Ticket and Name features can be used for deriving more features. But in this tutorial, I decided to drop it for simplicity.

titanic_demo.py
combined_df.drop(columns=['PassengerId', 'Name', 'Ticket', 'Cabin', 'Total_family', 'Age', 'Fare'], inplace=True)

Now its time to transform categorical variables. Because the machine learning algorithms work only with numbers. There are several techniques available.

  • pandas.get_dummies
  • Sklearn LabelEncoder
  • Sklearn OneHotEncoder
  • Here I just used pandas get_dummies.
titanic_demo.py
combined_df = pd.get_dummies(combined_df, columns = ['Age_class'], prefix='Age_cls')
combined_df = pd.get_dummies(combined_df, columns = ['Deck'], prefix='Deck')
combined_df = pd.get_dummies(combined_df, columns = ['Embarked'], prefix='Emb')
combined_df = pd.get_dummies(combined_df, columns = ['Sex'], prefix='Sex')
combined_df = pd.get_dummies(combined_df, columns = ['Pclass'], prefix='P_class')
combined_df = pd.get_dummies(combined_df, columns = ['Title'], prefix='Title')
combined_df = pd.get_dummies(combined_df, columns = ['Fare_groups'], prefix='Fare_groups')

We have completed the data analysis and feature engineering section. The next step is to build machine learning models using our prepared dataset.
Let's meet again with the final part of the article.

Next

Table of contents

  1. Kaggle Titanic data set - Top 2% guide (Part 01)
  2. Kaggle Titanic data set - Top 2% guide (Part 02)
  3. Kaggle Titanic data set - Top 2% guide (Part 03)
  4. Kaggle Titanic data set - Top 2% guide (Part 04)
  5. Kaggle Titanic data set - Top 2% guide (Part 05)

*本記事は @qualitia_cdevの中の一人、@nuwanさんに作成していただきました。
*This article is written by @nuwan a member of @qualitia_cdev.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0