0
0

More than 3 years have passed since last update.

Kaggle Titanic data set - Top 2% guide (Part 03)

Last updated at Posted at 2020-03-04

Part 03

Data analyzing and feature engineering

This phase is very important. Many data scientists spend time in this phase. We have to identify which features to keep, which to remove and, what sort of features we can obtain by other features.

Gender Distribution
titanic_demo.py
data = combined_df.loc[combined_df['Survived'].notnull()]
sns.countplot(x='Sex', hue='Survived', data = data)
plt.title('Gender Distribution - Survived vs Not-survived', fontsize = 14)

Screen Shot 2020-02-19 at 11.17.44.png

From our data set, we can see that more females survived than males. History says that children and females have given priority.

titanic_demo.py
g = sns.catplot(x='Pclass', y='Survived', hue='Sex', data=data, kind='bar', ci=None, col='Embarked')
g = g.set_ylabels('Survival Probability')
g.fig.set_figwidth(12)
g.fig.set_figheight(4)

Screen Shot 2020-02-19 at 11.20.25.png
For each port, females have a high probability of survival. Interestingly port Q doesn't have 1st, 2nd class male survivors. Maybe the majority who used the port Q can be belong 3rd class.

titanic_demo.py
sns.countplot(x='Pclass', hue='Sex', data=data[data['Embarked']=='Q'])

Screen Shot 2020-02-19 at 11.21.43.png

Our hypothesis is true. There are very few 1st and 2nd class passengers who used the port Q. And all the females of those two groups survived.

Passenger class Distribution
titanic_demo.py
fig, ax = plt.subplots(1 ,2, figsize=(16,6))
sns.catplot(x='Pclass', y='Survived', data=data, kind='bar', ci=None, ax=ax[0])
sns.catplot(x='Pclass', y='Survived', hue='Sex', data=data, kind='point', ci=None, ax=ax[1])

for i in range(2):
    ax[i].set_ylabel('Survival Probability')

ax[1].legend(loc='upper right', title='Sex')

plt.close(2)
plt.close(3)

fig.tight_layout()
plt.show()

Screen Shot 2020-02-20 at 10.25.44.png

People say money can't buy everything. But here we can see a clear pattern that people who spend more money on buying their tickets have a higher probability of survival. More than 60% of people survived who bought 1st class tickets while only around 25% of people survived who bought 3rd class tickets. All most all the females who bought 1st class tickets survived. 2nd class females also have more than 90% survival chance. The female passengers who bought 3rd class tickets have the lowest survival chance, and its around 50%.
Let's observe the count of the survived and not survived people according to the Pclass.

titanic_demo.py
g = sns.countplot(x='Pclass', hue='Survived', data=data)

Screen Shot 2020-02-19 at 11.51.26.png
Oh, There are 491 3rd class passengers, 216 1st class passengers, and 184 2nd class passengers. The graph shows that 1st and second classes given priority when evacuating. More than 350 passengers did not survive, who bought 3rd class tickets.

Age Distribution
titanic_demo.py
fig, ax = plt.subplots(3 ,1,figsize=(14,9))
sns.catplot(x='Age', y='Embarked', hue='Survived', data=data, orient='h', 
            kind='violin', dodge=True, cut=0, bw='scott', split = True, ax = ax[0])
ax[0].set_title('Age & Embarked vs Survived Comparison')

sns.catplot(x='Age', y='Pclass', hue='Survived', data=data, orient='h', 
            kind='violin', dodge=True, cut=0, bw="scott", split = True, ax = ax[1])
ax[1].set_title('Age & Pclass vs Survived')

sns.catplot(x='Age', y='Sex', hue='Survived', data=data, orient='h', 
            kind='violin', dodge=True, cut=0, bw='scott', split = True, ax = ax[2])
ax[2].set_title('Age & Sex vs Survived')

plt.close(2)
plt.close(3)
plt.close(4)
fig.tight_layout()
plt.show()

Screen Shot 2020-02-19 at 12.15.23.png

The people who used harbor Q and age within 18 to 28 are survived much than other ages, and when the age is getting higher survivability becomes lesser. Port C and S seem given priority for the children. The people who used port S and age between 20 to 40 are survived much than other ages. The survival rate is much better for children under 16, and interestingly almost all the children under 16 who use the 2nd class look likes survived. When the age of the males increased, survival chances getting lower. For the women even after age 50 more survived than dying.

We have too many distinct values for age features. I prefer to divide those values into some classes since our data set is not too large.

titanic_demo.py
combined_df['Age_class'] = ''
combined_df.loc[combined_df['Age'] <= 15, 'Age_class'] = 0
combined_df.loc[(combined_df['Age'] > 15) & (combined_df['Age'] <= 25), 'Age_class'] = 1
combined_df.loc[(combined_df['Age'] > 25) & (combined_df['Age'] <= 35), 'Age_class'] = 2
combined_df.loc[(combined_df['Age'] > 35) & (combined_df['Age'] <= 45), 'Age_class'] = 3
combined_df.loc[(combined_df['Age'] > 45) & (combined_df['Age'] <= 55), 'Age_class'] = 4
combined_df.loc[combined_df['Age'] > 55, 'Age_class'] = 5

sns.catplot('Age_class','Survived', data=combined_df, col='Pclass', kind='point' )
plt.show()

Screen Shot 2020-02-19 at 13.13.07.png
The above graphs point out when the age increases, the survivability became lower. All passenger classes obey the same pattern. This proves that children are given priority while evacuating passengers.

Let's continue in the next post.

Next

Table of contents

  1. Kaggle Titanic data set - Top 2% guide (Part 01)
  2. Kaggle Titanic data set - Top 2% guide (Part 02)
  3. Kaggle Titanic data set - Top 2% guide (Part 03)
  4. Kaggle Titanic data set - Top 2% guide (Part 04)
  5. Kaggle Titanic data set - Top 2% guide (Part 05)

*本記事は @qualitia_cdevの中の一人、@nuwanさんに作成していただきました。
*This article is written by @nuwan a member of @qualitia_cdev.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0