##Part 03
Data analyzing and feature engineering
This phase is very important. Many data scientists spend time in this phase. We have to identify which features to keep, which to remove and, what sort of features we can obtain by other features.
Gender Distribution
data = combined_df.loc[combined_df['Survived'].notnull()]
sns.countplot(x='Sex', hue='Survived', data = data)
plt.title('Gender Distribution - Survived vs Not-survived', fontsize = 14)
From our data set, we can see that more females survived than males. History says that children and females have given priority.
g = sns.catplot(x='Pclass', y='Survived', hue='Sex', data=data, kind='bar', ci=None, col='Embarked')
g = g.set_ylabels('Survival Probability')
g.fig.set_figwidth(12)
g.fig.set_figheight(4)
For each port, females have a high probability of survival. Interestingly port Q doesn't have 1st, 2nd class male survivors. Maybe the majority who used the port Q can be belong 3rd class.
sns.countplot(x='Pclass', hue='Sex', data=data[data['Embarked']=='Q'])
Our hypothesis is true. There are very few 1st and 2nd class passengers who used the port Q. And all the females of those two groups survived.
Passenger class Distribution
fig, ax = plt.subplots(1 ,2, figsize=(16,6))
sns.catplot(x='Pclass', y='Survived', data=data, kind='bar', ci=None, ax=ax[0])
sns.catplot(x='Pclass', y='Survived', hue='Sex', data=data, kind='point', ci=None, ax=ax[1])
for i in range(2):
ax[i].set_ylabel('Survival Probability')
ax[1].legend(loc='upper right', title='Sex')
plt.close(2)
plt.close(3)
fig.tight_layout()
plt.show()
People say money can't buy everything. But here we can see a clear pattern that people who spend more money on buying their tickets have a higher probability of survival. More than 60% of people survived who bought 1st class tickets while only around 25% of people survived who bought 3rd class tickets. All most all the females who bought 1st class tickets survived. 2nd class females also have more than 90% survival chance. The female passengers who bought 3rd class tickets have the lowest survival chance, and its around 50%.
Let's observe the count of the survived and not survived people according to the Pclass.
g = sns.countplot(x='Pclass', hue='Survived', data=data)
Oh, There are 491 3rd class passengers, 216 1st class passengers, and 184 2nd class passengers. The graph shows that 1st and second classes given priority when evacuating. More than 350 passengers did not survive, who bought 3rd class tickets.
######Age Distribution
fig, ax = plt.subplots(3 ,1,figsize=(14,9))
sns.catplot(x='Age', y='Embarked', hue='Survived', data=data, orient='h',
kind='violin', dodge=True, cut=0, bw='scott', split = True, ax = ax[0])
ax[0].set_title('Age & Embarked vs Survived Comparison')
sns.catplot(x='Age', y='Pclass', hue='Survived', data=data, orient='h',
kind='violin', dodge=True, cut=0, bw="scott", split = True, ax = ax[1])
ax[1].set_title('Age & Pclass vs Survived')
sns.catplot(x='Age', y='Sex', hue='Survived', data=data, orient='h',
kind='violin', dodge=True, cut=0, bw='scott', split = True, ax = ax[2])
ax[2].set_title('Age & Sex vs Survived')
plt.close(2)
plt.close(3)
plt.close(4)
fig.tight_layout()
plt.show()
The people who used harbor Q and age within 18 to 28 are survived much than other ages, and when the age is getting higher survivability becomes lesser. Port C and S seem given priority for the children. The people who used port S and age between 20 to 40 are survived much than other ages. The survival rate is much better for children under 16, and interestingly almost all the children under 16 who use the 2nd class look likes survived. When the age of the males increased, survival chances getting lower. For the women even after age 50 more survived than dying.
We have too many distinct values for age features. I prefer to divide those values into some classes since our data set is not too large.
combined_df['Age_class'] = ''
combined_df.loc[combined_df['Age'] <= 15, 'Age_class'] = 0
combined_df.loc[(combined_df['Age'] > 15) & (combined_df['Age'] <= 25), 'Age_class'] = 1
combined_df.loc[(combined_df['Age'] > 25) & (combined_df['Age'] <= 35), 'Age_class'] = 2
combined_df.loc[(combined_df['Age'] > 35) & (combined_df['Age'] <= 45), 'Age_class'] = 3
combined_df.loc[(combined_df['Age'] > 45) & (combined_df['Age'] <= 55), 'Age_class'] = 4
combined_df.loc[combined_df['Age'] > 55, 'Age_class'] = 5
sns.catplot('Age_class','Survived', data=combined_df, col='Pclass', kind='point' )
plt.show()
The above graphs point out when the age increases, the survivability became lower. All passenger classes obey the same pattern. This proves that children are given priority while evacuating passengers.
Let's continue in the next post.
Next
Table of contents
- Kaggle Titanic data set - Top 2% guide (Part 01)
- Kaggle Titanic data set - Top 2% guide (Part 02)
- Kaggle Titanic data set - Top 2% guide (Part 03)
- Kaggle Titanic data set - Top 2% guide (Part 04)
- Kaggle Titanic data set - Top 2% guide (Part 05)
*本記事は @qualitia_cdevの中の一人、@nuwanさんに作成していただきました。
*This article is written by @nuwan a member of @qualitia_cdev.