More than 3 years have passed since last update.

Kaggle Titanic data set - Top 2% guide (Part 02)

Last updated at 2020-09-04Posted at 2020-02-26

Part 02

Cabin Feature

titanic_demo.py

combined_df[combined_df["Cabin"].notnull()]

In the cabin feature, there is a letter before the number. I searched on the internet and found out the letter mean deck. The deck can be a useful feature to predict the survivability. So, I will remove the number and keep only the character.

titanic_demo.py

combined_df['Deck'] = combined_df['Cabin'].str[:1]
combined_df[combined_df["Deck"].notnull()]

I assumed everybody owned a cabin on the ship. Before going further let's fill the missing deck values with "M" temporarily.

titanic_demo.py

combined_df['Deck'].fillna('M', inplace=True)

First, let's check the counts of each cabin

titanic_demo.py

values(combined_df, 'Deck')

We got 1014 missing values. So we need a good way to fill those missing values. Otherwise, our model can be inaccurate. I think we can use Pclass, Embarked, and Fare to fill the missing Deck values. Let's check it.

titanic_demo.py

g = combined_df.groupby('Deck')['Pclass'].value_counts(normalize=True).unstack()
g.plot(kind='bar', stacked='True', figsize=(10,4) ).legend(bbox_to_anchor=(1, 0.5), title="Pclass")

We can see that there A, B, C cabins don't have 2nd and 3rd class passengers. 3rd class passengers are only in E, F, G Decks. There are second class passengers in D, E, F. Only the deck G holds 3rd class passengers.

titanic_demo.py

g = sns.catplot(y='Fare', x='Deck', hue='Pclass', data=combined_df, kind='box', height=10, aspect=2)
g.fig.set_figwidth(12)
g.fig.set_figheight(8)

titanic_demo.py

g = sns.catplot(y='Fare', x='Deck', hue='Embarked', data=combined_df, kind='box', height=10, aspect=2)
g.fig.set_figwidth(12)
g.fig.set_figheight(8)

From the above graphs, we can see that its possible to use Fare, Embarked and Pclass data. I think it's a good idea to fill missing values using mean fare values based on grouped Embarked and Pclass values.

titanic_demo.py

print(' ---- Pclass 1 ---- ')
print(' ** Embarked : S ** ')
print(combined_df[(combined_df['Pclass']==1) & (combined_df['Embarked']=='S') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : C ** ')
print(combined_df[(combined_df['Pclass']==1) & (combined_df['Embarked']=='C') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : Q ** ')
print(combined_df[(combined_df['Pclass']==1) & (combined_df['Embarked']=='Q') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print('')
print(' ---- Pclass 2 ---- ')
print(' ** Embarked : S ** ')
print(combined_df[(combined_df['Pclass']==2) & (combined_df['Embarked']=='S') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : C ** ')
print(combined_df[(combined_df['Pclass']==2) & (combined_df['Embarked']=='C') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : Q ** ')
print(combined_df[(combined_df['Pclass']==2) & (combined_df['Embarked']=='Q') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print('')
print(' ---- Pclass 3 ---- ')
print('** Embarked : S ** ')
print(combined_df[(combined_df['Pclass']==3) & (combined_df['Embarked']=='S') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : C ** ')
print(combined_df[(combined_df['Pclass']==3) & (combined_df['Embarked']=='C') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : Q ** ')
print(combined_df[(combined_df['Pclass']==3) & (combined_df['Embarked']=='Q') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())

This will out put mean Fare values for each deck concerning Pclass and Embarked. Then we can use those values to fill missing deck values.

titanic_demo.py

def fill_deck(df):
    temp = []
    temp_deck = ''
    for idx in range(len(df)):
        temp_deck = df.iloc[idx]['Deck']
        if df.iloc[idx]['Deck'] == 'M':
            if df['Pclass'].iloc[idx] == 1:
                if df['Embarked'].iloc[idx] == 'S':
                    if df['Fare'].iloc[idx] < 35.5:
                        temp_deck = 'T'
                        
                    elif df['Fare'].iloc[idx] >= 35.50 and df['Fare'].iloc[idx] < 46.43:
                        temp_deck = 'E'
                        
                    elif df['Fare'].iloc[idx] >= 46.43 and df['Fare'].iloc[idx] < 47.27:
                        temp_deck = 'A'
                    
                    elif df['Fare'].iloc[idx] >= 47.27 and df['Fare'].iloc[idx] < 47.70:
                        temp_deck = 'D'
                        
                    elif df['Fare'].iloc[idx] >= 47.70 and df['Fare'].iloc[idx] < 78.67:
                        temp_deck = 'B'
                        
                    else:
                        temp_deck = 'C'
                
                elif df['Embarked'].iloc[idx] == 'C':
                    if df['Fare'].iloc[idx] < 35.20:
                        temp_deck = 'A'
                        
                    elif df['Fare'].iloc[idx] >= 35.20 and df['Fare'].iloc[idx] < 75.73:
                        temp_deck = 'D'
                        
                    elif df['Fare'].iloc[idx] >= 75.73 and df['Fare'].iloc[idx] < 99.07:
                        temp_deck = 'E'
                    
                    elif df['Fare'].iloc[idx] >= 99.07 and df['Fare'].iloc[idx] < 106.25:
                        temp_deck = 'C'
                        
                    else:
                        temp_deck = 'B'
                        
                elif df['Embarked'].iloc[idx] == 'Q':
                    temp_deck = 'C'
                    
            elif df['Pclass'].iloc[idx] == 2:
                if df['Embarked'].iloc[idx] == 'S':
                    if df['Fare'].iloc[idx] < 11.33:
                        temp_deck = 'E'
                        
                    elif df['Fare'].iloc[idx] >= 11.33 and df['Fare'].iloc[idx] < 13:
                        temp_deck = 'D'
                        
                    else:
                        temp_deck = 'F'
                
                elif df['Embarked'].iloc[idx] == 'C':
                    temp_deck = 'D'
                        
                elif df['Embarked'].iloc[idx] == 'Q':
                    temp_deck = 'E'
                    
            elif df['Pclass'].iloc[idx] == 3:
                if df['Embarked'].iloc[idx] == 'S':
                    if df['Fare'].iloc[idx] < 7.60:
                        temp_deck = 'F'
                        
                    elif df['Fare'].iloc[idx] >= 7.60 and df['Fare'].iloc[idx] < 11:
                        temp_deck = 'E'
                        
                    else:
                        temp_deck = 'G'

                else:
                    temp_deck = 'F'
                
        temp.append(temp_deck)
    df['Deck'] = temp


fill_deck(combined_df)

Age Feature

There are 263 missing values of the age. If we think about the real-world situation, age can be a very significant relationship to survival rate. So we have to care a bit when dealing with missing values of this feature. We can assign mean values. But wait, what happens if kids have given priority when evacuation? Then mean values is not seems a great choice because 3 years old can be assigned as 29. If we can filter out more, it will great for our model. What can be more related to age? We didn't use the Name feature yet, we should check it.

titanic_demo.py

combined_df["Name"].to_list()[:30]

OK. We have Mr, Mrs, Miss, and Master in the Name feature. This may help us to derive missing age values. Let's find out. Seems like every title ends with ".". So, we can use our god which is the regex to save us.

titanic_demo.py

combined_df['Title'] = ''
for _ in combined_df:
    combined_df['Title'] = combined_df['Name'].str.extract('([A-Za-z]+)\.')
    
combined_df.head()

titanic_demo.py

pd.crosstab(combined_df["Title"], combined_df["Sex"]).T

Oh, there are 18 different titles. We can see that there are 4 major title categories, some rare categories, and miss-spelled categories. I will make a new Title named Rare and replace rare titles by it.

titanic_demo.py

combined_df["Title"].replace(
    ['Capt', 'Col','Countess','Don', 'Dona', 'Dr', 'Jonkheer', 'Major', 'Rev', 'Sir', 'Dona'], 'Rare'
, inplace = True)

Now lets deal with the missed spelled titles,

titanic_demo.py

combined_df['Title'].replace(
    ['Lady', 'Mlle', 'Mme', 'Ms'],
    ['Mrs', 'Miss', 'Miss', 'Mrs'],inplace=True)
pd.crosstab(combined_df["Title"], combined_df["Survived"]).T

titanic_demo.py

g = sns.catplot(y="Age",x="Title",hue="Pclass", data=combined_df, kind="box", size = 8)
g.fig.set_figwidth(12)
g.fig.set_figheight(6)

The graph shows Age depends on Pclass and the Title of the name.

titanic_demo.py

combined_df.groupby(["Title", "Pclass"])["Age"].describe()

OK, now we can see how the age is distributed among the Titles. I am going to use respecting q3 values to the missing age values.

titanic_demo.py

combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Master') & (combined_df['Pclass']==1),'Age'] = 11
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Master') & (combined_df['Pclass']==2),'Age'] = 3
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Master') & (combined_df['Pclass']==3),'Age'] = 9

combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Miss') & (combined_df['Pclass']==1),'Age'] = 36
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Miss') & (combined_df['Pclass']==2),'Age'] = 28.5
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Miss') & (combined_df['Pclass']==3),'Age'] = 23

combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mr') & (combined_df['Pclass']==1),'Age'] = 50
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mr') & (combined_df['Pclass']==2),'Age'] = 38
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mr') & (combined_df['Pclass']==3),'Age'] = 33

combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mrs') & (combined_df['Pclass']==1),'Age'] = 53.25
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mrs') & (combined_df['Pclass']==2),'Age'] = 40.5
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mrs') & (combined_df['Pclass']==3),'Age'] = 39

combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Rare') & (combined_df['Pclass']==1),'Age'] = 53
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Rare') & (combined_df['Pclass']==2),'Age'] = 53.25

Great, now we had finished filling missing values. The next phase is data analyzing and feature engineering.

I will continue this in another post.

Kaggle Titanic data set - Top 2% guide (Part 03)

*本記事は @qualitia_cdevの中の一人、@nuwanさんに作成していただきました。
*This article is written by @nuwan a member of @qualitia_cdev.

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Kaggle Titanic data set - Top 2% guide (Part 02)

Part 02

Cabin Feature

Age Feature

Next

Table of contents