0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 3 years have passed since last update.

Kaggle Titanic data set - Top 2% guide (Part 02)

Last updated at Posted at 2020-02-26

##Part 02
######Cabin Feature

titanic_demo.py
combined_df[combined_df["Cabin"].notnull()]

Screen Shot 2020-02-18 at 12.25.28.png
In the cabin feature, there is a letter before the number. I searched on the internet and found out the letter mean deck. The deck can be a useful feature to predict the survivability. So, I will remove the number and keep only the character.

titanic_demo.py
combined_df['Deck'] = combined_df['Cabin'].str[:1]
combined_df[combined_df["Deck"].notnull()]

Screen Shot 2020-02-18 at 12.29.26.png
I assumed everybody owned a cabin on the ship. Before going further let's fill the missing deck values with "M" temporarily.

titanic_demo.py
combined_df['Deck'].fillna('M', inplace=True)

First, let's check the counts of each cabin

titanic_demo.py
values(combined_df, 'Deck')

Screen Shot 2020-02-18 at 12.33.28.png
We got 1014 missing values. So we need a good way to fill those missing values. Otherwise, our model can be inaccurate. I think we can use Pclass, Embarked, and Fare to fill the missing Deck values. Let's check it.

titanic_demo.py
g = combined_df.groupby('Deck')['Pclass'].value_counts(normalize=True).unstack()
g.plot(kind='bar', stacked='True', figsize=(10,4) ).legend(bbox_to_anchor=(1, 0.5), title="Pclass")

Screen Shot 2020-02-18 at 12.45.03.png
We can see that there A, B, C cabins don't have 2nd and 3rd class passengers. 3rd class passengers are only in E, F, G Decks. There are second class passengers in D, E, F. Only the deck G holds 3rd class passengers.

titanic_demo.py
g = sns.catplot(y='Fare', x='Deck', hue='Pclass', data=combined_df, kind='box', height=10, aspect=2)
g.fig.set_figwidth(12)
g.fig.set_figheight(8)

Screen Shot 2020-02-18 at 13.09.14.png

titanic_demo.py
g = sns.catplot(y='Fare', x='Deck', hue='Embarked', data=combined_df, kind='box', height=10, aspect=2)
g.fig.set_figwidth(12)
g.fig.set_figheight(8)

Screen Shot 2020-02-18 at 13.10.01.png
From the above graphs, we can see that its possible to use Fare, Embarked and Pclass data. I think it's a good idea to fill missing values using mean fare values based on grouped Embarked and Pclass values.

titanic_demo.py
print(' ---- Pclass 1 ---- ')
print(' ** Embarked : S ** ')
print(combined_df[(combined_df['Pclass']==1) & (combined_df['Embarked']=='S') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : C ** ')
print(combined_df[(combined_df['Pclass']==1) & (combined_df['Embarked']=='C') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : Q ** ')
print(combined_df[(combined_df['Pclass']==1) & (combined_df['Embarked']=='Q') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print('')
print(' ---- Pclass 2 ---- ')
print(' ** Embarked : S ** ')
print(combined_df[(combined_df['Pclass']==2) & (combined_df['Embarked']=='S') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : C ** ')
print(combined_df[(combined_df['Pclass']==2) & (combined_df['Embarked']=='C') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : Q ** ')
print(combined_df[(combined_df['Pclass']==2) & (combined_df['Embarked']=='Q') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print('')
print(' ---- Pclass 3 ---- ')
print('** Embarked : S ** ')
print(combined_df[(combined_df['Pclass']==3) & (combined_df['Embarked']=='S') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : C ** ')
print(combined_df[(combined_df['Pclass']==3) & (combined_df['Embarked']=='C') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())
print('')
print(' ** Embarked : Q ** ')
print(combined_df[(combined_df['Pclass']==3) & (combined_df['Embarked']=='Q') & (combined_df['Deck']!='M')].groupby(['Deck'])['Fare'].mean().sort_values())

This will out put mean Fare values for each deck concerning Pclass and Embarked. Then we can use those values to fill missing deck values.

titanic_demo.py
def fill_deck(df):
    temp = []
    temp_deck = ''
    for idx in range(len(df)):
        temp_deck = df.iloc[idx]['Deck']
        if df.iloc[idx]['Deck'] == 'M':
            if df['Pclass'].iloc[idx] == 1:
                if df['Embarked'].iloc[idx] == 'S':
                    if df['Fare'].iloc[idx] < 35.5:
                        temp_deck = 'T'
                        
                    elif df['Fare'].iloc[idx] >= 35.50 and df['Fare'].iloc[idx] < 46.43:
                        temp_deck = 'E'
                        
                    elif df['Fare'].iloc[idx] >= 46.43 and df['Fare'].iloc[idx] < 47.27:
                        temp_deck = 'A'
                    
                    elif df['Fare'].iloc[idx] >= 47.27 and df['Fare'].iloc[idx] < 47.70:
                        temp_deck = 'D'
                        
                    elif df['Fare'].iloc[idx] >= 47.70 and df['Fare'].iloc[idx] < 78.67:
                        temp_deck = 'B'
                        
                    else:
                        temp_deck = 'C'
                
                elif df['Embarked'].iloc[idx] == 'C':
                    if df['Fare'].iloc[idx] < 35.20:
                        temp_deck = 'A'
                        
                    elif df['Fare'].iloc[idx] >= 35.20 and df['Fare'].iloc[idx] < 75.73:
                        temp_deck = 'D'
                        
                    elif df['Fare'].iloc[idx] >= 75.73 and df['Fare'].iloc[idx] < 99.07:
                        temp_deck = 'E'
                    
                    elif df['Fare'].iloc[idx] >= 99.07 and df['Fare'].iloc[idx] < 106.25:
                        temp_deck = 'C'
                        
                    else:
                        temp_deck = 'B'
                        
                elif df['Embarked'].iloc[idx] == 'Q':
                    temp_deck = 'C'
                    
            elif df['Pclass'].iloc[idx] == 2:
                if df['Embarked'].iloc[idx] == 'S':
                    if df['Fare'].iloc[idx] < 11.33:
                        temp_deck = 'E'
                        
                    elif df['Fare'].iloc[idx] >= 11.33 and df['Fare'].iloc[idx] < 13:
                        temp_deck = 'D'
                        
                    else:
                        temp_deck = 'F'
                
                elif df['Embarked'].iloc[idx] == 'C':
                    temp_deck = 'D'
                        
                elif df['Embarked'].iloc[idx] == 'Q':
                    temp_deck = 'E'
                    
            elif df['Pclass'].iloc[idx] == 3:
                if df['Embarked'].iloc[idx] == 'S':
                    if df['Fare'].iloc[idx] < 7.60:
                        temp_deck = 'F'
                        
                    elif df['Fare'].iloc[idx] >= 7.60 and df['Fare'].iloc[idx] < 11:
                        temp_deck = 'E'
                        
                    else:
                        temp_deck = 'G'

                else:
                    temp_deck = 'F'
                
        temp.append(temp_deck)
    df['Deck'] = temp


fill_deck(combined_df)    
Age Feature

There are 263 missing values of the age. If we think about the real-world situation, age can be a very significant relationship to survival rate. So we have to care a bit when dealing with missing values of this feature. We can assign mean values. But wait, what happens if kids have given priority when evacuation? Then mean values is not seems a great choice because 3 years old can be assigned as 29. If we can filter out more, it will great for our model. What can be more related to age? We didn't use the Name feature yet, we should check it.

titanic_demo.py
combined_df["Name"].to_list()[:30]

Screen Shot 2020-02-18 at 13.22.41.png
OK. We have Mr, Mrs, Miss, and Master in the Name feature. This may help us to derive missing age values. Let's find out. Seems like every title ends with ".". So, we can use our god which is the regex to save us.

titanic_demo.py
combined_df['Title'] = ''
for _ in combined_df:
    combined_df['Title'] = combined_df['Name'].str.extract('([A-Za-z]+)\.')
    
combined_df.head()

Screen Shot 2020-02-18 at 13.24.02.png

titanic_demo.py
pd.crosstab(combined_df["Title"], combined_df["Sex"]).T

Screen Shot 2020-02-18 at 13.25.02.png

Oh, there are 18 different titles. We can see that there are 4 major title categories, some rare categories, and miss-spelled categories. I will make a new Title named Rare and replace rare titles by it.

titanic_demo.py
combined_df["Title"].replace(
    ['Capt', 'Col','Countess','Don', 'Dona', 'Dr', 'Jonkheer', 'Major', 'Rev', 'Sir', 'Dona'], 'Rare'
, inplace = True)

Now lets deal with the missed spelled titles,

titanic_demo.py
combined_df['Title'].replace(
    ['Lady', 'Mlle', 'Mme', 'Ms'],
    ['Mrs', 'Miss', 'Miss', 'Mrs'],inplace=True)
pd.crosstab(combined_df["Title"], combined_df["Survived"]).T

Screen Shot 2020-02-18 at 13.27.06.png

titanic_demo.py
g = sns.catplot(y="Age",x="Title",hue="Pclass", data=combined_df, kind="box", size = 8)
g.fig.set_figwidth(12)
g.fig.set_figheight(6)

Screen Shot 2020-02-18 at 13.29.38.png
The graph shows Age depends on Pclass and the Title of the name.

titanic_demo.py
combined_df.groupby(["Title", "Pclass"])["Age"].describe()

Screen Shot 2020-02-18 at 13.32.05.png
OK, now we can see how the age is distributed among the Titles. I am going to use respecting q3 values to the missing age values.

titanic_demo.py
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Master') & (combined_df['Pclass']==1),'Age'] = 11
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Master') & (combined_df['Pclass']==2),'Age'] = 3
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Master') & (combined_df['Pclass']==3),'Age'] = 9

combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Miss') & (combined_df['Pclass']==1),'Age'] = 36
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Miss') & (combined_df['Pclass']==2),'Age'] = 28.5
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Miss') & (combined_df['Pclass']==3),'Age'] = 23

combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mr') & (combined_df['Pclass']==1),'Age'] = 50
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mr') & (combined_df['Pclass']==2),'Age'] = 38
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mr') & (combined_df['Pclass']==3),'Age'] = 33

combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mrs') & (combined_df['Pclass']==1),'Age'] = 53.25
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mrs') & (combined_df['Pclass']==2),'Age'] = 40.5
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Mrs') & (combined_df['Pclass']==3),'Age'] = 39

combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Rare') & (combined_df['Pclass']==1),'Age'] = 53
combined_df.loc[(combined_df['Age'].isnull()) & (combined_df['Title']=='Rare') & (combined_df['Pclass']==2),'Age'] = 53.25

Great, now we had finished filling missing values. The next phase is data analyzing and feature engineering.

I will continue this in another post.

Next

Table of contents

  1. Kaggle Titanic data set - Top 2% guide (Part 01)
  2. Kaggle Titanic data set - Top 2% guide (Part 02)
  3. Kaggle Titanic data set - Top 2% guide (Part 03)
  4. Kaggle Titanic data set - Top 2% guide (Part 04)
  5. Kaggle Titanic data set - Top 2% guide (Part 05)

*本記事は @qualitia_cdevの中の一人、@nuwanさんに作成していただきました。
*This article is written by @nuwan a member of @qualitia_cdev.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?