More than 5 years have passed since last update.

Pythonで機械学習 - Data Understanding

機械学習

Last updated at 2018-02-11Posted at 2018-02-11

Pythonで

データの操作にはPandasを使います。

import pandas as pd

Jupyterでの描画のためmatplotlibも使用します。

import matplotlib.pyplot as plt
%matplotlib inline

タイタニックのデータを読み込みます。Pandasのread_csvという関数でCSVファイルを取り込み、DataFrame型として保持します。

train = pd.read_csv("train.csv")

まずはデータの中身をざっと見てみます。

train.head()

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S
3	4	1	1	Futrelle, Mrs. Jacques Heath (Lily May Peel)	female	35.0	1	113803	53.1000	C123	S
4	5	0	3	Allen, Mr. William Henry	male	35.0	0	373450	8.0500	NaN	S

Kaggleによると各項目の定義は以下のとおりです。

|Variable|Definition|Key/Remark|
|:--|:--|:--|:--|
|survival|Survival|0 = No, 1 = Yes|
|pclass|Ticket class|1 = 1st(Upper), 2 = 2nd(Middle), 3 = 3rd(Lower)|
|sex|Sex||
|Age|Age in years|Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5|
|sibsp|# of siblings / spouses aboard the Titanic|Sibling = brother, sister, stepbrother, stepsister / Spouse = husband, wife (mistresses and fiancés were ignored)|
|parch|# of parents / children aboard the Titanic|Parent = mother, father / Child = daughter, son, stepdaughter, stepson / Some children travelled only with a nanny, therefore parch=0 for them.|
|ticket|Ticket number||
|fare|Passenger fare||
|cabin|Cabin number||
|embarked|Port of Embarkation|C = Cherbourg, Q = Queenstown, S = Southampton|

データの件数をカウントします。shape()で行数、列数を取得できます。全部で891行あります。

train.shape

(891, 12)

各列に欠損値がないかを、count()関数で確認します。Age, Cabinには欠損値があるようです。とくにCabinは欠損値が大半なので、変数としては使えなそうです。

train.count()

PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

データの分布を可視化してみます。

train.Survived.value_counts().plot.bar()

train.Pclass.value_counts().sort_index().plot.bar()

train.Sex.value_counts().plot.bar()

train.Age.plot.hist()

train.SibSp.value_counts().sort_index().plot.bar()

train.Parch.value_counts().sort_index().plot.bar()

train.Fare.plot.hist()

train.Embarked.value_counts().plot.bar()

等級の高い（数字の小さい）船室の方が生存率が高い

train.groupby(["Pclass"]).agg(["count","sum"])["Survived"].plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x10864b110>

女性の方が生存率が高い

train.groupby(["Sex"]).agg(["count","sum"])["Survived"].plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x10851b590>

train.groupby(["SibSp"]).agg(["count","sum"])["Survived"].plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x10921c690>

train.groupby(["Parch"]).agg(["count","sum"])["Survived"].plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x10921c3d0>

train["NumFamily"] = train["SibSp"] + train["Parch"]
train.groupby(["NumFamily"]).agg(["count","sum"])["Survived"].plot.bar()

<matplotlib.axes._subplots.AxesSubplot at 0x10a26e450>

8歳未満は特に生存率が高い(AgeBand=6は欠損値)

import numpy as np
bins = np.histogram(
    train["Age"].fillna(train["Age"].median())
    ,bins=10)[1]
print bins
train["AgeBand"] = np.digitize(train["Age"],bins)
train.groupby(["AgeBand"]).agg(["count","sum"])["Survived"].plot.bar()

[  0.42    8.378  16.336  24.294  32.252  40.21   48.168  56.126  64.084
  72.042  80.   ]





<matplotlib.axes._subplots.AxesSubplot at 0x1090af0d0>

bins = np.histogram(
    train["Fare"].fillna(train["Fare"].median())
    ,bins=10)[1]
print bins
train["FareBand"] = np.digitize(train["Fare"],bins)
train.groupby(["FareBand"]).agg(["count","sum"])["Survived"].plot.bar()

[   0.        51.23292  102.46584  153.69876  204.93168  256.1646
  307.39752  358.63044  409.86336  461.09628  512.3292 ]





<matplotlib.axes._subplots.AxesSubplot at 0x1093adc10>

戻る

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up