LoginSignup
2
2

More than 5 years have passed since last update.

Pythonで機械学習 - Data Understanding

Last updated at Posted at 2018-02-11

Pythonで

データの操作にはPandasを使います。

import pandas as pd

Jupyterでの描画のためmatplotlibも使用します。

import matplotlib.pyplot as plt
%matplotlib inline

タイタニックのデータを読み込みます。Pandasのread_csvという関数でCSVファイルを取り込み、DataFrame型として保持します。

train = pd.read_csv("train.csv")

まずはデータの中身をざっと見てみます。

train.head()
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S

Kaggleによると各項目の定義は以下のとおりです。

Variable Definition Key/Remark
survival Survival 0 = No, 1 = Yes
pclass Ticket class 1 = 1st(Upper), 2 = 2nd(Middle), 3 = 3rd(Lower)
sex Sex
Age Age in years Age is fractional if less than 1. If the age is estimated, is it in the form of xx.5
sibsp # of siblings / spouses aboard the Titanic Sibling = brother, sister, stepbrother, stepsister / Spouse = husband, wife (mistresses and fiancés were ignored)
parch # of parents / children aboard the Titanic Parent = mother, father / Child = daughter, son, stepdaughter, stepson / Some children travelled only with a nanny, therefore parch=0 for them.
ticket Ticket number
fare Passenger fare
cabin Cabin number
embarked Port of Embarkation C = Cherbourg, Q = Queenstown, S = Southampton

データの件数をカウントします。shape()で行数、列数を取得できます。全部で891行あります。

train.shape
(891, 12)

各列に欠損値がないかを、count()関数で確認します。Age, Cabinには欠損値があるようです。とくにCabinは欠損値が大半なので、変数としては使えなそうです。

train.count()
PassengerId    891
Survived       891
Pclass         891
Name           891
Sex            891
Age            714
SibSp          891
Parch          891
Ticket         891
Fare           891
Cabin          204
Embarked       889
dtype: int64

データの分布を可視化してみます。

train.Survived.value_counts().plot.bar()

output_16_1.png

train.Pclass.value_counts().sort_index().plot.bar()

image.png

train.Sex.value_counts().plot.bar()

image.png

train.Age.plot.hist()

image.png

train.SibSp.value_counts().sort_index().plot.bar()

image.png

train.Parch.value_counts().sort_index().plot.bar()

image.png

train.Fare.plot.hist()

image.png

train.Embarked.value_counts().plot.bar()

image.png

等級の高い(数字の小さい)船室の方が生存率が高い

train.groupby(["Pclass"]).agg(["count","sum"])["Survived"].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x10864b110>

output_23_1.png

女性の方が生存率が高い

train.groupby(["Sex"]).agg(["count","sum"])["Survived"].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x10851b590>

output_25_1.png

train.groupby(["SibSp"]).agg(["count","sum"])["Survived"].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x10921c690>

output_26_1.png

train.groupby(["Parch"]).agg(["count","sum"])["Survived"].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x10921c3d0>

output_27_1.png

train["NumFamily"] = train["SibSp"] + train["Parch"]
train.groupby(["NumFamily"]).agg(["count","sum"])["Survived"].plot.bar()
<matplotlib.axes._subplots.AxesSubplot at 0x10a26e450>

output_28_1.png

8歳未満は特に生存率が高い(AgeBand=6は欠損値)

import numpy as np
bins = np.histogram(
    train["Age"].fillna(train["Age"].median())
    ,bins=10)[1]
print bins
train["AgeBand"] = np.digitize(train["Age"],bins)
train.groupby(["AgeBand"]).agg(["count","sum"])["Survived"].plot.bar()
[  0.42    8.378  16.336  24.294  32.252  40.21   48.168  56.126  64.084
  72.042  80.   ]





<matplotlib.axes._subplots.AxesSubplot at 0x1090af0d0>

output_30_2.png

bins = np.histogram(
    train["Fare"].fillna(train["Fare"].median())
    ,bins=10)[1]
print bins
train["FareBand"] = np.digitize(train["Fare"],bins)
train.groupby(["FareBand"]).agg(["count","sum"])["Survived"].plot.bar()


[   0.        51.23292  102.46584  153.69876  204.93168  256.1646
  307.39752  358.63044  409.86336  461.09628  512.3292 ]





<matplotlib.axes._subplots.AxesSubplot at 0x1093adc10>

output_31_2.png

戻る

2
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
2