More than 5 years have passed since last update.

Pandasでとりあえずデータの概観を掴む(掴みたい)

Last updated at 2018-08-06Posted at 2018-05-25

[注意]この記事は忘備録的要素が強く、随時更新して自分用のチュートリアル化する予定

Pandasは高度かつ使いやすいデータ構造とデータ解析を提供するオープンソースのPythonのライブラリ
Pandasを使ってまずデータがどういうものなのかを判別してから分析を始めることをオススメ

環境

pandas 0.23.0
Python 3.6.4
jupyter notebook 5.5.0
macOS High Sierra

導入と下準備

インストールはpipで行うことができる。

terminal

pip install pandas

インポートとデータセット読み込む。今回はタイタニック号のデータを使う。

import pandas as pd 
# Load in train and test dataset
df_train = pd.read_csv('../input/train.csv')
df_train.head(3)

shapeでデータの次元サイズを参照

ここでデータセットの次元数を確認する。

print(df_train.shape)

(891, 12)

columnsでデータの列の名前を参照

print(df_train.columns)

Index(['PassengerId', 'Survived', 'Pclass', 'Name', 'Sex', 'Age', 'SibSp',
       'Parch', 'Ticket', 'Fare', 'Cabin', 'Embarked'],
      dtype='object')

head（）でデータの中身を実際に確認

shapeでデータの次元がわかったので、実際の値を確認してみる

df_train.head(3)

	PassengerId	Survived	Pclass	Name	Sex	Age	SibSp	Ticket	Fare	Cabin	Embarked
0	1	0	3	Braund, Mr. Owen Harris	male	22.0	1	A/5 21171	7.2500	NaN	S
1	2	1	1	Cumings, Mrs. John Bradley (Florence Briggs Th...	female	38.0	1	PC 17599	71.2833	C85	C
2	3	1	3	Heikkinen, Miss. Laina	female	26.0	0	STON/O2. 3101282	7.9250	NaN	S

infoで欠損値やデータ型を参照

df_train.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
PassengerId    891 non-null int64
Survived       891 non-null int64
Pclass         891 non-null int64
Name           891 non-null object
Sex            891 non-null object
Age            714 non-null float64
SibSp          891 non-null int64
Parch          891 non-null int64
Ticket         891 non-null object
Fare           891 non-null float64
Cabin          204 non-null object
Embarked       889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.6+ KB

isnull().sum()で欠損値をカウント

df_train.isnull().sum()

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

describeで要約統計量の表示

df_full = pd.concat([df_train, df_test], axis = 0, ignore_index=True)
print(df_full.shape)
df_full.describe()

(1309, 12)


/opt/conda/lib/python3.6/site-packages/ipykernel_launcher.py:1: FutureWarning: Sorting because non-concatenation axis is not aligned. A future version
of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=True'.

To retain the current behavior and silence the warning, pass sort=False

  """Entry point for launching an IPython kernel.

	Age	Fare	Parch	PassengerId	Pclass	SibSp	Survived
count	1046.000000	1308.000000	1309.000000	1309.000000	1309.000000	1309.000000	891.000000
mean	29.881138	33.295479	0.385027	655.000000	2.294882	0.498854	0.383838
std	14.413493	51.758668	0.865560	378.020061	0.837836	1.041658	0.486592
min	0.170000	0.000000	0.000000	1.000000	1.000000	0.000000	0.000000
25%	21.000000	7.895800	0.000000	328.000000	2.000000	0.000000	0.000000
50%	28.000000	14.454200	0.000000	655.000000	3.000000	0.000000	0.000000
75%	39.000000	31.275000	0.000000	982.000000	3.000000	1.000000	1.000000
max	80.000000	512.329200	9.000000	1309.000000	3.000000	8.000000	1.000000

順次更新

twitterでも機械学習に関する情報・オススメ記事などつぶやいているので、フォローお待ちしています。
@bam6o0

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up