More than 5 years have passed since last update.

10分でPandasを学ぶ

Last updated at 2015-01-02Posted at 2015-01-02

"pandas 0.15.2 documentation"の中に、"10 Minutes to pandas"なんてのがあったので、覗いてみたらかなり頭の中が整理された。
まじめにやると10分じゃ終わらんが、便利そうなところだけかいつまんでメモ。

まずはPandasとNumpyのインポート。

#import liblaries
import pandas as pd
import numpy as np

DataFrameを作る

DataFrameの作成方法も幾つかあるので、その整理。
まずは、DataFrameをnumpyで行列を作り、インデックスとラベルを貼り付けるパターン。

インデックスの作成。

#Create a index
dates = pd.date_range("20130101", periods=6)
dates

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None

DataFrameを作成し、インデックスを貼り付ける。

#Create a DatFrame
df = pd.DataFrame(np.random.randn(6,4),index = dates, columns = list("ABCD"))
df

 	A 	B 	C 	D
2013-01-01 	0.705624 	-0.793903 	0.843425 	0.672602
2013-01-02 	-1.211129 	2.077101 	-1.795861 	0.028060
2013-01-03 	0.706086 	0.385631 	0.967568 	0.271894
2013-01-04 	2.152279 	-0.493576 	1.184289 	-1.193300
2013-01-05 	0.455767 	0.787551 	0.239406 	1.627586
2013-01-06 	-0.639162 	-0.052620 	0.288010 	-2.205777

今度は、ラベル別にSeriesを作るイメージでDataFrameを作成。
こっちだと、ラベルごとに別々のdtypesを持てる

df2 = pd.DataFrame({ 'A' : 1.,
                     'B' : pd.Timestamp('20130102'),
                     'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
                     'D' : np.array([3] * 4,dtype='int32'),
                     'E' : pd.Categorical(["test","train","test","train"]),
                     'F' : 'foo' })
df2

 	A 	B 	C 	D 	E 	F
0 	1 	2013-01-02 	1 	3 	test 	foo
1 	1 	2013-01-02 	1 	3 	train 	foo
2 	1 	2013-01-02 	1 	3 	test 	foo
3 	1 	2013-01-02 	1 	3 	train 	foo


df2.dtypes

A           float64
B    datetime64[ns]
C           float32
D             int32
E          category
F            object
dtype: object

DataFrameの参照

次は欲しい形でデータを見る方法。

インデックスだけ、columnsだけ、numpyのデータだけ表示。

df.index

<class 'pandas.tseries.index.DatetimeIndex'>
[2013-01-01, ..., 2013-01-06]
Length: 6, Freq: D, Timezone: None


df.columns

Index([u'A', u'B', u'C', u'D'], dtype='object')

df.values

array([[ 0.705624  , -0.79390348,  0.84342517,  0.67260162],
       [-1.21112884,  2.0771009 , -1.79586146,  0.02806019],
       [ 0.70608621,  0.38563092,  0.9675681 ,  0.27189394],
       [ 2.15227868, -0.49357565,  1.18428903, -1.19329976],
       [ 0.45576744,  0.78755094,  0.23940583,  1.62758649],
       [-0.63916155, -0.05261954,  0.28800958, -2.20577674]])

統計量の概要をまとめて表示、これ便利。

df.describe()

 	A 	B 	C 	D
count 	6.000000 	6.000000 	6.000000 	6.000000
mean 	0.361578 	0.318364 	0.287806 	-0.133156
std 	1.177066 	1.034585 	1.087978 	1.368150
min 	-1.211129 	-0.793903 	-1.795861 	-2.205777
25% 	-0.365429 	-0.383337 	0.251557 	-0.887960
50% 	0.580696 	0.166506 	0.565717 	0.149977
75% 	0.705971 	0.687071 	0.936532 	0.572425
max 	2.152279 	2.077101 	1.184289 	1.627586

DataFrameの行列を反転。

df.T

2013-01-01 00:00:00 	2013-01-02 00:00:00 	2013-01-03 00:00:00 	2013-01-04 00:00:00 	2013-01-05 00:00:00 	2013-01-06 00:00:00
A 	0.705624 	-1.211129 	0.706086 	2.152279 	0.455767 	-0.639162
B 	-0.793903 	2.077101 	0.385631 	-0.493576 	0.787551 	-0.052620
C 	0.843425 	-1.795861 	0.967568 	1.184289 	0.239406 	0.288010
D 	0.672602 	0.028060 	0.271894 	-1.193300 	1.627586 	-2.205777

任意の軸でソートをかける。
例えば、ラベルを降順でソート。

df.sort_index(axis=1, ascending=False)

 	D 	C 	B 	A
2013-01-01 	0.672602 	0.843425 	-0.793903 	0.705624
2013-01-02 	0.028060 	-1.795861 	2.077101 	-1.211129
2013-01-03 	0.271894 	0.967568 	0.385631 	0.706086
2013-01-04 	-1.193300 	1.184289 	-0.493576 	2.152279
2013-01-05 	1.627586 	0.239406 	0.787551 	0.455767
2013-01-06 	-2.205777 	0.288010 	-0.052620 	-0.639162

次はラベル「B」の値で昇順で。


df.sort(columns='B')

A 	B 	C 	D
2013-01-01 	0.705624 	-0.793903 	0.843425 	0.672602
2013-01-04 	2.152279 	-0.493576 	1.184289 	-1.193300
2013-01-06 	-0.639162 	-0.052620 	0.288010 	-2.205777
2013-01-03 	0.706086 	0.385631 	0.967568 	0.271894
2013-01-05 	0.455767 	0.787551 	0.239406 	1.627586
2013-01-02 	-1.211129 	2.077101 	-1.795861 	0.028060

データを選び出す

色んな観点からデータを抜き出してくることが出来ます。
例えば、インデックスの一部だけとか。

ラベルとインデックスを両方指定してデータの抜き出し。

df.loc['20130102':'20130104',['A','B']]

 	A 	B
2013-01-02 	-1.211129 	2.077101
2013-01-03 	0.706086 	0.385631
2013-01-04 	2.152279 	-0.493576

任意のラベルでグループ作れる。そのままデータ操作できる。


#Creating a DataFrame
df = pd.DataFrame({"A" : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   "B" : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   "C" : np.random.randn(8),
                   "D" : np.random.randn(8)})

df

 	A 	B 	C 	D
0 	foo 	one 	1.130975 	1.235940
1 	bar 	one 	-0.140004 	-2.714958
2 	foo 	two 	1.526578 	-0.165415
3 	bar 	three 	-1.049092 	-0.037484
4 	foo 	two 	-1.182303 	0.288754
5 	bar 	two 	0.530652 	1.204125
6 	foo 	one 	0.678477 	-0.273343
7 	foo 	three 	0.929624 	0.169822


df.sort(columns='B')

A 	B 	C 	D
2013-01-01 	0.705624 	-0.793903 	0.843425 	0.672602
2013-01-04 	2.152279 	-0.493576 	1.184289 	-1.193300
2013-01-06 	-0.639162 	-0.052620 	0.288010 	-2.205777
2013-01-03 	0.706086 	0.385631 	0.967568 	0.271894
2013-01-05 	0.455767 	0.787551 	0.239406 	1.627586
2013-01-02 	-1.211129 	2.077101 	-1.795861 	0.028060

データを選び出す

色んな観点からデータを抜き出してくることが出来ます。
例えば、インデックスの一部だけとか。

ラベルとインデックスを両方指定してデータの抜き出し。

df.loc['20130102':'20130104',['A','B']]

 	A 	B
2013-01-02 	-1.211129 	2.077101
2013-01-03 	0.706086 	0.385631
2013-01-04 	2.152279 	-0.493576

任意のラベルでグループ作れる。そのままデータ操作できる。

#Creating a DataFrame
df = pd.DataFrame({"A" : ['foo', 'bar', 'foo', 'bar','foo', 'bar', 'foo', 'foo'],
                   "B" : ['one', 'one', 'two', 'three','two', 'two', 'one', 'three'],
                   "C" : np.random.randn(8),
                   "D" : np.random.randn(8)})
df

 	A 	B 	C 	D
0 	foo 	one 	1.130975 	1.235940
1 	bar 	one 	-0.140004 	-2.714958
2 	foo 	two 	1.526578 	-0.165415
3 	bar 	three 	-1.049092 	-0.037484
4 	foo 	two 	-1.182303 	0.288754
5 	bar 	two 	0.530652 	1.204125
6 	foo 	one 	0.678477 	-0.273343
7 	foo 	three 	0.929624 	0.169822

#Grouping and then calculate sum
df.groupby('A').sum()

 	C 	D
A 		
bar 	-0.658445 	-1.548317
foo 	3.083350 	1.255758

ピボットテーブルの作成

ピボットテーブルにするためのDataFrameの作成。

#Create a DataFrame
df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
                   'B' : ['A', 'B', 'C'] * 4,
                   'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] *2,
                   'D' : np.random.randn(12),
                   'E' : np.random.randn(12)})
df
A 	B 	C 	D 	E
0 	one 	A 	foo 	0.575699 	-1.669032
1 	one 	B 	foo 	0.592889 	-2.526196
2 	two 	C 	foo 	-2.229949 	-0.703339
3 	three 	A 	bar 	0.801380 	-1.638983
4 	one 	B 	bar 	-0.135691 	-0.302586
5 	one 	C 	bar 	0.317401 	1.169608
6 	two 	A 	foo 	0.064460 	-0.109437
7 	three 	B 	foo 	-0.605017 	1.043246
8 	one 	C 	foo 	-0.365220 	0.850535
9 	one 	A 	bar 	1.033552 	0.226002
10 	two 	B 	bar 	-0.260542 	0.352249
11 	three 	C 	bar 	0.518531 	1.407827

割りと簡単にピボットテーブルに変換できる。

pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])

 	C 	bar 	foo
A 	B 		
one 	A 	1.033552 	0.575699
B 	-0.135691 	0.592889
C 	0.317401 	-0.365220
three 	A 	0.801380 	NaN
B 	NaN 	-0.605017
C 	0.518531 	NaN
two 	A 	NaN 	0.064460
B 	-0.260542 	NaN
C 	NaN 	-2.229949

まとめ

一度ざっくり眺めておけば、処理に直面したときに返ってこれそうで、すごくありがたいよね。

参考

pandas 0.15.2 documentation
http://pandas.pydata.org/pandas-docs/stable/index.html

10 Minutes to pandas
http://pandas.pydata.org/pandas-docs/stable/10min.html

570

617

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up