More than 5 years have passed since last update.

【データサイエンティスト入門】科学計算、データ加工、グラフ描画ライブラリの使い方の基礎♬Pandasの基礎

Posted at 2020-08-21

昨夜は、【データサイエンティスト入門】科学計算、データ加工、グラフ描画ライブラリの使い方の基礎としてScipyの基礎をまとめましたが、今夜はPandasの基礎をまとめます。本書に乗った解説を補足することとします。
【注意】
「東京大学のデータサイエンティスト育成講座」を読んで、ちょっと疑問を持ったところや有用だと感じた部分をまとめて行こうと思う。
したがって、あらすじはまんまになると思うが内容は本書とは関係ないと思って読んでいただきたい。

Chapter 2-4 Pandasの基礎

「PandasはPythonでモデリングする（機械学習等を使う）前のいわゆる前処理をするときに便利なライブラリです。．．．表計算、データ抽出検索などの操作ができます。」

2-4-1 Pandasのライブラリのインポート

>>> import pandas as pd
>>> from pandas import Series, DataFrame
>>> pd.__version__
'1.0.3

2-4-2 Seriesの使い方

「Seriesは１次元の配列のような．．．」「ような」って、なんだよ
ということで、以下typeを見て、出力すると、なるほど「ような」．．．

>>> sample_pandas_data = pd.Series([0,10,20,30,40,50,60,70,80,90])
>>> print(type(sample_pandas_data))
<class 'pandas.core.series.Series'>
>>> print(sample_pandas_data)
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int64

では、インデックスがついている。

numpy配列からpd.Seriesに変換

参考によれば、
「PandasのベースはNumPyということもあって、互換性は非常に高いです。」

>>> array = np.arange(0,100,10)
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])
>>> series_sample = pd.Series(array)
>>> series_sample
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int32

dtype = 'int64'を指定

>>> array = np.arange(0,100,10, dtype = 'int64')
>>> array
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)
>>> series_sample = pd.Series(array)
>>> series_sample
0     0
1    10
2    20
3    30
4    40
5    50
6    60
7    70
8    80
9    90
dtype: int64

【参考】
PandasとNumPyの違いと使い分け方

pd.Series:インデックスを指定する

>>> sample_pandas_index_data = pd.Series([0,10,20,30,40,50,60,70,80,90], index = ['a','b','c','d','e','f','g','h','i','j'])
>>> sample_pandas_index_data
a     0
b    10
c    20
d    30
e    40
f    50
g    60
h    70
i    80
j    90
dtype: int64
>>> sample_pandas_index_data.index
Index(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'], dtype='object')
>>> sample_pandas_index_data.values
array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90], dtype=int64)

numpy配列から作成できる。

>>> array0 = np.arange(0,100,10, dtype = 'int64')
>>> array1 = np.array(['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])
>>> sample_pandas_index_data2 = pd.Series(array0,index = array1)
>>> sample_pandas_index_data2
a     0
b    10
c    20
d    30
e    40
f    50
g    60
h    70
i    80
j    90
dtype: int64

2-4-3 DataFrameの使い方

「DataFrameは2次元の配列です。．．．」
dataは辞書形式から変換します。
出力は表形式で出力されます。

>>> attri_data1 = {'ID':['100','101','102','103','104'],
...               'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
...               'Birth_year':['1990','1989','1970','1954','2014'],
...                'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
    ID      City Birth_year     Name
0  100     Tokyo       1990  Hiroshi
1  101     Osaka       1989    Akiko
2  102     Kyoto       1970     Yuki
3  103  Hokkaido       1954   Satoru
4  104     Tokyo       2014    Steve
>>> type(attri_data1)
<class 'dict'>

DataFrame:インデックスを指定する

>>> attri_data_frame1=DataFrame(attri_data1, index=['a','b','c','d','e'])
>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
c  102     Kyoto       1970     Yuki
d  103  Hokkaido       1954   Satoru
e  104     Tokyo       2014    Steve

2-4-4 行列操作

2-4-4-1 転置

>>> attri_data_frame1.T
                  a      b      c         d      e
ID              100    101    102       103    104
City          Tokyo  Osaka  Kyoto  Hokkaido  Tokyo
Birth_year     1990   1989   1970      1954   2014
Name        Hiroshi  Akiko   Yuki    Satoru  Steve

2-4-4-2 特定列取り出し

>>> attri_data_frame1.Birth_year
a    1990
b    1989
c    1970
d    1954
e    2014
Name: Birth_year, dtype: object
>>> attri_data_frame1[['ID','Birth_year']]
    ID Birth_year
a  100       1990
b  101       1989
c  102       1970
d  103       1954
e  104       2014

2-4-5 データ抽出

>>> attri_data_frame1[attri_data_frame1['City']=='Tokyo']
    ID   City Birth_year     Name
a  100  Tokyo       1990  Hiroshi
e  104  Tokyo       2014    Steve
>>> attri_data_frame1['City']=='Tokyo'
a     True
b    False
c    False
d    False
e     True
Name: City, dtype: bool

条件複数指定

>>> attri_data_frame1[attri_data_frame1['City'].isin(['Tokyo','Osaka'])]
    ID   City Birth_year     Name
a  100  Tokyo       1990  Hiroshi
b  101  Osaka       1989    Akiko
e  104  Tokyo       2014    Steve

2-4-6 データの削除と結合

2-4-6-1 列や行の削除

drop(削除したい列のリスト,axis = 1)

axis =　1 は列

>>> attri_data_frame1.drop(['Birth_year'], axis = 1)
    ID      City     Name
a  100     Tokyo  Hiroshi
b  101     Osaka    Akiko
c  102     Kyoto     Yuki
d  103  Hokkaido   Satoru
e  104     Tokyo    Steve

drop(削除したい行のリスト,axis = 0)

axis = 0　は行

>>> attri_data_frame1.drop(['c','e'], axis = 0)
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
d  103  Hokkaido       1954   Satoru

上記の操作では、元のdataは、変わらない

>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
c  102     Kyoto       1970     Yuki
d  103  Hokkaido       1954   Satoru
e  104     Tokyo       2014    Steve

以下のオプションinplace = Trueで置き換わる。
※元のデータは無くなるので注意

>>> attri_data_frame1.drop(['c','e'], axis = 0, inplace = True)
>>> attri_data_frame1
    ID      City Birth_year     Name
a  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
d  103  Hokkaido       1954   Satoru

2-4-6-2 データの結合

データの追加

>>> attri_data1 = {'ID':['100','101','102','103','104'],
...               'City':['Tokyo','Osaka','Kyoto','Hokkaido','Tokyo'],
...               'Birth_year':['1990','1989','1970','1954','2014'],
...                'Name':['Hiroshi','Akiko','Yuki','Satoru','Steve']}
>>> attri_data_frame1=DataFrame(attri_data1)
>>> attri_data_frame1
    ID      City Birth_year     Name
0  100     Tokyo       1990  Hiroshi
1  101     Osaka       1989    Akiko
2  102     Kyoto       1970     Yuki
3  103  Hokkaido       1954   Satoru
4  104     Tokyo       2014    Steve
>>> math_pt = [50, 43, 33,76,98]
>>> attri_data_frame1['Math']=math_pt
>>> attri_data_frame1
    ID      City Birth_year     Name  Math
0  100     Tokyo       1990  Hiroshi    50
1  101     Osaka       1989    Akiko    43
2  102     Kyoto       1970     Yuki    33
3  103  Hokkaido       1954   Satoru    76
4  104     Tokyo       2014    Steve    98

データの結合

>>> attri_data2 = {'ID':['100','101','102','105','107'],
...                'Math':[50, 43, 33,76,98],
...                'English':[90, 30, 20,50,30],
...                'Sex':['M', 'F', 'F', 'M', 'M']}
>>> attri_data_frame2=DataFrame(attri_data2)
>>> attri_data_frame2
    ID  Math  English Sex
0  100    50       90   M
1  101    43       30   F
2  102    33       20   F
3  105    76       50   M
4  107    98       30   M
>>> attri_data_frame1
    ID      City Birth_year     Name  Math
0  100     Tokyo       1990  Hiroshi    50
1  101     Osaka       1989    Akiko    43
2  102     Kyoto       1970     Yuki    33
3  103  Hokkaido       1954   Satoru    76
4  104     Tokyo       2014    Steve    98

同じキーを見つけてマージする。
キーは、ID．．．

>>> pd.merge(attri_data_frame1,attri_data_frame2)
    ID   City Birth_year     Name  Math  English Sex
0  100  Tokyo       1990  Hiroshi    50       90   M
1  101  Osaka       1989    Akiko    43       30   F
2  102  Kyoto       1970     Yuki    33       20   F

pandas.merge

>>> pd.merge(attri_data_frame1,attri_data_frame2, how = 'outer')
    ID      City Birth_year     Name  Math  English  Sex
0  100     Tokyo       1990  Hiroshi    50     90.0    M
1  101     Osaka       1989    Akiko    43     30.0    F
2  102     Kyoto       1970     Yuki    33     20.0    F
3  103  Hokkaido       1954   Satoru    76      NaN  NaN
4  104     Tokyo       2014    Steve    98      NaN  NaN
5  105       NaN        NaN      NaN    76     50.0    M
6  107       NaN        NaN      NaN    98     30.0    M

関連
Merge, join, concatenate and compare

2-4-7 集計

「groupbyで、ある特定の列を軸とした集計」

>>> attri_data_frame2.groupby('Sex')['Math'].mean()
Sex
F    38.000000
M    74.666667
Name: Math, dtype: float64
>>> attri_data_frame2.groupby('Sex')['English'].mean()
Sex
F    25.000000
M    56.666667
Name: English, dtype: float64

2-4-8 値のソート

attri_data_frame1.sort_index()で、indexでソートできる。

>>> attri_data_frame1=DataFrame(attri_data1, index=['e','b','a','c','d'])
>>> attri_data_frame1
    ID      City Birth_year     Name
e  100     Tokyo       1990  Hiroshi
b  101     Osaka       1989    Akiko
a  102     Kyoto       1970     Yuki
c  103  Hokkaido       1954   Satoru
d  104     Tokyo       2014    Steve
>>> attri_data_frame1.sort_index()
    ID      City Birth_year     Name
a  102     Kyoto       1970     Yuki
b  101     Osaka       1989    Akiko
c  103  Hokkaido       1954   Satoru
d  104     Tokyo       2014    Steve
e  100     Tokyo       1990  Hiroshi

attri_data_frame1.sort_values(by=['Birth_year'])で、'Birth_year'列の値でソートできる。

>>> attri_data_frame1.sort_values(by=['Birth_year'])
    ID      City Birth_year     Name
c  103  Hokkaido       1954   Satoru
a  102     Kyoto       1970     Yuki
b  101     Osaka       1989    Akiko
e  100     Tokyo       1990  Hiroshi
d  104     Tokyo       2014    Steve

2-4-9 nan(null)の判定

欠損値などを除外するなどの操作する。

2-4-9-1 条件に合致したデータの比較

>>> attri_data_frame1.isin(['Tokyo'])
      ID   City  Birth_year   Name
e  False   True       False  False
b  False  False       False  False
a  False  False       False  False
c  False  False       False  False
d  False   True       False  False

2-4-9-2 nanとnullの例

>>> attri_data_frame1['Name'] = np.nan
>>> attri_data_frame1
    ID      City Birth_year  Name
e  100     Tokyo       1990   NaN
b  101     Osaka       1989   NaN
a  102     Kyoto       1970   NaN
c  103  Hokkaido       1954   NaN
d  104     Tokyo       2014   NaN
>>> attri_data_frame1.isnull()
      ID   City  Birth_year  Name
e  False  False       False  True
b  False  False       False  True
a  False  False       False  True
c  False  False       False  True
d  False  False       False  True

nullの数を数える。

>>> attri_data_frame1.isnull().sum()
ID            0
City          0
Birth_year    0
Name          5
dtype: int64

練習問題

Math >= 50の抽出

>>> attri_data_frame2
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> attri_data_frame2[attri_data_frame2['Math'] >= 50]
    ID  Math  English Sex  Money
0  100    50       90   M   1000
3  105    76       50   M    300
4  107    98       30   M    700

Money男女別平均

>>> attri_data_frame2['Money'] = np.array([1000,2000, 500,300,700])
>>> attri_data_frame2
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> attri_data_frame2.groupby('Sex')['Money'].mean()
Sex
F    1250.000000
M     666.666667
Name: Money, dtype: float64

欠損値処理させたいかも．．．

>>> attri_data_frame2['Money'].mean()
900.0
>>> attri_data_frame2['Math'].mean()
60.0
>>> attri_data_frame2['English'].mean()
44.0

csv入出力

csvファイルの書き出し、読込、index有り無しを追記しておきます。
保存ファイルのindex有り無しで読み込む必要アリですですね。

>>> attri_data_frame2.to_csv(r'samole0.csv',index=False)
>>> attri_data_frame2.to_csv(r'samole1.csv',index=True)
>>> df = pd.read_csv("samole0.csv")
>>> df
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700
>>> df = pd.read_csv("samole1.csv")
>>> df
   Unnamed: 0   ID  Math  English Sex  Money
0           0  100    50       90   M   1000
1           1  101    43       30   F   2000
2           2  102    33       20   F    500
3           3  105    76       50   M    300
4           4  107    98       30   M    700
>>> df = pd.read_csv("samole1.csv", index_col=0)
>>> df
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700

indexをつけないと、．．．やはり意識した方がいいですね。

>>> df.to_csv(r'samole3.csv')
>>> df_ = pd.read_csv("samole3.csv")
>>> df_
   Unnamed: 0   ID  Math  English Sex  Money
0           0  100    50       90   M   1000
1           1  101    43       30   F   2000
2           2  102    33       20   F    500
3           3  105    76       50   M    300
4           4  107    98       30   M    700
>>> df_ = pd.read_csv("samole3.csv", index_col=0)
>>> df_
    ID  Math  English Sex  Money
0  100    50       90   M   1000
1  101    43       30   F   2000
2  102    33       20   F    500
3  105    76       50   M    300
4  107    98       30   M    700

まとめ

・本書のPandasの基礎に沿って、まとめた
・Pandasもグラフ書いたり、いろいろ処理も出来るが今回まとめた範囲を理解すれば使えると思う

・さらなる学習のためにおまけに比較的分かり易いTutorialにリンクを記載した

おまけ

Package overview
Getting started tutorials
What kind of data does pandas handle?
How do I read and write tabular data?
How do I select a subset of a DataFrame?
How to create plots in pandas?
How to create new columns derived from existing columns?
How to calculate summary statistics?
How to reshape the layout of tables?
How to combine data from multiple tables?
How to handle time series data with ease?
How to manipulate textual data?
Comparison with other tools

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up