Numpy
条件指定で抽出
# 配列生成
>>> array1 = np.array([16, 10, 3])
>>> array1.shape
(3,)
# boolリストで条件指定できる
>>> array1[[True, False, True]]
array([16, 3])
# boolリストが返る
>>> print(array1 > 5)
[ True True False]
# 組み合わせるとこう使える
>>> array1[array1 > 5]
array([16, 10])
bincount
配列の各値の出現回数をカウント
>>> a1 = np.array([0, 1, 1, 0, 1, 0, 0, 0])
>>> np.bincount(a1)
array([5, 3])
arange, linspace
numpy配列作成
>>> np.arange(5)
array([0, 1, 2, 3, 4])
>>> np.linspace(-3, 3, 5)
array([-3. , -1.5, 0. , 1.5, 3. ])
reshape
配列の次元変換
>>> line = np.linspace(-3, 3, 10)
# 1次元配列
>>> line
array([-3. , -2.33333333, -1.66666667, -1. , -0.33333333,
0.33333333, 1. , 1.66666667, 2.33333333, 3. ])
# 2次元配列に変換(10行1列)
>>> line.reshape(10, 1)
array([[-3. ],
[-2.33333333],
[-1.66666667],
[-1. ],
[-0.33333333],
[ 0.33333333],
[ 1. ],
[ 1.66666667],
[ 2.33333333],
[ 3. ]])
# どちらかは-1を指定でき、もう片方の次元から推測して変換される
>>> line.reshape(-1, 1)
array([[-3. ],
[-2.33333333],
[-1.66666667],
[-1. ],
[-0.33333333],
[ 0.33333333],
[ 1. ],
[ 1.66666667],
[ 2.33333333],
[ 3. ]])
Numpy配列の演算
>>> x = np.array([1, 2, 3])
>>> x.mean()
2.0
>>> xc = x - x.mean()
>>> xc
array([-1., 0., 1.])
axis指定
- axis=0: 層
- axis=1: 行
- axis=2: 列
>>> b
array([[[0, 1],
[2, 3],
[4, 5]],
[[0, 1],
[2, 3],
[4, 5]]])
>>> b.shape
(2, 3, 2)
>>> b.sum(axis=0)
array([[ 0, 2],
[ 4, 6],
[ 8, 10]])
>>> b.sum(axis=1)
array([[6, 9],
[6, 9]])
>>> b.sum(axis=2)
array([[1, 5, 9],
[1, 5, 9]])
Pandas
csv読み込み
読み込みと表示
>>> df = pd.read_csv('housing.csv')
>>> df.head()
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
>>> len(df)
506
>>> df.describe()
x1 x2 x3 x4 x5 ... x10 x11 x12 x13 y
count 506.000000 506.000000 506.000000 506.000000 506.000000 ... 506.000000 506.000000 506.000000 506.000000 506.000000
mean 3.613524 11.363636 11.136779 0.069170 0.554695 ... 408.237154 18.455534 356.674032 12.653063 22.532806
std 8.601545 23.322453 6.860353 0.253994 0.115878 ... 168.537116 2.164946 91.294864 7.141062 9.197104
min 0.006320 0.000000 0.460000 0.000000 0.385000 ... 187.000000 12.600000 0.320000 1.730000 5.000000
25% 0.082045 0.000000 5.190000 0.000000 0.449000 ... 279.000000 17.400000 375.377500 6.950000 17.025000
50% 0.256510 0.000000 9.690000 0.000000 0.538000 ... 330.000000 19.050000 391.440000 11.360000 21.200000
75% 3.677082 12.500000 18.100000 0.000000 0.624000 ... 666.000000 20.200000 396.225000 16.955000 25.000000
max 88.976200 100.000000 27.740000 1.000000 0.871000 ... 711.000000 22.000000 396.900000 37.970000 50.000000
>>> df.shape
(506, 14)
>>> df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 506 entries, 0 to 505
Data columns (total 14 columns):
x1 506 non-null float64
x2 506 non-null float64
x3 506 non-null float64
x4 506 non-null int64
x5 506 non-null float64
x6 506 non-null float64
x7 506 non-null float64
x8 506 non-null float64
x9 506 non-null int64
x10 506 non-null int64
x11 506 non-null float64
x12 506 non-null float64
x13 506 non-null float64
y 506 non-null float64
dtypes: float64(11), int64(3)
memory usage: 55.4 KB
情報の取り出し
列の取り出し
- 1次元指定でSeriesとして取得
>>> type(df['x1'])
<class 'pandas.core.series.Series'>
>>> df['x1']
0 0.00632
1 0.02731
...
504 0.10959
505 0.04741
Name: x1, Length: 506, dtype: float64
- 2次元指定でDataFrameとして取得
>>> type(df[['x1']])
<class 'pandas.core.frame.DataFrame'>
>>> df[['x1']]
x1
0 0.00632
1 0.02731
.. ...
504 0.10959
505 0.04741
[506 rows x 1 columns]
# 複数列
>>> df[['x1', 'x2']]
x1 x2
0 0.00632 18.0
1 0.02731 0.0
.. ... ...
504 0.10959 0.0
505 0.04741 0.0
[506 rows x 2 columns]
行の取り出し
>>> df[500:]
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
500 0.22438 0.0 9.69 0 0.585 6.027 79.7 2.4982 6 391 19.2 396.90 14.33 16.8
501 0.06263 0.0 11.93 0 0.573 6.593 69.1 2.4786 1 273 21.0 391.99 9.67 22.4
502 0.04527 0.0 11.93 0 0.573 6.120 76.7 2.2875 1 273 21.0 396.90 9.08 20.6
503 0.06076 0.0 11.93 0 0.573 6.976 91.0 2.1675 1 273 21.0 396.90 5.64 23.9
504 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48 22.0
505 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88 11.9
>>> df[:5]
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
3 0.03237 0.0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94 33.4
4 0.06905 0.0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.90 5.33 36.2
>>> df[100:150:10]
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
100 0.14866 0.0 8.56 0 0.520 6.727 79.9 2.7778 5 384 20.9 394.76 9.42 27.5
110 0.10793 0.0 8.56 0 0.520 6.195 54.4 2.7778 5 384 20.9 393.49 13.00 21.7
120 0.06899 0.0 25.65 0 0.581 5.870 69.7 2.2577 2 188 19.1 389.15 14.37 22.0
130 0.34006 0.0 21.89 0 0.624 6.458 98.9 2.1185 4 437 21.2 395.04 12.60 19.2
140 0.29090 0.0 21.89 0 0.624 6.174 93.6 1.6119 4 437 21.2 388.08 24.16 14.0
行と列の取り出し
ラベル指定(loc)
>>> df.loc[10:15, ['x1', 'x2']]
x1 x2
10 0.22489 12.5
11 0.11747 12.5
12 0.09378 12.5
13 0.62976 0.0
14 0.63796 0.0
15 0.62739 0.0
インデックス指定(iloc)
>>> df.iloc[10:15, 0:2]
x1 x2
10 0.22489 12.5
11 0.11747 12.5
12 0.09378 12.5
13 0.62976 0.0
14 0.63796 0.0
>>> df.iloc[:, :-1] # 全行と最後の列以外
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
.. ... ... ... .. ... ... ... ... .. ... ... ... ...
504 0.10959 0.0 11.93 0 0.573 6.794 89.3 2.3889 1 273 21.0 393.45 6.48
505 0.04741 0.0 11.93 0 0.573 6.030 80.8 2.5050 1 273 21.0 396.90 7.88
[506 rows x 13 columns]
条件指定で抽出
基本はNumpyと同じ
>>> df['x1'] > 50
0 False
1 False
...
504 False
505 False
Name: x1, Length: 506, dtype: bool
>>> df[df['x1'] > 50]
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
380 88.9762 0.0 18.1 0 0.671 6.968 91.9 1.4165 24 666 20.2 396.90 17.21 10.4
405 67.9208 0.0 18.1 0 0.693 5.683 100.0 1.4254 24 666 20.2 384.97 22.98 5.0
410 51.1358 0.0 18.1 0 0.597 5.757 100.0 1.4130 24 666 20.2 2.60 10.11 15.0
418 73.5341 0.0 18.1 0 0.679 5.957 100.0 1.8026 24 666 20.2 16.45 20.62 8.8
>>> df[(df['x1'] > 50) & (df['x6'] > 6)]
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
380 88.9762 0.0 18.1 0 0.671 6.968 91.9 1.4165 24 666 20.2 396.9 17.21 10.4
ただし、条件にはSeries(1次元)で渡さないとダメ
DataFrame(2次元)で渡すと変になる
SeriesとDataFrameの違い
>>> df.head(3)
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False
# 1次元指定だとSeries, 2次元指定だとDataFrameが返る
>>> type(df['home_team'])
<class 'pandas.core.series.Series'>
>>> type(df[['home_team']])
<class 'pandas.core.frame.DataFrame'>
>>> type(df[['home_team', 'away_team']])
<class 'pandas.core.frame.DataFrame'>
>>> type(df['home_team'] == 'England')
<class 'pandas.core.series.Series'>
>>> type(df[['home_team']] == 'England')
<class 'pandas.core.frame.DataFrame'>
>>> type(df[['home_team', 'away_team']] == 'England')
<class 'pandas.core.frame.DataFrame'>
# 表示の違い
# Series
>>> df['home_team'] == 'England'
0 False
1 True
2 False
...
39005 False
39006 False
39007 False
Name: home_team, Length: 39008, dtype: bool
# DataFrame
>>> df[['home_team']] == 'England'
home_team
0 False
1 True
2 False
... ...
39005 False
39006 False
39007 False
[39008 rows x 1 columns]
# DataFrame
>>> df[['home_team', 'away_team']] == 'England'
home_team away_team
0 False True
1 True False
2 False True
... ... ...
39005 False False
39006 False False
39007 False False
[39008 rows x 2 columns]
条件に指定してみる
# Series(いい感じ)
>>> df[df['home_team'] == 'England']
date home_team away_team home_score away_score tournament city country neutral
1 1873-03-08 England Scotland 4 2 Friendly London England False
3 1875-03-06 England Scotland 2 2 Friendly London England False
... ... ... ... ... ... ... ... ... ...
38881 2018-03-27 England Italy 1 1 Friendly London England False
38981 2018-06-02 England Nigeria 2 1 Friendly London England False
[480 rows x 9 columns]
# DataFrame(行が抽出できないし、一致するところ以外がNaNになってしまう)
>>> df[df[['home_team']] == 'England']
date home_team away_team home_score away_score tournament city country neutral
0 NaN NaN NaN NaN NaN NaN NaN NaN NaN
1 NaN England NaN NaN NaN NaN NaN NaN NaN
2 NaN NaN NaN NaN NaN NaN NaN NaN NaN
... ... ... ... ... ... ... ... ... ...
39006 NaN NaN NaN NaN NaN NaN NaN NaN NaN
39007 NaN NaN NaN NaN NaN NaN NaN NaN NaN
[39008 rows x 9 columns]
any
- DataFrameに対して、どちらかがTrueの場合に抽出というように使える
- anyを使うことでDataFrameがSeriesに変換される
# anyなし
>>> type(df[['home_team', 'away_team']] == 'England')
<class 'pandas.core.frame.DataFrame'>
>>> df[['home_team', 'away_team']] == 'England'
home_team away_team
0 False True
1 True False
2 False True
... ... ...
39005 False False
39006 False False
39007 False False
[39008 rows x 2 columns]
# anyあり
>>> type((df[['home_team', 'away_team']] == 'England').any(axis=1))
<class 'pandas.core.series.Series'>
>>> (df[['home_team', 'away_team']] == 'England').any(axis=1)
0 True
1 True
2 True
...
39005 False
39006 False
39007 False
Length: 39008, dtype: bool
# anyで条件抽出
>>> df[(df[['home_team', 'away_team']] == 'England').any(axis=1)]
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
... ... ... ... ... ... ... ... ... ...
38881 2018-03-27 England Italy 1 1 Friendly London England False
38981 2018-06-02 England Nigeria 2 1 Friendly London England False
[976 rows x 9 columns]
クエリ検索
>>> df.query("home_team == 'Japan' | away_team == 'Japan'")
date home_team away_team ... city country neutral
443 1917-05-07 Japan Philippines ... Tokyo Japan False
571 1921-05-30 Japan Philippines ... Shanghai China True
... ... ... ... ... ... ... ...
38916 2018-03-27 Ukraine Japan ... Liège Belgium True
38957 2018-05-30 Japan Ghana ... Yokohama Japan False
[597 rows x 9 columns]
基本的な値を取得
平均(mean)
>>> df.mean()
x1 3.613524
x2 11.363636
x3 11.136779
x4 0.069170
x5 0.554695
x6 6.284634
x7 68.574901
x8 3.795043
x9 9.549407
x10 408.237154
x11 18.455534
x12 356.674032
x13 12.653063
y 22.532806
dtype: float64
>>> df[['x1', 'x2']].mean()
x1 3.613524
x2 11.363636
dtype: float64
標準偏差(std)
>>> df.std()
x1 8.601545
x2 23.322453
x3 6.860353
x4 0.253994
x5 0.115878
x6 0.702617
x7 28.148861
x8 2.105710
x9 8.707259
x10 168.537116
x11 2.164946
x12 91.294864
x13 7.141062
y 9.197104
dtype: float64
最大値(max)
>>> df
a b c d
0 11 12 13 14
1 21 22 23 24
2 31 32 33 34
>>> df.max()
a 31
b 32
c 33
d 34
dtype: int64
>>> df.max(axis=1)
0 14
1 24
2 34
dtype: int64
データフレームの列同士の演算
>>> df.head(3)
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
1 0.02731 0.0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.90 9.14 21.6
2 0.02729 0.0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03 34.7
>>> df.mean()
x1 3.613524
x2 11.363636
x3 11.136779
x4 0.069170
x5 0.554695
x6 6.284634
x7 68.574901
x8 3.795043
x9 9.549407
x10 408.237154
x11 18.455534
x12 356.674032
x13 12.653063
y 22.532806
>>> df_c = df - df.mean()
>>> df_c.head(3)
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
0 -3.607204 6.636364 -8.826779 -0.06917 -0.016695 0.290366 -3.374901 0.294957 -8.549407 -112.237154 -3.155534 40.225968 -7.673063 1.467194
1 -3.586214 -11.363636 -4.066779 -0.06917 -0.085695 0.136366 10.325099 1.172057 -7.549407 -166.237154 -0.655534 40.225968 -3.513063 -0.932806
2 -3.586234 -11.363636 -4.066779 -0.06917 -0.085695 0.900366 -7.474901 1.172057 -7.549407 -166.237154 -0.655534 36.155968 -8.623063 12.167194
新しい列の追加
# 追加前
>>> df.head()
date home_team away_team home_score away_score tournament city country neutral
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False
1 1873-03-08 England Scotland 4 2 Friendly London England False
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False
3 1875-03-06 England Scotland 2 2 Friendly London England False
4 1876-03-04 Scotland England 3 0 Friendly Glasgow Scotland False
# 列の追加
>>> df['year'] = pd.to_numeric([date.split('-')[0] for date in df['date']])
>>> df['month'] = pd.to_numeric([date.split('-')[1] for date in df['date']])
>>> df['day'] = pd.to_numeric([date.split('-')[2] for date in df['date']])
# 確認
>>> df.head()
date home_team away_team home_score away_score tournament city country neutral year month day
0 1872-11-30 Scotland England 0 0 Friendly Glasgow Scotland False 1872 11 30
1 1873-03-08 England Scotland 4 2 Friendly London England False 1873 3 8
2 1874-03-07 Scotland England 2 1 Friendly Glasgow Scotland False 1874 3 7
3 1875-03-06 England Scotland 2 2 Friendly London England False 1875 3 6
4 1876-03-04 Scotland England 3 0 Friendly Glasgow Scotland False 1876 3 4
並び替え
>>> df.sort_values('home_score', ascending = False).head()
date home_team away_team home_score away_score tournament city country neutral year month day
23569 2001-04-11 Australia American Samoa 31 0 FIFA World Cup qualification Coffs Harbour Australia False 2001 4 11
10860 1979-08-30 Fiji Kiribati 24 0 South Pacific Games Nausori Fiji False 1979 8 30
23566 2001-04-09 Australia Tonga 22 0 FIFA World Cup qualification Coffs Harbour Australia False 2001 4 9
22344 2000-02-14 Kuwait Bhutan 20 0 AFC Asian Cup qualification Kuwait City Kuwait False 2000 2 14
22257 2000-01-26 China Guam 19 0 AFC Asian Cup qualification Hanoi Vietnam True 2000 1 26
置き換え
# Trueを1に置き換え
>>> df.replace({True: 1})
date home_team away_team home_score away_score ... country neutral year month day
0 1872-11-30 Scotland England 0 0 ... Scotland False 1872 11 30
1 1873-03-08 England Scotland 4 2 ... England False 1873 3 8
... ... ... ... ... ... ... ... ... ... ... ...
39006 2018-06-04 Armenia Moldova 0 0 ... Austria 1 2018 6 4
39007 2018-06-04 India Kenya 3 0 ... India False 2018 6 4
[39008 rows x 12 columns]
欠損値処理
欠損値がひとつでも含まれていたら行を削除(dropna)
>>> df.dropna()
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
[506 rows x 14 columns]
欠損値を埋める(fillna)
>>> df.fillna(0)
x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 y
0 0.00632 18.0 2.31 0 0.538 6.575 65.2 4.0900 1 296 15.3 396.90 4.98 24.0
[506 rows x 14 columns]
欠損値のある行を抽出(isnull)
>>> df[df['x1'].isnull()]
Empty DataFrame
Columns: [x1, x2, x3, x4, x5, x6, x7, x8, x9, x10, x11, x12, x13, y]
要素、行、列に関数を適用
Seriesの各要素に適用(map)
# DataFrame
>>> df
a b c d
0 11 12 13 14
1 21 22 23 24
2 31 32 33 34
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
# Series
>>> df['a']
0 11
1 21
2 31
Name: a, dtype: int64
>>> type(df['a'])
<class 'pandas.core.series.Series'>
# mapでSeriesの各要素に関数を適用
>>> df['a'].map(lambda x: '[{}]'.format(x))
0 [11]
1 [21]
2 [31]
Name: a, dtype: object
mapに辞書を渡すこともできる(replaceみたいな使い方)
>>> df['a']
0 11
1 21
2 31
Name: a, dtype: int64
>>> df['a'].map({11:'one',21:'two'})
0 one
1 two
2 NaN
Name: a, dtype: object
DataFrameの各行・各列に適用(apply)
# DataFrame
>>> df
a b c d
0 11 12 13 14
1 21 22 23 24
2 31 32 33 34
>>> type(df)
<class 'pandas.core.frame.DataFrame'>
# applyでDataFrameに関数を適用
>>> df.apply(lambda x: max(x))
a 31
b 32
c 33
d 34
dtype: int64
# 結果はSeriesに変換される
>>> type(df.apply(lambda x: max(x)))
<class 'pandas.core.series.Series'>