More than 1 year has passed since last update.

Pandas: データフレームについて--05: データフレームの変容

Last updated at 2022-06-22Posted at 2022-06-20

データフレームの変容

import pandas as pd
df = pd.DataFrame({
    'a': [1, 2, 3, 4, 5],
    'b': [1.2, 3.4, 5.6, 7.8, 9.0]
})
df

	a	b
0	1	1.2
1	2	3.4
2	3	5.6
3	4	7.8
4	5	9.0

1.　データフレームの列，行に関数を適用 `apply`

apply(func: 'AggFuncType', axis: 'Axis' = 0, raw: 'bool' = False,
      result_type=None, args=(), **kwargs)

import numpy as np

1.1.　スカラーを対象にする関数の場合

以下の 3 つの例はいずれも同じ結果になる。

np.sqrt(df)

	a	b
0	1.000000	1.095445
1	1.414214	1.843909
2	1.732051	2.366432
3	2.000000	2.792848
4	2.236068	3.000000

df.apply(np.sqrt) # axis=0

	a	b
0	1.000000	1.095445
1	1.414214	1.843909
2	1.732051	2.366432
3	2.000000	2.792848
4	2.236068	3.000000

df.apply(np.sqrt, axis=1)

	a	b
0	1.000000	1.095445
1	1.414214	1.843909
2	1.732051	2.366432
3	2.000000	2.792848
4	2.236068	3.000000

1.2.　1 次元配列を対象とする関数の場合

sum, mean, std, var, median など 1 次元配列（ベクトル）を対象にする関数は， axis の指定により違いがある。

axis は 0, 1 で指定できるが，最近のバージョンではそれぞれ，'index', 'columns' で指定できる。

また，numpy の関数にデータフレームを引数として与え，axis を指定しても，同じ結果が得られる。

1.2.1.　列単位に集計

df.apply(sum) # axis=0

a    15.0
b    27.0
dtype: float64

df.apply(sum, axis='index')

a    15.0
b    27.0
dtype: float64

np.sum(df) # axis=0

a    15.0
b    27.0
dtype: float64

np.sum(df, axis='index')

a    15.0
b    27.0
dtype: float64

1.2.2.　行単位に集計

df.apply(sum, axis=1)

0     2.2
1     5.4
2     8.6
3    11.8
4    14.0
dtype: float64

df.apply(sum, axis='columns')

0     2.2
1     5.4
2     8.6
3    11.8
4    14.0
dtype: float64

np.sum(df, axis=1)

0     2.2
1     5.4
2     8.6
3    11.8
4    14.0
dtype: float64

np.sum(df, axis='columns')

0     2.2
1     5.4
2     8.6
3    11.8
4    14.0
dtype: float64

2.　データフレームの各セルに関数を適用 applymap

データフレームの各セル（スカラー）に関数¹を適用する。

applymap(func: 'PythonFuncType', na_action: 'str | None' = None, **kwargs)

df.applymap(lambda x: x**2 + 3*x + 5) # 非推奨

	a	b
0	9	10.04
1	15	26.76
2	23	53.16
3	33	89.24
4	45	113.00

しかし，ベクトライズ関数がある場合（多くの場合はそうである）には，　applymap は避けるべきである。直接データフレームを操作するほうが実行速度が速い。

df**2 + 3*df + 5 # 推奨

	a	b
0	9	10.04
1	15	26.76
2	23	53.16
3	33	89.24
4	45	113.00

3.　データ変容 transform

transform(func: 'AggFuncType', axis: 'Axis' = 0, *args, **kwargs)

関数は lambda または '関数名'　で指定する。

以下のような　，スカラーを引数としてスカラーを返す関数の場合は axis に 0 または 1 を指定しても同じ結果になる。

axis=1 を指定した場合に，列のデータタイプが変わることがある。

3.1.　1 つの関数を指定する場合

df.transform(lambda x: x + 1) # axis=0

	a	b
0	2	2.2
1	3	4.4
2	4	6.6
3	5	8.8
4	6	10.0

df.transform(lambda x: x + 1, axis=1)

	a	b
0	2.0	2.2
1	3.0	4.4
2	4.0	6.6
3	5.0	8.8
4	6.0	10.0

3.2.　複数の関数を指定する場合

複数の関数を指定する場合は，リストで指定する。

df.transform(['sqrt', lambda x: x**2])

	a		b
	sqrt	<lambda>	sqrt	<lambda>
0	1.000000	1	1.095445	1.44
1	1.414214	4	1.843909	11.56
2	1.732051	9	2.366432	31.36
3	2.000000	16	2.792848	60.84
4	2.236068	25	3.000000	81.00

3.3.　グループ化データフレームを対象とする場合

df2 = pd.DataFrame({
    'gender': ['male', 'male', 'female', 'male', 'female', 'male', 'female'],
    'x': [2, 3, 1, 4, 6, 3, 3]
})
df2

	gender	x
0	male	2
1	male	3
2	female	1
3	male	4
4	female	6
5	male	3
6	female	3

gdf = df2.groupby('gender')
df2['Mean'] = gdf['x'].transform('mean')
df2

	gender	x	Mean
0	male	2	3.000000
1	male	3	3.000000
2	female	1	3.333333
3	male	4	3.000000
4	female	6	3.333333
5	male	3	3.000000
6	female	3	3.333333

この結果は，gender ごとの x の平均値を持つ変数 Mean を作ることになる。

df2[df2.gender == 'male']['x'].mean() # male 4 人の平均値

3.0

df2[df2.gender == 'female']['x'].mean() # female 3 人の平均値

3.3333333333333335

4.　もっと知りたい人

help(df.apply)     # Help on method apply in module pandas.core.frame
help(df.applymap)  # Help on method applymap in module pandas.core.frame
help(df.transform) # Help on method transform in module pandas.core.frame

スカラーを引数として，スカラーの結果を返す関数。 ↩

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

Pandas: データフレームについて--05: データフレームの変容

データフレームの変容

1. データフレームの列，行に関数を適用 apply

1.1. スカラーを対象にする関数の場合

1.2. 1 次元配列を対象とする関数の場合

1.2.1. 列単位に集計

1.2.2. 行単位に集計

2. データフレームの各セルに関数を適用 applymap

3. データ変容 transform

3.1. 1 つの関数を指定する場合

3.2. 複数の関数を指定する場合

3.3. グループ化データフレームを対象とする場合

4. もっと知りたい人