# 統計量の計算

``````import numpy as np
import pandas as pd

df = pd.DataFrame({
'a': [1, 2, 3, 4, 5],
'b': [1.2, 3.4, 5.6, 7.8, np.nan]
})
df
``````
a b
0 1 1.2
1 2 3.4
2 3 5.6
3 4 7.8
4 5 NaN

## 1.　主要記述統計量の算出 `describe`

``````describe(percentiles=None, include=None, exclude=None, datetime_is_numeric=False)
``````

NaN を除いた1有効サンプルサイズ，平均値，標準偏差，最小値，第 1 四分位数，中央値，第 3 四分位数，最大値を求める。

``````df.describe()
``````
a b
count 5.000000 4.000000
mean 3.000000 4.500000
std 1.581139 2.840188
min 1.000000 1.200000
25% 2.000000 2.850000
50% 3.000000 4.500000
75% 4.000000 6.150000
max 5.000000 7.800000

### 1.1.　個別の記述統計量の算出

``````print(df.count()) # サンプルサイズ（標本の大きさ）
``````
``````a    5
b    4
dtype: int64
``````
``````print(df.mean()) # 平均値
``````
``````a    3.0
b    4.5
dtype: float64
``````
``````print(df.std()) # （不偏）標準偏差 ddof=1
``````
``````a    1.581139
b    2.840188
dtype: float64
``````
``````print(df.std(ddof=0)) # （不偏でない）標準偏差
``````
``````a    1.414214
b    2.459675
dtype: float64
``````
``````print(df.min()) # 最小値
``````
``````a    1.0
b    1.2
dtype: float64
``````

パーセンタイル値を求める

``````quantile(q=0.5, axis: 'Axis' = 0, numeric_only: 'bool' = True,
interpolation: 'str' = 'linear')
``````
``````print(df.quantile(0.25)) # 第 1 四分位数
``````
``````a    2.00
b    2.85
Name: 0.25, dtype: float64
``````
``````print(df.quantile(0.50)) # 中央値（第 2 四分位数）
``````
``````a    3.0
b    4.5
Name: 0.5, dtype: float64
``````
``````print(df.quantile(0.75)) # 第 3 四分位数
``````
``````a    4.00
b    6.15
Name: 0.75, dtype: float64
``````
``````print(df.quantile([0, 0.25, 0.5, 0.75, 1])) # リストで複数の指定
``````
``````        a     b
0.00  1.0  1.20
0.25  2.0  2.85
0.50  3.0  4.50
0.75  4.0  6.15
1.00  5.0  7.80
``````
``````print(df.max()) # 最大値
``````
``````a    5.0
b    7.8
dtype: float64
``````
``````print(df.var()) # 不偏分散 ddof=1
``````
``````a    2.500000
b    8.066667
dtype: float64
``````
``````print(df.var(ddof=0)) # （不偏でない）分散 ddof=0
``````
``````a    2.500000
b    8.066667
dtype: float64
``````
``````print(df.median()) # 中央値（第 2 四分位数）
``````
``````a    3.0
b    4.5
dtype: float64
``````

## 2.　任意の記述統計量の算出 `aggregate`

``````aggregate(func=None, axis: 'Axis' = 0, *args, **kwargs) # aggregate() == agg()
agg は aggegate の別名（エイリアス）。別名 agg を使うことが推奨されている。
``````

``````df.agg(np.mean)
``````
``````a    3.0
b    4.5
dtype: float64
``````
``````df.agg('mean')
``````
``````a    3.0
b    4.5
dtype: float64
``````

``````df.agg(['count', sum, min, max, 'median', 'mean', 'var', 'std'])
``````
a b
count 5.000000 5.000000
sum 15.000000 18.000000
min 1.000000 1.200000
max 5.000000 7.800000
median 3.000000 4.500000
mean 3.000000 4.500000
var 2.500000 8.066667
std 1.581139 2.840188

`agg` の関数が引数を持つ場合の指定法。

``````result = df.agg('quantile', q=[0, 0.25, 0.5, 0.75, 1]) # min, q1, median, q3, max
result.index = ['min', 'q1', 'median', 'q3', 'max']
result
``````
a b
min 1.0 1.20
q1 2.0 2.85
median 3.0 4.50
q3 4.0 6.15
max 5.0 7.80

## 3.　パーセンタイル値（クオンタイル)の算出 `quantile`

``````quantile(q=0.5, axis: 'Axis' = 0, numeric_only: 'bool' = True,
interpolation: 'str' = 'linear')
``````
``````df.quantile() # q=0.5
``````
``````a    3.0
b    4.5
Name: 0.5, dtype: float64
``````
``````df.quantile(q=[0, 0.25, 0.5, 0.75, 1])
``````
a b
0.00 1.0 1.20
0.25 2.0 2.85
0.50 3.0 4.50
0.75 4.0 6.15
1.00 5.0 7.80

## もっと知りたい人

``````help(df.describe) # Help on method describe in module pandas.core.generic
help(df.count)    # Help on method count in module pandas.core.frame
help(df.mean)     # Help on method mean in module pandas.core.generic
help(df.std)      # Help on method std in module pandas.core.generic
help(df.min)      # Help on method min in module pandas.core.generic
help(df.quantile) # Help on method quantile in module pandas.core.frame
help(df.max)      # Help on method max in module pandas.core.generic
help(df.var)      # Help on method var in module pandas.core.generic
help(df.median)   # Help on method var in module pandas.core.generic
``````
1. 欠損値として `pd．NA` もあるが，欠損値として扱われるのは `np.NaN` である。

