More than 5 years have passed since last update.

Pythonで基礎集計（１）

Last updated at 2019-05-01Posted at 2019-05-01

Pythonでデータの基礎集計をする。

自分の備忘録です。
実務でデータサイエンスをしていると、基礎集計がかかせません。
特に、金融等の固い職場ではエビデンスとして集計結果を残す必要があったりします。

ただ、重要であるものの面白くはないので、テンプレを作ってしまって使い回せるようにというのが基本的な今回のモチベーションです。

pandasやmatplotlib,seabornを使って集計や可視化をしたりしますが、今回はdescribe()の拡張をします。
(プログラマでは無いので、コーディングの仕方や、もっと効率的な方法等あればご指摘ください。)

describe()はデータの全体像を把握するには良いんですが、外れ値等がよくわからないですし、層別の集計も出来ないので、それをできるようにして、csvで出力します。

FeatureAgg.py

import pandas as pd
import numpy as np

class FeatureAgg(object):
    def __init__(self, dataframe):
        self.df = dataframe
            
    #０件数
    def count0_num(self, data):
        count0_num = data[data == 0].shape[0]
        return count0_num

    #0件数割合
    def count0_rate(self, data):
        count0_num = data[data == 0].shape[0] 
        return count0_num/data.shape[0]

    #欠損件数
    def na_num(self, data):
        na_num = data.isnull().sum()
        return na_num

    #欠損件数
    def na_rate(self, data):
        na_num = data.isnull().sum()
        return na_num/data.shape[0]

    #5%点
    def lower_5per(self, data):
        low_5per = data.quantile(0.05)
        return low_5per

    #95%点
    def upper_5per(self, data):
        upper_5per = data.quantile(0.95)
        return upper_5per
    
    #基本統計量を算出
    def aggregation(self, data, arg_list, col):
        data_agg = data.groupby(arg_list, as_index=False)\
        .agg({col:[self.count0_num, self.count0_rate, self.na_num, self.na_rate, "count", "std", "min", self.lower_5per, "mean", "median", self.upper_5per, "max"]})
        return data_agg
    
    #
    def agg_df_describe(self, filename, *args):
        df = self.df
        arg_list = [arg for arg in args]
        for i, col in enumerate(df.columns.values):
            if not (col in arg_list):
                try:
                    df_agg = self.aggregation(df[arg_list + [col]], arg_list, col)
               
                    df_agg = pd.concat([pd.DataFrame(col, index=np.arange(df_agg.shape[0]), columns=["cols"]), df_agg], axis=1)
                    df_agg.columns = ["項目"] + arg_list + ["0件数", "0割合", "欠損件数", "欠損割合", "件数", "標準偏差", "最小値", "5%点", "平均", "中央値", "95%点", "最大値"]
                except:
                    print("{}は集計できません。".format(col))
            
            else:
                pass
            
            if i == 0:
                df_agg.to_csv(filename + ".csv", header=True, index=False, encoding="shift-jis")
            else:
                try:
                    df_agg.to_csv(filename + ".csv", header=False, index=False, encoding="shift-jis", mode="a")
                except:
                    print("書き込みエラー発生 {} 番目のカラム ： {}".format(i, col))

execute.py

# irisでテストする
from sklearn.datasets import load_iris
iris = load_iris()

df = pd.DataFrame(iris.data, columns = iris.feature_names)
df["target"] = iris.target

featureagg = FeatureAgg(df)
featureagg.agg_df_describe("iris_test", "target") #targetで層別に集計する

項目	target	件数	標準偏差	最小値	5%点	平均	中央値	95%点	最大値
sepal length (cm)	0	50	0.3524896872	4.3	4.4	5.006	5	5.61	5.8
sepal length (cm)	1	50	0.5161711471	4.9	5.045	5.936	5.9	6.755	7
sepal length (cm)	2	50	0.6358795933	4.9	5.745	6.588	6.5	7.7	7.9
sepal width (cm)	0	50	0.3790643691	2.3	3	3.428	3.4	4.055	4.4
sepal width (cm)	1	50	0.3137983234	2	2.245	2.77	2.8	3.2	3.4
sepal width (cm)	2	50	0.3224966382	2.2	2.5	2.974	3	3.51	3.8
petal length (cm)	0	50	0.1736639965	1	1.2	1.462	1.5	1.7	1.9
petal length (cm)	1	50	0.4699109772	3	3.39	4.26	4.35	4.9	5.1
petal length (cm)	2	50	0.5518946957	4.5	4.845	5.552	5.55	6.655	6.9
petal width (cm)	0	50	0.1053855894	0.1	0.1	0.246	0.2	0.4	0.6
petal width (cm)	1	50	0.19775268	1	1	1.326	1.3	1.6	1.8
petal width (cm)	2	50	0.2746500556	1.4	1.545	2.026	2	2.455	2.5
petal width (cm)	0	50	0.1053855894	0.1	0.1	0.246	0.2	0.4	0.6
petal width (cm)	1	50	0.19775268	1	1	1.326	1.3	1.6	1.8
petal width (cm)	2	50	0.2746500556	1.4	1.545	2.026	2	2.455	2.5

以上

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up