0
1

More than 3 years have passed since last update.

pandasのplotを用いたカテゴリカル値と連続値の分布の可視化

Last updated at Posted at 2021-03-03

pandasのplotを用いたカテゴリカル値と連続値の分布の可視化

kaggleとかでcsvファイルのデータ分析をする際に,keyごとの特徴量の分布を可視化して比較したくなる.
カテゴリカル値の場合はバーの上に数を表示.

Content

  1. カテゴリカル値の可視化
  2. 連続値の可視化
  3. 2DataFrame 間のカテゴリカル値の比較
  4. 2DataFrame 間の連続値の比較

必要なライブラリのimport

python3
import pandas as pd
import numpy as np
import matplotlib
import matplotlib.pyplot as plt
np.random.seed(seed=0)

jupyter notebookなら以下も追加

python3
%matplotlib inline

DataFrameの準備

python3
length=500
train_df = pd.DataFrame(np.random.normal(loc=0, scale=1, size=(length)),columns=["cont"])
eval_df = pd.DataFrame(np.random.normal(loc=1, scale=0.5, size=(length)),columns=["cont"])
train_df = train_df.assign(cat=np.random.randint(0,5,length)) 
eval_df = eval_df.assign(cat=np.random.randint(0,6,length)) 

カテゴリカル値の可視化

python3
def categorical_plot(df, key):
    data = df[key].value_counts(sort=False)
    ax = df[key].value_counts(sort=False).plot(kind="bar", title=key)
    for i, d in enumerate(data):
        ax.text(i, d, d, horizontalalignment="center",
                verticalalignment="bottom")
    return ax
python3
categorical_plot(train_df, "cat")

cat1.png

連続値の可視化

python3
def continuous_plot(df, key):
    ax = df[key].plot(kind="density", title=key)
    return ax
python3
continuous_plot(train_df, "cont")

cont1.png

2DataFrame間のカテゴリカル値の比較

python3
def compare_cat(trdf, evdf, key):
    tr_df = trdf.rename(columns={key: "train"})
    ev_df = evdf.rename(columns={key: "eval"})
    tr_df = tr_df["train"].value_counts(sort=False)
    ev_df = ev_df["eval"].value_counts(sort=False)
    plot_df = pd.concat([tr_df, ev_df], axis=1)
    ax = plot_df.plot(kind="bar", title=key)

    for i, d in enumerate(tr_df):
        ax.text(i, d, d, horizontalalignment="right",
                verticalalignment="bottom")
    for i, d in enumerate(ev_df):
        ax.text(i, d, d, horizontalalignment="left",
                verticalalignment="bottom")
    return ax
python3
compare_cat(train_df, eval_df, "cat")

compare_cat1.png

2DataFrame間の連続値の比較

python3
def compare_cont(trdf, evdf, key):
    tr_df = trdf.rename(columns={key: "train"})["train"]
    ev_df = evdf.rename(columns={key: "eval"})["eval"]
    plot_df = pd.concat([tr_df, ev_df], axis=1)
    ax = plot_df.plot(kind="density", title=key).legend(loc="upper left")
    return ax
python3
compare_cont(train_df, eval_df, "cont")

compare_cont1.png

0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1