More than 3 years have passed since last update.

データ分析

Last updated at 2021-10-22Posted at 2021-10-22

この記事の狙い・目的

機械学習を取り入れたAIシステムの構築は、
①データ分析→ ②データセット作成（前処理）→ ③モデルの構築・適用
というプロセスで行っていきます。
その際「データ分析」の段階では、正しく前処理が行えるよう、事前にデータを分析を行い、データの構造、データの傾向を解析し、その後の方針を決定していることが求められます。
このブログでは、「データ分析」の工程について初めから通して解説していきます。

プログラムの実行環境

Python3
MacBook pro（端末）
PyCharm（IDE）
Jupyter Notebook（Chrome）
Google スライド（Chrome）

対象データの確認

# データ取得
boston_df = pd.read_csv("./boston.csv", sep=',')

# データ確認
boston_df.shape
boston_df.head().astype(str)
boston_df.info()
boston_df.describe().astype(str)

# 地図で確認
from IPython.display import Image
Image("./boston_area.png")
Image("./boston_river.png")

属性

説明変数

CRIM：
町ごとの一人当たりの犯罪率
ZN：
25,000平方フィートを超える区画にゾーニングされた住宅用地の割合。
INDUS：
町ごとの非小売ビジネスエーカーの割合
CHAS：（int）
チャールズリバーダミー変数（路が川に接している場合は1、それ以外の場合は0）
NOX：
一酸化窒素濃度（1000万あたりのパーツ）[パーツ/10M]
RM：
住居あたりの平均部屋数
AGE：
1940年以前に建設された持ち家の割合
DIS：
ボストンの5つの雇用センターまでの加重距離
RAD：（int）
放射状高速道路へのアクセスの指標
TAX：
全額-価値資産、10000ドルあたりの税率($/10k)
PTRATIO：
町ごとの生徒と教師の比率
B：
方程式B = 1000（Bk-0.63）^2の結果。ここで、Bkは町ごとの黒人の割合です。
LSTAT：
人口の％低いステータス

目的変数

MEDV：
持ち家の中央値（1000ドル単位）

データ型、尺度

質的データ
- 名義尺度：CHAS（2値）
- 順序尺度：なし
量的データ
- 間隔尺度：RM、DIS、RAD
- 比例尺度：CRIM、ZN、INDUS、NOX、AGE、TAX、PTRATIO、B、MEDV

目的変数の確認

import warnings
warnings.filterwarnings('ignore')

# 正規分布を重ねて描画する
from scipy.stats import norm
plt.figure(figsize=[8,3])
sns.distplot(boston_df['MEDV'], kde=True, fit=norm, fit_kws={'label': '正規分布'}).grid()
plt.legend()

# QQ Plot
plt.figure(figsize=[8,3])
plt.grid()
stats.probplot(boston_df['MEDV'], dist="norm", plot=plt)

# コルモゴロフ・スミルノフ検定(K-S検定)
from scipy import stats
stats.kstest(boston_df['MEDV'], 'norm')

帰無仮説=正規分布と一致している
対立仮説=正規分布と一致していない

p値=0.0(四捨五入した値)のため、帰無仮説が棄却される。（正規分布と一致していない）
線形モデルの場合は、正規分布に変換した方が良さそう。

欠損値の確認

boston_df.isnull().sum()

欠損値は含まれない。

外れ値の確認

# 箱ひげ図
def hige_graph(cols):
    fig, ax = plt.subplots()
    data_ = []
    for col in cols:
        data_.append(boston_df[col])
    ax.set_title('箱ひげ図')
    ax.boxplot(data_, labels=cols)
    plt.grid()
    plt.show()


# 全項目
all_col = boston_df.columns
hige_graph(cols=all_col)

plt.grid()
sns.countplot(boston_df["ZN"])

plt.figure(figsize=[10,3])
sns.boxplot(data=boston_df, x='CRIM').grid()

plt.figure(figsize=[10,3])
sns.boxplot(data=boston_df, x='B').grid()

sns.lmplot(x='B',y='CRIM',data=boston_df,aspect=5,height=5)

sns.lmplot(x='CRIM',y='B',data=boston_df,aspect=5,height=5)

CRIM
- 外れ値（80）を含む。犯罪率80%は疑わしい値。線形モデルを使用する場合はこの外れ値を除去する。
B
- 何%で切るか難しいところだが、0.1〜10%台は全体の1.5%。除外しても良いかもしれない。

単・多変量データ解析

単変量データ解析

boston_df.hist(bins=10, figsize=(20,15))

plt.figure(figsize=[20,5])
sns.distplot(boston_df['CRIM']).grid()

plt.figure(figsize=[20,5])
sns.distplot(boston_df['B'], bins=100).grid()

plt.figure(figsize=[8,3])
sns.distplot(boston_df[boston_df['B']<200], bins=10).grid()

def sns_distplot(col_name, bins):
    import warnings
    warnings.filterwarnings('ignore')
    plt.figure(figsize=[8,3])
    sns.distplot(boston_df[col_name], bins=bins).grid()

sns_distplot(col_name='INDUS', bins=10)

sns_distplot(col_name='CHAS', bins=10)

sns_distplot(col_name='NOX', bins=10)

sns_distplot(col_name='RM', bins=10)

sns_distplot(col_name='AGE', bins=10)

sns_distplot(col_name='DIS', bins=10)

sns_distplot(col_name='RAD', bins=10)

sns.countplot(boston_df["RAD"]).grid()

sns_distplot(col_name='TAX', bins=10)

sns_distplot(col_name='PTRATIO', bins=10)

sns_distplot(col_name='LSTAT', bins=10)

NOX, RM, DIS, PTRATIO, LSTAT, MEDV
モデリングに線形モデルを用いる場合、正規分布に変換（近づける）する処理を施す。

多変量データ解析

相関分析

# 散布図行列
pd.plotting.scatter_matrix(boston_df, c='blue', figsize=(20, 20))

# ヒートマップ1
plt.figure(figsize=(7, 7))
sns.heatmap(pd.DataFrame(boston_df).corr(method='pearson'), annot=False, vmin=-1, vmax=1, cmap='bwr', cbar=True)

# ヒートマップ2
plt.figure(figsize=(7, 7))
sns.heatmap(pd.DataFrame(boston_df).corr(method='spearman'), annot=False, vmin=-1, vmax=1, cmap='bwr', cbar=True)

非線形で相関のありそうな変数
- CRIM: INDUS, NOX, RM, RAD, TAX
- INDUS: NOX, DIS
- NOX: AGE, DIS
- AGE: DIS
- RAD: TAX
- MEDV: LSTAT
相関の弱い変数
- CHAS: 単体では不要そう

# 目的変数との関係性を確認
def target_relation(tgt, data):
    y_train = data[tgt]

    # ヒートマップの表示数
    k = len(data.columns)
    fig = plt.figure(figsize=(20,20))

    # 各変数間の相関係数
    corrmat = data.corr()

    # リストの最大値から順にk個の要素を取得
    cols = corrmat.nlargest(k, tgt)[tgt].index

    # 全て可視化
    for i in np.arange(1, k):
        X_train = data[cols[i]]
        ax = fig.add_subplot(4,4, i)
        sns.regplot(x=X_train, y=y_train)
    
    plt.tight_layout()
    plt.show()

sns.jointplot(x='MEDV', y='RM', data=boston_df)

sns.jointplot(x='MEDV', y='LSTAT', data=boston_df)

RM, LSTATは目的変数と線形性が関係がある。

fig = px.box(boston_df, x="CHAS", y="MEDV", color="CHAS", width=600, height=400)
fig.show()

川に面している住宅（CAHS＝1）価格の方が、少し高い。
CAHS＝０で高級住宅は、外れ値となっている。

fig = px.box(boston_df, x="RAD", y="MEDV", color="RAD")
fig.show()

低価格帯は、RAD=24しかカバーできていない。

fig = px.scatter (boston_df, x = "MEDV", y = "RM", color = "CHAS",  trendline="ols")
fig.show()

fig = px.scatter(boston_df, x = "MEDV", y = "RM", color = "RAD",  trendline="ols")
fig.show()

CHAS=0はノイズを含んでいる。

fig = px.scatter (boston_df, x = "MEDV", y = "LSTAT", color = "CHAS",  trendline="ols")
fig.show()

fig = px.scatter (boston_df, x = "MEDV", y = "LSTAT", color = "RAD",  trendline="ols")
fig.show()

fig = px.scatter (boston_df, x = "MEDV", y = "LSTAT", color = "AGE",  trendline="ols")
fig.show()

RAD=24は外れる傾向がある。
AGE=80代以上は外れる傾向がある。

割合の確認

# 積み立て棒グラフ
def stack_graph(columns):
    tgt_df = pd.DataFrame()
    for i in range(24):
        tgt_df['RAD' + str(i+1)] = boston_df[boston_df['RAD'] == i+1][columns].mean()

    plt.figure(figsize=[20,5])
    my_plot = tgt_df.T.plot(kind='bar', stacked=True, title="各クラスタの平均")
    my_plot.set_xticklabels(my_plot.xaxis.get_majorticklabels(), rotation=0)
    plt.grid()

columns = ['RAD', 'MEDV', 'CRIM', 'DIS']
stack_graph(columns)

columns = ['RAD', 'MEDV', 'AGE']
stack_graph(columns)

RAD1~8は傾向が似ているが、各要素の割合は若干異なる。
RAD24は要素の割合が大きく異なる。

まとめ

今回はボストンの住宅価格データを用いてデータの解析を行なってきました。
項目数はあまり多くはありませんが、その中で目的変数と関係性の高そうな項目、他の項目と組み合わせて影響のありそうな項目、関係性の低い効かなそうな項目が見えてきました。
この考察を元に、次回前処理を行なっていきます。

最後に

他の記事はこちらでまとめています。是非ご参照ください。

解析結果

実装結果：GitHub/boston_regression_analytics.ipynb
データセット：Boston House Prices-Advanced Regression Techniques

参考資料

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up