More than 1 year has passed since last update.

多変量データ解析入門：Pythonで散布図とカラーマップ付きヒストグラムでデータの関係性を視覚化する方法

Last updated at 2023-04-26Posted at 2023-04-26

1. はじめに

こんにちわ、hinaateです。ふと散布図に合わせてヒストグラムを描画し、散布図とヒストグラムの両方にカラーマップを付けるとデータの量が多いときにその関係性をよく理解できるのでは？と思ったので、chatGPTさんに手伝ってもらいながらコードを考えてみました。
はじめに、x ,yおよびz の乱数配列を生成し、それらのヒストグラムを計算して描画します。そして、x,yに対応する z 値に基づいて散布図とヒストグラムの要素に色を付けていきます。

2. 乱数配列の生成

はじめに、numpy を使って x、y、および z の3つの乱数配列を生成しています。これらは、それぞれ 1000 個の要素を持つ標準正規分布に従う乱数です。

import numpy as np

np.random.seed(1)
x = np.random.randn(1000)
y = np.random.randn(1000)
z = np.random.randn(1000)

3. ヒストグラムの計算とソート

次に、x と y のヒストグラムを計算し、対応する z 値とともにソートします。

binwidth = 0.25
xymax = max(np.max(np.abs(x)), np.max(np.abs(y)))
lim = (int(xymax/binwidth) + 1) * binwidth
bins = np.arange(-lim, lim + binwidth, binwidth)

counts_x, _ = np.histogram(x, bins=bins)
sorted_indices_x = np.argsort(x)
sorted_x = x[sorted_indices_x]
sorted_z_x = z[sorted_indices_x]

counts_y, _ = np.histogram(y, bins=bins)
sorted_indices_y = np.argsort(y)
sorted_y = y[sorted_indices_y]
sorted_z_y = z[sorted_indices_y]

4. プロットのレイアウト

プロットのレイアウトは、以下のように定義します。

left, width = 0.1, 0.65
bottom, height = 0.1, 0.65
spacing = 0.02

rect_scatter = [left, bottom, width, height]
rect_histx = [left, bottom + height + spacing, width, 0.2]
rect_histy = [left + width + spacing, bottom, 0.2, height]

5. 散布図とヒストグラムの描画

散布図は、x と y のデータポイントを z 値に基づいて色付けし、次のように描画します。

fig = plt.figure(figsize=(8, 8))
ax_scatter = plt.axes(rect_scatter)
ax_histx = plt.axes(rect_histx, sharex=ax_scatter)
ax_histy = plt.axes(rect_histy, sharey=ax_scatter)

norm = Normalize(vmin=z.min(), vmax=z.max())

ax_scatter.scatter(x, y, c=z, cmap='viridis', norm=norm)

次に、ヒストグラムを描画します。x と y のヒストグラムをそれぞれ計算し、対応する z 値に基づいて色を付けます。ヒストグラムのバーは、fill_between および fill_betweenx 関数を使って描画されます。ここがこのプログラムの面白い部分ではないでしょうか。

for idx in range(len(bins) - 1):
    x_elements = (bins[idx] <= sorted_x) & (sorted_x < bins[idx + 1])
    y_elements = (bins[idx] <= sorted_y) & (sorted_y < bins[idx + 1])

    if x_elements.any():
        colors_x = plt.cm.viridis(sorted_z_x[x_elements])
        bottom_x = 0
        for c in colors_x:
            ax_histx.fill_between([bins[idx], bins[idx + 1]], bottom_x, bottom_x + 1, color=c)
            bottom_x += 1

    if y_elements.any():
        colors_y = plt.cm.viridis(sorted_z_y[y_elements])
        bottom_y = 0
        for c in colors_y:
            ax_histy.fill_betweenx([bins[idx], bins[idx + 1]], bottom_y, bottom_y + 1, color=c)
            bottom_y += 1

最後に、ヒストグラムの軸から目盛りラベルを削除し、プロットを表示します。

ax_histx.tick_params(axis="x", labelbottom=False)
ax_histy.tick_params(axis="y", labelleft=False)

plt.show()

実行すると、x と y のデータポイントに基づく散布図が表示され、対応する z 値によって色が付けられます。また、それぞれのデータポイントに対応するヒストグラムも表示されます。
散布図のデータが多すぎて、データが重なり、散布図のカラーマップが見えづらいときなどにこの描画方法は役に立ちそうです！
ビンの幅を調整すると散布図との対応をもう少しわかりやすくなるかもしれません。

6. まとめ

これにより、データポイント間の関係性や分布を視覚的に把握することが容易になるのではないでしょうか。このようなプロットは、多変量データの解析や機械学習の特徴量選択など、さまざまな分野で使えそうです。この記事で紹介したプログラムを応用することで、さらに複雑なプロットやカスタマイズされた描画に挑戦してみてください。私もいろいろ試してみようと思います！
では、今回は以上です！さようなら～。ちなみにかばに特に愛着はありません。🦛

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up