More than 1 year has passed since last update.

【Google Colabで学ぶ】データ可視化のデザイン

Last updated at 2023-12-17Posted at 2023-07-14

概要

このハンズオンではデータ視覚化のデザイン（SBクリエイティブ株式会社）の1～3章をもとに作成されたデータ可視化におけるデザインのベストプラクティスを体験できます。こちらの書籍ではTableauなどのBIツールを用いてグラフのデザインに関する説明がされていますが、このハンズオンではデータ分析でもよく使用されるPythonライブラリの可視化機能を用いてデザインの解説をします。

ハンズオン

Google Colabでハンズオンデータ可視化のデザイン.ipynbを公開中です。
以下の記事はハンズオンと同じ内容です。

ハンズオン実行環境構築

まずはMatplotlib、Altairといった可視化ライブラリをインストール・インポートします。必要なライブラリは適宜インストールしましょう。

!pip install japanize_matplotlib

import urllib
import json
import numpy as np
import pandas as pd
from pandas.core import groupby
import altair as alt
import vega_datasets
import matplotlib.pyplot as plt
import japanize_matplotlib

%matplotlib inline
%config InlineBackend.figure_format = 'retina'

特にAltairはデータ可視化用 json フォーマットの Vega-LiteをPythonから記述するライブラリで、その使い方はQiitaでも紹介しています。ぜひ参考にしてください。

データ可視化の目的

可視化はデータから重要な情報を読み取るためにグラフを作成することです。 このことを意識せずに作成されたグラフからは情報が読み取りにくいことがよくあります。まずは可視化によって読み取りやすくなる情報とは何かを理解し、データ可視化の目的について考えてみましょう。

データから情報を読み取るための手段として、グラフの作成のほかに平均や分散といった統計量が知られています。しかし統計量だけでは捉えきれないデータの傾向もあります。これを視覚的に分かりやすく捉えるのがデータ可視化の目的です。

イギリスの統計家 Frank Anscombe 氏によって作られた Anscombe's Quartet というデータセットを用いて、統計量では捉えることが難しいデータの傾向の具体例を見てみましょう。

Anscombe's Quartet は Ⅰ～Ⅳという4つのグループのそれぞれに X, Y という2つの変数のペアが11あります。

anscombe_df = pd.melt(vega_datasets.data.anscombe().assign(point=[i%11 for i in range(44)]), id_vars=["Series", "point"]).pivot(columns=[ "variable", "Series"], index="point")
anscombe_df

Variable	X				Y
Series	I	II	III	IV	I	II	III	IV
Point
0	10.0	10.0	10.0	8.0	8.04	9.14	7.46	6.58
1	8.0	8.0	8.0	8.0	6.95	8.14	6.77	5.76
2	13.0	13.0	13.0	8.0	7.58	8.74	12.74	7.71
3	9.0	9.0	9.0	8.0	8.81	8.77	7.11	8.84
4	11.0	11.0	11.0	8.0	8.33	9.26	7.81	8.47
5	14.0	14.0	14.0	8.0	9.96	8.10	8.84	7.04
6	6.0	6.0	6.0	8.0	7.24	6.13	6.08	5.25
7	4.0	4.0	4.0	19.0	4.26	3.10	5.39	12.50
8	12.0	12.0	12.0	8.0	10.84	9.13	8.15	5.56
9	7.0	7.0	7.0	8.0	4.81	7.26	6.42	7.91
10	5.0	5.0	5.0	8.0	5.68	4.74	5.73	6.89

各グループのX, Yの平均はそれぞれ9.00, 7.59です。平均はデータの大小・多寡を表す統計量の1つであり、XとYは同程度の大きさであることが推測されます。

anscombe_df.mean(axis='rows')

       variable  Series
value  X         I         9.000000
                 II        9.000000
                 III       9.000000
                 IV        9.000000
       Y         I         7.500000
                 II        7.500909
                 III       7.500000
                 IV        7.500909

各グループのX, Yの標準偏差はそれぞれ3.32, 2.03です。標準偏差はデータのばらつき具合を表す統計量の1つであり、XとYのばらつきは同程度であることが推測されます。

anscombe_df.std(axis='rows')

       variable  Series
value  X         I         3.316625
                 II        3.316625
                 III       3.316625
                 IV        3.316625
       Y         I         2.032890
                 II        2.031657
                 III       2.030424
                 IV        2.030579

各グループのX, Yの相関係数はそれぞれ0.82です。相関係数は2つの変数の関連性を表す指標であり、XとYの関連性は同程度であることが推測されます。

np.diag(anscombe_df.corr().loc["value", "value"].loc["X", "Y"].values)

array([0.81618645, 0.81623651, 0.81628674, 0.81652144])

このように統計量を用いてⅠ～Ⅳを比較すると、大きさ、ばらつき、関連性といった特徴は同程度であると推測されました。
次に散布図を用いてⅠ～Ⅳを比較してみましょう。

# 散布図作成
anscombe_scatter = (
    alt.Chart(data=vega_datasets.data.anscombe())
    .mark_point()
    .encode(
        x=alt.X("X", axis=alt.Axis(grid=False, labelAngle=0)), 
        y=alt.Y("Y", axis=alt.Axis(grid=False, labelAngle=0)), 
        tooltip=["X", "Y"]
        )
    .properties(width=200,height=200)
)

# 回帰直線作成
anscombe_regress = anscombe_scatter + (
    anscombe_scatter
    .transform_regression('X', 'Y', method="linear")
    .mark_line(shape='mark', opacity=0.8)
)

(
    anscombe_regress
    .facet(column = alt.Column('Series', header=alt.Header(labelFontSize=15), title=None))
    .configure_view(strokeWidth=0)
)

散布図で可視化することで統計量では掴めなかったデータの傾向が明らかになりました。グラフ作成時は「データの詳細が知りたい」や「データの全体像が知りたい」といった目的を意識し、必要な情報が読み取りやすいかどうかを考えるようにしましょう。

認知的負荷について

脳が情報を理解するためにかかる負荷を認知的負荷といいます。認知的負荷の高いグラフやダッシュボードは受け手に誤解されたり興味を持たれなかったりすることがあるため、データの可視化は「認知的負荷を下げる」という大方針にしたがうことが望ましいです。

認知的負荷の低さはData ink Ratioという指標で捉えられることがあります。

$$ (Data　ink　Ratio) = \frac{データそのものを表すために使われたインクの量（Data　ink）}{グラフ、グラフィックなどデータ表現全体で使われたインクの量（Total　ink）}$$

たとえば棒グラフの棒はデータと紐づくため data ink ですが、棒の影、背景の掛け線、凡例のアイコンなどは data ink ではありません。このようなデータと紐づかないインクが少ないほど認知的負荷は低くなるといわれています。

base = (
    alt.Chart(vega_datasets.data.population())
    .transform_filter(alt.datum.year > 1900)
    .properties(width=400, height=400)
)

left_bar = (
    base.mark_bar().encode(
        x=alt.X('year:N', title="年代"), 
        y=alt.Y('sum(people)', title="人口"),
        color=alt.Color("year:N", legend=alt.Legend(orient='left'))
    )
    .properties(title="年代ごとの米国人口推移（悪い例）")
)

right_base = base.encode(
        x=alt.X('year:N', title="年代", axis=alt.Axis(ticks=False, labelAngle=0)), 
        y=alt.Y('sum(people)', title="人口",  axis=None),
    )

right_bar = right_base.mark_bar(color="gray")

right_text = (
    right_base
    .mark_text(dx=0, dy=-10, color='black')
    .encode(text=alt.Text('sum(people):Q', format='.3s'))
    .properties(title="米国人口推移（良い例）") 
)

(left_bar | (right_bar + right_text)).configure_view(strokeWidth=0)

上の棒グラフでは以下の工夫で認知的負荷を下げています。

値を直接書き込むことで縦軸の人口ラベルを削除
横軸の年代ラベルを水平に
不要な掛け線・色を削除
年代の記載があるため凡例を削除

このように認知的負荷を下げるようにデザインが洗練されたグラフからの示唆は、読み手にとって誤解されにくく説得力が増します。

カラーユニバーサルデザインについて

多数の人に見られるグラフを作成する場合などではカラーユニバーサルデザインにも配慮しましょう。

色覚（色の見え方）にはいくつかのパターンがあります。以下は色覚タイプごとにシミュレーションされた色の見え方です。緑や赤など、色覚タイプによっては判別が難しい色の組み合わせがあることが分かります。

[出典] 神奈川県地域福祉課 "色使いのガイドラインサインマニュアルVer.2”平成30年6月，（参照：2023-01-27）

上図の点線で囲んだカラーレンジは、ほとんどの色覚タイプにとってグラデーションとして認識することができます。
カラーコードでは steelblue(#4682B4) とdarkorange(#FF8C00)の間のあたりです。このような様々な色覚に配慮した配色のことをカラーユニバーサルデザインといいます。後述のヒートマップなど、色を用いたグラフを作成する場合はこのカラーレンジの中にある色を使用することで認知的負荷を下げられます。

このハンズオンでもカラーユニバーサルデザインに配慮したグラフ作成を心がけています。

目的別チャート選択

ここまでは可視化において目的を意識することの重要性を述べさせていただきました。ここからはデータから読み取りたい情報の場合ごとに、適切な可視化方法について見ていきましょう。

棒グラフ

棒グラフはデータの大小・多寡の比較に用いられる代表的なグラフの1つです。

ゼロからはじめる

棒グラフで値の多寡を表現する際は軸の起点を0にすることを気を付けましょう。以下は架空の予備校における架空大学合格者数推移の例です。起点を 820 にしているため合格者数の増加速度が実際より大きくみえます。

コード

passers_df = pd.DataFrame([
    {"年度":2019, "合格者数": 840},
    {"年度":2020, "合格者数": 852},
    {"年度":2021, "合格者数": 891},
]
)

left_base = alt.Chart(passers_df) .encode(
  x=alt.X('年度:N', axis=alt.Axis(ticks=False, labelAngle=0)), 
  y=alt.Y('合格者数:Q', axis=None, scale=alt.Scale(domain=[820, 900])),
)

left_bar = left_base.mark_bar(color="gray")

left_texts = left_base.mark_text(dx=0, dy=20, color='white', size=30).encode(text=alt.Text('合格者数:Q'))

left_chart = (left_bar + left_texts).properties(width=300, height=300, title="合格者数推移（起点は820人）")

right_base = alt.Chart(passers_df) .encode(
  x=alt.X('年度:N', axis=alt.Axis(ticks=False, labelAngle=0)), 
  y=alt.Y('合格者数:Q', axis=None, scale=alt.Scale(domain=[0, 900])),
)

right_bar = right_base.mark_bar(color="gray")

right_texts = right_base.mark_text(dx=0, dy=20, color='white', size=30).encode(text=alt.Text('合格者数:Q'))

right_chart = (right_bar + right_texts).properties(width=300, height=300, title="合格者数推移（起点は0人）")

(left_chart | right_chart)

差の大きさを強調したほうが良い場面では敢えて軸の起点を0としないこともあります。

文字は回転させない

棒グラフは縦に細長いことから凡例を回転させて表記させなければならないことがあります。このような文字の回転は避けたほうが認知的負荷も下がります。

以下は Internet Movie Database（IMDB）に登録されている映画の情報です

コード

data_path = "https://cdn.jsdelivr.net/npm/vega-datasets@v1.29.0/data/movies.json"

with urllib.request.urlopen(data_path) as f:
    movies = f.read().decode("UTF-8").split("\n")[0]

dict_data = json.loads(movies)
keys = list(dict_data[0].keys())
movies_df = pd.DataFrame([[row[key]for key in keys] for row in dict_data], columns=keys)
movies_df.head()

	Title	US_Gross	Worldwide_Gross	US_DVD_Sales	Production_Budget	Release_Date	MPAA_Rating	Running_Time_min	Distributor	Source	Major_Genre	Creative_Type	Director	Rotten_Tomatoes_Rating	IMDB_Rating	IMDB_Votes
0	The Land Girls	146083.0	146083.0	NaN	8000000.0	Jun 12 1998	R	NaN	Gramercy	None	None	None	None	NaN	6.1	1071.0
1	First Love, Last Rites	10876.0	10876.0	NaN	300000.0	Aug 07 1998	R	NaN	Strand	None	Drama	None	None	NaN	6.9	207.0
2	I Married a Strange Person	203134.0	203134.0	NaN	250000.0	Aug 28 1998	None	NaN	Lionsgate	None	Comedy	None	None	NaN	6.8	865.0
3	Let's Talk About Sex	373615.0	373615.0	NaN	300000.0	Sep 11 1998	None	NaN	Fine Line	None	Comedy	None	None	13.0	NaN	NaN
4	Slam	1009819.0	1087521.0	NaN	1000000.0	Oct 09 1998	R	NaN	Trimark	Original Screenplay	Drama	Contemporary Fiction	None	62.0	3.4	165.0

以下の例では映画ジャンルごとにレーティングの平均値を表していますが、横軸のラベルが回転してしまっています。

コード

(
    alt.Chart(movies_df)
    .transform_filter(alt.datum.Major_Genre != None)
    .mark_bar(color="gray")
    .encode(
        x=alt.X("Major_Genre:N", axis=alt.Axis(ticks=False, grid=False, title="映画ジャンル")),
        y=alt.Y("mean(IMDB_Rating):Q", axis=alt.Axis(ticks=False, grid=False, title="IMDBレーティング"))
    )
    .configure_view(strokeWidth=0)
    .properties(width=600, height=400)
)

このようなグラフは認知的負荷が高いため、デザインを修正してラベルの回転をなくしましょう。下記のように軸ラベルを交換することで文字の回転をなくすことができました。

コード

base = (
    alt.Chart(movies_df)
    .encode(
        x=alt.X("mean(IMDB_Rating):Q", axis=None),
        y=alt.Y("Major_Genre:N", 
                axis=alt.Axis(ticks=False, grid=False, title="主要ジャンル", labelFontSize=13, titleFontSize=13)
                )
    )
)

bars = base.mark_bar(color="gray")

texts = (
    base
    .mark_text(align='left', dx=-35, dy=0, color='white', size=13)
    .encode(text=alt.Text("mean(IMDB_Rating):Q", format=".2f"))
)

(
    (bars + texts)
    .configure_view(strokeWidth=0)
    .properties(width=600, height=400, title="IMDBレーティング")
)

三次元グラフは使用しない

変数が3つある場合、そのうちの1変数の大小を表現するグラフとして三次元棒グラフがあります。しかし三次元グラフを理解するには奥行きという情報を処理しなければならず認知的負荷が高くなりがちです。極力使用しないように心がけましょう。

ここでは2012～2015年におけるシアトルの月間最高気温推移を三次元棒グラフで可視化してみます。

コード

seattle_weather_df = vega_datasets.data.seattle_weather()
seattle_temp_max_df = (seattle_weather_df
    .assign(year=seattle_weather_df.loc[:,"date"].dt.year)
    .assign(month=seattle_weather_df.loc[:,"date"].dt.month)
    .groupby(["year", "month"])
    .max()
    .loc[:,["temp_max"]]
    .reset_index()
)
seattle_temp_max_df.head()

	year	month	temp_max
0	2012	1	12.8
1	2012	2	16.1
2	2012	3	15.6
3	2012	4	23.3
4	2012	5	26.7

シアトル月間最高気温推移を「年」と「月」という2軸に分割して「最高気温」を表示させると、季節性の変化などを分かりやすく表現できそうです。

ところが「年」と「月」と「最高気温」という3つの情報を三次元棒グラフを用いて表現しようとすると、手前の棒に比べて奥の棒が小さく見えてしまったり、そもそも奥の棒の一部が見えなくなってしまっていたりするため、むしろ情報を正確に読み取るのが難しくなってしまっています。

コード

fig = plt.figure(figsize=(8, 8))
ax = fig.add_subplot(111, projection='3d')

ax.bar3d(x=seattle_temp_max_df["month"],
         y=seattle_temp_max_df["year"],
         z=0, 
         dx=0.6, dy=0.2, dz=seattle_temp_max_df["temp_max"], shade=True)
ax.set_xlabel("月")
ax.set_xticks([1, 4, 7, 10])
ax.set_ylabel("年")
ax.set_yticks([2012, 2013, 2014, 2015])
ax.set_title('シアトルの月間最高気温')

plt.show()

このような三次元の情報を可視化したい場合は、棒グラフを二次元に並べるのが推奨されます。

コード

base = alt.Chart(seattle_temp_max_df, width=120)

bar = base.mark_bar().encode(x=alt.X("temp_max:Q", axis=None))

text = base.mark_text(fontSize=13, color="white",align="left").encode(
    x=alt.value(0),
    text=alt.Text('temp_max:Q', format='.3s'),
)

(
    (bar + text).facet(
        row=alt.Row("month:N", title="月", header=alt.Header(labelAlign="left", labelAngle=0, titleAngle=0)),
        column=alt.Column("year:N", header=alt.Header(labelAngle=0), title="シアトルの月間最高気温"),
    )
    .configure_facet(spacing=0)
)

このような棒グラフの敷き詰めは Excel の「条件付き書式」および「データバー」という機能を活用しても可能です。

ヒートマップ

三変数がある場合にデータの大小、多寡を表現するグラフとしてヒートマップもあります。ヒートマップは色のグラデーションや濃淡によってデータの大小や多寡を表現できるため、次元を1つ節約することができます。

ただしヒートマップが表す色は正確な値を表現することが難しいため、マス目にテキストでも値を表示しておくと親切です。

またユニバーサルデザインに配慮する場合は、ほとんどの色覚タイプで色のグラデーションを判別しやすいscheme='greenblue'などを使用すると親切です。

コード

base = alt.Chart(seattle_temp_max_df).encode(
    x=alt.X('month:N', title="月", axis=alt.Axis(labelAngle=0, labelFontSize=15)),
    y=alt.Y('year:N', title=None, axis=alt.Axis(labelFontSize=15)),
)

heatmap = base.mark_rect().encode(
    color=alt.Color('temp_max:Q', title="最高気温(℃)", scale=alt.Scale(scheme='greenblue'))
)

text = base.mark_text(fontSize=13).encode(
        text=alt.Text('temp_max:Q', format='.3s'),
        color=alt.condition(
            alt.datum.temp_max > 24,
            alt.value('white'),
            alt.value('black')
        )
)

(heatmap + text).properties(width=500, height=100, title="シアトル月別最高気温推移")

割合を表す

積み上げ棒グラフ

積み上げ棒グラフは色分けされた棒グラフであり、各セグメントの内訳と合計値を棒の長さから知ることができます。割合は合計値が1になるという特性から、積み上げ棒グラフで表現すると必ず同じ高さになります （定数和制約）。そのため割合の比較に積み上げ棒グラフはよく用いられます。

今回はシアトルの天気を年ごとに集計したデータを使用します。

コード

seattle_weather_day_cnt_df = (seattle_weather_df
    .assign(year=seattle_weather_df.loc[:,"date"].dt.year)
    .groupby(["year", "weather"])
    .count()
    .loc[:,["date"]]
    .reset_index()
)
seattle_weather_day_cnt_df.head()

	year	weather	date
0	2012	drizzle	31
1	2012	fog	5
2	2012	rain	191
3	2012	snow	21
4	2012	sun	118

棒グラフは横に並べるだけで時系列変化も分かりやすく表現できるため非常に有効です。

コード

(
    alt.Chart(seattle_weather_day_cnt_df)
    .mark_bar()
    .encode(
        x=alt.X(
            "year:O",
            axis=alt.Axis(labelAngle=0, ticks=False, grid=False, title="年")
        ),
        y=alt.Y(
            "date:Q",stack="normalize",
            axis=alt.Axis(format="%", ticks=True, grid=False, title=None),
        ),
        color=alt.Color("weather:N", title="天気")
    )
    .configure_view(strokeWidth=0)
    .properties(width=300, height=400, title="天気別日数の割合")
)

円グラフ

円グラフは割合を可視化する一般的な手法として知られていますが、次のようなの欠点があります。

3つ以上のセグメントに区切る場合、どれが最も多く占めるのかを理解することが難しい
円を区切るには色分けするしかなく凡例が不可欠であるため凡認知的負荷がかかりやすい。
円のみで正確な値が分かりにくく、グラフ付近に数値を記入することが多い
2つ以上の円を比較することが難しい

前述で可視化したシアトルの天気の割合を今度は円グラフで表現しました。円グラフでは「その年に最も多かった天気の種類」は分かりやすいものの、各年の推移や割合の小さな天気（drizzleなど）の変化は積み立て棒グラフより読み取りにくくなってしまいます。円グラフは、「複数の円グラフを比べない」あるいは「属性の数が少ない（せいぜい2個か3個）」といった限られた場面を除いて基本的に使用しないことを推奨します。

コード

(
    alt.Chart(seattle_weather_day_cnt_df)
    .mark_arc()
    .encode(
        theta=alt.Theta(field="date", type="quantitative", title="年"),
        color=alt.Color(field="weather", type="nominal", title="天気"),
        column=alt.Column("year:O", title="年")
    )
    .configure_view(strokeWidth=0)
    .properties(width=200, height=200)
)

積み上げ棒グラフでは割合の累積を可視化しますが、各属性の割合の時系列推移を可視化したい場合は折れ線グラフを活用しましょう。

コード

base = (
    alt.Chart(seattle_weather_day_cnt_df)
    .transform_joinaggregate(total_date="sum(date):Q", groupby=["year"])
    .transform_calculate(comp_date="datum.date/datum.total_date")
    .encode(
        x=alt.X("year",type="ordinal", axis=alt.Axis(ticks=True, grid=False, labelAngle=0, title="年")),
        y=alt.Y("comp_date",type="quantitative",
                scale=alt.Scale(domain=[0, 1]),
                axis=alt.Axis(format="%", ticks=True, grid=False, title=None)
                ),
        color=alt.Color("weather", type="nominal", legend=None),
    )
)

points = base.mark_circle(size=40)

lines = base.mark_line()

texts =  (
    base
    .transform_filter(alt.datum.year == 2012)
    .mark_text(dx=-20, dy=-5)
    .encode(text="weather:N")
)

(
    (points + lines + texts)
    .properties(width=400, height=300, title="天気別日数の割合")
    .configure_view(strokeWidth=0)
)

細かいですが、どの色が何を表しているかは凡例などでまとめて書くよりはグラフに直接書き込むほうが認知的負荷を下げられます。

時系列データを表す

折れ線グラフ

折れ線グラフは時間の推移を可視化する有効な手段の1つとして知られています。

ここでは2000～2010年の米国における失業者数のデータを扱います。

コード

data_path = "https://cdn.jsdelivr.net/npm/vega-datasets@1.29.0/data/unemployment-across-industries.json"

with urllib.request.urlopen(data_path) as f:
    unemployment_across_industries = f.read().decode("cp932").split("\n")[0]
dict_data = json.loads(unemployment_across_industries)
keys = list(dict_data[0].keys())
unemployment_across_industries_df = pd.DataFrame([[row[key]for key in keys] for row in dict_data], columns=keys)
unemployment_across_industries_df["date"] = pd.to_datetime(unemployment_across_industries_df["date"])
unemployment_across_industries_df.head()

	series	year	month	count	rate	date
0	Government	2000	1	430	2.1	2000-01-01 08:00:00+00:00
1	Government	2000	2	409	2.0	2000-02-01 08:00:00+00:00
2	Government	2000	3	311	1.5	2000-03-01 08:00:00+00:00
3	Government	2000	4	269	1.3	2000-04-01 08:00:00+00:00
4	Government	2000	5	370	1.9	2000-05-01 07:00:00+00:00

スパゲッティチャートは避けましょう

凡例が多すぎると複数の線が重なり合い認知的負荷が高くなることがあります。これはスパゲッティチャートと呼ばれることがあります。

コード


base = (
    alt.Chart(unemployment_across_industries_df)
    .encode(
        x=alt.X(field="date",type="temporal",axis=alt.Axis(grid=False, title="年")),
        y=alt.Y(field="count",type="quantitative",axis=alt.Axis(grid=False, title="失業者数")),
        color=alt.Color(
            field="series",type="nominal",
            scale=alt.Scale(scheme="category20b"),
            legend=None
        )
    )
)

lines = base.mark_line()

texts = (
    base
    .transform_window(last_record='percent_rank(year)', groupby=["series"])
    .transform_filter(alt.datum.last_record==1)
    .mark_text(dx=10, dy=0, align="left")
    .encode(
        text=alt.Text("series")
        )
)

(lines+texts).configure_view(strokeWidth=0).properties(width=500, height=300, title="米国失業者数推移")

ハイライトを活用しましょう

ハイライトを付けるなどして線の重なりを避けるようにしましょう。以下のグラフはカーソルの近くにある業種をハイライトします。

コード

series_mouse_selection = alt.selection(
    type="single", fields=["series"], on="mouseover", nearest=True, init={"series": "Agriculture"}
)

base = alt.Chart(unemployment_across_industries_df).encode(
        x=alt.X(field="date",type="temporal",axis=alt.Axis(grid=False, title="年")),
        y=alt.Y(field="count",type="quantitative",axis=alt.Axis(grid=False, title="失業者数")),
    detail=alt.Detail(field="series",type="nominal"),
    tooltip=[
        alt.Tooltip(field="count",type="temporal"),
        alt.Tooltip(field="series",type="nominal"),
        alt.Tooltip(field="date",type="temporal", format="%Y年%m月%d日 %H時%M分"),
    ],
)

points = (
    base.mark_circle()
    .encode(
        opacity=alt.condition(
            predicate=series_mouse_selection,
            if_true=alt.value(1),
            if_false=alt.value(0),
        ),
    )
    .add_selection(series_mouse_selection)
)

lines = (
    base
    .mark_line()
    .encode(
        color=alt.condition(
            predicate=series_mouse_selection,
            if_true=alt.value("steelblue"),
            if_false=alt.value("lightgray")
            ),
    opacity=alt.condition(
        predicate=series_mouse_selection,
        if_true=alt.value(1),
        if_false=alt.value(0.5)
        )
    )
)

texts = (
    alt.Chart()
    .transform_window(row_number='row_number(year)', groupby=["series"])
    .transform_filter(alt.datum.row_number==1)
    .mark_text(align="center", dx=0, dy=-170, fontSize=18)
    .encode(text=alt.Text("series"))
    .transform_filter(series_mouse_selection)
)

(points + lines + texts).configure_view(strokeWidth=0).properties(width=600)

二重軸は避けましょう

同じ期間で2つの変数の推移を可視化したい場合、折れ線グラフや棒グラフを重ね合わせることがあります。しかし、このようなグラフを理解するには軸や凡例などを参照して複数のグラフが何を表しているのかを確認しなければならず、認知的負荷が低いグラフとはいえないことが多いです。

以下は2012～2015年におけるシアトルの最高気温と最大風速と月別に集計した結果を表しています。色を参照することで左の縦軸で最高気温、右の縦軸で最大風速を表していることが分かります。

コード

seattle_temp_max_line = (
    alt.Chart(seattle_weather_df)
    .mark_line(color="steelblue")
    .encode(
        x=alt.X('yearmonth(date):T',axis=alt.Axis(grid=False, title="年")),
        y=alt.Y('max(temp_max)',axis=alt.Axis(orient="left", grid=False, titleColor="steelblue", title="最高気温"))
    )
)

seattle_wind_line = (
    alt.Chart(seattle_weather_df)
    .mark_bar(color="darkorange")
    .encode(
        x=alt.X('yearmonth(date):T',axis=alt.Axis(grid=False)),
        y=alt.Y('max(wind)',axis=alt.Axis(orient="right", grid=False, titleAngle=-90, titleX =35, titleColor="darkorange", title="最大風速"))
    )
)

(seattle_temp_max_line + seattle_wind_line).configure_view(strokeWidth=0)

同時系列で2つ以上の変数を可視化したい場合は縦や横にグラフを並べると認知的負荷を下げられます。

コード

seattle_temp_max_line = (
    alt.Chart(seattle_weather_df)
    .mark_line(color="steelblue")
    .encode(
        x=alt.X('yearmonth(date):T',axis=None, title="年"),
        y=alt.Y('max(temp_max)',axis=alt.Axis(grid=False, titleColor="steelblue", title="最高気温"))
    )
    .properties(width=500, height=100)
)

seattle_wind_line = (
    alt.Chart(seattle_weather_df)
    .mark_bar(color="darkorange")
    .encode(
        x=alt.X('yearmonth(date):T',axis=alt.Axis(grid=False)),
        y=alt.Y('max(wind)',axis=alt.Axis(grid=False, titleAngle=-90, titleColor="darkorange", title="最大風速"))
    )
    .properties(width=500, height=100)
)

(seattle_temp_max_line & seattle_wind_line).configure_view(strokeWidth=0)

スロープチャート

時系列の中で「最初」と「最後」の2点間の違いを比較したい場合はスロープチャートを使用するのがよいです。

ここでは1931年と1932年の2点に着目して地域別の大麦収穫量を可視化します。

コード

barley_df = vega_datasets.data.barley()
barley_df.head()

	yield	variety	year	site
0	27.00000	Manchuria	1931	University Farm
1	48.86667	Manchuria	1931	Waseca
2	27.43334	Manchuria	1931	Morris
3	39.93333	Manchuria	1931	Crookston
4	32.96667	Manchuria	1931	Grand Rapids0

スロープチャートによってモーリス（Morris）地方だけが大麦の収穫量が増加していることが分かりました。

コード

base = (
    alt.Chart(barley_df)
    .encode(
        x=alt.X('year:O', axis=alt.Axis(grid=False, labelAngle=0, title="年")),
        y=alt.Y('sum(yield)', axis=alt.Axis(grid=False, title="大麦収穫量")),
        color=alt.Color('site', legend=None)
    )
)

lines = base.mark_line(size=3)

points = base.mark_circle(size=30)

texts =  (
    base
    .transform_filter(alt.datum.year == 1932)
    .mark_text(dx=10, dy=0, align="left")
    .encode(text="site:N")
)

(lines + points + texts).configure_view(strokeWidth=0).properties(width=300, height=200, title="地域別大麦収穫量")

ただ折れ線グラフと同様に凡例の数が多い場合はスパゲッティチャートのようになってしまいます。要素ごとに列を分けるなどの工夫をして認知的負荷の軽減を心がけましょう。

コード

base = (
    alt.Chart(barley_df)
    .encode(
        x=alt.X('year:O', axis=alt.Axis(grid=False, labelAngle=0, title=None)),
        y=alt.Y('median(yield)', axis=alt.Axis(grid=False, title="大麦収穫量")),
    )
)

slop = base.mark_line()

point = base.mark_circle()

(
    (slop+point)
    .properties(width=60, height=200)
    .facet(column=alt.Column("site", title=None))
    .configure_view(strokeWidth=0)
)

ウォーターフォールチャート
「最初」と「最後」の2点間だけでなく中間経過も可視化したグラフとしてウォーターフォールチャートがあります。ウォーターフォールチャートは最初の値を基準とし、次の時間帯でどれくらい値が変動したのかを表現したグラフです。

どの時期にどうなったのかを明確に示すことができ、下記の例のような売上収益推移の可視化などに優れています。

コード


### Trying to recreate this vega lite implementation in Altair 
### https://vega.github.io/vega-lite/examples/waterfall_chart.html

## had to add some more fields from the original example data in the vega-lite implementation
the_data=[
      {"label": "Begin", "amount": 4000, "order":0, "color_label":'Begin'},
      {"label": "Jan", "amount": 1707, "order":1, "color_label":'+'},
      {"label": "Feb", "amount": -1425, "order":2, "color_label":'-'},
      {"label": "Mar", "amount": -1030, "order":3, "color_label":'-'},
      {"label": "Apr", "amount": 1812, "order":4, "color_label":'+'},
      {"label": "May", "amount": -1067, "order":5, "color_label":'-'},
      {"label": "Jun", "amount": -1481, "order":6, "color_label":'-'},
      {"label": "Jul", "amount": 1228, "order":7, "color_label":'+'},
      {"label": "Aug", "amount": 1176, "order":8, "color_label":'+'},
      {"label": "Sep", "amount": 1146, "order":9, "color_label":'+'},
      {"label": "Oct", "amount": 1205, "order":10, "color_label":'+'},
      {"label": "Nov", "amount": -1388, "order":11, "color_label":'-'},
      {"label": "Dec", "amount": 1492, "order":12, "color_label":'+'},
      {"label": "End", "amount": 0, "order":13, "color_label":'End'}
    ]

df=pd.DataFrame(the_data)


## workaround to enable 3 different colors for the bars
color_lst = (
    alt.Color('color_label', legend=None, scale=alt.Scale(
        domain=['Begin','End','-','+'], range=['gray', 'gray', 'steelblue', 'darkorange']
        )
    )
)

## the "base_chart" defines the transform_window, transform_calculate, and encode X and color coding
base = (
    alt.Chart(df)
    .transform_window(sort=[{'field': 'order'}],
                      frame=[None, 0], 
                      window_sum_amount='sum(amount)',
                      window_lead_label='lead(label)'
                      )
    .transform_calculate(
        calc_lead="datum.window_lead_label === null ? datum.label : datum.window_lead_label",
        calc_prev_sum="datum.label === 'End' ? 0 : datum.window_sum_amount - datum.amount",
        calc_amount="datum.label === 'End' ? datum.window_sum_amount : datum.amount",
        calc_text_amount="(datum.label !== 'Begin' && datum.label !== 'End' && datum.calc_amount > 0 ? '+' : '') + datum.calc_amount",
        calc_center="(datum.window_sum_amount + datum.calc_prev_sum) / 2",
        calc_sum_dec="datum.window_sum_amount < datum.calc_prev_sum ? datum.window_sum_amount : ''",
        calc_sum_inc="datum.window_sum_amount > datum.calc_prev_sum ? datum.window_sum_amount : ''",
        )
    .encode(
        x=alt.X('label:O', axis=alt.Axis(labelAngle=0, title="月"),
                sort=alt.EncodingSortField(field="order", op="max", order='ascending')
                )
        )
)

bar = (
    base.mark_bar()
    .encode(
        y=alt.Y('calc_prev_sum:Q',axis=None),
        y2=alt.Y2('window_sum_amount:Q'), 
        color=color_lst
        )
    )

## the "rule" chart is for the lines that connect bars each month
rule=base.mark_rule(color='#404040', opacity=0.9, strokeWidth=2, xOffset=-25, x2Offset=25, strokeDash=[3,3]).encode(
    #draw a horizontal line where the height (y) is at window_sum_amount, and the ending width (x2) is at calc_lead
    y='window_sum_amount:Q'
    , x2='calc_lead'
)

## create text values to display
text_pos_values_top_of_bar=base.mark_text(baseline='bottom', dy=-4).encode(
    text=alt.Text('calc_sum_inc:N')
    , y='calc_sum_inc:Q'
)
text_neg_values_bot_of_bar=base.mark_text(baseline='top', dy=4).encode(
    text=alt.Text('calc_sum_dec:N')
    , y='calc_sum_dec:Q'
)
text_bar_values_mid_of_bar=base.mark_text(baseline='middle').encode(
    text=alt.Text('calc_text_amount:N')
    , y='calc_center:Q'
     , color=alt.condition("datum.label ==='Begin'||datum.label === 'End'", alt.value("white"), alt.value("white"))
)

## layer everything together
(
    (bar+rule+text_pos_values_top_of_bar+text_neg_values_bot_of_bar+text_bar_values_mid_of_bar)
    .properties(width=600, height=300, title="売上収益推移")
    .configure_view(strokeWidth=0)
)

面チャート

積み上げ棒グラフの時間変化を追跡したい場合は面チャートも有効です。面チャートから細かな差異を認識するのは難しいですが、積み上げ棒グラフで表現するのが得意であった割合データなどの可視化には適しています。

ここでは先ほど紹介した米国失業者数のデータを使用します。

コード

(
    alt.Chart(unemployment_across_industries_df)
    .mark_area()
    .encode(
        x=alt.X("date:T", title="年"),
        y=alt.Y("count:Q", stack="normalize", axis=alt.Axis(format="%", title="割合")),
        color=alt.Color("series:N",
            scale=alt.Scale(scheme="category20b"),
            legend=alt.Legend(labelFontSize=10, title="業種")
        ),
    )
    .configure_view(strokeWidth=0)
    .properties(width=500, height=300, title="米国失業者数推移")
)

特定の属性のみを抽出して比較したい場合は、凡例でクリックした項目のみで正規化するような機能を付けるとよいでしょう。
以下の面チャートは、shiftキーを押しながら凡例を押下するとその項目だけで正規化されます。

コード

series_multi_selection = alt.selection_multi(fields=["series"], bind="legend")

base = (
    alt.Chart(unemployment_across_industries_df)
    .encode(
        x=alt.X("date:T", title="年"),
        y=alt.Y("count:Q", stack="normalize", axis=alt.Axis(format="%", title="割合")),
        color=alt.Color("series:N",
            scale=alt.Scale(scheme="category20b"),
            legend=alt.Legend(labelFontSize=10, title="業種")
        ),
    )
)

background = base.mark_area(opacity=0)
foreground = base.mark_area().add_selection(series_multi_selection).transform_filter(series_multi_selection)

(background + foreground).properties(width=500, height=300, title="米国失業者数推移")

分布を表す

ヒストグラム

縦軸に度数、横軸に階級を取ることで分布を表します。棒ごとの間隔を狭くすることでギャップが分かりやすくなり、どの階級がボリュームゾーンであるかなどが識別しやすくなります。

下の例はシアトルの最高気温の分布を表しています。10～15℃の気温となることが最も多いことが分かります。

コード

(
    alt.Chart(seattle_weather_df)
    .mark_bar()
    .encode(
        x=alt.X(
            "temp_max",
            bin=alt.Bin(step=5, extent=[-10, 45]),
            axis=alt.Axis(
                labelFontSize=15, ticks=True, grid=False, titleFontSize=18, title="最高気温"
            ),
        ),
        y=alt.Y(
            "count(temp_max)",
            axis=alt.Axis(labelFontSize=15, ticks=True, grid=False, titleFontSize=18, title="日数"),
            stack=None,
        )
    )
    .properties(width=500, height=300, title="最高気温別日数")
    .configure_view(strokeWidth=0)
)

ヒストグラムなど分布を表すグラフにおいて階級はbinと呼ばれることがあります。上記の最高気温を表したヒストグラムではbinの幅を5℃にしています。

binの幅が全体で統一されていないグラフは認知的負荷が上がってしまいます。可視化する際はbinの幅を統一するようにしましょう。ただし分布によっては全体像を適切に表現するようなbinの幅の指定が難しいこともあり、棒グラフ・線グラフなどと比べて認知的負荷を下げるためにはに工夫が必要になることが多いです。

ダンベルチャート

ダンベルチャートは最小値と最大値をダンベル状に表現します。以下の例では月別に気温のレンジを表していますが、どの月も30℃くらいのレンジであることが分かります。

コード

base = (
    alt.Chart(seattle_weather_df)
    .encode(y=alt.Y("month(date):O",
                    axis=alt.Axis(titleAngle=0, labelFontSize=15, ticks=True, grid=False, title="月")
                    )
    )
)

temp_min_points = (
    base.mark_point(size=250, color="steelblue", filled=True)
    .encode(x=alt.X("min(temp_min):Q", axis=alt.Axis(
        labelFontSize=15, ticks=True, grid=False, title="気温"))
    )
)

temp_max_points = (
    base.mark_point(size=250, color="darkorange", filled=True)
    .encode(x=alt.X("max(temp_max):Q", axis=alt.Axis(
        labelFontSize=15, ticks=True, grid=False, titleFontSize=18, title="気温"))
    )
)

bars = (
    base.transform_fold(['temp_min', 'temp_max'], as_=['min_max', "temp"])
    .mark_line(size=2, color="gray")
    .encode(x=alt.X("temp:Q"), detail="month(date):O")
)

(
    (temp_min_points + temp_max_points + bars)
    .properties(width=600, height=300, title="2012～2015年におけるシアトルの気温")
    .configure_view(strokeWidth=0)
)

箱ひげ図

箱ひげ図はシンプルなデザインで四分位数、中央値といった外れ値に影響を受けにくい複数の統計量を可視化するグラフです。そのため分布の概況を知るには最適です。

最小値～最大値の範囲に含まれないデータは外れ値として扱われます。一般的に最小値の下限は「第一四分位数-1.5×四分位範囲」であり最大値の上限は「第三四分位数+1.5×四分位範囲」ですが、倍率1.5はツールによって変更可能です。

下記では2012～2015年におけるシアトルの最高気温の分布を表しています。

コード

(
    alt.Chart(seattle_weather_df)
    .mark_boxplot( 
        size=30, # 箱の幅
        ticks=alt.MarkConfig(width=30), # ひげの ┸ や ┯ の幅
        median=alt.MarkConfig(color="black", size=30), # 中央値に引かれる線の設定
    )
    .encode(
        x=alt.X("month(date):O", 
                axis=alt.Axis(
                    labelFontSize=13, labelAngle=0, titleFontSize=13, format='%b', ticks=True, grid=False, title="月")),
        y=alt.Y(
            "max(temp_max):Q", title="最高気温",
             axis=alt.Axis(labelFontSize=13, titleFontSize=13, ticks=True, grid=False)
        ),
    )
    .properties(width=600, height=300, title="2012～2015年におけるシアトルの最高気温")
    .configure_view(strokeWidth=0)
)

箱ひげ図は多くの統計量をシンプルな図で表現することができる一方で、統計に詳しくない読み手には伝わらないことがある点に注意しましょう。

リッジライン

分布の大まかな形状を表すための方法としてリッジラインがあります。ヒストグラムを滑らかにしたようなデザインをしており、山の高さで度数を表現します。

コード

step = 30
overlap = 0.5
x_range = [-5, 40]
bin_range = 5

(
    alt.Chart(seattle_weather_df)
    .transform_timeunit(month='month(date)')
    .transform_bin(as_="ビン", field="temp_max", bin=alt.Bin(step=bin_range, extent=x_range))
    .transform_aggregate(y_axis="count()", groupby=["month", "ビン"])
    .transform_impute(
        impute="y_axis", groupby=["month"], key="ビン", value=0, keyvals=x_range
    )
    .mark_area(
        interpolate="monotone", fillOpacity=0.6, stroke="lightgray", strokeWidth=0.5
    )
    .encode(
        alt.X(
            "ビン:Q",
            title="最高気温（℃）",
            axis=alt.Axis(grid=False),
            scale=alt.Scale(domain=x_range),
        ),
        alt.Y(
            "y_axis:Q",
            stack=None,
            title=None,
            axis=None,
            scale=alt.Scale(range=[step, -step * overlap]),
        ),
        alt.Row("month:T", title=None, header=alt.Header(labelAlign="left", labelAngle=0, format='%b')),
    )
    .properties(bounds="flush", width=400, height=int(step))
    .configure_facet(spacing=0)
    .configure_view(stroke=None)
    .configure_title(anchor="end")
)

バイオリンプロット

分布の大まかな形状はバイオリンプロットでも表現可能です。リッジラインと同様に山の高さで度数を表現します。

コード

x_range = [-5, 40]
bin_range = 5

(
    alt.Chart(seattle_weather_df)
    .transform_timeunit(month='month(date)')
    .transform_density("temp_max", as_=["temp_max", "density"], extent=x_range, groupby=["month"])
    .mark_area(orient="horizontal")
    .encode(
        x=alt.X("density:Q", stack="center", impute=None, title=None,
            axis=alt.Axis(labels=False, values=[0], grid=False, ticks=True),
        ),
        y=alt.Y("temp_max:Q", axis=alt.Axis(labels=False, values=[0], grid=False, ticks=True, title="最高気温",)),
        column=alt.Column("month:T",
            header=alt.Header(labelFontSize=13, format='%b',
                title="2012～2015年におけるシアトルの最高気温"
            ),
        ),
    )
    .configure_facet(spacing=0)
    .configure_view(stroke=None)
    .properties(width=50, height=300)
)

ジッタープロット

ジッタープロットは1つ1つのデータの大きさを点で表現するチャートです。点の粗密から分布を知ることができます。

コード

(
    alt.Chart(seattle_weather_df)
    .transform_timeunit(month='month(date)')
    .mark_circle(size=8)
    .encode(
        x=alt.X(
            "jitter:Q",
            title=None,
            axis=alt.Axis(values=[0], ticks=True, grid=False, labels=False),
            scale=alt.Scale(),
        ),
        y=alt.Y("temp_max:Q", axis=alt.Axis(title="最高気温", grid=False)),
        # color=alt.condition(selector, "クラス", alt.value("lightgray"), scale=color_scale),
        column=alt.Column("month:T",
                          title="2012～2015年におけるシアトルの最高気温",
                          header=alt.Header(format='%b',labelFontSize=13, titleFontSize=13)
                          ),
    )
    .transform_calculate(jitter="sqrt(-2*log(random()))*cos(2*PI*random())")
    .configure_facet(spacing=10)
    .properties(width=30, height=300)
)

	date	symbol	price
0	2000-01-31	AAPL	25.940000
1	2000-01-31	AMZN	64.560000
2	2000-01-31	IBM	100.520000
3	2000-01-31	MSFT	39.810000
4	2000-07-31	AAPL	27.703333

関係性を表す

散布図

散布図は2変数の関係性を表すのに適しています。縦軸と横軸の両方で連続変数を取り、ポイントの位置で個々のデータの大小や多寡を比較できます。散布図から読み取りやすい情報として以下のようなものが挙げられます。

どのエリアにポイントが密集しているのか
外れ値をとるポイントは何か
2軸に取られた変数の相関関係はどうなっているのか（正の相関か、負の相関か、無相関か）

ここでは米国、日本、欧州で販売されている自動車のデータを用いて可視化の実例を紹介します。

コード

vega_datasets.data.cars().head()

	Name	Miles_per_Gallon	Cylinders	Displacement	Horsepower	Weight_in_lbs	Acceleration	Year	Origin
0	chevrolet chevelle malibu	18.0	8	307.0	130.0	3504	12.0	1970-01-01	USA
1	buick skylark 320	15.0	8	350.0	165.0	3693	11.5	1970-01-01	USA
2	plymouth satellite	18.0	8	318.0	150.0	3436	11.0	1970-01-01	USA
3	amc rebel sst	16.0	8	304.0	150.0	3433	12.0	1970-01-01	USA
4	ford torino	17.0	8	302.0	140.0	3449	10.5	1970-01-01	USA

散布図は属性ごとにポイントの色や形を変えることにより、分布の傾向の違いなども比較することができます。

以下の例では馬力と燃費の関係を表していますが、米国製の自動車は馬力に優れているのに対し、欧州や日本製の自動車は燃費に優れている傾向が読み取れます。

しかし散布図は2変数を軸とするため、どちらか片方の変数の分布は読み取りにくいことがあります。それぞれの変数の分布なども同時に知りたい場合はヒストグラムも並べて表示しておくと親切です。

コード

width, height= 450, 450
base = alt.Chart(vega_datasets.data.cars())

scatter = (
    base.mark_circle(size=60)
    .encode(
      x=alt.X("Horsepower", title="馬力", axis=alt.Axis(grid=False)),
      y=alt.Y('Miles_per_Gallon', title="燃費（燃料1ガロンあたりの走行マイル）", axis=alt.Axis(grid=False)),
      color='Origin',
      tooltip=['Name', 'Origin', 'Horsepower', 'Miles_per_Gallon']
    )
    .properties(width=width,height=height)
)

hist_x = base.mark_bar().encode(
    x=alt.X("Horsepower", bin=alt.Bin(step=10),axis=None),
    y=alt.Y('count(Horsepower)', axis=alt.Axis(grid=False,title='人数')),
    color=alt.Color('Origin'),
    ).properties(width=width,height=height//3)

hist_y = base.mark_bar().encode(
    y=alt.Y("Miles_per_Gallon", bin=alt.Bin(step=3),axis=None),
    x=alt.X('count(Miles_per_Gallon)',
        axis=alt.Axis(grid=False,title='人数'),
        ),
    color=alt.Color('Origin'),
    ).properties(width=width//3,height=height)

(hist_x&(scatter|hist_y)).configure_view(strokeWidth=0).properties(title="各国の自動車における馬力と燃費の関係性")

データの大小や多寡を比較する場合、棒グラフやヒストグラムなどで十分に情報を伝えられることも多いです

散布図は横軸と縦軸の両方に連続変数を取れるため細かい粒度で情報を伝えることができますが、「結局何が言いたいのか」ということが伝わりにくく認知的負荷が高くなることもあります。特に読み手がデータに慣れ親しんでいない場合などは、どのような可視化が適切なのかをよく吟味しましょう。

まとめ

可視化はデータから必要な情報を読み取るために行いましょう。きれいなグラフを作成することに拘りすぎず、必要な情報が読み取りやすいかどうかを意識することが大切です。

参考文献

永田ゆかり、データ視覚化のデザイン、SBクリエイティブ株式会社、2020、201p

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up