More than 5 years have passed since last update.

matplotlibでよく使う手続き

Last updated at 2018-02-11Posted at 2018-02-11

動機

私は普段からmatplotlibを用いてグラフを描画しています。

作成できるグラフのクオリティには満足していますが、
頻繁に使う割に、必要なスクリプトが長い手続きがあります。

そういった手続きは、分析用の汎用モジュールとしてまとめて書いておくのが便利です。
この記事では汎用モジュールから幾つか抜粋してご紹介します。

汎用モジュールは以降plotTools.pyと表記します。
データの処理とグラフ描画はJupyter Notebookから実行することを想定しています。

0.プロット用データの準備

サンプルデータとして、Titanicを使います。
seabornにデータが同梱されています
（plotにはseabornのAPIは使いません）。

JupyterNotebook

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
%matplotlib inline

df = sns.load_dataset('titanic')
df['survived'] = df['survived'] == 1

1. 棒グラフのデータラベル表示

直接書くと意外と面倒なこの手続き。
下記のスクリプトを分析フォルダに置いておきます。

以下の部分を変更できるよう、引数として持ちます。

ax : axisオブジェクト。指定がない場合はplt.gca()で取得する。
formatter : データラベルの表示形式(数値を文字列に変換する関数。
- ただしパーセンテージ表記は頻繁に使うため、単に'percent'とだけで指定できるようにしておく。
orientation : 棒グラフの方向。垂直棒グラフなら'v', 水平棒グラフなら'h'とする。
- axisオブジェクトからは垂直棒グラフなのか、水平棒グラフなのか判断できないため、指定が必要。
stacked : 積上げ棒グラフかどうか。
kind : 棒グラフかどうか。折れ線グラフに対しても適用できるようにしたいが、未実装。
fontsize : データラベルの文字の大きさ。
colors : フォントカラー。積上げ棒グラフのときは'auto'を指定すると、下地の輝度を計算し、黒字か白字かを自動判定する。
weight : フォントスタイル。'bold'など。
rotation : データラベルの角度。データ点が多く、データラベルが重なってしまう場合などに。
total : 積上げ棒グラフのとき、合計値を表示するかどうか

plotTools.py

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

def annotating_plots(ax = None, formatter = None, orientation = 'v', stacked = False, kind = 'bar', 
                     fontsize = 12, colors = None, weight='medium', rotation = 0, total=False):
    if ax is None:
        # axisが指定されていない場合、現在描画されているオブジェクトから取得する
        ax = plt.gca()
        
    if formatter is None:
        #デフォルトでは、数値をそのまま文字列として出力する
        formatter = lambda x : str(x)
        
    elif formatter == 'percent':
        # データラベルの表現が文字列でpercentと指定された場合、formatterを設定する
        formatter = lambda x : '{:.{}f}%'.format(x*100, 1)
        
        
    if colors is None:
        # 色が指定されていない場合、全て黒にする
        colors = ['black' for _ in range(len(ax.patches))]
        
    elif isinstance(colors, str):
        if colors != 'auto':
            # 色が文字列で指定されている場合、これを全てのデータラベルにブロードキャストする
            colors = [colors for _ in range(len(ax.patches))]
            
        else:
            #colors = 'auto'は、Stacked bar向けのオプション。rgbを輝度に変換し、0.5以下なら白字にする
            colors = ['white' if np.dot([0.3, 0.6, 0.1],p.get_facecolor()[:3]) <= 0.5 else 'black' for p in ax.patches]
        

    if kind == 'bar':
        # 棒グラフの場合
        if orientation == 'v':
            if not stacked:
                # Grouped bar
                for cnt, p in enumerate(ax.patches):
                    ax.annotate(formatter(p.get_height()),
                                (p.get_x() + p.get_width()/2, p.get_height()*1.005),
                                ha='center', 
                                va='bottom',
                                fontsize=fontsize,
                                color = colors[cnt],
                                weight = weight,
                                rotation = rotation
                                )

            else:
                # Stacked barはoffsetが必要なため、offsetデータ格納用配列を定義する
                length = pd.Series([p.get_x() for p in ax.patches]).nunique() #axisの長さ
                offset = np.zeros(length)
                for cnt, p in enumerate(ax.patches):
                    ax.annotate(formatter(p.get_height()),
                                (p.get_x() + p.get_width()/2, p.get_height()/2 + offset[cnt%length]),
                                horizontalalignment='center',
                                verticalalignment='center',
                                color = colors[cnt],
                                size = fontsize,
                                weight = weight,
                                rotation = rotation
                                )
                    offset[cnt%length] += p.get_height()
                if total:
                    # Stacked barに対し、合計値を表示
                    for idx, p in zip(range(length), ax.patches):
                        ax.annotate(formatter(offset[idx]), 
                                    (p.get_x() + p.get_width()/2, offset[idx]),
                                    horizontalalignment='center',
                                    verticalalignment='bottom',
                                    weight = weight,
                                    rotation = rotation
                                    )               

        elif orientation == 'h':
            if not stacked:
                for cnt, p in enumerate(ax.patches):
                    ax.annotate(formatter(p.get_width()),
                                (p.get_width(),p.get_y() + p.get_height()/2),
                                horizontalalignment='left',
                                verticalalignment='center',
                                fontsize=fontsize,
                                color = colors[cnt],
                                weight = weight,
                                rotation = rotation
                                )
                    
            else:
                # Stacked barはoffsetが必要なため、offsetデータ格納用配列を定義する
                length = pd.Series([p.get_y() for p in ax.patches]).nunique() #axisの長さ
                offset = np.zeros(length)
                for cnt, p in enumerate(ax.patches):
                    ax.annotate(formatter(p.get_width()),
                                (p.get_width()/2 + offset[cnt%length], p.get_y() + p.get_height()/2),
                                horizontalalignment='center',
                                verticalalignment='center',
                                color = colors[cnt],
                                size = fontsize,
                                weight = weight,
                                rotation = rotation
                                )
                    offset[cnt%length] += p.get_width()
                if total:
                    # Stacked barに対し、合計値を表示
                    for idx, p in zip(range(length), ax.patches):
                        ax.annotate(formatter(offset[idx]), 
                                    (offset[idx],
                                    p.get_y() + p.get_height()/2),
                                    horizontalalignment='left',
                                    verticalalignment='center',
                                    weight = weight,
                                    rotation = rotation
                                    )

titanicのデータを例としてグラフにしてみます。
上記のスクリプトがplotTools.pyという名前で保存されているとします。

シンプルなの棒グラフの例

JupyterNotebook

import plotTools as pt

pd.crosstab(df['sex'], df['survived'], normalize='index')[True].plot.bar()
pt.annotating_plots(formatter='percent')

積上げ棒グラフの例

JupyterNotebook

pd.crosstab(df['sex'], df['survived'], normalize='index').plot.bar(stacked=True)
pt.annotating_plots(formatter='percent',
                    stacked = True,
                    colors = 'auto')

2. 軸のパーセント表記

こちらはデータラベルほど長くはなりませんが、よく使うため、汎用モジュールにまとめておきます。

digits : 表示する桁数

plotTools.py

from matplotlib.ticker import FuncFormatter
def axis_percent(ax = None, axis=1, digits=0):
    if ax is None:
        ax = plt.gca()
    def _to_percent(x, position):
        return '{:.{}f}%'.format(x*100, digits)
    formatter = FuncFormatter(_to_percent)
    
    if axis == 0:
        ax.xaxis.set_major_formatter(formatter)        
    elif axis == 1:
        ax.yaxis.set_major_formatter(formatter)
    elif axis == 2:
        ax.zaxis.set_major_formatter(formatter)

上記の積上げ棒グラフの軸をパーセント表記に書き換えてみます。

JupyterNotebook

pd.crosstab(df['sex'], df['survived'], normalize='index').plot.bar(stacked=True, legend=False)
pt.annotating_plots(formatter='percent',
                    stacked = True,
                    colors = 'auto')
pt.axis_percent()

3. グラフのemf保存

日本ではたいていの場合、プレゼンテーションや資料提供にはPowerPointを用いると思います。
png形式でも構いませんが、印刷する場合にはベクタ形式でグラフを保存しておく方が
より綺麗な資料が出来上がります。

matplotlib.pyplot のsavefig関数は、svg形式には対応しているのですが、svg形式はPowerPointが対応していません。
PowerPointに埋め込むためには、emf形式にグラフ画像を変換する必要があります。

ここではinkscapeを利用することを考えます。
inkscapeをインストールしていない方は、下記のURLからダウンロードしてください。
inkscapeのダウンロードページ

inkscapeをインストールした上で、下記のスクリプトを汎用モジュールに追加します。

plotTools

import subprocess
import os

def savefig_as_emf(file, fig = None, transparent=False):
    if fig is None:
        figure = plt.gcf()
    inkscape_path = "C://Program Files//Inkscape//inkscape.exe"
    filepath = file

    if filepath is not None:
        path, filename = os.path.split(filepath)
        filename, extension = os.path.splitext(filename)

        svg_filepath = os.path.join(path, filename+'.svg')
        emf_filepath = os.path.join(path, filename+'.emf')

        figure.savefig(svg_filepath, format='svg', transparent = transparent, bbox_inches='tight')

        subprocess.call([inkscape_path, svg_filepath, '--export-emf', emf_filepath])
        os.remove(svg_filepath)

その後、以下のように実行することで、emf形式のグラフを出力することが出来るようになります。

JupyterNotebook

pd.crosstab(df['sex'], df['survived'], normalize='index').plot.bar(stacked=True, legend=False)
pt.annotating_plots(formatter='percent',
                    stacked = True,
                    colors = 'auto')
pt.axis_percent()

pt.savefig_as_emf('hoge')

終わりに

ここまで読んでくださってありがとうございます。
データ分析の仕事を進める際には、汎用的な処理は出来るだけモジュールにまとめておくようにしています。
各論が非常に多いデータ分析の仕事において、
経験を効率に変えるためには、何らかの工夫が必要だなといつも感じています。
皆さんの工夫をお知らせ頂けたら幸いです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up