More than 3 years have passed since last update.

dictの欠損キー対処の4つの方法

Last updated at 2020-08-14Posted at 2020-08-14

Effective Python 第2版――Pythonプログラムを改良する90項目が本当にすばらしいので泣きながら読み進めています。

その中に、dictの欠損キーに対する対処方法の項目があったため、詳細内容は本書を読んで頂きたいのですが、それぞれの処理時間が気になったため計測しました。

今回はかなり簡単ですが、ある文字列の出現文字をカウントする処理を書きます。
実行環境はGoogle Colabのデフォルトです。

まず必要なライブラリをインポートします。

import time, defaultdict

適当な文字列を集計用の対象にします。

target = 'super_string_of_my_passages. but this does not make sense at all. because this is nothing'

なお最後に、出現数でsortしたkey-valueを出力しており、想定される結果は以下です。

[('s', 13),
 (' ', 12),
 ('e', 8),
 ('t', 7),
 ('a', 6),
 ('i', 5),
 ('n', 5),
 ('_', 4),
 ('o', 4),
 ('u', 3),
 ('g', 3),
 ('h', 3),
 ('p', 2),
 ('r', 2),
 ('m', 2),
 ('.', 2),
 ('b', 2),
 ('l', 2),
 ('f', 1),
 ('y', 1),
 ('d', 1),
 ('k', 1),
 ('c', 1)]

if文でinを使った場合

if文でキーが存在するかチェックし、Trueを返すin式を使うことで欠損キーに初期値を与える。おそらく最初に考えつくシンプルな方法。

%%time
ranking = {}
for key in target:
    if key in ranking.keys():
        count = ranking[key]
    else:
        count = 0
    ranking[key] = count + 1
sorted(ranking.items(), key=lambda x: x[1], reverse=True)

CPU times: user 45 µs, sys: 9 µs, total: 54 µs Wall time: 56.3 µs

try文でKeyErrorを使った場合

try-except文を使って、エラーの原因であるKeyErrorを想定されるエラーとしてハンドリングする。

%%time
ranking = {}
for key in target:
    try:
        count = ranking[key]
    except KeyError:
        count = 0
    ranking[key] = count + 1
sorted(ranking.items(), key=lambda x: x[1], reverse=True)

CPU times: user 59 µs, sys: 11 µs, total: 70 µs Wall time: 78.2 µs

getメソッドを使った場合

組み込み型dictに用意されているgetメソッドを使う。

%%time
ranking = {}
for key in target:
    count = ranking.get(key, 0)
    ranking[key] = count + 1
sorted(ranking.items(), key=lambda x: x[1], reverse=True)

CPU times: user 43 µs, sys: 8 µs, total: 51 µs Wall time: 53.6 µs

defaultdictを使った場合

%%time
ranking = defaultdict(int)

for s in target:
    ranking[s] += 1
sorted(ranking.items(), key=lambda x: x[1], reverse=True)

CPU times: user 36 µs, sys: 8 µs,　total: 44 µs　 Wall time: 47.2 µs

結論

defaultdictがイイかも！(*^^)
*defaultdictは万能ではなく想定しないエラーの原因にもなり得るため注意して使う。
参考: (http://yoshidabenjiro.hatenablog.com/entry/2017/09/05/012828)

番外編

本書で扱われてる手法は以上ですが、
おいおいこんな処理ならあいつを使ってもええがな！忘れてまへんか〜？
と突っ込まれそうなので番外編として記します。
このような単純なケースではcollectionsライブラリのCounterクラスを使ってもいいと思います。
各要素における出現回数をカウントしてくれる便利なものです。
なお個数順にsortしてくれるmost_commonメソッドが用意されているのでそれを使います。

from collections import Counter

%%time
ranking = Counter(target)
ranking.most_common()

CPU times: user 53 µs, sys: 0 ns, total: 53 µs Wall time: 56.5 µs

ありがとうございました！

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up