More than 3 years have passed since last update.

Pythonらしいコードを書くためのメモ

Last updated at 2020-07-26Posted at 2020-05-01

はじめに

pythonを書いてきて、こうした方がいいなって思ったものを整理しました。

tuple

ほぼlistみたいなものですが、listは可変なのに対してtupleは不変、その分か速度・メモリの点でtupleの方が優れているようです。なので、要素が変わらないリストはtupleにした方が良さそうです。

namedtuple

何らかの構造を使いまわしたいときにclassを使うかと思います。が、コードが長くなってしまうので私はあまり好みませんでした。

class Twitter:
	def __init__(self, account, user, followers, followings, tweets):
		self.account = account
		self.user = user
		self.followers = followers
		self.followings = followings
		self.tweets = tweets

	def __repr__(self):
		return f"{type(self).__name__}(account={repr(self.account)}, user={repr(self.user)}, followers={repr(self.followers)}, followings={repr(self.followings)}, tweets={repr(self.tweets)})"

t = Twitter("小池百合子", "@ecoyuri", 790000, 596, 3979)
print(t)

Twitter(account='小池百合子', user='@ecoyuri', followers=790000, followings=596, tweets=3979)

これが namedtupleを使うとこう書けました。↓

from collections import namedtuple

Twitter = namedtuple('Twitter', 'account user followers followings tweets')
t = Twitter('小池百合子', '@ecoyuri', 790000, 596, 3979)
print(t)

# 出力は前述と同様

コードが短くて済む
記法が単純でわかりやすい
print()で中身が出力できる

のでとても良いと思いました。

Counter

リストに含まれる各要素の出現頻度を辞書形式で取得したいようなとき。例えばこんな感じでしょうか。

judgements = ['銀シャリ', 'スーパーマラドーナ', '銀シャリ', '和牛', '銀シャリ']

counts = dict()
# 要素のユニークリストを取得
members = set(judgements)
for m in members:
	count = 0
	for j in judgements:
		if j == m:
			count += 1
	counts[m] = count

# 出現回数の降順にソート
counts = sorted(counts.items(), key=lambda x:x[1], reverse=True)
counts_dict = dict(counts)
print(counts_dict)

{'銀シャリ': 3, 'スーパーマラドーナ': 1, '和牛': 1}

ただ、これだとやってることの割にコードが長くていやなので、Counterを使って

from collections  import Counter

judgements = ['銀シャリ', 'スーパーマラドーナ', '銀シャリ', '和牛', '銀シャリ']
counts_dict = Counter(judgements)
print(counts_dict)

のように書くと幸せです。CounterはBag of Wordsを使いたいときに必須と思います。

yield

例えば以下のようなコードを考えてみます。

def omit_stopwords(tweets):
	omitted_tweets = []
	for t in tweets:
		# urlまたは@{ユーザ名}または#{タグ名}を除去
		reg = r'https?://[\w/:%#\$&\?\(\)~\.=\+\-]+|[@＠][A-Za-z0-9._-]+|[#＃][一-龥_ぁ-ん_ァ-ヺーａ-ｚＡ-Ｚa-zA-Z0-9]+'
		text_mod = re.sub(reg,'',t['text'])
		omitted_tweets.append(text_mod)
	return omitted_tweets

# get_tweetsは"[{'text':{tweet1},'text':{tweet2},...,'text':{tweetN}]の形式のデータを返す関数"
ots = omit_stopwords(get_tweets())

for ot in ots:
	print(f"analyzing the tweet: {ot}")

analyzing the tweet: 本日18:45～のライブ配信は、大阪府の吉村知事 にお付き合いいただきます。〜〜〜
 ・・・
analyzing the tweet: 〜〜〜。今後も感染拡大防止のため、現地調査を引き続き行います。

Tweetデータなどは大抵の場合容量が大きいので、omitted_tweetsはかなりサイズの大きいリストになりメモリ的にも速度的にもよろしくないです。そのようなとき、

def omit_stopwords(tweets):
	for t in tweets:
		reg = r'https?://[\w/:%#\$&\?\(\)~\.=\+\-]+|[@＠][A-Za-z0-9._-]+|[#＃][一-龥_ぁ-ん_ァ-ヺーａ-ｚＡ-Ｚa-zA-Z0-9]+'
		text_mod = re.sub(reg,'',t['text'])
		yield text_mod

ots = omit_stopwords(get_tweets())

for ot in ots:
	print(f"analyzing the tweet: {ot}")

のように return の代わりに yield とすることで、for文内で初めてomit_stopwords内の置換処理が走り、結果的にメモリが抑えられるようです。その証拠に、変数ots を出力してみると、

<generator object omit_stopwords_yield at 0x10f957468>

のようにgenerator型というものになっていて、

print(f"analyzing the tweet: {ots.__next__()}")
print(f"analyzing the tweet: {ots.__next__()}")
print(f"analyzing the tweet: {ots.__next__()}")
# ・・・

でリスト内のデータを一つずつ出力できます。(for文を回した後だとジェネレータを使い切っているのでエラーになります)

itertools.product

for文が多重になってると見づらくていやです。
たとえば、機械学習でパラメータ選択のために各パラメータの組み合わせごとに精度を算出したりするときとか。

from sklearn.ensemble import RandomForestClassifier

...
...

ne_list = [3,5,10,50,100,500,1000,5000,10000]
md_list = [2,3,5,10,50,100]
mf_list = [0.1,0.3,0.5]

for ne in ne_list:
	for md in md_list:
		for mf in mf_list:
			clf = RandomForestClassifier(n_estimators=ne, max_depth=md, max_features=mf, random_state=0)
			print(f'(n_estimators, max_depth, max_features)={ne,md,mf}: {clf.fit(X_train, y_train).score(X_test,y_test)}')

この例だと処理部分が2行ぐらいなのでそんなにですが、処理部分が長くなるとインデントがわけわかんなくなって苦労します。
itertools.productを使えば、

from sklearn.ensemble import RandomForestClassifier
from itertools import product

...
...

for ne,md,mf in product(ne_list, md_list, mf_list):
	clf = RandomForestClassifier(n_estimators=ne, max_depth=md, max_features=mf, random_state=0)
	print(f'(n_estimators, max_depth, max_features)=={ne,md,mf}: {clf.fit(X, y).score(X,y)}')

のように、3重for文が1重でかけてすごくスッキリします。
ただ、for文を重ねる方法よりも速度が遅くなるという情報もあるので重めの処理には向かないかもしれません。

内包表記

reg = r'[@＠][A-Za-z0-9._-]+'
target_tweets = []
# @{ユーザ名}を含まないツイートのみ抽出
for t in get_tweets():
	if not re.search(reg, t['text']):
		target_tweets.append(t)

↑が

reg = r'[@＠][A-Za-z0-9._-]+'
target_tweets = [t for t in get_tweets() if not re.search(reg, t['text'])]

のようにスッキリかけます。リストから別のリストを作り出したいときによく使われます。

コードが少なくて済む
速度が速い(らしい)

ので積極的に使っていきたいですね。

f文字列

ログとかコンソールに変数を出力したいときに、いちいち%s とか書くのもめんどくさいし、+で繋ぎまくるのとかなんかイケてないなと感じてました。

import logging
import datetime

logging.basicConfig(filename='sample.log', format='%(message)s', level=logging.INFO)
alpha_list = ['0001','0002','0003','0004','0005']
beta_list = [10,20,30,40,50]

for alpha in alpha_list:
	for beta in beta_list:
		start = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
		logging.info('%s  analyze start: (alpha,beta)=(%s, %s)' % (start, alpha, beta))
		result = analyze(alpha, beta) # なんらかの分析処理
		end = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
		logging.info('%s  analyze finished' % end)
		save_results(alpha + '_' + str(beta) + '_' + end + '.csv') # 結果をファイルに出力する処理

特に、save_resultsで変数合体させてファイル名作ってるとこは書くのも見るのもいやです。f文字列使えば、for以降の部分が

for alpha in alpha_list:
	for beta in beta_list:
		start = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
		logging.info(f'{start}  analyze start: (alpha,beta)=({alpha}, {beta})')
		result = analyze(alpha, beta)
		end = datetime.datetime.now().strftime('%Y%m%d%H%M%S')
		logging.info(f'{end}  analyze finished')
		save_results(result, f'{alpha}_{beta}_{end}.csv')

という感じになります。

見た目がスッキリする
書く量が少なくて済む
%sの数と代入する変数の数が合ってなくてエラーとかにならない

ので私は好きです。ちなみにf文字列はpython 3.6以降から導入された機能のようです。

おわりに

ほかにも適宜追記していくかも。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up