More than 3 years have passed since last update.

[Python] ライブラリ別メモ

Last updated at 2021-05-23Posted at 2020-12-04

単語帳．毎回検索するのが面倒なので転載多め．元URLあり．

beautifulsoup

[Qiita@itkr: PythonとBeautiful Soupでスクレイピング]
(https://qiita.com/itkr/items/513318a9b5b92bd56185)

text vs string

[いるかのボックス: Beautifulsoup4のtextとstringの違い]
(https://irukanobox.blogspot.com/2016/06/beautifulsoup4textstring.html?m=0)

画像の保存

[ゼロイチ: Python3で画像をスクレイピングしてローカルに保存する|BeautifulSoupを利用]
(https://programming-beginner-zeroichi.jp/articles/73)
[西住工房: 【Python/BeautifulSoup】Webサイトの画像を自動で一括ダウンロード]
(https://algorithm.joho.info/programming/python/beautifulsoup-download-image/)

concurrent.futures

[concurrent.futures]
(https://docs.python.org/ja/3/library/concurrent.futures.html)

テンプレート

ProcessPoolExecutor→ThreadPoolExecutorでマルチスレッド．

from concurrent import futures
with futures.ProcessPoolExecutor(max_workers=None) as executor:
    for *:
        executor.submit(myfunc, arg0, arg1, ...)

並列化のタイミング: `submit`, `map`の後

ProcessPoolExecutorもThreadPoolExecutorも同じ．

並列化できない例1: ジェネレータ自体

from concurrent import futures
import time

max_workers = 8

def mygenerator(args):
    '''重い処理を挟むジェネレータ'''
    for arg in args:
        time.sleep(1)
        yield arg

for i in mygenerator(range(max_workers)):
    print(i) # 8秒かかる

submitでもmapでも所要時間は変わらない:

# with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
    for i in mygenerator(range(max_workers)):
        print(i)

# with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
    executor.map(print, mygenerator(range(max_workers)))

勿論submitした後は並列化されるので，それ以降の処理が重い場合は効果あり．

並列化できない例2: `submit`する引数

以下の例でも所要時間は短縮されない．

def myfunc(arg):
    '''重い処理を挟む関数'''
    time.sleep(1)
    return arg

# with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
    for i in range(max_workers):
        executor.submit(print, myfunc(i))

重い処理自体を並列化してしまえばOK．
例えば以下の例は並列化される (およそ8秒→1秒になる)

# with futures.ProcessPoolExecutor(max_workers=max_workers) as executor:
with futures.ThreadPoolExecutor(max_workers=max_workers) as executor:
    for i in range(max_workers):
        executor.submit(time.sleep, 1)
        executor.submit(print, i)

ベター(?)プラクティス

これら並列化されるタイミングを踏まえると，
共通化可能な処理をラップする関数を別に作り，それを並列化するのが良いと考えられる．
例: 「ファイルを読む (様々な形式や関数・モジュール) →そのトレンドを計算 (これは共通化可能)」を大量にやる場合，
　「」内を行う関数をそれぞれ作成し，それを並列化する:

メモ

submitの返り値はFutureクラス．変数futureで受けるのが一般的．
これをリストなどにまとめるときはfuturesとしたくなるが，名前が競合する場合はfsで．
(公式ドキュメントのfutures.as_completed(fs)参照)

ThreadPoolExecutor: 軽い処理用, 1つのCPUが使われる
ProcessPoolExecutor: 重い処理用, 複数のCPUが使われる

※並列状況はtopコマンドで監視可能．
　重い処理にThreadPoolExecutorを使うとCPU使用率が100%を超える．

※(エラーも出力せず) 動かないのはデッドロックの可能性あり．
　自作クラスのメソッドの中にある関数を，同じファイル直属の関数にするだけで解決することも．
　その他デッドロックは"hang"などで検索．
[stackoverflow: Python doctest hangs using ProcessPoolExecutor]
(https://stackoverflow.com/questions/48218897/python-doctest-hangs-using-processpoolexecutor)

参考
[Qiita@tag1216: Pythonでconcurrent.futuresを使った並列タスク実行]
(https://qiita.com/tag1216/items/db5adcf1ddcb67cfefc8)

その他並列処理モジュールとの比較
[Heavy Wetal: concurrent.futures — 並行処理 in Python]
(https://heavywatal.github.io/python/concurrent.html)
[Qiita@simonritchie: Pythonの並列処理・並行処理をしっかり調べてみた]
(https://qiita.com/simonritchie/items/1ce3914eb5444d2157ac)

itertools

[Python: itertools]
(https://docs.python.org/ja/3/library/itertools.html)

ちなみにイテレータ⊃ジェネレータ

イテレータをn個のサブイテレータに均等に分割する

itertools.isliceを使う．
期待する動作:

イテレータ -> イテレータ(0), イテレータ(1), ..., イテレータ(n-1)
元のイテレータをn個おきにとってくる
- (先頭からm個ずつ，という分割は全体の長さが必要)

iters = range(12)

n = 3
for s in range(n):
    print(s, list(itertools.islice(iters, s, None, n)))
# 0 [0, 3, 6, 9]
# 1 [1, 4, 7, 10]
# 2 [2, 5, 8, 11]

n = 5
for s in range(n):
    print(s, list(itertools.islice(iters, s, None, n)))
# 0 [0, 5, 10]
# 1 [1, 6, 11]
# 2 [2, 7]
# 3 [3, 8]
# 4 [4, 9]

`chain`: 複数のイテレータを結合する

n = 3
s0, s1, s2 = 0, n + 1, n + n + 2
iter0 = (i for i in range(s0, s0 + n))
iter1 = (i for i in range(s1, s1 + n))
iter2 = (i for i in range(s2, s2 + n))

for i in itertools.chain(iter0, iter1, iter2):
    print(i) # 0 1 2 4 5 6 8 9 10

[stackoverflow: How to join two generators in Python?]
(https://stackoverflow.com/questions/3211041/how-to-join-two-generators-in-python)

`cycle`: 繰り返しイテレータを回す

vals = ['a', 'b', 'c']
for i, val in enumerate(vals):
    print(i, val)
# 0 a
# 1 b
# 2 c

vals = ['a', 'b', 'c']
# vals = (val for val in vals) # ジェネレータでも同じ
for i, val in enumerate(itertools.cycle(vals)):
    print(i, val)
    if i > 4: break
# 0 a
# 1 b
# 2 c
# 3 a
# 4 b
# 5 c

[stackoverflow: How to make a repeating generator in Python]
(https://stackoverflow.com/a/45037522)

json

インスタンスをjsonで保存する

json.dump(f.__dict__)でOK．
例えば以下の抽象クラスは，インスタンスの情報をjsonファイルで保存・そこから読み込みをするテンプレート．

class AbstractConfig(metaclass=ABCMeta):

    def to_json(self, path):
        '''
        Args:
            path (str)
        '''
        with open(path, 'w') as f:
            json.dump(self.__dict__, f, indent=4)

    @classmethod
    def from_json(cls, path):
        '''
        Args:
            path (str)
        '''
        with open(path, 'r') as f:
            json_dict = json.load(f)
        return cls(**json_dict)

.__dict__では不十分なケースがあるらしいが，それはその時
(jsonではなく) jsonsというライブラリもあるらしい

[stackoverflow: How to make a class JSON serializable]
(https://stackoverflow.com/questions/3768895/how-to-make-a-class-json-serializable)
[verilog書く人: 【Python】jsonで自作クラスを含んだデータをシリアライズ/デシリアライズする]
(http://segafreder.hatenablog.com/entry/2017/10/01/140125)

JSONDecodeError: Expecting property name enclosed in double quotesになる他の原因

エラーメッセージは「"で囲ってほしい」と言っているが，
そうであってもこのエラーが発生することがある．
}でくくる直前に,がついている場合がその一例．
以下は正常に動作する例．

test.json

{
    "a": "one",
    "b": "two"
}

path = './test.json'
with open(path, 'r') as f:
    adict = json.load(f)
print(adict) # {'a': 'one', 'b': 'two'}

しかし，以下の通り(Pythonでは許される通り)}で閉じる直前に,を付けると上記のエラーが発生する．
(line 4 column 1というので，}でエラーが発生したことになる．)

error.json

{
    "a": "one",
    "b": "two", <- この,がNG
}

mutagen

mp3などのタグ編集

note.nkmk.me: Pythonでmp3などのID3タグを編集するmutagenの使い方

netCDF4

日時の変換

'''nc (netCDF4.Dataset), time_name (str)として'''
from netCDF4 import num2date
time = nc.variables[time_name]
num2date(time[:], units=time.units, calendar=time.calendar)

ただし，これで返ってくるのはcftime.datetimeというクラスのサブクラス．
これをdatetime.datetimeに変換するためにはyearやmonthといった同名の属性にアクセスする．

import datetime as dt
def cftime2datetime(cftime):
    keys = ['year', 'month', 'day', 'hour', 'minute', 'second', 'microsecond']
    dict_kw = {key: getattr(cftime, key) for key in keys}
    return dt.datetime(**dict_kw)

num2dateの使い方は下ページの"7) Dealing with time coordinates."節が分かりやすい．
[公式: netCDF4 module]
(https://unidata.github.io/netcdf4-python/netCDF4/index.html)
cftime.datetimeの詳細は以下を参照．
[公式: cftime API]
(https://unidata.github.io/cftime/api.html)

numpy/scipy

pandas

pathlib

from pathlib import Path
p = Path( 'dir1/dir2/file.txt' )

print( type( p.name ), p.name ) # <class 'str'> file.txt
print( type( p.stem ), p.stem ) # <class 'str'> file
print( type( p.suffix ), p.suffix ) # <class 'str'> .txt
print( type( p.parent ), p.parent ) # <class 'pathlib.WindowsPath'> dir1\dir2
print( type( p.parents ), p.parents ) # <class 'pathlib._PathParents'> <WindowsPath.parents>
print( type( p.parents[0] ), p.parents[0] ) # <class 'pathlib.WindowsPath'> dir1\dir2
print( type( p.parents[1] ), p.parents[1] ) # <class 'pathlib.WindowsPath'> dir1

[note.nkmk.me: Python, pathlibでファイル名・拡張子・親ディレクトリを取得]
(https://note.nkmk.me/python-pathlib-name-suffix-parent/)
note.nkmk.me: Python, pathlibでファイル一覧を取得（glob, iterdir）

parse

文字列のパース

[PyPI: parse]
(https://pypi.org/project/parse/)
[trivial technologies: Pythonicな文字列パーサ parse]
(https://coreblog.org/ats/python-parse/)

pickle

きっと続かんブログ: 【python】リストや辞書を外部ファイルに保存

ジェネレータはpicklableではない

def is_picklable(obj):
    try:
        pickle.dumps(obj)
    except:
        return False
    return True

obj = range(10)
print(type(obj), is_picklable(obj)) # <class 'range'> True
obj = (i for i in range(10))
print(type(obj), is_picklable(obj)) # <class 'generator'> False
obj = iter(range(10))
print(type(obj), is_picklable(obj)) # <class 'range_iterator'> True
obj = iter([1, 2, 3])
print(type(obj), is_picklable(obj)) # <class 'list_iterator'> True

保存するにはリスト等に変換すれば問題ないが，
ProcessPoolExecutorによる並列時に注意．

[stackoverflow: Why can't generators be pickled?]
(https://stackoverflow.com/questions/7180212/why-cant-generators-be-pickled)

re

[note.nkmk.me: Pythonの正規表現モジュールreの使い方（match、search、subなど）]
(https://note.nkmk.me/python-re-match-search-findall-etc/)
[note.nkmk.me: Pythonで文字列を抽出（位置・文字数、正規表現）]
(https://note.nkmk.me/python-str-extract/)
[Qiita@luohao0404: 分かりやすいpythonの正規表現の例]
(https://qiita.com/luohao0404/items/7135b2b96f9b0b196bf3)

schedule

[Qiita@Kai-Suzuki: Python Scheduleライブラリでジョブ実行]
(https://qiita.com/Kai-Suzuki/items/0c5c0e5cbdb4075fe482)

sqlite3

Tensorflow

バージョン別のドキュメント (.md内のハイパーリンクは機能しないかも)
[Tensorflow: TensorFlowAPIバージョン]
(https://www.tensorflow.org/versions)

[(株)クラスキャットセールスインフォメーション: TensorFlow : Get Started : TensorFlow 技法 101]
(https://tensorflow.classcat.com/2016/02/03/tensorflow-tutorials-mechanics-101/)

https://vict0rs.ch/2018/05/17/restore-tf-model-dataset/
https://screwandsilver.com/tensorflow_init_variable_error/
https://stackoverflow.com/questions/42322698/tensorflow-keras-multi-threaded-model-fitting

tqdm

基本: イテレータを`tqdm.tqdm()`で囲む

def myfunc(i):
    time.sleep(0.0001)
    return i

n = int(1e4)
args = range(n)
for i in tqdm.tqdm(args):
    myfunc(i)

[Qiita@kuroitu: Pythonで進捗表示したい！]
(https://qiita.com/kuroitu/items/f18acf87269f4267e8c1)

ジェネレータ等`len`が無い場合: `total=`オプションで全体数を指定

- `total=len(list(ジェネレータ))`とするとforが回らなくなるので注意

args = range(n)
args = (i for i in args)
for i in tqdm.tqdm(args, total=n):
    myfunc(i)

[Pystyle: tqdm でプログラムの進捗をプログレスバーで表示する方法 > 全体のイテレーション数を指定する]
(https://pystyle.info/how-to-use-tqdm-to-display-the-progress-bar/#outline__4_2)

`print()`との併用: `print()` -> `tqdm.tqdm.write()`に置換

陽に文字列に変換する必要あり

def myfunc(i):
    time.sleep(0.0001)
    if i % 100 == 0: tqdm.tqdm.write(str(i))
    return i

[ばいばいバイオ: 【Python】 tqdmでプログレスバーを表示してみた]
(https://www.kimoton.com/entry/20190830/1567128952)

並列計算との複合

tqdm.contrib
rangeやenumerate, Process/ThreadPoolExecutor.mapはtqdmからラッパーが提供されている

from tqdm.contrib.concurrent import process_map#, thread_map
r = process_map(myfunc, args, max_workers=16, chunksize=100)

submitが使いたい場合:

with tqdm.tqdm(total=n) as pbar:
    fs = []
    with futures.ProcessPoolExecutor(max_workers=16) as executor:
        for i in args:
            f = executor.submit(myfunc, i)
            fs += [f]
        for f in futures.as_completed(fs):
            pbar.update(1)

以下はstackoverflowでも指摘されている悪い例．
(最初のタスクが遅いと，プログレスバーが一気に進む)

def myfunc(i):
    time.sleep(10/i)
    #time.sleep(0.0001)
    return i

def main(*args):
    n = int(1e4)
    args = range(1, n+1)

    with futures.ThreadPoolExecutor(max_workers=16) as executor:
        fs = list(tqdm.tqdm(executor.map(myfunc, args), total=n))

[stackoverflow: Use tqdm with concurrent.futures?]
(https://stackoverflow.com/questions/51601756/use-tqdm-with-concurrent-futures)

treelib: 木構造

[Welcome to treelib’s documentation!]
(https://treelib.readthedocs.io/en/latest/)
がドキュメントだが使うには不十分なので，
以下のGitHubのソースコードを見ながら作業する方が無難…
[GitHub: caesar0301/treelib]
(https://github.com/caesar0301/treelib/tree/master/treelib)

unittest

テンプレート

import unittest

class MyTestCase(unittest.TestCase):

    def test(self):
        # 返り値のテスト
        self.assertEqual(myfunc(args), expected_return)

        # 返り値がTrue/Falseか確認
        self.assertTrue(return_true())
        self.assertTrue(return_false())

        # 返り値がNoneであるかテスト
        self.assertIsNone(return_none(args))

        # 例外のテスト
        with self.assertRaises(ExpectedException):
            myfunc(args)

if __name__ == '__main__':
    # スクリプトとして実行された場合の処理
    unittest.main(verbosity=2)

例外は以下でもテスト可能だが，引数は別に渡す必要がある．
(これ見ると，期待される返り値 -> 関数 -> (引数)の順に書く方が統一感あるような気がするが，
以下のdocs.python.orgでもそうではない: self.assertEqual('foo'.upper(), 'FOO'))

self.assertRaises(ExpectedException, myfunc)
self.assertRaises(ExpectedException, myfunc, arg1, arg2) # 引数があるときはそのまま列挙

[docs.python.org: unittest]
(https://docs.python.org/3/library/unittest.html)
[CUBE SUGAR CONTAINER: Python: ユニットテストを書いてみよう]
(https://blog.amedama.jp/entry/python-unittest)
[stackoverflow: How do you test that a Python function throws an exception?]
(https://stackoverflow.com/questions/129507/how-do-you-test-that-a-python-function-throws-an-exception)

ディレクトリ構成とテスト実行

Qiita@hoto17296: Python 3 標準の unittest でテストを書く際のディレクトリ構成

urllib

`urlopen`のエラー確認

try:
    res = urllib.request.urlopen(url)
except urllib.error.HTTPError as e:
    print(e.code, e.reason) # ここでエラー取得

[Hibiki Programming Notes: [Python] Webサイトのデータを取得する（urllib.requestモジュール）]
(https://hibiki-press.tech/learn_prog/python/urllib_request_module/2019#URLErrorHTTPError)

403: Forbiddenなサイトにアクセス

hdrs = {"User-Agent": "Mozilla/5.0 (X11; Ubuntu; Linux x86_64; rv:47.0) Gecko/20100101 Firefox/47.0"}
req = urllib.request.Request(url=url, headers=hdrs)
res = urllib.request.urlopen(req)

[stackoverflow: Python3でwebスクレイピングしたいのですが存在するURLが開けません。]
(https://ja.stackoverflow.com/questions/27922/python3%E3%81%A7web%E3%82%B9%E3%82%AF%E3%83%AC%E3%82%A4%E3%83%94%E3%83%B3%E3%82%B0%E3%81%97%E3%81%9F%E3%81%84%E3%81%AE%E3%81%A7%E3%81%99%E3%81%8C%E5%AD%98%E5%9C%A8%E3%81%99%E3%82%8Burl%E3%81%8C%E9%96%8B%E3%81%91%E3%81%BE%E3%81%9B%E3%82%93)

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up

[Python] ライブラリ別メモ

beautifulsoup

text vs string

画像の保存

concurrent.futures

テンプレート

並列化のタイミング: submit, mapの後

並列化できない例1: ジェネレータ自体

並列化できない例2: submitする引数

ベター(?)プラクティス

メモ

itertools

イテレータをn個のサブイテレータに均等に分割する

chain: 複数のイテレータを結合する

cycle: 繰り返しイテレータを回す

json

インスタンスをjsonで保存する

JSONDecodeError: Expecting property name enclosed in double quotesになる他の原因

mutagen

netCDF4

日時の変換

numpy/scipy

pandas

pathlib

parse

pickle

ジェネレータはpicklableではない

re

schedule

sqlite3

Tensorflow

tqdm

基本: イテレータをtqdm.tqdm()で囲む

ジェネレータ等lenが無い場合: total=オプションで全体数を指定

print()との併用: print() -> tqdm.tqdm.write()に置換

並列計算との複合

treelib: 木構造

unittest

テンプレート

ディレクトリ構成とテスト実行

urllib

urlopenのエラー確認

403: Forbiddenなサイトにアクセス

並列化のタイミング: `submit`, `map`の後

並列化できない例2: `submit`する引数

`chain`: 複数のイテレータを結合する

`cycle`: 繰り返しイテレータを回す

基本: イテレータを`tqdm.tqdm()`で囲む

ジェネレータ等`len`が無い場合: `total=`オプションで全体数を指定

`print()`との併用: `print()` -> `tqdm.tqdm.write()`に置換

`urlopen`のエラー確認