正規表現で取りうる値をすべて取得する方法

Last updated at 2015-12-10Posted at 2015-12-09

やりたいこと

その正規表現が表している文字列をすべて取得したい
具体的には、クローラを作る時にクローリング対象URLを正規表現で指定したい
↓こんなかんじ

インプット

http://xxx/abc[x-z]/image(9|10|11).png

　取得したい値

http://xxx/abcx/image9.png
http://xxx/abcy/image9.png
http://xxx/abcz/image9.png
http://xxx/abcx/image10.png
http://xxx/abcy/image10.png
http://xxx/abcz/image10.png
http://xxx/abcx/image11.png
http://xxx/abcy/image11.png
http://xxx/abcz/image11.png

やり方

google/sre_yieldを使う

sample.py

import sre_yield

if __name__ == '__main__':
    regex = r'http://xxx/abc[x-z]/image(9|10|11)\.png'
    urllist = list(sre_yield.AllStrings(regex))
    print(urllist)

実行結果

['http://xxx/abcx/image9.png', 'http://xxx/abcy/image9.png', 'http://xxx/abcz/image9.png', 'http://xxx/abcx/image10.png', 'http://xxx/abcy/image10.png', 'http://xxx/abcz/image10.png', 'http://xxx/abcx/image11.png', 'http://xxx/abcy/image11.png', 'http://xxx/abcz/image11.png']

sre_yield は最近StackOverFlowで教えてもらったライブラリです。
正規表現を展開する方法、あまり需要がないのかネットで探してもなかなか辿りつけなかったのでまとめました。
それにしても、StackOverFlowに足を向けて寝れません。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up