More than 5 years have passed since last update.

先読み/後読みアサーションで重複マッチング

Last updated at 2020-02-26Posted at 2020-02-26

先読み/後読みアサーションで重複マッチング

Python3ではデフォルトで重複しないマッチングを行う。一度マッチした文字は「消費」されるため、例えば「AAAA」から「AA」を検索した場合、最初2文字と最後2文字の2箇所にマッチし、中間にはマッチしない。同じく、_hoge_foo_baaar_から_[a-z]+_を検索すると、_foo_はマッチしない。

重複したマッチングを行うときは、文字を「消費」しない先読みアサーション/後読みアサーションを使う。
（参考：Pythonの正規表現で重複したマッチのやり方（先読みアサーション）

ということで手を動かして試してみた。

import re
s = "_hoge_foo_baaar_"
key = "(?<=_)[a-z]+(?=_)"

for m in regex.finditer(key, s, overlapped=True):
  print(m)
# <regex.Match object; span=(1, 5), match='hoge'>
# <regex.Match object; span=(6, 9), match='foo'>
# <regex.Match object; span=(10, 15), match='baaar'>

# 一致位置を取り出す
m = [m.span() for m in re.finditer(key, s)]
print(m)
# [(1, 5), (6, 9), (10, 15)]

macthには先/後読みアサーションで指定した部分は含まれない．
含みたい場合は多分 regex 使ったほうが早い(公式でも推奨されているreの上方互換ライブラリらしい)

import regex
s = "_hoge_foo_baaar_"
key = "_[a-z]+_"

for m in regex.finditer(key, s, overlapped=True):
  print(m)
# <regex.Match object; span=(0, 6), match='_hoge_'>
# <regex.Match object; span=(5, 10), match='_foo_'>
# <regex.Match object; span=(9, 16), match='_baaar_'>

ms = [m.span() for m in regex.finditer(key, s, overlapped=True)]
print(ms)
# [(0, 6), (5, 10), (9, 16)]

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up