More than 3 years have passed since last update.

【Python】正規表現

Last updated at 2022-08-12Posted at 2022-08-10

pythonで正規表現を使うためのメモです。

マッチ関数

match

先頭部が一致していれば、Trueを返す。
一致している箇所が無ければ、Noneを返す。

import re

content = 'The high on August 7 is 35 degrees'

result = re.match(r'[A-Za-z]+',content)

if result:
　　#マッチした部分を返す
    print(result.group())
    #マッチした先頭のインデックスを返す
    print(result.start())
    #マッチした末尾のインデックスを返す
    print(result.end())
    #startとendをタプルで返す
    print(result.span())
else:
    print("No match")

>> The
>> 0
>> 3
>> (0,3)

compile

正規表現パターンをコンパイルし、match()やsearch()メソッドを使って、マッチングする。

import re

content = 'The high on August 7 is 35 degrees'
pattern = re.compile(r'[A-Za-z]+')
result = pattern.match(content)
if result:
　　#マッチした部分を返す
    print(result.group())
    #マッチした先頭のインデックスを返す
    print(result.start())
    #マッチした末尾のインデックスを返す
    print(result.end())
    #startとendをタプルで返す
    print(result.span())
else:
    print("No match")

>> The
>> 0
>> 3
>> (0,3)

コンパイルした場合としなかった場合の比較

コンパイルした場合

import re
import time

contents = ['The high on August 7 is 35.8 degrees.', 'Tropical is a night tonight.', 'The high temperature on August 10 is supposed to be 40.59 degrees.']

pattern = re.compile(r'[a-zA-Z]+')

start = time.perf_counter()
for i in range(10**6):
    for content in contents:
        pattern.search(content)
end = time.perf_counter()
print(f'{end-start}sec')

>> 0.7540798999980325sec

コンパイルしなかった場合

import re
import time

contents = ['The high on August 7 is 35.8 degrees.', 'Tropical is a night tonight.', 'The high temperature on August 10 is supposed to be 40.59 degrees.']

pattern = r'[a-zA-Z]+'

start = time.perf_counter()
for i in range(10**6):
    for content in contents:
        re.search(pattern,content)
end = time.perf_counter()
print(f'{end-start}sec')

>> 2.0059483000004548sec

当たり前ですけど、コンパイルした方が早いですね。

公式によると、少ししか使わない時は、コンパイルの必要はないとのこと。
ループを回した時等で効果を発揮する。

注釈 re.compile() やモジュールレベルのマッチング関数に渡された最新のパターンはコンパイル済みのものがキャッシュされるので、一度に正規表現を少ししか使わないプログラムでは正規表現をコンパイルする必要はありません。

search

対象の文字列の中に一致する箇所があれば、Trueを返す。
一致する箇所が無ければ、Noneを返す。

import re

content = 'The high on August 7 is 35 degrees'

result = re.search(r'\d+',content)

if result:
    #マッチする箇所が複数ある場合、マッチする先頭箇所のみ抽出
    print(result.group())
    print(result.start())
    print(result.end())
    print(result.span())
else:
    print("No match")

>> 7
>> 19
>> 20
>> (19, 20)

findall

マッチする全ての文字列をリストとして返す。
一致している箇所が無ければ、空のリストを返す。

import re

content = 'The high on August 7 is 35 degrees'

result = re.findall(r'\d+',content)

print(result)

>> ['7', '35']

fullmatch

対象の文字列全体がパターンに一致する場合、Trueを返す。
一致しない場合、Noneを返す。

import re

content = 'The high on August 7 is 35 degrees'

result = re.fullmatch(r'\d+',content)

if result:
    print(result.group())
else:
    print("No match")

>> No match

split

対象文字列を正規表現パターンで分割し、分割した文字列をリストで返す。

import re

content = 'Interesting01YouTube1called2Weather3News'

result = re.split(r'\d+',content)

if result:
    print(result)
else:
    print("No match")

>> ['Interesting', 'YouTube', 'called', 'Weather', 'News']

sub（replaceと同じ）

対象文字列の中で正規表現パターンと一致するものを置き換える。

import re

content = 'Interesting 01 YouTube 1 called 2 Weather 3 News'

#数字をXに置き換える
result = re.sub(r'\d+','X',content)

print(result)

>> Interesting X YouTube X called X Weather X News

subn

subと同じ役割。
subでは文字列で返すが、subnはタプルで返す。

import re

content = 'Interesting 01 YouTube 1 called 2 Weather 3 News'

#数字をXに置き換える
result = re.subn(r'\d+','X',content)

print(result)

>> ('Interesting X YouTube X called X Weather X News', 4)

参考リンク

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up