More than 3 years have passed since last update.

pythonの正規表現

Python3

Last updated at 2020-06-28Posted at 2020-06-27

#目次

正規表現によるパターンマッチングのまとめ
 正規表現に用いる記号まとめ
 文字集合の短縮形
 基本的なパターンマッチの流れ
 正規表現に用いる記号を検索する場合は「\」でエスケープする必要がある
 複数のパターンとマッチさせる
 「?」：直前のパターンが0または1回
 「*」：直前のパターンが0回以上
 「+」：直前のパターンが1回以上
 「{ }」：繰り返し回数を指定する
 貪欲マッチと非貪欲マッチ
 見つかった全ての文字列を返す：findall()メソッド
 「[ ]」：独自に文字列集合を定義する
 「^」：文字列の先頭のパターンを指定
 「$」：文字列の末尾のパターンを指定
 「 . 」：改行以外の任意の1文字
 「 .*」：あらゆる文字列とマッチする
 ドット文字と改行をマッチさせる：re.DOTALL
大文字・小文字を無視したマッチ：re.IGNORECASE または re.I
文字列を置換する：sub()メソッド
 マッチした文字列を、置き換えの一部として使いたい場合
 複雑な正規表現を管理する：re.VERBOSE
re.IGNORECASEとre.DOTALLとre.VERBOSEを組み合わせる

##正規表現によるパターンマッチングのまとめ
####1. import re で正規表現モジュールをインポートする。
※正規表現：regular expression
####2. re.compile()関数を呼び出し、Regexオブジェクトを生成する
（raw文字列を使う）
※正規表現では「\」を多用するため、毎回エスケープするのは面倒
####3. Regexオブジェクトのメソッドに、検索対象の文字列を渡すと、Matchオブジェクトを返す。
search()メソッド
findall()メソッド ← ※タプルを返す
match()メソッド
####4. Matchオブジェクトのメソッドを呼び出し、実際にマッチした文字列を取得する。
group()メソッド

##正規表現に用いる記号まとめ

書き方	意味
( )	グルーピング
\|	複数のパターンのうちの一つとマッチさせる
?	直前のパターンが0または1回（ワイルドカード場合は）
*	直前のパターンが0回以上
+	直前のパターンが1回以上
{ }	繰り返し回数を指定 {3}, {3,5}, {3,} {,5}
^spam	「spam」から始まるパターン
spam$	[spam]で終わるパターン
.	任意の1文字
[abc%]	[ ]内の任意の1文字にマッチする（この例では「a」か「b」か「c」か「%」） ※[ ]の中では通常の正規表現の記号は解釈されない＝エスケープする必要なし
[^abc]	[ ]内の文字以外の任意の1文字にマッチする（この例では「a」「b」「c」以外）
[a-zA-Z0-9]	小文字と大文字と数字にマッチする

##文字集合の短縮形

書き方	意味
\d	0~9の数字
\D	0~9の数字以外
\w	文字、数字
\W	文字、数字以外
\s	スペース、タブ、改行
\S	スペース、タブ、改行以外

##基本的なパターンマッチの流れ

python.py

import re

phone_num_regex=re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phone_num_regex.search('私の電話番号は415-555-4242です。')
mo.group()

result.

'415-555-4242'

####グループ毎に取り出すことも可能

python.py

phone_num_regex=re.compile(r'(\d{3})-(\d{3}-\d{4})')
mo = phone_num_regex.search('私の電話番号は415-555-4242です。')
print(mo.group(1))
print(mo.group(2))
print(mo.group())
print(mo.group(0))    # マッチした文字列を返す
print(mo.groups())    # タプルを返す

result.

415
555-4242
415-555-4242
415-555-4242
('415', '555-4242')

##正規表現に用いる記号を検索する場合は「\」でエスケープする必要がある

python.py

phone_num_regex=re.compile(r'(\(\d{3}\))(\d{3}-\d{4})')
mo = phone_num_regex.search('私の電話番号は(415)555-4242です。')
print(mo.group())
print(mo.group(1))

result.

(415)555-4242
(415)

##複数のパターンとマッチさせる

python.py

hero_regex = re.compile(r'Batman|Tina Fey')
mo1 = hero_regex.search('Batman and Tina Fey.')
print(mo1.group())

mo2 = hero_regex.search('Tina Fey and Batman.')
print(mo2.group())

result.

Batman
Tina Fey

python.py

bat_regex = re.compile(r'Bat(man|mobile|copter|bat)')
mo = bat_regex.search('Batmobile lost a wheel')
print(mo.group())  # マッチした文字列全体
print(mo.group(1)) # １番目の()のグループにマッチした文字列

result.

Batmobile
mobile

##「?」：直前のパターンが0または1回

python.py

phone_num_regex=re.compile(r'(\d{3}-)?(\d{3}-\d{4})')
mo1 = phone_num_regex.search('私の電話番号は415-555-4242です。')
print(mo1.group())

mo2 = phone_num_regex.search('私の電話番号は555-4242です。')
print(mo2.group())

result.

415-555-4242
555-4242

##「*」：直前のパターンが0回以上

python.py

bat_regex = re.compile(r'Bat(wo)*man')   
mo1 = bat_regex.search('The Adventures of Batwowowowowoman')
mo1.group()

result.

'Batwowowowowoman'

##「+」：直前のパターンが1回以上

python.py

bat_regex = re.compile(r'Bat(wo)+man')   
mo1 = bat_regex.search('The Adventures of Batwowowowowoman')
print(mo1.group())

mo2 = bat_regex.search('The Adventures of Batwoman')
print(mo2.group())

mo3 = bat_regex.search('The Adventures of Batman')
print(mo3)

result.

Batwowowowowoman
Batwoman
None

##「{ }」：繰り返し回数を指定する

python.py

ha_regex = re.compile(r'(Ha){3,5}')
mo1 = ha_regex.search('HaHaHaHaHa')
print(mo1.group())

mo2 = ha_regex.search('Ha')
print(mo2)

result.

HaHaHaHaHa
None

##貪欲マッチと非貪欲マッチ
デフォルトでは貪欲マッチ（最も長いものにマッチする）
####繰り返し表現の後ろに「?」をつけることで非貪欲マッチになる

python.py

greedy_ha_regex = re.compile(r'(Ha){3,5}')
mo1 = greedy_ha_regex.search('HaHaHaHaHa')
print(mo1.group())

nongreedy_ha_regex = re.compile(r'(Ha){3,5}?')
mo2 = nongreedy_ha_regex.search('HaHaHaHaHa')
print(mo2.group())

result.

HaHaHaHaHa
HaHaHa

##見つかった全ての文字列を返す：findall()メソッド
※search()は最初に見つかった文字列のMatchオブジェクトを返す

####グループのない正規表現の場合、findall()はマッチした文字列のリストを返す

python.py

phone_num_regex = re.compile(r'\d{3}-\d{3}-\d{4}')
mo = phone_num_regex.search('Cell: 415-555-9999 Work: 212-555-0000')
print(mo.group())
print(phone_num_regex.findall('Cell: 415-555-9999 Work: 212-555-0000'))

result.

415-555-9999
['415-555-9999', '212-555-0000']

####グループのある正規表現の場合、findall()はグループに対応した文字列のタプルのリストを返す

python.py

phone_num_regex = re.compile(r'(\d{3})-(\d{3})-(\d{4})')
phone_num_regex.findall('Cell: 415-555-9999 Work: 212-555-0000')

result.

[('415', '555', '9999'), ('212', '555', '0000')]

##「[ ]」：独自に文字列集合を定義する

python.py

vowel_regex = re.compile(r'[aeiouAEIOU]')    # 母音とマッチ
vowel_regex.findall('RoboCop eats baby food.')

result.

['o', 'o', 'o', 'e', 'a', 'a', 'o', 'o']

####[ の直後にキャレット（^）をつけると、文字の補集合を表す（＝以外）

python.py

vowel_regex = re.compile(r'[^aeiouAEIOU]')    # 母音意外とマッチ
vowel_regex.findall('RoboCop eats baby food.')

result.

['R', 'b', 'C', 'p', ' ', 't', 's', ' ', 'b', 'b', 'y', ' ', 'f', 'd', '.']

##「^」：文字列の先頭のパターンを指定

python.py

begins_with_hello = re.compile(r'^Hello')
print(begins_with_hello.search('Hello world!'))
print(begins_with_hello.search('He said Hello.'))

result.

<re.Match object; span=(0, 5), match='Hello'>
None

##「$」：文字列の末尾のパターンを指定

python.py

ends_with_number = re.compile(r'\d$')
ends_with_number.search('Your number is 42')

result.

<re.Match object; span=(16, 17), match='2'>

####全体が1文字以上の数字である文字列

python.py

whole_string_is_number = re.compile(r'^\d+$')
print(whole_string_is_number.search('1234567890'))
print(whole_string_is_number.search('12345abc67890'))
print(whole_string_is_number.match('123 4567890'))

result.

<re.Match object; span=(0, 10), match='1234567890'>
None
None

##「 . 」：改行以外の任意の1文字

python.py

at_regex = re.compile(r'.at')
at_regex.findall('The cat in the hat sat on the flat mat.')

result.

['cat', 'hat', 'sat', 'lat', 'mat']

##「 .*」：あらゆる文字列とマッチする

python.py

name_regex = re.compile(r'First\s*Name:\s*(.*)\s*Last\s*Name:\s*(.*)')
mo = name_regex.search('First Name: Al Last Name: Sweigart')
print(mo.group(1))
print(mo.group(2))

result.

Al 
Sweigart

####「 . 」は改行にマッチしない

python.py

no_newline_regex = re.compile(r'.*')
print(no_newline_regex.findall('Serve the public trust.\nProtect the innocent.\nUphold the law.'))

no_newline_regex = re.compile(r'.+')
print(no_newline_regex.findall('Serve the public trust.\nProtect the innocent.\nUphold the law.'))

result.

['Serve the public trust.', '', 'Protect the innocent.', '', 'Uphold the law.', '']
['Serve the public trust.', 'Protect the innocent.', 'Uphold the law.']

####貪欲と非貪欲

python.py

nogreedy_regex = re.compile(r'<.*?>') # 非貪欲
mo = nogreedy_regex.search('<To serve man> for dinner.>')
print(mo.group())

greedy_regex = re.compile(r'<.*>') # 貪欲
mo = greedy_regex.search('<To serve man> for dinner.>')
print(mo.group())

result.

<To serve man>
<To serve man> for dinner.>

##ドット文字と改行をマッチさせる：re.DOTALL
####re.compile()の第2引数として、re.DOTALLを渡すと、ドット文字が改行を含む全ての文字とマッチするようになる

python.py

no_newline_regex = re.compile(r'.*', re.DOTALL)
no_newline_regex.search('Serve the public trust.\nProtect the innocent.\nUphold the law.').group()

result.

'Serve the public trust.\nProtect the innocent.\nUphold the law.'

##大文字・小文字を無視したマッチ：re.IGNORECASE または re.I
####re.compile()の第2引数として、re.IGNORECASE または re.Iを渡すと、大文字と小文字を区別せずにマッチする

python.py

roboco_regex = re.compile(r'robocop', re.I)
print(roboco_regex.search('Robocop is part man, part machine, all cop.').group())
print(roboco_regex.search('ROBOCOP protects the innocent.').group())
print(roboco_regex.search('Please tell me about the robocop.').group())

result.

Robocop
ROBOCOP
robocop

##文字列を置換する：sub()メソッド
####sub()メソッドは引数を2つ取る
####第1引数：置き換える文字列
####第2引数：検索置換対象の文字列

python.py

names_regex = re.compile(r'Agent\s\w+')
names_regex.sub('CENSORED', 'Agent Alice gave the secret documents to Agent Bob.')

result.

'CENSORED gave the secret documents to CENSORED.'

##マッチした文字列を、置き換えの一部として使いたい場合
####sub()の第1引数に、\1、\2、\3のように、グループの番号を使って記述する

python.py

agent_names_regex = re.compile(r'Agent\s(\w)\w*')
agent_names_regex.sub(r'\1****', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

result.

'A**** told C**** that E**** knew B**** was a double agent.'

python.py

agent_names_regex = re.compile(r'Agent\s\w(\w*)')
agent_names_regex.sub(r'*\1', 'Agent Alice told Agent Carol that Agent Eve knew Agent Bob was a double agent.')

result.

'*lice told *arol that *ve knew *ob was a double agent.'

##複雑な正規表現を管理する：re.VERBOSE
####正規表現の文字列中の空白文字やコメントを無視させる

python.py

# 煩雑な正規表現をわかりやすく記述するために、改行やコメントを無視する
# ()の中でもスペースは無視される（\sで書く）
phone_regex = re.compile(r'''(
    (\d{3}|\(\d{3}\))?            # 3桁の市外局番（（）がついていてもOK）
    ( |-|\.)?                     # 区切り（スペースかハイフンかピリオド）
    \d{3}                         # 3桁の市内局番
    (\s|-|\.)                     # 区切り
    \d{4}                         # 4桁の番号
    (\s*(ext|x|ext.)\s*\d{2,5})?  # 2～5桁の内線番号
    )''', re.VERBOSE)

print(phone_regex.findall('111-222-3333, (111)-222 3333, (111) 222.3333'))

result.

[('111-222-3333', '111', '-', '-', '', ''), ('(111)-222 3333', '(111)', '-', ' ', '', ''), ('222.3333', '', '', '.', '', '')]

##re.IGNORECASEとre.DOTALLとre.VERBOSEを組み合わせる
####re.compile()の第2引数には1つの値しか渡せないので、論理演算子「 | 」を使用する

python.py

some_regex_value = re.compile(r'''
    ^m.*
    ''', re.IGNORECASE | re.DOTALL | re.VERBOSE) 

some_regex_value.match("My name is Ken.\nWhat's your name?")

result.

<re.Match object; span=(0, 33), match="My name is Ken.\nWhat's your name?">

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up