More than 5 years have passed since last update.

【Python】正規表現でタグ(HTML, XML, ...)を扱う

Last updated at 2019-11-13Posted at 2018-09-20

重たいXMLファイルやHTMLファイルの前処理に使える

対象の文

text = 'He <b>absorbed</b> the knowledge or beliefs of his tribe.'

タグの中身を取り出す

pattern = r'<b>(.+?)</b>'
matched_list = re.findall(pattern, text)

print(matched_list)  # => ['absorbed']

タグを取り除く

# <b>absorbed</b>にマッチさせて置換する
pattern = r'<.+?>'  

print(re.sub(pattern, '', text))  
# => He absorbed the knowledge or beliefs of his tribe.

タグとその中身を取り除く

# <b>absorbed</b>にマッチさせてスペースで置換する
pattern = r'<.+>'  
new_text = re.sub(pattern, '', text)  
# => He  the knowledge or beliefs of his tribe.

print(new_text.split())  
# => ['He', 'the', 'knowledge', 'or', 'beliefs', 'of', 'his', 'tribe.']

以下のやり方だと、リストにする際バックスペースが残ってしまうので注意

# <b>absorbed</b>にマッチさせてバックスペースで置換する
pattern = r'<.+>'  
new_text = re.sub(pattern, '\b', text)  
# => He the knowledge or beliefs of his tribe.

print(new_text.split())  
# => ['He', '\x08', 'the', 'knowledge', 'or', 'beliefs', 'of', 'his', 'tribe.']

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up