LoginSignup
16
11

More than 3 years have passed since last update.

【Python】正規表現でタグ(HTML, XML, ...)を扱う

Last updated at Posted at 2018-09-20

重たいXMLファイルやHTMLファイルの前処理に使える

対象の文

text = 'He <b>absorbed</b> the knowledge or beliefs of his tribe.'

タグの中身を取り出す

pattern = r'<b>(.+?)</b>'
matched_list = re.findall(pattern, text)

print(matched_list)  # => ['absorbed']

タグを取り除く

# <b>absorbed</b>にマッチさせて置換する
pattern = r'<.+?>'  

print(re.sub(pattern, '', text))  
#=> He absorbed the knowledge or beliefs of his tribe.

タグとその中身を取り除く

# <b>absorbed</b>にマッチさせてスペースで置換する
pattern = r'<.+>'  
new_text = re.sub(pattern, '', text)  
#=> He  the knowledge or beliefs of his tribe.

print(new_text.split())  
#=> ['He', 'the', 'knowledge', 'or', 'beliefs', 'of', 'his', 'tribe.']
以下のやり方だと、リストにする際バックスペースが残ってしまうので注意
# <b>absorbed</b>にマッチさせてバックスペースで置換する
pattern = r'<.+>'  
new_text = re.sub(pattern, '\b', text)  
#=> He the knowledge or beliefs of his tribe.

print(new_text.split())  
#=> ['He', '\x08', 'the', 'knowledge', 'or', 'beliefs', 'of', 'his', 'tribe.']
16
11
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
16
11