2
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

【Python】タグを含むXML文書からテキストだけ抽出する

Last updated at Posted at 2019-06-29

概要

下のようなXML文書の場合,単純にテキストを抽出することはできない.

ダメな例
node = etree.fromstring("""
<content>
    Text outside tag <div>Text <em>inside</em> tag</div>
</content>
""")

print(node.text)

>> Text outside tag  # content直下のテキストしか取得できない.

あらかじめXMLの構造が決まってたら普通に要素ごとにfind()すればいいけど,文書内にランダムにタグが出現する場合は困る.

解決策

これ.

https://stackoverflow.com/questions/4624062/get-all-text-inside-a-tag-in-lxml

ただし,リンク先の一番上のstringify_children()は正確ではないので,node.itertext()を使いましょう.

OKな例
node = etree.fromstring("""
<content>
    Text outside tag <div>Text <em>inside</em> tag</div>
</content>
""")

print(''.join(node.itertext()))

>>      Text outside tag Text inside tag 

おまけ

両端にスペースが入ってる場合を考慮して,

OKな例2
print(''.join(node.itertext()).strip())

>> Text outside tag Text inside tag  

とする方が安全.

2
2
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
2

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?