More than 5 years have passed since last update.

PythonでXMLのタグを外す

Python

Last updated at 2019-09-28Posted at 2018-08-29

タグ単位で削除するやり方はすぐ見つかったのですが、
タグ部分を削除して内部のテキストだけを残すやり方がなかなか見つからなかったのでメモ。

正規表現で削除するのは改行とか面倒そうだし、さすがにXMLParserがあるだろうということでやめました
ドキュメントを探した結果、以下のように行うのがよさそうです。

strip_tags.py

import xml.etree.ElementTree as ET
import codecs

tree = ET.parse('data.xml')
root = tree.getroot()
notags = ET.tostring(root, method='text')
notags = notags.decode('utf8')

with codecs.open('notags.txt', 'w', 'utf8') as f:
    f.write(notags)

実行例

data.xml

<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

notags.txt

改行やスペースはそのまま残ります。
必要に応じて適宜再変換をかければよさそうです。
上記ではrootを指定しているので、全体についてのタグ以外テキストが取れますが、
root以外のelementを指定すればそのタグ内のテキストだけが取れます。

思っていたとおり標準モジュールでできたのですが、
リファレンスに詳細が書かれていないので見つけるのに時間がかかりました。

参考
https://docs.python.jp/3/library/xml.etree.elementtree.html
xml.etree.ElementTree.tostringの項目に

method は "xml", "html", "text" のいずれか(デフォルトは "xml") です。

と書いてあるがそれぞれ設定したらどうなるのかは書かれていない。。。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up