More than 5 years have passed since last update.

lxmlでブロークンなXMLをパースする

Python

Last updated at 2013-03-15Posted at 2013-03-15

lxmlはlibxml2に基いてXMLをパースしているが、BeautifulSoupは正規表現に基いてXMLをパースしているため、以下の様なブロークンなXMLでもパースすることが出来る。

速度的にはlxmlを使いたいけど、BeautifulSoupが必要かもしれない、というようなどちらを使うか迷うような状況に対してlxmlは以下の様なインターフェースを提供してくれる。

入力が下のようなブロークンなXMLだとする。

<piyo>bar</piyo>
<piyo>hoge</piyo>

結果

python

In [1]: from lxml import etree
In [2]: with open('hoge') as f:
   ...:     xml=etree.fromstring(f.read())
   ...:       File "<string>", line unknown　XMLSyntaxError: Extra content at the end of the document, line 2, column 1

python

In [3]: from lxml.html.soupparser import fromstring

In [4]: with open('hoge') as f:
   ...:     xml=fromstring(f.read())
   ...:

In [5]: for piyo in xml.findall('piyo'): print piyo.text.strip()
bar
hoge

参考

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up