0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

【Python】gzipされたXMLをlxml.etree.iterparseする

Last updated at Posted at 2019-06-24

大した話じゃないけど,日本語の記事が見当たらなかったので一応.

問題

lxml.etree.parseは***.xml.gzをそのまま読み込めるのに,lxml.etree.iterparseは読み込めない.

parse.py
from lxml import etree
path = 'dataset/sample/sample.xml.gz'
tree = etree.parse(path)
print(tree.find('//PMID').text)

>> 1
iterparse.py
from lxml import etree
path = 'dataset/sample/sample.xml.gz'
tree = etree.iterparse(path)
print(list(tree)[0][1].text)

>> lxml.etree.XMLSyntaxError: Document is empty, line 1, column 1

ちなみにpathを解凍したsample.xmlにすると,iterparseでも 1 が出力されます.(当然)

解決策

こんな記事があった.

https://mailman-mail5.webfaction.com/pipermail/lxml/20100103/013073.html

要約すると,parseはいろんなファイル形式に対応してるけど,iterparseはすごく単純なファイル形式しか対応してない.(たぶん)

よって,gzipモジュールを使う.

iterparse.py
from lxml import etree
import gzip
path = 'dataset/sample/sample.xml.gz'
tree = etree.iterparse(gzip.GzipFile(path))
print(list(tree)[0][1].text)

>> 1

できた.

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?