2
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 5 years have passed since last update.

xbrlと殴り合うためにxpathを使う

Last updated at Posted at 2017-12-19

有価証券報告書のxbrlとかMB単位のやたら力強いxmlで、
まともに殴り合うとつらすぎるので、xpathで必要なところだけ抜き出す話。

xpathはこのあたりがわかりやすい。
https://qiita.com/rllllho/items/cb1187cec0fb17fc650a

下準備

サンプルのターゲットはこいつ。
https://www.mizuho-fg.co.jp/investors/financial/report/yuho_201603/data/xbrl_mhbk160627.zip

~/Downloads に落としてきてzipを展開しておく。

欲しいところのnamespaceとタグ名は事前に調べておかないとならない。
ここは手作業。
今回は jpcrp_cor 名前空間の、タグ名に BusinessResults が含まれる要素を抜き出してみる。

namespace

<jpcrp_cor:OrdinaryIncomeSummaryOfBusinessResults ... みたいなタグ名の、コロンの左側の部分がnamespace。
これは要素名と別に指定が必要。

namespaceはroot要素 <xbrli:xbrl> の属性にくっついてる。
jpcrp_cor の場合はこれ。

xmlns:jpcrp_cor="http://disclosure.edinet-fsa.go.jp/taxonomy/jpcrp/2013-08-31/jpcrp_cor"

で、こう。

# coding: utf-8

import os
import lxml.etree

file = '%s/Downloads/xbrl_mhbk160627/PublicDoc/jpcrp030000-asr-001_E03532-000_2016-03-31_01_2016-06-27.xbrl' % os.environ["HOME"]
doc = lxml.etree.parse(file)

ns = {
    "jpcrp_cor": "http://disclosure.edinet-fsa.go.jp/taxonomy/jpcrp/2013-08-31/jpcrp_cor",
}

# 'BusinessResults' が含まれてるタグ名で引っ掛ける
xp = "//*[contains(name(), 'BusinessResults')]"
elems = doc.xpath(xp, namespaces=ns)

for x in elems:
    print(x.text)
2
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?