Ruby の rexml で XML をパースするの２

Last updated at 2022-03-25Posted at 2022-03-24

はじめに

移植やってます。
( from python 3.7 to ruby 2.7 )

概要

前回の記事で簡単そうだったのですが、実際は難しい操作でした。
各種のxmlに対応させるclassのため、xpathが不定でタグをスキャンして、その階層に合わせてパースする必要がありました。

どうすRuby

sample

<?xml version="1.0" encoding="UTF-8"?>
<xs:schema targetNamespace="http://regis-web.systemsbiology.net/pepXML" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://regis-web.systemsbiology.net/pepXML" xmlns:pepx="http://regis-web.systemsbiology.net/pepXML" elementFormDefault="qualified">
	<xs:annotation>
	</xs:annotation>
	<xs:element name="msms_pipeline_analysis">
		<xs:complexType>
			<xs:sequence>
				<xs:element name="analysis_summary" minOccurs="0" maxOccurs="unbounded">
				</xs:element>
				<xs:element name="dataset_derivation" minOccurs="0">
					<xs:annotation>
					</xs:annotation>
					<xs:complexType>
						<xs:sequence>
						</xs:sequence>
						<xs:attribute name="generation_no" type="xs:nonNegativeInteger" use="required">
						</xs:attribute>
					</xs:complexType>
				</xs:element>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
</xs:schema>

この様なxsdから、処理の流れとして

xs:attributeタグを探す
type属性の値が['int', 'long', 'nonNegativeInteger', 'positiveInt', 'integer', 'unsignedInt']のいずれか
親タグがelementならname属性の値を取る
親タグがcomplexTypeなら更に上の親タグのelementのname属性を取る
配列に格納する（結果として、下記のデータセットが格納）
xs:attributeタグがなくなるまで繰り返す

["dataset_derivation", "generation_no"]

sax2

require 'rexml/parsers/sax2parser'
require 'rexml/sax2listener'
require 'rexml/document'
require 'set'

module REXML
  module Parsers
		class SAX2Parser
			def get_tag_stack
				@tag_stack.dup
			end
		end
	end
end

xml = (<<XML)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema targetNamespace="http://regis-web.systemsbiology.net/pepXML" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://regis-web.systemsbiology.net/pepXML" xmlns:pepx="http://regis-web.systemsbiology.net/pepXML" elementFormDefault="qualified">
	<xs:annotation>
	</xs:annotation>
	<xs:element name="msms_pipeline_analysis">
		<xs:complexType>
			<xs:sequence>
				<xs:element name="analysis_summary" minOccurs="0" maxOccurs="unbounded">
				</xs:element>
				<xs:element name="dataset_derivation" minOccurs="0">
					<xs:annotation>
					</xs:annotation>
					<xs:complexType>
						<xs:sequence>
						</xs:sequence>
						<xs:attribute name="generation_no" type="xs:nonNegativeInteger" use="required">
						</xs:attribute>
					</xs:complexType>
				</xs:element>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
</xs:schema>
XML

parser = REXML::Parsers::SAX2Parser.new(xml)
doc = REXML::Document.new(xml)

elements = []
parser.listen(:start_element, ["xs:attribute"]){|uri, localname, qname, attrs|
  elements << [parser.get_tag_stack, attrs]
}
parser.parse

dataset = Set.new()
elements.each do |element|
	if ['int', 'long', 'nonNegativeInteger', 'positiveInt', 'integer', 'unsignedInt'].any?{ element[1]["type"].include?(_1) }
		REXML::XPath.match(doc, element[0].join('/')).each do |xp|
			case xp.parent.local_name
			when 'element'
				dataset << [xp.parent.attribute('name').value, xp.attribute('name').value]
			when 'complexType'
				if xp.parent.parent.local_name == 'element'
					dataset << [xp.parent.parent.attribute('name').value, xp.attribute('name').value]
				end
			end
		end
	end
end
p dataset

# <Set: {["dataset_derivation", "generation_no"]}>

無理矢理感が凄まじい。

sax2parser

module REXML
  module Parsers
		class SAX2Parser
			def get_tag_stack
				@tag_stack.dup
			end
		end
	end
end

SAX2Parser.rbの@tag_stackにツリー情報がありますので、オープンクラスで取得します。

parser

parser.listen(:start_element, ["xs:attribute"]){|uri, localname, qname, attrs|
  elements << [parser.get_tag_stack, attrs]
}

一見分かりにくいのですが、parser.listen()にブロックを渡しています。

SAXの情報が少ないので大変です。

追記（xpath版）

本番環境で上記sax2版ですと、elementsが空になりました。
そこで、xmlの階層が深いことはないだろうと考え、xpath版に切り替えます。

xpath

require 'rexml/document'
require 'set'

xml = (<<XML)
<?xml version="1.0" encoding="UTF-8"?>
<xs:schema targetNamespace="http://regis-web.systemsbiology.net/pepXML" xmlns:xs="http://www.w3.org/2001/XMLSchema" xmlns="http://regis-web.systemsbiology.net/pepXML" xmlns:pepx="http://regis-web.systemsbiology.net/pepXML" elementFormDefault="qualified">
	<xs:annotation>
	</xs:annotation>
	<xs:element name="msms_pipeline_analysis">
		<xs:complexType>
			<xs:sequence>
				<xs:element name="analysis_summary" minOccurs="0" maxOccurs="unbounded">
				</xs:element>
				<xs:element name="dataset_derivation" minOccurs="0">
					<xs:annotation>
					</xs:annotation>
					<xs:complexType>
						<xs:sequence>
						</xs:sequence>
						<xs:attribute name="generation_no" type="xs:nonNegativeInteger" use="required">
						</xs:attribute>
					</xs:complexType>
				</xs:element>
			</xs:sequence>
		</xs:complexType>
	</xs:element>
</xs:schema>
XML

doc = REXML::Document.new(xml)
trees = Set.new()
attrees = Set.new()
ret = []

doc.children.each do |ch|
	if ch.respond_to?(:expanded_name)
		trees << ch.expanded_name
	end
end

while trees.empty?.!
	adtrees = Set.new()
	trees.each do |tree|
		doc.elements.each(tree) do |chs|
			chs.children.each do |ch|
				if ch.respond_to?(:expanded_name)
					adtrees << tree + '/' + ch.expanded_name
					if ch.name == 'attribute'
						attrees << tree + '/' + ch.expanded_name
						if ['int', 'long', 'nonNegativeInteger', 'positiveInt', 'integer', 'unsignedInt'].include?(ch.attributes['type'].sub(ch.prefix + ':', ''))
							case ch.parent.local_name
							when 'element'
								ret << [ch.parent.attribute('name').value, ch.attribute('name').value]
							when 'complexType'
								if ch.parent.parent.local_name == 'element'
									ret << [ch.parent.parent.attribute('name').value, ch.attribute('name').value]
								end
							end
						end
					end
				end	
			end
		end
	end
	trees = adtrees
end
p ret

# output
[["dataset_derivation", "generation_no"]]

これも無理矢理感が凄いです。

element

		doc.elements.each(tree) do |chs|

ここをeachにしないと、１番めの要素のみを拾います。

メモ

Ruby の rexml を学習した
百里を行く者は九十里を半ばとす

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up