@tomoya0130posted at 2023-02-07

Pythonでsitemap.xmlのパースがうまくできない

Q&A

Python XML パース sitemap.xml

解決したいこと

sitemap.xml内のURLのみを抽出したく、以下の記事を参考にコードを書いてみたのですが、うまく抽出できません…。
https://pg-chain.com/python-xml-elementtree

発生している問題・エラー

・結果が何も返ってこない。

該当するソースコード

以下の記事を参考にコードを作成
https://pg-chain.com/python-xml-elementtree

import xml.etree.ElementTree as ET
    
tree = ET.parse('〇〇.xml')
root = tree.getroot()
for loc in root.iter("loc"):
  print(loc.text)

sitemap.xmlの中身の例

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns:xsi="http://www.w3.org/sample" xmlns="http://www.sitemaps.org/sample" xmlns:image="http://www.google.com/sample/1.1" xmlns:video="http://www.google.com/sample/1.1" xmlns:geo="http://www.google.com/sample/1.0" xmlns:news="http://www.google.com/sample/0.9" xmlns:mobile="http://www.google.com/sample/1.0" xmlns:pagemap="http://www.google.com/sample/1.0" xmlns:xhtml="http://www.w3.org/sample" xsi:schemaLocation="http://www.sitemaps.org/sample">
  <url>
    <loc>https://sample.jp</loc>
    <lastmod>2023-01-25T00:57:00+09:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://sample2.jp</loc>
    <lastmod>2023-01-25T00:57:00+09:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
</urlset>

自分で試したこと

<sample>
  <url>
    <loc>https://sample.jp</loc>
    <lastmod>2023-01-25T00:57:00+09:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
  <url>
    <loc>https://sample2.jp</loc>
    <lastmod>2023-01-25T00:57:00+09:00</lastmod>
    <changefreq>daily</changefreq>
    <priority>0.8</priority>
  </url>
</sample>

・上記のように＜?xml version="1.0" encoding="UTF-8"?＞と＜urlset＞部分を削除して実行したところ、抽出ができた。
→上記タグがあるせいできちんと＜loc＞を指定できていない？

どういう指定をすればURLをlocタグのURLを抽出できるか、ご教示いただけますと幸いですm(__)m

0 likes

3Answer

@yamayoshi7 posted at 2023-02-07

namespace 関連のようですね。
たとえば、こことか。
https://banatech.net/blog/view/19

0Like

@yamayoshi7 posted at 2023-02-07

こっちの方が分かりやすいかな。
https://orangain.hatenablog.com/entry/namespaces-in-xpath

0Like

Comments

@tomoya0130
Questioner
ありがとうございます！参考記事含め、試行錯誤してみたところ以下の記述で抽出ができました。
ただ、lastmod、changefreqタグも抽出されてしまうのですが、ここからlocタグのみに絞り込む方法をご助言いただけないでしょうか…。
※for child in url.iter("loc"):　の記述でいけるかな？と思ってましたが、ダメでした。

### コード
```
import xml.etree.ElementTree as ET
tree = ET.parse("〇〇.xml")
ET.register_namespace('', 'http://www.sitemaps.org/schemas/sitemap/0.9')
root = tree.getroot()
for url in root:
for child in url.iter():
print(child.text)
```

### 抽出結果
```
https://sample.jp
2023-01-25T00:57:00+09:00
daily
0.8

https://sample2.jp
2023-01-25T00:57:00+09:00
daily
0.8

https://sample3.jp
2023-01-25T00:57:00+09:00
daily
0.8

```
@yamayoshi7
これかな。
https://stackoverflow.com/questions/74029268/parse-xml-with-specific-declaration
stackoverflow 最強です。

@STSynthe posted at 2023-02-08

名前空間が指定されてないのとiterではなく、Xpathが使えるfindallが良いです。

マニュアルにもありますが

For XML with namespaces, use the usual qualified {namespace}tag notation:

とあるように、Xpathでは{名前空間}タグあるいは名前空間(厳密にはkey名):タグで指定します。findallから相当する全ての要素をXpathで選択させます。Xpathは全子要素//からlocタグを全て選択させるようにすると望み通りの要素がlistが返ります。

from xml.etree import ElementTree

root = ElementTree.parse('./sitemap.xml')
namespaces = {'urlset': 'http://www.sitemaps.org/sample'}

list_loc = root.findall('.//urlset:loc', namespaces)

for _loc in list_loc:
    print(_loc.text)

どういう指定をすれば

Xpathへの理解と公式のElementTree XML APIドキュメントをよく確認なさってください。特にXpathはPythonに関わらず基本形はほとんど同じで、複雑な解析をするには必須です。

0Like

Are you sure you want to delete the question?

Pythonでsitemap.xmlのパースがうまくできない

解決したいこと

発生している問題・エラー

該当するソースコード

sitemap.xmlの中身の例

自分で試したこと

3Answer

Comments

Your answer might help someone💌