More than 1 year has passed since last update.

PythonAdvent Calendar 2022

[Python]webページのmeta情報を取得してテキストファイルに出力

Last updated at 2022-12-20Posted at 2022-12-19

背景

Webページのメタ情報の中身だけを取得したいなーと思いました

<meta name="description" content="hogeのコンテンツ">
<meta name="keywords" content="huga,hoge">
<title>hoge</title>

上のソースだと「hogeのコンテンツ」、「huga,hoge」、「hoge」の部分だけが欲しいのです

準備

まとめて取得したかったので、取得したいページのURLのリストをテキストファイルとかで用意します

hoge.txt

https://www.hoge.com/huga/index.html
https://www.hoge.com/piyo/index.html
https://www.hoge.com/poyo/index.html
https://www.hoge.com/miyo/index.html
https://www.hoge.com/mayo/index.html
https://www.hoge.com/nigo/index.html

実装

ライブラリの読み込み

from urllib.request import urlopen
from bs4 import BeautifulSoup
import codecs
import glob
import time
import requests

作成したテキストファイル読み込み

with open('./hoge.txt') as f:

1行ずつ読み込み

for index, url in enumerate(f):
  url = url.rstrip('\n')

ファイル名は1個のURLで一ファイル作成して、ファイル名はインデックス.txtにしたいのでenumerateを使用します

404の場合、記載

  res = requests.get(url)
  if res.status_code == 404:
   print(url + ';' + '404', file=codecs.open('hoge/'+ f'{index:04}' +'.txt', 'w', encoding='UTF-8'))
   continue
  time.sleep(1)

404だった場合、404と記載するようにします
出力はcodecs.openでファイル書き込みを行います。ファイル名は前述のようにインデックス.txtで、hoge配下に出力します
ファイル名のインデックスは4桁にしたいのでf'{index:04}'と記述しています
また、意味がないかもしれませんがアクセスの間隔空けたいので1秒sleepします

HTML取得

html = urlopen(url).read()
soup = BeautifulSoup(html, features="html.parser")

BeautifulSoupでHTMLを扱います

HTMLからmeta情報を抽出

title = soup.find('title').text
des = soup.find('meta', attrs={'name': 'description'}).get('content')     
key = soup.find('meta', attrs={'name': 'keywords'}).get('content')
        
print(url + ';' + title + ';' + des + ';' + key, file=codecs.open('hoge/'+ f'{index:04}' +'.txt', 'w', encoding='UTF-8'))
time.sleep(2)

titleはtitleタグの中に存在するので、.findで指定して.textで抽出できます。
descriptionとkeywordsはmetaタグの中のname属性であり、欲しいのはcontent属性の中身なので、
nameはattrsで指定して、.getでcontentの中身を抽出します
そして、前述した404のようにファイルに書き込みます
また、同様にアクセスの間隔空けたいのでsleep入れてます
ここまでがfor文の中身です

一覧で見たいのでファイル結合

readFiles = glob.glob("hoge/*.txt")
sortReadFiles = sorted(readFiles)
with open("hoge/result.txt", "wb") as resultFile:
 for f in sortReadFiles:
  with open(f, "rb") as infile:
   resultFile.write(infile.read())

URLを読み込んで1個ずつファイルに出力する処理が終わったら、一覧で見たいので全てを結合します
globで今まで出力していたhogeディレクトリの中のテキストファイルの名前を配列にして番号順にソートします
配列にしたテキストファイルをopenで1個ずつ開いていき、result.txtというファイルに出力していきます。

実装全体

from urllib.request import urlopen
from bs4 import BeautifulSoup
import codecs
import glob
import time
import requests

with open('./hoge.txt') as f:
 for index, url in enumerate(f):
  url = url.rstrip('\n')
  res = requests.get(url)
  if res.status_code == 404:
   print(url + ';' + '404', file=codecs.open('hoge/'+ f'{index:04}' +'.txt', 'w', encoding='UTF-8'))
   continue
  time.sleep(1)
  html = urlopen(url).read()
  soup = BeautifulSoup(html, features="html.parser")
  title = soup.find('title').text
  des = soup.find('meta', attrs={'name': 'description'}).get('content')     
  key = soup.find('meta', attrs={'name': 'keywords'}).get('content')
  
  # URL;title;description;keywords　という形で出力      
  print(url + ';' + title + ';' + des + ';' + key, file=codecs.open('hoge/'+ f'{index:04}' +'.txt', 'w', 
  encoding='UTF-8'))
  time.sleep(2)

readFiles = glob.glob("hoge/*.txt")
sortReadFiles = sorted(readFiles)
with open("hoge/result.txt", "wb") as resultFile:
 for f in sortReadFiles:
  with open(f, "rb") as infile:
   resultFile.write(infile.read())

出力結果

result.txt

https://www.hoge.com/huga/index.html;huga;hugaの情報;huga,hoge
https://www.hoge.com/piyo/index.html;piyo;piyoの情報;piyo,hoge
https://www.hoge.com/poyo/index.html;poyo;poyoの情報;poyo,hoge
https://www.hoge.com/miyo/index.html;miyo;miyoの情報;miyo,hoge
https://www.hoge.com/mayo/index.html;mayo;mayoの情報;mayo,hoge
https://www.hoge.com/nigo/index.html;nigo;nigoの情報;nigo,hoge

反省

2回サーバーアクセスしている件

404の確認とHTMLの取得で2回サーバーにアクセスしていて、アクセス制限などが怖いので1回で済む方法ありましたら教えていただきたいです

追記：

soup = BeautifulSoup(res.content, features="html.parser")

html = urlopen(url).read()を削除してこれが良い気がします

最後に全て結合するにも関わらず、URL1個に対して1個のテキストファイルに出力している

最初から1個のテキストファイルにまとめて出力すれば最後のfor文要らないですね

追記：

  with open('hoge/result.txt', "a", encoding='UTF-8', newline='\n') as f:     
   f.write(url + ';' + title + ';' + des + ';' + key + "\n")

最後の結合部分削除で、ファイル出力部分をこれにすればもっとスマートかなーと思います（404部分も同様）

表記揺れに対応していない

descriptionでなくDescriptionとなっているような、表記揺れのときは取得できないので例外処理を追加するべきだなと思いました

  if soup.find('meta', attrs={'name': 'description'}):
   des = soup.find('meta', attrs={'name':'description'}).get('content')
  else:
   des = soup.find('meta', attrs={'name':'Description'}).get('content')

応急処置として一旦分岐しましたが、もう少し検討しないとなと…

timeout設定

request送り続けるの良くないのでrequests.getのところにtimeout設定するべきですね

まとめ

pythonはじめて触りましたが、ライブラリが豊富で楽しいなと思いました

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up