More than 1 year has passed since last update.

[Python]webページのHTMLソースをディレクトリそのままにまとめてスクレイピング

Last updated at 2023-02-09Posted at 2023-01-06

背景

WebページのHTMLソースをディレクトリそのままにまとめてスクレイピングしたいと思いました

例えば、https://www.hoge.com/huga/hoge/index.htmlとhttps://www.hoge.com/huga/index.htmlのソースをまとめて取得するとき、ローカルにドメイン名をトップディレクトリとして

└── www.hoge.com
     └── huga 
          ├── hoge ── index.html
          └── index.html

の形で取得したいです

準備

基本のやり方は

を踏襲します

取得したいページのURLのリストをテキストファイルとかで用意します
（実際は存在するURLでやりましたがぼかしています）

hoge.txt

https://www.hoge.com/huga/hoge/index.html
https://www.hoge.com/huga/index.html

実装

ライブラリの読み込み

import os
import time
import requests

※ベーシック認証があるサイトの場合は

from requests.auth import HTTPBasicAuth

も追加

作成したテキストファイル読み込み

with open('./hoge.txt') as f:

1行ずつ読み込み

for url in f:
 url = url.rstrip('\n')

request投げる

 res = requests.get(url=url,timeout=1.5)

※ベーシック認証があるサイトの場合

 res = requests.get(url=url,auth=HTTPBasicAuth("ユーザー名","パスワード"),timeout=1.5)

URLをパスに加工する

 url = url.replace('https://','./')

ファイル作成

  if res.status_code == 404:
   filePath = os.path.dirname(url)
   if not os.path.exists(filePath):
    os.makedirs(filePath)

   with open(url,'w',encoding='UTF-8') as f:
    f.write('404')
   time.sleep(2)   
   continue

  filePath = os.path.dirname(url)
  if not os.path.exists(filePath):
   os.makedirs(filePath)

  with open(url,'w',encoding='UTF-8') as f:
   f.write(res.text)
  time.sleep(2)

加工したパスをもとにHTMLファイルを作成します
作成する前に、os.path.existsでパスのディレクトリの存在確認をして、
あればそのままHTMLファイル作成、なければディレクトリを作成してからHTMLファイル作成をしています
ステータスコードが404の場合はファイルに404と記載するようにします
作成が完了したらサーバーの負荷を抑えるため2秒待ちます

実装全体

import os
import time
import requests

with open('./hoge.txt') as f:
 for url in f:
  url = url.rstrip('\n')       
  res = requests.get(url=url,timeout=1.5)
  url = url.replace('https://','./')
  if res.status_code == 404:
   filePath = os.path.dirname(url)
   if not os.path.exists(filePath):
    os.makedirs(filePath)

   with open(url,'w',encoding='UTF-8') as f:
    f.write('404')
   time.sleep(2)   
   continue

  filePath = os.path.dirname(url)
  if not os.path.exists(filePath):
   os.makedirs(filePath)

  with open(url,'w',encoding='UTF-8') as f:
   f.write(res.text)
  time.sleep(2)

出力結果

└── www.hoge.com
     └── huga 
          ├── hoge ── index.html
          └── index.html

できまスタ（実際は存在するURLでやりましたがぼかしています）

反省

ファイル出力部分が冗長

関数にしたほうがスマートな気がします

import os
import time
import requests

def mkfile(fileName, text):
    filePath = os.path.dirname(fileName)
    if not os.path.exists(filePath):
        os.makedirs(filePath)

    with open(fileName,'w',encoding='UTF-8') as f:
        f.write(text)

with open('./hoge.txt') as f:
 for url in f:
  url = url.rstrip('\n')       
  res = requests.get(url=url,timeout=1.5)
  url = url.replace('https://','./')
  if res.status_code == 404:
   mkfile(url,'404')
   time.sleep(2)    
   continue
  print(url)
  mkfile(url,res.text)
  time.sleep(2)

sleep2秒もいらない

1.5か1くらいが適切だと思いました

まとめ

URLのリストがあれば簡単なのでオススメです

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up