More than 3 years have passed since last update.

ホームページのリンク先を全部取り出したい

Posted at 2021-06-17

時々HTMLのリンク先を取り出したくなるので、その方法を忘れないようにメモとして残します。なおOSはUbuntu20.04で、使用しているブラウザはGoogle-Chromeです。

BeautifulSoupを使う

リンク先を取り出す手段で、真っ先に思いつくのがBeautifulSoupを使う方法です。

BeautifulSoupのインストール

pip3 install bs4

Python3+BeautifulSoupのサンプル

import requests
from bs4 import BeautifulSoup
html = requests.get('https://qiita.com')
soup = BeautifulSoup(html.content,'html.parser')
for link in soup.find_all('a'):
    print(link.get('href'))

しかし、この方法では静的コンテンツの場合はちゃんとリンク先は出るのですが、動的コンテンツの場合は取り出すことができません。

Seleniumを使う

調べたいページが動的コンテンツの場合にはSeleniumを使い対応します。Seleniumでページを読み込んだ5秒後にリンク先を取り出しています。なおSeleniumの裏でChromeが動いています（画面の表示をしないHeadlessモードで動いている）のでChromeが入っていることが前提です。

Seleniumのインストール

apt install python3-selenium

Python3+Seleniumのサンプル

import time
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
options = Options()
options.add_argument('--headless')
driver = webdriver.Chrome('/usr/bin/chromedriver',options=options)
driver.get('https://qiita.com')
time.sleep(5)
tags = driver.find_elements_by_tag_name('a')
for link in tags:
    print(link.get_attribute('href'))

Javascriptを使う

でもやっぱり動的コンテンツの場合は、ブラウザ操作中の好きなタイミングでリンク先を調べたくなります。その場合はブラウザの開発モードのconsoleで下のJavascriptを実行します。

javascriptのサンプル

for(let link of document.getElementsByTagName("a")){ console.log(link.href) }

このやり方はページ毎に自分で操作をしなければいけません。

おわりに

多数のページのリンク先を集めるならSeleniumで、１ページだけでいいならJavascriptで、と使い分けが必要ですね。後日もっと良い方法を調べてみます。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up