More than 3 years have passed since last update.

Beautifulsoup でWebページのリンクを抽出する

Last updated at 2021-05-23Posted at 2017-07-17

参考にしたページ
PythonでWebページのリンクを抽出するスクリプトを書いた
この例は、python2 用なので、python3 用に書き換えました。
HTTP Error 403: Forbidden を回避するようにしました。

実行結果

$ ./get_url.py  https://ekzemplaro.org
en/ 	 
English
ekzemplaro/ 	 
言語とデータベースの接続プログラムサンプル集
audio_books/ 	 オーディオブック
librivox/ 	 LibriVox の勧め
./audio/ 	 Audio
http://www.hi-ho.ne.jp/linux 	 オープンソース開発
./raspberry/ 	 Raspberry Pi
./storytelling/ 	 ストーリーテリング
./crowdsourcing/ 	 クラウドソーシング
https://twitter.com/ekzemplaro 	 私のツイッター
https://github.com/ekzemplaro/ 	 GitHub
qiita/ 	 Qiita
./test_dir/ 	 テストコーナー

get_url.py

# ! /usr/bin/python
# -*- coding: utf-8 -*-
#
#   get_url.py
#
#                   Aug/18/2018
#
# ------------------------------------------------------------------
import requests
import sys
from bs4 import BeautifulSoup
#
url = sys.argv[1]
headers = {"User-Agent": "Mozilla/5.0 (X11; Linux x86_64; rv:57.0) Gecko/20100101 Firefox/57.0",}
#
try:
	rr = requests.get(url,headers=headers)
	html = rr.content
	try:
		soup = BeautifulSoup(html, "html.parser")
		for aa in soup.find_all("a"):
			link = aa.get("href")
			name = aa.get_text()
			print(link,"\t",name)
	except Exception as ee:
		sys.stderr.write("*** error *** in BeautifulSoup ***\n")
		sys.stderr.write(str(ee) + "\n")
#

except Exception as ee:
	sys.stderr.write("*** error *** in requests.get ***\n")
	sys.stderr.write(str(ee) + "\n")
#
# ------------------------------------------------------------------
# ------------------------------------------------------------------

Arch Linux での requests と beautifulsoup4 のインストール方法

sudo pacman -S python-requests
sudo pacman -S python-beautifulsoup4

次のバージョンで動作を確認しました。

$ python --version
Python 3.9.5

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up