Help us understand the problem. What is going on with this article?

Pythonスクレイピング 備忘録 URLを相対パスを絶対パスに変換

初めに

pythonスクレイピング慣れたいので今日から備忘録を書きながら学習していこうと思います。
基本的にわからないところを調べて検証してできたらここに備忘録としてまとめます。細かな説明はしません

相対パスで表示される

https://gigazine.net/
にあるaタグの要素からURLを取得します

soutai
from urllib import request
from bs4 import BeautifulSoup

url = "https://gigazine.net/"
html = request.urlopen(url)
soup = BeautifulSoup(html,"html.parser")
for i in soup.find_all("a"):
       print(i.get("href"))

soutai_kekka
https://gigazine.net/
https://twitter.com/gigazine
https://www.facebook.com/GIGAZINE
https://www.youtube.com/user/gigazine
#
#
None
https://gigazine.net/gsc/
https://gigazine.net/club/
https://gigazine.net/news/20190415-cambridge-bicycle-plan/
https://gigazine.net/news/20190415-cambridge-bicycle-plan/
https://gigazine.net/news/20190415-cambridge-bicycle-plan/

#~省略~
https://gigazine.net/news/C34/
https://gigazine.net/news/C16/
https://gigazine.net/news/C36/
https://gigazine.net/news/C21/

#相対パスになってる
/news/contact2/
/news/contact3/
/news/contact4/
/news/about/

http://gigazine.co.jp/

絶対パスに変換

こちらの記事で解決しました
https://teratail.com/questions/27753

urllib.parseモジュール概要
https://docs.python.org/ja/3/library/urllib.parse.html
このモジュールでは URL (Uniform Resource Locator) 文字列をその構成要素 (アドレススキーム、ネットワーク上の位置、パスその他) に分解したり、構成要素を URL に組みなおしたり、 "相対 URL (relative URL)" を指定した "基底 URL (base URL)" に基づいて絶対 URL に変換するための標準的なインタフェースを定義しています。

urljoin 関数概要
“基底 URL”(base)と別のURL(url)を組み合わせて、完全な URL (“絶対 URL”) を構成します。

おーいろいろとお世話になりそう(適当)

zettai
from urllib import request
from bs4 import BeautifulSoup
from urllib.parse import urljoin

url = "https://gigazine.net/"
html = request.urlopen(url)
soup = BeautifulSoup(html,"html.parser")
for i in soup.find_all("a"):
    print(urljoin(url, i.get("href")))


zettai_kekka

https://gigazine.net/
https://twitter.com/gigazine
https://www.facebook.com/GIGAZINE
https://www.youtube.com/user/gigazine
https://gigazine.net/
https://gigazine.net/
https://gigazine.net/
https://gigazine.net/gsc/
https://gigazine.net/club/
https://gigazine.net/news/20190415-cambridge-bicycle-plan/
https://gigazine.net/news/20190415-cambridge-bicycle-plan/
https://gigazine.net/news/20190415-cambridge-bicycle-plan/

#〜省略~

https://gigazine.net/news/C34/
https://gigazine.net/news/C16/
https://gigazine.net/news/C36/
https://gigazine.net/news/C21/

#絶対パスで表示されるようになった
https://gigazine.net/news/contact2/
https://gigazine.net/news/contact3/
https://gigazine.net/news/contact4/
https://gigazine.net/news/about/

http://gigazine.co.jp/

終わり

これからも定期的に備忘録していこうと思います

Why do not you register as a user and use Qiita more conveniently?
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away
Comments
Sign up for free and join this conversation.
If you already have a Qiita account
Why do not you register as a user and use Qiita more conveniently?
You need to log in to use this function. Qiita can be used more conveniently after logging in.
You seem to be reading articles frequently this month. Qiita can be used more conveniently after logging in.
  1. We will deliver articles that match you
    By following users and tags, you can catch up information on technical fields that you are interested in as a whole
  2. you can read useful information later efficiently
    By "stocking" the articles you like, you can search right away