More than 5 years have passed since last update.

Pythonでウェブスクレイピング

Last updated at 2020-03-13Posted at 2020-03-13

ウェブスクレイピングをしよう

自分用メモ。
yahoojapanのニューストピック(https://news.yahoo.co.jp/topics)
ここでは題名をスクレイピング->テキストファイルに保存まで。

ライブラリインポート

import requests
from bs4 import BeautifulSoup

GETリクエスト

## プロキシ下の環境だったためプロキシ設定
proxies = {
 "http":"プロキシURL"
 "https":"プロキシURL"
}
## オプションにプロキシ設定し、GETリクエストを投げる
req = request.get('https://news.yahoo.co.jp/topics',proxies=proxies)

BeautifulSoupでスクレイピング実行->リスト化

BeautifulSoup().find_allのオプション引数は"class"でなく"class_"であることに注意
classだとPythonの予約語になってしまう。

yahoo = BeautifulSoup(req.content,"html.parser")
topics_list_items = yahoo.find_all("li" ,class_="topicsListItem")#yahoojapanHPの仕様でこのclass。
### 文字列のみ抜き出してリスト化
topics_list_text = []
for topics_list_item in topics_list_items:
    topics_list_text.append(topics_list_item.text)

テキストファイルに保存する

f = open('topiclist.txt','w')#ファイルは自動的に生成
for topic in topics_list_text:
    f.write(str(topic)+'\n')#一行に一トピックという加工
f.close()

所感

BeautifulSoupを初めて使ったがとても明瞭で簡単。逆に言えばブラックボックス。
ただfind_allだとget_textのメソッドはなく、for文で取り出さないといけないみたい。
いずれにせよとても簡単にできた。時代は進んでいる…
以上っす。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up