0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

RequestsとBeautifulSoupを使用してニュースをスクレイピングし、CSV出力する

Last updated at Posted at 2021-11-10

ニュースのURLとタイトルをスクレイピングし、CSVファイルを出力します。
言語はpythonを使用します。
今回は、はてなブックマークとHackerNewsを対象に行います。

下記をインストールしておく

pip install requests
pip install BeautifulSoup4
pip install numpy

crawler.pyを作り、ニュースをスクレイピングする

import requests
from bs4 import BeautifulSoup

def crawl_hatena():
    data = []
    title_list = []
    url_list = []
    load_url = "https://b.hatena.ne.jp/hotentry/it"
    html = requests.get(load_url)
    soup = BeautifulSoup(html.content, "html.parser")

    # セレクタを指定してデータを取得して格納する
    topic = soup.find_all("a", class_="js-keyboard-openable")
    for element in topic[:30]:
        if not element:
            continue
        title_list.append(element.get("title"))
        url_list.append(element.get("href"))
    for (i, j) in zip(title_list, url_list):
        data.append("[タイトル]" + i)
        data.append("[URL]" + j + "\n")
    item = "\n" + "-----【はてなブックマーク】-----" + "\n" + "\n".join(data)
    return item, title_list, url_list

def crawl_hacker_news():
    data = []
    title_list = []
    url_list = []
    load_url = "https://news.ycombinator.com/"
    html = requests.get(load_url)
    soup = BeautifulSoup(html.content, "html.parser")

    # セレクタを指定してデータを取得して格納する
    topic = soup.find_all("a", class_="storylink")
    for element in topic[:50]:
      title_list.append(element.get_text())
      url_list.append(element.get("href"))
    for (i, j) in zip(title_list, url_list):
        data.append("[タイトル]" + i)
        data.append("[URL]" + j + "\n")
    item = "\n" + "-----【Hacker News】-----" + "\n" + "\n".join(data)
    return item, title_list, url_list

main.pyを作成して、crawler.pyで抽出したデータをCSVへ出力する

import numpy as np
import csv

# NewsタイトルとURL取得(別ファイルに記述)
cr_hatena = crawler.crawl_hatena()
cr_hacker_news = crawler.crawl_hacker_news()

# URLとタイトルをまとめる
hatena_data = [list(e) for e in zip(cr_hatena[1], cr_hatena[2])]
hacker_news_data = [list(e) for e in zip(cr_hacker_news[1], cr_hacker_news[2])]

# はてなブックマークとHackerNewsのデータをまとめる
bk_news_data = np.concatenate((hatena_data, hacker_news_data))
bk_news_header = ['Title', 'URL']
bk_news_path = 'bk_news/bk_news_'

# CSVへ書き込む
with open(bk_news_path + str(time) + '.csv', 'w') as f:
    writer = csv.writer(f)
    writer.writerow(bk_news_header)
    writer.writerows(bk_news_data)

上記を実行すると、指定したパスにスクレイピングをしたデータがCSVファイルに書き込まれていると思います。

0
1
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
1

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?