More than 1 year has passed since last update.

Python WEBサイトが更新されたらSlackに通知する

Last updated at 2022-05-28Posted at 2022-05-28

はじめに

・日経新聞社のニュース記事が更新され次第、タイトルとURLをSLACKに通知するコードを記載する
・日経新聞社をスクレイピングをする際は、robots.txtを参照し、違反等の判断は自己責任でお願い致します。
・以下の参考記事を元に、コードのみを記載致します(詳しく知りたい方は、以下の記事をご確認下さい)
https://qiita.com/ryo-futebol/items/235c212fdfc3704b7e9c
・上記記事のように、定期実行までのコードは載せず、手動での実行コードを記載します。

大まかな流れ

・対象サイトをスクレイピングして必要情報取得
・更新がないかをチェック
・更新があれば内容をSlackで通知

ライブラリ

import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
import os
import datetime
import slackweb

スクレイピング

日経新聞のトップページの記事を７記事取得する
https://www.nikkei.com/

取得する項目は、タイトルとURLのみです
スクレイピングの方法について、解説サイトなどでご確認下さい。

def news():
    url = "https://www.nikkei.com"
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html,"html.parser")
    result = []

    #日経新聞のトップ記事を取得
    info = soup.find_all(class_="top_t74fb00")
    for i in info:
        info = i.find_all("article")
        for j in info:
            info = j.find_all("a")
            info = info[0]
            link_url = url + info.get('href')
            title = info.text
            result.append([title,link_url])

    info = soup.find_all(class_="blocks_b16l9kyv")
    for i in info:
        info = i.find_all("article")
        for j in info:
            info = j.find_all("a")
            info = info[1]
            title = info.text
            link_url = url + info.get('href')
            result.append([title,link_url])
    
    result = result[:7] #記事を７記事に限定
    return result

取得した情報をCSVに保存する

def output_csv(result):
    with open('last_log.csv', 'w', newline='',encoding='utf_8') as file:
        headers = ['Title', 'URL']
        writer = csv.writer(file)
        writer.writerow(headers)
        for row in result:
            writer.writerow(row)

ファイルの存在確認と中身の確認を行う

def read_csv():
    if not os.path.exists('last_log.csv'):
        result = news() # 新規のニュースを取得
        output_csv(result) # 保存データを更新
        raise Exception('ファイルがありません。')
    if os.path.getsize('last_log.csv') == 0:
        raise Exception('ファイルの中身が空です。')
    csv_list = pd.read_csv('last_log.csv', header=None).values.tolist() #DataFrame から List への変換
    return csv_list

1つ前に保存したcsvと最新のスクレイピング記事を比較して中身が変わったものだけ取り出す

def list_diff(result, last_result):
    diff_list = []
    for tmp in result:
        if tmp not in last_result:
            diff_list.append(tmp)
    return diff_list

Slackに通知（Incoming WebHooks）

def send_to_slack(diff_list):
    if diff_list != []:
        now = datetime.datetime.now()
        now = now.strftime("%Y/%m/%d %H:%M:%S")
        text = now + '\n'
        for tmp in diff_list:
            text += tmp[0] + '\n' + tmp[1] + '\n'
        slack  = slackweb.Slack(url='ご自身のWEBHOOKURLをご入力下さい')
        slack.notify(text=text)

完成したコード

一度目の実行は、csvファイルが作成されていないため
”raise Exception('ファイルがありません。')”　が表示されます。
2回目のWEBサイト更新時に、差異があればSLACKに通知されます。

import pandas as pd
import requests
from bs4 import BeautifulSoup
import csv
import os
import datetime
import slackweb

def news():
    url = "https://www.nikkei.com"
    response = requests.get(url)
    html = response.text
    soup = BeautifulSoup(html,"html.parser")
    result = []

    #日経新聞のトップ記事を取得
    info = soup.find_all(class_="top_t74fb00")
    for i in info:
        info = i.find_all("article")
        for j in info:
            info = j.find_all("a")
            info = info[0]
            link_url = url + info.get('href')
            title = info.text
            result.append([title,link_url])

    info = soup.find_all(class_="blocks_b16l9kyv")
    for i in info:
        info = i.find_all("article")
        for j in info:
            info = j.find_all("a")
            info = info[1]
            title = info.text
            link_url = url + info.get('href')
            result.append([title,link_url])
    
    result = result[:7] #記事を７記事に限定
    return result

def output_csv(result):
    with open('last_log.csv', 'w', newline='',encoding='utf_8') as file:
        headers = ['Title', 'URL']
        writer = csv.writer(file)
        writer.writerow(headers)
        for row in result:
            writer.writerow(row)

def read_csv():
    if not os.path.exists('last_log.csv'):
        result = news() # 新規のニュースを取得
        output_csv(result) # 保存データを更新
        raise Exception('ファイルがありません。')
    if os.path.getsize('last_log.csv') == 0:
        raise Exception('ファイルの中身が空です。')
    csv_list = pd.read_csv('last_log.csv', header=None).values.tolist() #DataFrame から List への変換
    return csv_list

def list_diff(result, last_result):
    diff_list = []
    for tmp in result:
        if tmp not in last_result:
            diff_list.append(tmp)
    return diff_list

def send_to_slack(diff_list):
    if diff_list != []:
        now = datetime.datetime.now()
        now = now.strftime("%Y/%m/%d %H:%M:%S")
        text = now + '\n'
        for tmp in diff_list:
            text += tmp[0] + '\n' + tmp[1] + '\n'
        slack  = slackweb.Slack(url='ご自身のWEBHOOKURLをご入力下さい')
        slack.notify(text=text)

# main
if __name__ == '__main__':
    last_result = read_csv() # 以前の保存データを参照
    result = news() # 新規のニュースを取得
    output_csv(result) # 保存データを更新
    diff_list = list_diff(result, last_result) # 以前のデータと新規のニュースを照合
    send_to_slack(diff_list) # 差異があればSLACKに通知する

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up