More than 5 years have passed since last update.

対話型LINE BOTの作り方004 (上場企業の決算日を答える)

Posted at 2019-12-03

概要

前回、株価情報を取得する機能を実装したわけだが、現実問題として「そんなの、わざわざBOTに聞かなくても、株アプリ使ったほうが便利じゃね」というツッコミを受けることは時間の問題であり、自身も御多分に漏れず某ポートフォリオ管理アプリを日常的に使っているわけで、今更BOTに株価をピンポイントで教えてもらいたいというニーズを持つには至らないわけである。
そこで改めて原点に立ち返って、どんな機能が欲しいかを自問し、より実用性のある機能を模索してみた。

「そうだ、決算(予定)日を一括で把握できたら便利じゃん」

多くの銘柄を保有している場合、いちいち個別銘柄の決算、半期決算、四半期決算などの予定を日々把握しておくことは結構手間なのである。これを一元的に収集して可視化できるような機能があれば便利だし、なによりBOT本来の在り方らしくて良いと思う。

というわけで今回は、WEBサイトから情報をピンポイントで抜き出して分析するスキルの強化を目的としたスクレイピングテクニックを極めていきたい。

システム構成

LINEアプリから銘柄コードを含む起動コードを投入(①)するとBOTがWEBサイトにアクセスし、直近の決算発表(予定)日を調査し回答(②)する。
個別銘柄指定もしくは事前にポートフォリオを設定しておき複数銘柄の一括取得も想定する。

1.対象WEBサイトの選定

株予報というサイトを利用させていただく。
個別銘柄のページを開いたとき、下図に示す位置に直近の決算発表(予定)日が表示される。
決算発表済であればその旨が日付とともに表示され、
これから発表される予定であれば発表予定日として日付とともに表示してくれている。
この位置に対応するHTMLソースは222行目から始まる"header_main"クラスである。
この配下に欲しい情報が全て含まれていることがわかる。

よって今回のミッションは下記の3工程のスキルを理解すれば実現できそうである。

HTMLソースの取得
パース、<div class="header_main">タグの抽出
文字列整形

で、結論を先に書いておくと
　1. についてはrequests
　2. についてはBeautifulSoup
　3. についてはre
をそれぞれ活用することで非常にスマートなコーディングが実現する。

2.requestsによるHTMLソースの取得

requestsはチャットボット実装の際にパッケージ導入済みなので詳細は割愛する。
コーディング例は下記の通り

(スクリプト実行例)requestsの使い方

(botenv2) [botenv2]$ python
Python 3.6.7 (default, Dec  5 2018, 15:02:16) 
>>> import requests

# HTMLソース取得
>>> r = requests.get('https://kabuyoho.ifis.co.jp/index.php?action=tp1&sa=report_top&bcode=4689')

# 内容確認
>>> print(r.headers)
{'Cache-Control': 'max-age=1', 'Content-Encoding': 'gzip', 'Content-Type': 'text/html; charset=UTF-8', (省略)
>>> print(r.encoding)
UTF-8
>>> print(r.content)
(省略)

という感じで特に解説は必要ないだろう。
r.contentの中にBody部が丸ごと格納されているので、後は煮るなり焼くなり、
目的とするHTMLタグをキーにして情報を引っ張ってくればよい。

3.BeautifulSoup(パーサー)の導入

有名なパッケージで、既にいろんなパーサーを実装しているようだ。
導入してみたところ、これだけで今回の目的の90%は達成できてしまった。。
かつてC言語をポチポチいじくっていた時代と比べると隔世の感である。

BeautifulSoupインストール

(botenv2) [botenv2]$ pip install BeautifulSoup4

(スクリプト実行例@続き)BeautifulSoupの使い方

>>> from bs4 import BeautifulSoup

# Body部をparserで解析
>>> soup = BeautifulSoup(r.content, "html.parser")

試しにheader_mainクラスを表示してみる。

>>> print(soup.find("div", class_="header_main"))

実行結果


<div class="header_main">
<div class="stock_code left">4689</div>
<div class="stock_name left">Ｚホールディングス</div>
<div class="block_update right">
<div class="title left">
                                                                        決算発表済
                                                        </div>
<div class="settle left">
                                                                        2Q
                                                        </div>
<div class="date left">
                                                                                                   2019/11/01
                                                                                                </div>
<div class="float_end"></div>
</div>
<div class="float_end"></div>
</div>

すごい。便利すぎて震えが止まらない。

3.文字列整形

残るは不要な文字列を削除していくだけである。
HTMLタグは不要なので、textメソッドを利用する。

(スクリプト実行例@続き)テキスト抽出

>>> s = soup.find("div", class_="header_main").text
>>> print(s)
4689
Ｚホールディングス


                                                                        決算発表済
                                                 

                                                                        2Q
                                                 

                                                                                                   2019/11/01
                                                                                         




>>>

タグが一掃されたが、まだ謎の間隙が大量に残っている。
これがスペースなのかメタ文字なのかわからず、一瞬ハマった。
そういうときはバイト型で表示することで実体が見えてくる。

(参考)文字コード確認

>>> s.encode()
b'\n4689\n\xef\xbc\xba\xe3\x83\x9b\xe3\x83\xbc\xe3\x83\xab\xe3\x83\x87\xe3\x82\xa3\xe3\x83\xb3\xe3\x82\xb0\xe3\x82\xb9\n\n\n\t\t\t\t\t\t\t\t\t\xe6\xb1\xba\xe7\xae\x97\xe7\x99\xba\xe8\xa1\xa8\xe6\xb8\x88\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t2Q\n\t\t\t\t\t\t\t\n\n\t\t\t\t\t\t\t\t\t\t\t\t\t\t\t2019/11/01\n\t\t\t\t\t\t\t\t\t\t\t\t\n\n\n\n'

要は/nや/tを除去すればいいのだね。
容赦なくカンマへと置換して葬り去りましょう。

(スクリプト実行例@続き)メタ文字除去

>>> import re
>>> s = re.sub(r'[\n\t]+', ',', s)
>>> print(s)
,4689,Ｚホールディングス,決算発表済,2Q,2019/11/01,

仕上げに前後の邪魔なカンマを除去すれば

(スクリプト実行例@続き)

>>> s = re.sub(r'(^,)|(,$)','', s)
>>> print(s)
4689,Ｚホールディングス,決算発表済,2Q,2019/11/01

おーいい感じ。このままCSVやdataframeにも変換できそうです。

ちなみに存在しない銘柄コードを取得した場合に上記処理を経ると下記の文字コードが残る。

存在しない銘柄コードの場合

>>> print(s)
b'\xc2\xa0'

この\xc2\xa0はUnicodeにおけるNO-BREAK SPACEを意味しており、HTMLでは&nbspに相当する。
この文字コードが含まれたままだとその後の処理に支障を来たすため可能であれば除去する方が望ましい。
(webページのスクレイピングにおいてはよく遭遇する問題のようです。)

(参考) 【Python3】スクレイピング中に[\xa0]に遭遇した時の対処法

&nbspの除去

s = re.sub(r'[\xc2\xa0]','', s)

4.BOTアプリに実装

以上の処理を整理して関数化したものがこちら。

getSettledata.py

import requests
from bs4 import BeautifulSoup
import re
import logging
logger = logging.getLogger('getSettledata')

source = 'https://kabuyoho.ifis.co.jp/index.php?action=tp1&sa=report_top&bcode='

# 決算日取得関数 (引数が空の場合は4689(ZHD)のデータを参照する)
def get_settleInfo(code="4689"):

  #クローリング  
  try:
    logger.debug('read web data cord = ' + code) #logging
    r = requests.get(source + code)
  except:
    logger.debug('read web data ---> Exception Error') #logging
    return None, 'Exception error: access failed'

  #スクレイピング
  soup = BeautifulSoup(r.content, "html.parser")
  settleInfo = soup.find("div", class_="header_main").text
  settleInfo = re.sub(r'[\n\t]+', ',', settleInfo) #メタ文字の除去
  settleInfo = re.sub(r'(^,)|(,$)','', settleInfo) #行頭行末のカンマ除去
  settleInfo = re.sub(r'[\xc2\xa0]','', settleInfo) #&nbsp(\xc2\xa0)問題の処置
  logger.debug('settleInfo result = ' + settleInfo) #logging

  if not settleInfo:
    settleInfo = 'そんな銘柄ないよ～'

  return settleInfo

if __name__ == '__main__':
  print(get_settleInfo())

メインプログラムの方にはいつも通り起動コードの識別による条件分岐処理を追記する。
SETTLEVIEW_LIST_CORDにあらかじめ自分のポートフォリオを作っておくことで一括取得の対象となる。

chatbot.py(★追記)##既存機能部分については変更ないため割愛

# -*- Coding: utf-8 -*-

from django.views.decorators.csrf import csrf_exempt
from django.http import HttpResponse
from django.shortcuts import render
from datetime import datetime
from time import sleep
import requests
import json
import base64
import logging
import os
import random
import log.logconfig
from utils import tools
import re
from .getStockdata import get_chart
from .getSettledata import get_settleInfo

logger = logging.getLogger('commonLogging')

LINE_ENDPOINT = 'https://api.line.me/v2/bot/message/reply'
LINE_ACCESS_TOKEN = ''

###
### 　割愛
###

SETTLEVIEW_KEY = ['決算','settle'] #★追記
SETTLEVIEW_LIST_KEY = ['決算リスト'] #★追記
SETTLEVIEW_LIST_CORD = ['4689','3938','4755','1435','3244','3048'] #★追記

@csrf_exempt
def line_handler(request):

    #exception
    if not request.method == 'POST':
      return HttpResponse(status=200)

    logger.debug('line_handler message incoming') #logging
    out_log = tools.outputLog_line_request(request) #logging
    request_json = json.loads(request.body.decode('utf-8'))

    for event in request_json['events']:
      reply_token = event['replyToken']
      message_type = event['message']['type']
      user_id = event['source']['userId']

      #whitelist
      if not user_id == LINE_ALLOW_USER:
        logger.warning('invalid userID:' + user_id) #logging
        return HttpResponse(status=200)

      #action
      if message_type == 'text':
        if:
        ###
        ###　割愛
        ###

        elif any(s in event['message']['text'] for s in SETTLEVIEW_KEY): #★追記
          action_data(reply_token,'settleview',event['message']['text']) #★追記

        else:
        ###
        ###　割愛
        ###

    return HttpResponse(status=200)

def action_res(reply_token,command,):
    ###
    ###　割愛
    ###

def action_data(reply_token,command,value):

    #株価チャート
    ###
    ###　割愛
    ###

    ####################################################### ★追記ここから
    #決算情報
    elif command == 'settleview':
      logger.debug('get_settleInfo on') #logging

      #ポートフォリオ銘柄の一括取得     
      if any(s in value for s in SETTLEVIEW_LIST_KEY): 
        logger.debug('get_settleInfo LIST') #logging

        results = []
        for cord in SETTLEVIEW_LIST_CORD:
          results.append(get_settleInfo(cord))

        logger.debug('get_settleInfo LIST ---> ' + '\n'.join(results)) #logging
        response_text(reply_token,'\n'.join(results))

      #個別銘柄取得
      else:
        cord = re.search('[0-9]+$', value)
        logger.debug('get_settleInfo cord = ' + cord.group()) #logging

        result = get_settleInfo(cord.group())

        if result[0] is not None:
          response_text(reply_token,result)
        else:
          response_text(reply_token,result[1])
    ####################################################### ★追記ここまで

def response_image(reply_token,orgUrl,preUrl,text):
    ###
    ###　割愛
    ###

def response_text(reply_token,text):
    payload = {
      "replyToken": reply_token,
      "messages":[
        {
          "type": 'text',
          "text": text
        }
      ]
    }
    line_post(payload)

def line_post(payload):
    url = LINE_ENDPOINT
    header = {
      "Content-Type": "application/json",
      "Authorization": "Bearer " + LINE_ACCESS_TOKEN
    }
    requests.post(url, headers=header, data=json.dumps(payload))
    out_log = tools.outputLog_line_response(payload) #logging
    logger.debug('line_handler message -->reply') #logging

def ulocal_chatting(event):
    ###
    ###　割愛
    ###

以上で完成。

line_botを起動

(botenv2) [line_bot]$ gunicorn --bind 127.0.0.1:8000 line_bot.wsgi:application

LINEアプリから書式に沿ったメッセージを投下すると結果が返ってくる。

一括取得したい場合は決算リストと入力する。

6銘柄をシリアルに処理して実測およそ1秒。想像以上に高速処理されていて感動したが、あまりWEBサイトに迷惑になるといけないので、頻繁にアクセスしないよう、ほどほどに使っていこうと思う。
今回はここまで。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up