LoginSignup
0
0

More than 3 years have passed since last update.

Pythonプログラミング:BeautifulSoup4を使って米YahooFinanceから企業情報を取得(クローリング)してみた

Posted at

はじめに

前回の記事(Pythonプログラミング:SeleniumとBeautifulSoup4を使ってニュース記事を取得(クローリング)してみた)の続きです。

ニュース記事に登場する企業について、その概要(事業内容、役員、株主、etc.)を取得する必要が追加で発生した。

というわけで、Pythonプログラムで「英語」企業情報を取得する処理を実現してみようと思います。
今回、情報SourceとしてはYahoo! Financeを対象とします。

※本稿執筆時点(2020/11/02)の情報に基づき、Code紹介と実行例を示します。

本稿で紹介すること

尚、筆者は以下のVersionで動作確認をしています。
- Python: 3.6.8
- BeautifulSoup4: 4.9.1

本稿で紹介しないこと

  • Pythonライブラリのインストール方法および使い方
    • requests
    • BeautifulSoup4
  • ティッカーシンボル(Ticker Symbol; 日本の証券コードに相当)の取得方法
    • 企業名からティッカーシンボルを取得してリクエストURLを自動で生成する、といった処理は実装しない

サンプルコード

Code量も多くないので、全体のCodeを紹介。
ポイントは2つ。

1. 明示的な待機

アクセス先に負荷を与えないためにも、待機処理(Sleep)を実装するのはマスト。
本稿では、前回の記事とは異なりSeleniumは用いませんが、単位時間にプログラムから爆発的なHTTPリクエストを発行しないように、Forループ処理を使う際は待機処理を実装するのがベター。

2. タグ要素の指定

各ページSourceを眺めて、タグ構成を鑑みて要素を指定し、BeautifulSoup4で情報を取得する必要あり。
多くの場合、タグに付与されたclass属性を指定し、目的のタグ(およびその内部のText)を取得する処理を実装することになります。

Codeを紹介

コードを実行すると、コンソールにprint()の出力が表示されます。

crawler_yahoo.py
import requests
from bs4 import BeautifulSoup

def getSoup(url):
  html = requests.get(url)
  #soup = BeautifulSoup(html.content, "html.parser")
  soup = BeautifulSoup(html.content, "lxml")
  return soup

def getAssetProfile(soup):
  wrapper = soup.find("div", class_="asset-profile-container")
  paragraph = [element.text for element in wrapper.find_all("span", class_="Fw(600)")]
  return paragraph

def getKeyExecutives(soup):
  wrapper = soup.find("section", class_="Bxz(bb) quote-subsection undefined")
  paragraph = []
  for element in wrapper.find_all("tr", class_="C($primaryColor) BdB Bdc($seperatorColor) H(36px)"):
    name = element.find("td", class_="Ta(start)").find("span").text
    title = element.find("td", class_="Ta(start) W(45%)").find("span").text
    pay = element.find("td", class_="Ta(end)").find("span").text
    paragraph.append([name, title, pay])
  return paragraph

def getDescription(soup):
  wrapper = soup.find("section", class_="quote-sub-section Mt(30px)")
  paragraph = [element.text for element in wrapper.find_all("p", class_="Mt(15px) Lh(1.6)")]
  return paragraph

def getMajorHolders(soup):
  wrapper = soup.find("div", class_="W(100%) Mb(20px)")
  paragraph = []
  for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor)"):
    share = element.find("td", class_="Py(10px) Va(m) Fw(600) W(15%)").text
    heldby = element.find("td", class_="Py(10px) Ta(start) Va(m)").find("span").text
    paragraph.append([share, heldby])
  return paragraph

def getTopHolders(soup, category):
  idx = {'Institutional': 0, 'MutualFund': 1}
  wrapper = soup.find_all("div", class_="Mt(25px) Ovx(a) W(100%)")[idx[category]]
  paragraph = []
  for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor) Bgc($hoverBgColor):h Whs(nw) H(36px)"):
    tmp = [element.find("td", class_="Ta(start) Pend(10px)").text, ]
    tmp.extend([col.text for col in element.find_all("td", class_="Ta(end) Pstart(10px)")])
    paragraph.append(tmp)
  return paragraph

iphone12で話題のApple(ティッカーシンボル:APPL)を例に、実行方法を示します。
まずは、基本情報です。

soup = getSoup('https://finance.yahoo.com/quote/AAPL/profile?p=AAPL')

profile = getAssetProfile(soup)
print('\r\n'.join(profile))
#profile[0]: Sector(s)
#profile[1]: Industry
#profile[2]: Full Time Employees

以下、実行結果です。

Technology
Consumer Electronics
147,000

続いて、役員一覧です。

exs = getKeyExecutives(soup)
#print('\r\n'.join(exs))
for ex in exs:
  print(ex)
  #ex[0]: Name
  #ex[1]: Title
  #ex[2]: Pay

以下、実行結果です。

['Mr. Timothy D. Cook', 'CEO & Director', '11.56M']
['Mr. Luca  Maestri', 'CFO & Sr. VP', '3.58M']
['Mr. Jeffrey E. Williams', 'Chief Operating Officer', '3.57M']
['Ms. Katherine L. Adams', 'Sr. VP, Gen. Counsel & Sec.', '3.6M']
["Ms. Deirdre  O'Brien", 'Sr. VP of People & Retail', '2.69M']
['Mr. Chris  Kondo', 'Sr. Director of Corp. Accounting', 'N/A']
['Mr. James  Wilson', 'Chief Technology Officer', 'N/A']
['Ms. Mary  Demby', 'Chief Information Officer', 'N/A']
['Ms. Nancy  Paxton', 'Sr. Director of Investor Relations & Treasury', 'N/A']
['Mr. Greg  Joswiak', 'Sr. VP of Worldwide Marketing', 'N/A']

続いて、事業内容です。

desc = getDescription(soup)
print('\r\n'.join(desc))

以下、実行結果です。

Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. It also sells various related services. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, HomePod, iPod touch, and other Apple-branded and third-party accessories. It also provides AppleCare support services; cloud services store services; and operates various platforms, including the App Store, that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. In addition, the company offers various services, such as Apple Arcade, a game subscription service; Apple Music, which offers users a curated listening experience with on-demand radio stations; Apple News+, a subscription news and magazine service; Apple TV+, which offers exclusive original content; Apple Card, a co-branded credit card; and Apple Pay, a cashless payment service, as well as licenses its intellectual property. The company serves consumers, and small and mid-sized businesses; and the education, enterprise, and government markets. It sells and delivers third-party applications for its products through the App Store. The company also sells its products through its retail and online stores, and direct sales force; and third-party cellular network carriers, wholesalers, retailers, and resellers. Apple Inc. was founded in 1977 and is headquartered in Cupertino, California.

ココからはURLが変わって、株主情報です。まずはサマリです。

soup = getSoup('https://finance.yahoo.com/quote/AAPL/holders?p=AAPL')

holders = getMajorHolders(soup)
for holder in holders:
  print(holder)
  #holder[0]: share
  #holder[1]: heldby

以下、実行結果です。

['0.07%', '% of Shares Held by All Insider']
['62.12%', '% of Shares Held by Institutions']
['62.16%', '% of Float Held by Institutions']
['4,296', 'Number of Institutions Holding Shares']

続いて、株主情報(法人株主)です。

topholders = getTopHolders(soup, 'Institutional')
for holder in topholders:
  print(holder)
  #holder[0]: Holder
  #holder[1]: Shares
  #holder[2]: Date Reported
  #holder[3]: % Out
  #holder[4]: Value

以下、実行結果です。

['Vanguard Group, Inc. (The)', '1,315,961,000', 'Jun 29, 2020', '7.69%', '120,015,643,200']
['Blackrock Inc.', '1,101,824,048', 'Jun 29, 2020', '6.44%', '100,486,353,177']
['Berkshire Hathaway, Inc', '980,622,264', 'Jun 29, 2020', '5.73%', '89,432,750,476']
['State Street Corporation', '709,057,472', 'Jun 29, 2020', '4.15%', '64,666,041,446']
['FMR, LLC', '383,300,188', 'Jun 29, 2020', '2.24%', '34,956,977,145']
['Geode Capital Management, LLC', '251,695,416', 'Jun 29, 2020', '1.47%', '22,954,621,939']
['Price (T.Rowe) Associates Inc', '233,087,540', 'Jun 29, 2020', '1.36%', '21,257,583,648']
['Northern Trust Corporation', '214,144,092', 'Jun 29, 2020', '1.25%', '19,529,941,190']
['Norges Bank Investment Management', '187,425,092', 'Dec 30, 2019', '1.10%', '13,759,344,566']
['Bank Of New York Mellon Corporation', '171,219,584', 'Jun 29, 2020', '1.00%', '15,615,226,060']

続いて、株主情報(個人の投資信託)です。

topholders = getTopHolders(soup, 'MutualFund')
for holder in topholders:
  print(holder)
  #holder[0]: Holder
  #holder[1]: Shares
  #holder[2]: Date Reported
  #holder[3]: % Out
  #holder[4]: Value

以下、実行結果です。

['Vanguard Total Stock Market Index Fund', '444,698,584', 'Jun 29, 2020', '2.60%', '40,556,510,860']
['Vanguard 500 Index Fund', '338,116,248', 'Jun 29, 2020', '1.98%', '30,836,201,817']
['SPDR S&P 500 ETF Trust', '169,565,200', 'Sep 29, 2020', '0.99%', '19,637,345,812']
['Invesco ETF Tr-Invesco QQQ Tr, Series 1 ETF', '155,032,988', 'Aug 30, 2020', '0.91%', '20,005,456,771']
['Fidelity 500 Index Fund', '145,557,920', 'Aug 30, 2020', '0.85%', '18,782,793,996']
['Vanguard Institutional Index Fund-Institutional Index Fund', '143,016,840', 'Jun 29, 2020', '0.84%', '13,043,135,808']
['iShares Core S&P 500 ETF', '123,444,255', 'Sep 29, 2020', '0.72%', '14,296,079,171']
['Vanguard Growth Index Fund', '123,245,072', 'Jun 29, 2020', '0.72%', '11,239,950,566']
['Vanguard Information Technology Index Fund', '79,770,560', 'Aug 30, 2020', '0.47%', '10,293,593,062']
['Select Sector SPDR Fund-Technology', '69,764,960', 'Sep 29, 2020', '0.41%', '8,079,480,017']

Webブラウザで表示されている情報が、きちんと取得できていますね。
いろいろな企業の情報を収集すると、有名な某個人投資家が株主に名を連ねる企業の一覧というか、傾向みたいなものが見えてくるのかな。。。

まとめ

BeautifulSoup4を使って企業情報(from Yahoo! Finance)を取得(クローリング)する方法を紹介。

0
0
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
0
0