はじめに
前回の記事(Pythonプログラミング:SeleniumとBeautifulSoup4を使ってニュース記事を取得(クローリング)してみた)の続きです。
ニュース記事に登場する企業について、その概要(事業内容、役員、株主、etc.)を取得する必要が追加で発生した。
というわけで、Pythonプログラムで「英語」企業情報を取得する処理を実現してみようと思います。
今回、情報Sourceとしては**Yahoo! Finance**を対象とします。
※本稿執筆時点(2020/11/02)の情報に基づき、Code紹介と実行例を示します。
本稿で紹介すること
- Yahoo! FinanceからのProfileの取得
- Yahoo! FinanceからのHoldersの取得
尚、筆者は以下のVersionで動作確認をしています。
- Python: 3.6.8
- BeautifulSoup4: 4.9.1
本稿で紹介しないこと
- Pythonライブラリのインストール方法および使い方
- requests
- BeautifulSoup4
- ティッカーシンボル(Ticker Symbol; 日本の証券コードに相当)の取得方法
- 企業名からティッカーシンボルを取得してリクエストURLを自動で生成する、といった処理は実装しない
サンプルコード
Code量も多くないので、全体のCodeを紹介。
ポイントは2つ。
1. 明示的な待機
アクセス先に負荷を与えないためにも、待機処理(Sleep)を実装するのはマスト。
本稿では、前回の記事とは異なりSeleniumは用いませんが、単位時間にプログラムから爆発的なHTTPリクエストを発行しないように、Forループ処理を使う際は待機処理を実装するのがベター。
2. タグ要素の指定
各ページSourceを眺めて、タグ構成を鑑みて要素を指定し、BeautifulSoup4で情報を取得する必要あり。
多くの場合、タグに付与されたclass属性を指定し、目的のタグ(およびその内部のText)を取得する処理を実装することになります。
Codeを紹介
コードを実行すると、コンソールにprint()の出力が表示されます。
import requests
from bs4 import BeautifulSoup
def getSoup(url):
html = requests.get(url)
#soup = BeautifulSoup(html.content, "html.parser")
soup = BeautifulSoup(html.content, "lxml")
return soup
def getAssetProfile(soup):
wrapper = soup.find("div", class_="asset-profile-container")
paragraph = [element.text for element in wrapper.find_all("span", class_="Fw(600)")]
return paragraph
def getKeyExecutives(soup):
wrapper = soup.find("section", class_="Bxz(bb) quote-subsection undefined")
paragraph = []
for element in wrapper.find_all("tr", class_="C($primaryColor) BdB Bdc($seperatorColor) H(36px)"):
name = element.find("td", class_="Ta(start)").find("span").text
title = element.find("td", class_="Ta(start) W(45%)").find("span").text
pay = element.find("td", class_="Ta(end)").find("span").text
paragraph.append([name, title, pay])
return paragraph
def getDescription(soup):
wrapper = soup.find("section", class_="quote-sub-section Mt(30px)")
paragraph = [element.text for element in wrapper.find_all("p", class_="Mt(15px) Lh(1.6)")]
return paragraph
def getMajorHolders(soup):
wrapper = soup.find("div", class_="W(100%) Mb(20px)")
paragraph = []
for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor)"):
share = element.find("td", class_="Py(10px) Va(m) Fw(600) W(15%)").text
heldby = element.find("td", class_="Py(10px) Ta(start) Va(m)").find("span").text
paragraph.append([share, heldby])
return paragraph
def getTopHolders(soup, category):
idx = {'Institutional': 0, 'MutualFund': 1}
wrapper = soup.find_all("div", class_="Mt(25px) Ovx(a) W(100%)")[idx[category]]
paragraph = []
for element in wrapper.find_all("tr", class_="BdT Bdc($seperatorColor) Bgc($hoverBgColor):h Whs(nw) H(36px)"):
tmp = [element.find("td", class_="Ta(start) Pend(10px)").text, ]
tmp.extend([col.text for col in element.find_all("td", class_="Ta(end) Pstart(10px)")])
paragraph.append(tmp)
return paragraph
iphone12で話題のApple(ティッカーシンボル:APPL)を例に、実行方法を示します。
まずは、基本情報です。
soup = getSoup('https://finance.yahoo.com/quote/AAPL/profile?p=AAPL')
profile = getAssetProfile(soup)
print('\r\n'.join(profile))
# profile[0]: Sector(s)
# profile[1]: Industry
# profile[2]: Full Time Employees
以下、実行結果です。
Technology
Consumer Electronics
147,000
続いて、役員一覧です。
exs = getKeyExecutives(soup)
# print('\r\n'.join(exs))
for ex in exs:
print(ex)
#ex[0]: Name
#ex[1]: Title
#ex[2]: Pay
以下、実行結果です。
['Mr. Timothy D. Cook', 'CEO & Director', '11.56M']
['Mr. Luca Maestri', 'CFO & Sr. VP', '3.58M']
['Mr. Jeffrey E. Williams', 'Chief Operating Officer', '3.57M']
['Ms. Katherine L. Adams', 'Sr. VP, Gen. Counsel & Sec.', '3.6M']
["Ms. Deirdre O'Brien", 'Sr. VP of People & Retail', '2.69M']
['Mr. Chris Kondo', 'Sr. Director of Corp. Accounting', 'N/A']
['Mr. James Wilson', 'Chief Technology Officer', 'N/A']
['Ms. Mary Demby', 'Chief Information Officer', 'N/A']
['Ms. Nancy Paxton', 'Sr. Director of Investor Relations & Treasury', 'N/A']
['Mr. Greg Joswiak', 'Sr. VP of Worldwide Marketing', 'N/A']
続いて、事業内容です。
desc = getDescription(soup)
print('\r\n'.join(desc))
以下、実行結果です。
Apple Inc. designs, manufactures, and markets smartphones, personal computers, tablets, wearables, and accessories worldwide. It also sells various related services. The company offers iPhone, a line of smartphones; Mac, a line of personal computers; iPad, a line of multi-purpose tablets; and wearables, home, and accessories comprising AirPods, Apple TV, Apple Watch, Beats products, HomePod, iPod touch, and other Apple-branded and third-party accessories. It also provides AppleCare support services; cloud services store services; and operates various platforms, including the App Store, that allow customers to discover and download applications and digital content, such as books, music, video, games, and podcasts. In addition, the company offers various services, such as Apple Arcade, a game subscription service; Apple Music, which offers users a curated listening experience with on-demand radio stations; Apple News+, a subscription news and magazine service; Apple TV+, which offers exclusive original content; Apple Card, a co-branded credit card; and Apple Pay, a cashless payment service, as well as licenses its intellectual property. The company serves consumers, and small and mid-sized businesses; and the education, enterprise, and government markets. It sells and delivers third-party applications for its products through the App Store. The company also sells its products through its retail and online stores, and direct sales force; and third-party cellular network carriers, wholesalers, retailers, and resellers. Apple Inc. was founded in 1977 and is headquartered in Cupertino, California.
ココからはURLが変わって、株主情報です。まずはサマリです。
soup = getSoup('https://finance.yahoo.com/quote/AAPL/holders?p=AAPL')
holders = getMajorHolders(soup)
for holder in holders:
print(holder)
#holder[0]: share
#holder[1]: heldby
以下、実行結果です。
['0.07%', '% of Shares Held by All Insider']
['62.12%', '% of Shares Held by Institutions']
['62.16%', '% of Float Held by Institutions']
['4,296', 'Number of Institutions Holding Shares']
続いて、株主情報(法人株主)です。
topholders = getTopHolders(soup, 'Institutional')
for holder in topholders:
print(holder)
#holder[0]: Holder
#holder[1]: Shares
#holder[2]: Date Reported
#holder[3]: % Out
#holder[4]: Value
以下、実行結果です。
['Vanguard Group, Inc. (The)', '1,315,961,000', 'Jun 29, 2020', '7.69%', '120,015,643,200']
['Blackrock Inc.', '1,101,824,048', 'Jun 29, 2020', '6.44%', '100,486,353,177']
['Berkshire Hathaway, Inc', '980,622,264', 'Jun 29, 2020', '5.73%', '89,432,750,476']
['State Street Corporation', '709,057,472', 'Jun 29, 2020', '4.15%', '64,666,041,446']
['FMR, LLC', '383,300,188', 'Jun 29, 2020', '2.24%', '34,956,977,145']
['Geode Capital Management, LLC', '251,695,416', 'Jun 29, 2020', '1.47%', '22,954,621,939']
['Price (T.Rowe) Associates Inc', '233,087,540', 'Jun 29, 2020', '1.36%', '21,257,583,648']
['Northern Trust Corporation', '214,144,092', 'Jun 29, 2020', '1.25%', '19,529,941,190']
['Norges Bank Investment Management', '187,425,092', 'Dec 30, 2019', '1.10%', '13,759,344,566']
['Bank Of New York Mellon Corporation', '171,219,584', 'Jun 29, 2020', '1.00%', '15,615,226,060']
続いて、株主情報(個人の投資信託)です。
topholders = getTopHolders(soup, 'MutualFund')
for holder in topholders:
print(holder)
#holder[0]: Holder
#holder[1]: Shares
#holder[2]: Date Reported
#holder[3]: % Out
#holder[4]: Value
以下、実行結果です。
['Vanguard Total Stock Market Index Fund', '444,698,584', 'Jun 29, 2020', '2.60%', '40,556,510,860']
['Vanguard 500 Index Fund', '338,116,248', 'Jun 29, 2020', '1.98%', '30,836,201,817']
['SPDR S&P 500 ETF Trust', '169,565,200', 'Sep 29, 2020', '0.99%', '19,637,345,812']
['Invesco ETF Tr-Invesco QQQ Tr, Series 1 ETF', '155,032,988', 'Aug 30, 2020', '0.91%', '20,005,456,771']
['Fidelity 500 Index Fund', '145,557,920', 'Aug 30, 2020', '0.85%', '18,782,793,996']
['Vanguard Institutional Index Fund-Institutional Index Fund', '143,016,840', 'Jun 29, 2020', '0.84%', '13,043,135,808']
['iShares Core S&P 500 ETF', '123,444,255', 'Sep 29, 2020', '0.72%', '14,296,079,171']
['Vanguard Growth Index Fund', '123,245,072', 'Jun 29, 2020', '0.72%', '11,239,950,566']
['Vanguard Information Technology Index Fund', '79,770,560', 'Aug 30, 2020', '0.47%', '10,293,593,062']
['Select Sector SPDR Fund-Technology', '69,764,960', 'Sep 29, 2020', '0.41%', '8,079,480,017']
Webブラウザで表示されている情報が、きちんと取得できていますね。
いろいろな企業の情報を収集すると、有名な某個人投資家が株主に名を連ねる企業の一覧というか、傾向みたいなものが見えてくるのかな。。。
まとめ
BeautifulSoup4を使って企業情報(from Yahoo! Finance)を取得(クローリング)する方法を紹介。