More than 5 years have passed since last update.

UL Systems （ウルシステムズ）Advent Calendar 2017

@silva0215(Masaru Morita)in

ウルシステムズ株式会社

【Python】BeautifulSoupを使ってKaggleの賞金総額を調べてみた

Last updated at 2017-12-07Posted at 2017-12-07

はじめに

ulgeek アドベントカレンダーの 8日目です。

データサイエンスのコンペティションを開催している KaggleのWebページから、今までに出されている賞金額を調べてみたいと思います。
270以上のコンペティションが開催されており、1つ1つURLをポチポチして、コピペするのは面倒なので、BeautifulSoupを使います。
最後に取得結果から賞金額を集計して調査完了です。

注意事項

いきなり注意事項ですが、大事なことなので触れておきます。

まず、Webページをクロールをする時は、対象のWebページがクローラのアクセスを禁止していないか注意しましょう。robots.txtの内容を確認するのがよいです。
ちなみに https://www.kaggle.com/robots.txt だと404だったので特に規定はなさそうでした。

また、対象のWebページのアクセス負荷を考慮し、同時に大量のリクエストを投げないように時々スリープを入れるなどの対応もしておきましょう。

実際にWebクローリングで逮捕されたケース（不起訴でしたが）も存在しますので、
状況によっては、そのようなこともあることを理解しておくべきです。
Webクローリングする際には、やり方に十分注意して、自己責任でお願いします。
下記のサイトは、一読をおすすめします。

参考サイト：https://creasys.org/rambo/articles/84fc91dd1071f59e83e3

使ったもの

それでは始めていきましょう。まずは使ったもの。

Python 3.6.3（Anaconda3）
beautifulsoup4　4.6.0
requests 2.18.4
bash on windows

対象コンペティションのURLリストを用意

取得対象のURLリストをテキストファイルで作成します。
今回使うURLは、 Kaggleのコンペティション一覧から、Chromeの開発者ツールでコピペして取得しました。「そこはコピペかい」は、なしですよ。

url-list.txt

https://www.kaggle.com/c/imagenet-object-detection-challenge
https://www.kaggle.com/c/imagenet-object-localization-challenge
https://www.kaggle.com/c/imagenet-object-detection-from-video-challenge
https://www.kaggle.com/c/titanic
https://www.kaggle.com/c/house-prices-advanced-regression-techniques
https://www.kaggle.com/c/digit-recognizer
https://www.kaggle.com/c/dog-breed-identification
https://www.kaggle.com/c/statoil-iceberg-classifier-challenge
https://www.kaggle.com/c/zillow-prize-1
...

ソースコード

こちらは、データ取得するためのソースコードです。（scrape.py）

scrape.py

import sys
import requests
import json
import time
from bs4 import BeautifulSoup

args = sys.argv
urlListFile = args[1]

for line in open(urlListFile, 'r'):
	time.sleep(5)
	url = line
	res = requests.get(url)

	soup = BeautifulSoup(res.content, 'html.parser')
	text = soup.body.find("div",attrs={"class":"site-layout__main-content"}).script.text
	data = text[text.find("({")+1:text.rfind("})")+1]
	jsn = json.loads(data)
	
	print(jsn["competitionTitle"] + "," + jsn["dateEnabled"] + "," + str(jsn["rewardQuantity"]))

特に難しいことをしてる訳ではないと思うので、あっさり書けました。
簡単に使えるのがよいですね。

それでは、ソースコードを簡単に説明していきます。
まずは使用するライブラリをインポート。

scrape.py

import sys
import requests
import json
from bs4 import BeautifulSoup

先程作成したURLリスト（url-list.txt）を読み込んで、URLリストの行数だけループします。

scrape.py

args = sys.argv
urlListFile = args[1]

for line in open(urlListFile, 'r'):

以降は、for文の中身です。
ここからWebページの情報を取得していきます。

まずは、URLを指定してGETしてきます。
GETの前にsleep(5)を入れました。

scrape.py

	time.sleep(5)
	url = line
	res = requests.get(url)

レスポンスからBeautifulSoupオブジェクトを作成します。HTMLパーサーについての警告が出たので、パーサーを指定しました。
作成したBeautifulSoupオブジェクトから、classを指定して対象のdivタグに書かれているスクリプトを抽出します。
取得したい内容によっては、soup.body.find()に渡す引数を変えたり、soup.find_all("a")のようにタグの取得方法を変えて取得するイメージです。
もっと詳しい情報が知りたい方は、ドキュメントをどうぞ。

今回は、取得したいデータがスクリプトのJSONから取れるので、抜き出してJSONオブジェクトにします。

scrape.py

	soup = BeautifulSoup(res.content, 'html.parser')
	text = soup.body.find("div",attrs={"class":"site-layout__main-content"}).script.text
	data = text[text.find("({")+1:text.rfind("})")+1]
	jsn = json.loads(data)

JSONの必要な情報をカンマ区切りで出力します。

scrape.py

	print(jsn["competitionTitle"] + "," + jsn["dateEnabled"] + "," + str(jsn["rewardQuantity"]))

ソースコードの説明は、以上です。
BeautifulSoupで取得したJSONは以下です。

取得したJSONの中身（例）

{
 "activeTab": "overview",
 "competitionId": 6775,
 "competitionType": "prediction",
 "competitionTitle": "Passenger Screening Algorithm Challenge",
 "briefDescription": "Improve the accuracy of the Department of Homeland Security's threat recognition algorithms",
 "competitionHeaderImageUrl": "https://kaggle2.blob.core.windows.net/competitions/kaggle/6775/logos/header.png",
 "organizationId": 384,
 "organizationName": "Department of Homeland Security",
 "organizationSlug": "dhs",
 "organizationThumbnailUrl": "https://kaggle2.blob.core.windows.net/organizations/384/thumbnail.png%3Fr=232",
 "hasAcceptedRules": False,
 "pageMessages": [],
 "dateEnabled": "2017-06-22T16:00:23.557Z",
 "deadline": "2017-12-15T23:59:00Z",
 "mergerDeadline": "2017-12-04T23:59:00Z",
 "newEntrantDeadline": "2017-12-04T23:59:00Z",
 "rewardQuantity": 1500000.0,
 "rewardTypeName": "USD",
 "totalTeams": 492,
 ...
}

実行

scrape.pyを実行します。引数にURLリストのファイル（url.txt）を指定します。
実行結果は、result.csvに出力します。

$ python scrape.py urls.txt >> result.csv

実行結果

以下のように、

"コンペティション名","コンペティション開始日時","賞金額"

が、CSVファイルに出力されます。

result.csv

Titanic: Machine Learning from Disaster,2012-09-28T21:13:33.55Z,0.0
House Prices: Advanced Regression Techniques,2016-08-30T01:08:56.763Z,0.0
Digit Recognizer,2012-07-25T20:43:30.087Z,0.0
Dog Breed Identification,2017-09-29T14:36:29.23Z,0.0
Statoil/C-CORE Iceberg Classifier Challenge,2017-10-23T21:23:33.377Z,50000.0
Zillow Prize: Zillow’s Home Value Prediction (Zestimate),2017-05-24T12:00:13.743Z,1200000.0
TensorFlow Speech Recognition Challenge,2017-11-15T18:16:57.437Z,25000.0
Corporación Favorita Grocery Sales Forecasting,2017-10-19T19:16:44.577Z,30000.0
WSDM - KKBox's Churn Prediction Challenge,2017-09-18T21:36:13.53Z,5000.0
WSDM - KKBox's Music Recommendation Challenge,2017-09-27T19:59:31.247Z,5000.0
Passenger Screening Algorithm Challenge,2017-06-22T16:00:23.557Z,1500000.0
Spooky Author Identification,2017-10-25T17:16:51.653Z,25000.0
Cdiscount’s Image Classification Challenge,2017-09-14T16:57:57.44Z,35000.0
Porto Seguro’s Safe Driver Prediction,2017-09-29T16:24:29.747Z,25000.0
Text Normalization Challenge - English Language,2017-09-05T17:18:48.813Z,25000.0
Text Normalization Challenge - Russian Language,2017-09-05T17:19:00.8Z,25000.0
︙

集計結果

取得した結果から賞金を集計しました。

賞金総額：$9,770,375（約10億円、賞金非公開除く）
平均：$48,130（約500万円）
最高：$1,500,000（約1.6億円）

年毎の賞金額の推移も集計してみました。

Year	Reward
2010	$19,417
2011	$590,960
2012	$776,158
2013	$970,160
2014	$811,180
2015	$1,147,500
2016	$1,025,000
2017	$4,430,000

約8年間で10億円ですが、半分以上が今年の賞金ですね。
過去最高額Top3が全て2017年に開催されているのが要因のようです。

1位 $1,500,000
2位 $1,200,000
3位 $1,000,000

今年からかなりの盛り上がりを見せているようですね。
以上で調査完了です。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up