More than 3 years have passed since last update.

プログラム経験２ヶ月の私が１週間で機械学習アプリをつくってみた

Last updated at 2022-07-21Posted at 2022-07-21

はじめに

1. 機械学習アプリを作ろうと思ったきっかけ

近頃サッカーの試合を見ていると、ボール支配率だけではなく最終スコアの予測が試合中に常に更新され続けている。海外のチームによってはこのようなスコア期待値を上回ることを目標としていることもあるらしく、このようなサッカーのデータ分析に興味を持ったため。

２. 使用言語、ライブラリの選定理由

機械学習用のライブラリが非常に充実しているpythonを利用する。
また、Webアプリの作成には、pythonだけで完成するstreamlitを利用してなるべく短時間でアプリを作ることを目的とする。

3. Statsbombについて

本分析、およびアプリ作成においてはサッカーのデータ分析の会社であるStatsBombが提供しているオープンデータを使用する。ありがとうございます！
以下のリンクにてクレジットやデータの内容が確認できる。
https://github.com/statsbomb/statsbombpy

余談だがサッカーデータ分析の講座もあるようなので、ちょっと見てみたい。やったことある人いませんか。

4. アプリ構築の流れ

機械学習モデルの構築→Githubでレポジトリ作成→Streamlitを用いてアプリ作成→デプロイ

機械学習モデルの構築

statsbomb.py及び必要なモジュールをインストールする

import pandas as pd
import numpy as np
from pandas import Series, DataFrame
import matplotlib.pyplot as plt
import matplotlib as mpl
import seaborn as sns
from sklearn.metrics import confusion_matrix, precision_score, recall_score, f1_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.tree import DecisionTreeRegressor
from sklearn.neighbors import KNeighborsRegressor
from sklearn.linear_model import LinearRegression
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC, LinearSVC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor, GradientBoostingClassifier
from sklearn.model_selection import cross_val_score
from sklearn.preprocessing import StandardScaler
%matplotlib inline
pip install statsbombpy
from statsbombpy import sb

データの確認

help関数を用いてこのモジュールに何が入っているのかを確認する(Gitを見ればよいのだが)

help(sb)

Help on module statsbombpy.sb in statsbombpy:

NAME
    statsbombpy.sb

FUNCTIONS
    competition_events(country: str, division: str, season: str, gender: str = 'male', split: bool = False, filters: dict = {}, fmt: str = 'dataframe', creds: dict = {'user': None, 'passwd': None}) -> Union[pandas.core.frame.DataFrame, dict]
    
    competition_frames(country: str, division: str, season: str, gender: str = 'male', fmt: str = 'dataframe', creds: dict = {'user': None, 'passwd': None}) -> Union[pandas.core.frame.DataFrame, dict]
    
    competitions(fmt='dataframe', creds: dict = {'user': None, 'passwd': None})
    
    events(match_id: int, split: bool = False, filters: dict = {}, fmt: str = 'dataframe', flatten_attrs: bool = True, creds: dict = {'user': None, 'passwd': None}) -> Union[pandas.core.frame.DataFrame, dict]
    
    frames(match_id: int, fmt: str = 'dataframe', creds: dict = {'user': None, 'passwd': None}) -> Union[pandas.core.frame.DataFrame, list, dict]
    
    lineups(match_id, fmt='dataframe', creds: dict = {'user': None, 'passwd': None})
    
    matches(competition_id: int, season_id: int, fmt='dataframe', creds: dict = {'user': None, 'passwd': None})
    
    player_match_stats(match_id: int, fmt: str = 'dataframe', creds: dict = {'user': None, 'passwd': None}) -> Union[pandas.core.frame.DataFrame, dict]
    
    player_season_stats(competition_id: int, season_id: int, fmt='dataframe', creds: dict = {'user': None, 'passwd': None}) -> Union[pandas.core.frame.DataFrame, dict]
    
    team_season_stats(competition_id: int, season_id: int, fmt='dataframe', creds: dict = {'user': None, 'passwd': None}) -> Union[pandas.core.frame.DataFrame, dict]

DATA
    DEFAULT_CREDS = {'passwd': None, 'user': None}
    PARALLELL_CALLS_NUM = 4
    Union = typing.Union
        Union type; Union[X, Y] means either X or Y.
        
        To define a union, use e.g. Union[int, str].  Details:
        - The arguments must be types and there must be at least one.
        - None as an argument is a special case and is replaced by
          type(None).
        - Unions of unions are flattened, e.g.::
        
            Union[Union[int, str], float] == Union[int, str, float]
        
        - Unions of a single argument vanish, e.g.::
        
            Union[int] == int  # The constructor actually returns int
        
        - Redundant arguments are skipped, e.g.::
        
            Union[int, str, int] == Union[int, str]
        
        - When comparing unions, the argument order is ignored, e.g.::
        
            Union[int, str] == Union[str, int]
        
        - You cannot subclass or instantiate a union.
        - You can use Optional[X] as a shorthand for Union[X, None].

FILE
    /opt/anaconda3/lib/python3.9/site-packages/statsbombpy/sb.py

competition_events、competition_frames、competitions、events、frames、lineups、matches、player_match_stats、player_season_stats、team_season_stats
以上９つの関数が入っており、それぞれがDataFrameを返すことがわかる。また、有料会員登録してuserとpasswaordを入力すればOpen data以外のデータにアクセスできるようになっているようだ。

コンペティション一覧の表示

competitions = sb.competitions()
competitions.head()

	competition_id	season_id	country_name	competition_name	competition_gender	competition_youth	competition_international	season_name	match_updated	match_updated_360	match_available_360	match_available
0	16	4	Europe	Champions League	male	False	False	2018/2019	2021-08-27T11:26:39.802832	2021-06-13T16:17:31.694	None	2021-07-09T14:06:05.802
1	16	1	Europe	Champions League	male	False	False	2017/2018	2021-08-27T11:26:39.802832	2021-06-13T16:17:31.694	None	2021-01-23T21:55:30.425330
2	16	2	Europe	Champions League	male	False	False	2016/2017	2021-08-27T11:26:39.802832	2021-06-13T16:17:31.694	None	2020-07-29T05:00
3	16	27	Europe	Champions League	male	False	False	2015/2016	2021-08-27T11:26:39.802832	2021-06-13T16:17:31.694	None	2020-07-29T05:00
4	16	26	Europe	Champions League	male	False	False	2014/2015	2021-08-27T11:26:39.802832	2021-06-13T16:17:31.694	None	2020-07-29T05:00

プレミアリーグ、チャンピョンズリーグ、FIFAワールドカップ、ラリーガなどのコンペティションのデータが格納されていることが確認できる。
ラリーガのデータが16シーズン分用意されており、今回はラリーガのデータを利用することにする。

特定シーズンに含まれるデータの表示

今回はラリーガのデータを使用することにしたため、competitionが１１のものを抜き出していく。　

sb.matches(competition_id=11, season_id=90).info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 35 entries, 0 to 34
Data columns (total 22 columns):
 #   Column                 Non-Null Count  Dtype 
---  ------                 --------------  ----- 
 0   match_id               35 non-null     int64 
 1   match_date             35 non-null     object
 2   kick_off               35 non-null     object
 3   competition            35 non-null     object
 4   season                 35 non-null     object
 5   home_team              35 non-null     object
 6   away_team              35 non-null     object
 7   home_score             35 non-null     int64 
 8   away_score             35 non-null     int64 
 9   match_status           35 non-null     object
 10  match_status_360       35 non-null     object
 11  last_updated           35 non-null     object
 12  last_updated_360       35 non-null     object
 13  match_week             35 non-null     int64 
 14  competition_stage      35 non-null     object
 15  stadium                35 non-null     object
 16  referee                30 non-null     object
 17  home_managers          35 non-null     object
 18  away_managers          35 non-null     object
 19  data_version           35 non-null     object
 20  shot_fidelity_version  35 non-null     object
 21  xy_fidelity_version    35 non-null     object
dtypes: int64(4), object(18)
memory usage: 6.1+ KB

matchesには試合のチームや監督、スタジアム、スコアといった基本的な情報が含まれていることがわかる。

試合に含まれるイベントデータの確認

それでは、一つの試合を取り出して何がeventデータとして保存されているのかをみてみよう。

events = sb.events(match_id=3773369)
events.head()

/opt/anaconda3/lib/python3.9/site-packages/statsbombpy/api_client.py:20: NoAuthWarning:

credentials were not supplied. open data access only

	ball_receipt_outcome	ball_recovery_recovery_failure	block_deflection	block_offensive	carry_end_location	clearance_aerial_won	clearance_body_part	clearance_head	clearance_left_foot	clearance_other	...	shot_statsbomb_xg	shot_technique	shot_type	substitution_outcome	substitution_replacement	tactics	team	timestamp	type	under_pressure
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	{'formation': 3421, 'lineup': [{'player': {'id...	Barcelona	00:00:00.000	Starting XI	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	{'formation': 352, 'lineup': [{'player': {'id'...	Huesca	00:00:00.000	Starting XI	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Huesca	00:00:00.000	Half Start	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Barcelona	00:00:00.000	Half Start	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	Barcelona	00:00:00.000	Half Start	NaN

5 rows × 89 columns

パスやドリブル、ブロックといった試合の中でのやり取りの一つ一つが時間や選手名、そのプレーの結果とともに収められていることがわかる。

ラリーガ2015-2020のデータを抽出

それではデータの中身がわかってきたところで、今回行いたい機械学習のためにラリーガ2015-2020のデータを抜き出して一つのDataFrameとして扱えるようにしよう。

seasons=["2015/2016", "2016/2017", "2017/2018", "2018/2019", "2019/2020"]
grouped_events = pd.DataFrame()

for i in range(len(seasons)):
    grouped_events_temp = sb.competition_events(
        country= "Spain",
        division= "La Liga",
        season= seasons[i],
    )
    grouped_events = pd.concat([grouped_events, grouped_events_temp])

grouped_events.head()

	50_50	bad_behaviour_card	ball_receipt_outcome	ball_recovery_offensive	ball_recovery_recovery_failure	block_deflection	block_offensive	block_save_block	carry_end_location	clearance_aerial_won	...	substitution_outcome	substitution_replacement	tactics	team	timestamp	type	under_pressure	player_off_permanent	half_start_late_video_start	pass_backheel
0	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	{'formation': 433, 'lineup': [{'player': {'id'...	Barcelona	00:00:00.000	Starting XI	NaN	NaN	NaN	NaN
1	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	{'formation': 442, 'lineup': [{'player': {'id'...	Getafe	00:00:00.000	Starting XI	NaN	NaN	NaN	NaN
2	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	{'formation': 4231, 'lineup': [{'player': {'id...	Real Betis	00:00:00.000	Starting XI	NaN	NaN	NaN	NaN
3	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	{'formation': 433, 'lineup': [{'player': {'id'...	Barcelona	00:00:00.000	Starting XI	NaN	NaN	NaN	NaN
4	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	...	NaN	NaN	{'formation': 433, 'lineup': [{'player': {'id'...	Barcelona	00:00:00.000	Starting XI	NaN	NaN	NaN	NaN

5 rows × 114 columns

必要な情報だけを抜き出す＋必要な情報を追加する

データを抜き出せたものの、このままだと機械学習のための特徴量としては不必要なデータが多すぎ、効果的な学習が行えないため、一部のシュートに関連がありそうなデータだけを残して、削除する。

shots_Barcelona = grouped_events[['id', 'index', 'period', 'timestamp', 'minute', 'second', 'type', 'under_pressure', 'location', 'match_id',
                         'possession', 'possession_team', 'play_pattern', 'team', 'player',  'shot_aerial_won', 'shot_body_part',
                         'shot_deflected', 'shot_end_location', 'shot_first_time','shot_follows_dribble', 
                         'shot_freeze_frame', 'under_pressure', 'shot_key_pass_id', 'shot_one_on_one',
                         'shot_open_goal', 'shot_outcome', 'shot_redirect', 'shot_saved_off_target', 'shot_saved_to_post',
                         'shot_technique', 'shot_type', ]].loc[['team'] == "Barcelona" ]
shots_Barcelona = shots_Barcelona.dropna(subset=['shot_outcome'])

さらに、シュート結果などのデータについてはカテゴリ変数化を行う。これによってカテゴリ変数をフラグ化できる。

shots_Barcelona['play_pattern'] = pd.Categorical(shots_Barcelona['play_pattern'])
shots_Barcelona['shot_technique'] = pd.Categorical(shots_Barcelona['shot_technique'])
shots_Barcelona['shot_type'] = pd.Categorical(shots_Barcelona['shot_type'])
dummy_goals = pd.get_dummies(shots_Barcelona[['play_pattern', 'shot_technique', 'shot_type']], drop_first=False)
dummy_goals['shot_one_on_one'] = shots_Barcelona['shot_one_on_one'].map(lambda x: 1 if x == True else 0)
dummy_goals['shot_open_goal'] = shots_Barcelona['shot_open_goal'].map(lambda x: 1 if x == True else 0)
dummy_goals['shot_aerial_won'] = shots_Barcelona['shot_aerial_won'].map(lambda x: 1 if x == True else 0)
dummy_goals['Goal'] = shots_Barcelona['shot_outcome'].map(lambda x: 1 if x == "Goal" else 0)

ペナルティエリア内かどうかを判定するカテゴリを追加する。機械学習の精度を上げるためには有効な特徴量エンジニアリングが重要となってくる。

def in_PenaltyArea(area):
    area = area[1:-1]
    area = area.split(',')
    if 102 <= float(area[0]) <= 120 and 18 <=  float(area[1]) <= 62:
        return 1
    if 0 <=  float(area[0]) <= 18 and 18 <=  float(area[1]) <= 62:
        return 1
    else:
        return 0 
    
dummy_goals['In_PenaltyArea'] = shots_Barcelona['location'].map(in_PenaltyArea)

モデルを構築する

今回はいくつかのモデルに学習をさせてみて、どのモデルが最も適しているかを検討する。
今回推定したいシュートの成功、失敗については、その性質上失敗に大きく偏りがある。

全てシュート失敗と判定すれば８０％近くのデータを適正に判断できることになるが、それでは決して有効なモデルとは言えない。
そのため、Confusion Matrixとよばれる（日本語では混同行列）を使い、精度について注意深く観察しよう。

sc = StandardScaler()

X = dummy_goals.drop('Goal', axis=1)
y = dummy_goals['Goal']
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.1, random_state = 0, stratify = y)

sc.fit(X_train)
X_train_std = sc.transform(X_train)
X_test_std = sc.transform(X_test)

models = {
    'KNN' : KNeighborsClassifier(n_neighbors=100, weights='distance'),
    'LogisticRegression' : LogisticRegression(random_state = 0),
    'DecisionTree' : DecisionTreeClassifier(max_depth=5 ,random_state = 0),
    'SVM' : LinearSVC(random_state = 0, max_iter=10000),
    'RandomForest' : RandomForestClassifier(random_state=0),
    'GradiantBoost' : GradientBoostingClassifier(random_state=0)
}

scores = {}

for model_name, model in models.items():
    fit_model = model.fit(X_train_std, y_train)
    pred_y = fit_model.predict(X_test_std)
    confusion_m = confusion_matrix(y_test, pred_y)
    print('confusion matrix')
    print(confusion_m)
    #scores[(model_name, 'train')] = model.score(X_train_std, y_train)
    #scores[(model_name, 'test')] = model.score(X_test_std, y_test)
    print('train:', fit_model.__class__.__name__, fit_model.score(X_train_std, y_train))
    print('test:', fit_model.__class__.__name__, fit_model.score(X_test_std, y_test))
    print("適合率:{:.3f}".format(precision_score(y_test, pred_y)))
    print("再現率:{:.3f}".format(recall_score(y_test, pred_y)))
    print("F1値:{:.3f}".format(f1_score(y_test, pred_y)))

    print('===================================================')

confusion matrix
[[213   1]
 [ 38   6]]
train: KNeighborsClassifier 0.8768898488120951
test: KNeighborsClassifier 0.8488372093023255
適合率:0.857
再現率:0.136
F1値:0.235
===================================================
confusion matrix
[[212   2]
 [ 36   8]]
train: LogisticRegression 0.8466522678185745
test: LogisticRegression 0.8527131782945736
適合率:0.800
再現率:0.182
F1値:0.296
===================================================
confusion matrix
[[213   1]
 [ 38   6]]
train: DecisionTreeClassifier 0.8496760259179266
test: DecisionTreeClassifier 0.8488372093023255
適合率:0.857
再現率:0.136
F1値:0.235
===================================================
confusion matrix
[[213   1]
 [ 37   7]]
train: LinearSVC 0.8457883369330453
test: LinearSVC 0.8527131782945736
適合率:0.875
再現率:0.159
F1値:0.269
===================================================
confusion matrix
[[213   1]
 [ 37   7]]
train: RandomForestClassifier 0.8548596112311015
test: RandomForestClassifier 0.8527131782945736
適合率:0.875
再現率:0.159
F1値:0.269
===================================================
confusion matrix
[[212   2]
 [ 35   9]]
train: GradientBoostingClassifier 0.8522678185745141
test: GradientBoostingClassifier 0.8565891472868217
適合率:0.818
再現率:0.205
F1値:0.327
===================================================

モデルができました！

混同行列の行はモデルが予測したクラスを示し、列は実際のクラスを示します。この場合、表の右上の数値はモデルがシュート失敗と判断して、実際にシュートが失敗していたもの、左上はモデルがシュート失敗と判断したものの、実際にはシュートが成功していたもの、左下はモデルがシュート失敗と判定したものの、実際にはシュートが成功しゴールとなっていたもの、左下はモデルがシュート成功と予測し、実際にゴールが成功していたものを示します。

今回はどのモデルにおいても結果に大差はありませんが、F1値を見ると若干勾配ブースト分類器のスコアが良いようです。しかし混同行列を見ると、他のモデルより１つ２つゴールをシュートと判定できている程度で、やはり数値に大きな違いはなさそうです。

モデルをさらに精度を高めたい場合には、予測と実際の結果が誤っているものを調査して、どのようなケースにおいてモデルの予測が失敗しやすいのか、そのようなケースの特徴を調査し、モデルに反映させたり学習量を増加させたりすることで一定程度モデルの精度の向上が見込めるでしょう。
それ以外にも、特徴量をより良いもの（例えば相手がどのチームだったかや最終パスがどのようなものであったか等が考えられます）を選別したり、パラメータの調整を入れること、単純なデータ量を増やすこと、でもモデル精度の向上が見込めそうです。

しかし今回はここまでとし、Streamlitを利用したWebアプリの制作に移りましょう。

参考文献

東京大学のデータサイエンティスト育成講座
戦略的データサイエンス入門

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up