この記事の概要
今回分析する対象リーグは、イタリア、スペイン、フランス、イングランド、ドイツの国内リーグ、いわゆる欧州5大リーグです。
点数を量産するフォワードとはどのような特徴があるのか機械学習を使って分析していきたいと思います。
データセット
データセットはkaggleに公開されている2021-2022 Football Player Statsです。
本編
データの読み込み
import pandas as pd
data = pd.read_csv('2021-2022_Football_Player_Stats.csv',sep = ';', encoding = 'ISO-8859-1')
データには全ポジションの選手データがあるが、今回アタッカーのみで分析したいので、アタッカーのみを取り出していきます。
pos = ['FW', 'MFFW', 'FWMF']
index = []
for i in range(2921) :
if data['Pos'][i] in pos :
index.append(i)
football_stats = data.iloc[index,:]
データセットにある特徴量は以下の通りになります。2021 年から 2022 年までの 90 分ごとのサッカー選手の統計が含まれています。
Rk : Rank
Player : Player's name
Nation : Player's nation
Pos : Position
Squad : Squad’s name
Comp : League that squat occupies
Age : Player's age
Born : Year of birth
MP : Matches played
Starts : Matches started
Min : Minutes played
90s : Minutes played divided by 90
Goals : Goals scored or allowed
Shots : Shots total (Does not include penalty kicks)
SoT : Shots on target (Does not include penalty kicks)
SoT% : Shots on target percentage (Does not include penalty kicks)
G/Sh : Goals per shot
G/SoT : Goals per shot on target (Does not include penalty kicks)
ShoDist : Average distance, in yards, from goal of all shots taken (Does not include penalty kicks)
ShoFK : Shots from free kicks
ShoPK : Penalty kicks made
PKatt : Penalty kicks attempted
PasTotCmp : Passes completed
PasTotAtt : Passes attempted
PasTotCmp% : Pass completion percentage
PasTotDist : Total distance, in yards, that completed passes have traveled in any direction
PasTotPrgDist : Total distance, in yards, that completed passes have traveled towards the opponent's goal
PasShoCmp : Passes completed (Passes between 5 and 15 yards)
PasShoAtt : Passes attempted (Passes between 5 and 15 yards)
PasShoCmp% : Pass completion percentage (Passes between 5 and 15 yards)
PasMedCmp : Passes completed (Passes between 15 and 30 yards)
PasMedAtt : Passes attempted (Passes between 15 and 30 yards)
PasMedCmp% : Pass completion percentage (Passes between 15 and 30 yards)
PasLonCmp : Passes completed (Passes longer than 30 yards)
PasLonAtt : Passes attempted (Passes longer than 30 yards)
PasLonCmp% : Pass completion percentage (Passes longer than 30 yards)
Assists : Assists
PasAss : Passes that directly lead to a shot (assisted shots)
Pas3rd : Completed passes that enter the 1/3 of the pitch closest to the goal
PPA : Completed passes into the 18-yard box
CrsPA : Completed crosses into the 18-yard box
PasProg : Completed passes that move the ball towards the opponent's goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area
PasAtt : Passes attempted
PasLive : Live-ball passes
PasDead : Dead-ball passes
PasFK : Passes attempted from free kicks
TB : Completed pass sent between back defenders into open space
PasPress : Passes made while under pressure from opponent
Sw : Passes that travel more than 40 yards of the width of the pitch
PasCrs : Crosses
CK : Corner kicks
CkIn : Inswinging corner kicks
CkOut : Outswinging corner kicks
CkStr : Straight corner kicks
PasGround : Ground passes
PasLow : Passes that leave the ground, but stay below shoulder-level
PasHigh : Passes that are above shoulder-level at the peak height
PaswLeft : Passes attempted using left foot
PaswRight : Passes attempted using right foot
PaswHead : Passes attempted using head
TI : Throw-Ins taken
PaswOther : Passes attempted using body parts other than the player's head or feet
PasCmp : Passes completed
PasOff : Offsides
PasOut : Out of bounds
PasInt : Intercepted
PasBlocks : Blocked by the opponent who was standing it the path
SCA : Shot-creating actions
ScaPassLive : Completed live-ball passes that lead to a shot attempt
ScaPassDead : Completed dead-ball passes that lead to a shot attempt
ScaDrib : Successful dribbles that lead to a shot attempt
ScaSh : Shots that lead to another shot attempt
ScaFld : Fouls drawn that lead to a shot attempt
ScaDef : Defensive actions that lead to a shot attempt
GCA : Goal-creating actions
GcaPassLive : Completed live-ball passes that lead to a goal
GcaPassDead : Completed dead-ball passes that lead to a goal
GcaDrib : Successful dribbles that lead to a goal
GcaSh : Shots that lead to another goal-scoring shot
GcaFld : Fouls drawn that lead to a goal
GcaDef : Defensive actions that lead to a goal
Tkl : Number of players tackled
TklWon : Tackles in which the tackler's team won possession of the ball
TklDef3rd : Tackles in defensive 1/3
TklMid3rd : Tackles in middle 1/3
TklAtt3rd : Tackles in attacking 1/3
TklDri : Number of dribblers tackled
TklDriAtt : Number of times dribbled past plus number of tackles
TklDri% : Percentage of dribblers tackled
TklDriPast : Number of times dribbled past by an opposing player
Press : Number of times applying pressure to opposing player who is receiving, carrying or releasing the ball
PresSucc : Number of times the squad gained possession withing five seconds of applying pressure
Press% : Percentage of time the squad gained possession withing five seconds of applying pressure
PresDef3rd : Number of times applying pressure to opposing player who is receiving, carrying or releasing the ball, in the defensive 1/3
PresMid3rd : Number of times applying pressure to opposing player who is receiving, carrying or releasing the ball, in the middle 1/3
PresAtt3rd : Number of times applying pressure to opposing player who is receiving, carrying or releasing the ball, in the attacking 1/3
Blocks : Number of times blocking the ball by standing in its path
BlkSh : Number of times blocking a shot by standing in its path
BlkShSv : Number of times blocking a shot that was on target, by standing in its path
BlkPass : Number of times blocking a pass by standing in its path
Int : Interceptions
Tkl+Int : Number of players tackled plus number of interceptions
Clr : Clearances
Err : Mistakes leading to an opponent's shot
Touches : Number of times a player touched the ball. Note: Receiving a pass, then dribbling, then sending a pass counts as one touch
TouDefPen : Touches in defensive penalty area
TouDef3rd : Touches in defensive 1/3
TouMid3rd : Touches in middle 1/3
TouAtt3rd : Touches in attacking 1/3
TouAttPen : Touches in attacking penalty area
TouLive : Live-ball touches. Does not include corner kicks, free kicks, throw-ins, kick-offs, goal kicks or penalty kicks.
DriSucc : Dribbles completed successfully
DriAtt : Dribbles attempted
DriSucc% : Percentage of dribbles completed successfully
DriPast : Number of players dribbled past
DriMegs : Number of times a player dribbled the ball through an opposing player's legs
Carries : Number of times the player controlled the ball with their feet
CarTotDist : Total distance, in yards, a player moved the ball while controlling it with their feet, in any direction
CarPrgDist : Total distance, in yards, a player moved the ball while controlling it with their feet towards the opponent's goal
CarProg : Carries that move the ball towards the opponent's goal at least 5 yards, or any carry into the penalty area
Car3rd : Carries that enter the 1/3 of the pitch closest to the goal
CPA : Carries into the 18-yard box
CarMis : Number of times a player failed when attempting to gain control of a ball
CarDis : Number of times a player loses control of the ball after being tackled by an opposing player
RecTarg : Number of times a player was the target of an attempted pass
Rec : Number of times a player successfully received a pass
Rec% : Percentage of time a player successfully received a pass
RecProg : Completed passes that move the ball towards the opponent's goal at least 10 yards from its furthest point in the last six passes, or any completed pass into the penalty area
CrdY : Yellow cards
CrdR : Red cards
2CrdY : Second yellow card
Fls : Fouls committed
Fld : Fouls drawn
Off : Offsides
Crs : Crosses
TklW : Tackles in which the tackler's team won possession of the ball
PKwon : Penalty kicks won
PKcon : Penalty kicks conceded
OG : Own goals
Recov : Number of loose balls recovered
AerWon : Aerials won
AerLost : Aerials lost
AerWon% : Percentage of aerials won
今回は以下の処理を施して、利用して行きます。
df_football_stats = pd.DataFrame()
df_football_stats['Player'] = football_stats['Player']
df_football_stats['Nation'] = football_stats['Nation']
df_football_stats['Squad'] = football_stats['Squad']
df_football_stats['League'] = football_stats['Comp']
df_football_stats['Age'] = football_stats['Age']
df_football_stats['MP'] = football_stats['MP']
df_football_stats['G/90'] = football_stats['Goals']
df_football_stats['G/Sh'] = football_stats['G/Sh']
df_football_stats['PKGoals'] = ((football_stats['ShoPK'] * football_stats['Min']) / 90).round(0).astype(int)
df_football_stats['shots'] = football_stats['Shots']
df_football_stats['Goals'] = ((football_stats['Goals'] * football_stats['Min'])/90).round(0).astype(int)
df_football_stats['Pass'] = football_stats['PasTotAtt']
df_football_stats['PassCompleted'] = football_stats['PasTotCmp']
df_football_stats['PassComp%'] = df_football_stats['PassCompleted']
df_football_stats['Pass3rd'] = football_stats['Pas3rd'] * football_stats['Min']
df_football_stats['Assist'] = ((football_stats['Assists'] * football_stats['Min']) / 90).round(0).astype(int)
df_football_stats['Assist/90'] = football_stats['Assists']
df_football_stats['Cross'] = football_stats['PasCrs']
df_football_stats['CrossCompleted'] = football_stats['CrsPA']
df_football_stats['CrossComp%'] = df_football_stats['CrossCompleted']
df_football_stats['Tackle_Won'] = football_stats['TklWon']
df_football_stats['SucDribble'] = ((football_stats['DriSucc'] * football_stats['Min']) / 90).round(0).astype(int)
df_football_stats['Dribble'] = football_stats['DriAtt']
df_football_stats['DribbleComp%'] = df_football_stats['SucDribble']
df_football_stats['TouAttPen'] = football_stats['TouAttPen']
df_football_stats['Fls'] = football_stats['Fls']
df_football_stats['Fld'] = football_stats['Fld']
df_football_stats['AerWon'] = football_stats['AerWon']
df_football_stats['AerLost'] = football_stats['AerLost']
df_football_stats['AerWon%'] = football_stats['AerWon']
df_football_stats['Car3rd'] = football_stats['Car3rd']
df_football_stats['TouAttPen'] = football_stats['TouAttPen']
df_football_stats['GCA'] = football_stats['GCA']
df_football_stats['Touches'] = football_stats['Touches']
df_football_stats['ShoDist'] = football_stats['ShoDist']
df_football_stats['CarMis'] = football_stats['CarMis']
df_football_stats['CPA'] = football_stats['CPA']
相関係数
Goalsと相関が強い特徴量は以下のようになります。
相関が強いとは、一方が大きくなれば他方が大きくなるという関係です。
相関が強いほど1に近づきます。
corr_df = df_football_stats.corr()
corr_df.sort_values('Goals', ascending=False).head(15).style.background_gradient(axis=None)
得点ランキングTOP10
df_football_stats.sort_values('Goals', ascending = False).head(10).iloc[:, [0,1,2,3,4,5,10]]
Player | Nation | Squad | League | Age | MP | Goals |
---|---|---|---|---|---|---|
Robert Lewandowski | POL | Bayern Munich | Bundesliga | 33 | 34 | 35 |
Kylian Mbappé | FRA | Paris S-G | Ligue 1 | 23 | 35 | 28 |
Karim Benzema | FRA | Real Madrid | La Liga | 34 | 32 | 27 |
Ciro Immobile | ITA | Lazio | Serie A | 32 | 31 | 27 |
Wissam Ben Yedder | FRA | Monaco | Ligue 1 | 31 | 37 | 25 |
Patrik Schick | CZE | Leverkusen | Bundesliga | 26 | 27 | 24 |
Son Heung-min | KOR | Tottenham | Premier League | 29 | 35 | 23 |
Mohamed Salah | EGY | Liverpool | Premier League | 29 | 35 | 23 |
Erling Haaland | NOR | Dortmund | Bundesliga | 21 | 24 | 22 |
Lautaro Martínez | ARG | Inter | Serie A | 24 | 35 | 21 |
これを見るとランキング1位は、ロベルト・レヴァンドフスキです。
アシストランキングTOP10
df_football_stats.sort_values('Assist', ascending = False).head(10).iloc[:, [0,1,2,3,4,5,15]]
Player | Nation | Squad | League | Age | MP | Assist |
---|---|---|---|---|---|---|
Kylian Mbappé | FRA | Paris S-G | Ligue 1 | 23 | 35 | 18 |
Lionel Messi | ARG | Paris S-G | Ligue 1 | 34 | 26 | 14 |
Domenico Berardi | ITA | Sassuolo | Serie A | 27 | 33 | 14 |
Ousmane Dembélé | FRA | Barcelona | La Liga | 25 | 21 | 13 |
Mohamed Salah | EGY | Liverpool | Premier League | 29 | 35 | 13 |
Christopher Nkunku | FRA | RB Leipzig | Bundesliga | 24 | 34 | 13 |
Marco Reus | GER | Dortmund | Bundesliga | 32 | 29 | 13 |
Karim Benzema | FRA | Real Madrid | La Liga | 34 | 32 | 12 |
Benjamin Bourigeaud | FRA | Rennes | Ligue 1 | 28 | 38 | 12 |
Moussa Diaby | FRA | Leverkusen | Bundesliga | 22 | 32 | 12 |
アシストランキングを見ると、1位は、キリアン・エンバベです。
エンバベは得点ランキングでは2位であり、得点に多く絡んでいることは一目瞭然です。
ドリブル成功数TOP10
df_football_stats.sort_values('SucDribble', ascending = False).head(10).iloc[:, [0,1,2,3,4,5,21]]
Player | Nation | Squad | League | Age | MP | SucDribble |
---|---|---|---|---|---|---|
Allan Saint-Maximin | FRA | Newcastle Utd | Premier League | 25 | 35 | 140 |
Kylian Mbappé | FRA | Paris S-G | Ligue 1 | 23 | 35 | 103 |
Vinicius Júnior | BRA | Real Madrid | La Liga | 21 | 35 | 101 |
Rafael Leão | POR | Milan | Serie A | 22 | 34 | 95 |
Sofiane Boufal | MAR | Angers | Ligue 1 | 28 | 29 | 95 |
Houssem Aouar | FRA | Lyon | Ligue 1 | 23 | 36 | 75 |
Patrick Wimmer | AUT | Arminia | Bundesliga | 20 | 31 | 75 |
Lucas Paquetá | BRA | Lyon | Ligue 1 | 24 | 35 | 75 |
Emmanuel Dennis | NGA | Watford | Premier League | 24 | 33 | 73 |
Wilfried Zaha | CIV | Crystal Palace | Premier League | 29 | 33 | 73 |
機械学習
今回モデルはlightgbmを使用します。
カテゴリー特徴量は、one-hotエンコーディングを用いると特徴量の数が多くなってしまうのでLabelEncodingで処理します。
目的変数をゴール数として、それ以外を特徴量としています。
パラメータチューニングはしていません。
コードは以下のようになります。
#ラベルエンコーディング
from sklearn.preprocessing import OrdinalEncoder
dummies_col = ['Nation', 'Squad', 'League']
oe = OrdinalEncoder()
encoded = oe.fit_transform(df_football_stats[dummies_col].values)
decoded = oe.inverse_transform(encoded)
df_labelencode = pd.DataFrame(encoded, columns=dummies_col)
df_football_stats['Nation'] = df_labelencode['Nation']
df_football_stats['Squad'] = df_labelencode['Squad']
df_football_stats['League'] = df_labelencode['League']
#モデル
from sklearn.model_selection import train_test_split
import lightgbm as lgb
from sklearn.metrics import mean_squared_error
x = df_football_stats.drop(['MP', 'Player', 'G/90', 'Goals', 'G/Sh'], axis=1)
y = df_football_stats['Goals']
x_train, x_val, y_train, y_val = train_test_split(
x,
y,
random_state=42,
test_size=0.1,
)
params = {
'task' : 'train',
'boosting_type' : 'gbdt',
'objective' : 'regression',
'metric': 'rmse',
}
lgb_train = lgb.Dataset(x_train, y_train)
lgb_eval = lgb.Dataset(x_val, y_val)
model = lgb.train(
params,
train_set = lgb_train,
valid_sets = lgb_eval,
num_boost_round = 1000,
verbose_eval = 20,
early_stopping_rounds = 10,
)
y_valid_pred = model.predict(x_val)
score = np.sqrt(mean_squared_error(y_val, y_valid_pred))
print(f' RMSE: {score}')
結果 RMSE 1.8077498197050759
特徴量の重要度
lgb.plot_importance(model)

考察
Lightgbmの特徴量の重要度を見ると、得点を決めるFWは敵陣に近いところでのパス成功率、シュート本数、ドリブル成功数、ペナルティエリアでのタッチ数が重要とされています。このことから、点数を多くとるFWとは、相手ゴール付近でドリブル、パス、ジュートの三つのことができるプレイヤーだとわかります。
守っているDFやGKは、多くの選択肢を持つプレイヤーほどゴールを守りずらいということが言えるでしょう。
このことから、ドリブル成功数、アシストランキングの両方で上位であるエンバベは非常に守りずらいFWだと考えます。
また、所属リーグを見てみると、重要度が低いことが見てとれます。つまり、得点をとりやすいリーグがある可能性が低いということです。
まとめ
今回は、簡単なデータ分析で、得点を量産するFWの特徴を分析してみました。
リーグで上位の成績を残すためには、ゴール前で多くの選択肢を持つことが大事だということがわかりました。
ただ、今回欧州5大リーグの一年分のデータのみを利用したため、もっと多くのデータを用いると結果は変わってくるかもしれません。また、走力のデータや映像データを用いたりすると面白いと思いました。
完全に私欲を満たすための記事でしたが、お付き合いありがとうございました。
コードを詳しくみたい方は、こちらのリンクからご覧ください。
https://github.com/ka1to0324/2021-2022_football_analysis/blob/main/2021-2022_football_analysis.ipynb