More than 1 year has passed since last update.

R Advent Calendar 2022

@keach_r

パスを回せればワールドカップで勝てるのかをRで眺める

Last updated at 2022-12-20Posted at 2022-12-20

R Advent Calendar 2022 21日目の記事です。

20日はまだ記事がありません。チャンスです、書きましょう。
22日は @str8808 さんの「OECDパッケージを使ったデータ分析」です。

ワールドカップ、盛り上がりました。メッシがついに優勝したり、モロッコがアフリカ勢初のベスト４で大躍進したりと、充実の大会でした。日本はまたも惜しかった...。

日本関連だとグループステージの大金星に関連して「700本以上のパスを通しながら勝てなかったチームはスペインとドイツだけでいずれも相手は日本だった」という話題がありました。

2 - Since detailed World Cup records began (1966), there are only two instances of a team losing a game despite attempting 700+ passes:

Spain vs Japan tonight
Germany vs Japan last week

Formula. pic.twitter.com/nvsTSNqyI0
— OptaJoe (@OptaJoe) December 1, 2022

2010年くらいから世界的にカウンターが流行気味で、個の力、いわゆるデュエルの強さが強調されています。プレスのかかった前線で成功率50%パスを3回つなぐよりは、勝率20%の1v1でドリブル突破を狙うのが合理的なのかもしれません。パス本数はあまり重要ではないメトリクスになったのでしょうか？

ということで、Rを使ってパスに注目したデータハンドリングと可視化をしてみます。

パスデータ

fivethirtyeightパッケージに含まれる2018年大会までの各選手の1試合あたりのパス成功本数のZスコアデータセットを使いました。試合ごとチームごとに集計されたパス本数のデータセットがあれば楽なのですが、フリーなものが見つからなかったのでこちらを整形しました。このデータを使った可視化サイトも面白いのでどうぞ。

データセットの形式は以下の通りです。16のメトリクスのZスコアが大会・選手別に集計されています。

library(tidyverse)
library(gghighlight)
library(rvest) # webスクレイピング
library(ggimage) # 国旗表示
library(ggbeeswarm) # 蜂群図

player_stats <- 
  readr::read_csv("https://projects.fivethirtyeight.com/soccer-api/international/2018/world_cup_comparisons.csv") %>%
  dplyr::mutate(season = as.character(season))

head(player_stats)
# # A tibble: 6 × 19
#   player season team  goals_z  xg_z crosses_z boxtouches_z passes_z progpasses_z takeons_z
#   <chr>  <chr>  <chr>   <dbl> <dbl>     <dbl>        <dbl>    <dbl>        <dbl>     <dbl>
# 1 Crist… 2018   Arge…   -0.42 -0.55      0.08        -0.24    -0.61        -0.72     -0.05
# 2 Eduar… 2018   Arge…   -0.42 -0.5      -0.51         0.64     0.02        -0.46     -0.05
# 3 Enzo … 2018   Arge…   -0.42  0.11     -0.51        -0.69     0.27        -0.65     -0.36
# 4 Feder… 2018   Arge…   -0.42 -0.59     -0.51        -0.69    -0.97        -0.91     -0.67
# 5 Franc… 2018   Arge…   -0.42 -0.59     -0.51        -0.8     -0.67         0.64     -0.67
# 6 Gabri… 2018   Arge…    1.3  -0.3       0.08        -0.35     1           -0.2      -0.36
# # … with 9 more variables: progruns_z <dbl>, tackles_z <dbl>, interceptions_z <dbl>,
# #   clearances_z <dbl>, blocks_z <dbl>, aerials_z <dbl>, fouls_z <dbl>, fouled_z <dbl>,
# #   nsxg_z <dbl>

Zスコアは以下の式で得られます。

$$
Z score = \dfrac{x - \mu}{\sigma}
$$

ここで、$\mu$は平均値、$\sigma$は標準偏差です。大会ごとに正規化処理はされているのですが、正規化前の数値がないので、上記の可視化サイトから各大会２人程度のパス本数を手入力で持ってきて逆変換しました。大会ごとに少なくとも２人のパス本数があれば逆変換ができるので、出場回数が多い選手をうまく選びます。

longstanding_players <-
  tibble::tribble(
    ~ player, ~ flag, ~ season, ~ passes_num,
    "Lothar Matthäus", "DE", "1982", NA_real_,
    "Lothar Matthäus", "DE", "1986",  28.8,
    "Lothar Matthäus", "DE", "1990",  53.7,
    "Lothar Matthäus", "DE", "1994",  53.8,
    "Lothar Matthäus", "DE", "1998",  41.2,
    "Paolo Maldini",   "IT", "1990",  34.4,
    "Paolo Maldini",   "IT", "1994",  40.7,
    "Paolo Maldini",   "IT", "1998",  32.4,
    "Paolo Maldini",   "IT", "2002",  29.5,
    "Roberto Carlos",  "BR", "1986",  25.3,
    "Roberto Carlos",  "BR", "1998",  49.4,
    "Roberto Carlos",  "BR", "2002",  34.1,
    "Roberto Carlos",  "BR", "2006",  32.8,
    "Sergio Ramos",    "ES", "2006",  47.3,
    "Sergio Ramos",    "ES", "2010",  63.4,
    "Sergio Ramos",    "ES", "2014",  55.0,
    "Sergio Ramos",    "ES", "2018", 111.2,
    "Yuto Nagatomo",   "JP", "2010",  18.0,
    "Yuto Nagatomo",   "JP", "2014",  37.0,
    "Yuto Nagatomo",   "JP", "2018",  40.3
  )

longstanding_stats <-
  player_stats %>%
  dplyr::inner_join(longstanding_players, by = c("player", "season"))

longstanding_stats %>%
  ggplot2::ggplot(ggplot2::aes(player, season, image = flag)) +
  ggimage::geom_flag()  +
  ggplot2::coord_flip()

ちなみに、1986年大会のRoberto Carlosは1996–2006年大会のRoberto Carlosとは同姓同名の別人GKです。形式が概ね変わらない1986年大会以降を対象に、回帰により大会ごとに$\mu$と$\sigma$を計算します。

inv_zscore <-
  longstanding_stats %>%
  dplyr::filter(season >= 1986) %>%
  split(.$season) %>%
  purrr::map_dfr(.id = "season", ~ lm(data = ., passes_num ~ passes_z) %>% tidy %>% select(term, estimate)) %>%
  tidyr::pivot_wider(names_from = term, values_from = estimate) %>%
  purrr::set_names(c("season", "mu", "sigma"))

head(inv_zscore)
# # A tibble: 6 x 3
#   season    mu sigma
#   <chr>  <dbl> <dbl>
# 1 1986    20.2  12.5
# 2 1990    20.4  13.6
# 3 1994    23.0  15.8
# 4 1998    20.2  13.3
# 5 2002    19.7  12.8
# 6 2006    19.7  13.2

逆変換して選手ごと・チームごとのパス本数データセットを作ります。なお、この集計方法は交代を考慮しておらず、単純にチームごとに全プレイヤーの平均パス本数の総和としてチームのパス本数を計算しています。そのため交代が多いチームは実際よりも試合あたりのパス数が増える計算になっています。

player_passes <-
  player_stats %>%
  filter(as.numeric(season) >= 1986) %>%
  dplyr::left_join(inv_zscore, by = "season") %>%
  dplyr::transmute(player, season, team, passes = passes_z * sigma + mu)

head(player_passes)
# # A tibble: 6 × 4
#   player          season team      passes
#   <chr>           <chr>  <chr>      <dbl>
# 1 Cristian Pavón  2018   Argentina  11.6 
# 2 Eduardo Salvio  2018   Argentina  22.6 
# 3 Enzo Pérez      2018   Argentina  26.9 
# 4 Federico Fazio  2018   Argentina   5.37
# 5 Franco Armani   2018   Argentina  10.6 
# 6 Gabriel Mercado 2018   Argentina  39.6 

team_passes <-
  player_passes %>%
  dplyr::group_by(team, season) %>%
  dplyr::summarise(passes = sum(passes), .groups = "drop")

head(team_passes)
# # A tibble: 6 × 3
#   team      season passes
#   <chr>     <chr>   <dbl>
# 1 Algeria   1986     385.
# 2 Algeria   2010     329.
# 3 Algeria   2014     263.
# 4 Angola    2006     259.
# 5 Argentina 1986     329.
# 6 Argentina 1990     287.

パスデータを概観します。
チーム別にパス本数のトレンドを見ると、パスサッカーの本場スペインが頭ひとつ抜けています。他にも、ドイツ・イタリア・アルゼンチン・ブラジルといずれも強豪です。このあたりのチームは実力差から相手が引くので結果的にパスが増えた、という影響もありそうです。あとは1994年大会は外れ値的なパスサッカー大会だったようで、コロンビアの２人が異質な印象です。

team_passes %>%
  ggplot2::ggplot(ggplot2::aes(season, passes, col = team)) +
  ggbeeswarm::geom_beeswarm() +
  gghighlight::gghighlight(!dplyr::between(passes, 220, 500), label_key = team)

player_passes %>%
  ggplot2::ggplot(ggplot2::aes(season, passes, col = team)) +
  ggbeeswarm::geom_beeswarm(cex = .3) +
  gghighlight::gghighlight(passes > 80, label_key = player,max_highlight = 20)

年代別にパス本数のトレンドを見ると、1994年大会のはずれ値を除けば、全体的なパス本数は近年も増加傾向にあるように見えます。
ただし、チーム間のばらつきも増加傾向にあり、戦術 (あるいは戦力差？) の極端化が示唆されます。

team_passes %>%
  dplyr::group_by(season) %>%
  dplyr::mutate(mean = mean(passes), sd = sd(passes)) %>%
  ggplot2::ggplot(ggplot2::aes(season, passes, group = season)) +
  ggplot2::geom_point(ggplot2::aes(y = mean), size = 5) +
  ggplot2::geom_errorbar(ggplot2::aes(ymin = mean - sd, ymax = mean + sd), width = .25) +
  ggbeeswarm::geom_beeswarm(col = "grey")

1986年大会と2018年大会だけを比べればスタイルの変化は歴然で、最少243本から最多436本と平均もその標準偏差の小さい1986年大会と比較して、2018年大会では151本のイランから743本のスペインまで多様です。

team_passes %>%
  dplyr::filter(season %in% c("1986", "2018")) %>%
  dplyr::mutate(team = forcats::fct_reorder(team, passes, .desc = TRUE)) %>%
  ggplot2::ggplot(ggplot2::aes(team, passes, label = round(passes), group = season)) +
  ggplot2::geom_bar(stat = "identity") +
  ggplot2::geom_text(col = "white", hjust = 1) +
  ggplot2::coord_flip() +
  ggplot2::facet_grid(~ season)

順位データ

パスと順位の関係を見るために、FIFA World Cup records and statistics (Wikipedia)から順位を集計したテーブルをスクレイピングします。

team_results_raw <-
  rvest::read_html("https://en.wikipedia.org/wiki/FIFA_World_Cup_records_and_statistics") %>%
  rvest::html_table() %>%
  .[[10]] %>%
  rlang::set_names(c("team", 1930, 1934, 1938, seq(1950, 2022, by = 4), "total")) %>%
  dplyr::select(-total)

head(team_results_raw)
# # A tibble: 6 × 23
#   team      `1930`  `1934`  `1938` `1950` `1954` `1958` `1962` `1966` `1970` `1974` `1978`
#   <chr>     <chr>   <chr>   <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr>  <chr> 
# 1 Algeria   "Part … "Part … "Part… "Part… "Part… "Part… "Part… "×"    "•"    "•"    "•"   
# 2 Angola    ""      ""      ""     ""     ""     ""     ""     ""     ""     ""     ""    
# 3 Argentina "2nd"   "R1T-9… "×"    "×"    "×"    "R113… "R110… "QF5t… "•"    "R28t… "1st" 
# 4 Australia ""      ""      ""     ""     ""     ""     ""     "•"    "•"    "R114… "•"   
# 5 Austria   "×"     "4th"   "••[g… "×"    "3rd"  "R115… "×"    "•"    "•"    "•"    "R27t…
# 6 Belgium   "R111t… "R115t… "R113… "×"    "R112… "•"    "•"    "•"    "R1T-… "•"    "•"   
# # … with 11 more variables: 1982 <chr>, 1986 <chr>, 1990 <chr>, 1994 <chr>, 1998 <chr>,
# #   2002 <chr>, 2006 <chr>, 2010 <chr>, 2014 <chr>, 2018 <chr>, 2022 <chr>

wide形式なので使い勝手がいいようにlong形式に整形します。本戦出場全チームの順位付けがされているので、stringrパッケージで順位を数値化し、どのラウンドまで進んだのかを取り出します。ついでにteamなどの列に付記された注釈も削除しておきます。

team_results <-
  team_results_raw %>%
  tidyr::pivot_longer(cols = -team, names_to = "season", values_to = "result") %>%
  dplyr::filter(stringr::str_detect(result, "1st|2nd|3rd|4th|QF|R1|R2")) %>%
  dplyr::mutate(team = stringr::str_remove_all(team, "\\[.\\]"),
                result = stringr::str_remove_all(result, "\\[.\\]"),
                place = as.numeric(stringr::str_remove_all(result, "st|nd|rd|th|QF|R1|R2|T-")),
                round = stringr::str_extract(result, "1st|2nd|3rd|4th|QF|R1|R2") %>%
                        factor(levels = c("1st", "2nd", "3rd", "4th", "QF", "R2", "R1"),
                               label = c("1st", "2nd", "3rd", "4th", "QF", "R16", "R32")),
                result = NULL)

head(team_results)
# # A tibble: 6 × 4
#   team      season place round
#   <chr>     <chr>  <chr> <chr>
# 1 Algeria   1982   13    R32   
# 2 Algeria   1986   22    R32   
# 3 Algeria   2010   28    R32   
# 4 Algeria   2014   14    R16   
# 5 Angola    2006   23    R32   
# 6 Argentina 1930   2     2nd

順位データを概観します。
各国の最高到達地点を地図にマッピングしてみます。国名の表記揺らぎやイギリスの３チーム登録などを微調整が手間です。上位を取るのはヨーロッパか南米かです。

record <-
  team_results %>%
  dplyr::mutate(round_num = as.numeric(round)) %>%
  dplyr::group_by(team) %>%
  dplyr::slice_min(round_num, n = 1, with_ties = FALSE)

world <- ggplot2::map_data("world")

setdiff(record$team, world$region)
#  [1] "China PR"            "Congo DR"            "East Germany"       
#  [4] "England"             "Northern Ireland"    "Republic of Ireland"
#  [7] "Scotland"            "Trinidad and Tobago" "United States"      
# [10] "Wales" 

world_modif <-
  world %>%
  dplyr::mutate(region = dplyr::case_when(region == "USA" ~ "United States",
                                          region == "China" ~ "China PR",
                                          region %in% c("Democratic Republic of the Congo", "Republic of Congo") ~ "Congo DR",
                                          region %in% c("Trinidad", "Tobago") ~ "Trinidad and Tobago",
                                          region == "Ireland" ~ "Republic of Ireland",
                                          subregion == "Northern Ireland" ~ "Republic of Ireland",
                                          subregion == "Great Britain" ~ "England",
                                          subregion == "Scotland" ~ "Scotland",
                                          subregion == "Wales" ~ "Wales",
                                          TRUE ~ region))
  
dplyr::left_join(world_modif, record, by = c("region" = "team")) %>%
  ggplot2::ggplot(ggplot2::aes(long, lat, fill = round, group = group)) +
  ggplot2::geom_polygon(fill = "grey") +
  ggplot2::geom_polygon(col = "white", linewidth = .1)

最近の上位国と日本の順位変動は以下の通りです。日本はベスト16敗退のなかではまたしても成績優秀でしたが、ベスト8の壁が分厚いです。

sure_bets <-
  team_results %>%
  dplyr::filter(as.numeric(season) >= 1986) %>%
  dplyr::filter(place <= 2) %>%
  dplyr::pull(team) %>% 
  unique

flags <-
  tibble::tibble(team = c(sure_bets, "Japan"), flag = c("AR", "BR", "CR", "FR", "DE", "IT", "NL", "ES", "JP"))

team_results %>%
  dplyr::filter(as.numeric(season) >= 1986) %>%
  dplyr::filter(team %in% c(sure_bets, "Japan")) %>%
  dplyr::left_join(flags, by = "team") %>%
  ggplot2::ggplot(ggplot2::aes(place, season, image = flag, group = team)) +
  ggplot2::annotate(geom = "rect", xmin = 0.5, xmax = 32.5, ymin = -Inf, ymax = +Inf, fill = "grey25") +
  ggplot2::annotate(geom = "rect", xmin = 0.5, xmax = 16.5, ymin = -Inf, ymax = +Inf, fill = "grey50") +
  ggplot2::annotate(geom = "rect", xmin = 0.5, xmax = 8.5, ymin = -Inf, ymax = +Inf, fill = "grey75") +
  ggplot2::annotate(geom = "rect", xmin = 0.5, xmax = 4.5, ymin = -Inf, ymax = +Inf, fill = "grey90") +
  ggplot2::annotate(geom = "rect", xmin = 0.5, xmax = 2.5, ymin = -Inf, ymax = +Inf, fill = "white") +
  ggimage::geom_flag(size = .03) +
  ggplot2::scale_x_continuous(breaks = c(1, 2, 3, 4, 5, 8, 9, 16, 17, 32), limits = c(.5, 32.5), expand = c(0, 0))

解析 | パスを回せれば勝てるのか？

パスと順位のデータをマージします。ここでも表記の揺らぎを微調整します。

pass <-
  team_passes %>%
  dplyr::mutate(team = dplyr::case_when(team == "USSR" ~ "Russia",
                                        team == "West Germany" ~ "Germany",
                                        team %in% c("Northern Ireland", "Republic of Ireland", "Rep of Ireland") ~ "Ireland",
                                        team == "Yugoslavia" ~ "Serbia",
                                        team == "Czechoslovakia" ~ "Czech Republic",
                                        TRUE ~ team))
place <- 
  team_results %>%
  dplyr::mutate(team = dplyr::case_when(team == "United States" ~ "USA",
                                        team == "USSR" ~ "Russia",
                                        team %in% c("Northern Ireland", "Republic of Ireland", "Rep of Ireland") ~ "Ireland
                                        TRUE ~ team))

dataset <-
  dplyr::left_join(pass, place, by = c("team", "season")) %>%
  dplyr::filter(as.numeric(season) >= 1986)

head(dataset)
# # A tibble: 6 x 5
#   team      season passes place round
#   <chr>     <chr>   <dbl> <dbl> <fct>
# 1 Algeria   1986     385.    22 R32  
# 2 Algeria   2010     329.    28 R32  
# 3 Algeria   2014     263.    14 R16  
# 4 Angola    2006     259.    23 R32  
# 5 Argentina 1986     329.     1 1st  
# 6 Argentina 1990     287.     2 2nd

素直にパス本数と順位の相関をとると、負の相関が認められます。つまり、パス本数が多いチームは高順位という傾向にあるようです。

cor.test(dataset$passes, dataset$place)
# 	Pearson's product-moment correlation
# 
# data:  dataset$passes and dataset$place
# t = -3.8024, df = 262, p-value = 0.0001783
# alternative hypothesis: true correlation is not equal to 0
# 95 percent confidence interval:
#  -0.3400255 -0.1110244
# sample estimates:
#        cor 
# -0.2286861

決勝進出チームの平均パス本数はほとんどの場合で300本を超えて平均400本弱、準決勝進出チームまで見ても370本程度となり、「勝てるチームはパスが回せている」は真です。プラン通りにパスを回せるチームは勝てるし、対戦相手に消極策を強いることができるような強豪も結果としてパス本数が増えて勝てる、ということかと思います。対偶の「パスが回せていないチームは勝てない」ももちろん真で、相手にがっつり持たれているようでは勝つのは難しいようです。これをひっくり返したので冒頭記事にあった日本 vs スペイン、日本 vs ドイツは珍しい出来事だったということです。

一方、下位に終わったチームでも600本超えのような数値が見られるように、「パスを回せるチームは勝てる」は偽です。DFラインで無駄パスを回しても意味がなく、このあたりはパスの出どころ、長短、方向など、より詳細なデータ解析が求められます。

dataset %>%
  dplyr::group_by(round) %>%
  dplyr::summarise(mean = mean(passes), median = median(passes))
# # A tibble: 7 x 3
#   round  mean median
#   <fct> <dbl>  <dbl>
# 1 1st    420.   382.
# 2 2nd    371.   372.
# 3 3rd    373.   371.
# 4 4th    364.   352.
# 5 QF     347.   326.
# 6 R16    360.   356.
# 7 R32    334.   329.

dataset %>%
  ggplot2::ggplot(ggplot2::aes(round, passes, col = round, group = round)) +
  ggplot2::geom_boxplot(outlier.colour = NA) +
  ggplot2::geom_jitter()
  viridis::scale_color_viridis()

2022年大会の日本をFIFAのマッチレポートから振り返ると、パス成功本数はドイツ戦で207本、コスタリカ戦で510本、スペイン戦で167本、クロアチア戦で414本と、4戦平均では324本でした。なんだかんだ平均を取れば極端に低いということもなく、相手との実力差に合わせたプレーになって (意図的にして) いた、ということに見えます。シュート数の劣勢を見ても、運の要素が多分に作用した大金星という印象です。

耳タコですが勝てる試合で勝ち切る得点力と対戦相手に消極作を強いる「日本は格上」というイメージを植え付けられる実績が重要なのかもしれません。今回の２勝で変にカウンター推しに転換せず、横綱相撲を目指してほしいと個人的には思っています。

  tibble::tribble(
    ~ "match", ~ "is_japan", ~ "Shoot", ~ "Shoot_on_target", ~ "Goal", ~ "Pass", ~ "flag",
    "Japan v Germany",    "Japan",    12, 4, 2, 207, "JP",
    "Japan v Germany",    "Opponent", 26, 9, 1, 743, "DE",
    "Japan v Costa Rica", "Japan",    13, 3, 0, 510, "JP",
    "Japan v Costa Rica", "Opponent",  4, 1, 1, 371, "CR",
    "Japan v Spain",      "Japan",     6, 3, 2, 167, "JP",
    "Japan v Spain",      "Opponent", 12, 5, 1, 992, "ES",
    "Japan v Croatia",    "Japan",    13, 4, 1, 414, "JP",
    "Japan v Croatia",    "Opponent", 17, 4, 1, 624, "HR") %>%
  tidyr::pivot_longer(cols = Shoot:Pass, names_to = "metrics", values_to = "Count") %>%
  ggplot2::ggplot(ggplot2::aes(is_japan, Count, label = Count, image = flag)) +
  ggplot2::geom_bar(stat = "identity", fill = "grey70") +
  ggimage::geom_flag(ggplot2::aes(y = Count/2), size = .3) +
  ggplot2::geom_text(y = 0, vjust = 0) +
  ggplot2::theme(axis.title = ggplot2::element_blank(),
                 axis.text.x = ggplot2::element_blank(),
                 axis.ticks.x = ggplot2::element_blank(),
                 strip.text.x = ggplot2::element_blank(),
                 strip.background = ggplot2::element_blank()
                 strip.placement = "outside") +
  ggplot2::facet_grid(metrics ~ match, scale = "free_y", switch = "y")

ちなみに2022年大会の1–3位も500–600本前後の順当なパス本数で、今大会の異常値は日本よりもモロッコに見えました。モロッコの３位決定戦までの７試合のパス成功本数は平均317本と最少レベルです。本数は少ないながら運任せではなく、相手を釣りだしてから低い位置のサイドライン際で躱すデザインされたプレーを度々披露していて、2019アジアカップで優勝したときの強かったカタールを思い出しました。

ここ数年落ちていた観戦意欲が盛り返してきたので、来年のアジアカップからまた追いかけようと思います。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up