More than 5 years have passed since last update.

PythonのPandasライブラリを使ってNBAの試合データを操作し、最後に可視化するチュートリアル

Posted at 2018-10-18

はじめに

Understanding Basketball Analytics にあるPandasチュートリアルを試してみました。

実行してみたjupyterノートブックとデータはこちら
https://github.com/nanako-ut/pandas_with_NBA_data

対象データ

2016年6月10日　GSW（ウォリアーズ）vs CLE（キャバリアーズ）戦

ショット、アシスト、選手交代等のアクションが記録されたCSVデータ

データの内容

ラベル	入力例	内容
a1～a5	Stephen Curry	ホームチームの選手名
h1～h5	J.R. Smith	アウェイチームの選手名
period	1	ピリオド
away_score	108	アウェイチームの得点
home_score	97	ホームチームの得点
remaining_time	0:12:00	残り時間
play_length	0:00:16	プレイ時間
team	GSW or CLE	アクションが発生したチーム
event_type	start of period	イベントの種類（jump ball/ miss/ rebound / shot …）
assist	Stephen Curry	アシストした選手名
away	Andrew Bogut	アウェイチームのジャンプボールの選手名
home	Tristan Thompson	ホームチームのジャンプボールの選手名
possession	Stephen Curry	ジャンプボールを確保した選手名
block	Draymond Green	ブロックをした選手名
entered	Andre Iguodala	交代（IN）選手名
left	Andrew Bogut	交代（OUT）選手名
num	1	フリースローの試投数
opponent	Draymond Green	ファウルを与えた相手の選手名
outof	2	フリースローに与えられた本数
player	Draymond Green	アクションを起こした選手名
points	2	得点
result	made or missed	プレーの結果（成功 or 失敗）
steal	Kyrie Irving	スティールした選手名
type	Jump Shot	イベントの詳細（Hook Shot/Jump Shot/Jump Shot…）
shot_distance	20	シューターとリングとの距離
original_x	1	選手の位置（X座標）
original_y	41	選手の位置（Y座標）
converted_x	25.1	コートチェンジ分変換後の選手の位置（X座標）
converted_y	84.9	コートチェンジ分変換後の選手の位置（Y座標）

***

Introduction to Pandas Using Play-By-Play

チュートリアルの説明文は、超意訳です

python 3.6.5 です。チュートリアルは 2.7.10 なので、チュートリアルを一部 python 3.x系で動くよう変えています

Abstract

Pandasの基本的なコマンドを実行し、NBAのサンプルデータを分析します。サンプルデータはPandasで読み込めるカンマ区切りのCSVファイルです。

1. Python: Data Manipulation

このドキュメントでは、Python環境でPandasを使ってデータを分析する方法を検討します。
Pandasのデータフレーム（DataFrame）を使ってCSVファイルを扱い、他のPythonパッケージでさらに処理できるよう、簡単な計算や加工をします。
ここでは、2016年3月12日　オクラホマシティ・サンダー vs サントニオ・スパーズの試合データを使います。
※.OKC vs SASのCSVファイルが入手できなかったので、GSW vs CLE 戦のデータを使います

file = '(2016-06-10)-0041500404-GSW@CLE.csv'

2.Pandas

Pandasをインポートします。
もしPandaがインストールされていない場合は、「pip install pandas」を実行しましょう。

2.1 Reading In Files and Accessing Data

CSVファイルを読み込み、データの形式（行数・列数）をチェックします。

game = pd.read_csv(file)
game.shape
---------
(467, 44)

ラベル名を表示します。

game.columns.values
---------
array(['game_id', 'data_set', 'date', 'a1', 'a2', 'a3', 'a4', 'a5', 'h1',
       'h2', 'h3', 'h4', 'h5', 'period', 'away_score', 'home_score',
       'remaining_time', 'elapsed', 'play_length', 'play_id', 'team',
       'event_type', 'assist', 'away', 'home', 'block', 'entered', 'left',
       'num', 'opponent', 'outof', 'player', 'points', 'possession',
       'reason', 'result', 'steal', 'type', 'shot_distance', 'original_x',
       'original_y', 'converted_x', 'converted_y', 'explanation'],
      dtype=object)

先頭５行、最後の５行を表示します。

game.head(5)
game.tail(5)

特定の列のデータを参照したい場合は、ラベルを指定します。

game['away_score']

通常の配列を扱うのと同じように、簡単に参照できます。

awayScore = game ['away_score']
awayScore[10]
---------
7

行データにアクセスするにはlocコマンドを使います。
もしこのゲームでの11番目のアクションが何か知りたい場合は、10を指定します。
（インデックスは0から始まる）

game.loc[10]

さらにラベルを指定することで行の要素を参照することができます。
11番目のアクションが起こった時の、アウェイチームの first playerを見てみましょう。

game.loc[10]['a1']
---------
'Harrison Barnes'

2.2 Querying Data in a Dataframe

行と列からデータを抽出する方法を知ったので、次にデータへの問い合わせをします。

例えば、先ほど参照した「Harrison Barnes」のアクション数を数えます。
DataFrameを操作し、複数の行・列に対して「Harrison Barnes」の出現回数を数えるだけです。

count = 0
labels = list(game.columns.values)
for i in range(game.shape[0]): # gameの行数分ループ
    for label in labels:       # gameの列数分ループ
        if 'Harrison Barnes' in str(game.loc[i][label]):
            count = count + 1
            break
            
count
---------
354

「Harrison Barnes」がプレーしていないのに、アクションが発生しているデータがあります。
（「a1～h5」に名前がないのに、「left」に名前がある）

これは、途中交代で下がったケースですので、除きましょう。

count = 0
labels = list(game.columns.values)
for i in range(game.shape[0]):
    for label in labels:
        if 'Harrison Barnes' in str(game.loc[i]['a1':'h5']):
            count = count + 1
            #print(label)
            break
            
count
---------
349

a1～h5には、コート上の10人のプレーヤー名が記載されています。
「Counter」という機能を使って、コート上のプレーヤーのアクション数を数えます。

先発の10人を見てみましょう。
試合開始1つのアクション（0行目のa1～h5）から、選手名を抽出します。

from collections import Counter
totals = Counter(list(game.loc[0, 'a1':'h5']))
totals
---------

Counter({'Harrison Barnes': 1,
         'Draymond Green': 1,
         'Andrew Bogut': 1,
         'Klay Thompson': 1,
         'Stephen Curry': 1,
         'Richard Jefferson': 1,
         'LeBron James': 1,
         'Tristan Thompson': 1,
         'J.R. Smith': 1,
         'Kyrie Irving': 1})

全プレーヤーのアクション数を数えます。

for i in range(1, game.shape[0]):
    totals = totals + Counter(list(game.loc[i, 'a1':'h5']))
totals 
---------

Counter({'Harrison Barnes': 349,
         'Draymond Green': 406,
         'Andrew Bogut': 75,
         'Klay Thompson': 383,
         'Stephen Curry': 399,
         # ～以下略

2.3 Grouping

Pandasのgroupbyを使って、より洗練されたグルーピングの方法を試しましょう。
groupbyは辞書データ（キーと値を持つ配列）を生成します。

基本的な例として、5対5のマッチアップの組み合わせを見てみましょう。

onCourt = game.groupby(['a1','a2','a3','a4','a5','h1','h2','h3','h4','h5'])
onCourt.groups

---------
{('Draymond Green',
  'Klay Thompson',
  'Andre Iguodala',
  'Shaun Livingston',
  'Stephen Curry',
  'LeBron James',
  'Iman Shumpert',
  'Matthew Dellavedova',
  'Channing Frye',
  'Richard Jefferson'): Int64Index([135], dtype='int64'),
 ('Draymond Green',
  'Klay Thompson',
  'Andre Iguodala',
  'Stephen Curry',
  'Festus Ezeli',
  'LeBron James',
  'Richard Jefferson',
  'Kyrie Irving',
  'J.R. Smith',
  'Tristan Thompson'): Int64Index([162, 163, 164, 165, 166, 167, 168, 169, 170, 171], dtype='int64'),
# 以下略

最初の組み合わせは、

'Draymond Green',
  'Klay Thompson',
  'Andre Iguodala',
  'Shaun Livingston',
  'Stephen Curry',
  'LeBron James',
  'Iman Shumpert',
  'Matthew Dellavedova',
  'Channing Frye',
  'Richard Jefferson')

であり、この組み合わせは135番目のアクションのみで発生しています。

Pythonの辞書データはキー項目をソートしません。
そのため、並び順が違う同じ組み合わせを数えてしまっているかも。

分かりやすくするためa1～h5を a1～a5に変更し、a1～a5の組み合わせの出現回数を数えてみよう。
（任意のホームプレーヤーを指定することで、件数を数えられる。ここでは「h4」とした）

onCourt = game.groupby(['a1','a2','a3','a4','a5'])
onCourt['h4'].count()

---------
a1               a2              a3                 a4                    a5                  
Draymond Green   Klay Thompson   Andre Iguodala     Shaun Livingston      Stephen Curry            1
                                                    Stephen Curry         Festus Ezeli            10
                                                                          James Michael McAdoo    26
                                 Stephen Curry      Andre Iguodala        Anderson Varejao         1
                 Stephen Curry   Andre Iguodala     Anderson Varejao      Shaun Livingston        42
                                                    Shaun Livingston      James Michael McAdoo     7
Harrison Barnes  Draymond Green  Andrew Bogut       Klay Thompson         Stephen Curry           75
                                 Klay Thompson      Andre Iguodala        Shaun Livingston        31
                                                    Stephen Curry         Andre Iguodala          39
                                 Stephen Curry      Andre Iguodala        Shaun Livingston        15
                 Klay Thompson   Andre Iguodala     Draymond Green        Shaun Livingston         3
                                                                          Stephen Curry           62
                                                    Shaun Livingston      Draymond Green          14
                                                                          James Michael McAdoo    20
Klay Thompson    Andre Iguodala  Draymond Green     Shaun Livingston      Stephen Curry           26
                                 Stephen Curry      Festus Ezeli          Harrison Barnes          1
                 Stephen Curry   Draymond Green     Harrison Barnes       Andre Iguodala          41
                                 Festus Ezeli       Draymond Green        Harrison Barnes          1
                                                    Harrison Barnes       Shaun Livingston        14
                                                    Shaun Livingston      Draymond Green           3
Stephen Curry    Andre Iguodala  Anderson Varejao   Marreese Speights     Klay Thompson            1
                                                    Shaun Livingston      Marreese Speights        1
                                 Harrison Barnes    Draymond Green        Klay Thompson            8
                                 Marreese Speights  Klay Thompson         Harrison Barnes          6
                                 Shaun Livingston   Harrison Barnes       Draymond Green           1
                                                    James Michael McAdoo  Harrison Barnes         18
Name: h4, dtype: int64

並び順が違うため、別々に数えられてしまっている組み合わせがありました。

Draymond Green   Klay Thompson   Andre Iguodala     Shaun Livingston      Stephen Curry            1
Klay Thompson    Andre Iguodala  Draymond Green     Shaun Livingston      Stephen Curry           26

組み合わせの数を算出するには、次のやり方を試してみましょう。

a1～a5列の値の配列を抽出
配列をソート
新しいDataFrameを構築
オリジナルDataFrameのa1～a5をソート版DataFrameのa1～a5で置き換える

a = game.loc[0:, 'a1':'a5'].values
a.sort(axis=1) # 行毎にそれぞれの行の並びをソート

awaysort = pd.DataFrame(a, game.index, list(game.columns.values)[3:8])
game.loc[0:, 'a1':'a5'] = awaysort.loc[0:, 'a1':'a5']
onCourt = game.groupby(['a1','a2','a3','a4','a5'])
onCourt['h4'].count()

---------
a1                a2               a3                    a4                 a5              
Anderson Varejao  Andre Iguodala   Draymond Green        Klay Thompson      Stephen Curry         1
                                                         Shaun Livingston   Stephen Curry        42
                                   Klay Thompson         Marreese Speights  Stephen Curry         1
                                   Marreese Speights     Shaun Livingston   Stephen Curry         1
Andre Iguodala    Draymond Green   Festus Ezeli          Klay Thompson      Stephen Curry        10
                                   Harrison Barnes       Klay Thompson      Shaun Livingston     48
                                                                            Stephen Curry       150
                                                         Shaun Livingston   Stephen Curry        16
                                   James Michael McAdoo  Klay Thompson      Stephen Curry        26
                                                         Shaun Livingston   Stephen Curry         7
                                   Klay Thompson         Shaun Livingston   Stephen Curry        27
                  Festus Ezeli     Harrison Barnes       Klay Thompson      Stephen Curry         1
                  Harrison Barnes  James Michael McAdoo  Klay Thompson      Shaun Livingston     20
                                                         Shaun Livingston   Stephen Curry        18
                                   Klay Thompson         Marreese Speights  Stephen Curry         6
Andrew Bogut      Draymond Green   Harrison Barnes       Klay Thompson      Stephen Curry        75
Draymond Green    Festus Ezeli     Harrison Barnes       Klay Thompson      Stephen Curry         1
                                   Klay Thompson         Shaun Livingston   Stephen Curry         3
Festus Ezeli      Harrison Barnes  Klay Thompson         Shaun Livingston   Stephen Curry        14
Name: h4, dtype: int64

アウェイチームの組み合わせに対して、
ホームチームがどのようなローテーションを組んでいたか同じように見てみます。

onCourt = game.groupby(['a1','a2','a3','a4','a5','h1','h2','h3','h4','h5'])
onCourt['h4'].count()
len(onCourt['h4'].count())
---------
44

合計44パターンのローテーションで対戦していたことが分かりました。

2.4 Counting Statistics

groupbyとDataFrameを駆使して、スタッツを集計してみましょう。

例えば、アシストの数

assists = game.groupby('assist')
assists['assist'].count()

平均ショット距離
（ショット距離…シュートを打った際の、プレーヤーとリングとの距離）

assists['shot_distance'].mean()

フィールドゴールの成功数

events = game.groupby(['event_type'])
shots = game.iloc[events.groups['shot']].groupby('player')
shots['player'].count()

フィールドゴール失敗数

misses = game.iloc[events.groups['miss']].groupby('player')
misses['player'].count()

フィールドゴールパーセンテージ（FGP）＝シュートの成功数／シュート本数

S = shots['player'].count()
M = misses['player'].count()
FGP = S/(S+M)
FGP

---------
player
Anderson Varejao             NaN
Andre Iguodala          0.333333
Channing Frye                NaN
# 途中略
Stephen Curry           0.440000
Tristan Thompson        0.714286

FGPを算出できない選手は「NaN」と表示されてしまいました。
これを修正するために「fillna」関数を利用します。

データがない「Nan」を0で埋めてから、FGPを算出します。

B['shot'] = B['shot'].fillna(0)
B['miss'] = B['miss'].fillna(0)
B['total'] = B['shot']+B['miss']
B['FGP'] = B['shot']/B['total']
B = B.drop('miss', 1)
B

---------
	shot	total	FGP
Anderson Varejao	0.0	1.0	0.000000
Andre Iguodala	4.0	12.0	0.333333
Channing Frye	0.0	1.0	0.000000
Draymond Green	2.0	9.0	0.222222
Harrison Barnes	5.0	11.0	0.454545
Iman Shumpert	1.0	5.0	0.200000
J.R. Smith	3.0	10.0	0.300000
James Michael McAdoo	1.0	1.0	1.000000
Kevin Love	3.0	6.0	0.500000
Klay Thompson	7.0	14.0	0.500000
Kyrie Irving	14.0	28.0	0.500000
LeBron James	11.0	21.0	0.523810
Matthew Dellavedova	0.0	1.0	0.000000
Richard Jefferson	1.0	2.0	0.500000
Shaun Livingston	3.0	8.0	0.375000
Stephen Curry	11.0	25.0	0.440000
Tristan Thompson	5.0	7.0	0.714286

最後に、選手のプレイ時間を算出します。

a = game.loc[0:, 'a1':'a5'].values
a.sort(axis=1)
awaysort = pd.DataFrame(a, game.index, list(game.columns.values)[3:8])
game.loc[0:, 'a1':'a5'] = awaysort.loc[0:, 'a1':'a5']
onCourt = game.groupby(['a1','a2','a3','a4','a5'])

for key in onCourt.groups.keys():
    print('{} {}'.format(key, pd.to_numeric(onCourt.get_group(key)['play_length'].str[-2:]).sum()))

---------
('Anderson Varejao', 'Andre Iguodala', 'Draymond Green', 'Klay Thompson', 'Stephen Curry') 0
('Anderson Varejao', 'Andre Iguodala', 'Draymond Green', 'Shaun Livingston', 'Stephen Curry') 242
('Anderson Varejao', 'Andre Iguodala', 'Klay Thompson', 'Marreese Speights', 'Stephen Curry') 0
('Anderson Varejao', 'Andre Iguodala', 'Marreese Speights', 'Shaun Livingston', 'Stephen Curry') 0
('Andre Iguodala', 'Draymond Green', 'Festus Ezeli', 'Klay Thompson', 'Stephen Curry') 28
# 以下略

プレイ時間合計

tots = 0
for key in onCourt.groups.keys():
    tots = tots + pd.to_numeric(onCourt.get_group(key)['play_length'].str[-2:]).sum()

tots/60

---------
48.0

選手別プレイ時間

playerTimes = {'player':'time'}

for key in onCourt.groups.keys():
    for playerName in key:
        if playerName in playerTimes:
            playerTimes[playerName] += pd.to_numeric(onCourt.get_group(key)['play_length'].str[-2:]).sum()
        else:
            playerTimes[playerName] = pd.to_numeric(onCourt.get_group(key)['play_length'].str[-2:]).sum()
        
playerTimes

---------
{'player': 'time',
 'Anderson Varejao': 242,
 'Andre Iguodala': 2228,
 'Draymond Green': 2534,
 'Klay Thompson': 2362,
 'Stephen Curry': 2379,
 'Shaun Livingston': 1115,
 'Marreese Speights': 5,
 'Festus Ezeli': 79,
 'Harrison Barnes': 2410,
 'James Michael McAdoo': 445,
 'Andrew Bogut': 601}

ヘッダ行を削除して、時間を秒⇒HH:MM:SSの形式へフォーマット

import math 

del playerTimes['player']
for key, value in playerTimes.items():
    playerTimes[key] = '{:02}'.format(math.floor(value/60)) + ':'+'{:02}'.format(value%60)

playerTimes
---------
{'Anderson Varejao': '04:02',
 'Andre Iguodala': '37:08',
 'Draymond Green': '42:14',
 'Klay Thompson': '39:22',
 'Stephen Curry': '39:39',
 'Shaun Livingston': '18:35',
 'Marreese Speights': '00:05',
 'Festus Ezeli': '01:19',
 'Harrison Barnes': '40:10',
 'James Michael McAdoo': '07:25',
 'Andrew Bogut': '10:01'}

5人のプレイ時間がめちゃめちゃ多いですね

 'Andre Iguodala': '37:08',
 'Draymond Green': '42:14',
 'Klay Thompson': '39:22',
 'Stephen Curry': '39:39',
 'Harrison Barnes': '40:10',

3 Visualization

matplotlibを使ってデータを可視化します。

import matplotlib.pyplot as plt

最初にスコアフローを作成します。

time = ((11 - pd.to_numeric(game['remaining_time'].str[-5:-3]))*60) + (60 - pd.to_numeric(game['remaining_time'].str[-2:])) + (pd.to_numeric(game['period']) - 1)*720
                                                                           
away = pd.to_numeric(game['away_score'])
home = pd.to_numeric(game['home_score'])
plt.plot(time,home,'silver')
plt.plot(time,away,'cyan')
plt.show()

次にショットチャートを作成します。

チームとイベントタイプ別にグルーピングしたデータを用意し、
成功：緑、失敗：赤　でベーシックなショットチャートを作成します。

team = game.groupby(['team','event_type'])
team.get_group(('GSW', 'shot'))
team.get_group(('CLE', 'shot')).loc[0:,'converted_x':'converted_y']

plt.plot(team.get_group(('GSW', 'shot')).loc[0:,'converted_x'],team.get_group(('GSW', 'shot')).loc[0:,'converted_y'],'go')
plt.plot(team.get_group(('GSW', 'miss')).loc[0:,'converted_x'],team.get_group(('GSW', 'miss')).loc[0:,'converted_y'],'ro')
plt.show()

X軸の25あたりがリングですね。ミドルレンジのシュートがあまり決まっていなかったようです。

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up