はじめに
Join はmysqlなどの、行列をの構造を持つデータベースでとてもよく使われる関数である。しかし、pandasしか使ったことしかない人にはjoinという関数は少し難しいと思う。私もpandasしか使ってないときはjoinはあまりしっくりこなかったが、mysqlを主に使うようになってからはjoinは愛用している。ここではpandasで必要なjoinの基礎知識をまとめておく。
環境
import pandas as pd
print(pd.__version__)
# 1.3.5
Data
class_id
を基準に関係性を持つusers
とclass
のデータフレームを用意した。
users
df_users = pd.read_excel('./users.xlsx')
print(df_users.to_markdown())
| | student_id | student_name | student_age | class_id |
|---:|-------------:|:---------------|--------------:|-----------:|
| 0 | 1 | Tanaka | 22 | 1 |
| 1 | 2 | Kimura | 33 | 1 |
| 2 | 3 | Yoshida | 44 | 2 |
| 3 | 4 | Suzuki | 55 | 2 |
| 4 | 5 | Toyota | 66 | 3 |
| 5 | 6 | Honda | 77 | 4 |
| 6 | 8 | Maeda | 88 | 5 |
class
df_class = pd.read_excel('./class.xlsx')
print(df_class.to_markdown())
| | class_id | class_name | class_location |
|---:|-----------:|:-------------|:-----------------|
| 0 | 1 | swim | building_A |
| 1 | 2 | golf | ground_C |
| 2 | 3 | tennis | ground_B |
| 3 | 8 | baseball | ground_D |
| 4 | 9 | handball | ground_E |
Join関数
inner
df_join = pd.merge(df_users, df_class, on="class_id", how="inner")
print(df_join.to_markdown())
| | student_id | student_name | student_age | class_id | class_name | class_location |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|
| 0 | 1 | Tanaka | 22 | 1 | swim | building_A |
| 1 | 2 | Kimura | 33 | 1 | swim | building_A |
| 2 | 3 | Yoshida | 44 | 2 | golf | ground_C |
| 3 | 4 | Suzuki | 55 | 2 | golf | ground_C |
| 4 | 5 | Toyota | 66 | 3 | tennis | ground_B |
class_id
を基準に、互いのフレームがnanにならないように結合した
left
df_left = pd.merge(df_users, df_class, on="class_id", how='left')
print(df_left.to_markdown())
| | student_id | student_name | student_age | class_id | class_name | class_location |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|
| 0 | 1 | Tanaka | 22 | 1 | swim | building_A |
| 1 | 2 | Kimura | 33 | 1 | swim | building_A |
| 2 | 3 | Yoshida | 44 | 2 | golf | ground_C |
| 3 | 4 | Suzuki | 55 | 2 | golf | ground_C |
| 4 | 5 | Toyota | 66 | 3 | tennis | ground_B |
| 5 | 6 | Honda | 77 | 4 | nan | nan |
| 6 | 8 | Maeda | 88 | 5 | nan | nan |
class_id
を基準に、左のフレーム(LEFT)の値がnanにならないように結合した
right
df_right = pd.merge(df_users, df_class, on="class_id", how='right')
print(df_right.to_markdown())
| | student_id | student_name | student_age | class_id | class_name | class_location |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|
| 0 | 1 | Tanaka | 22 | 1 | swim | building_A |
| 1 | 2 | Kimura | 33 | 1 | swim | building_A |
| 2 | 3 | Yoshida | 44 | 2 | golf | ground_C |
| 3 | 4 | Suzuki | 55 | 2 | golf | ground_C |
| 4 | 5 | Toyota | 66 | 3 | tennis | ground_B |
| 5 | nan | nan | nan | 8 | baseball | ground_D |
| 6 | nan | nan | nan | 9 | handball | ground_E |
class_id
を基準に、右のフレーム(RIGHT)の値がnanにならないように結合した
outer
df_outer = pd.merge(df_users, df_class, on="class_id", how='outer')
print(df_outer.to_markdown())
| | student_id | student_name | student_age | class_id | class_name | class_location |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|
| 0 | 1 | Tanaka | 22 | 1 | swim | building_A |
| 1 | 2 | Kimura | 33 | 1 | swim | building_A |
| 2 | 3 | Yoshida | 44 | 2 | golf | ground_C |
| 3 | 4 | Suzuki | 55 | 2 | golf | ground_C |
| 4 | 5 | Toyota | 66 | 3 | tennis | ground_B |
| 5 | 6 | Honda | 77 | 4 | nan | nan |
| 6 | 8 | Maeda | 88 | 5 | nan | nan |
| 7 | nan | nan | nan | 8 | baseball | ground_D |
| 8 | nan | nan | nan | 9 | handball | ground_E |
class_id
を基準に、nanの存在とは関係なく、すべての値を表すように結合した
よく使う例
実際のデータのように、複数のTableにデータが存在し、1つに合わせてデータを処理したい時を想定する。
よってlocation
の説明をclass_location
に関連づけたデータを追加した。
df_location = pd.read_excel('./location.xlsx')
print(df_location.to_markdown())
| | class_location | location_explain |
|---:|:-----------------|:---------------------------|
| 0 | building_A | go straight |
| 1 | building_B | go straight and turn right |
| 2 | ground_A | go right and turn right |
| 3 | ground_B | go right and turn left |
| 4 | ground_C | go left and turn right |
| 5 | ground_D | go left and turn left |
| 6 | ground_E | next to ground_D |
df_left = pd.merge(df_users, df_class, on="class_id", how='left')
df_left_left = pd.merge(df_left, df_location, on="class_location", how='left')
print(df_left_left.to_markdown())
| | student_id | student_name | student_age | class_id | class_name | class_location | location_explain |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|:-----------------------|
| 0 | 1 | Tanaka | 22 | 1 | swim | building_A | go straight |
| 1 | 2 | Kimura | 33 | 1 | swim | building_A | go straight |
| 2 | 3 | Yoshida | 44 | 2 | golf | ground_C | go left and turn right |
| 3 | 4 | Suzuki | 55 | 2 | golf | ground_C | go left and turn right |
| 4 | 5 | Toyota | 66 | 3 | tennis | ground_B | go right and turn left |
| 5 | 6 | Honda | 77 | 4 | nan | nan | nan |
| 6 | 8 | Maeda | 88 | 5 | nan | nan | nan |
Users
を中心に考えたいときは、Users
にclass
とloccation
の情報を加えていくように、join left
を繰り返すと考えやすい。
Honda
やMaeda
のように、class_name
のように右側の情報がないなら、それに対して付加情報を表示するか、まだ未登録であるような情報をhtml上で条件付きで示せばよい。
関連