2
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?

More than 1 year has passed since last update.

[python] pandasでmergeによる結合 (join) する方法

Posted at

はじめに

Join はmysqlなどの、行列をの構造を持つデータベースでとてもよく使われる関数である。しかし、pandasしか使ったことしかない人にはjoinという関数は少し難しいと思う。私もpandasしか使ってないときはjoinはあまりしっくりこなかったが、mysqlを主に使うようになってからはjoinは愛用している。ここではpandasで必要なjoinの基礎知識をまとめておく。

環境

import pandas as pd
print(pd.__version__)
# 1.3.5

Data

class_idを基準に関係性を持つusersclassのデータフレームを用意した。

users

df_users = pd.read_excel('./users.xlsx')
print(df_users.to_markdown())
users
|    |   student_id | student_name   |   student_age |   class_id |
|---:|-------------:|:---------------|--------------:|-----------:|
|  0 |            1 | Tanaka         |            22 |          1 |
|  1 |            2 | Kimura         |            33 |          1 |
|  2 |            3 | Yoshida        |            44 |          2 |
|  3 |            4 | Suzuki         |            55 |          2 |
|  4 |            5 | Toyota         |            66 |          3 |
|  5 |            6 | Honda          |            77 |          4 |
|  6 |            8 | Maeda          |            88 |          5 |

class

df_class = pd.read_excel('./class.xlsx')
print(df_class.to_markdown())
class
|    |   class_id | class_name   | class_location   |
|---:|-----------:|:-------------|:-----------------|
|  0 |          1 | swim         | building_A       |
|  1 |          2 | golf         | ground_C         |
|  2 |          3 | tennis       | ground_B         |
|  3 |          8 | baseball     | ground_D         |
|  4 |          9 | handball     | ground_E         |

Join関数

inner

df_join = pd.merge(df_users, df_class, on="class_id", how="inner")
print(df_join.to_markdown())
join( inner )
|    |   student_id | student_name   |   student_age |   class_id | class_name   | class_location   |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|
|  0 |            1 | Tanaka         |            22 |          1 | swim         | building_A       |
|  1 |            2 | Kimura         |            33 |          1 | swim         | building_A       |
|  2 |            3 | Yoshida        |            44 |          2 | golf         | ground_C         |
|  3 |            4 | Suzuki         |            55 |          2 | golf         | ground_C         |
|  4 |            5 | Toyota         |            66 |          3 | tennis       | ground_B         |

class_idを基準に、互いのフレームがnanにならないように結合した

left

df_left = pd.merge(df_users, df_class, on="class_id", how='left')
print(df_left.to_markdown())
left
|    |   student_id | student_name   |   student_age |   class_id | class_name   | class_location   |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|
|  0 |            1 | Tanaka         |            22 |          1 | swim         | building_A       |
|  1 |            2 | Kimura         |            33 |          1 | swim         | building_A       |
|  2 |            3 | Yoshida        |            44 |          2 | golf         | ground_C         |
|  3 |            4 | Suzuki         |            55 |          2 | golf         | ground_C         |
|  4 |            5 | Toyota         |            66 |          3 | tennis       | ground_B         |
|  5 |            6 | Honda          |            77 |          4 | nan          | nan              |
|  6 |            8 | Maeda          |            88 |          5 | nan          | nan              |

class_idを基準に、左のフレーム(LEFT)の値がnanにならないように結合した

right

df_right = pd.merge(df_users, df_class, on="class_id", how='right')
print(df_right.to_markdown())
right
|    |   student_id | student_name   |   student_age |   class_id | class_name   | class_location   |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|
|  0 |            1 | Tanaka         |            22 |          1 | swim         | building_A       |
|  1 |            2 | Kimura         |            33 |          1 | swim         | building_A       |
|  2 |            3 | Yoshida        |            44 |          2 | golf         | ground_C         |
|  3 |            4 | Suzuki         |            55 |          2 | golf         | ground_C         |
|  4 |            5 | Toyota         |            66 |          3 | tennis       | ground_B         |
|  5 |          nan | nan            |           nan |          8 | baseball     | ground_D         |
|  6 |          nan | nan            |           nan |          9 | handball     | ground_E         |

class_idを基準に、右のフレーム(RIGHT)の値がnanにならないように結合した

outer

df_outer = pd.merge(df_users, df_class, on="class_id", how='outer')
print(df_outer.to_markdown())
outer
|    |   student_id | student_name   |   student_age |   class_id | class_name   | class_location   |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|
|  0 |            1 | Tanaka         |            22 |          1 | swim         | building_A       |
|  1 |            2 | Kimura         |            33 |          1 | swim         | building_A       |
|  2 |            3 | Yoshida        |            44 |          2 | golf         | ground_C         |
|  3 |            4 | Suzuki         |            55 |          2 | golf         | ground_C         |
|  4 |            5 | Toyota         |            66 |          3 | tennis       | ground_B         |
|  5 |            6 | Honda          |            77 |          4 | nan          | nan              |
|  6 |            8 | Maeda          |            88 |          5 | nan          | nan              |
|  7 |          nan | nan            |           nan |          8 | baseball     | ground_D         |
|  8 |          nan | nan            |           nan |          9 | handball     | ground_E         |

class_idを基準に、nanの存在とは関係なく、すべての値を表すように結合した

よく使う例

実際のデータのように、複数のTableにデータが存在し、1つに合わせてデータを処理したい時を想定する。
よってlocationの説明をclass_locationに関連づけたデータを追加した。

df_location = pd.read_excel('./location.xlsx')
print(df_location.to_markdown())
|    | class_location   | location_explain           |
|---:|:-----------------|:---------------------------|
|  0 | building_A       | go straight                |
|  1 | building_B       | go straight and turn right |
|  2 | ground_A         | go right and turn right    |
|  3 | ground_B         | go right and turn left     |
|  4 | ground_C         | go left and turn right     |
|  5 | ground_D         | go left and turn left      |
|  6 | ground_E         | next to ground_D           |
df_left = pd.merge(df_users, df_class, on="class_id", how='left') 
df_left_left = pd.merge(df_left, df_location, on="class_location", how='left')
print(df_left_left.to_markdown())
|    |   student_id | student_name   |   student_age |   class_id | class_name   | class_location   | location_explain       |
|---:|-------------:|:---------------|--------------:|-----------:|:-------------|:-----------------|:-----------------------|
|  0 |            1 | Tanaka         |            22 |          1 | swim         | building_A       | go straight            |
|  1 |            2 | Kimura         |            33 |          1 | swim         | building_A       | go straight            |
|  2 |            3 | Yoshida        |            44 |          2 | golf         | ground_C         | go left and turn right |
|  3 |            4 | Suzuki         |            55 |          2 | golf         | ground_C         | go left and turn right |
|  4 |            5 | Toyota         |            66 |          3 | tennis       | ground_B         | go right and turn left |
|  5 |            6 | Honda          |            77 |          4 | nan          | nan              | nan                    |
|  6 |            8 | Maeda          |            88 |          5 | nan          | nan              | nan                    |

Usersを中心に考えたいときは、Usersclassloccationの情報を加えていくように、join leftを繰り返すと考えやすい。
HondaMaedaのように、class_nameのように右側の情報がないなら、それに対して付加情報を表示するか、まだ未登録であるような情報をhtml上で条件付きで示せばよい。

関連

2
3
0

Register as a new user and use Qiita more conveniently

  1. You get articles that match your needs
  2. You can efficiently read back useful information
  3. You can use dark theme
What you can do with signing up
2
3

Delete article

Deleted articles cannot be recovered.

Draft of this article would be also deleted.

Are you sure you want to delete this article?