pandas: PostgreSQL のテーブルの重複を削除

Last updated at 2024-09-16Posted at 2024-09-16

Primary Key を設定しないテーブルでは、データが重複する可能性があります。
その重複を削除するプログラムです。

実行前

  id   |  name  | population |  date_mod  
-------+--------+------------+------------
 t3461 | 広島   |      72814 | 2001-09-14
 t3461 | 広島   |      72814 | 2001-09-14
 t3462 | 福山   |      41738 | 2001-07-21
 t3462 | 福山   |      41738 | 2001-07-21
 t3463 | 東広島 |      92513 | 2001-06-12
 t3463 | 東広島 |      92513 | 2001-06-12
 t3464 | 呉     |      93167 | 2001-09-29
 t3464 | 呉     |      93167 | 2001-09-29
(8 rows)

実行後

  id   |  name  | population |  date_mod  
-------+--------+------------+------------
 t3461 | 広島   |      72814 | 2001-09-14
 t3462 | 福山   |      41738 | 2001-07-21
 t3463 | 東広島 |      92513 | 2001-06-12
 t3464 | 呉     |      93167 | 2001-09-29
(4 rows)

プログラム

db_duplicate_omit.py

#! /usr/bin/python
#
#	postgresql/pandas/db_duplicate_omit.py
#
#						Sep/16/2024
# ------------------------------------------------------------------
import pandas as pd
import sys
import os
from dotenv import load_dotenv
from sqlalchemy import create_engine
from sqlalchemy import text as q_text
#
# ------------------------------------------------------------------
def single_proc(engine,schema,table_src):
	sys.stderr.write("*** single_proc *** start\n")
	sql_str = 'SELECT * FROM ' + schema + "." + table_src
	print(sql_str)
	q = q_text(sql_str)
	try:
#
		df=pd.read_sql(sql=q, con=engine.connect())
		print(df)
		print(df.shape)
		dg = df.drop_duplicates()
		print(dg.shape)
#
		dg.to_sql(name=table_src,con=engine,schema=schema,if_exists='replace', index=False)
#
	except Exception as ee:
		sys.stderr.write("*** error *** in read_sql_query ***\n")
		sys.stderr.write(str(ee) + "\n")
# ------------------------------------------------------------------
sys.stderr.write("*** 開始 ***\n")

dotenv_path = '.env'
load_dotenv(dotenv_path)
user=os.environ.get("user")
password=os.environ.get("password")
data_base=os.environ.get("data_base")

host ='localhost'

SQLALCHEMY_DATABASE_URL = "postgresql://" + user + ":" + password + "@" + host + "/" + data_base
engine = create_engine(SQLALCHEMY_DATABASE_URL)
#
#
schema="public"
table_src = "cities"
#
single_proc(engine,schema,table_src)
#
sys.stderr.write("*** 終了 ***\n")
# ------------------------------------------------------------------

.env

user='scott'
password='tiger123'
data_base='city'

SQL

テーブルの作成

postgre_create.sql

drop table if exists cities;
create table cities (id varchar(10), name text, population int, date_mod date);

データの挿入

postgre_insert.sql

insert into cities values ('t3461','広島',72814,'2001-9-14');
insert into cities values ('t3462','福山',41738,'2001-7-21');
insert into cities values ('t3463','東広島',92513,'2001-6-12');
insert into cities values ('t3464','呉',93167,'2001-9-29');

データの読み込み

postgre_read.sql

select * from cities order by id;

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up