More than 3 years have passed since last update.

Pythonを利用してcsvファイルをDBにインポートする

Posted at 2021-05-16

始めに

Pythonを利用して、csvファイルをそのままpostgresqlにインポートするスクリプトを作成した。sqlaclchemyとpandasを利用することで、テーブルの作成の手間がなくなったり、DBの違いも吸収できるようになった。

postgresqlの準備

実験用のpsotgresqlは、docker-composeを利用して立ち上げる。

version: '3'
 
services:
    postgres:
        image: postgres:10
        container_name: postgresql
        ports:
          - "5432:5432"
        environment:
          POSTGRES_USER: postgresql_user
          POSTGRES_PASSWORD: postgresql_password
        volumes:
          - ./db:/var/lib/postgresql/data

起動コマンドを以下に示す。

docker-compose up -d
Docker Compose is now in the Docker CLI, try `docker compose up`

Creating csv2db_postgres_1 ... done

csvをpostgresqlにインポート

必要ライブラリのインストール

利用するライブラリはpandasとsqlalchemyの2つのみ。各ライブラリのバージョンは以下の通り。

SQLAlchemy==1.4.15
pandas==1.2.4

以下のコマンドでインストールを実施する。

pip install pandas sqlalchemy

スクリプトの作成

csvファイルをDBにインポートするPythonスクリプトは以下の通り。

main.py

import argparse

import pandas as pd
from sqlalchemy import create_engine

if __name__ == "__main__":
    parser = argparse.ArgumentParser()

    parser.add_argument("table_name", help="テーブル名")
    parser.add_argument("csv_path", help="読み込むCSVファイルのパス")

    args = parser.parse_args()

    # 引数の設定
    connection_config = {
        'user': 'postgresql_user',
        'password': 'postgresql_password',
        'host': 'localhost',
        'port': '5432',
        'database': 'postgres',
        'database_type': 'postgresql' # 接続するDBのタイプ(mysqlの時: mysql)
    }

    engine = create_engine('{database_type}://{user}:{password}@{host}:{port}/{database}'.format(**connection_config), encoding='utf-8')

    df = pd.read_csv(args.csv_path)

    # PostgreSQLに書き込む
    df.to_sql(args.table_name, con=engine, if_exists='replace', index=False )

table_nameにDBのテーブル名を指定し、csv_pathに読み込むCSVファイルのパスを指定する。今回はkaggleのタイタニックの学習データをテーブルに保存してみる。

以下のURLからcsvをダウンロードして、dataフォルダ配下に配置する。

https://www.kaggle.com/c/titanic/data

実行コマンドを以下に示す。

python main.py taitanic_train data/train.csv

テーブルにログインして保存されていることを確認。

docker exec -ti postgresql /bin/bash # Dockerのコンテナにログイン
root@3ba4a2adf245:/# psql -d postgres -U postgresql_user
psql (10.16 (Debian 10.16-1.pgdg90+1))
Type "help" for help.

postgres=# postgres=# \z
                                 Access privileges
 Schema |      Name      | Type  | Access privileges | Column privileges | Policies 
--------+----------------+-------+-------------------+-------------------+----------
 public | taitanic_train | table |                   |                   | 
(1 row)
postgres=# select * from taitanic_train; # Select文でテーブルの中身を表示。
 PassengerId | Survived | Pclass |                                        Name                                        |  Sex   | Age  | SibSp | Parch |       Ticket       |   Fare   |
      Cabin      | Embarked 
-------------+----------+--------+------------------------------------------------------------------------------------+--------+------+-------+-------+--------------------+----------+
-----------------+----------
           1 |        0 |      3 | Braund, Mr. Owen Harris                                                            | male   |   22 |     1 |     0 | A/5 21171          |     7.25 |
                 | S
           2 |        1 |      1 | Cumings, Mrs. John Bradley (Florence Briggs Thayer)                                | female |   38 |     1 |     0 | PC 17599           |  71.2833 |
 C85             | C
           3 |        1 |      3 | Heikkinen, Miss. Laina                                                             | female |   26 |     0 |     0 | STON/O2. 3101282   |    7.925 |
                 | S
           4 |        1 |      1 | Futrelle, Mrs. Jacques Heath (Lily May Peel)                                       | female |   35 |     1 |     0 | 113803             |     53.1 |
 C123            | S
           5 |        0 |      3 | Allen, Mr. William Henry                                                           | male   |   35 |     0 |     0 | 373450             |     8.05 |
                 | S
===== 以下省略 ====

指定したテーブル名でデータをインポートできた！

保存されているデータの型を確認してみると、文字列はtext、数字はbigintやdouble precisionが適応されており、最低限のマッピングはなされているよう。もちろんNullやCollactionは未定義の状態になるので、これらの定義を付与したい場合は手動で定義する必要がある。

postgres=# \d taitanic_train
                  Table "public.taitanic_train"
   Column    |       Type       | Collation | Nullable | Default 
-------------+------------------+-----------+----------+---------
 PassengerId | bigint           |           |          | 
 Survived    | bigint           |           |          | 
 Pclass      | bigint           |           |          | 
 Name        | text             |           |          | 
 Sex         | text             |           |          | 
 Age         | double precision |           |          | 
 SibSp       | bigint           |           |          | 
 Parch       | bigint           |           |          | 
 Ticket      | text             |           |          | 
 Fare        | double precision |           |          | 
 Cabin       | text             |           |          | 
 Embarked    | text             |           |          |

終わりに

今まで地味にめんどくさかったDBへcsvファイルのインポートが簡単に出来るようになった。細かい設定ができないのでシステム開発でDBにデータを挿入する際は利用できないが、データ分析のために1度DBに入れたい、といった場合には使えそう。

GitHub - sqlalchemy/sqlalchemy: The Database Toolkit for Python

You get articles that match your needs
You can efficiently read back useful information
You can use dark theme

What you can do with signing up